Cloud Street

Friday, June 24, 2005

A trick of the eye

A long time ago on a Web site far, far away, Clay Shirky wrote:
"We are all so used to bell curve distributions that power law distributions can seem odd."
He then traced Pareto-like 'power law' curves operating in a number of domains where large numbers of people make unconstrained choices - most memorably, inbound link counts for blogs. The inverse 'power law' curve dives steeply, then levels out, glides downwards almost to zero and peters out slowly. And thus was born the 'Long Tail'.

As I wrote here, there's a problem with this article, and hence with the 'Long Tail' image itself. Despite repeated references to 'power law distributions', none of the curves Clay presented were distributions. They were histograms representing ranked lists: in other words series of numbers ordered from high to low.

What's the difference? A short answer is that the data Clay presents makes his own comparison with 'bell curve' (normal) distributions unsustainable: order from high to low and you will only ever get a downward curve.

For a longer answer, you'll have to look at some numbers. Here are some x,y values which would give you a normal distribution. (For anyone in danger of glazing over, that's 'x' as in horizontal axis, low to high values running left to right; 'y' values are on the vertical axis, low to high running bottom to top).

11
230
3100
4240
5400
6600
7750
8900
9960
101000
111000
12960
13900
14750
15600
16400
17240
18100
1930
201

OK? And here are some co-ordinates which would give you an inverse power-law distribution:

11000
2444
3250
4160
5111
682
763
849
940
1033
1128
1224
1320
1418
1516
1614
1712
1811
1910
209

Just for the hell of it, here are some numbers that would give you a direct (ascending) power law distribution:

19
210
311
412
514
616
718
820
924
1028
1133
1240
1349
1463
1582
16111
17160
18250
19444
201000

Finally, by way of contrast, here's a series of numbers.

1000
444
250
160
111
82
63
49
40
33
28
24
20
18
16
14
12
11
10
9

I've sorted these numbers high to low, but - unlike the other three examples - there's nothing in the data that told me to do that. You could arrange them that way; you could sort them low to high instead; you could even hack them about manually to produce a rather lumpy and uneven bell curve. It's up to you.

I'm not saying that a ranked listing - arranging numbers like these high to low - is meaningless. The ranked histogram is quite a good graphic - it's informative (within limits) and easy to grasp. What I am saying is that it's an arbitrary ordering rather than a distribution. Which is to say, it's not the best way of representing this data - let alone the only way. It's a relatively information-poor representation, and one which tends to promote perverse and unproductive ways of thinking about the data.

More about this - and a couple of constructive suggestions - next time I post.

Thursday, June 23, 2005

Authority you can respect

Or: on popularity, deference, knowledge domains and knowledge clouds.

Or: "if Pietro was right, and Dave was right (and I was right about how they fit together), does that mean Shelley was wrong to say that Technorati was wrong?"

I'm not sure I get Technorati. As far as I can understand, it does three things.
  1. Tagging. Using some standard HTML, bloggers can tag their own articles with keywords; Technorati then tracks and aggregates these tags, allowing users to find similarly-tagged entries in other blogs. I'm not sure I see the point of this. Compared with del.icio.us - which builds a public archive of tagged material by enabling users to tag other people's articles (and their own, if they so wish) - this seems underpowered at best, ego-driven at worst.
  2. Linking. Technorati tracks blog-to-blog links, enabling users to find out who's been linking to their articles. I've used this a few times, but I'm not convinced it's that great a feature. Firstly, Google purports to do the same thing with its 'link:' search option; it's only the fact that 'link:' is broken that makes me use Technorati. Secondly, after tracking them for a while, it's dawned on me that I don't really care about links: I care about people reading my articles (which my hit-counter can tell me about), and I care about getting into conversations, either through an exchange of posts or in Comments threads. If people aren't interested in talking to me, I'd just as soon they didn't advertise my blog. (What would it gain me, after all?) Which brings me to
  3. Popularity and Authority. This is the big one. From the name on down, Technorati is all about in-groups and out-groups. 'Authority' is one of the two sort orders which appear when you search for links to your blog (the other being 'date'). 'Authority' is measured by the number of in-bound links the sites linking to yours have in their own right. To put it another way, authority directly tracks popularity (although this is 'popularity' in that odd American high-school sense of the word: 'popular' sites aren't the ones with the most friends (most out-bound links, most distinct participants in Comments threads or even most traffic) but the ones with the most people envying them (hence: most in-bound links)).
The equation of authority with 'popularity' is, in one sense, neither inappropriate nor avoidable. In another sense it's both reprehensible and wrong. First, the argument in favour. As I wrote here, the distinction between the knowledge produced in academic discourse and the knowledge produced in conversation is ultimately artificial: in both cases, there's a cloud of competing and overlapping arguments and definitions; in both cases, each speaker - or each intervention - draws a line around a preferred constellation of concepts. At some level, all knowledge is 'cloudy'. Moreover, in both cases, the outcome of interactions depends in large part on the connections which speakers can make between their own arguments and those of other speakers, particularly those who speak with greater authority. (Hence controversy: your demonstration that an established writer is wrong about A, B and C will interest a lot more people - and do more for your reputation - than your utterly original exposition of X, Y and Z.) You may not like the internationally-renowned scholar who's agreed to look in on your workshop - you may resent his refusal to attend the whole thing and disapprove of his attitude to questioners; you may not even think his work's that great - but you still invite him: he's popular, which means he's authoritative, which means he reflects well on you. Domain by domain, authority does indeed track popularity.

But there's the rub - and here begins the argument against Technorati. Domain by domain, authority tracks popularity, but not globally: it makes a certain kind of sense to say that the Sun is more authoritative than the Star, but to say that it's more authoritative than the Guardian would be absurd. (Perverse rankings like this are precisely an indicator of when two distinct domains are being merged.) Similarly, it's easy to imagine somebody describing either the Daily Kos or Instapundit as the most 'authoritative' site on the Web; what's impossible to imagine is the mindset which would say that Kos was almost the most authoritative source, second only to Glenn Reynolds. But this is what drops out if we use Technorati's (global) equation of popularity with authority.

Some counter-propositions. Firstly, more is not (necessarily) better. The intrinsic appeal of different domains of knowledge varies enormously: in most academic specialities, if you've got a regular audience in three figures you're doing extraordinarily well. Conversely, if you want a mass audience, you'll need to write the kind of stuff that will get you a mass audience.

Secondly, broadcasting is not conversation; linking is not conversation. My only concern about readership is that I'm reaching enough people with similar interests to have a decent conversation. I'm particularly concerned that the people I'm responding to in this blog are reading it - but I've got no way of knowing that they are, unless they carry the conversation on, either in comments or on their own blogs (hi Adam, hi Dave). A blogroll link, while it would please my vanity, would tell me nothing at all about whether the words I write are actually being read.

Thirdly, domain by domain, popularity records itself: if you keep your eyes and ears open, you very rapidly discover the sources being cited, the authors you need to line up with (or against), the major arguments and their proponents. In this perspective, Technorati is of dubious merit at best, positively misleading at worst. A domain-by-domain popularity meter - like the information you can glean from del.icio.us link-shading, and to some extent from Technorati's tagging - could give you a condensed who's who, although the effort you could save by this kind of shortcut has to be set against the information you'd lose by not taking part in the arguments yourself. A global popularity meter - like Technorati's link-count - will tell you nothing you need to know and a lot that you don't. (This effect has been masked up to now by the prevalence of a single domain among Technorati tags (and, indeed, Technorati users): it's a design flaw which has been compensated by an implementation flaw.)

So I tend to agree with Shelley: the globally 'popular' blogs are quite popular enough already without their readers directing yet more traffic their way - and, for most of us, global 'popularity' is an irrelevant distraction. From which it follows that blogs don't need blogrolls. If we blogroll everyone whose posts we respond to, the blogroll's unnecessary. If, on the other hand, we blogroll everyone whose blogs we read - or, from the look of some blogrolls, every blogWeb site we've ever readheard of - the power law will kick in: links will inevitably tend to cluster around the 'top' five or ten or fifty blogs, the blogs Everybody Knows, the A List (ugh).

Some final brief thoughts. Blogging tends towards conversation. Conversation routes around gatekeepers (Technorati is, precisely, a gatekeeper - but an avoidable gatekeeper). Conversations happen within domains. People cross domains, but domains don't overlap. Every domain thinks it's the only one.

And there is no long tail. (That's not connected, it's just a trailer for my next post...)

Monday, June 13, 2005

The cloud of knowing

Dave Weinberger has got me thinking again (cheers, Dave).

1. I had been planning on beginning by talking briefly about Aristotle's discovery of the shape of knowledge: To know this robin is to see its place in a hierarchy of similarities (it's like other birds) and differences (it's different from other birds), an incredibly efficient way to organize complex systems.

2. I had been planning on ending by talking about knowledge as a property of conversations.

3. Last year, when writing about why blogs are not (generally) echo chambers, I had talked about conversation as the iterating of differences on a shared ground.

So, in the middle of last night it occurred to me that conversations, as the iteration of differences on the basis of similarity, are formally like Aristotle's description of knowledge as the placing of the known in a system of differences and similarities.

(The 'echo chamber' piece is well worth a look, incidentally, even if you aren't interested in the Dean campaign.)

The key point here, I think, is that the Aristotelian hierarchy is an achieved system of differences-within-similarity. If we characterise a conversation as 'the iteration of differences on the basis of similarity', the stress should be on iteration, on process. In other words, it's not a collaborative attempt to chip away the accumulated crud of ambiguity and tautology and reveal the true hierarchy of knowledge in all its crystalline precision. The knowledge produced by a conversation exists within the conversation, and grows within it; there's always another difference to be iterated (or collapsed). (Compare wikis - although not, oddly, (the public face of) Wikipedia.)

Conversations don't produce a tidy set of definitions which can be picked up and applied elsewhere. What they produce - in one light, what they are - is a tangle of more-or-less definitive associations and exclusions, all resting on a set of prior assumptions whose own definitions are fairly hazy. The sense you make of any argument depends on what you think of its reference points, the argument it's responding to, the person advancing it, the person being responded to... The knowledge produced within a conversation is the (continuing) accumulation of this kind of 'sense'. Structurally, it's not a tree; it's more like a swarm. Conversations are knowledge clouds.

Now pull back. I recently wrote a paper on the 2001-5 Italian government led by Silvio Berlusconi. I quoted several news sources, but also cited sources with titles like "Interpretive approaches and the study of Italian politics" and "System crisis and the origins of a new Right". In other words, I situated my argument within the context of arguments already advanced by other authors. I'm a newcomer to the field of Italian studies; as such, I have little or no standing in the field, and what I have is enhanced if I can underpin what I write with assertions from established writers. The credibility of my arguments is also enhanced, at least among readers who agree with the writers I've cited; to turn it round, the credibility of my arguments, advanced without supporting quotations, is minimal. (Referees made this point, without commenting on the merits of the arguments themselves.) Academic publication, I would suggest, is a continuing conversation - and academic discourse is a knowledge cloud.

Two conclusions. If 'cloudiness' is a universal condition, del.icio.us and flickr and tag clouds and so forth don't enable us to do anything new; what they are giving us is a live demonstration of how the social mind works. Which could be interesting, to put it mildly. On the other hand, those of us who are into tagging need to give some thought to what we've been doing in all these other areas to mitigate the adverse effects of clouds - ranging from group pathologies to the undue influence exerted by anti-social young guys.

Friday, June 03, 2005

The Web as Umwelt

Alfred Schutz: "since human beings are born of mothers and not concocted in retorts, the experience of the exixstence of other human beings and of the meaning of their actions is certainly the first and most original empirical observation man[sic] makes"

Dave Weinberger: "some things become clearer if you do not start with the premise that people are fundamentally isolated and battle against noise in order to connect with others. Instead, we find ourselves in a world shared by others. Connection comes first. Isolation and alienation are withdrawals from the pre-existence of what is shared."

Connection comes first. I think this insight is significant and underrated, even though in some ways it's staringly obvious. If we take it seriously, it gives us (among other things) a new way of looking at the geek-pathologies of online life: in this view, it wouldn't be a question of isolated (in Real Life) individuals kidding themselves that they're connected (online), but of connected individuals distorting some of their connections - overloading some and neglecting others. But then, in obvious but significant ways, we all are connected individuals: if there's an element of self-deception in these cases, perhaps it starts with the denial of connection.

Thursday, June 02, 2005

Semiological, or almost entirely?

Mike Harper:

Semiotics, which is clearly older than the semantic web, tells us you can’t always map signs to real world objects. You can do it for things like, say, the Taj Mahal, but not for things like democracy, justice etc. So they map to concepts. Trouble is, you’re talking really about what’s inside someone else’s head. And you can’t really be sure what that is. So, the argument goes, stuff like RDF is just “syntactic sugar". It’s neatly structured but can’t escape the fact that the tags, urns etc have to have an agreed meaning ... I can’t bring myself to agree with this completely. In practice people seem to get by. I think there must be a feedback loop involved. If you interpret a statement about X and act on it, and your interpretation is wrong, and the interpretation matters in this case, something bad will probably happen. You will then revise your understanding of what is meant by X.

This is all good phenomenological stuff - see the Schutz quote above. One of Schutz's great arguments was that there is no definitional God's eye view - there is only human social experience, including the experience of making and using signs.

So surely the semantic web can work in small ways where all parties are agreed on the meaning of the vocabulary.

The trouble is - as Clay pointed out back here - that if you've got that level of agreement among all participants you don't need the semantics. If you're all using the same schema anyway, your respective schemas don't need to describe themselves - and if they do need to describe themselves, there needs to be a common language they can do it in, and hence a higher level of shared context.

What you can do is say "I'm using [x] to mean $FOO, which is a subtype of $BAR but does not overlap with $BAZ; how about you?" Or rather, "On 2005-06-03, writing in Manchester (England/UK/EU), I used [x] to mean $FOO..." and so on. That, to me, is (or rather will be) where it gets interesting - the point is not to encode semiotics but to encode semantics in such a way that the semiotics can be inferred.

Or rather, in such a way that the semiotics can't not be inferred. Which they need to be. Once you get away from the physical sciences and their geek spinoffs, it's very, very hard to reach a final level of granularity. You can map the physical contours of France in exactly the same way that you can map Britain - and with enough data you could map Britain 100 years ago and map France 100 years ago in exactly the same way. What you can't do is chart the number of suicides or street thefts or families in poverty or users of illegal drugs or asylum applications or hospital admissions in Britain and compare them with the figures for Britain 100 years ago, let alone with French figures. This is not because the data isn't there, but (in all those cases) because it's the product of a complex set of social interactions - and, as such, it doesn't have a stable meaning, in time or in space.

This is what I mean about inferring semiotics: figures on 'drug use', to take the most obvious example, are produced in particular ways and classified using particular criteria, which correspond to patterns of public health and law enforcement activity as well as to broader social attitudes. The data doesn't contain or express those attitudes and patterns of activity - but if you don't know about them it's effectively meaningless. ("Hey, look, there are twice as many people using drugs! Oh, wait, there are twice as many substances classified as drugs. Never mind.") The only way forward, it seems to me, is to (as it were) factory-stamp data with the conditions of its production, as far as they can be established: "this source on 'drugs' covers this period in this jurisdiction, and consequently uses definitions derived from this legislation, including this but excluding this and this".

That's what I'd like to do, anyway.