Cloud Street

Monday, October 31, 2005

None of you stand so tall

In the previous post, I showed that the canonical 'power law' chart which underlies the Long Tail image does not, in fact, represent a power law. What it represents is a ranked list, which happens to have a similar shape to a power law series: as it stands, the 'power law' is an artifact of the way the list has been sorted. In particular, the contrast which is often drawn, in this context, between a power law distribution and a normal distribution is inappropriate and misleading. If you sort a list high to low, it can only ever have the shape of a descending curve.

There are counter-arguments, which I'll go through in strength order (weakest first).

Counter-argument 1: the Argument from Inconsequentiality.

In the post which started it all, Clay wrote:
the shape of Figure #1, several hundred blogs ranked by number of inbound links, is roughly a power law distribution.

Note weasel wordage: it would be possible to argue that what Clay (and Jason Kottke) identified wasn't really a power law distribution, it was just some data which could be plotted in a way which looked oddly like a power law curve. Thankfully, Clay cut off this line of retreat, referring explicitly to power law distributions:

power law distributions are ubiquitous. Yahoo Groups mailing lists ranked by subscribers is a power law distribution. LiveJournal users ranked by friends is a power law ... we know that power law distributions tend to arise in social systems where many people express their preferences among many options.

And so on. When we say 'power law', we mean 'power law distribution': we're all agreed on that.

Except, of course, that what we're talking about isn't a power law distribution. Which brings us to...

Counter-argument 2: the Argument from Intuition.

The pages I excerpted in the previous post specifically contrast the power law distribution with the 'normal' bell curve.

many web statistics don’t follow a normal distribution (the infamous bell curve), but a power law distribution. A few items have a significant percentage of the total resource (e.g., inbound links, unique visitors, etc.), and many items with a modest percentage of the resources form a long “tail” in a plot of the distribution.

we find a very few highly connected sites, and very many nearly unconnected sites, a power law distribution whose curve is very high to the left of the graph with the highly connected sites, with a long "tail" to the right of the unconnected sites. This is completely different than the bell curve that folks normally assume

The Web, like most networks, has a peculiar behavior: it doesn't follow standard bell curve distributions ... [it] follows a power law distribution where you get one or two sites with a ton of traffic (like MSN or Yahoo!), and then 10 or 20 sites each with one tenth the traffic of those two, and 100 or 200 sites each with 100th of the traffic, etc.


One of my Latin teachers at school had an infuriating habit, for which (in the best school-story tradition) I'm now very grateful. If you read him a translation which didn't make sense (grammatically, syntactically or literally) he'd give you an anguished look and say, "But how can that be?" It was a rhetorical question, but it was also - infuriatingly - an open question: he genuinely wanted you to look again at what you'd written and realise that, no, actually that noun in the ablative couldn't be the object of the verb... Good training, and not only for reading Latin.

If you've got this far, do me a favour and re-read the excerpts above. Then ask yourself: how can that be?

As long as we're talking about interval/ratio variables - the only type for which a normal distribution can be plotted - it's hard to make sense of this stuff. What, to put it bluntly, is being plotted on the X axis? The best I can do is to suppose that the X axis plots number of sites: A few items have a significant percentage of the total resource; a very few highly connected sites; one or two sites with a ton of traffic. There's your spike on the left: a low X value (a few items) and a high Y (a significant percentage of the total resource).

But this doesn't really work either. Or rather, it could work, but only if every group of sites with the same number of links had a uniquely different number of members - and if the number of members in each group were in inverse proportion to the number of links (1 site with n links, 2 sites with n/2 links, 3 sites with n/3 links, 4 sites with n/4 links...). This isn't impossible, in very much the same way that the spontaneous development of a vacuum in this room isn't impossible; a pattern like that wouldn't be a power law so much as evidence of Intelligent Design.

This is an elaborate and implausible model; it's also something of a red herring, as we'll see in a minute. It's worth going into in detail, though; as far as I can see, it's the only way of getting these data into a power law distribution, with high numbers of links on the left, without using ranking. And cue...

Counter-argument 3: the Argument from Ranking.

Over to Clay:

The basic shape is simple - in any system sorted by rank, the value for the Nth position will be 1/N. For whatever is being ranked -- income, links, traffic -- the value of second place will be half that of first place, and tenth place will be one-tenth of first place. (There are other, more complex formulae that make the slope more or less extreme, but they all relate to this curve.)


"The value for the Nth position will be 1/N" (or proportionate to 1/N, to be more precise); alternatively, you could say that N items have a value of 1/N or greater. (Have a think about this one - we'll be coming back to it later.) Either way, it's a power law, right? Well, yes - and no. It's certainly true to say that a ranked list with these properties confirms to a version of the power law - specifically, Zipf's law. It's also true to say that Zipfian rankings are associated with Pareto-like power law distributions: we may yet be able to find a power law in this data. But we're not there yet - and Clay's presentation of the data doesn't help us to get there. (Jason's has some of the same problems, but Clay's piece is a worse offender; it's also much more widely known.)

The first problem is with the recurrent comparison of ranked graphs with bell curves. Adam: "a ranked graph ... by definition is *always* decreasing, and can *never* be a bell curve". If anyone tells you that such and such a phenomenon follows a power law rather than a normal distribution, take a good look at their X axis. If they've got ranks there, the statement is meaningless.

Secondly, the graph Clay presented - a classic of the 'big head, long tail' genre - isn't actually a Zipfian series, for the simple reason that it includes tied ranks: it's not a list of ranks but a list of nominals sorted into rank order.

I'll clarify. Suppose that we've got a series which only loosely conforms to Zipf's Law, perhaps owing to errors in the real world:

RankValue
11000
2490
3340
4220
5220
6180
7140

Now, what happens on the graph around values 4 and 5? If the X axis represents ranking, it makes no sense to say that the value of 220 corresponds to a rank of 4 and a rank of 5: it's a rank of 4, followed by no ranking for 5 and a rank of 6 for the value of 180. We can see the point even more clearly if we take the alternative interpretation of a Zipfian list and say that the X axis tracks 'number of items with value greater than or equal to Y'. Clearly there are 6 items greater than or equal to 180 and 5 greater than or equal to 220 - but it would be nonsensical to say that there are also 4 items greater than or equal to 220. Either way, if you have a ranked list with tied rankings this should be represented by gaps in the graph.

This may seem like a minor nitpick, but it's actually very important. Back to Adam:

One nice thing about a ranked graph is that the “area” under the curve is equal to the total value associated with the items spanned on the ranked axis

Or, in the words of one of the pieces I quoted in the previous post:

In such a curve the distribution tapers off slowly into the sunset, and is called a tail. What is most intriguing about this long tail is that if you add up all the traffic at the end of it, you get a lot of traffic

What we're talking about, clearly, is the Long Tail. Looking at some actual figures for inbound linkage (collected from NZ Bear earlier this year), there are few tied ranks in the higher rankings and more as we go further out: 95 unique values in the first 100 ranks and 79 in the next 100. Further down, the curve grows flatter, as we'd expect. The first ten rankings (ranging from 5,389 down to 2,142 links) correspond to ten sites; the last ten (ranging, predictably, from 9 down to zero) correspond to a total of 14,445. As Adam says, if you were to graph these data as a list of nominals ranked in descending order, the 'area' covered by the curve would give you a good visual impression of the total number of links accounted for by low-linked sites: the Long Tail, no other. But this graphic does not conform to a power law - not even Zipf's Law. A list conforming to Zipf's Law would drop tied ranks - it would exclude duplicates, if that's any clearer. Instead of a long tail, it would trail off to the right with a series of widely-spaced fenceposts. ("In equal 9126th place, blogs with 9 links; in equal 9593rd place, 8-linkers...")

Long Tail, power law: choose one.

You can have a Long Tail, but only by graphing a list of nominals ranked in descending order.

You can have a power law series with rankings, but only by replacing the long tail with scattered fenceposts.

Even more importantly, neither of these is a power law distribution. Given the appropriate data values, you can derive a power law distribution from a ranked list - but it doesn't look like the 'long tail' graphic we know so well. I'll talk about what it does look like in the next post.

Friday, October 21, 2005

Put your head back in the clouds

OK, let's talk about the Long Tail.

I've been promising a series of posts on the Long Tail myth for, um, quite a while. (What's a month in blog time? A few of those.) The Long Tail posts begin here.

Here's what we're talking about, courtesy of our man Shirky:
We are all so used to bell curve distributions that power law distributions can seem odd. The shape of Figure #1, several hundred blogs ranked by number of inbound links, is roughly a power law distribution. Of the 433 listed blogs, the top two sites accounted for fully 5% of the inbound links between them. (They were InstaPundit and Andrew Sullivan, unsurprisingly.) The top dozen (less than 3% of the total) accounted for 20% of the inbound links, and the top 50 blogs (not quite 12%) accounted for 50% of such links.


Figure #1: 433 weblogs arranged in rank order by number of inbound links.

It's a popular meme, or it would be if there were any such thing as a meme (maybe I'll tackle that one another time). Here's one echo:
many web statistics don’t follow a normal distribution (the infamous bell curve), but a power law distribution. A few items have a significant percentage of the total resource (e.g., inbound links, unique visitors, etc.), and many items with a modest percentage of the resources form a long “tail” in a plot of the distribution. For example, a few websites have millions of links, more have hundreds of thousands, even more have hundreds or thousands, and a huge number of sites have just one, two, or a few.
Another:
if we measure the connectivity of a sample of 1000 web sites, (i.e. the number of other web sites that point to them), we might find a bell curve distribution, with an "average" of X and a standard deviation of Y. If, however, that sample happened to contain google.com, then things would be off the chart for the "outlier" and normal for every other one.

If we back off to see the whole web's connectivity, we find a very few highly connected sites, and very many nearly unconnected sites, a power law distribution whose curve is very high to the left of the graph with the highly connected sites, with a long "tail" to the right of the unconnected sites. This is completely different than the bell curve that folks normally assume
And another:
The Web, like most networks, has a peculiar behavior: it doesn't follow standard bell curve distributions where most people's activities are very similar (for example if you plot out people's heights you get a bell curve with lots of five- and six-foot people and no 20-foot giants). The Web, on the other hand, follows a power law distribution where you get one or two sites with a ton of traffic (like MSN or Yahoo!), and then 10 or 20 sites each with one tenth the traffic of those two, and 100 or 200 sites each with 100th of the traffic, etc. In such a curve the distribution tapers off slowly into the sunset, and is called a tail. What is most intriguing about this long tail is that if you add up all the traffic at the end of it, you get a lot of traffic
All familiar, intuitive stuff. It's entered the language, after all - we all know what the 'long tail' is. And when, for example, Ross writes about somebody who started blogging about cooking at the end of the tail and is now part of the fat head and has become a pro, we all know what the 'fat head' is, too - and we know what (and who) is and isn't part of it.

Unfortunately, the Long Tail doesn't exist.

To back up that assertion, I'm going to have to go into basic statistics - and trust me, I do mean 'basic'. In statistics there are three levels of measurement, which is to say that there are three types of variable. You can measure by dividing the field of measurement into discrete partitions, none of which is inherently ranked higher than any other. This car is blue (could have been red or green); this conference speaker is male (could have been female); this browser is running under OS X (could have been Win XP). These are nominal variables. You can code up nominals like this as numbers - 01=blue, 02=red; 1=male, 2=female - but it won't help you with the analysis. The numbers can't be used as numbers: there's no sense in which red is greater than blue, female is greater than male or OS X is - OK, bad example. Since nominals don't have numerical value, you can't calculate a mean or a median with them; the most you can derive is a mode (the most frequent value).

Then there are ordinal variables. You derive ordinal variables by dividing the field of measurement into discrete and ordered partitions: 1st, 2nd, 3rd; very probable, quite probable, not very probable, improbable; large, extra-large, XXL, SuperSize. As this last example suggests, the range covered by values of an ordinal variable doesn't have to exhaust all the possibilities; all that matters is that the different values are distinct and can be ranked in order. Numeric coding starts to come into its own with ordinals. Give 'large' (etc) codes 1, 2, 3 and 4, and a statement that (say) '50% of size observations are less than 3' actually makes sense, in a way that it wouldn't have made sense if we were talking about car colour observations. In slightly more technical language, you can calculate a mode with ordinal variables, but you can also calculate a median: the value which is at the numerical mid-point of the sample, when the entire sample is ordered low to high.

Finally, we have interval/ratio or I/R variables. You derive an I/R variable by measuring against a standard scale, with a zero point and equal units. As the name implies, an I/R variable can be an interval (ten hours, five metres) or a ratio (30 decibels, 30% probability). All that matters is that different values are arithmetically consistent: 3 units minus 2 units is the same as 5 minus 4; there's a 6:5 ratio between 6 units and 5 units. Statistics starts to take off when you introduce I/R variables. We can still calculate a mode (the most common value) and a median (the midpoint of the distribution), but now we can also calculate a mean: the arithmetic average of all values. (You could calculate a mean for ordinals or even nominals, but the resulting number wouldn't tell you anything: you can't take an average of 'first', 'second' and 'third'.)

You can visualise the difference between nominals, ordinals and I/R variables by imagining you're laying out a simple bar chart. It's very simple: you've got two columns, a long one and a short one. We'll also assume that you're doing this by hand, with two rectangular pieces of paper that you've cut out - perhaps you're designing a poster, or decorating a float for the Statistical Parade. Now: where are you going to place those two columns? If they're nominals ('red cars' vs 'blue cars'), it's entirely up to you: you can put the short one on the left or the right, you can space them out or push them together, you can do what you like. If they're ordinals ('second class degree awards' vs 'third class') you don't have such a free rein: spacing is still up to you, but you will be expected to put the 'third' column to the right of the 'second'. If they're I/R variables, finally - '180 cm', '190 cm' - you'll have no discretion at all: the 180 column needs to go at the 180 point on the X axis, and similarly for the 190.

Almost finished. Now let's talk curves. The 'normal distribution' - the 'bell curve' - is a very common distribution of I/R variables: not very many low values on the left, lots of values in the middle, not very many high values on the right. The breadth and steepness of the 'hump' varies, but all bell curves are characterised by relatively steep rising and falling curves, contrasting with the relative flatness of the two tails and the central plateau. The 'power law distribution' is a less common family of distributions, in which the number of values is inversely proportionate to the value itself or a power of the value. For example, deriving Y values from the inverse of the cube of X:
X valueY formulaY value
11000 / (1^3)1000
21000 / (2^3)125
31000 / (3^3)37.037
41000 / (4^3)15.625
51000 / (5^3)8
61000 / (6^3)4.63
As you can see, a power law curve begins high, declines steeply then 'levels out' and declines ever more shallowly (it tends towards zero without ever reaching it, in fact).

Got all that? Right. Quick question: how do you tell a normal distribution from a power-law distribution? It's simple, really. In one case both low and high values have low numbers of occurrences, while most occurrences are in the central plateau of values around the mean. In the other, the lowest values have the highest numbers of occurrences; most values have low occurrence counts, and high values have the lowest counts of all. In both cases, though, what you're looking at is the distribution of interval/ratio variables. The peaks and tails of those distribution curves can be located precisely, because they're determined by the relative counts (Y axis) of different values (X axis) - just as in the case of our imaginary bar chart.

Back to a real bar chart.

Figure #1: 433 weblogs arranged in rank order by number of inbound links.

The shape of Figure #1, several hundred blogs ranked by number of inbound links, is roughly a power law distribution.

As you can see, this actually isn't a power law distribution - roughly or otherwise. It's just a list. These aren't I/R variables; they aren't even ordinals. What we've got here is a graphical representation of a list of nominal variables (look along the X axis), ranked in descending order of occurrences. We can do a lot better than that - but it will mean forgetting all about the idea that low-link-count sites are in a 'long tail', while the sites with heavy traffic are in the 'head'.

[Next post: how we could save the Long Tail, and why we shouldn't try.]

Wednesday, October 19, 2005

Good neighbors

[Updated 20/10 - tidying-up, response to Adam, Malik quote ect ect]

Shelley:
Through the various link services, last week I found that my RSS entries were being published to a GreatestJournal site. I’d never heard of GreatestJournal, and when I went to contact the site to ask them to remove the feed, there is no contact information. I did find, though, a trouble ticket area and submitted a ticket asking the site to remove the account.
In reply, "GreatestJournal" (whoever they are) told Shelley that her RSS feed was in the public domain, so they could do whatever they liked with it. ("You might wish to take your feed down if you don’t want people to use it." That's helpful.)

One other thing: the email in which they conveyed this information had a copyright notice at the bottom. (Shelley reprinted it anyway.)

Coincidentally, I'd recently been reading this post on EconoMeta, in which Adam talks about our changing relationship with our personal data:
one important part of Web 2.0 is the separation of user data from the applications that use it, and the idea that users should own and control this data.
...
the switching costs imposed by Web 1.0 companies to get a competitive advantage are being replaced by different switching costs created by the *users* of Web 2.0 companies ... [e.g.] the switching costs created by the value of a social network at MySpace or a reputation on eBay, as opposed to the switching cost created by the email address and “walled garden” at AOL.
Separation of user data from applications? Check. User ownership and control? Um, not so much.

It seems to me that this is (depending on how charitable you're feeling) a naive oversight, a lurking contradiction or a dirty little secret at the heart of the "Web 2.0" vision: it's not about the users. Here's Tim O'Reilly, no less:
Let's close, therefore, by summarizing what we believe to be the core competencies of Web 2.0 companies:
  • Services, not packaged software, with cost-effective scalability
  • Control over unique, hard-to-recreate data sources that get richer as more people use them
  • Trusting users as co-developers
  • Harnessing collective intelligence
  • Leveraging the long tail through customer self-service
  • Software above the level of a single device
  • Lightweight user interfaces, development models, AND business models
So we've got software companies harnessing collective intelligence, leveraging the Snaggly Fence* - and, of course, exercising control over unique data. Unique and hard-to-recreate data. Unique data that's continually enriched by its users. We're talking social software, aren't we?

It seems increasingly clear that there are two sides to Web 2.0. The sunny side - the 'social software' side - is where we ask questions like:

Q: How will the data sources become unique and impossible to recreate?
A: By being enriched!
Q: How will the data be enriched?
A: Through being used by people!
Q: How will people use the data?
A: Quickly, easily, intuitively and in their thousands!

That's also the easy side of Web 2.0 - there aren't too many posers there, as you can see.

But there's another side, where we ask questions like "Who will own those data sources?" - and, increasingly, "How will they get hold of them to begin with?" Which, I think, is where GreatestJournal comes in. In comments at Shelley's post, Roger Benningfield made the Web 2.0 connection:
I came across a whole swarm of Web 2.0 stuff in my aggregator. “Microformats, XHTML, death to walled gardens!” they cried.

And I thought, “Oh, you guys are *fucked*.” Because ultimately, the business models they’re envisioning are going to make GreatestJournal’s response look friendly in comparison. If they ever manage to build any momentum (questionable), they’re going to hit a brick wall of posts like this one… a *big* wall.
Case in point: a thoroughly odd development called Sxore. Adina: "The idea is that if a user signs up to comment on one blog, they'll be able to comment on other blogs. ... Sxore creates an RSS feed for each user. Presumably you can follow comments made by that user across different blogs. So, if you think someone has good ideas about blog visualizations, you get to read what they also think about President Bush." Hmmm. What was that about users owning and controlling their data again?

Om Malik has been having similar thoughts:
if we tag, bookmark or share, and help del.icio.us or Technorati or Yahoo become better commercial entities, aren’t we seemingly commoditizing our most valuable asset - time. We become the outsourced workforce, the collective, though it is still unclear what is the pay-off. While we may (or may not) gain something from the collective efforts, the odds are whatever “the collective efforts” are, they are going to boost the economic value of those entities. Will they share in their upside? Not likely!

Take Skype as an example - it rides on our broadband pipes, for which we a hefty monthly charge. It uses our computers and pipes to replace a network that cost phone companies billions to build. In exchange we can make free phone calls to other Skype users. I have no problems with that. I had no problems with Skype charging me for SkypeIN and SkypeOUT calls as well, for this was only a premium service only to be used if and when needed.

However, now that it is part of eBay, I do cringe a little.
It seems to me that the Web 2.0 hype is about social software, but only in the sense that it's about monetising social software: in Marxist terms it's a form of primitive accumulation. In non-Marxist terms, it's enclosure: appropriating something that exists outside the circuit of trading and ownership and managing the supply so that it can only be obtained within that circuit. Or: stealing it and selling it back. I don't know what the GreatestJournal business model is, or how Sxore are planning on making their money; probably something perfectly obvious and straightforward. But it seems to involve turning our work into their assets. I'm not too keen.

In response to Adam (in comments), my concern isn't that it's impossible to draw a line where the benefits of social software can coexist with monetisation (I myself use and endorse the fine products of Blogger.com, after all). What worries me, firstly, is that the drive for monetisation is producing pressures for closure (and enclosure). Secondly, that half the people who advocate Web 2.0 seem to share the company perspective to the point of positively welcoming these developments (see the O'Reilly sermon linked above) - while a lot of the rest are so committed to the vision as to be spectacularly ill-prepared to put up any resistance.

My immediate reaction to Shelley's GreatestJournal post was to leap to the defence of walled gardens - "Walled gardens are full of people!". It's a nice line, but on reflection I don't think it's quite right. What we're hearing is a sublime (although far from unprecedented) example of chutzpah - a critique of barriers by advocates of enclosure. The blogosphere isn't a walled garden, it's a wide-open common where nobody has ownership rights. An enclave which can't be strip-mined isn't walled in; all that's happened is that the predators - who would put their own fences around it if they could - have been walled out. Long may they remain so.

(The Americanism in the title is deliberate, incidentally.)

*There Is No Long Tail

Wednesday, October 05, 2005

Everything playing at once

Dave:

I no longer look at the front page of the NY Times to tell me what's important. I look at it to see what people like the editors of the NY Times think is important. I'm finding the news that matters through the Internet recommendation engine: Blogs, emails, mailing lists, my aggregator, websites that aggregate and comment on news, etc.

Brief thoughts (also appearing in comments at Dave's): we're back with finding out what people say about stuff. Which is, ultimately, all there is to find out. Knowledge - and, for that matter, news - has always been produced in cloud form, as an emergent property of conversations. When we counterpose knowledge to conversation, we're really saying that certain conversations have ended - or been brought to an end - and left unchallenged conclusions behind them. What's changed is that, until recently, the conversations which produce knowledge (and news) have taken place within small and closed groups, so that most of us have only seen the crystallised end-product of the conversation. What Wikipedia, blogging, RSS and del.icio.us give us is the rudiments of a distributed conversation platform, enabled by pervasive broadband. (Which is why the ownership of the authority to stop the conversation - and crystallise the cloud - is such a big issue.)