Cloud Street

Friday, October 21, 2005

Put your head back in the clouds

OK, let's talk about the Long Tail.

I've been promising a series of posts on the Long Tail myth for, um, quite a while. (What's a month in blog time? A few of those.) The Long Tail posts begin here.

Here's what we're talking about, courtesy of our man Shirky:
We are all so used to bell curve distributions that power law distributions can seem odd. The shape of Figure #1, several hundred blogs ranked by number of inbound links, is roughly a power law distribution. Of the 433 listed blogs, the top two sites accounted for fully 5% of the inbound links between them. (They were InstaPundit and Andrew Sullivan, unsurprisingly.) The top dozen (less than 3% of the total) accounted for 20% of the inbound links, and the top 50 blogs (not quite 12%) accounted for 50% of such links.


Figure #1: 433 weblogs arranged in rank order by number of inbound links.

It's a popular meme, or it would be if there were any such thing as a meme (maybe I'll tackle that one another time). Here's one echo:
many web statistics don’t follow a normal distribution (the infamous bell curve), but a power law distribution. A few items have a significant percentage of the total resource (e.g., inbound links, unique visitors, etc.), and many items with a modest percentage of the resources form a long “tail” in a plot of the distribution. For example, a few websites have millions of links, more have hundreds of thousands, even more have hundreds or thousands, and a huge number of sites have just one, two, or a few.
Another:
if we measure the connectivity of a sample of 1000 web sites, (i.e. the number of other web sites that point to them), we might find a bell curve distribution, with an "average" of X and a standard deviation of Y. If, however, that sample happened to contain google.com, then things would be off the chart for the "outlier" and normal for every other one.

If we back off to see the whole web's connectivity, we find a very few highly connected sites, and very many nearly unconnected sites, a power law distribution whose curve is very high to the left of the graph with the highly connected sites, with a long "tail" to the right of the unconnected sites. This is completely different than the bell curve that folks normally assume
And another:
The Web, like most networks, has a peculiar behavior: it doesn't follow standard bell curve distributions where most people's activities are very similar (for example if you plot out people's heights you get a bell curve with lots of five- and six-foot people and no 20-foot giants). The Web, on the other hand, follows a power law distribution where you get one or two sites with a ton of traffic (like MSN or Yahoo!), and then 10 or 20 sites each with one tenth the traffic of those two, and 100 or 200 sites each with 100th of the traffic, etc. In such a curve the distribution tapers off slowly into the sunset, and is called a tail. What is most intriguing about this long tail is that if you add up all the traffic at the end of it, you get a lot of traffic
All familiar, intuitive stuff. It's entered the language, after all - we all know what the 'long tail' is. And when, for example, Ross writes about somebody who started blogging about cooking at the end of the tail and is now part of the fat head and has become a pro, we all know what the 'fat head' is, too - and we know what (and who) is and isn't part of it.

Unfortunately, the Long Tail doesn't exist.

To back up that assertion, I'm going to have to go into basic statistics - and trust me, I do mean 'basic'. In statistics there are three levels of measurement, which is to say that there are three types of variable. You can measure by dividing the field of measurement into discrete partitions, none of which is inherently ranked higher than any other. This car is blue (could have been red or green); this conference speaker is male (could have been female); this browser is running under OS X (could have been Win XP). These are nominal variables. You can code up nominals like this as numbers - 01=blue, 02=red; 1=male, 2=female - but it won't help you with the analysis. The numbers can't be used as numbers: there's no sense in which red is greater than blue, female is greater than male or OS X is - OK, bad example. Since nominals don't have numerical value, you can't calculate a mean or a median with them; the most you can derive is a mode (the most frequent value).

Then there are ordinal variables. You derive ordinal variables by dividing the field of measurement into discrete and ordered partitions: 1st, 2nd, 3rd; very probable, quite probable, not very probable, improbable; large, extra-large, XXL, SuperSize. As this last example suggests, the range covered by values of an ordinal variable doesn't have to exhaust all the possibilities; all that matters is that the different values are distinct and can be ranked in order. Numeric coding starts to come into its own with ordinals. Give 'large' (etc) codes 1, 2, 3 and 4, and a statement that (say) '50% of size observations are less than 3' actually makes sense, in a way that it wouldn't have made sense if we were talking about car colour observations. In slightly more technical language, you can calculate a mode with ordinal variables, but you can also calculate a median: the value which is at the numerical mid-point of the sample, when the entire sample is ordered low to high.

Finally, we have interval/ratio or I/R variables. You derive an I/R variable by measuring against a standard scale, with a zero point and equal units. As the name implies, an I/R variable can be an interval (ten hours, five metres) or a ratio (30 decibels, 30% probability). All that matters is that different values are arithmetically consistent: 3 units minus 2 units is the same as 5 minus 4; there's a 6:5 ratio between 6 units and 5 units. Statistics starts to take off when you introduce I/R variables. We can still calculate a mode (the most common value) and a median (the midpoint of the distribution), but now we can also calculate a mean: the arithmetic average of all values. (You could calculate a mean for ordinals or even nominals, but the resulting number wouldn't tell you anything: you can't take an average of 'first', 'second' and 'third'.)

You can visualise the difference between nominals, ordinals and I/R variables by imagining you're laying out a simple bar chart. It's very simple: you've got two columns, a long one and a short one. We'll also assume that you're doing this by hand, with two rectangular pieces of paper that you've cut out - perhaps you're designing a poster, or decorating a float for the Statistical Parade. Now: where are you going to place those two columns? If they're nominals ('red cars' vs 'blue cars'), it's entirely up to you: you can put the short one on the left or the right, you can space them out or push them together, you can do what you like. If they're ordinals ('second class degree awards' vs 'third class') you don't have such a free rein: spacing is still up to you, but you will be expected to put the 'third' column to the right of the 'second'. If they're I/R variables, finally - '180 cm', '190 cm' - you'll have no discretion at all: the 180 column needs to go at the 180 point on the X axis, and similarly for the 190.

Almost finished. Now let's talk curves. The 'normal distribution' - the 'bell curve' - is a very common distribution of I/R variables: not very many low values on the left, lots of values in the middle, not very many high values on the right. The breadth and steepness of the 'hump' varies, but all bell curves are characterised by relatively steep rising and falling curves, contrasting with the relative flatness of the two tails and the central plateau. The 'power law distribution' is a less common family of distributions, in which the number of values is inversely proportionate to the value itself or a power of the value. For example, deriving Y values from the inverse of the cube of X:
X valueY formulaY value
11000 / (1^3)1000
21000 / (2^3)125
31000 / (3^3)37.037
41000 / (4^3)15.625
51000 / (5^3)8
61000 / (6^3)4.63
As you can see, a power law curve begins high, declines steeply then 'levels out' and declines ever more shallowly (it tends towards zero without ever reaching it, in fact).

Got all that? Right. Quick question: how do you tell a normal distribution from a power-law distribution? It's simple, really. In one case both low and high values have low numbers of occurrences, while most occurrences are in the central plateau of values around the mean. In the other, the lowest values have the highest numbers of occurrences; most values have low occurrence counts, and high values have the lowest counts of all. In both cases, though, what you're looking at is the distribution of interval/ratio variables. The peaks and tails of those distribution curves can be located precisely, because they're determined by the relative counts (Y axis) of different values (X axis) - just as in the case of our imaginary bar chart.

Back to a real bar chart.

Figure #1: 433 weblogs arranged in rank order by number of inbound links.

The shape of Figure #1, several hundred blogs ranked by number of inbound links, is roughly a power law distribution.

As you can see, this actually isn't a power law distribution - roughly or otherwise. It's just a list. These aren't I/R variables; they aren't even ordinals. What we've got here is a graphical representation of a list of nominal variables (look along the X axis), ranked in descending order of occurrences. We can do a lot better than that - but it will mean forgetting all about the idea that low-link-count sites are in a 'long tail', while the sites with heavy traffic are in the 'head'.

[Next post: how we could save the Long Tail, and why we shouldn't try.]

3 Comments:

  • Great point Phil, I think bringing in formal variable classes helps to make clear the differences here. Maybe you're going to get to this in the next installment, but the whole point of my previous post was that a ranked histogram is indeed a list of nominal variables, but that the ranking itself allows one to extract a closely related I/R variable.

    In your example, the X axis of a ranked histogram of blogs is indeed a list of nominal variables, but the fact that this list is ranked makes them into ordinals, since blog A has a rank (number of inbound links) that is higher or lower than blog B. We can then note that if the nth ranked blog has y inbound links, this is equivalent to saying that n blogs have y or more inbound links. This allows us to transform the ordinal variable of ranked blogs into the I/R variable of number of blogs (with y or more inbound links).

    Now you're an inversion and a derivative away from a classic probability distribution (PDF) showing the number (or percentage) of blogs with a given number of inbound links. I showed before that if the ranked histogram fit a power law, then the corresponding PDF is also a power law. I just put up a new post showing that if the ranked histogram doesn’t fit a power law (but perhaps still has a "long tail"), the corresponding PDF *can* in fact have a meaningful average; in fact, it can be an exact bell curve!

    By Blogger Adam, at 22/10/05 18:49  

  • Nice work, Adam, although I confess that the math[s] is still a bit beyond me in a couple of places. Yes, I'll be going over some of the ground you covered in the next post in the series, after I've pointed out the problems with some of those superficially plausible 'power law' illustrations ("A few items have a significant percentage of the total resource ... and many items with a modest percentage of the resources form a long tail”). Stay tuned (but don't hold your breath!)

    By Blogger Phil, at 24/10/05 11:04  

  • yes, i know: time is tight. but i really would love to read that:
    "It's a popular meme, or it would be if there were any such thing as a meme (maybe I'll tackle that one another time)."

    By Blogger martin lindner, at 14/11/05 11:58  

Post a Comment

Links to this post:

Create a Link

<< Home