Cloud Street

Monday, October 31, 2005

None of you stand so tall

In the previous post, I showed that the canonical 'power law' chart which underlies the Long Tail image does not, in fact, represent a power law. What it represents is a ranked list, which happens to have a similar shape to a power law series: as it stands, the 'power law' is an artifact of the way the list has been sorted. In particular, the contrast which is often drawn, in this context, between a power law distribution and a normal distribution is inappropriate and misleading. If you sort a list high to low, it can only ever have the shape of a descending curve.

There are counter-arguments, which I'll go through in strength order (weakest first).

Counter-argument 1: the Argument from Inconsequentiality.

In the post which started it all, Clay wrote:
the shape of Figure #1, several hundred blogs ranked by number of inbound links, is roughly a power law distribution.

Note weasel wordage: it would be possible to argue that what Clay (and Jason Kottke) identified wasn't really a power law distribution, it was just some data which could be plotted in a way which looked oddly like a power law curve. Thankfully, Clay cut off this line of retreat, referring explicitly to power law distributions:

power law distributions are ubiquitous. Yahoo Groups mailing lists ranked by subscribers is a power law distribution. LiveJournal users ranked by friends is a power law ... we know that power law distributions tend to arise in social systems where many people express their preferences among many options.

And so on. When we say 'power law', we mean 'power law distribution': we're all agreed on that.

Except, of course, that what we're talking about isn't a power law distribution. Which brings us to...

Counter-argument 2: the Argument from Intuition.

The pages I excerpted in the previous post specifically contrast the power law distribution with the 'normal' bell curve.

many web statistics don’t follow a normal distribution (the infamous bell curve), but a power law distribution. A few items have a significant percentage of the total resource (e.g., inbound links, unique visitors, etc.), and many items with a modest percentage of the resources form a long “tail” in a plot of the distribution.

we find a very few highly connected sites, and very many nearly unconnected sites, a power law distribution whose curve is very high to the left of the graph with the highly connected sites, with a long "tail" to the right of the unconnected sites. This is completely different than the bell curve that folks normally assume

The Web, like most networks, has a peculiar behavior: it doesn't follow standard bell curve distributions ... [it] follows a power law distribution where you get one or two sites with a ton of traffic (like MSN or Yahoo!), and then 10 or 20 sites each with one tenth the traffic of those two, and 100 or 200 sites each with 100th of the traffic, etc.


One of my Latin teachers at school had an infuriating habit, for which (in the best school-story tradition) I'm now very grateful. If you read him a translation which didn't make sense (grammatically, syntactically or literally) he'd give you an anguished look and say, "But how can that be?" It was a rhetorical question, but it was also - infuriatingly - an open question: he genuinely wanted you to look again at what you'd written and realise that, no, actually that noun in the ablative couldn't be the object of the verb... Good training, and not only for reading Latin.

If you've got this far, do me a favour and re-read the excerpts above. Then ask yourself: how can that be?

As long as we're talking about interval/ratio variables - the only type for which a normal distribution can be plotted - it's hard to make sense of this stuff. What, to put it bluntly, is being plotted on the X axis? The best I can do is to suppose that the X axis plots number of sites: A few items have a significant percentage of the total resource; a very few highly connected sites; one or two sites with a ton of traffic. There's your spike on the left: a low X value (a few items) and a high Y (a significant percentage of the total resource).

But this doesn't really work either. Or rather, it could work, but only if every group of sites with the same number of links had a uniquely different number of members - and if the number of members in each group were in inverse proportion to the number of links (1 site with n links, 2 sites with n/2 links, 3 sites with n/3 links, 4 sites with n/4 links...). This isn't impossible, in very much the same way that the spontaneous development of a vacuum in this room isn't impossible; a pattern like that wouldn't be a power law so much as evidence of Intelligent Design.

This is an elaborate and implausible model; it's also something of a red herring, as we'll see in a minute. It's worth going into in detail, though; as far as I can see, it's the only way of getting these data into a power law distribution, with high numbers of links on the left, without using ranking. And cue...

Counter-argument 3: the Argument from Ranking.

Over to Clay:

The basic shape is simple - in any system sorted by rank, the value for the Nth position will be 1/N. For whatever is being ranked -- income, links, traffic -- the value of second place will be half that of first place, and tenth place will be one-tenth of first place. (There are other, more complex formulae that make the slope more or less extreme, but they all relate to this curve.)


"The value for the Nth position will be 1/N" (or proportionate to 1/N, to be more precise); alternatively, you could say that N items have a value of 1/N or greater. (Have a think about this one - we'll be coming back to it later.) Either way, it's a power law, right? Well, yes - and no. It's certainly true to say that a ranked list with these properties confirms to a version of the power law - specifically, Zipf's law. It's also true to say that Zipfian rankings are associated with Pareto-like power law distributions: we may yet be able to find a power law in this data. But we're not there yet - and Clay's presentation of the data doesn't help us to get there. (Jason's has some of the same problems, but Clay's piece is a worse offender; it's also much more widely known.)

The first problem is with the recurrent comparison of ranked graphs with bell curves. Adam: "a ranked graph ... by definition is *always* decreasing, and can *never* be a bell curve". If anyone tells you that such and such a phenomenon follows a power law rather than a normal distribution, take a good look at their X axis. If they've got ranks there, the statement is meaningless.

Secondly, the graph Clay presented - a classic of the 'big head, long tail' genre - isn't actually a Zipfian series, for the simple reason that it includes tied ranks: it's not a list of ranks but a list of nominals sorted into rank order.

I'll clarify. Suppose that we've got a series which only loosely conforms to Zipf's Law, perhaps owing to errors in the real world:

RankValue
11000
2490
3340
4220
5220
6180
7140

Now, what happens on the graph around values 4 and 5? If the X axis represents ranking, it makes no sense to say that the value of 220 corresponds to a rank of 4 and a rank of 5: it's a rank of 4, followed by no ranking for 5 and a rank of 6 for the value of 180. We can see the point even more clearly if we take the alternative interpretation of a Zipfian list and say that the X axis tracks 'number of items with value greater than or equal to Y'. Clearly there are 6 items greater than or equal to 180 and 5 greater than or equal to 220 - but it would be nonsensical to say that there are also 4 items greater than or equal to 220. Either way, if you have a ranked list with tied rankings this should be represented by gaps in the graph.

This may seem like a minor nitpick, but it's actually very important. Back to Adam:

One nice thing about a ranked graph is that the “area” under the curve is equal to the total value associated with the items spanned on the ranked axis

Or, in the words of one of the pieces I quoted in the previous post:

In such a curve the distribution tapers off slowly into the sunset, and is called a tail. What is most intriguing about this long tail is that if you add up all the traffic at the end of it, you get a lot of traffic

What we're talking about, clearly, is the Long Tail. Looking at some actual figures for inbound linkage (collected from NZ Bear earlier this year), there are few tied ranks in the higher rankings and more as we go further out: 95 unique values in the first 100 ranks and 79 in the next 100. Further down, the curve grows flatter, as we'd expect. The first ten rankings (ranging from 5,389 down to 2,142 links) correspond to ten sites; the last ten (ranging, predictably, from 9 down to zero) correspond to a total of 14,445. As Adam says, if you were to graph these data as a list of nominals ranked in descending order, the 'area' covered by the curve would give you a good visual impression of the total number of links accounted for by low-linked sites: the Long Tail, no other. But this graphic does not conform to a power law - not even Zipf's Law. A list conforming to Zipf's Law would drop tied ranks - it would exclude duplicates, if that's any clearer. Instead of a long tail, it would trail off to the right with a series of widely-spaced fenceposts. ("In equal 9126th place, blogs with 9 links; in equal 9593rd place, 8-linkers...")

Long Tail, power law: choose one.

You can have a Long Tail, but only by graphing a list of nominals ranked in descending order.

You can have a power law series with rankings, but only by replacing the long tail with scattered fenceposts.

Even more importantly, neither of these is a power law distribution. Given the appropriate data values, you can derive a power law distribution from a ranked list - but it doesn't look like the 'long tail' graphic we know so well. I'll talk about what it does look like in the next post.

1 Comments:

  • Aha, I get it now, great point: all this talk about curves that fit histogram data, ranked or not, can have problems when the numbers get too small, since we're fitting a real-valued function to a histogram consisting of integers. This problem is really just one of accuracy, although you're correct to point out that we have to be careful about what people really mean when they say that a histogram "fits" a curve such as a power law.

    For example, the histogram may fit the data by "jumping" to integer values. In other words, if the power law says that a series of values in the ranked long tail should be {5, 4.55, 4.17, 3.85, 3.57}, a good fit would be a histogram consisting of repeated integer data {5, 5, 4, 4, 4}. This maintains the "area equalling total" interpretation, and thus also preserves the validity of observations regarding the total significance of the long tail as compared to the head. However, if the data was instead "scattered fenceposts," e.g. {5, 0, 0, 0, 4}, this might fit the curve while ignoring the zeros, but it would invalidate the "area" interpretation, and thus destroy the validity of observations on long tail importance based upon the power law curve.

    So I don't think that it's really correct to say "you can have a power law series with rankings, but only by replacing the long tail with scattered fenceposts." A better statement is that in order for area to be meaningful, as it must be for the usual "long tail" conclusions to hold, the data that fits the power law curve must have repeated values rather then scattered fenceposts.

    The key missing step that can be confusing here is the definition of the word "fit": for traditional long tail interpretations to work, the "fit" must be area-based rather than point-based, i.e. repeated values rather than scattered fenceposts.

    By Blogger Adam, at 10/11/05 01:51  

Post a Comment

Links to this post:

Create a Link

<< Home