Cloud Street

Wednesday, May 25, 2005

When is a spike not a spike?

When it's a long tail. Maybe.

David Weinberger writes:

In a conversation with Erica George at the Berkman she pointed out that the demographics of Live Journal don't always represent one's experience of Live Journal — the demographics say that teenage girls are the largest users, but if you're a 25 year old, your social group there may not look that way at all.

Which raises an issue about the way the "long tail" is pictured. Clay's charts are accurate depictions of his data, but they have a mythic power that's misleading: The long tail looks like, well, a long tail when in fact it's a fractal curlicue of relationships.

This is an interesting point in itself - perhaps the blogosphere would be better viewed as a series (archipelago? galaxy?) of more or less closed, more or less interlinked 'spheres'. I'm not sure how you'd visualise that, though - perhaps something like the Jefferson High School network diagram?.

But there's a broader point about the accuracy of those 'long tail' graphics. Adam Marsh made an interesting point here about a recently-discovered 'long tail':

Clay refers to “the characteristic long tail of people who use many fewer tags than the power taggers.” While this chart does exhibit a “long tail,” this is simply a result of the fact that the users were ordered by decreasing tag usage (also true of the following three charts) — the X axis here doesn’t represent a value, it is just a sequence of users.

The phrase “long tail” usually refers to the observation that for many distributions, the number of elements with outlying values (the “tail”) may be cumulatively significant compared to the number of elements clustered near the average.

On inspection, it turns out that this is also true of the celebrated 'Power law and Weblogs' graphic: there are no values on the X axis, just a list of blogs arranged in descending order of number of links. This matters, because in a graphical representation of a statistical distribution both axes carry information. Typically, values of the variable being measured run low to high on the X axis, left to right, while the count of occurrences of each value runs high to low on the Y axis, top to bottom. Clay wrote, "We are all so used to bell curve distributions that power law distributions can seem odd." But Clay's own graphics aren't so much odd as misleading, and not only because he's put high values on the left of the graph rather than the right. In effect, he's got two axes conveying one piece of information. Andrew Sullivan's blog and Instapundit get a high Y value (lots of links) and a high X value (because all the sites with lots of links have been sorted to the left).

If you took the same numbers and plotted them on an X axis with values - if you produced a graph showing how many blogs had how many links, with zero at the origin on both scales... Well, I don't know what would happen - but five minutes' experimentation tellsreminds me that, if you wanted to produce a nice clear series of vertical bars rather than a line that wanders all over the place, you'd need to put 'number of blogs' on the Y axis and 'number of inbound links' on the X axis, rather than vice versa. (There's a simple reason for this: some values are unique by definition, others aren't.) Which in turn means that any vertical spike would represent large numbers of blogs (say, for example, blogs with small numbers of inbound links) while any long tail would represent small numbers (say, for example, the few blogs with lots of links).

Caveat: I haven't crunched any actual numbers, or even mumbled them gently. But maybe we've been looking at this the wrong way round, statistically speaking. Perhaps the long tail is the spike; perhaps the spike is really the long tail.