Cloud Street

Friday, June 24, 2005

A trick of the eye

A long time ago on a Web site far, far away, Clay Shirky wrote:
"We are all so used to bell curve distributions that power law distributions can seem odd."
He then traced Pareto-like 'power law' curves operating in a number of domains where large numbers of people make unconstrained choices - most memorably, inbound link counts for blogs. The inverse 'power law' curve dives steeply, then levels out, glides downwards almost to zero and peters out slowly. And thus was born the 'Long Tail'.

As I wrote here, there's a problem with this article, and hence with the 'Long Tail' image itself. Despite repeated references to 'power law distributions', none of the curves Clay presented were distributions. They were histograms representing ranked lists: in other words series of numbers ordered from high to low.

What's the difference? A short answer is that the data Clay presents makes his own comparison with 'bell curve' (normal) distributions unsustainable: order from high to low and you will only ever get a downward curve.

For a longer answer, you'll have to look at some numbers. Here are some x,y values which would give you a normal distribution. (For anyone in danger of glazing over, that's 'x' as in horizontal axis, low to high values running left to right; 'y' values are on the vertical axis, low to high running bottom to top).

11
230
3100
4240
5400
6600
7750
8900
9960
101000
111000
12960
13900
14750
15600
16400
17240
18100
1930
201

OK? And here are some co-ordinates which would give you an inverse power-law distribution:

11000
2444
3250
4160
5111
682
763
849
940
1033
1128
1224
1320
1418
1516
1614
1712
1811
1910
209

Just for the hell of it, here are some numbers that would give you a direct (ascending) power law distribution:

19
210
311
412
514
616
718
820
924
1028
1133
1240
1349
1463
1582
16111
17160
18250
19444
201000

Finally, by way of contrast, here's a series of numbers.

1000
444
250
160
111
82
63
49
40
33
28
24
20
18
16
14
12
11
10
9

I've sorted these numbers high to low, but - unlike the other three examples - there's nothing in the data that told me to do that. You could arrange them that way; you could sort them low to high instead; you could even hack them about manually to produce a rather lumpy and uneven bell curve. It's up to you.

I'm not saying that a ranked listing - arranging numbers like these high to low - is meaningless. The ranked histogram is quite a good graphic - it's informative (within limits) and easy to grasp. What I am saying is that it's an arbitrary ordering rather than a distribution. Which is to say, it's not the best way of representing this data - let alone the only way. It's a relatively information-poor representation, and one which tends to promote perverse and unproductive ways of thinking about the data.

More about this - and a couple of constructive suggestions - next time I post.

0 Comments:

Post a Comment

<< Home