Cloud Street: A trick of the eye

A long time ago on a Web site far, far away, Clay Shirky wrote:

"We are all so used to bell curve distributions that power law distributions can seem odd."

He then traced Pareto-like 'power law' curves operating in a number of domains where large numbers of people make unconstrained choices - most memorably, inbound link counts for blogs. The inverse 'power law' curve dives steeply, then levels out, glides downwards almost to zero and peters out slowly. And thus was born the 'Long Tail'.

As I wrote here, there's a problem with this article, and hence with the 'Long Tail' image itself. Despite repeated references to 'power law distributions', none of the curves Clay presented were distributions. They were histograms representing ranked lists: in other words series of numbers ordered from high to low.

What's the difference? A short answer is that the data Clay presents makes his own comparison with 'bell curve' (normal) distributions unsustainable: order from high to low and you will only ever get a downward curve.

For a longer answer, you'll have to look at some numbers. Here are some x,y values which would give you a normal distribution. (For anyone in danger of glazing over, that's 'x' as in horizontal axis, low to high values running left to right; 'y' values are on the vertical axis, low to high running bottom to top).

1	1
2	30
3	100
4	240
5	400
6	600
7	750
8	900
9	960
10	1000
11	1000
12	960
13	900
14	750
15	600
16	400
17	240
18	100
19	30
20	1

OK? And here are some co-ordinates which would give you an inverse power-law distribution:

1	1000
2	444
3	250
4	160
5	111
6	82
7	63
8	49
9	40
10	33
11	28
12	24
13	20
14	18
15	16
16	14
17	12
18	11
19	10
20	9

Just for the hell of it, here are some numbers that would give you a direct (ascending) power law distribution:

1	9
2	10
3	11
4	12
5	14
6	16
7	18
8	20
9	24
10	28
11	33
12	40
13	49
14	63
15	82
16	111
17	160
18	250
19	444
20	1000

Finally, by way of contrast, here's a series of numbers.

1000
444
250
160
111
82
63
49
40
33
28
24
20
18
16
14
12
11
10
9

I've sorted these numbers high to low, but - unlike the other three examples - there's nothing in the data that told me to do that. You could arrange them that way; you could sort them low to high instead; you could even hack them about manually to produce a rather lumpy and uneven bell curve. It's up to you.

I'm not saying that a ranked listing - arranging numbers like these high to low - is meaningless. The ranked histogram is quite a good graphic - it's informative (within limits) and easy to grasp. What I am saying is that it's an arbitrary ordering rather than a distribution. Which is to say, it's not the best way of representing this data - let alone the only way. It's a relatively information-poor representation, and one which tends to promote perverse and unproductive ways of thinking about the data.

More about this - and a couple of constructive suggestions - next time I post.

Cloud Street

Friday, June 24, 2005

A trick of the eye

0 Comments:

About Me

Me elsewhere

Previous