### A trick of the eye

A long time ago on a Web site far, far away, Clay Shirky wrote:

As I wrote here, there's a problem with this article, and hence with the 'Long Tail' image itself. Despite repeated references to 'power law distributions',

What's the difference? A short answer is that the data Clay presents makes his own comparison with 'bell curve' (normal) distributions unsustainable: order from high to low and you will only ever get a downward curve.

For a longer answer, you'll have to look at some numbers. Here are some x,y values which would give you a normal distribution. (For anyone in danger of glazing over, that's 'x' as in horizontal axis, low to high values running left to right; 'y' values are on the vertical axis, low to high running bottom to top).

OK? And here are some co-ordinates which would give you an inverse power-law distribution:

Just for the hell of it, here are some numbers that would give you a direct (ascending) power law distribution:

Finally, by way of contrast, here's a series of numbers.

I've sorted these numbers high to low, but - unlike the other three examples - there's nothing in the data that told me to do that. You could arrange them that way; you could sort them low to high instead; you could even hack them about manually to produce a rather lumpy and uneven bell curve. It's up to you.

I'm not saying that a ranked listing - arranging numbers like these high to low - is meaningless. The ranked histogram is quite a good graphic - it's informative (within limits) and easy to grasp. What I am saying is that it's an arbitrary ordering rather than a distribution. Which is to say, it's not the best way of representing this data - let alone the only way. It's a relatively information-poor representation, and one which tends to promote perverse and unproductive ways of thinking about the data.

More about this - and a couple of constructive suggestions - next time I post.

"We are all so used to bell curve distributions that power law distributions can seem odd."He then traced Pareto-like 'power law' curves operating in a number of domains where large numbers of people make unconstrained choices - most memorably, inbound link counts for blogs. The inverse 'power law' curve dives steeply, then levels out, glides downwards almost to zero and peters out slowly. And thus was born the 'Long Tail'.

As I wrote here, there's a problem with this article, and hence with the 'Long Tail' image itself. Despite repeated references to 'power law distributions',

**none of the curves Clay presented were distributions**. They were histograms representing ranked lists: in other words series of numbers ordered from high to low.What's the difference? A short answer is that the data Clay presents makes his own comparison with 'bell curve' (normal) distributions unsustainable: order from high to low and you will only ever get a downward curve.

For a longer answer, you'll have to look at some numbers. Here are some x,y values which would give you a normal distribution. (For anyone in danger of glazing over, that's 'x' as in horizontal axis, low to high values running left to right; 'y' values are on the vertical axis, low to high running bottom to top).

1 | 1 |

2 | 30 |

3 | 100 |

4 | 240 |

5 | 400 |

6 | 600 |

7 | 750 |

8 | 900 |

9 | 960 |

10 | 1000 |

11 | 1000 |

12 | 960 |

13 | 900 |

14 | 750 |

15 | 600 |

16 | 400 |

17 | 240 |

18 | 100 |

19 | 30 |

20 | 1 |

OK? And here are some co-ordinates which would give you an inverse power-law distribution:

1 | 1000 |

2 | 444 |

3 | 250 |

4 | 160 |

5 | 111 |

6 | 82 |

7 | 63 |

8 | 49 |

9 | 40 |

10 | 33 |

11 | 28 |

12 | 24 |

13 | 20 |

14 | 18 |

15 | 16 |

16 | 14 |

17 | 12 |

18 | 11 |

19 | 10 |

20 | 9 |

Just for the hell of it, here are some numbers that would give you a direct (ascending) power law distribution:

1 | 9 |

2 | 10 |

3 | 11 |

4 | 12 |

5 | 14 |

6 | 16 |

7 | 18 |

8 | 20 |

9 | 24 |

10 | 28 |

11 | 33 |

12 | 40 |

13 | 49 |

14 | 63 |

15 | 82 |

16 | 111 |

17 | 160 |

18 | 250 |

19 | 444 |

20 | 1000 |

Finally, by way of contrast, here's a series of numbers.

1000 |

444 |

250 |

160 |

111 |

82 |

63 |

49 |

40 |

33 |

28 |

24 |

20 |

18 |

16 |

14 |

12 |

11 |

10 |

9 |

I've sorted these numbers high to low, but - unlike the other three examples - there's nothing in the data that told me to do that. You could arrange them that way; you could sort them low to high instead; you could even hack them about manually to produce a rather lumpy and uneven bell curve. It's up to you.

I'm not saying that a ranked listing - arranging numbers like these high to low - is meaningless. The ranked histogram is quite a good graphic - it's informative (within limits) and easy to grasp. What I am saying is that it's an arbitrary ordering rather than a distribution. Which is to say, it's not the best way of representing this data - let alone the only way. It's a relatively information-poor representation, and one which tends to promote perverse and unproductive ways of thinking about the data.

More about this - and a couple of constructive suggestions - next time I post.

## 0 Comments:

Post a Comment

<< Home