Saturday, February 25, 2012

Bins & buckets

Binning data seems a very wasteful thing to do. Even if you have finished with it, someone else may find it useful.

It was when I was recently reading a paper written by a physicist about statistical distributions that I first came across the concept of binning data. I had to look at the figures before I understood what he meant, & then more minutes to try & think what word I would have used. Probably grouping, or maybe just cross tabulation which, in social statistics usually involves grouping - for instance population in 5-year age bands.

I thought binning must derive from the idea of sorting data into different bins according to the value of the variable – much as, in the old days, postal workers used to sort outgoing mail by tossing it into the correct bin for the postal town of destination.

Just the other day it occurred to me that bin may in fact derive from binary, which corresponds to the truth table method of programming cross tabulations.

And then I came across this in an article about statistical analysis of language & economics on Language Log:

What this does, in effect, is drop families around the world into one of 1.4 billion buckets, where two families fall into the same bucket if and only if they are identical in country of birth and residence, age, sex, income, family structure, number of children, and religion, where the religions of the world are broken up into 74 types.
Keith Chen


So perhaps I was right first time.