Tuesday 23 September 2008

Giving in to data loss aversion

D. Kelly O'Day, of the Process Trends web site, has a new Charts & Graphs blog, and his first post comments on the US Census Bureau's Current Population Reports data discussed by Jorge Camoes, and Andreas Lipphardt at xlcubed.com, as an example of a confusing data set with too much detail, or is it just detail in need of sensitive presentation?



(as reader michael rightly points out in comments on the XLCubed post, this can't be showing the rise of income disparity, because it's percentage of households against fixed income level, not percentage of income against fixed household quantiles. Also, the series tops out at $100k, which is peanuts compared to the super-rich levels where the income transfer of the last few decades has really been most spectacular.)

Kelly abandons the time series approach to present the two end points in a dot plot:



However, I think that reducing the curves to a pair of points, one in 1967 and one in 2007, loses a lot of information that the full time series has to tell. So, in the spirit of defending loss aversion, I wondered if a readable time series graph could be constructed that would give real insights into the census table.

First, I made it a cumulative graph, so each income bracket now goes from zero to n(i), not n(i-1) to n(i). This means the lines can't cross any more, which should help:



I eliminated the >$100k line, as this is now by definition 100% of the households, cumulatively. As we see, the percentage of households making under $100k in constant dollars falls steadily from 1967 to 2005, which is just what we would expect from economic growth, and a desirable result. You want more people getting richer. However, this does not carry through to all income levels, which fall by less and less.

Perhaps this is an artifact of the linear scale, as written about by Jon Peltier and Nicolas Bissantz recently. So I tried it with a log scale instead:



(fully accepting Jon's comment that log scales do an equal injustice to the upper values, when the data is values equally distributed within an upper limit instead of proportionally distributed to infinity)

The surprising result is that the <$5k income level contains almost as many households in 2005 as in 1967, and significantly more than in the 70s, after a fall in the 60s. After a modest fall in the 90s, it rises again after 2000, as do all the other income levels below $100k. If we have said a fall is a good thing, then a rise must be bad. So the full time series data do contain a shocking insight, but it's to be found not at the upper levels but the lower ones, and not just by comparing 1967 to 2005, but following the trend down, then up. The dot plot can only tell a story of average improvement between 1967-2007, which masks a story of advance and retreat.

1 comment:

Anonymous said...

Derek:

Very interesting!

Thanks for the tip on Nicolas's time series post, it's an important point for time series charts.

I've worked up a post on logarithmic scales.

I'm still thinking about the loss aversion issue. I'll be posting on the connection between research question and chart type.

You, Andreas and I have focused on different questions, leading to different charts. Its like a photographer who uses different lenses to shoot a scene. While the scene doesn't change, the view of the scene does.

While some scenes (photos) may be more interesting - pleasing than others,each has a role to play in understanding the actual scene that the photographer shot.

Let me know if you have any thoughts, suggestions on this. Chart type selection is an important topic. Jorge's income data provides a good example of the role of question to chart type.

Kelly