Tuesday, 23 September 2008

Giving in to data loss aversion

D. Kelly O'Day, of the Process Trends web site, has a new Charts & Graphs blog, and his first post comments on the US Census Bureau's Current Population Reports data discussed by Jorge Camoes, and Andreas Lipphardt at xlcubed.com, as an example of a confusing data set with too much detail, or is it just detail in need of sensitive presentation?



(as reader michael rightly points out in comments on the XLCubed post, this can't be showing the rise of income disparity, because it's percentage of households against fixed income level, not percentage of income against fixed household quantiles. Also, the series tops out at $100k, which is peanuts compared to the super-rich levels where the income transfer of the last few decades has really been most spectacular.)

Kelly abandons the time series approach to present the two end points in a dot plot:



However, I think that reducing the curves to a pair of points, one in 1967 and one in 2007, loses a lot of information that the full time series has to tell. So, in the spirit of defending loss aversion, I wondered if a readable time series graph could be constructed that would give real insights into the census table.

First, I made it a cumulative graph, so each income bracket now goes from zero to n(i), not n(i-1) to n(i). This means the lines can't cross any more, which should help:



I eliminated the >$100k line, as this is now by definition 100% of the households, cumulatively. As we see, the percentage of households making under $100k in constant dollars falls steadily from 1967 to 2005, which is just what we would expect from economic growth, and a desirable result. You want more people getting richer. However, this does not carry through to all income levels, which fall by less and less.

Perhaps this is an artifact of the linear scale, as written about by Jon Peltier and Nicolas Bissantz recently. So I tried it with a log scale instead:



(fully accepting Jon's comment that log scales do an equal injustice to the upper values, when the data is values equally distributed within an upper limit instead of proportionally distributed to infinity)

The surprising result is that the <$5k income level contains almost as many households in 2005 as in 1967, and significantly more than in the 70s, after a fall in the 60s. After a modest fall in the 90s, it rises again after 2000, as do all the other income levels below $100k. If we have said a fall is a good thing, then a rise must be bad. So the full time series data do contain a shocking insight, but it's to be found not at the upper levels but the lower ones, and not just by comparing 1967 to 2005, but following the trend down, then up. The dot plot can only tell a story of average improvement between 1967-2007, which masks a story of advance and retreat.

Thursday, 4 September 2008

Reorderable Tables

I've been interested in the idea of reorderable tables since I read about Jacques Bertin coming up with the idea in his 1967 book Semiologie graphique. The idea is that if you don't have a preferred order for your rows and columns, why not order them so the values in the cells make a rough diagonal across the grid? This makes the patterns much more straightforward to detect.

But I didn't know how to implement it in a spreadsheet until I googled on the word "reorderable", which turns out to be used for Bertin's tables, and almost nothing else. Some of the hits describe something called the "barycenter heuristic", which is really quite simple. It's based on the idea that each row or column has a "centre of gravity" that you can calculate, and then sort by. Take for example this table from Juice Analytics review, 5 Options for Embedding Charts in a Web Page

I added formulae for calculating the barycentre of the rows and columns, of the following form:

=SUMPRODUCT(COLUMN(D5:F5),D5:F5)/SUM(D5:F5)
=SUMPRODUCT(ROW(D5:D9),D5:D9)/SUM(D5:D9)


Then, because Excel sort up/down is temperamental without well-defined headers, and left/right is even more so, I recorded macros like this:

Application.Goto Reference:="SORT_COLS"
Selection.Sort Key1:=Range("SORT_COLS"), Order1:=xlDescending, Header:=xlNo,
Orientation:=xlLeftToRight

Application.Goto Reference:="SORT_ROWS"
Selection.Sort Key1:=Range("SORT_ROWS"),
Order1:=xlAscending, Header:=xlNo,
Orientation:=xlTopToBottom


These use the pre-created named ranges SORT_ROWS and SORT_COLS shown in red and blue here:

Now when I run the two macros, the table sorts itself into this:

I linked the macros to the shorcut keys Ctrl-m and Ctrl-M, and depending on the structure of the data, they either converge quickly and stop responding to key presses, or sort the reverse of whatever they sorted the last time. This last behaviour is handy because it makes the table cycle though the four possible orientations. You may prefer the table in another orientation than the one it converges to initially.

This is a simple example, and could have been done by hand, but you can do it with more complicated tables too. This example is from the Junk Charts article Noisy subways:



It's not a perfect diagonal, and wouldn't be even if it was the optimal solution, and I'm not convinced this technique does find the optimal solution. But it's not bad for something so simple.

Friday, 15 August 2008

Clock this

Jon Peltier has written about graphs based on hours of the day (and chosen the perfect title, so I have to make do with second best), with examples attempting to show the day and night cycles together intuitively.

I think the whole day-night split thing is artificial: there's no connection between the times that are twelve hours apart and happen to have the same number under our old mediæval twelve hour day, twelve hour night scheme. I'm surprised how much less Americans use the 24 hour clock, and how they call it "military time" when they do. In Britain we associate it more with the coming of the railways in the nineteenth century, and the need for standard timetables (until trains made national time zones necessary, individual towns had their own time!) I've made a polar area graph on a 24 hour scheme here:
The polar area graph is one of my favourite neglected specialist graph types, but it really only has a chance of being the graph type of choice if

a) you don't mind reading stacked areas instead of position along a common scale (see Cleveland's Hierarchy),
b) the data set is truly cyclic, not linear, and
c) there are two series to compare, one much smaller than the other (here we lack a second, smaller series that could be plotted closer to the centre, to take advantage of the square root relationship between the area of the slices and their radius)

Saturday, 12 July 2008

Sloppy step charts in the media

Major news organizations seem so slick, that it surprises me when they fall down on the simplest of graphical tasks. On 10 July MSNBC ran a story, How to value life? EPA devalues its estimate about the decline in the dollar value of one life used to weigh the cost of pollution prevention (if lives were even static in value, you'd expect them to rise with inflation).  The graph they showed was this one from the AP



I doubt that the value fell at a constant rate during certain periods, but that's what the chart seems to show, even though all graph drawing software has the means to represent the actual situation-- immediate drops on certain dates. Jon Peltier has written about this, in Line Chart vs. Step Chart.

The Wall Street Journal got it right with Clinton's Road to Second Place on 4 June: