Yesterday I spoke at Stat Bytes, our student-run statistical computing seminar.
My goal was to introduce two principled frameworks for thinking about data visualization: human visual perception and the Grammar of Graphics.
(We also covered some relevant R packages:
directlabels, and a gentle intro to
These are not the only “right” approaches, nor do they guarantee your graphics will be good. They are just useful tools to have in your arsenal.
The talk was also a teaser for my upcoming fall course, 36-721: Statistical Graphics and Visualization [draft syllabus pdf].
Here are my
The talk was quite interactive, so the slides aren’t designed to stand alone. Open the slides and follow along using my notes below.
(Answers are intentionally in white text, so you have a chance to think for yourself before you highlight the text to read them.)
If you want a deeper introduction to dataviz, including human visual perception, Alberto Cairo’s The Functional Art [website, amazon] is a great place to start.
For a more thorough intro to
ggplot2, see creator Hadley Wickham’s own presentations at the bottom of this page.
(Apologies also to the National Statistical Service of the Republic of Armenia for using their plots on slides 4, 6, and 12. They are a group of skilled people working hard under challenging conditions (including the need to show 3 languages on most reports and graphs!). I hope they do not mind me using a few of their graphics as starting points for discussing redesigns.)
Framework 1: human visual perception
- (2) How many 6s can you find in this image? How long does it take you?
- (3) To compare numeral shapes alone, you have to apply conscious attention, thinking slowly. But the human brain is amazingly efficient at grouping and comparing items with contrasting colors, automatically, before the image even reaches your conscious attention. Whenever possible, your graphics should make use of the brain’s preattentive processing to simplify the task and to help viewers see the structure in a flash.
- (4) Consider this idea of preattentive processing. What makes this graphic difficult or slow to read, and how could it be improved? [Some answers: legend is far from plot; year-to-year comparisons are difficult; pie slice angles are hard to compare; order of slices is uninformative]
- (5) One possible redesign [year-to-year comparisons are shown directly; categories labeled directly, not with a legend; y-axis positions are easier to compare than angles; colors now provide meaning (blue for increase, red for decrease)]
- (6) What could be improved? [legend is far from plot; similar colors are hard to distinguish; semi-alphabetical ordering is uninformative; comparing marriage to divorce rates within a region is hard]
- (7) One possible redesign [direct labels; informative sorting by marriage rate]
- (8) This is not an exhaustive explanation of visual perception and preattentive processing. But using that framework, here are a few principles you can apply directly when designing graphics.
Next we’ll talk more about the first two bullets and how to use them in R.
- (9) Think for a moment: How would you choose a color scheme for its usability? What would you need to know about the color palette?
- (10) Cynthia Brewer and colleagues at Penn State do research into usable color palettes (for cartography, but also useful for other graphics). Their findings are summarized pragmatically on the ColorBrewer website. Play around with the site. Most of these palettes are easily accessed within R using the
- (11) Start R and play with the first half of my code, to see examples of
The dataset is a small subset of the NHANES 2011-2012 survey. This kind of data is used to create those growth percentile charts you see at the doctor’s office, when your baby gets weighed and measured to see whether the child’s growth is in a normal range. My wife and I have been seeing a lot of these lately 🙂
Framework 2: The Grammar of Graphics
- (12) What could be improved? [legend far from plotted values; axis/scale also far and misaligned from data; graphic shows volume, but the data is actually mapped to height]
- (13) One possible redesign [show bar heights directly without the confusing use of volume; informative sorting; direct labels]
- (14) GoG is principled because it cannot do “ungrammatical” things, like the plot on slide (12) which misleadingly shows changing volumes that do not represent a data variable. On the other hand, it’s more flexible than (say) Excel’s hard-wired templates. GoG lets you specify the graph you need from the ground up.
Leland Wilkinson developed this Grammar of Graphics idea and wrote a great book about it [amazon, my review]. This influential concept has been implemented many times, serving as the basis for the data visualization tools in Tableau, SPSS, JMP, D3.js, and (as
- (15) What are the aes, stat, geom, facet for slide (13)? [aes: service maps to position on x-axis, percent maps to position on y-axis; stat: identity; geom: bar; facet: none]
- (16) What are the aes, stat, geom, facet here? [original charts from WHO for Boys and for Girls] [aes: age maps to position on x-axis, length maps to position on y-axis, quantile maps to color; stat: quantiles (3, 15, 50, 85, and 97%); geom: line; facet: gender]
- (17) What are the aes, stat, geom, facet here? [aes: weight maps to position on x-axis, length maps to position on y-axis, gender maps to color and shape; stat: median; geom: point; facet: age] (This example isn’t perfect, because each month also shows previous months’ data)
- (18) Go back to R and play with the second half of my code, to see examples of making similar baby-growth plots in
- (19) We discussed other plots that could be made with these commands, including a few variations that show all 6 variables at once (including both the raw data and overlaid statistical summaries). I’m not saying these are great, insightful plots—just showing the flexibility of
ggplot2as a tool.
I also find that working with
ggplot2is very similar to coming up with a statistical regression model. Say we use facets to subgroup the data by gender and race/ethnicity, and the race-facets looks very similar but the gender-facets clearly differ. That suggests our regression model should probably include a term for gender, but it’s probably OK to omit the race/ethnicity terms.
I was glad to hear some audience members thought this was a good intro to
ggplot2. I tried to keep it simple by using just a few limited commands, reusing the same dataset over and over, and not bothering with the
qplot command (which I find gives you the wrong idea about how the GoG works).