Category Archives: Visualization

Dataclysm, Christian Rudder

In between project deadlines and homework assignments, I enjoyed taking a break to read Christian Rudder’s Dataclysm. (That’s right, my pleasure-reading break from statistics grad school textbooks is… a different book about statistics. I think I have a problem. Please suggest some good fiction!)

So, Rudder is one of the founders of dating site OkCupid and its quirky, data-driven research blog. His new book is very readable—each short, catchy chapter was hard to put down. I like how he gently alludes to the statistical details for nerds like myself, in a way that shouldn’t overwhelm lay readers. The clean, Tufte-minimalist graphs work quite well and are accompanied by clear writeups. Some of the insights are basically repeats of material already on the blog, but with a cleaner writeup, though there’s plenty of new stuff too. Whether or not you agree with all of his conclusions [edit: see Cathy O’Neil’s valid critiques of the stats analyses here], the book sets a good example to follow for anyone interested in data- or evidence-based popular science writing.

Most of all, I loved his description of statistical precision:

Ironically, with research like this, precision is often less appropriate than a generalization. That’s why I often round findings to the nearest 5 or 10 and the words ‘roughly’ and ‘approximately’ and ‘about’ appear frequently in these pages. When you see in some article that ‘89.6 percent’ of people do x, the real finding is that ‘many’ or ‘nearly all’ or ‘roughly 90 percent’ of them do it, it’s just that the writer probably thought the decimals sounded cooler and more authoritative. The next time a scientist runs the numbers, perhaps the outcome will be 85.2 percent. The next time, maybe it’s 93.4. Look out at the churning ocean and ask yourself exactly which whitecap is ‘sea level.’ It’s a pointless exercise at best. At worst, it’s a misleading one.

I might use that next time I teach.

The description of how academics hunt for data is also spot on: “Data sets move through the research community like yeti—I have a bunch of interesting stuff but I can’t say from where; I heard someone at Temple has tons of Amazon reviews; I think L has a scrape of Facebook.

Sorry I didn’t take many notes this time, but Alberto Cairo’s post on the book links to a few more detailed reviews.

Winter is coming (to the Broad Street pump)

We live in an amazing future, where an offhand Twitter joke about classic data visualizations and Game of Thrones immediately turns into a real t-shirt you can buy.

You know nothing (about cholera), John Snow

Hats off to Alberto Cairo (whose book The Functional Art and blog are the best introductions to data visualization that I can recommend—but you already knew that).

If you don’t already know the story of John Snow and the Broad Street pump—or if you think you do but haven’t heard the full details—then The Ghost Map is a great telling.

Update: Alberto continues to kick this up a notch, adding two more Game Of Thrones-themed classic dataviz jokes, and making the images/captions available under the Creative Commons license. Awesome.

Winter is coming (for Napoleon)

‘Census Marketing’ app by Olin College students

I love having the chance to promote nifty data visualizations; good work from my former employer, the Census Bureau; and student projects from my alma mater, Olin College. So it’s a particular pleasure to highlight all three at once:

Elizabeth Duncan and Marena Richardson, students in Olin’s Data Science course, teamed up with Census staff and BusinessUSA to develop an app that helps make Census data accessible to small business owners.


The result, Census Marketing, is a nifty and simple interface to overlay Decennial Census and American Community Survey data on Google Maps.

Imagine you’re planning to start or expand a small business, and you know the demographic you’d like to target (age, income, etc.) Where in your town is there a high concentration of your target market? And, are there already competing businesses nearby?

Load up Duncan and Richardson’s website, enter your location, select demographic categories from a few drop-down menus, and give your business type. The app will go find the relevant data (through the Census API) and display it for you as a block-level heatmap on Google Maps. It’ll also highlight the locations of existing businesses that might be competitors.

For example, say you want to open a pizzeria in my Pittsburgh neighborhood of Squirrel Hill. You might want to target the undergrad and grad student populations, since they tend to order pizza pretty often. Punch in the zip code 15217, choose all races and both sexes, select age groups 20-29 and 30-39, and specify that you’re looking for other competing pizzerias:


Well! The student-age population is clearly concentrated around Hobart and Murray… but so are the competing pizzerias. Good to know. Maybe you need to brainstorm a new business plan, seek out a different part of town, or try marketing to a different demographic.

Besides learning about data science and creating a website, Duncan and Richardson also interviewed several actual small business owners to refine the user experience. It’s a nice example of Olin’s design-centered approach to engineering education. I can imagine a couple of further improvements to this app… But it’s already a nice use case for the Census API, and a good example of the work Olin students can do in a short time.

PS—the course instructor, Allen Downey, has a free book ThinkStats on introductory statistics from a computer scientist’s point of view. I hear that a revised second edition is on its way.

Data-Driven Journalism MOOC

TL;DR: The Knight Center’s free online journalism courses are great for anyone who works with data, storytelling, or both. See what’s being offered here.
My favorite links from a recent course on Data-Driven Journalism are here.
And a fellow student’s suggested reading list is here.

Last fall, a coworker and I led a study group for the Knight Center‘s MOOC (massive open online course) on “Introduction to Infographics and Data Visualization”, taught by Alberto Cairo. The course and Alberto’s book were excellent, and we were actually able to bring Alberto in to the Census Bureau for a great lecture a few months later. This course is now in its 3rd offering (starting today!) and I cannot recommend it highly enough if you have any interest in data, journalism, visualization, design, storytelling, etc.!

So, this summer I was happy to see the Knight Center offering another MOOC, this time on “Data-Driven Journalism: The Basics”. What with moving cities and starting the semester, I hadn’t kept up with the class, but I’ve finally finished the last few readings & videos. Overall I found a ton of great material.

The course’s five lecturers gave an overview of data-driven journalism: from its historical roots in the 1800s and its relation to computer-aided reporting, to how to get data in the first place, through cleaning and checking the data, and finally to building news apps and journalistic data visualizations.

In week 3 there was a particularly useful exercise of going through a spreadsheet of hunting accidents. Of course it illustrated some of the difficulties in cleaning data, and it gave concrete practice in filtering and sorting. But it was also a great illustration of how a database can lead you potential trends or stories that you might have missed if you’d only gone out to interview a few individual hunters.

I loved some of the language that came up, such as “backgrounding the data” — analogous to checking out your sources to see how much you can trust them — or “interrogating the data,” including coming prepared to the “data interview” to ask thorough, thoughtful questions. I’d love to see a Statistics 101 course taught from this perspective. Statisticians do these things all the time, but our terminology and approach seem alien and confusing the first few times you see them. “Thinking like a journalist” and “thinking like a statistician” are not all that different, and the former might be a much more approachable path to the latter.

For those who missed the course, consider skimming the Data Journalism Handbook (free online); Stanford’s Data Journalism lectures (hour-long video); the course readings I saved on Pinboard; and my notes below.
Edit: See also fellow student Daniel Drew Turner’s suggested reading list.
Then, keep an eye out for next time it’s offered on the Knight Center MOOC page.

Below is a (very messy) braindump of notes I took during the class, in case there are any useful nuggets in there. (Messiness due to my own limited time to clean the notes up, not to any disorganization in the course itself!) I think the course videos were not for sharing outside the class, but I’ve linked to the other readings and videos.

Continue reading

Census-related links from June

A few links to share since I’ve been away:

Article on “An American Tradition: The U.S. Census Bureau Continues to Innovate in Data Visualization” (free access after watching an ad): A colleague summarizes some of the many mapping and datavis resources provided by the Census Bureau.

Two interactive web apps by data users (these use Census Bureau data but the datavises are someone else’s products):

  • Web app “Point Context”: a data user calls the Census Bureau’s API to find the distributions of age, race, income, and education for residents of the “average” neighborhood containing an arbitrary set of latitude-longitude coordinates.
  • Interactive map of “Is the United States spending less on public education?”: A Census data user practices with D3 and tries out the lessons from datavis classes — show comparisons, allow color-blind-safe color palettes, “catchy” headlines and informative annotations help guide readers, etc. I particularly like the arrow indicating where the selected state falls on the colorbar.

Several tools for making maps from Excel or spreadsheet-like tools:

  • Esri, the makers of ArcGIS software, have created a Microsoft Office add-on that lets you create maps of your data in Excel. A live demo looked promising, especially if your organization is already an Esri client… but otherwise ArcGIS is not cheap!
  • If you do have ArcGIS Online access, you can try using Esri’s “Story Maps” templates. Their published examples include this simple one based on Census Bureau data.
  • JMP, a SAS product, also has mapping tools that should be fairly simple for people used to spreadsheets. But again, SAS’s products tend to be expensive too.

Visual Revelations, Howard Wainer

I’m starting to recognize several clusters of data visualization books. These include:

(Of course this list calls out for a flowchart or something to visualize it!)

Howard Wainer’s Visual Revelations falls in this last category. And it’s no surprise Wainer’s book emulates Tufte’s, given how often the author refers back to Tufte’s work (including comments like “As Edward Tufte told me once…”). And The Visual Display of Quantitative Information is still probably the best introduction to the genre. But Visual Revelations is different enough to be a worthwhile read too if you enjoy such books, as I do.

Most of all, I appreciated that Wainer presents many bad graph examples found “in the wild” and follows them with improvements of his own. Not all are successful, but even so I find this approach very helpful for learning to critique and improve my own graphics. (Tufte’s classic book critiques plenty, but spends less time on before-and-after redesigns. On the other hand, Kosslyn’s book is full of redesigns, but his “before” graphs are largely made up by him to illustrate a specific point, rather than real graphics created by someone else.)

Of course, Wainer covers the classics like John Snow’s cholera map and Minard’s plot of Napoleon’s march on Russia (well-trodden by now, but perhaps less so in 1997?). But I was pleased to find some fascinating new-to-me graphics. In particular, the Mann Gulch Fire section (p. 65-68) gave me shivers: it’s not a flashy graphic, but it tells a terrifying story and tells it well.
[Edit: I should point out that Snow’s and Minard’s plots are so well-known today largely thanks to Wainer’s own efforts. I also meant to mention that Wainer is the man who helped bring into print an English translation of Jacques Bertin’s seminal Semiology of Graphics and a replica volume of William Playfair’s Commercial and Political Atlas and Statistical Breviary. He has done amazing work at unearthing and popularizing many lost gems of historical data visualization!
See also Alberto Cairo’s review of a more recent Wainer book.]

Finally, Wainer’s tone overall is also much lighter and more humorous than Tufte’s. His first section gives detailed advice on how to make a bad graph, for example. I enjoyed Wainer’s jokes, though some might prefer more gravitas.

Continue reading

audiolyzR: Data sonification with R

Update (5/15/2014): I just realized audiolyzR is publicly available on CRAN. See also co-creator Jesse Garrison’s audiolyzR page.

In his talk “Give Your Data A Listen” at last summer’s useR! 2012 conference, Eric Stone presented joint work with Jesse Garrison on audiolyzR, an R package for “data sonification.” I thought this was a nifty and well-executed idea. Since I haven’t seen Eric and Jesse post any demos online yet, I’d like to share a summary and video clip here, so that I can point to them whenever I describe audiolyzR to other folks.


In August I invited Eric to my workplace to speak, and he gave us a great talk including demos of features added since the useR session. Here’s the post-event summary:

Eric Stone, a PhD student at Temple University, presented his co-authored work with Jesse Garisson on “data sonification”: using sound (other than speech) to visualize a dataset.
Eric demonstrated audiolizations of scatterplots and histograms using the statistical software R and the audio toolkit Max/MSP, as well as his ongoing research on time-series line plots. The software shows a visual display of the data and then plays an audio version, with the x-axis mapped to time and the y-axis to pitch. For instance, a positively-correlated scatterplot sounds like rising scales or arpeggios. Other variables are represented by timbre, volume, etc. to distinguish them. The analyst can also tweak the tempo and other settings while listening to the data repeatedly to help outliers stand out more clearly. A few training examples helped the audience to learn how to listen to these audiolizations and identify these outliers.
Eric believes that, even if the audiolization itself is no clearer than a visual plot, activating multiple cortices in the brain makes the analyst more attuned to the data. As a musician since childhood, he succeeded in making the results sound pleasant so that they do not wear out the listener.
The software will soon be released as an R package and linked to RExcel to expand its reach to Excel users. Future work includes: 1) supporting more data structures and more layers of data in the same audiolization; 2) testing the software with visually impaired users as a tool for accessibility; and 3) developing ways to embed the audiolizations into a website.

Eric suggested that he can imagine someone using this as part of an information dashboard or for reviewing a zillion different data views in a row, while multi-tasking: Just set it to loop through each slice of the data while you work on something else. Your ears will alert you when you hit a data slice that’s unusual and worth investigating further.

Eric has kindly sent me a version of the package, and below I demonstrate a few examples using NHANES data:

I’ve asked Eric if there’s a public release coming anytime soon, but it may be a while:

I am nearly ready to release it, but it’s one of those situations where my advisor will come up with “just one more thing” to add, so, you know, it might be a while.. Anyway, if people are interested I can provide them with the software and everything. Just let me know if anyone is.

If you want to get in touch with Eric, his contact info is in the useR talk abstract linked at the top.

On a very-loosely-related note, consider also John Cook’s post on measuring evidence in decibels. Someday I’d like to re-read this after I’ve had my morning coffee and think about if there’s any useful way to turn this metaphor into literal sonic hypothesis testing.

The tuba effect

The Jingle All The Way 8k results are up, and naturally I was curious how I stacked against the other runners. I know I’m no sprinter, so I’ve just plotted the median times within each age-by-gender category. Apparently carrying a tuba gave me a race time comparable to the median among 70-74 year old women.

Of course I already knew I’d lose a race against my grandmother, a strong Polish woman who taught PE for many years. But when I’m carrying a tuba, your grandmother could likely beat me too.

Most-cited books on list of lists of data visualization readings

As part of the resources for his online data visualization course, Alberto Cairo has posted several lists of recommended readings:

Some of these links lead to other excellent recommended-readings lists:

I figured I should focus on reading the book suggestions that came up more than once across these lists. Below is the ranking; it’s by author rather than book, since some authors were suggested with multiple books. So many good books!

The list, by number of citations per author: Continue reading

Graph Design for the Eye and Mind, Stephen Kosslyn

When I reviewed The Grammar of Graphics, Harlan Harris pointed me to Kosslyn’s book Graph Design for the Eye and Mind. I’ve since read it and can recommend it highly, although the two books have quite different goals. Unlike Wilkinson’s book, which provides a framework encompassing all the graphics that are possible, Kosslyn’s book summarizes perceptual research on what makes graphics actually readable.

In other words, this is something of the graphics equivalent to Strunk and White’s The Elements of Style, except that Kosslyn’s grounded in actual psychology research rather than personal preferences. This is a good book to keep at your desk for quickly checking whether your most recent graphic follows his advice.

Kosslyn is targeting the communicator-of-results, not the pure statistician (churning out graphs for experts’ data exploration) or the data artist (playing with data-inspired, more-pretty-than-meaningful visual effects). In contrast to Tukey’s remark that a good statistical graphic “forces us to notice what we never expected to see,” Kosslyn’s focus is clear communication of what the analyst has already notices.

For present purposes I would say that a good graph forces the reader to see the information the designer wanted to convey. This is the difference between graphics for data analysis and graphics for communication.

Kosslyn also respects aesthetics but does not focus on them:

Making a display attractive is the task of the designer […] But these properties should not obscure the message of the graph, and that’s where this book comes in.

So Kosslyn presents his 8 “psychological principles of effective graphics” (for details, see Chopeta Lyons’ review or pages 4-12 of Kosslyn’s Clear and to the Point). Then he illustrates the principles with clear examples and back them up with research citations, for each of several common graph types as well as for labels, axes, etc. in general. I particularly like all the paired “Don’t” and “Do” examples, showing both what to avoid and how to fix it. Most of the book is fairly easy reading and solid advice. Although much of it is common sense, it’s useful as a quick checkup of the graphs you’re creating, especially as it’s so well laid-out.

Bonus: Unlike many other recent data visualization books, Kosslyn does not completely disavow pie charts. Rather, he gives solid advice on the situations where they are appropriate, and on how to use them well in those cases.

If you want to dig even deeper, Colin Ware’s Information Visualization is a very detailed but readable reference on the psychological and neural research that underpins Kosslyn’s advice.

The rest of this post is a list of notes-to-self about details I want to remember or references to keep handy… Bolded notes are things I plan to read about further. Continue reading