The Grammar of Graphics: notes on first reading

Leland Wilkinson’s The Grammar of Graphics is a classic in the data visualization literature. Wilkinson created a framework that coherently ties together many aspects of designing, implementing, reading, and understanding a graphic. It’s a useful approach and has been fairly influential: The popular R package ggplot2 is, more or less, an implementation of Wilkinson’s ideas, and I also see their influence in the software Tableau (about which more another time). Wilkinson himself helped to build these ideas into SPSS’s Graphics Production Language (GPL) and then SPSS Visualization Designer.

So what’s so special here? One of the core ideas is to start with the raw data and think about all the transformations, summaries, etc. that go into graphing it. With a good framework, this can help us see connections between different graphs and create new ones. (The opposite extreme is a “typology” or list of graph types, like you get in Excel: do you want a bar chart, a pie chart, a line chart, or one of these other 10 types? Such a list has no deep structure.) Following Wilkinson’s approach, you’ll realize that a pie chart is basically just a stacked bar chart plotted in polar coordinates, with bar height mapped to pie-slice angle… and that can get you thinking: What if I mapped bar height to radius, not angle? What if I added a variable and moved to spherical coordinates? What if I put a scatterplot in polar coordinates too? These may turn out to be bad ideas, but at least you’re thinking — in a way that is not encouraged by Excel’s list of 10 graph types.

This is NOT the approach that Wilkinson takes.

But, of course, thinking is hard, and so is this book. Reading The Grammar of Graphics requires much more of a dedicated slog than, say, Edward Tufte’s books, which you can just flip through randomly for inspiration and bite-sized nuggets of wisdom. (I admire Tufte too, but I have to admit that Wilkinson’s occasional jabs at Tufte were spot-on and amused me to no end.) It’s a book full of wit and great ideas, but also full of drawn-out sections that require serious focus, and it takes a while to digest it all and put it together in your mind.

So, although I’d highly recommend this book to anyone deeply interested in visualization, I’m still digesting it. What follows is not a review but just notes-to-self from my first read-through: things to follow up on and for my own reference. It might not be particularly thrilling for other readers. Continue reading “The Grammar of Graphics: notes on first reading”

Commandeering a map from PDF or EPS, using Inkscape and R

I love Nathan Yau’s tutorial on making choropleths from a SVG file. However, if you don’t have a SVG handy already and instead you want to repurpose a map from another vector format such as PDF or EPS, there are a few extra steps that can be done in the free tool Inkscape. And while I’m at it, how could I turn down the opportunity to replicate Nathan’s Python approach in R instead?

The following was inspired by the 300-page Census Atlas of the United States, full of beautiful maps of 2000 decennial census data. I particularly liked the small multiples of state maps, which were highly generalized (i.e. the fine detail was smoothed out) but still recognizable, and DC was enlarged to be big enough to see.

I have wanted a map like this for my own purposes, when mapping a variable for all 50 states and DC. Unfortunately, I haven’t been able to track down any colleagues who know where to find the original shapefiles for this map. Fortunately, several images from the Census Atlas are available in EPS format near the bottom of this page, under “PostScript Map Files.” With access to such vector graphics, we can get started.

Continue reading “Commandeering a map from PDF or EPS, using Inkscape and R”

Making R graphics legible in presentation slides

I only visited a few JSM sessions today, as I’ve been focused on preparing for my own talk tomorrow morning. However, I went to several talks in a row which all had a common problem that made me cringe: graphics where the fonts (titles, axes, labels) are too small to read.

You used R's default settings when putting this graph in your slides? Too bad I won't be able to read it from anywhere but the front of the room.

Dear colleagues: if we’re going to the effort of analyzing our data carefully, and creating a lovely graph in R or otherwise to convey our results in a slideshow, let’s PLEASE save our graphs in a way that the text is legible on the slides! If the audience has to strain to read your graphics, it’s no easier to digest than a slide with dense equations or massive tables of numbers.

For those of us working in R, here are some very quick suggestions that would help me focus on the content of your graphics, not on how hard I’m squinting to read them.

Continue reading “Making R graphics legible in presentation slides”

JSM 2012: Sunday

Greetings from lovely San Diego, CA, site of this year’s Joint Statistical Meetings. I can’t believe it’s already been a year since I was inspired to start blogging during the JSM in Miami!

If you’re keeping tabs on this year’s conference, there’s a fair amount of #JSM2012 activity on Twitter. Sadly, I haven’t seen any recent posts on The Statistics Forum, which blogged JSM so actively last year.

Yesterday’s Dilbert cartoon was also particularly fitting for the start of JSM, with its focus on big data 🙂

Continue reading “JSM 2012: Sunday”

U.S. Census Bureau releases API

The Census API, which was in the works for a while, was finally made publicly available yesterday (news release).

I’ve heard the DC dating scene is tough for single women… But especially for centenarians!

So far, two datasets are accessible:

  • the 2010 Census Summary File 1, providing counts down to the tract and block levels
  • the 2006-2010 American Community Survey five-year estimates, providing estimates down to the tract and block-group levels (but not all the way down to blocks)

The developers page provides more information and showcases a couple of the first few apps using the API so far, including one by Cornell’s Jan Vink (whose online poverty maps I’ve mentioned before).

For a handy list of the other government agencies with APIs and developers pages, check out the FCC’s developers page.

useR 2012: main conference braindump

I knew R was versatile, but DANG, people do a lot with it:

> > … I don’t think anyone actually believes that R is designed to make *everyone* happy. For me, R does about 99% of the things I need to do, but sadly, when I need to order a pizza, I still have to pick up the telephone. —Roger Peng

> There are several chains of pizzerias in the U.S. that provide for Internet-based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it’s only a matter of time before you will have a pizza-ordering function available. —Doug Bates

Indeed, the GraphApp toolkit … provides one (for use in Sydney, Australia, we presume as that is where the GraphApp author hails from). —Brian Ripley

So, heads up: the following post is super long, given how much R was covered at the conference. Much of this is a “notes-to-self” braindump of topics I’d like to follow up with further. I’m writing up the invited talks, the presentation and poster sessions, and a few other notes. The conference program has links to all the abstracts, and the main website should collect most of the slides eventually.

Continue reading “useR 2012: main conference braindump”

Maps of changes in area boundaries, with R

Today a coworker needed some maps showing boundary changes. I used what I learned last week in the useR 2012 geospatial data course to make a few simple maps in R, overlaid on OpenStreetMap tiles. I’m posting my maps and my R code in case others find them useful.

A change in Census block-groups from 2000 to 2010, in Mobile, AL

Continue reading “Maps of changes in area boundaries, with R”

useR 2012: impressions, tutorials

First of all, useR 2012 (the 8th International R User Conference) was, hands down, the best-organized conference I’ve had the luck to attend. The session chairs kept everything moving on time, tactfully but sternly; the catering was delicious and varied; and Vanderbilt University’s leafy green campus and comfortable facilities were an excellent setting. Many thanks to Frank Harrell and the rest of Vanderbilt’s biostatistics department for hosting!

Plus there's a giant statue of bacon. What's not to love?

Continue reading “useR 2012: impressions, tutorials”

Pithy and pragmatic textbooks

I enjoy the rare statistics textbook that can take its subject with a grain of salt:

The practitioner has heard that the [random field] should be ergodic, since “this is what makes statistical inference possible,” but is not sure how to check this fact and proceeds anyway, feeling vaguely guilty of having perhaps overlooked something very important.
Geostatistics: Modeling Spatial Uncertainty, by Chilès and Delfiner.

It’s a familiar feeling!
As Chilès and Delfiner wryly suggest, we statisticians could often do a better job of writing for beginners or practitioners. We should not just state the assumptions needed by our tools, but also explain how sensitive results are to the assumptions, how to check these assumptions in practice, and what else to try if they’re not met.