The Elements of Graphing Data, William S. Cleveland

Bill Cleveland is one of the founding figures in statistical graphics and data visualization. His two books, The Elements of Graphing Data and Visualizing Data, are classics in the field, still well-worth reading today.

Visualizing is about the use of graphics as a data analysis tool: how to check model fit by plotting residuals and so on. Elements, on the other hand, is about the graphics themselves and how we read them. Cleveland (co)-authored some of the seminal papers on human visual perception, including the often-cited Cleveland & McGill (1984), “Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods.” Plenty of authors doled out common-sense advice about graphics before then, and some even ran controlled experiments (say, comparing bars to pies). But Cleveland and colleagues were so influential because they set up a broader framework that is still experimentally-testable, but that encompasses the older experiments (say, encoding data by position vs length vs angle vs other things—so that bars and pies are special cases). This is just one approach to evaluating graphics, and it has limitations, but it’s better than many competing criteria, and much better than “because I said so” *coughtuftecough* 🙂

In Elements, Cleveland summarizes his experimental research articles and expands on them, adding many helpful examples and summarizing the underlying principles. What cognitive tasks do graph readers perform? How do they relate to what we know about the strengths and weaknesses of the human visual system, from eye to brain? How do we apply this research-based knowledge, so that we encode data in the most effective way? How can we use guides (labels, axes, scales, etc.) to support graph comprehension instead of getting in the way? It’s a lovely mix of theory, experimental evidence, and practical advice including concrete examples.

Now, I’ll admit that (at least in the 1st edition of Elements) the graphics certainly aren’t beautiful: blocky all-caps fonts, black-and-white (not even grayscale), etc. Some data examples seem dated now (Cold War / nuclear winter predictions). The principles aren’t all coherent. Each new graph variant is given a name, leading to a “plot zoo” that the Grammar of Graphics folks would hate. Many examples, written for an audience of practicing scientists, may be too technical for lay readers (for whom I strongly recommend Naomi Robbins’ Creating More Effective Graphs, a friendlier re-packaging of Cleveland).

Nonetheless, I still found Elements a worthwhile read, and it made a big impact on the data visualization course I taught. Although the book is 30 years old, I still found many new-to-me insights, along with historical context for many aspects of R’s base graphics.

[Edit: I’ll post my notes on Visualizing Data separately.]

Below are my notes-to-self, with things-to-follow-up in bold:

Continue reading

A cursory overview of Differential Privacy

I went to a talk today about Differential Privacy. Unfortunately the talk was rushed due to a late start, so I didn’t quite catch the basic concept. But later I found this nice review paper by Cynthia Dwork who does a lot of research in this area. Here’s a hand-wavy summary for myself to review next time I’m parsing the technical definition.

I’m used to thinking about privacy or disclosure prevention as they do at the Census Bureau. If you release a sample dataset, such as the ACS (American Community Survey)’s PUMS (public use microdata sample), you want to preserve the included respondents’ confidentiality. You don’t want any data user to be able to identify individuals from this dataset. So you perturb the data to protect confidentiality, and then you release this anonymized sample as a static database. Anyone who downloads it will get the same answer each time they compute summaries on this dataset.

(How can you anonymize the records? You might remove obvious identifying information (name and address); distort some data (add statistical noise to ages and incomes); topcode very high values (round down the highest incomes above some fixed level); and limit the precision of variables (round age to the nearest 5-year range, or give geography only at a large-area level). If you do this right, hopefully (1) potential attackers won’t be able to link the released records to any real individuals, and (2) potential researchers will still get accurate estimates from the data. For example, say you add zero-mean random noise to each person’s age. Then the mean age in this edited sample will still be near the mean age in the original sample, even if no single person’s age is correct.)

So we want to balance privacy (if you include *my* record, it should be impossible for outsiders to tell that it’s *me*) with utility (broader statistical summaries from the original and anonymized datasets should be similar).

In the Differential Privacy setup, the setting and goal are a bit different. You (generally) don’t release a static version of the dataset. Instead, you create an interactive website or something, where people can query the dataset, and the website will always add some random noise before reporting the results. (Say, instead of tweaking each person’s age, we just wait for a user to ask for something. One person requests the mean age, and we add random noise to that mean age before we report it. Another user asks for mean age among left-handed college-educated women, and we add new random noise to this mean before reporting it.)

If you do this right, you can get a Differential Privacy guarantee: Whether or not *I* participate in your database has only a small effect on the risk to *my* privacy (for all possible *I* and *my*). This doesn’t mean no data user can identify you or your sensitive information from the data… only that your risk of identification won’t change much whether or not you’re included in the database. Finally, depending on how you choose the noise mechanism, you can ensure this Differential Privacy retains some level of utility: estimates based on these noisified queries won’t be too far from the noiseless versions.

At first glance, this isn’t quite satisfying. It feels in the spirit of several other statistical ideas, such as confidence intervals: it’s tractable for theoretical statisticians to work with, but it doesn’t really address your actual question/concern.

But in a way, Dwork’s paper suggests that this might be the best we can hope for. It’s possible to use a database to learn sensitive information about a person, even if they are not in that database! Imagine a celebrity admits on the radio that their income is 100 times the national median income. Using this external “auxiliary” information, you can learn the celebrity’s income from any database that’ll give you the national median income—even if the celebrity’s data is not in that database. Of course much subtler examples are possible. In this sense, Dwork argues, you can never make *absolute* guarantees to avoid breaching anyone’s privacy, whether or not they are in your dataset, because you can’t control the auxiliary information out there in the world. But you can make the *relative* guarantee that a person’s inclusion in the dataset won’t *increase* their risk of a privacy breach by much.

Still, I don’t think this’ll really assuage people’s fears when you ask them to include their data in your Differentially Private system:

“Hello, ma’am, would you take our survey about [sensitive topic]?”
“Will you keep my responses private?”
“Well, sure, but only in the sense that this survey will *barely* raise your privacy breach risk, compared to what anyone could already discover about you on the Internet!”
“Uh, I’m going to go off the grid forever now. Goodbye.” [click]
“Dang, we lost another one.”

Manual backtrack: Three-Toed Sloth.

Stefan Wager on the statistics of random forests

Yesterday’s CMU stats department seminar was given by Stefan Wager, who spoke on statistical estimation with random forests (RFs).

Random forests are very popular models in machine learning and data science for prediction tasks. They often have great empirical performance when all you need is a black-box algorithm, as in many Kaggle competitions. On the other hand, RFs are less commonly used for estimation tasks, because historically we could not do well at computing confidence intervals or hypothesis tests: there was no good understanding of RFs’ statistical properties, nor good estimators of variance (needed for confidence intervals). Until now.

Wager has written several papers on the statistical properties of random forests. He also has made code available for computing pointwise confidence intervals. (Confidence bands, for the whole RF-estimated regression function at once, have not been developed yet.)

Wager gave concrete examples of when this can be useful, for instance in personalized medicine. You don’t always want just point-estimate predictions for how a patient will respond to a certain treatment. Often you want some margin of error too, so you can decide on the treatment that’s most likely to help. That is, you’d like to avoid a treatment with a positive estimate but a margin of error so big that we’re not sure it helps (it might actually be harmful).

It’s great to see such work on statistical properties of (traditionally) black-box models. In general, it’s an exciting (if challenging) problem to figure out properties and estimate MOEs for such ML-flavored algorithms. Some data science or applied ML folks like to deride high-falutin’ theoretical statisticians, as did Breiman himself (the originator of random forests)… But work like Wager’s is very practical, not merely theoretically interesting. We need more of this, not less.

PS—One other nifty idea from his talk, something I hadn’t seen before: In the usual k-nearest-neighbor algorithm, you pick a target point where you want to make a prediction, then use Euclidean distance to find the k closest neighbors in the training data. Wager showed examples where it works better to train a random forest first, then use “number of trees where this data point is in the same leaf as the target point” as your distance. That is, choose as “neighbors” any points that tend to land in the same leaf as your target, regardless of their Euclidean distance. The results seem more stable than usual kNN. New predictions may be faster to compute too.

Followup for myself:

  • Ryan Tibshirani asked about using shrinkage together with random forests. I can imagine small area estimators that shrink towards a CART or random forest prediction instead of a usual regression, but Ryan sounded more like he had lasso or ridge penalties in mind. Does anyone do either of these?
  • Trees and forests can only split perpendicular to the variables, but sometimes you might have “rotated” structure (i.e. interesting clustering splits are diagonal in the predictor space). So, do people ever find it useful to do PCA first, and *then* CART or RFs? Maybe even using all the PCs, so that you’re not doing it for dimension reduction, just for the sake of rotation? Or maybe some kind of sparse PCA variant where you only rotate certain variables that need it, but leave the others alone (unrotated) when you run CART or RFs?
  • The “infinitesimal jackknife” sounded like a nifty proof technique, but I didn’t catch all the details. Read up more on this.

Participant observation in statistics classes (Steve Fienberg interview)

CMU professor Steve Fienberg has a nice recent interview at Statistics Views.

He brings up great nuggets of stats history, including insights into the history and challenges of Big Data. I also want to read his recommended books, especially Fisher’s Design of Experiments and Raiffa & Schlaifer’sApplied Statistical Decision Theory. But my favorite part was about involving intro stats students in data collection:

One of the things I’ve been able to do is teach a freshman seminar every once in a while. In 1990, I did it as a class in a very ad hoc way and then again in 2000, and again in 2010, I taught small freshman seminars on the census. Those were the census years, so I would bring real data into the classroom which we would discuss. One of the nice things about working on those seminars is that, because I personally knew many of the Census Directors, I was able to bring many of them to class as my guests. It was great fun and it really changes how students think about what they do. In 1990, we signed all students up as census enumerators and they did a shelter and homeless night and had to come back and describe their experiences and share them. That doesn’t sound like it should belong in a stat class but I can take you around here at JSM and introduce you to people who were in those classes and they’ve become statisticians!

What a great teaching idea 🙂 It reminds me of discussions in an anthropology class I took, where we learned about participant observation and communities of practice. Instead of just standing in a lecture hall talking about statistics, we’d do well to expose students to real-life statistical work “in the field”—not just analysis, but data collection too. I still feel strongly that data collection/generation is the heart of statistics (while data analysis is just icing on the cake), and Steve’s seminar is a great way to hammer that home.

Victoria Stodden on Reproducible Research

Yesterday’s department seminar was by Victoria Stodden [see slides from Nov 9, 2015]. With some great Q&A during the talk, we only made it through about half the slides.

Dr Stodden spoke about several kinds of reproducibility important to science, and their links to different “flavors” of science. As I understood it, there are

  • empirical reproducibility: are the methods (lab-bench protocol, psych-test questionnaire, etc.) available, so that we could repeat the experiment or data-collection?
  • computational reproducibility: are the code and data available, so that we could repeat the processing and calculations?
  • statistical reproducibility: was the sample large enough that we can expect to get comparable results, if we do repeat the experiment and calculations?

Her focus is on the computational piece. As more and more research involves methodological contributions primarily in the software itself (and not explained in complete detail in the paper), it’s critical for that code to be open and reproducible.

Continue reading

Teaching data visualization: approaches and syllabi

While I’m still working on my reflection of the dataviz course I just taught, there were some useful dataviz-teaching talks at the recent IEEE VIS conference.

Jen Christiansen and Robert Kosara have great summaries of the panel on “Vis, The Next Generation: Teaching Across the Researcher-Practitioner Gap.”

Even better, slides are available for some of the talks: Marti Hearst, Tamara Munzner, and Eytan Adar. Lots of inspiration for the next time I teach.


Finally, here are links to the syllabi or websites of various past dataviz courses. Browsing these helps me think about what to cover and how to teach it.

Not quite data visualization, but related:

Comment below or tweet @civilstat with any others I’ve missed, and I’ll add them to the list.
(Update: Thanks to John Stasko for links to many I missed, including his own excellent course site & resource page.)

Why bother with magrittr

I’ve seen R users swooning over the magrittr package for a while now, but I couldn’t make heads or tails of all these scary %>% symbols. Finally I had time for a closer look, and it seems potentially handy indeed. Here’s the idea and a simple toy example.

So, it can be confusing and messy to write (and read) functions from the inside out. This is especially true when functions take multiple arguments. Instead, magrittr lets you write (and read) functions from left to right.

Say you need to compute the LogSumExp function \log\left(\sum_{i=1}^n\exp(x_i)\right), and you’d like your code to specify the logarithm base explicitly.

In base R, you might write
log(sum(exp(MyData)), exp(1))
But this is a bit of a mess to read. It takes a lot of parentheses-matching to see that the exp(1) is an argument to log and not to one of the other functions.

Instead, with magrittr, you program from left to right:
MyData %>% exp %>% sum %>% log(exp(1))
The pipe operator %>% takes output from the left and uses it as the first argument of input on the right. Now it’s very clear that the exp(1) is an argument to log.

There’s a lot more you can do with magrittr, but code with fewer nested parentheses is already a good selling point for me.

Apart from cleaning up your nested functions, this approach to programming might be helpful if you write a lot of JavaScript code, for example if you make D3.js visualizations. R’s magrittr pipe is similar in spirit to JavaScript’s method chaining, so it might make context-switching a little easier.

Statistical Graphics and Visualization course materials

I’ve just finished teaching the Fall 2015 session of 36-721, Statistical Graphics and Visualization. Again, it is a half-semester course designed primarily for students in the MSP program (Masters of Statistical Practice) in the CMU statistics department. I’m pleased that we also had a large number of students from other departments taking this as an elective.

For software we used mostly R (base graphics, ggplot2, and Shiny). But we also spent some time on Tableau, Inkscape, D3, and GGobi.

We covered a LOT of ground. At each point I tried to hammer home the importance of legible, comprehensible graphics that respect human visual perception.

Pie chart with remake

Remaking pie charts is a rite of passage for statistical graphics students

My course materials are below. Not all the slides are designed to stand alone, but I have no time to remake them right now. I’ll post some reflections separately.

Download all materials as a ZIP file (38 MB), or browse individual files:
Continue reading

Chai Squares

I saw this typo for Chi Square a while back and thought it’d make a great recipe idea. Turns out I was right: these bars won a prize at my department’s World Statistics Day bake-off.


Start with Mark Bittman’s blondie recipe (copied/adapted from here), and add some of the spices that go into chai tea.

  • 8 tablespoons (1 stick, 4 ounces or 113 grams) butter, melted
  • 1 cup (218 grams or 7 3/4 ounces for light; 238 grams or 8 3/8 ounces for dark) brown sugar
  • 1 large egg
  • 1 teaspoon vanilla
  • Pinch salt
  • 1 cup (4 3/8 ounces or 125 grams) all-purpose flour
  • 1/2 teaspoon cardamom
  • 1/2 teaspoon cinnamon
  • 1/2 teaspoon ground ginger
  • 1/2 teaspoon ground cloves
  • scant 1/2 teaspoon fresh ground black pepper
  1. Preheat oven to 350°F. Butter an 8×8 pan, or line the pan with aluminum foil and grease the foil.
  2. Mix melted butter with brown sugar. Beat until smooth. Beat in egg and vanilla.
  3. Combine salt, flour, and spices. Gently stir flour mixture into butter mixture.
  4. Pour into prepared pan. Bake 20-25 minutes, or until barely set in the middle. Cool on rack before cutting them.


Summary sheet of ways to map statistical uncertainty

A few years ago, a team at the Cornell Program on Applied Demographics (PAD) created a really nice demo of several ways to show statistical uncertainty on thematic maps / choropleths. They have kindly allowed me to host their large file here: PAD_MappingExample.pdf (63 MB)

Screenshot of index page from PAD mapping examples

Screenshot of index page from PAD mapping examples

Each of these maps shows a dataset with statistical estimates and their precision/uncertainty for various areas in New York state. If we use color or shading to show the estimates, like in a traditional choropleth map, how can we also show the uncertainty at the same time? The PAD examples include several variations of static maps, interaction by toggling overlays, and interaction with mouseover and sliders. Interactive map screenshots are linked to live demos on the PAD website.

I’m still fascinated by this problem. Each of these approaches has its strengths and weaknesses: Symbology Overlay uses separable dimensions, but there’s no natural order to the symbols. Pixelated Classification seems intuitively clear, but may be misleading if people (incorrectly) try to find meaning in the locations of pixels within an area. Side-by-side maps are each clear on their own, but it’s hard to see both variables at once. Dynamic Feedback gives detailed info about precision, but only for one area at a time, not all at once. And so forth. It’s an interesting challenge, and I find it really helpful to see so many potential solutions collected in one document.

The creators include Nij Tontisirin and Sutee Anantsuksomsri (both since moved on from Cornell), and Jan Vink and Joe Francis (both still there). The pixellated classification map is based on work by Nicholas Nagle.

For more about mapping uncertainty, see their paper:

Francis, J., Tontisirin, N., Anantsuksomsri, S., Vink, J., & Zhong, V. (2015). Alternative strategies for mapping ACS estimates and error of estimation. In Hoque, N. and Potter, L. B. (Eds.), Emerging Techniques in Applied Demography (pp. 247–273). Dordrecht: Springer Netherlands, DOI: 10.1007/978-94-017-8990-5_16 [preprint]

and my related posts:

See also Visualizing Attribute Uncertainty in the ACS: An Empirical Study of Decision-Making with Urban Planners. This talk by Amy Griffin is about studying how urban planners actually use statistical uncertainty on maps in their work.