Dataviz contest on “Visualizing Well-Being”

Someone from OECD emailed me about a data visualization contest for the Wikiprogress website (the deadline is August 24th):

Visualizing Well-Being contest

I am contacting you on behalf of the website Wikiprogress, which is currently running a Data Visualization Contest, with the prize of a paid trip to Mexico to attend the 5th OECD World Forum in Guadalajara in October this year. Wikiprogress is an open-source website, hosted by the OECD, to facilitate the exchange of information on well-being and sustainability, and the aim of the competition is to encourage participants to use well-being measurement in innovative ways to a) show how data on well-being give a more meaningful picture of the progress of societies than more traditional growth-oriented approaches, and b) to use their creativity to communicate key ideas about well-being to a broad audience.

After reading your blog, I think that you and your readers might be interested in this challenge. The OECD World Forums bring together hundreds of change-makers from around the world, from world leaders to small, grassroots projects, and the winners will have their work displayed and will be presented with a certificate of recognition during the event.

You can also visit the competition website here:

It does sound like a challenge that might intrigue this blog’s readers:

  • think about how to report human well-being, beyond traditional measures like GDP;
  • find relevant good datasets (“official statistics” or otherwise);
  • visualize these measures’ importance or insightful trends in the data; and
  • possibly win a prize trip to the next OECD forum in Guadalajara, Mexico to network with others who are interested in putting data, statistics, and visualization to good use.

“Don’t invert that matrix” – why and how

The first time I read John Cook’s advice “Don’t invert that matrix,” I wasn’t sure how to follow it. I was familiar with manipulating matrices analytically (with pencil and paper) for statistical derivations, but not with implementation details in software. For reference, here are some simple examples in MATLAB and R, showing what to avoid and what to do instead.

[Edit: R code examples and results have been revised based on Nicholas Nagle’s comment below and advice from Ryan Tibshirani.]

If possible, John says, you should just ask your scientific computing software to directly solve the linear system Ax = b. This is often faster and more numerically accurate than computing the matrix inverse of A and then computing x = A^{-1}b.

We’ll chug through a computation example below, to illustrate the difference between these two methods. But first, let’s start with some context: a common statistical situation where you may think you need matrix inversion, even though you really don’t.

[One more edit: I’ve been guilty of inverting matrices directly, and it’s never caused a problem in my one-off data analyses. As Ben Klemens comments below, this may be overkill for most statisticians. But if you’re writing a package, which many people will use on datasets of varying sizes and structures, it may well be worth the extra effort to use solve or QR instead of inverting a matrix if you can help it.]

Continue reading

Two principled approaches to data visualization

Yesterday I spoke at Stat Bytes, our student-run statistical computing seminar.

My goal was to introduce two principled frameworks for thinking about data visualization: human visual perception and the Grammar of Graphics.
(We also covered some relevant R packages: RColorBrewer, directlabels, and a gentle intro to ggplot2.)

These are not the only “right” approaches, nor do they guarantee your graphics will be good. They are just useful tools to have in your arsenal.

Example plot with direct labels and ColorBrewer colors, made in ggplot2.

Example plot with direct labels and ColorBrewer colors, made in ggplot2.

The talk was also a teaser for my upcoming fall course, 36-721: Statistical Graphics and Visualization [draft syllabus pdf].

Here are my

The talk was quite interactive, so the slides aren’t designed to stand alone. Open the slides and follow along using my notes below.
(Answers are intentionally in white text, so you have a chance to think for yourself before you highlight the text to read them.)

If you want a deeper introduction to dataviz, including human visual perception, Alberto Cairo’s The Functional Art [website, amazon] is a great place to start.
For a more thorough intro to ggplot2, see creator Hadley Wickham’s own presentations at the bottom of this page.

Continue reading

Simplistic thinking in statistical inference

My friend Brian Segal, at the University of Michigan, writes in response to the journal that banned statistical inference:

If nothing else, I think BASP did a great job of starting a discussion on p-values, and more generally, the role of statistical inference in certain types of research. Stepping back a bit, I think the discussion fits into a broader question of how we deal with answers that are inherently grey, as opposed to clear cut. Hypothesis testing, combined with traditional cutoff values, is a neat way to get a yes/no answer, but many reviewers want a yes/no answer, even in the absence of hypothesis tests.

As one example, I recently helped a friend in psychology to validate concepts measured by a survey. In case you haven’t done this before, here’s a quick (and incomplete) summary of construct validation: based on substantive knowledge, group the questions in the survey into groups, each of which measures a different underlying concept, like positive attitude, or negativity. The construct validation question is then, “Do these groups of questions actually measure the concepts I believe they measure?”

In addition to making sure the groups are defensible based on their interpretation, you usually have to do a quantitative analysis to get published The standard approach is to model the data with a structural equation model (as a side note, this includes confirmatory factor analysis, which is not factor analysis!). The goodness of fit statistic is useless in this context, because the null hypothesis is not aligned with the scientific question, so people use a variety of heuristics, or fit indices, to decide if the model fits. The model is declared to either fit or not fit (and consequently the construct is either valid or not valid) depending on whether the fit index is larger or smaller than a rule-of-thumb value. This is the same mentality as hypothesis testing.

Setting aside the question of whether it makes sense to use structural equation models to validate constructs, the point I’m trying to make is that the p-value mentality is not restricted to statistical inference. Like any unsupervised learning situation, it’s very difficult to say how well the hypothesized groups measure the underlying constructs (or if they even exist). Any answer is inherently grey, and yet many researchers want a yes/no answer. In these types of cases, I think it would be great if statisticians could help other researchers come to terms not just with the limits of the statistical tools, but with the inquiry itself.

I agree with Brian that we can all do a better job of helping our collaborators to think statistically. Statistics is not just a set of arbitrary yes/no hoops to jump through in the process of publishing a paper; it’s a kind of applied epistemology. As tempting as it might be to just ban all conclusions entirely, we statisticians are well-trained in probing what can be known and how that knowledge can be justified. Give us the chance, and we’d would love to help you navigate the subtleties, limits, and grey areas in your research!

DotCity: a game written in R? and other statistical computer games?

A while back I recommended Nathan Uyttendaele’s beginner’s guide to speeding up R code.

I’ve just heard about Nathan’s computer game project, DotCity. It sounds like a statistician’s minimalist take on SimCity, with a special focus on demographic shifts in your population of dots (baby booms, aging, etc.). Furthermore, he’s planning to program the internals using R.

This is where scatterplot points go to live and play when they're not on duty.

This is where scatterplot points go to live and play when they’re not on duty.

Consider backing the game on Kickstarter (through July 8th). I’m supporting it not just to play the game itself, but to see what Nathan learns from the development process. How do you even begin to write a game in R? Will gamers need to have R installed locally to play it, or will it be running online on something like an RStudio server?

Meanwhile, do you know of any other statistics-themed computer games?

  • I missed the boat on backing Timmy’s Journey, but happily it seems that development is going ahead.
  • SpaceChem is a puzzle game about factory line optimization (and not, actually, about chemistry). Perhaps someone can imagine how to take it a step further and gamify statistical process control √† la Shewhart and Deming.
  • It’s not exactly stats, but working with data in textfiles is an important related skill. The Command Line Murders is a detective noir game for teaching this skill to journalists.
  • The command line approach reminds me of Zork and other old text adventure / interactive fiction games. Perhaps, using a similar approach to the step-by-step interaction of swirl (“Learn R, in R”), someone could make an I.F. game about data analysis. Instead of OPEN DOOR, ASK TROLL ABOUT SWORD, TAKE AMULET, you would type commands like READ TABLE, ASK SCIENTIST ABOUT DATA DICTIONARY, PLOT RESIDUALS… all in the service of some broader story/puzzle context, not just an analysis by itself.
  • Kim Asendorf wrote a fictional “short story” told through a series of data visualizations. (See also FlowingData’s overview.) The same medium could be used for a puzzle/mystery/adventure game.

After 4th semester of statistics PhD program

This was my first PhD semester without any required courses (more or less). That means I had time to focus on research, right?

It was also my first semester as a dad. Exhilarating, joyful, and exhausting :) So, time was freed up by having less coursework, but it was reallocated largely towards diapering and sleep. Still, I did start on a new research project, about which I’m pretty excited.

Our department was also recognized as one of the nation’s fastest-growing statistics departments. I got to see some of the challenges with this first-hand as a TA for a huge 200-student class.

See also my previous posts on the 1st, the 2nd, and the 3rd semester of my Statistics PhD program.


  • Statistical Computing:
    This was a revamped, semi-required, half-semester course, and we were the guinea pigs. I found it quite useful. The revamp was spearheaded by our department chair Chris Genovese, who wanted to pass on his software engineering knowledge/mindset to the rest of us statisticians. This course was not just “how to use R” (though we did cover some advanced topics from Hadley Wickham’s new books Advanced R and R Packages; and it got me to try writing homework assignment analyses as R package vignettes).
    Rather, it was a mix of pragmatic coding practices (using version control such as Git; writing and running unit tests; etc.) and good-to-know algorithms (hashing; sorting and searching; dynamic programming; etc.). It’s the kind of stuff you’d pick up on the job as a programmer, or in class as a CS student, but not necessarily as a statistician even if you write code often.
    The homework scheme was nice in that we could choose from a large set of assignments. We had to do two per week, but could do them in any order—so you could do several on a hard topic you really wanted to learn, or pick an easy one if you were having a rough week. The only problem is that I never had to practice certain topics if I wanted to avoid them. I’d like to try doing this as an instructor sometime, but I’d want to control my students’ coverage a bit more tightly.
    This fall, Stat Computing becomes an actually-required, full-semester course and will be cotaught by my classmate Alex Reinhart.
  • Convex Optimization:
    Another great course with Ryan Tibshirani. Tons of work, with fairly long homeworks, but I also learned a huge amount of very practical stuff, both theory (how to prove a certain problem is convex? how to prove a certain optimization method works well?) and practice (which methods are likely to work on which problems?).
    My favorite assignments were the ones in which we replicated analyses from recent papers. A great way to practice your coding, improve your optimization, and catch up with the literature all at once. One of these homeworks actually inspired in me a new methodological idea, which I’ve pursued as a research project.
    Ryan’s teaching was great as usual. He’d start each class with a review from last time and how it connects to today. There were also daily online quizzes, posted after class and due at midnight, that asked simple comprehension questions—not difficult and not a huge chunk of your grade, but enough to encourage you to keep up with the class regularly instead of leaving your studying to the last minute.
  • TAing for Intro to Stat Inference:
    This was the 200-student class. I’m really glad statistics is popular enough to draw such crowds, but it’s the first time the department has had so many folks in the course, and we are still working out how to manage it. We had an army of undergrad- and Masters-level graders for the weekly homeworks, but just three of us PhD-level TAs to grade midterms and exams, which made for several loooong weekends.
    I also regret that I often wasn’t at my best during my office hours this semester. I’ll blame it largely on baby-induced sleep deprivation, but I could have spent more time preparing too. I hope the students who came to my sessions still found them helpful.
  • Next semester, I’ll be teaching the grad-level data visualization course! It will be heavily inspired by Alberto Cairo’s book and his MOOC. I’m still trying to find the right balance between the theory I think is important (how does the Grammar of Graphics work, and why does it underpin ggplot2, Tableau, D3, etc.? how does human visual perception work? what makes for a well-designed graphic?) vs. the tool-using practice that would certainly help many students too (teach me D3 and Shiny so I can make something impressive for portfolios and job interviews!)
    I was glad to hear Scott Murray’s reflections on his recent online dataviz course co-taught with Alberto.


  • Sparse PCA: I’ve been working with Jing Lei on several aspects of sparse PCA, extending some methodology that he’s developed with collaborators including his wife Kehui Chen (also a statistics professor, just down the street at UPitt). It’s a great opportunity to practice what I’ve learned in Convex Optimization and earlier courses. I admired Jing’s teaching when I took his courses last year, and I’m enjoying research work with him: I have plenty of independence, but he is also happy to provide direction and advice when needed.
    We have some nice simulation results illustrating that our method can work in an ideal setting, so now it’s time to start looking at proofs of why it should work :) as well as a real dataset to showcase its use. More on this soon, I hope.
    Unfortunately, one research direction that I thought could become a thesis topic turned out to be a dead end as soon as we formulated the problem more precisely. Too bad, though at least it’s better to find out now than after spending months on it.
  • I still need to finish writing up a few projects from last fall: my ADA report and a Small Area Estimation paper with Rebecca Steorts (now moving from CMU to Duke). I really wish I had pushed myself to finish them before the baby came—now they’ve been on the backburner for months. I hope to wrap them up this summer. Apologies to my collaborators!


  • Being a sDADistician: Finally, my penchant for terrible puns becomes socially acceptable, maybe even expected—they’re “dad jokes,” after all.
    Grad school seems to be a good time to start a family. (If you don’t believe me, I heard it as well from Rob Tibshirani last semester.) I have a pretty flexible schedule, so I can easily make time to see the baby and help out, working from home or going back and forth, instead of staying all day on campus or at the office until late o’clock after he’s gone to bed. Still, it helps to make a concrete schedule with my wife, about who’s watching the baby when. Before he arrived, I had imagined we could just pop him in the crib to sleep or entertain himself when we needed to work—ah, foolish optimism…
    It certainly doesn’t work for us both to work from home and be half-working, half-watching him. Neither the work nor the child care is particularly good that way. But when we set a schedule, it’s great for organization & motivation—I only have a chunk of X hours now, so let me get this task DONE, not fritter the day away.
    I’ve spent less time this semester attending talks and department events (special apologies to all the students whose defenses I missed!), but I’ve also forced myself to get much better about ignoring distractions like computer games and Facebook, and I spend more of my free time on things that really do make me feel better such as exercise and reading.
  • Stoicism: This semester I decided to really finish the Seneca book I’d started years ago. It is part of a set of philosophy books I received as a gift from my grandparents. Long story short, once I got in the zone I was hooked, and I’ve really enjoyed Seneca’s Letters to Lucilius as well as Practical Philosophy, a Great Courses lecture series on his contemporaries.
    It turns out several of my fellow students (including Lee Richardson) have been reading the Stoics lately too. The name “Stoic” comes from “Stoa,” i.e. porch, after the place where they used to gather… so clearly we need to meet for beers at The Porch by campus to discuss this stuff.
  • Podcasts: This semester I also discovered the joy of listening to good podcasts.
    (1) Planet Money is the perfect length for my walk to/from campus, covers quirky stories loosely related to economics and finance, and includes a great episode with a shoutout to CMU’s Computer Science school.
    (2) Talking Machines is a more academic podcast about Machine Learning. The hosts cover interesting recent ideas and hit a good balance—the material is presented deeply enough to interest me, but not so deeply I can’t follow it while out on a walk. The episodes usually explain a novel paper and link to it online, then answer a listener question, and end with an interview with a ML researcher or practitioner. They cover not only technical details, but other important perspectives as well: how do you write a ML textbook and get it published? how do you organize a conference to encourage women in ML? how do you run a successful research lab? Most of all, I love that they respect statisticians too :) and in fact, when they interview the creator of The Automatic Statistician, they probe him on whether this isn’t just going to make the data-fishing problem worse.
    (3) PolicyViz is a new podcast on data visualization, with somewhat of a focus on data and analyses for the public: government statistics, data journalism, etc. It’s run by Jon Schwabish, whom I (think I) got to meet when I still worked in DC, and whose visualization workshop materials are a great resource.
  • It’s a chore to update R with all the zillion packages I have installed. I found that Tal Galili’s installr manages updates cleanly and helpfully.
  • Next time I bake brownies, I’ll add some spices and call them “Chai squares.” But we must ask, of course: what size to cut them for optimal goodness of fit in the mouth?

Nice example of a map with uncertainty

OK, back to statistics and datavis.

As I’ve said before, I’m curious about finding better ways to draw maps which simultaneously show numerical estimates and their precision or uncertainty.

The April 2015 issue of Significance magazine includes a nice example of this [subscription link; PDF], thanks to Michael Wininger. Here is his Figure 2a (I think the labels for the red and blue areas are mistakenly swapped, but you get the idea):

Wininger, Fig. 2a

Basically, Wininger is mapping the weather continuously over space, and he overlays two contours: one for where the predicted snowfall amount is highest, and another for where the probability of snowfall is highest.

I can imagine people would also enjoy an interactive version of this map, where you have sliders for the two cutoffs (how many inches of snow? what level of certainty?). You could also just show more levels of the contours on one static map, by adding extra lines, though that would get messy fast.

I think Wininger’s approach looks great and is easy to read, but it works largely because he’s mapping spatially-continuous data. The snowfall levels and their certainties are estimated at a very fine spatial resolution, unlike say a choropleth map of the average snowfall by county or by state. The other thing that helps here is that certainty is expressed as a probability (which most people can interpret)… not as a measure of spread or precision (standard deviation, margin of error, coefficient of variation, or what have you).

Could this also work on a choropleth map? If you only have data at the level of discrete areas, such as counties… Well, this is not a problem with weather data, but it does come up with administrative or survey data. Say you have survey estimates for the poverty rate in each county (along with MOEs or some other measure of precision). You could still use one color to fill all the counties with high estimated poverty rates. Then use another color to fill all the counties with highly precise estimates. Their overlap would show the areas where poverty is estimated to be high and that estimate is very precise. Sliders would let the readers set their own definition of “high poverty” and “highly precise.”

I might be wrong, but I don’t think I’ve seen this approach before. Could be worth a try.

Roulette Wheel of Time

While we’re on crossovers between statistics and brick-sized fantasy novels, I remember taking some notes on references to math, logic, and probability in Robert Jordan’s Wheel of Time book series.

(I really can’t recommend the series. I enjoyed the first few books in middle school, but in a re-read last year they haven’t stood up to my childhood memories. The first is still fun but a blatant Tolkien ripoff; the rest are plodding and repetitive.)

Readers, can you recommend any good fantasy / sci-fi (or other fiction) that treats stats & math well?

The Dragon Reborn

A few of the characters discuss the difference between distributions that show clustering, uniformity, and randomness:

“It tells us it is all too neat,” Elayne said calmly. “What chance that thirteen women chosen solely because they were Darkfriends would be so neatly arrayed across age, across nations, across Ajahs? Shouldn’t there be perhaps three Reds, or four born in Cairhien, or just two the same age, if it was all chance? They had women to choose from or they could not have chosen so random a pattern. There are still Black Ajah in the Tower, or elsewhere we don’t know about. It must mean that.”

She’s suspicious of the very uniform distribution of demographic characteristics in the observed sample of 13 bad-guy characters. If turning evil happens at random, or at least is independent of these demographics, you’d expect some clusters to occur by chance in such a small sample—that’s why statistical theory exists, to help decide if apparent patterns are spurious. And if evil was associated with any demographic, you’d certainly expect to see some clusters. The complete absence of clustering (in fact, we see the opposite: dispersion) looks more like an experimental design, selecting observations that are as different as possible… implying there is a larger population to choose from than just these 13. Nice ūüėõ

There are also records of historical hypothesis testing of a magical artifact:

“Use unknown, save that channeling through it seems to suspend chance in some way, or twist it.” She began to read aloud. “‘Tossed coins presented the same face every time, and in one test landed balanced on edge one hundred times in a row. One thousand tosses of the dice produced five crowns one thousand times.'”

That’s a degenerate distribution right there.

Mat, the lucky-gambler character, also talks of luck going in his favor more often where there’s more randomness: he always wins at dice, usually at card games, and rarely at games like “stones” (basically Go). It’d be good fodder for a short story set in our own world—a character who realizes he’s no braniac but incredibly lucky and so seeks out luck-based situations. What else could you do, besides the obvious lottery tickets and casinos?

The Shadow Rising

I was impressed by Elayne’s budding ability to think like a statistician in the previous book, but she returns to more simplistic thinking in this book. The characters ponder murder motives (p.157):

“They were killed because they talked […] Or to stop them from it […] They might have been killed simply to punish them for being captured […] Three possibilities, and only one says the Black Ajah knows they revealed a word. Since all three are equal, the chances are that they do not know.”

Oh, Elayne. There are well-known problems with the principle of insufficient reason. Your approach to logic may get you into trouble yet.

Lord of Chaos

The description of Caemlyn’s chief clerk and census-taker Halwin Norry is hamfisted and a missed opportunity:

Rand … was not certain anything was real to Norry except the numbers in his ledgers. He recited the number of deaths during the week and the price of turnips carted in from the countryside in the same dusty tone, arranged the daily burials of penniless friendless refugees with no more horror and no more joy than he showed hiring masons to check the repair of the city walls. Illian was just another land to him, not the abode of Sammael, and Rand just another ruler.

If anything, Norry sounds like an admirable professional! Official statisticians must be as objective and politically disinterested as possible; else the rulers can make whatever “decisions” they like but there’ll be no way to accurately carry them out when you don’t know what resources you actually have on hand nor how severe the problem really is. It’d be fascinating to see how Norry actually gets runs a war-time census—perhaps with scrying help from the local magic users? But here Jordan is just sneering down. Such a shame.

Knife of Dreams

There are a few ridiculous scenes of White Ajah logicians arguing; I should have noted them down. I’m not sure if Jordan really believes mathematicians and logicians talk like this, or whether his tongue is in cheek and he’s just joking, but man, it’s a grotesque caricature. Someday I’d love to see a popular book describe the kind of arguments mathematicians actually have with each other. But this isn’t it.

Reader Morghulis

TL;DR: Memento mori. After reading too much Seneca, I’m meditating on death like a statistician, by counting how many of GRRM’s readers did not even survive to see the HBO show (much less the end of the book series). Rough answer: around 40,000.
No disrespect meant to Martin, his readers, or their families—it’s just a thought exercise that intrigued me, and I figured it may interest other people.
Also, we’ve blogged about GoT and statistics before.

In the Spring a young man’s fancy lightly turns to actuarial tables.

That’s right: Spring is the time of year when the next bloody season of Game of Thrones airs. This means the internet is awash with death counts from the show and survival predictions for the characters still alive.

All the deaths in 'A Song of Ice and Fire'

Others, more pessimistically, wonder about the health of George R. R. Martin, author of the A Song of Ice and Fire¬†(ASOIAF) book series (on which Game of Thrones is based). Some worried readers compare Martin to Robert Jordan, who passed away after writing the 11th¬†Wheel of Time book, leaving 3 more books to be finished posthumously. Martin’s trilogy has become 5 books so far and is supposed to end at 7, unless it’s 8… so who really knows how long it’ll take.

(Understandably, Martin responds emphatically to these concerns. And after all, Martin and Jordan are completely different aging white American men who love beards and hats and are known for writing phone-book-sized fantasy novels that started out as intended trilogies but got out of hand. So, basically no similarities whatsoever.)

But besides the author and his characters, there’s another set of deaths to consider. The books will get finished eventually. But how many readers will have passed away waiting for that ending? Let’s take a look.

Caveat: the inputs are uncertain, the process is handwavy, and the outputs are certainly wrong. This is all purely for fun (depressing as it may be).


Continue reading

Small Area Estimation 101: old materials posted

I never got around to polishing my Small Area Estimation (SAE) “101” tutorial materials that I promised a while ago. So here they are, though still unedited and not as clean / self-explanatory as I’d like.

The slides introduce a few variants of the simplest area-level (Fay-Herriot) model, analyzing the same dataset in a few different ways. The slides also explain some basic concepts behind Bayesian inference and MCMC, since the target audience wasn’t expected to be familiar with these topics.

  • Part 1: the basic Frequentist area-level model; how to estimate it; model checking (pdf)
  • Part 2: overview of Bayes and MCMC; model checking; how to estimate the basic Bayesian area-level model (pdf)
  • All slides, data, and code (ZIP)

The code for all the Frequentist analyses is in SAS. There’s R code too, but only for a WinBUGS example of a Bayesian analysis (also repeated in SAS). One day I’ll redo the whole thing in R, but it’s not at the top of the list right now.

Frequentist examples:

  • “ByHand” where we compute the Prasad-Rao estimator of the model error variance (just for illustrative purposes since all the steps are explicit and simpler to follow; but not something I’d usually recommend in practice)
  • “ProcMixed” where we use mixed modeling to estimate the model error variance at the same time as everything else (a better way to go in practice; but the details get swept up under the hood)

Bayesian examples:

  • “ProcMCMC” and “ProcMCMC_alt” where we use SAS to fit essentially the same model parameterized in a few different ways, some of whose chains converge better than others
  • “R_WinBUGS” where we do the same but using R to call WinBUGS instead of using SAS

The example data comes from Mukhopadhyay and McDowell, “Small Area Estimation for Survey Data Analysis using SAS Software” [pdf].

If you get the code to run, I’d appreciate hearing that it still works :)

My SAE resources page still includes a broader set of tutorials/textbooks/examples.