I enjoyed this week’s Symposium on Large-Scale Data Inference, which honored Harvard’s Carl Morris as the keynote speaker. This was the 2nd such symposium; last year’s honoree was Brad Efron (whose new book I also recommend after seeing it at this event).

This year’s focus was the intersection of statistics and data visualization around the question, “Can we believe what we see?” I was seriously impressed by the variety and quality of the speakers & panelists — many thanks to Social & Scientific Systems for organizing! Look for the lecture videos to be posted online in January.

See below for the first two speakers, Carl Morris and Mark Hansen. The next posts will summarize talks by Di Cook and Rob Kass (part 2), and Chris Volinsky and the final panel discussion (part 3).

**“I don’t think of Bayes and frequentist as opposites; they can be mutually supportive.”**

Carl Morris is one of the leading figures of Empirical Bayes methods, so it’s no wonder they appeared in his keynote on regression to the mean (RTTM) and how to address it with multilevel models. He gave a nice example of how the usual hypothesis tests and CIs can be biased in situations where we expect RTTM. For example, say you’re reporting polls in swing states and a small poll shows unusually high support for one candidate over the other, although you strongly expect the true percentages to be close to 0.5. In other words, this poll is likely a fluke and you expect the next poll’s results to be closer to even — to regress to the mean. So even if the p-value strongly rejects H0: theta=0.5, this p-value with small n may be less reliable evidence than a weaker p-value with large n. If you have a threshold for believable poll results (say, neither candidate is likely to have more than 60% of the vote in this swing state), you can set up a simple two-level model that’ll account for this.

Morris’ approach will reduce the number of your false positives (the number of times you wrongly reject H0). But instead of things like Bonferroni corrections that make confidence intervals wider, a multilevel model will shrink the poll estimates towards the believable 50% result — so you don’t get wider CIs, just more plausible point estimates, and yet you still avoid false positives. As Morris pointed out, “You’ll do better as a frequentist if you use Bayesian methods.”

I’ll revisit this talk in detail in another post, especially since it ties in closely with small area estimation (my primary area of work these days). I’ll just end with a comment from Morris’ colleague Herman Chernoff, a diehard frequentist, who nonetheless says that if you don’t understand a problem from a Bayesian perspective then you don’t understand it well enough. Go team!

**“Everyone types the name of an object in R and watches all 20 billion lines scroll by at least once, right?”**

I’d known of Mark Hansen as creator of the Moveable Type installation at the New York Times offices, and also as the PhD advisor to Nathan Yau of FlowingData.com. But until this talk I hadn’t been aware of the breadth and depth of his work on “information performance,” or as he put it, “turning data into something poetic.”

Consider Moveable Type: it takes an intimidatingly-large skillset to put together the custom hardware, the software coding (“There’s mounds of Python flying in the hallway”), the linguistic parsing and statistical analysis… not to mention the design sense that makes this installation such a palpable, moving experience. Or the patience to assemble it where they did: “Writing and debugging code is painful enough; but doing it in a tower full of journalists who are trained to ask questions nonstop… whew!”

Nonetheless, Hansen’s still drawn to the newsroom crew, to the point that he’s recently left UCLA for the Columbia Journalism School where I believe he’ll be helping build a joint program in journalism and computer science. As he says, technology and coding have “flattened the terrain” of data visualization and brought in people from many different disciplines, and he’ll help newcomers tap into the aesthetics and storytelling here. Data’s not just there for confirmatory analysis (checking the facts on a hypothesis you already had), but can be explored; journalistic uncovering can play a role, and we as statisticians haven’t sold people on that yet. We need to bridge the gaps between the various datavis communities, and also recognize that the aesthetic component to storytelling is important: see below the Shakespeare Machine, whose stunning effect required plenty of R analysis in the background but far more elegance in the physical implementation.

Hansen demonstrated several other nifty projects but I was most curious when he told us how he got a glance inside the eBay/PayPal “war room.” The layout of the display screens and dashboards, the architecture of who sits where, the social relationships among who’s in charge of what aspects — all these fascinating questions concerning design, ergonomics, anthropology, etc. reminded me of my old job at Ziba Design. I hadn’t thought before about how combining such experience with a statistician’s viewpoint on data management could be useful in designing war-rooms and command centers for decision-making. Hansen will actually be teaching a related course this spring, on newsrooms, inspired by “places that take ‘out there’ in here.” Students will visit air traffic control rooms & network operations centers; study the involvement from the view of architects, engineers, designers; and learn from that in order to design a better newsroom. I begged him to make this class available on Coursera or Udacity 🙂

Pingback: Event Review: Carl Morris Symposium on Large-Scale Data Inference

I’d love to see that happen! If you know him personally, please keep asking.

Haha, I’ll do what I can 🙂 and definitely will post here if it does ever happen!