Carl Morris Symposium on Large-Scale Data Inference (2/3)

Continuing the summary of last week’s symposium on statistics and data visualization (see part 1 and part 3)… Here I describe Dianne Cook’s discussion of visual inference, and Rob Kass’ talk on statistics in cognitive neuroscience.

[Edit: I’ve added a few more related links throughout the post.]

Continue reading “Carl Morris Symposium on Large-Scale Data Inference (2/3)”

Carl Morris Symposium on Large-Scale Data Inference (1/3)

I enjoyed this week’s Symposium on Large-Scale Data Inference, which honored Harvard’s Carl Morris as the keynote speaker. This was the 2nd such symposium; last year’s honoree was Brad Efron (whose new book I also recommend after seeing it at this event).

This year’s focus was the intersection of statistics and data visualization around the question, “Can we believe what we see?” I was seriously impressed by the variety and quality of the speakers & panelists — many thanks to Social & Scientific Systems for organizing! Look for the lecture videos to be posted online in January.

See below for the first two speakers, Carl Morris and Mark Hansen. The next posts will summarize talks by Di Cook and Rob Kass (part 2), and Chris Volinsky and the final panel discussion (part 3).

Continue reading “Carl Morris Symposium on Large-Scale Data Inference (1/3)”

USGS mapping suggestions

Geology, to put it bluntly, rocks. Where else can you talk about cleavage, bedding attitudes, discharge, and thrust faults with a straight face?

Anyhow, the United States Geological Survey (USGS) has a nice document of Suggestions To Authors of their technical reports and maps. In particular, the chapter on “Preparing maps and other illustrations” seems to be a good reference on maps for those of us without much formal cartography/geography training. For example, there are good tips on index maps (little inset maps showing the context around the bigger map, or pointing out where your study took place). The overall focus is naturally on geological maps, but much of the advice applies to other kinds of maps and visualizations too.

This is the 7th edition from 1991, so perhaps it’s due for an update, but the advice still seems solid. I’d also love to see the 1st edition from 1909 and see how much the guide has changed.

Superheroes? Dataheroes!

Jake Porway of DataKind gave an inspiring talk comparing statisticians and data scientists to superheroes. Hear the story of how “the data scientists, statisticians, analysts were able to bend data to their will” and how these powers are being used for good or for awesome:

(Hat Tip: FlowingData.com)

Jake’s comment that “you have extraordinary powers that ordinary people don’t have” reminds me of Andrew Gelman’s suggestion that “The next book to write, I guess, should be called, not Amazing Numberrunchers or Fabulous Stat-economists, but rather something like Statistics as Your Very Own Iron Man Suit.

Links to the statistics / data science volunteering opportunities Jake mentioned:

I also recommend Statistics Without Borders, with more of an international health focus. And if you’re here in Washington DC, Data Community DC and the related meetups are a great resource too.

Edit: Current students could also see if there is a Statistics in the Community (StatCom) Network branch at their university.

Compensating for different spatial abilities (feat. cyborgs!)

In July, I saw Iowa State’s  Dr. Sarah Nusser give a presentation about spatial ability among survey field representatives and how different people interact with various geospatial technologies. This talk introduced an area of research quite new to me, and it reminded me how important it is to know your audience before designing products for them. It also touched on directly augmenting our sensory perception — more about that below.

When you hire people to collect survey data in the field (verify addresses, conduct interviews, assess land cover type, etc.), you hope they’ll be able to find their way to the sites where you’re sending them. But new hires might come in with various levels of skill or experience, as well as different mental models for maps and geography. Dr. Nusser’s work [here’s a representative article] frames this as “spatial ability” and, practically speaking, treats it as innate: rather than training adults to improve their spatial ability, she focuses on technology and interfaces that help them work better with the mental model they already have. (I can’t believe that spatial ability really is innate and static… but it’s probably cheaper to design a few user-targeted interfaces once than to train new hires indefinitely.)

How do you tell if someone has high or low spatial ability (high SA vs low SA)? One approach is the Paper Folding Test and related tests produced by the Educational Testing Service.

Where will the holes be when the paper is unfolded?

Continue reading “Compensating for different spatial abilities (feat. cyborgs!)”

The Grammar of Graphics: notes on first reading

Leland Wilkinson’s The Grammar of Graphics is a classic in the data visualization literature. Wilkinson created a framework that coherently ties together many aspects of designing, implementing, reading, and understanding a graphic. It’s a useful approach and has been fairly influential: The popular R package ggplot2 is, more or less, an implementation of Wilkinson’s ideas, and I also see their influence in the software Tableau (about which more another time). Wilkinson himself helped to build these ideas into SPSS’s Graphics Production Language (GPL) and then SPSS Visualization Designer.

So what’s so special here? One of the core ideas is to start with the raw data and think about all the transformations, summaries, etc. that go into graphing it. With a good framework, this can help us see connections between different graphs and create new ones. (The opposite extreme is a “typology” or list of graph types, like you get in Excel: do you want a bar chart, a pie chart, a line chart, or one of these other 10 types? Such a list has no deep structure.) Following Wilkinson’s approach, you’ll realize that a pie chart is basically just a stacked bar chart plotted in polar coordinates, with bar height mapped to pie-slice angle… and that can get you thinking: What if I mapped bar height to radius, not angle? What if I added a variable and moved to spherical coordinates? What if I put a scatterplot in polar coordinates too? These may turn out to be bad ideas, but at least you’re thinking — in a way that is not encouraged by Excel’s list of 10 graph types.

This is NOT the approach that Wilkinson takes.

But, of course, thinking is hard, and so is this book. Reading The Grammar of Graphics requires much more of a dedicated slog than, say, Edward Tufte’s books, which you can just flip through randomly for inspiration and bite-sized nuggets of wisdom. (I admire Tufte too, but I have to admit that Wilkinson’s occasional jabs at Tufte were spot-on and amused me to no end.) It’s a book full of wit and great ideas, but also full of drawn-out sections that require serious focus, and it takes a while to digest it all and put it together in your mind.

So, although I’d highly recommend this book to anyone deeply interested in visualization, I’m still digesting it. What follows is not a review but just notes-to-self from my first read-through: things to follow up on and for my own reference. It might not be particularly thrilling for other readers. Continue reading “The Grammar of Graphics: notes on first reading”

Commandeering a map from PDF or EPS, using Inkscape and R

I love Nathan Yau’s tutorial on making choropleths from a SVG file. However, if you don’t have a SVG handy already and instead you want to repurpose a map from another vector format such as PDF or EPS, there are a few extra steps that can be done in the free tool Inkscape. And while I’m at it, how could I turn down the opportunity to replicate Nathan’s Python approach in R instead?

The following was inspired by the 300-page Census Atlas of the United States, full of beautiful maps of 2000 decennial census data. I particularly liked the small multiples of state maps, which were highly generalized (i.e. the fine detail was smoothed out) but still recognizable, and DC was enlarged to be big enough to see.

I have wanted a map like this for my own purposes, when mapping a variable for all 50 states and DC. Unfortunately, I haven’t been able to track down any colleagues who know where to find the original shapefiles for this map. Fortunately, several images from the Census Atlas are available in EPS format near the bottom of this page, under “PostScript Map Files.” With access to such vector graphics, we can get started.

Continue reading “Commandeering a map from PDF or EPS, using Inkscape and R”

Making R graphics legible in presentation slides

I only visited a few JSM sessions today, as I’ve been focused on preparing for my own talk tomorrow morning. However, I went to several talks in a row which all had a common problem that made me cringe: graphics where the fonts (titles, axes, labels) are too small to read.

You used R's default settings when putting this graph in your slides? Too bad I won't be able to read it from anywhere but the front of the room.

Dear colleagues: if we’re going to the effort of analyzing our data carefully, and creating a lovely graph in R or otherwise to convey our results in a slideshow, let’s PLEASE save our graphs in a way that the text is legible on the slides! If the audience has to strain to read your graphics, it’s no easier to digest than a slide with dense equations or massive tables of numbers.

For those of us working in R, here are some very quick suggestions that would help me focus on the content of your graphics, not on how hard I’m squinting to read them.

Continue reading “Making R graphics legible in presentation slides”

U.S. Census Bureau releases API

The Census API, which was in the works for a while, was finally made publicly available yesterday (news release).

I’ve heard the DC dating scene is tough for single women… But especially for centenarians!

So far, two datasets are accessible:

  • the 2010 Census Summary File 1, providing counts down to the tract and block levels
  • the 2006-2010 American Community Survey five-year estimates, providing estimates down to the tract and block-group levels (but not all the way down to blocks)

The developers page provides more information and showcases a couple of the first few apps using the API so far, including one by Cornell’s Jan Vink (whose online poverty maps I’ve mentioned before).

For a handy list of the other government agencies with APIs and developers pages, check out the FCC’s developers page.

useR 2012: main conference braindump

I knew R was versatile, but DANG, people do a lot with it:

> > … I don’t think anyone actually believes that R is designed to make *everyone* happy. For me, R does about 99% of the things I need to do, but sadly, when I need to order a pizza, I still have to pick up the telephone. —Roger Peng

> There are several chains of pizzerias in the U.S. that provide for Internet-based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it’s only a matter of time before you will have a pizza-ordering function available. —Doug Bates

Indeed, the GraphApp toolkit … provides one (for use in Sydney, Australia, we presume as that is where the GraphApp author hails from). —Brian Ripley

So, heads up: the following post is super long, given how much R was covered at the conference. Much of this is a “notes-to-self” braindump of topics I’d like to follow up with further. I’m writing up the invited talks, the presentation and poster sessions, and a few other notes. The conference program has links to all the abstracts, and the main website should collect most of the slides eventually.

Continue reading “useR 2012: main conference braindump”