Superheroes? Dataheroes!

Jake Porway of DataKind gave an inspiring talk comparing statisticians and data scientists to superheroes. Hear the story of how “the data scientists, statisticians, analysts were able to bend data to their will” and how these powers are being used for good or for awesome:

(Hat Tip: FlowingData.com)

Jake’s comment that “you have extraordinary powers that ordinary people don’t have” reminds me of Andrew Gelman’s suggestion that “The next book to write, I guess, should be called, not Amazing Numberrunchers or Fabulous Stat-economists, but rather something like Statistics as Your Very Own Iron Man Suit.”

Links to the statistics / data science volunteering opportunities Jake mentioned:

I also recommend Statistics Without Borders, with more of an international health focus. And if you’re here in Washington DC, Data Community DC and the related meetups are a great resource too.

Edit: Current students could also see if there is a Statistics in the Community (StatCom) Network branch at their university.

Statistics contests

Are you familiar with Kaggle? It’s a website for hosting online data-analysis contests, like smaller-scale versions of the Netflix Prize contest.
The U.S. Census Bureau is now hosting a Kaggle contest, asking statisticians and data scientists to help predict mail return rates on surveys and census forms (more info at census.gov and kaggle.com). The ability to predict return rates will help the Census Bureau target its outreach efforts and interview followup (phone calls and door-to-door interviews) more efficiently. So you could win a prize and make the government more efficient, all at the same time! 🙂
The contest ends on Nov 1st, so you still have 40 days to compete.

If you prefer making videos to crunching numbers, there’s also a video contest to promote the International Year of Statistics for 2013. Help people see how statistics makes the world better, impacts current events, or gives you a fun career, and you may win a prize and be featured on their website all next year. There are special prizes among non-English-language videos and among entrants under 18 years old.
Submissions are open until Oct 31st, just a day before the Census Challenge.

The Grammar of Graphics: notes on first reading

Leland Wilkinson’s The Grammar of Graphics is a classic in the data visualization literature. Wilkinson created a framework that coherently ties together many aspects of designing, implementing, reading, and understanding a graphic. It’s a useful approach and has been fairly influential: The popular R package ggplot2 is, more or less, an implementation of Wilkinson’s ideas, and I also see their influence in the software Tableau (about which more another time). Wilkinson himself helped to build these ideas into SPSS’s Graphics Production Language (GPL) and then SPSS Visualization Designer.

So what’s so special here? One of the core ideas is to start with the raw data and think about all the transformations, summaries, etc. that go into graphing it. With a good framework, this can help us see connections between different graphs and create new ones. (The opposite extreme is a “typology” or list of graph types, like you get in Excel: do you want a bar chart, a pie chart, a line chart, or one of these other 10 types? Such a list has no deep structure.) Following Wilkinson’s approach, you’ll realize that a pie chart is basically just a stacked bar chart plotted in polar coordinates, with bar height mapped to pie-slice angle… and that can get you thinking: What if I mapped bar height to radius, not angle? What if I added a variable and moved to spherical coordinates? What if I put a scatterplot in polar coordinates too? These may turn out to be bad ideas, but at least you’re thinking — in a way that is not encouraged by Excel’s list of 10 graph types.

ExcelCharts — This is NOT the approach that Wilkinson takes.

But, of course, thinking is hard, and so is this book. Reading The Grammar of Graphics requires much more of a dedicated slog than, say, Edward Tufte’s books, which you can just flip through randomly for inspiration and bite-sized nuggets of wisdom. (I admire Tufte too, but I have to admit that Wilkinson’s occasional jabs at Tufte were spot-on and amused me to no end.) It’s a book full of wit and great ideas, but also full of drawn-out sections that require serious focus, and it takes a while to digest it all and put it together in your mind.

So, although I’d highly recommend this book to anyone deeply interested in visualization, I’m still digesting it. What follows is not a review but just notes-to-self from my first read-through: things to follow up on and for my own reference. It might not be particularly thrilling for other readers. Continue reading “The Grammar of Graphics: notes on first reading” →

Commandeering a map from PDF or EPS, using Inkscape and R

I love Nathan Yau’s tutorial on making choropleths from a SVG file. However, if you don’t have a SVG handy already and instead you want to repurpose a map from another vector format such as PDF or EPS, there are a few extra steps that can be done in the free tool Inkscape. And while I’m at it, how could I turn down the opportunity to replicate Nathan’s Python approach in R instead?

The following was inspired by the 300-page Census Atlas of the United States, full of beautiful maps of 2000 decennial census data. I particularly liked the small multiples of state maps, which were highly generalized (i.e. the fine detail was smoothed out) but still recognizable, and DC was enlarged to be big enough to see.

I have wanted a map like this for my own purposes, when mapping a variable for all 50 states and DC. Unfortunately, I haven’t been able to track down any colleagues who know where to find the original shapefiles for this map. Fortunately, several images from the Census Atlas are available in EPS format near the bottom of this page, under “PostScript Map Files.” With access to such vector graphics, we can get started.

Continue reading “Commandeering a map from PDF or EPS, using Inkscape and R” →

Making R graphics legible in presentation slides

I only visited a few JSM sessions today, as I’ve been focused on preparing for my own talk tomorrow morning. However, I went to several talks in a row which all had a common problem that made me cringe: graphics where the fonts (titles, axes, labels) are too small to read.

LegibleBad — You used R's default settings when putting this graph in your slides? Too bad I won't be able to read it from anywhere but the front of the room.

Dear colleagues: if we’re going to the effort of analyzing our data carefully, and creating a lovely graph in R or otherwise to convey our results in a slideshow, let’s PLEASE save our graphs in a way that the text is legible on the slides! If the audience has to strain to read your graphics, it’s no easier to digest than a slide with dense equations or massive tables of numbers.

For those of us working in R, here are some very quick suggestions that would help me focus on the content of your graphics, not on how hard I’m squinting to read them.

Continue reading “Making R graphics legible in presentation slides” →

JSM 2012: Sunday

Greetings from lovely San Diego, CA, site of this year’s Joint Statistical Meetings. I can’t believe it’s already been a year since I was inspired to start blogging during the JSM in Miami!

If you’re keeping tabs on this year’s conference, there’s a fair amount of #JSM2012 activity on Twitter. Sadly, I haven’t seen any recent posts on The Statistics Forum, which blogged JSM so actively last year.

Yesterday’s Dilbert cartoon was also particularly fitting for the start of JSM, with its focus on big data 🙂

Continue reading “JSM 2012: Sunday” →

U.S. Census Bureau releases API

The Census API, which was in the works for a while, was finally made publicly available yesterday (news release).

So far, two datasets are accessible:

the 2010 Census Summary File 1, providing counts down to the tract and block levels
the 2006-2010 American Community Survey five-year estimates, providing estimates down to the tract and block-group levels (but not all the way down to blocks)

The developers page provides more information and showcases a couple of the first few apps using the API so far, including one by Cornell’s Jan Vink (whose online poverty maps I’ve mentioned before).

For a handy list of the other government agencies with APIs and developers pages, check out the FCC’s developers page.

useR 2012: main conference braindump

I knew R was versatile, but DANG, people do a lot with it:

> > … I don’t think anyone actually believes that R is designed to make *everyone* happy. For me, R does about 99% of the things I need to do, but sadly, when I need to order a pizza, I still have to pick up the telephone. —Roger Peng

> There are several chains of pizzerias in the U.S. that provide for Internet-based ordering (e.g. www.papajohnsonline.com) so, with the Internet modules in R, it’s only a matter of time before you will have a pizza-ordering function available. —Doug Bates

Indeed, the GraphApp toolkit … provides one (for use in Sydney, Australia, we presume as that is where the GraphApp author hails from). —Brian Ripley

So, heads up: the following post is super long, given how much R was covered at the conference. Much of this is a “notes-to-self” braindump of topics I’d like to follow up with further. I’m writing up the invited talks, the presentation and poster sessions, and a few other notes. The conference program has links to all the abstracts, and the main website should collect most of the slides eventually.

Continue reading “useR 2012: main conference braindump” →

Maps of changes in area boundaries, with R

Today a coworker needed some maps showing boundary changes. I used what I learned last week in the useR 2012 geospatial data course to make a few simple maps in R, overlaid on OpenStreetMap tiles. I’m posting my maps and my R code in case others find them useful.

A change in Census block-groups from 2000 to 2010, in Mobile, AL

Continue reading “Maps of changes in area boundaries, with R” →

useR 2012: impressions, tutorials

First of all, useR 2012 (the 8th International R User Conference) was, hands down, the best-organized conference I’ve had the luck to attend. The session chairs kept everything moving on time, tactfully but sternly; the catering was delicious and varied; and Vanderbilt University’s leafy green campus and comfortable facilities were an excellent setting. Many thanks to Frank Harrell and the rest of Vanderbilt’s biostatistics department for hosting!

useR_bacon — Plus there's a giant statue of bacon. What's not to love?

Continue reading “useR 2012: impressions, tutorials” →