Statistics is Applied Science Fiction

I’m enjoying the discussion coming out of Alberto Cairo‘s online data visualization course.

Bryn Williams, in a comment on thinkers & creators who read comics & sci-fi for inspiration:

“…a familiarity with imagined alternative worlds makes philosophy an easier path to tread when posing counterfactuals and thought experiments…”

My response:

And not just philosophy or data visualization — I think statistics could be presented as a kind of “applied science fiction.” When you perform a hypothesis test of whether some parameter is 0, you

  1. assume it *is* 0,
  2. imagine what kinds of data you would probably have seen under that assumption, and then
  3. if the real data you *did* see is unlikely under that assumption, decide that the assumption is probably wrong.

It’s just like in SF where

  1. you imagine a possible alternate reality (say, Joe discovers a talent for dowsing),
  2. you explore the consequences if that possibility were true (Joe becomes rich from oil prospecting), and
  3. in the best cases, readers can draw lessons about our actual reality from this thought experiment (http://xkcd.com/808/).

(XKCD is, of course, a great comic for both SF and datavis. See also this recent SMBC for another amusing exploration of “If this claim were true…”)

Loess and Clark

Apologies for the awful pun in the title, but it seemed to befit an exploration of the history of loess local regression, particularly its name and codebase.

If you’re not familiar with loess, it’s basically a nonparametric algorithm that smooths the data to find the local mean of y at each x value. If you want to end up with a more traditional regression, loess can still be a useful starting point for visually finding trends in the data. Earl Glynn shows a worked example with R code that illustrates the loess fit for different values of the bandwidth.

Today was the first session of a Machine Learning study group with my colleagues. (We’re following along Andrew Ng‘s course notes for Stanford’s CS 229, also available on Coursera.) In the first chapter, Ng mentions loess regression, and two colleagues had interesting historical comments about it. Continue reading “Loess and Clark”

Animated map of 2012 US election campaigning, with R and ffmpeg

[vimeo 52312754]

(Video link here, in case the embedded player doesn’t work for you.)

Idea: see if I can mimic the idea behind Ben Schmidt’s lovely video of ocean shipping routes, and apply it to another dataset. But which?
“Hmm… what’s another interesting dataset about some competitors traveling around a mostly-fixed area at the same time?… Hey friends, stop  giving me election news, I need to think of an idea… Oh.” Continue reading “Animated map of 2012 US election campaigning, with R and ffmpeg”

Javascript and D3 for R users, part 2: running off the R server instead of Python

Thank you all for the positive responses to Basics of JavaScript and D3 for R Users! Quick update: last time we had to dabble in a tiny bit of Python to start a local server, in order to actually run JavaScript and D3 examples on our home computer… However, commenter Shankar had the great idea of using the R server instead. He provided some example code, but reported that it didn’t work with all the examples.

Here’s my alternative code, which works with all the D3 examples I’ve tried so far. Unlike Shankar’s approach with lower-level functions, I found it simpler to use Jeffrey Horner’s excellent Rook package.

# Load the Rook library
library(Rook)

# Where is your d3 directory located?
myD3dir <- 'C:/Downloads'

# Start the server
s <- Rhttpd$new()
s$start(quiet=TRUE)

# To view a different D3 example,
# change the directory and .html file names below
# and rerun s$add() and s$browse()
s$add(
app=Builder$new(
Static$new(
# List all the subdirectories that contain
# any files it will need to access (.js, .css, .html, etc)
urls = c('/d3','/d3/examples','/d3/examples/choropleth'),
root = myD3dir
),
Redirect$new('/d3/examples/choropleth/choropleth.html')
),
name='d3'
)
s$browse(2)
# browse(1) would load the default RookTest app instead

# When you're done,
# clean up by stopping and removing the server
s$stop()
s$remove(all=TRUE)
rm(s)

If I understand the Rook documentation correctly, you just can’t browse directories using R’s local server. So you’ll have to type in the exact directory and HTML file for each example separately. But otherwise, this should be a simple way to play with D3 for anyone who’d rather stick within R instead of installing Python.

Basics of JavaScript and D3 for R Users

Hadley Wickham, creator of the ggplot2 R package, has been learning JavaScript and its D3 library for the next iteration of ggplot2 (tentatively titled r2d3?)… so I suspect it’s only a matter of time before he pulls the rest of the R community along.

Below are a few things that weren’t obvious when I first tried reading JavaScript code and the D3 library in particular. (Please comment if you notice any errors.) Then there’s also a quick walkthrough for getting D3 examples running locally on your computer, and finally a list of other tutorials & resources. In a future post, we’ll explore one of the D3 examples and practice tweaking it.

Perhaps these short notes will help other R users get started more quickly than I did. Even if you’re a ways away from writing complex JavaScript from scratch, it can still be useful to take one of the plentiful D3 examples and modify it for your own purposes. Continue reading “Basics of JavaScript and D3 for R Users”

Carl Morris Symposium on Large-Scale Data Inference (2/3)

Continuing the summary of last week’s symposium on statistics and data visualization (see part 1 and part 3)… Here I describe Dianne Cook’s discussion of visual inference, and Rob Kass’ talk on statistics in cognitive neuroscience.

[Edit: I’ve added a few more related links throughout the post.]

Continue reading “Carl Morris Symposium on Large-Scale Data Inference (2/3)”

Carl Morris Symposium on Large-Scale Data Inference (1/3)

I enjoyed this week’s Symposium on Large-Scale Data Inference, which honored Harvard’s Carl Morris as the keynote speaker. This was the 2nd such symposium; last year’s honoree was Brad Efron (whose new book I also recommend after seeing it at this event).

This year’s focus was the intersection of statistics and data visualization around the question, “Can we believe what we see?” I was seriously impressed by the variety and quality of the speakers & panelists — many thanks to Social & Scientific Systems for organizing! Look for the lecture videos to be posted online in January.

See below for the first two speakers, Carl Morris and Mark Hansen. The next posts will summarize talks by Di Cook and Rob Kass (part 2), and Chris Volinsky and the final panel discussion (part 3).

Continue reading “Carl Morris Symposium on Large-Scale Data Inference (1/3)”

Superheroes? Dataheroes!

Jake Porway of DataKind gave an inspiring talk comparing statisticians and data scientists to superheroes. Hear the story of how “the data scientists, statisticians, analysts were able to bend data to their will” and how these powers are being used for good or for awesome:

(Hat Tip: FlowingData.com)

Jake’s comment that “you have extraordinary powers that ordinary people don’t have” reminds me of Andrew Gelman’s suggestion that “The next book to write, I guess, should be called, not Amazing Numberrunchers or Fabulous Stat-economists, but rather something like Statistics as Your Very Own Iron Man Suit.

Links to the statistics / data science volunteering opportunities Jake mentioned:

I also recommend Statistics Without Borders, with more of an international health focus. And if you’re here in Washington DC, Data Community DC and the related meetups are a great resource too.

Edit: Current students could also see if there is a Statistics in the Community (StatCom) Network branch at their university.

Statistics contests

Are you familiar with Kaggle? It’s a website for hosting online data-analysis contests, like smaller-scale versions of the Netflix Prize contest.
The U.S. Census Bureau is now hosting a Kaggle contest, asking statisticians and data scientists to help predict mail return rates on surveys and census forms (more info at census.gov and kaggle.com). The ability to predict return rates will help the Census Bureau target its outreach efforts and interview followup (phone calls and door-to-door interviews) more efficiently. So you could win a prize and make the government more efficient, all at the same time! 🙂
The contest ends on Nov 1st, so you still have 40 days to compete.

If you prefer making videos to crunching numbers, there’s also a video contest to promote the International Year of Statistics for 2013. Help people see how statistics makes the world better, impacts current events, or gives you a fun career, and you may win a prize and be featured on their website all next year. There are special prizes among non-English-language videos and among entrants under 18 years old.
Submissions are open until Oct 31st, just a day before the Census Challenge.