Victoria Stodden on Reproducible Research

Yesterday’s department seminar was by Victoria Stodden [see slides from Nov 9, 2015]. With some great Q&A during the talk, we only made it through about half the slides.

Dr Stodden spoke about several kinds of reproducibility important to science, and their links to different “flavors” of science. As I understood it, there are

  • empirical reproducibility: are the methods (lab-bench protocol, psych-test questionnaire, etc.) available, so that we could repeat the experiment or data-collection?
  • computational reproducibility: are the code and data available, so that we could repeat the processing and calculations?
  • statistical reproducibility: was the sample large enough that we can expect to get comparable results, if we do repeat the experiment and calculations?

Her focus is on the computational piece. As more and more research involves methodological contributions primarily in the software itself (and not explained in complete detail in the paper), it’s critical for that code to be open and reproducible.

Furthermore, historically there have been two kinds of scientific evidence, with pretty-well-understood standards: Deductive (whose evidence is a mathematical or logical proof), and Empirical (requiring statistical evidence incl. appropriate data collection and analysis). People are now claiming that Computational and/or Big-Data-Driven evidence are new, third/fourth branches of science… but to treat it as a real science, we’ll need clear-cut standards for this kind of evidence, comparable to the old standards of Deductive proof or Empirical experiment-and-statistical-analysis.

Apart from concerns about such community standards, there’s also the plain fact that it’s a pain to make progress using old, non-reproducible, poorly-documented code. Take for instance the Madagascar project. A professor found that his grad students were taking 2 years to become productive—it took that long to understand, re-run, and add to the previous student’s code. He started requiring that all his students turn in well-packaged, completely reproducible code + data, or else he wouldn’t approve their thesis. After this change, so the story goes, it took new students only 2 weeks instead of 2 years to begin productively building on past students’ work.
Stodden’s slide 26 cites several other systems that aim to help researchers collect, document, and disseminate their code & data, including her own project Research Compendia.

But intense reproducibility is still hard, time-intensive, and often unappreciated. Stodden described a study in which academics reported that “time to document and clean up” was the top barrier to sharing their code and data. If we could only change the incentive structure, more people would do it—just as people are incentivized to write up their findings as research papers, even though that takes a long time too.

Some journals are finally starting to encourage or even required the submission of code & data (with obvious exceptions such as HIPAA-privacy-restricted medical data etc.) Even if there are other edge cases besides private data (e.g. code that takes weeks to run, or can only run on a supercomputer), code-sharing would still make a good impact on many publications. Also, the high-impact journal Science enacted new statistical requirements (to help with statistical reproducibility) and added statisticians to the board of editors in 2014. So there are signs of positive change on the way.

One more aspect of statistical reproducibility: How can we control for multiple testing, file drawer problem, etc. if we don’t track all the tests and comparisons you attempted during your analysis? There’s no widely-used software to do this automatically. But a few such tools do exist; they just aren’t widely used yet. See Stodden’s slide 26, under “Workflow Tracking and Research Environments.”

Finally, I also enjoyed having lunch with Dr Stodden. We discussed blogging and how hard it is for a perfectionist academic to write up a quick post… so it takes all day to write… so the posts are few and far between. (I’m trying to dash this post off quickly, to compensate!)
She also had interesting thoughts about the Statistics vs Data Science debate (are they different? does it even matter?). Instead of working in a statistics department, she’s in a school of Library and Information Science. In a way, this strikes me as a great place for Data Science. How to house, catalogue, filter, and search your giant streams of incoming data? How to build tools that’ll help users find what they need in the data efficiently? How to communicate with your audience? Some of those tools will draw on statistics or machine learning, but it’s not the same thing as developing statistical/ML theory.
Finally, while some statisticians feel “Oh, Data Science is just Statistics!” as if Data Science is treading on our toes or trying to replace us… she said she’s heard exactly the same complaint from folks in Machine Learning, Databases, and other fields. Again that suggests to me that it *isn’t* merely statistics under a new name, if other fields have the same concern about it 🙂 On the other hand, all these complaints do have some validity. In situations when a newspaper headline gushes about the promising new field of Data Science, but the article content describes exactly what statisticians have done for years, it’s no surprise that we feel undervalued. I’m sure it happens to the Databases folks too.

Followup for myself:

  • Read the article she recommended: Gavish & Donoho, “Three Dream Applications of Verifiable Computational Results”
  • Think/ask about her claim that “divorce of data generation from data analysis” is now more common than the older paradigm of generating/collecting the data yourself. Are there really more researchers studying “found” datasets (say, Google engineers studying whatever their trawler happens to find) than ones generated by experiment or survey (biologists, psychologists, materials engineers, agriculturalists, pollsters, etc. in the lab or the field)?
  • At lunch she suggested making good use of Science’s code-sharing-requirements policy. Find a Science paper whose statistics content could use major improvement; ask the authors for their code + data, which the journal requires them to share; fix up the stats and publish this improvement. Seems like a nice way for statistics grad students to make an impact (and beef up their CV).
  • Some of her own SparseLab demos might be nicely adapted into R and turned into Shiny demos, hosted on a Shiny server or shinyapps.io … Maybe a good project for future Stat Computing or Dataviz students?