Random forests are very popular models in machine learning and data science for prediction tasks. They often have great empirical performance when all you need is a black-box algorithm, as in many Kaggle competitions. On the other hand, RFs are less commonly used for estimation tasks, because historically we could not do well at computing confidence intervals or hypothesis tests: there was no good understanding of RFs’ statistical properties, nor good estimators of variance (needed for confidence intervals). Until now.

Wager has written several papers on the statistical properties of random forests. He also has made code available for computing pointwise confidence intervals. (Confidence bands, for the whole RF-estimated regression function at once, have not been developed yet.)

Wager gave concrete examples of when this can be useful, for instance in personalized medicine. You don’t always want just point-estimate predictions for how a patient will respond to a certain treatment. Often you want some margin of error too, so you can decide on the treatment that’s most likely to help. That is, you’d like to avoid a treatment with a positive estimate but a margin of error so big that we’re not sure it helps (it might actually be harmful).

It’s great to see such work on statistical properties of (traditionally) black-box models. In general, it’s an exciting (if challenging) problem to figure out properties and estimate MOEs for such ML-flavored algorithms. Some data science or applied ML folks like to deride high-falutin’ theoretical statisticians, as did Breiman himself (the originator of random forests)… But work like Wager’s is very practical, not merely theoretically interesting. We need more of this, not less.

PS—One other nifty idea from his talk, something I hadn’t seen before: In the usual k-nearest-neighbor algorithm, you pick a target point where you want to make a prediction, then use Euclidean distance to find the k closest neighbors in the training data. Wager showed examples where it works better to train a random forest first, then use “number of trees where this data point is in the same leaf as the target point” as your distance. That is, choose as “neighbors” any points that tend to land in the same leaf as your target, regardless of their Euclidean distance. The results seem more stable than usual kNN. New predictions may be faster to compute too.

Followup for myself:

- Ryan Tibshirani asked about using shrinkage together with random forests. I can imagine small area estimators that shrink towards a CART or random forest prediction instead of a usual regression, but Ryan sounded more like he had lasso or ridge penalties in mind. Does anyone do either of these?
- Trees and forests can only split perpendicular to the variables, but sometimes you might have “rotated” structure (i.e. interesting clustering splits are diagonal in the predictor space). So, do people ever find it useful to do PCA first, and *then* CART or RFs? Maybe even using all the PCs, so that you’re not doing it for dimension reduction, just for the sake of rotation? Or maybe some kind of sparse PCA variant where you only rotate certain variables that need it, but leave the others alone (unrotated) when you run CART or RFs?
- The “infinitesimal jackknife” sounded like a nifty proof technique, but I didn’t catch all the details. Read up more on this.

He brings up great nuggets of stats history, including insights into the history and challenges of Big Data. I also want to read his recommended books, especially Fisher’s *Design of Experiments* and Raiffa & Schlaifer’s*Applied Statistical Decision Theory*. But my favorite part was about involving intro stats students in data collection:

One of the things I’ve been able to do is teach a freshman seminar every once in a while. In 1990, I did it as a class in a very ad hoc way and then again in 2000, and again in 2010, I taught small freshman seminars on the census. Those were the census years, so I would bring real data into the classroom which we would discuss. One of the nice things about working on those seminars is that, because I personally knew many of the Census Directors, I was able to bring many of them to class as my guests. It was great fun and it really changes how students think about what they do. In 1990, we signed all students up as census enumerators and they did a shelter and homeless night and had to come back and describe their experiences and share them. That doesn’t sound like it should belong in a stat class but I can take you around here at JSM and introduce you to people who were in those classes and they’ve become statisticians!

What a great teaching idea It reminds me of discussions in an anthropology class I took, where we learned about participant observation and communities of practice. Instead of just standing in a lecture hall talking about statistics, we’d do well to expose students to real-life statistical work “in the field”—not just analysis, but data collection too. I still feel strongly that data collection/generation is the heart of statistics (while data analysis is just icing on the cake), and Steve’s seminar is a great way to hammer that home.

]]>Dr Stodden spoke about several kinds of reproducibility important to science, and their links to different “flavors” of science. As I understood it, there are

- empirical reproducibility: are the methods (lab-bench protocol, psych-test questionnaire, etc.) available, so that we could repeat the experiment or data-collection?
- computational reproducibility: are the code and data available, so that we could repeat the processing and calculations?
- statistical reproducibility: was the sample large enough that we can expect to get comparable results, if we do repeat the experiment and calculations?

Her focus is on the computational piece. As more and more research involves methodological contributions primarily in the software itself (and not explained in complete detail in the paper), it’s critical for that code to be open and reproducible.

Furthermore, historically there have been two kinds of scientific evidence, with pretty-well-understood standards: Deductive (whose evidence is a mathematical or logical proof), and Empirical (requiring statistical evidence incl. appropriate data collection and analysis). People are now claiming that Computational and/or Big-Data-Driven evidence are new, third/fourth branches of science… but to treat it as a real science, we’ll need clear-cut standards for this kind of evidence, comparable to the old standards of Deductive proof or Empirical experiment-and-statistical-analysis.

Apart from concerns about such community standards, there’s also the plain fact that it’s a pain to make progress using old, non-reproducible, poorly-documented code. Take for instance the Madagascar project. A professor found that his grad students were taking 2 years to become productive—it took that long to understand, re-run, and add to the previous student’s code. He started requiring that all his students turn in well-packaged, completely reproducible code + data, or else he wouldn’t approve their thesis. After this change, so the story goes, it took new students only 2 weeks instead of 2 years to begin productively building on past students’ work.

Stodden’s slide 26 cites several other systems that aim to help researchers collect, document, and disseminate their code & data, including her own project Research Compendia.

But intense reproducibility is still hard, time-intensive, and often unappreciated. Stodden described a study in which academics reported that “time to document and clean up” was the top barrier to sharing their code and data. If we could only change the incentive structure, more people would do it—just as people are incentivized to write up their findings as research papers, even though that takes a long time too.

Some journals are finally starting to encourage or even required the submission of code & data (with obvious exceptions such as HIPAA-privacy-restricted medical data etc.) Even if there are other edge cases besides private data (e.g. code that takes weeks to run, or can only run on a supercomputer), code-sharing would still make a good impact on many publications. Also, the high-impact journal Science enacted new statistical requirements (to help with statistical reproducibility) and added statisticians to the board of editors in 2014. So there are signs of positive change on the way.

One more aspect of statistical reproducibility: How can we control for multiple testing, file drawer problem, etc. if we don’t track all the tests and comparisons you attempted during your analysis? There’s no widely-used software to do this automatically. But a few such tools do exist; they just aren’t widely used yet. See Stodden’s slide 26, under “Workflow Tracking and Research Environments.”

Finally, I also enjoyed having lunch with Dr Stodden. We discussed blogging and how hard it is for a perfectionist academic to write up a quick post… so it takes all day to write… so the posts are few and far between. (I’m trying to dash this post off quickly, to compensate!)

She also had interesting thoughts about the Statistics vs Data Science debate (are they different? does it even matter?). Instead of working in a statistics department, she’s in a school of Library and Information Science. In a way, this strikes me as a great place for Data Science. How to house, catalogue, filter, and search your giant streams of incoming data? How to build tools that’ll help users find what they need in the data efficiently? How to communicate with your audience? Some of those tools will draw on statistics or machine learning, but it’s not the same thing as developing statistical/ML theory.

Finally, while some statisticians feel “Oh, Data Science is just Statistics!” as if Data Science is treading on our toes or trying to replace us… she said she’s heard exactly the same complaint from folks in Machine Learning, Databases, and other fields. Again that suggests to me that it *isn’t* merely statistics under a new name, if other fields have the same concern about it On the other hand, all these complaints do have some validity. In situations when a newspaper headline gushes about the promising new field of Data Science, but the article content describes exactly what statisticians have done for years, it’s no surprise that we feel undervalued. I’m sure it happens to the Databases folks too.

Followup for myself:

- Read the article she recommended: Gavish & Donoho, “Three Dream Applications of Verifiable Computational Results”
- Think/ask about her claim that “divorce of data generation from data analysis” is now more common than the older paradigm of generating/collecting the data yourself. Are there really more researchers studying “found” datasets (say, Google engineers studying whatever their trawler happens to find) than ones generated by experiment or survey (biologists, psychologists, materials engineers, agriculturalists, pollsters, etc. in the lab or the field)?
- At lunch she suggested making good use of Science’s code-sharing-requirements policy. Find a Science paper whose statistics content could use major improvement; ask the authors for their code + data, which the journal requires them to share; fix up the stats and publish this improvement. Seems like a nice way for statistics grad students to make an impact (and beef up their CV).
- Some of her own SparseLab demos might be nicely adapted into R and turned into Shiny demos, hosted on a Shiny server or shinyapps.io … Maybe a good project for future Stat Computing or Dataviz students?

Jen Christiansen and Robert Kosara have great summaries of the panel on “Vis, The Next Generation: Teaching Across the Researcher-Practitioner Gap.”

Even better, slides are available for some of the talks: Marti Hearst, Tamara Munzner, and Eytan Adar. Lots of inspiration for the next time I teach.

Finally, here are links to the syllabi or websites of various past dataviz courses. Browsing these helps me think about what to cover and how to teach it.

- Andrew Thomas, CMU 36-721 F’14
- Rebecca Nugent, CMU 36-721 F’10 and 36-315 S’14
- Ben Shneiderman, U of Maryland CMSC 734 S’15
- Hadley Wickham, Rice stat499 F’10 and stat645 S’11
- Tamara Munzner, U of British Columbia CS533C and CS547 (various terms)
- Trevor Branch, U of Washington FISH 554 A W’15
- Jeffrey Heer, U of Washington CSE512 W’14, and Stanford cs448b F’12 and earlier terms
- Pat Hanrahan, Stanford CS448B W’06 and earlier terms
- Eytan Adar, U of Michigan SI649 F’15
- Alexander Lex, Harvard CS171 S’15
- Andrew Gelman, Columbia Statistics G8307 F’15
- Kaiser Fung, NYU DATA1-CE9002 S’14 and workshop Su’15
- Kevin Quealy, Metis workshop S’15
- Alberto Cairo and Scott Murray, Knight Center MOOC F’15
- Marti Hearst, Berkeley i247 (various terms)
- Annette Greiner and Christopher Arnold, Berkeley (in development?)
- Alan Rogers, U of Utah Anth 5485 S’11
- Dan Carr, George Mason STAT 875 S’10 and STAT 663 F’09

Comment below or tweet @civilstat with any others I’ve missed, and I’ll add them to the list.

]]>`magrittr`

package for a while now, but I couldn’t make heads or tails of all these scary `%>%`

symbols. Finally I had time for a closer look, and it seems potentially handy indeed. Here’s the idea and a simple toy example.
So, it can be confusing and messy to write (and read) functions from the inside out. This is especially true when functions take multiple arguments. Instead, `magrittr`

lets you write (and read) functions from left to right.

Say you need to compute the LogSumExp function , and you’d like your code to specify the logarithm base explicitly.

In base R, you might write

`log(sum(exp(MyData)), exp(1))`

But this is a bit of a mess to read. It takes a lot of parentheses-matching to see that the `exp(1)`

is an argument to `log`

and not to one of the other functions.

Instead, with `magrittr`

, you program from left to right:

`MyData %>% exp %>% sum %>% log(exp(1))`

The pipe operator `%>%`

takes output from the left and uses it as the first argument of input on the right. Now it’s very clear that the `exp(1)`

is an argument to `log`

.

There’s a lot more you can do with `magrittr`

, but code with fewer nested parentheses is already a good selling point for me.

Apart from cleaning up your nested functions, this approach to programming might be helpful if you write a lot of JavaScript code, for example if you make D3.js visualizations. R’s `magrittr`

pipe is similar in spirit to JavaScript’s method chaining, so it might make context-switching a little easier.

For software we used mostly R (base graphics, ggplot2, and Shiny). But we also spent some time on Tableau, Inkscape, D3, and GGobi.

We covered a LOT of ground. At each point I tried to hammer home the importance of legible, comprehensible graphics that respect human visual perception.

My course materials are below. Not all the slides are designed to stand alone, but I have no time to remake them right now. I’ll post some reflections separately.

Download all materials as a ZIP file (38 MB), or browse individual files:

- Syllabus and Suggested Readings list; our required texts were Cairo’s The Functional Art and Donahue’s Fundamental Statistical Concepts in Presenting Data
**Homeworks:****Projects:****Lectures:**- 01 Introduction slides
- 02 Legible Graphics slides, R code, R output, addendum
- 03 Visual Perception slides, R code, R output
- 04 Grammar of Graphics slides, R code, R output
- 05 Graphic Design slides, example component graphs (STEM, NonSTEM, Business) and layout
- 06 Interaction Design slides, Shiny code and data, D3 code and data
- 07 Visualization Research slides
- 08 Shiny Lab Session slides
- 09 Graphics for Statistical Analysis slides, R code, R output, data (tips, ganglion)
- 10 Mapping slides, R code, R output, data (MT, PA, USA)
- 11 High-Dimensional Data slides
- 12 Networks and Trees slides
- 13 Wrap-up slides, R code, R output
- NHANES extract dataset used in several lectures

Please note:

- The examples, papers, blogs and researchers linked here are just scratching the surface. I meant no offense to anyone left out. I’ve simply tried to link to blogs, Twitter, and researchers’ websites that are actively updated.
- I have tried my best to include attribution, citations, and links for all images (besides my own) in the lecture slides. Same for datasets in the R code. Wherever I use scans from a book, I have contacted the authors and do so with their approval (Alberto Cairo, Di Cook, Mark Monmonier, Colin Ware, & Robin Williams). However, if you are the creator or copyright holder of any images here and want them removed or the attribution revised, please let me know and I will comply.
- Most of the cited books have an Amazon Associates link. If you follow these links and buy something during that visit, I get a small advertising fee (in the form of an Amazon gift card). Each year so far, these fees have totaled under $100 a year. I just spend it on more dataviz books

Start with Mark Bittman’s blondie recipe (copied/adapted from here), and add some of the spices that go into chai tea.

- 8 tablespoons (1 stick, 4 ounces or 113 grams) butter, melted
- 1 cup (218 grams or 7 3/4 ounces for light; 238 grams or 8 3/8 ounces for dark) brown sugar
- 1 large egg
- 1 teaspoon vanilla
- Pinch salt
- 1 cup (4 3/8 ounces or 125 grams) all-purpose flour
- 1/2 teaspoon cardamom
- 1/2 teaspoon cinnamon
- 1/2 teaspoon ground ginger
- 1/2 teaspoon ground cloves
- scant 1/2 teaspoon fresh ground black pepper

- Preheat oven to 350°F. Butter an 8×8 pan, or line the pan with aluminum foil and grease the foil.
- Mix melted butter with brown sugar. Beat until smooth. Beat in egg and vanilla.
- Combine salt, flour, and spices. Gently stir flour mixture into butter mixture.
- Pour into prepared pan. Bake 20-25 minutes, or until barely set in the middle. Cool on rack before cutting them.

Enjoy!

]]>Each of these maps shows a dataset with statistical estimates and their precision/uncertainty for various areas in New York state. If we use color or shading to show the estimates, like in a traditional choropleth map, how can we also show the uncertainty at the same time? The PAD examples include several variations of static maps, interaction by toggling overlays, and interaction with mouseover and sliders. Interactive map screenshots are linked to live demos on the PAD website.

I’m still fascinated by this problem. Each of these approaches has its strengths and weaknesses: Symbology Overlay uses separable dimensions, but there’s no natural order to the symbols. Pixelated Classification seems intuitively clear, but may be misleading if people (incorrectly) try to find meaning in the locations of pixels within an area. Side-by-side maps are each clear on their own, but it’s hard to see both variables at once. Dynamic Feedback gives detailed info about precision, but only for one area at a time, not all at once. And so forth. It’s an interesting challenge, and I find it really helpful to see so many potential solutions collected in one document.

The creators include Nij Tontisirin and Sutee Anantsuksomsri (both since moved on from Cornell), and Jan Vink and Joe Francis (both still there). The pixellated classification map is based on work by Nicholas Nagle.

For more about mapping uncertainty, see their paper:

Francis, J., Tontisirin, N., Anantsuksomsri, S., Vink, J., & Zhong, V. (2015). Alternative strategies for mapping ACS estimates and error of estimation. In Hoque, N. and Potter, L. B. (Eds.), Emerging Techniques in Applied Demography (pp. 247–273). Dordrecht: Springer Netherlands, DOI: 10.1007/978-94-017-8990-5_16 [preprint]

and my related posts:

- Localized Comparisons: my own attempts at showing uncertainty in an interactive map and in a cartogram, plus links to work by Gabriel Florit, David Sparks, Nicholas Nagle, and Nancy Torrieri & David Wong
- Nice example of a map with uncertainty: a map by Michael Wininger

See also Visualizing Attribute Uncertainty in the ACS: An Empirical Study of Decision-Making with Urban Planners. This talk by Amy Griffin is about studying how urban planners actually use statistical uncertainty on maps in their work.

]]>A large part of the focus will be on useful principles and frameworks: human visual perception, the Grammar of Graphics, graphic design and interaction design, and more current dataviz research. As for tools, besides base R and ggplot2, I’ll introduce a bit of Tableau, D3.js, and Inkscape/Illustrator. For assessments, I’m trying a variant of “specs grading”, with a heavy use of rubrics, hoping to make my expectations clear and my TA’s grading easier.

My initial course materials are up on my department webpage.

Here are the

(I’ll probably just use Blackboard during the semester, but I may post the final materials here again.)

It’s been a pleasant challenge to plan a course that can satisfy statisticians (*slice and dice data quickly to support detailed analyses! examine residuals and other model diagnostics! work with data formats from rectangular CSVs through shapefiles to social networks!*) … while also passing on lessons from the data journalism and design communities (*take design and the user experience seriously! use layout, typography, and interaction sensibly!*). I’m also trying to put into practice all the advice from teaching seminars I’ve taken at CMU’s Eberly Center.

Also, in preparation, this summer I finally enjoyed reading more of the classic visualization books on my list.

- Cleveland’s
*The Elements of Graphing Data*and Robbins’*Creating More Effective Graphs*are chock full of advice on making clear graphics that harness human visual perception correctly. - Ware’s
*Information Visualization*adds to this the latest research findings and a ton of useful detail. - Cleveland’s
*Visualizing Data*and Cook & Swayne’s*Interactive and Dynamic Graphics for Data Analysis*are a treasure trove of practical data analysis advice. Cleveland’s many case studies show how graphics are a critical part of exploratory data analysis (EDA) and model-checking. In several cases, his analysis demonstrates that previously-published findings used an inappropriate model and reached poor conclusions due to what he calls rote data analysis (RDA). Cook & Swayne do similar work with more modern statistical methods, including the first time I’ve seen graphical diagnostics for many machine learning tools. There’s also a great section on visualizing missing data. The title is misleading: you don’t need R and GGobi to learn a lot from their book. - Monmonier’s
*How to Lie with Maps*refers to dated technology, but the concepts are great. It’s still useful to know just how maps are made, and how different projections work and why it matters. Much of cartographic work sounds analogous to statistical work: making simplifications in order to convey a point more clearly, worrying about data quality and provenance (different areas on the map might have been updated by different folks at different times), setting national standards that are imperfect but necessary… The section on “data maps” is critical for any statistician working with spatial data, and the chapter on bureaucratic mapping agencies will sound familiar to my Census Bureau colleagues.

I hope to post longer notes on each book sometime later.

]]>But I do think there is one core thing that differentiates Statisticians from these others. Having an interest in this is why you might choose to major in statistics rather than applied math, machine learning, etc. And it’s the reason you might hire a trained statistician rather than someone else fluent with data:

Statisticians use the idea of

variability due to samplingto design good data collection processes, to quantify uncertainty, and to understand the statistical properties of our methods.

When applied statisticians design an experiment or a survey, they account for the inherent randomness and try to control it. They plan your study in such a way that’ll make your estimates/predictions as accurate as possible for the sample size you can afford. And when they analyze the data, alongside each estimate they report its precision, so you can decide whether you have enough evidence or whether you still need further study. For more complex models, they also worry about overfitting: can this model generalize well to the population, or is too complicated to estimate with this sample and hence is it just fitting noise?

When theoretical statisticians invent a new estimator, they study how well it’ll perform over repeated sampling, under various assumptions. They study its statistical properties first and foremost. Loosely speaking: How variable will the estimates tend to be? Will they be biased (i.e. tend to always overestimate or always underestimate)? How robust will they be to outliers? Is the estimator consistent (as the sample size grows, does the estimate tend to approach the true value)?

These are not the only important things in working with data, and they’re not the only things statisticians are trained to do. But (as far as I can tell) they are a much deeper part of the curriculum in statistics training than in any other field. Statistics is their home. Without them, you can often still be a good data analyst but a poor statistician.

Certainly we need to do a better job of selling these points. (I don’t agree with everything in this article, but it really is a shame when the NSF invites 100 experts to a Big Data conference but does not include a single statistician.) But maybe it’s not really a problem that ML and Data Science are “eating our lunch.” These days there are many situations that don’t require solid understanding of statistical concepts & properties—situations where “generalizing from sample to population” isn’t the hard part:

- In some Big Data situations, you literally have
**all**the data. There’s no sampling going on. If you just need descriptive summaries of what happened in the past, you have the full population—no need for a statistician to quantify uncertainty.

[*Edit: Some redditors misunderstood my point here. Yes, there are many cases where you still want statistical inference on population data (about the future, or about what else might have happened); but that’s not what I mean here. An example might help. Lawyers in a corporate fraud case may have a digital file containing every single relevant financial record, so they can literally analyze all the data. There’s no worry here that this population is a random sample from some abstract superpopulation. You just summarize what the defendant did, not what they might have done but didn’t.*] - In other Big Data cases, you care not about the past but about estimates that’ll generalize to future data. If your sample is huge, and your data collection isn’t biased somehow, then the statistical uncertainty due to sampling will be negligible. Again, any data analyst will do—no need for statistical training.
- Other times, you don’t want parameter estimates—you need predictions. In the Netflix Prize or most Kaggle contests, you build a model on training data and evaluate your predictions’ performance on held-out test data. If both datasets are huge, then again, sampling variation may be a minor concern; you may not need to worry much about overfitting; and it really is okay to try a zillion complex, uninterpretable, black-box models and choose the one with the best score on the test data. Cross-validation or hold-out validation may be the only statistical hammer you need for every such nail.
- Finally, there are some hard problems (web search results, speech recognition, natural language translation) involving immediate give-and-take with a human, where it’s frankly okay for the model to make a lot of “mistakes.” If Google doesn’t return the very best search result for my query on page 1, I can look further or edit my query. If my speech recognition software makes a mistake, I can try again enunciating more clearly, or I can just type the word directly. Quantifying and controlling such a model’s randomness and errors would be useful, but not critical.
- Plus, there have always been problems better suited to mathematical modeling, where the uncertainty is more about how a complicated deterministic model turns its inputs to outputs. There, instead of statistical analysis you’d want sensitivity analysis, which is not usually part of our core training.

Yes, in most of these cases a statistician would do well, but so would the other flavors of data analyst. The statistician would be most valuable at the start, in setting up the data collection process, rather than in the actual analysis.

On the other hand, when sampling is expensive and difficult, and if you care about interpretable estimates rather than black-box predictions, you can’t beat statisticians.

- What does the Census Bureau need? Someone who can design a giant nationwide survey to be as cost-effective as possible, learning as much as we can about the nation’s citizens (including breakdowns by small geographic and demographic groups) without overspending taxpayers’ money. Who does it hire? Statisticians.
- What does the FDA need? Someone who can design a clinical trial that’ll stop as soon as the evidence in favor of or against the new drug/procedure is strong enough, so that as few patients as possible are exposed to a bad new drug or are withheld from an effective new treatment. Who does it hire? Statisticians.
- Statisticians also work on a different kind of Big Data: small-ish samples but with high dimensionality. In genetics, each person’s genome is a huge dataset, even if you only have the genomes of a relatively small number of people with the disease you’re studying. Naive data mining will find a zillion spurious associations, and too often such results get published… but it doesn’t actually advance the scientific understanding of which genes really do what. A statistician’s humility (we’re not confident about these associations yet and need further study) is better than asserting unfounded, possibly harmful claims.

Finally, there are plenty of cases in between. The data’s already been collected; it’s hard to know how important the sampling variability will be; or maybe you just need to make a decision quickly, even if there’s not enough data to have strong evidence. I can imagine that in business analytics, you’d be inclined to hire the data scientist (who’ll confidently tell you “We crunched the numbers!”) over the buzzkill statistician (who’ll tell you “Still not enough evidence…”), and the market is so unpredictable that it’s hard to tell afterwards who was right anyway.

Now, I’d love it if all statisticians had broader training in other topics, including the ones that machine learning and data science have claimed for themselves. Hadley Wickham’s recent interview points out:

He observed during his statistics PhD that there was a “total disconnect between what people need to actually understand data and what was being taught.” Unlike the statisticians who were focused on abstruse ramifications of the central limit theorem, Wickham was in the business of making data analysis easier for the public.

Indeed, in many traditional statistics departments, you’d have trouble getting funded to study data analysis from a usability standpoint, even though it’s an extremely important and valuable topic of study.

But if the new Data Science departments that are popping up claim this topic, I don’t see anything wrong with that. If academic Statistics departments keep chugging away at understanding estimators’ statistical properties, that’s fine; somebody needs to be doing it. However, if Statistics departments drop the mantle of studying sampling variation, and nobody else picks it up, that’d be a real loss.

I love my department at CMU, but sometimes I wonder if we’re chasing these other data science fields too much. We only offer one class each on survey sampling and on experimental design, both at the undergrad level and never taken by our grad students. Our course on Convex Optimization was phenomenal, but we almost never discussed the statistical properties of the crazy models we fit (not even to point out that you may as well stop optimizing once the numerical precision is within your statistical precision—you don’t need predictions optimized to 7 decimal places if the standard error is at 1 decimal place.)

]]>