One more difference between statistics and [machine learning, data science, etc.]

Statisticians have always done a myriad of different things related to data collection and analysis. Many of us are surprised (even frustrated) that Data Science is even a thing. “That’s just statistics under a new name!” we cry. Others are trying to bring Data Science, Machine Learning, Data Mining, etc. into our fold, hoping that Statistics will be the “big tent” for everyone learning from data.

But I do think there is one core thing that differentiates Statisticians from these others. Having an interest in this is why you might choose to major in statistics rather than applied math, machine learning, etc. And it’s the reason you might hire a trained statistician rather than someone else fluent with data:

Statisticians use the idea of variability due to sampling to design good data collection processes, to quantify uncertainty, and to understand the statistical properties of our methods.

When applied statisticians design an experiment or a survey, they account for the inherent randomness and try to control it. They plan your study in such a way that’ll make your estimates/predictions as accurate as possible for the sample size you can afford. And when they analyze the data, alongside each estimate they report its precision, so you can decide whether you have enough evidence or whether you still need further study. For more complex models, they also worry about overfitting: can this model generalize well to the population, or is too complicated to estimate with this sample and hence is it just fitting noise?

When theoretical statisticians invent a new estimator, they study how well it’ll perform over repeated sampling, under various assumptions. They study its statistical properties first and foremost. Loosely speaking: How variable will the estimates tend to be? Will they be biased (i.e. tend to always overestimate or always underestimate)? How robust will they be to outliers? Is the estimator consistent (as the sample size grows, does the estimate tend to approach the true value)?

These are not the only important things in working with data, and they’re not the only things statisticians are trained to do. But (as far as I can tell) they are a much deeper part of the curriculum in statistics training than in any other field. Statistics is their home. Without them, you can often still be a good data analyst but a poor statistician.

Certainly we need to do a better job of selling these points. (I don’t agree with everything in this article, but it really is a shame when the NSF invites 100 experts to a Big Data conference but does not include a single statistician.) But maybe it’s not really a problem that ML and Data Science are “eating our lunch.” These days there are many situations that don’t require solid understanding of statistical concepts & properties—situations where “generalizing from sample to population” isn’t the hard part:

  • In some Big Data situations, you literally have all the data. There’s no sampling going on. If you just need descriptive summaries of what happened in the past, you have the full population—no need for a statistician to quantify uncertainty.
    [Edit: Some redditors misunderstood my point here. Yes, there are many cases where you still want statistical inference on population data (about the future, or about what else might have happened); but that’s not what I mean here. An example might help. Lawyers in a corporate fraud case may have a digital file containing every single relevant financial record, so they can literally analyze all the data. There’s no worry here that this population is a random sample from some abstract superpopulation. You just summarize what the defendant did, not what they might have done but didn’t.]
  • In other Big Data cases, you care not about the past but about estimates that’ll generalize to future data. If your sample is huge, and your data collection isn’t biased somehow, then the statistical uncertainty due to sampling will be negligible. Again, any data analyst will do—no need for statistical training.
  • Other times, you don’t want parameter estimates—you need predictions. In the Netflix Prize or most Kaggle contests, you build a model on training data and evaluate your predictions’ performance on held-out test data. If both datasets are huge, then again, sampling variation may be a minor concern; you may not need to worry much about overfitting; and it really is okay to try a zillion complex, uninterpretable, black-box models and choose the one with the best score on the test data. Cross-validation or hold-out validation may be the only statistical hammer you need for every such nail.
  • Finally, there are some hard problems (web search results, speech recognition, natural language translation) involving immediate give-and-take with a human, where it’s frankly okay for the model to make a lot of “mistakes.” If Google doesn’t return the very best search result for my query on page 1, I can look further or edit my query. If my speech recognition software makes a mistake, I can try again enunciating more clearly, or I can just type the word directly. Quantifying and controlling such a model’s randomness and errors would be useful, but not critical.
  • Plus, there have always been problems better suited to mathematical modeling, where the uncertainty is more about how a complicated deterministic model turns its inputs to outputs. There, instead of statistical analysis you’d want sensitivity analysis, which is not usually part of our core training.

Yes, in most of these cases a statistician would do well, but so would the other flavors of data analyst. The statistician would be most valuable at the start, in setting up the data collection process, rather than in the actual analysis.

On the other hand, when sampling is expensive and difficult, and if you care about interpretable estimates rather than black-box predictions, you can’t beat statisticians.

  • What does the Census Bureau need? Someone who can design a giant nationwide survey to be as cost-effective as possible, learning as much as we can about the nation’s citizens (including breakdowns by small geographic and demographic groups) without overspending taxpayers’ money. Who does it hire? Statisticians.
  • What does the FDA need? Someone who can design a clinical trial that’ll stop as soon as the evidence in favor of or against the new drug/procedure is strong enough, so that as few patients as possible are exposed to a bad new drug or are withheld from an effective new treatment. Who does it hire? Statisticians.
  • Statisticians also work on a different kind of Big Data: small-ish samples but with high dimensionality. In genetics, each person’s genome is a huge dataset, even if you only have the genomes of a relatively small number of people with the disease you’re studying. Naive data mining will find a zillion spurious associations, and too often such results get published… but it doesn’t actually advance the scientific understanding of which genes really do what. A statistician’s humility (we’re not confident about these associations yet and need further study) is better than asserting unfounded, possibly harmful claims.

Finally, there are plenty of cases in between. The data’s already been collected; it’s hard to know how important the sampling variability will be; or maybe you just need to make a decision quickly, even if there’s not enough data to have strong evidence. I can imagine that in business analytics, you’d be inclined to hire the data scientist (who’ll confidently tell you “We crunched the numbers!”) over the buzzkill statistician (who’ll tell you “Still not enough evidence…”), and the market is so unpredictable that it’s hard to tell afterwards who was right anyway.

Wondermark understands the travails of being a statistician.

Now, I’d love it if all statisticians had broader training in other topics, including the ones that machine learning and data science have claimed for themselves. Hadley Wickham’s recent interview points out:

He observed during his statistics PhD that there was a “total disconnect between what people need to actually understand data and what was being taught.” Unlike the statisticians who were focused on abstruse ramifications of the central limit theorem, Wickham was in the business of making data analysis easier for the public.

Indeed, in many traditional statistics departments, you’d have trouble getting funded to study data analysis from a usability standpoint, even though it’s an extremely important and valuable topic of study.

But if the new Data Science departments that are popping up claim this topic, I don’t see anything wrong with that. If academic Statistics departments keep chugging away at understanding estimators’ statistical properties, that’s fine; somebody needs to be doing it. However, if Statistics departments drop the mantle of studying sampling variation, and nobody else picks it up, that’d be a real loss.

I love my department at CMU, but sometimes I wonder if we’re chasing these other data science fields too much. We only offer one class each on survey sampling and on experimental design, both at the undergrad level and never taken by our grad students. Our course on Convex Optimization was phenomenal, but we almost never discussed the statistical properties of the crazy models we fit (not even to point out that you may as well stop optimizing once the numerical precision is within your statistical precision—you don’t need predictions optimized to 7 decimal places if the standard error is at 1 decimal place.)

7 responses to “One more difference between statistics and [machine learning, data science, etc.]

  1. You are too generous to data science without statistics . I have never met a case in which you had all the data. You may have every row, but there are always missing records. A statisticians training gives one experience in thinking about missingness. Also if you have all the data, but the goal is forecasting, then you don’t have all the data. Again, statistics definitely has something to offer when it comes to understanding why a model might perform better or worse out of sample. I would not say that well trained statisticians are crucial because they can model uncertainty… But rather that they are crucial because they can model data generating processes, which ought to be central to any data science. On the flip side, real data generating processes can be tricky and not easily reduced to the abstract theoretical framework that statisticians are most comfortable in.

    • Thanks for the pushback 🙂

      “I have never met a case in which you had all the data.” Neither have I, but they exist. As I understand it, it’s common for accountants to compute descriptive statistics based on large, complete datasets.

      I actually wish missingness was better addressed in the core statistics curriculum. Too many people just give up too easily and do na.rm=TRUE or its equivalent.

      Perhaps instead of “uncertainty” there’s a better word I should use—maybe “precision with respect to randomness” or something like that.

      When you say “statisticians … are crucial because they can model data generating processes,” I think we agree. By saying statisticians understand & study sampling, I meant that statisticians can model stochastic data-generating processes.

      But we don’t study every data-generating process. Differential equations and chaos theory are not part of our core curriculum, yet they are critical for certain data sciences.

  2. anthony damico

    nice. thanks.

  3. The disconnect over uncertainty is also related to certain aspects of the domains in question. How do you derive confidence regions in settings with non-parametric models over high-dimensional and combinatorial structures? What does uncertainty mean over images or text corpora or coreference graphs, or functions of them? For example, I’d love to have some notion of uncertainty for a logistic regression with 1M features, but it’s not clear how to do it — both computationally and what it practically means. And that’s a much easier setting than cross-document coreference, machine translation, neural network models, or many other settings. In all these cases, point estimates and prediction are just easier problems than estimating (or using) parameter uncertainty, and I think some of the attention toward prediction stems from this fact. It’s a great research opportunity to bring statistical notions of uncertainty to these domains.

    • Agreed. Whether you call it uncertainty, precision, or just “What other models/estimates/predictions are plausible from this data?”, these are all fascinating open problems.

  4. Also this reminded me of this good article from Jerome Friedman back in 1997 (he uses the term “data mining”, but is pretty similar to today’s “data science”). See especially section 5.

  5. I like the fact that this article emphasizes the importance of experimental design and sampling variability in identifying an important feature of statistics. Even when you have ‘all the data”, by the way, you still don’t have an understanding of the causal processes involved. You might still want to bootstrap or use other methods to learn about how easily (frequently) such and such a pattern would be expected by variability alone.