“Statistical Modeling: The Two Cultures,” Breiman

One highlight of my fall semester is going to be a statistics journal club led by CMU’s Ryan Tibshirani together with his dad Rob Tibshirani (here on sabbatical from Stanford). The journal club will focus on “Hot Ideas in Statistics“: some classic papers that aren’t covered in standard courses, and some newer papers on hot or developing areas. I’m hoping to find time to blog about several of the papers we discuss.

The first paper was Leo Breiman’s “Statistical Modeling: The Two Cultures” (2001) with discussion and rejoinder. This is a very readable, high-level paper about the culture of statistical education and practice, rather than about technical details. I strongly encourage you to read it yourself.

Breiman’s article is quite provocative, encouraging statisticians to downgrade the role of traditional mainstream statistics in favor of a more machine-learning approach. Breiman calls the two approaches “data modeling” and “algorithmic modeling”:

Data modeling assumes a stochastic model for where the data came from: what is the distribution for the data or the random noise, and how do you imagine it relates to predictor variables? Then you estimate and interpret the model parameters. Breiman claims that common practice is to validate your model by goodness-of-fit tests and residual analysis.
Algorithmic modeling assumes almost nothing about the data, except that it’s usually i.i.d. from the population you want to learn about. You don’t start with any statistical distributions or interpretable models; just build a “black box” algorithm, like random forests or neural nets, and evaluate performance by prediction accuracy (on withheld test data, or by cross-validation).

I absolutely agree that traditional statistics focuses on the former over the latter, and also that the latter has a lot to offer and should be a welcome addition to any statistician’s toolbox. But Breiman’s tone is pretty harsh regarding “data modeling,” apart from a few placating remarks at the end. He uses a few straw man arguments, explaining how algorithmic modeling beats poorly-done traditional statistics. (For instance, about overreliance on 5% significance of regression coefficients, he says “Nowadays, I think most statisticians will agree that this is a suspect way to arrive at conclusions”—but he is still presenting this “suspect way” as the standard that “most statisticians” use. So which is it? Is the majority wrong or right? If by “statisticians” he actually means “psychologists who took one stats class,” then this calls for a completely different discussion about education and service courses.) Meanwhile, Breiman neglects some important benefits of well-done data modeling.

A couple of the discussants (David Cox and Brad Efron) defend the value of data modeling. (Efron has a great way to rephrase significance tests as prediction problems: “In our sample of 20 patients drug A outperformed drug B; would this still be true if we went on to test all possible patients?”) Another discussant (Bruce Hoadley) shares some examples of early algorithmic culture from the credit scoring industry, including the importance of interpretability: “Sometimes we can’t implement them until the lawyers and regulators approve. And that requires super interpretability.” The final discussant (Emanuel Parzen) encourages us to see many cultures besides Breiman’s two: Parzen mentions maximum entropy methods, robust methods, Bayesian methods, and quantile methods, while I would add sampling theory as another underappreciated distinct statistical paradigm.

As for myself, I agree with many of Breiman’s points, especially that “algorithmic modeling” should be added to our standard applied toolbox and also become a bigger topic of theoretical study. But I don’t think “data modeling” is as bad as he implies.

Breiman’s preferred approach is strongly focused on pure prediction problems: Based on today’s weather, what will the ozone levels be tomorrow? Can you train a mass spectrometer to predict whether a new unknown compound contains chlorine? However, there are many scientific problems where the question is about understanding, not really about prediction. Even if you can never get really good predictions of who will experience liver failure and who won’t, you still want to know the approximate effects of various behaviors on your chance of liver failure. Breiman dismisses the (nicely interpretable) logistic regression for this problem, suggests a random forest instead, and shows a nifty way of estimating the relative “importance” (whatever that means) of each predictor variable. But saying “variable 12 is more important than variable 10” seems kind of pointless. What you want to know is “If you increase your exposure to carcinogen X by Y units, your risk of disease Z will double,” which is not as easy to extract from a random forest. Even more so, a data model will give you confidence intervals, which can be quite useful despite their weaknesses. Most algorithmic models seem to entirely ignore the concept of confidence intervals for effect sizes.

Furthermore, there are statistical problems where you cannot really do prediction in Breiman’s sense. In my work on small area estimation at the Census Bureau, we often had trouble finding good ways to validate our models, because it won’t work to do simple cross-validation or withhold a test set. When your goal is to provide poverty estimates for each of the 50 US states, you can’t just drop some of the states and cross-validate: the states are not really exchangeable. And you can’t just get more states or pretend that these 50 are a sample from a larger set of possible states: we really do care about these 50. Sure, you can imagine various ways to get around this, including evaluating prediction accuracy on synthetic data (as we started to do). But my point is that it’s not trivial and you can’t treat everything as a standard prediction problem with i.i.d. observations.

That leads me to another concern, both in Breiman’s paper and in the Machine Learning classes I’ve taken here at CMU. The “lack of a generative data model” basically means that you’re assuming your training data are taken i.i.d. (or SRS, as a simple random sample) from the population you want to learn about. Firstly, that IS a kind of generative data model. I’ll treat this as an admission from Breiman that we’ve established you do need a data model; now we’re just quibbling over its extent 🙂 But secondly, what do Machine Learning people do when the data are not i.i.d.? If your training and test data aren’t representative of the broader population, a simple prediction accuracy rate is meaningless. There must be some research into this, but I’ve hardly seen any. For instance, I still know of only one paper (Toth & Eltinge, 2011) on how to do regression trees when your data come from a complex sample survey.

[Edit: Ryan Tibshirani reminded me that there are plenty of Machine Learning approaches that are not i.i.d.! For instance, there’s sequential or online learning (your data are coming in a sequence that is not i.i.d., may be changing over time, and might even be given to you by an “adversary” who tries to keep it as uninformative as possible) and active learning (you get to see some data, fit a preliminary model, and then choose where to see the next data point: how best to choose that next observation if you want to maximize your chance of learning something useful?) And of course time series or spatial data are another non-i.i.d. setup frequently used in ML. But it still seems that there’s not as much focus on complex survey data, which are drawn in one batch from a finite population but not as a simple random sample.]

In class I mentioned that our program has no classes covering nontrivial experimental design or survey sampling at the PhD level. Surely there would be interest in learning how to do this well for statistics and machine learning. My classmate Alex asked if I’m volunteering to teach it 🙂 Maybe not a bad idea someday?

In our class discussion, people also pointed out that many of the “algorithmic” models can be motivated by a statistical model, just like you can treat many data modeling methods as pure algorithms. It seems clear that it’s always good to know what implicit model is behind your algorithm. Then at least you have some hope of checking your model assumptions, even if it’s imperfect. In general, I think there is still need to develop better model diagnostics for both data and algorithmic models. I don’t mean more yes-no goodness of fit tests, but better ways to decide how your model is weak and can be improved. Breiman cites Bill Cleveland admitting that residual analysis doesn’t help much beyond four or five dimensions, but that just cries out for more research. Breiman’s examples remind you of the importance of checking for multicollinearity before you make interpretations, but that is true of algorithmic modeling too.

Yes, there are gaps in traditional statistics culture’s approach, some of which algorithmic modeling or machine learning can help to fill. There are even bigger gaps in our ability to train non-experts to use statistical models and procedures appropriately. But I doubt that non-experts will make much better use of random forests or neural nets, even if they could conceivably have better prediction performance, even where that concept is relevant. In the end, Breiman makes many valid points, but he does not convince me to dismiss distributional assumptions and traditional statistics as a dead end approach.