My friend Brian Segal, at the University of Michigan, writes in response to the journal that banned statistical inference:
If nothing else, I think BASP did a great job of starting a discussion on p-values, and more generally, the role of statistical inference in certain types of research. Stepping back a bit, I think the discussion fits into a broader question of how we deal with answers that are inherently grey, as opposed to clear cut. Hypothesis testing, combined with traditional cutoff values, is a neat way to get a yes/no answer, but many reviewers want a yes/no answer, even in the absence of hypothesis tests.
As one example, I recently helped a friend in psychology to validate concepts measured by a survey. In case you haven’t done this before, here’s a quick (and incomplete) summary of construct validation: based on substantive knowledge, group the questions in the survey into groups, each of which measures a different underlying concept, like positive attitude, or negativity. The construct validation question is then, “Do these groups of questions actually measure the concepts I believe they measure?”
In addition to making sure the groups are defensible based on their interpretation, you usually have to do a quantitative analysis to get published The standard approach is to model the data with a structural equation model (as a side note, this includes confirmatory factor analysis, which is not factor analysis!). The goodness of fit statistic is useless in this context, because the null hypothesis is not aligned with the scientific question, so people use a variety of heuristics, or fit indices, to decide if the model fits. The model is declared to either fit or not fit (and consequently the construct is either valid or not valid) depending on whether the fit index is larger or smaller than a rule-of-thumb value. This is the same mentality as hypothesis testing.
Setting aside the question of whether it makes sense to use structural equation models to validate constructs, the point I’m trying to make is that the p-value mentality is not restricted to statistical inference. Like any unsupervised learning situation, it’s very difficult to say how well the hypothesized groups measure the underlying constructs (or if they even exist). Any answer is inherently grey, and yet many researchers want a yes/no answer. In these types of cases, I think it would be great if statisticians could help other researchers come to terms not just with the limits of the statistical tools, but with the inquiry itself.
I agree with Brian that we can all do a better job of helping our collaborators to think statistically. Statistics is not just a set of arbitrary yes/no hoops to jump through in the process of publishing a paper; it’s a kind of applied epistemology. As tempting as it might be to just ban all conclusions entirely, we statisticians are well-trained in probing what can be known and how that knowledge can be justified. Give us the chance, and we’d would love to help you navigate the subtleties, limits, and grey areas in your research!
Thanks, Jerzy, for posting this. On reading my comment a second time, I realize that it could benefit from some clarifications. My apologies for not proofreading ahead of time.
I was trying to distinguish between two types of questions: a) those that have inherently uncertain answers no matter how much data you collect, and b) those that could, in theory, be answered with certainty if you collected all relevant data. A simple example of the latter would be whether the difference in means between two populations is greater than some value – if you could collect all of the population data without measurement error, you could answer the question with 100% certainty. In contrast, if you wanted to know whether a particular
clustering corresponds to an underlying reality, you could collect all the population data and still not be 100% certain, though you could certainly build evidence in favor of or against the clustering.
I recently read Robert Kass’s article, Statistical Inference: The Big Picture , and I think this distinction between types of questions address one of the core concerns of statistical pragmatism. Essentially, there’s more uncertainty in the subjunctive statement associated with questions of type (a) then with questions of type (b), and I think it might be useful to remind ourselves and collaborators of this limitation when dealing with questions of type (a). These concerns might not be the same as those of BASP, but it seems related and is relevant to psychologists, so I thought it might be worth adding to the discussion.
Thanks, Brian. Great clarification. I agree that clustering or construct validation are fundamentally different types of questions than estimating a well-defined population parameter. If the BASP kerfuffle draws attention to this distinction and these limitations, and helps both statisticians and scientists think harder about it, that’d be a good thing.
This is great! Coming from the physical sciences it is amusing to see how often one group of statisticians will declare what another group does to be totally invalid and–perhaps morally lapsed. You read the literature on this or that method, applied perhaps for decades, and someone comes along who says it’s been applied wrong or never was valid in the first place. The question the layman then asks is whether statistics–with such sharp epistemological divisions–is to be trusted? I’ve been around long enough to remember when the Bayesians were a small sect. Like sectarians in general there is often a sense of being outcast, which they shared. Now they are beginning to predominate and we witness the BASP’s edict on p-values. Kind of like the contention between the utopian and the scientific socialists.