NHST ban followup

I’ve been chatting with classmates about that journal that banned Null Hypothesis Significance Testing (NHST). Some have more charitable interpretations than I did, and I thought they’re worth sharing.

Similarly, a writeup on Nature’s website quoted a psychologist who sees two possibilities here:

“A pessimistic prediction is that it will become a dumping ground for results that people couldn’t publish elsewhere,” he says. “An optimistic prediction is that it might become an outlet for good, descriptive research that was undervalued under the traditional criteria.”

(Also—how does Nature, of all places, get the definition of p-value wrong? “The closer to zero the P value gets, the greater the chance the null hypothesis is false…” Argh. But that’s neither here nor there.)

Here’s our discussion, with Yotam Hechtlinger and Alex Reinhart.

Yotam:

I’ll play the devil’s advocate. If you try to figure out about the nature of people’s emotions or thoughts, a clear finding will be seen from descriptive statistics and the use of larger sample size. They are actually requesting for a stricter standard—it should be so significant that it will be obvious to the naked eye. A guess will be to ask a small number of people a question and draw a conclusion from the fact that the p-value < 0.012. This, sadly, leads to the fact that tons of the psychology statements can't be replicated.

Jerzy:

That would be nice, but what does it mean to be “so significant that it will be obvious to the naked eye”? I have trouble imagining a good simple way to defend such a claim.
Or, if the editors say “We’ll only publish a paper where the estimated effect sizes are huge, AND the sample is huge, AND the standard deviations are tiny,” how should a prospective author decide on their sample size? Do they have to spend all their money on recruiting 1000 subjects instead of, say, making the experimental setup better?

Yotam:

You and me should not be the ones defending a claim of significant. People at the field should. Think of Paleontology for example. They find some bone, and then start arguing whether the finding agrees with current theory or not, and work on developing some consensus.

So I would argue that significant finding is one that raise lots of interest among the researchers in the field, enable you to draw conclusions from, and provide some way to test or refute those conclusions.

They actually say that pretty nicely in the paper at the link you gave: “… we believe that the p < .05 bar is too easy to pass and sometimes serves as an excuse for lower quality research. We hope and anticipate that banning the NHSTP will have the effect of increasing the quality of submitted manuscripts by liberating authors from the stultified structure of NHSTP thinking thereby eliminating an important obstacle to creative thinking..." I liked the "liberating" there. In other words they are saying---make something we find interesting, convince us that it's actually valuable, and we will get you published. Regarding the fact that it will be harder to find effects that actually meet that criteria---good. Research is hard. Also publishing a philosophy paper is hard. But (arguably) the publication standard in psychology should be raised. I am not certain (to say the least) that p-values or CI's function as good criteria for publication quality. Editors' interest is just as good and as useful (for the science of psychology!). They have to publish something. Convince them that you are more interesting than the rest---and you got it.

Jerzy:

This discussion is great! But I’m still not convinced 🙂

(1) I agree that it’s not our job to decide what’s *interesting*. But I can come up with a ton of “interesting” findings that are spurious if I use small data sets. Or, if the editors’ only defense against spurious claims is that “we encourage larger than usual sample sizes,” I can just use a big dataset and do data-fishing.

I agree that p-values are *not* ideal for convincing me that your results are non-spurious, but I just don’t understand how the editors will decide what *is* convincing without CIs or something similar. “I found evidence of psychic powers! … in a sample of size 5” is clearly not convincing evidence, even though the effect would be interesting if it were true. So what else will you use to decide what’s convincing vs. what’s spurious, if not statistical inference tools? Throwing out all inference just seems too liberating.

(2) Instead of banning CIs, a better way to raise the publication standard would be to *require* CIs that are tight/precise. This is a better place to give editors/reviewers leeway (“Is this precise enough to be useful?”) than sample size (“Did they data-snoop in a database that I think was big enough?”) and liberate authors/researchers. Then the reader can learn whether “That’s a precisely-measured but very small effect, so we’re sure it’s negligible” vs. “That’s a precisely-measured large effect, so we can apply this new knowledge appropriately.”

(3) “Significant” is a terrible word that historical statisticians chose. It should be replaced with “statistically demonstrable” or “defensible” or “plausible” or “non-spurious” or “convincing”. It has nothing to do with whether the claimed effect/finding is *interesting* or *large*. It only tells us whether we think the sample size was big enough for it to be worth even discussing yet, or whether more data are needed before we start to discuss it. (In that sense, I agree that the p < 0.05 cutoff is "too easy to pass.") But you and I *should* have a say in whether something is called "statistically significant." Our core job, as PhD level statisticians, is basically to develop this and other similar inferential properties of estimators/procedures. (4) Of course there are cases where sample size is irrelevant. If previously everybody thought that no 2-year-old child can learn to recite the alphabet backwards, then it only takes n=1 such children to be an interesting publication. But that's just showing that something is possible, not estimating effects or population parameters, which does require stat inference tools.

Alex:

There’s some work (e.g. Cumming’s book “Understanding the new statistics”) on choosing sample sizes to ensure a small CI, rather than a high power. I agree with Jerzy that, without CIs or some other inferential tool, requiring a larger sample size isn’t meaningful—the sample size needed to detect a given difference is often non-intuitive, and without making a CI or calculating power you won’t realize that your sample is inadequate.

Requiring effects to be big enough to be visually obvious also doesn’t cover the opposite problem in inference: when people conclude “I can’t see an effect, so one must not exist.” It’s much better to use a CI to quantify which effect sizes are plausible.

Yotam:

I think the question or the position I’m taking in this discussion is further away than CI’s. I agree with you that if we are doing statistics, it’s better to do it right, and CI’s, especially tight ones, can often provide quite valuable and important information if the problem is stated right. That is, WHEN statistics is used in order to draw important conclusions.

You asked: “So what else will you use to decide what’s convincing vs. what’s spurious, if not statistical inference tools?”

And this, at least to me, the heart of our discussion. Statistics is not the holy grail for an interesting scientific discovery in all fields. I gave as an example Paleontology. Take History, Philosophy, Math(!), CS (mostly), Chemistry(?), Microeconomics, Businesses(?), Law, and tons others. Of course, statistics is widely used in most of those fields, but when people conduct research it’s being done inside the research community, and they develop theories and schools, without statistical significance.

In Psychology statistics has become the Sheriff for valid research. But de facto, it is not working. Some may say that this is because statistics is not being done right. I think this is a pretty big statement, as there are very smart people there. Even when done perfectly right the nature of the discovery is different there. In my opinion (without checking), the research questions are too reliant on humans, and more often than not feels like the research is being held back by the use of the tools.

There are alternative for the statistical framework. Think about Geology. If some researcher find something that doesn’t hold with current theories, he will point that out and offer an alternative. For his ideas to be accepted, it’s not a matter of a small CI’s, but a matter of convincing the geological community that this is an important discovery.

Another example—Philosophy. Descartes says something. Hume disagrees. Philosophers can go on and on with logic and stories and claims until one school is more sound, and then move on to research some other interesting problems in the field. You can think of an experiment. You can conduct it and get some statistics—but what will it mean?

Psychology might deserve the same treatment. Freud wasn’t doing any statistics. If you want to state something about the human nature, or human mind, state that. And explain perfectly well why you think that. If you show that with experiments, or an interesting story (like in businesses when they analyze test cases), or with strong CI—that is less important. The important part is that you manage to convince the people in your community that your work is interesting and important to the field.

I think that what troubles you about my position (correct me if I’m wrong) is that I state that Psychology’s CI’s doesn’t mean a lot. But it’s not coming from disrespecting psychology’s research, rather than by understanding statistics’ limitations. I think that after thousands of research papers, stating almost everything, and exactly the opposite, in a very significant way, maybe psychologists can use a change and do exactly as the editors ask them to do:
“Convince us in a clear and creative way that you are doing something important and interesting without stating < 0.002 with 89% power. We have enough of those type of claims. Find some other way to get our interest". How would you do that without inference? I guess with logic, experience, knowledge and common sense.

Alex:

Perhaps another way to state that is that psychology develops theories which are not easy to test statistically. Paul Meehl wrote a great paper in the 60s, “Theory-testing in psychology and physics: A methodological paradox”, which argues that statistical tests of psychological theories typically don’t provide much evidence of anything.

Jerzy:

Yotam, I agree 100% that there is scope for other kinds of research than the numerical experiments which statistics can be applied to. Yes, more people should be encouraged to observe interesting things that are not data-driven (like digging up an unknown kind of bone) and invent new theories that have no reliance on statistical inference, just “logic, experience, knowledge and common sense.”

But in this particular journal editorial, they don’t seem to be talking about that. They say:
“Are any inferential statistical procedures required? No, because the state of the art remains uncertain. However, BASP will require strong descriptive statistics, including effect sizes… we encourage the use of larger sample sizes…”

So in their own words, they plan to keep focusing on publishing studies that rely on large samples and are interpreted in terms of statistical analysis. They are *not* talking about the studies that you describe (a business case-study, a new theory of mind, a chemical lab-bench experiment, a newly-discovered species). They *want* to publish statistics—they just don’t want to publish any rigorous inferential info along with them.

Again, I fully agree that the direction you propose is valuable. But these journal authors aren’t proposing that! They propose to keep demanding statistical evidence, but ignore the measures of quality that distinguish better vs worse statistical evidence. Right?

Yotam:

I see what you’re saying. Well I am not certain about their publishing criteria. It can be read as if they insist from now on to use “bad statistics” since it’s simpler for the researcher, which is obviously a mistake.

But I have read that a bit differently, and I think that since they are doing such a big step it’s better to give them the benefit of the doubt. I have read their message as: “Forget about statistics. You are liberated from those tools. Do interesting, creative experiments, and if you find something cool, your results would follow from descriptive statistics”.

This is somewhat different than an editor requesting the researcher to do statistical research and publish statistics. The way I read it (which obviously can be wrong) they *want* to publish psychology, and statistics can be used in the process to demonstrate your claims. It might turn out to be a too liberal interpretation to this specific journal, I’m not sure, and it will depend on the type of papers they are going to publish from now on.

At any case—my main point is to claim that by easing the statistical standard on psychology research, psychology can only benefit. By forcing the researcher to do statistics “right” you’ll end up getting psychology journals publishing statistics that usually doesn’t mean a lot. This is why I think this is a step in the right direction (and maybe not far enough).

Of course there is quantitative work being done in psychology experiments, but I think the nature of the claims, and the interest it raises should focus more on psychology, and less on statistics. I do not know a lot about behavioral psychology or psychology in general, so I might be wrong regarding that—but the first experiment I can think of is the Stanford jail experiment, where they made bunch of students prisoners and guards, and watched how the students behave. Now this is an interesting experiment over the human nature—and you do not need any p-values to discuss its meaning or results. I know this is from the 70’s and IRB would never approve something like that again—but shouldn’t that be the type of research the researcher is encouraged and focused on doing? Why is the statistics important there?

If we agree about this general (more radical) claim about statistics in psychology, I’m fine with discussing the intentions of this specific journal at a later time, or over a beer, if that is cool with you. My main claim is that the solution to the replicability crisis in the field is not to do statistics “better”, rather than to give the researcher enough (or total) statistical slack, and focus on psychology. Claims which are statistically stupid would fail just because it won’t be possible to perpetuate those claims into the psychological community. Not because the CI is too loose.

Jerzy:

“discussing … over a beer ” == yes!

So, feel free to continue in the comments, or find us over beers 🙂

2 responses to “NHST ban followup

  1. The part that stood out for me is Yotam’s comment that “The important part is that you manage to convince the people in your community that your work is interesting and important to the field. … How would you do that without inference? I guess with logic, experience, knowledge and common sense.”

    After all, nobody is convinced of anything from a single study. We take our existing knowledge, and update it a little based on the study before us. This is subjective, and belief has a continuous range of values. If you have a study that finds something with 94% confidence, that’s still information that should raise our confidence of the finding, albeit 1% less than a 95%-confidence study would. On the other hand, a significant finding is still only in the context of one experimental design, and we’re back to logic, experience, knowledge, and common sense as to whether the overall study generalizes in a valid way.

    But you know who is convinced by a past-the-post p-value? Journalists. Partisans who want a paper to `prove’ what they already believe. People who need something to make their pop psychology book sound weightier. The people who don’t look at context and for whom a single study will do. As a newspaper reader, I get the impression that psychology journals have to deal with this crowd a lot more than researchers in other fields, and I could picture an exasperated editor throwing up his or her hands and saying `forget it, we’re not going to give you a shortcut to dealing with complexity.’

    I wouldn’t have done this myself if I were in their shoes, and I agree with many of the points made above and could add my own list of problems. But I think it’s great that the editors did this, because it’s the bold step that got the conversation flowing. Journals _do_ have an overreliance on p-values, and though a ban is probably not the right solution, it is at least an attempt.

  2. The arguments made in that Meehl paper are very important. I do not think that researchers realize the danger they are in by making claims based off “rejecting the null hypothesis”. Can you think of a case where there is no possible uninteresting alternative explanation for a scientific result that is inconsistent with “chance”? I have found that, in biomed, often I can also interpret the “non-chance” result in a way diametrically opposed to the favored explanation (eg the treatment effect is a “good” thing rather than a “bad” thing).

    Ruling out a number of “non-chance” explanations is much more difficult, often requiring more detailed data collection and clever experimental design, than simply ruling out chance. The thing is that a study that achieves the former will also achieve the latter, but not vice versa. For that reason, I doubt the utility of considering a null hypothesis at all for most cases and definitely see that a focus on this has caused great waste and damage.

    The answer is to return to the pre-NHST ways of doing science. Collect data regarding a topic in more and more detail until someone comes up with a model that makes predictions. Then check these predictions against new data. Most often if the data is detailed enough there will be multiple competing explanations to compare. Once again, if the data can distinguish between real explanations it will also be able to rule out chance.