The process of doing science, math, engineering, etc. is usually way messier than how those results are reported. Abstruse Goose explains it well:

In pure math, that’s usually fine. As long as your final proof can be verified by others, it doesn’t necessarily matter how you got there yourself.

Now, verifying it might be hard, for example with computer-assisted proofs like that of the Four Color Theorem. And teaching math via the final proof might not be the best way, pedagogically, to develop problem-solving intuition.

But still, a theorem is either true or it isn’t.

However, in the experimental sciences, where real-world data is inherently variable, it’s very rare that you can really say, “I’ve proven that Theory X is true.” Usually the best you can do is to say, “I have strong evidence for Theory X,” or, “Given these results it is reasonable to believe in Theory X.”

(There’s also decision theory: “Do we have enough evidence to **think** that Theory X is true?” is a separate question from “Do we have enough evidence to **act** as if Theory X is true?”)

In these situations, **the way you reached your conclusions** really does affect **how trustworthy they are**.

Andrew Gelman reports that Cornell psychologists have written a nice paper on this topic, focusing on the statistical testing side of this issue. It’s a quick and worthwhile read.

Some of their recommendations only make sense for limited types of analysis, but for those cases, it is sensible advice. I thought that the contrast between their two descriptions of Study 2 (“standard” on p. 2, versus “compliant” on p. 6) was very effective.

I’m not sure what to think of their idea of limiting “researcher degrees of freedom.”

For example, they discourage a Bayesian approach because “Bayesian statistics require making additional judgments (e.g., the prior distribution) on a case-by-case basis, providing yet more researcher degrees of freedom.”

I’m a bit hesitant to say that researchers should be pigeonholed into the standard frequentist toolkit and not allowed to use their best judgment!

If canned frequentist methods are unsuitable for the problem at hand, or underestimate uncertainty relative to a carefully-thought-out, problem-appropriate Bayesian method, you may not be doing better after all…

However, like the authors of this paper, I do support **better reporting of why** a certain analysis was judged to be the right tool for the job.

Ideally, more of us would know Bayesian methods and could justify the choice between frequentist and Bayes approaches for the given problem at hand, not by always saying “the frequentist approach is standard” and stopping our thinking there.