# Category Archives: Statistics

## Forget NHST: conference bans all conclusions

Once again, CMU is hosting the illustrious notorious SIGBOVIK conference.

Not to be outdone by the journal editors who banned confidence intervals, the SIGBOVIK 2015 proceedings (p.83) feature a proposal to ban future papers from reporting any conclusions whatsoever:

In other words, from this point forward, BASP papers will only be allowed to include results that “kind of look significant”, but haven’t been vetted by any statistical processes…

This is a bold stance, and I think we, as ACH members, would be remiss if we were to take a stance any less bold. Which is why I propose that SIGBOVIK – from this day forward – should ban conclusions

Of course, even this provision may not be sufficient, since readers may draw their own conclusions from any suggestions, statements, or data presented by authors. Thus, I suggest a phased plan to remove any potential of readers being mislead…

I applaud the author’s courageous leadership. Readers of my own SIGBOVIK 2014 paper on BS inference (with Alex Reinhart) will immediately see the natural synergy between conclusion-free analyses and our own BS.

## Statistics Done Wrong, Alex Reinhart

Hats off to my classmate Alex Reinhart for publishing his first book! Statistics Done Wrong: The Woefully Complete Guide [website, publisher, Amazon] came out this month. It’s a well-written, funny, and useful guide to the most common problems in statistical practice today.

Although most of his examples are geared towards experimental science, most of it is just as valid for readers working in social science, data journalism [if Alberto Cairo likes your book it must be good!], conducting surveys or polls, business analytics, or any other “data science” situation where you’re using a data sample to learn something about the broader world.

This is NOT a how-to book about plugging numbers into the formulas for t-tests and confidence intervals. Rather, the focus is on interpreting these seemingly-arcane statistical results correctly; and on designing your data collection process (experiment, survey, etc.) well in the first place, so that your data analysis will be as straightforward as possible. For example, he really brings home points like these:

• Before you even collect any data, if your planned sample size is too small, you simply can’t expect to learn anything from your study. “The power will be too low,” i.e. the estimates will be too imprecise to be useful.
• For each analysis you do, it’s important to understand commonly-misinterpreted statistical concepts such as p-values, confidence intervals, etc.; else you’re going to mislead yourself about what you can learn from the data.
• If you run a ton of analyses overall and only publish the ones that came out significant, such data-fishing will mostly produce effects that just happened (by chance, in your particular sample) to look bigger than they really are… so you’re fooling yourself and your readers if you don’t account for this problem, leading to bad science and possibly harmful conclusions.

Admittedly, Alex’s physicist background shows in a few spots, when he implies that physicists do everything better (e.g. see my notes below on p.49, p.93, and p.122.)

Seriously though, the advice is good. You can find the correct formulas in any Stats 101 textbook. But Alex’s book is a concise reminder of how to plan a study and to understand the numbers you’re running, full of humor and meaningful, lively case studies.

Highlights and notes-to-self below the break:
Continue reading

## NHST ban followup

I’ve been chatting with classmates about that journal that banned Null Hypothesis Significance Testing (NHST). Some have more charitable interpretations than I did, and I thought they’re worth sharing.

Similarly, a writeup on Nature’s website quoted a psychologist who sees two possibilities here:

“A pessimistic prediction is that it will become a dumping ground for results that people couldn’t publish elsewhere,” he says. “An optimistic prediction is that it might become an outlet for good, descriptive research that was undervalued under the traditional criteria.”

(Also—how does Nature, of all places, get the definition of p-value wrong? “The closer to zero the P value gets, the greater the chance the null hypothesis is false…” Argh. But that’s neither here nor there.)

Here’s our discussion, with Yotam Hechtlinger and Alex Reinhart.

## Very gentle resource for speeding up R code

Nathan Uyttendaele has written a great beginner’s guide to speeding up your R code. Abstract:

Most calculations performed by the average R user are unremarkable in the sense that nowadays, any computer can crush the related code in a matter of seconds. But more and more often, heavy calculations are also performed using R, something especially true in some fields such as statistics. The user then faces total execution times of his codes that are hard to work with: hours, days, even weeks. In this paper, how to reduce the total execution time of various codes will be shown and typical bottlenecks will be discussed. As a last resort, how to run your code on a cluster of computers (most workplaces have one) in order to make use of a larger processing power than the one available on an average computer will also be discussed through two examples.

Unlike many similar guides I’ve seen, this really is aimed at a computing novice. You don’t need to be a master of the command line or a Linux expert (Windows and Mac are addressed too). You are walked through installation of helpful non-R software. There’s even a nice summary of how hardware (hard drives vs RAM vs CPU) all interact to affect your code’s speed. The whole thing is 60 pages, but it’s a quick read, and even just skimming it will probably benefit you.

Favorite parts:

• “The strategy of opening R several times and of breaking down the calculations across these different R instances in order to use more than one core at the same time will also be explored (this strategy is very effective!)” I’d never realized this is possible. He gives some nice advice on how to do it with a small number of R instances (sort of “by hand,” but semi-automated).
• I knew about rm(myLargeObject), but not about needing to run gc() afterwards.
• I haven’t used Rprof before, but now I will.
• There’s helpful advice on how to get started combining C code with R under Windows—including what to install and how to set up the computer.
• The doSMP package sounds great — too bad it’s been removed but I should practice using the parallel and snow packages.
• P.63 has a helpful list of questions to ask when you’re ready to learn using your local cluster.

One thing Uyttendaele could have mentioned, but didn’t, is the use of databases and SQL. These can be used to store really big datasets and pass small pieces of them into R efficiently, instead of loading the whole dataset into RAM at once. Anthony Damico recommends the column-store database system MonetDB and has a nice introduction to using MonetDB with survey data in R.

## Launch party for CMU undergrad stats major programs

So here at CMU, we’re proud to have one of the “largest and fastest-growing” statistics departments in the US.

Tomorrow (March 3rd) is the launch party for several new (joint-)major programs for CMU undergrads: Statistics and Machine Learning, Statistics and Neuroscience, and Mathematical Statistics. That’s in addition to two existing programs: Statistics Core and the joint program in Economics and Statistics.

If you’re in Pittsburgh, come to the launch party at 4:30pm tomorrow. We’ll have project showcases, advising, interactive demos, etc., not to mention free food

## Journal bans null hypothesis testing and confidence intervals

So I’ve complained before about the problems with Null Hypothesis Significance Testing (NHST) and how, in many cases, it’d be more informative and more useful to report confidence intervals instead of p-values.

Well, the journal Basic and Applied Social Psychology has recently decided to ban p-values… but they’ve also tossed out confidence intervals and all the rest of classical statistical inference. And they’re not sold on Bayesian inference either. (Nor does their description of Bayes convince me that they understand it, with weird wordings like “strong grounds for assuming that the numbers really are there.”)

Apparently, instead of choosing another, less common inference flavor (such as likelihood or fiducial inference), they are doing away with rigorous inference altogether and only publishing descriptive statistics. The only measure they explicitly mention to prevent publishing spurious findings is that “we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.” That sounds to me like they know sampling error and inference are important—they just refuse to quantify them, which strikes me as bizarre.

I’m all in favor of larger-than-typical sample sizes, but I’m really curious how they will decide whether they are large enough. Sample sizes need to be planned before the experiment happens, long before you get feedback from the journal editors. If a researcher plans an experiment, hoping to publish in this journal, what guidance do they have on what sample size they will need? Even just doubling the sample size is already often prohibitively expensive, yet it doesn’t even halve the standard error; will that be convincing enough? Or will they only publish Facebook-sized studies with millions of participants (which often have other experimental-design issues)?

Conceivably, they might work out these details and this might still turn out to be a productive change making for a better journal, if the editors are more knowledgeable than the editorial makes them sound, AND if they do actually impose a stricter standard than p<0.05, AND if good research work meeting this standard is ever submitted to the journal. But I worry that, instead, it'll just end up downgrading the journal's quality and reputation, making referees unsure how to review articles without statistical evidence, and making readers unsure how reliable the published results are.

See also the American Statistical Association’s comment on the journal’s new policy, and the reddit discussion (particularly Peter’s response).

Edit: John Kruschke is more hopeful, and Andrew Gelman links to a great paper citing cases of actual harm done by NHST. Again, I’m not trying to defend overuse of p-values—but there are useful and important parts of statistical inference (such as confidence intervals) that cannot be treated rigorously with descriptive statistics alone. And reliance on the interocular trauma test alone just frees up more ways to fiddle with the data to sneak it past reviewers.

## Dataclysm, Christian Rudder

In between project deadlines and homework assignments, I enjoyed taking a break to read Christian Rudder’s Dataclysm. (That’s right, my pleasure-reading break from statistics grad school textbooks is… a different book about statistics. I think I have a problem. Please suggest some good fiction!)

So, Rudder is one of the founders of dating site OkCupid and its quirky, data-driven research blog. His new book is very readable—each short, catchy chapter was hard to put down. I like how he gently alludes to the statistical details for nerds like myself, in a way that shouldn’t overwhelm lay readers. The clean, Tufte-minimalist graphs work quite well and are accompanied by clear writeups. Some of the insights are basically repeats of material already on the blog, but with a cleaner writeup, though there’s plenty of new stuff too. Whether or not you agree with all of his conclusions [edit: see Cathy O’Neil’s valid critiques of the stats analyses here], the book sets a good example to follow for anyone interested in data- or evidence-based popular science writing.

Most of all, I loved his description of statistical precision:

Ironically, with research like this, precision is often less appropriate than a generalization. That’s why I often round findings to the nearest 5 or 10 and the words ‘roughly’ and ‘approximately’ and ‘about’ appear frequently in these pages. When you see in some article that ‘89.6 percent’ of people do x, the real finding is that ‘many’ or ‘nearly all’ or ‘roughly 90 percent’ of them do it, it’s just that the writer probably thought the decimals sounded cooler and more authoritative. The next time a scientist runs the numbers, perhaps the outcome will be 85.2 percent. The next time, maybe it’s 93.4. Look out at the churning ocean and ask yourself exactly which whitecap is ‘sea level.’ It’s a pointless exercise at best. At worst, it’s a misleading one.

I might use that next time I teach.

The description of how academics hunt for data is also spot on: “Data sets move through the research community like yeti—I have a bunch of interesting stuff but I can’t say from where; I heard someone at Temple has tons of Amazon reviews; I think L has a scrape of Facebook.

Sorry I didn’t take many notes this time, but Alberto Cairo’s post on the book links to a few more detailed reviews.

## “Statistical Modeling: The Two Cultures,” Breiman

One highlight of my fall semester is going to be a statistics journal club led by CMU’s Ryan Tibshirani together with his dad Rob Tibshirani (here on sabbatical from Stanford). The journal club will focus on “Hot Ideas in Statistics“: some classic papers that aren’t covered in standard courses, and some newer papers on hot or developing areas. I’m hoping to find time to blog about several of the papers we discuss.

The first paper was Leo Breiman’s “Statistical Modeling: The Two Cultures” (2001) with discussion and rejoinder. This is a very readable, high-level paper about the culture of statistical education and practice, rather than about technical details. I strongly encourage you to read it yourself.

Breiman’s article is quite provocative, encouraging statisticians to downgrade the role of traditional mainstream statistics in favor of a more machine-learning approach. Breiman calls the two approaches “data modeling” and “algorithmic modeling”: Continue reading

## After teaching 1st statistics course

I’ve just finished an exhausting but rewarding 6 weeks teaching a summer-session course on “Experimental Design for Behavioral and Social Sciences,” CMU course 36-309. My course materials are secreted away on Blackboard, but here is my syllabus. You can also see some materials from a previous session here, including Howard Seltman’s textbook (free online).

The students were expected to have already taken an introductory statistics course. After a short review of basic concepts and t-tests, we dove into more intermediate analyses (ANOVA and regression, contrasts, chi-square tests and logistic regression, repeated measures) and into how a good study should be designed (power, internal/external validity, etc.)

I’ve taught one-off statistics workshops before, and I’ve taught once-a-week semester-long Polish language classes, but this was my first experience teaching a full-length course in statistics. Detailed notes are below.

## After 2nd semester of Statistics PhD program

Here’s another post on life as a statistics PhD student (in the Department of Statistics, at Carnegie Mellon University, in Pittsburgh, PA).
The previous such post was After 1st semester of Statistics PhD program.

Classes:

• I feared that Advanced Probability Overview would be just dry esoteric theory, but Jing Lei ensured all the topics were really well-motivated. Although it was tough, I did better than I’d hoped (especially given that I’ve never taken a proper Real Analysis course). In Statistical Machine Learning, Larry Wasserman and Ryan Tibshirani did a great job of balancing “old” core theory with new cutting-edge research topics, including helpful homework assignments that gave us practice both in theory and in applications.
• My highlight of the semester was being able to read and digest a research paper that was way too abstract when I tried reading it a few years ago. It really hit me that I must be learning something in grad school
(The paper was Building Consistent Regression Trees from Complex Sample Data, by Toth and Eltinge. While working at Census, I wanted to try running a complex-survey-weighted regression tree, but I couldn’t get much out of this paper. Now, after a good dose of probability theory and machine learning, it’s far clearer. In fact, I have some ideas about extending this work!)
• The Statistical Machine Learning class referenced a ton of crazy math terms I wasn’t familiar with: Banach and Hilbert spaces, Lp norms, conjugate functions, etc. It terrified me at first—I’ve never even heard of this stuff, should I have taken grad-level functional analysis before I started this PhD, am I about to fail?!?—but it turns out a lot of it is just names for specific versions of general concepts that I already knew. Whew. Also, most of it got used repeatedly from topic to topic, so we did gain familiarity even without explicitly taking a functional analysis course etc. So, don’t get disheartened too easily by unfamiliar terminology!
• It was great to finally learn more about Lp norms and about splines. Also, almost everything in SML can be written as a penalized regression
• Smoothing splines and Reproducing Kernel Hilbert Space (RKHS) regression are nifty because the setup is that you want to optimize over all possible functions. So you start out with an infinite-dimensional space, for which in general there might be no simple way to search/optimize! … But in these specific setups, we can prove that the optimal solution happens to lie in a finite-dimensional subspace, where your usual optimization/search tools will work after all. Nice.
• Larry had a nice “foundations” day in SML, with examples where Bayes and Frequentist analysis differ greatly. However, I didn’t find most of his examples too convincing, since the Bayesian “loses” only due to a stupid choice of priors; or the Bayesian “loses” for finite n but in a case where n in practice would have to be ridiculously large. Still, this helped stretch my thinking about how these inference philosophies differ.
• Larry points out: you often hear that “We might as well go Bayes because if you give people a Frequentist interval, they’ll interpret it as a Bayes interval.” But the reverse is also true: Give someone a sequence of 95% Bayes intervals, and they’ll expect 95% of them to contain the true value. That is NOT necessarily going to happen with Bayes CIs (unlike Frequentist CIs).
• In addition to Subjective, Objective, Empirical, or Calibrated Bayes, let me propose “Cynical Bayes”: Don’t choose a prior because you believe it. Instead, choose one to optimize your estimator’s Frequentist properties. That way you can keep your expert Freq’ist colleagues happy, yet still call it a Bayes estimator, so you can give the usual Bayes interpretation to keep nonexperts happy
• A background in Statistics will keep you thinking about distributions and probabilities and convergences. But a background in Applied Math may be better at giving you tools and ideas for feature engineering. It’s worth having both toolsets.
• The Advanced Probability Overview course covered some measure-theoretic probability. I’m finally understanding the subtleties of how the different convergences $\xrightarrow{p}$, $\xrightarrow{as}$, $\xrightarrow{D}$, and $\xrightarrow{L^p}$ all differ, and why it matters. We saw these concepts last semester in Intermediate Statistics, but the distinctions are far clearer to me now.
• AdvProb’s measure theory section also really helped me understand why textbooks say a random variable is a “function”: intuitively it seems like just a variable or a number or something… but in fact it really is a function, from “the state of the world” i.e. an element $\omega$ of the set $\Omega$ of all possible outcomes or states of the world, to the measurement you will collect (often a number on the real line). Finally, this measure theory view of probability, as the size of a subset of $\Omega$, is helpful. Even though statisticians’ goal is to develop tools that let them work with the range of the random variable and ignore the domain $\Omega$, it’s good to remember that this domain exists.
• However, measure theory and probability theory suffer from some really poor terminology! For example, it took me far too long to realize that “integrable” means “the integral is finite”, NOT “the integral exists.”
• When we teach students R, we really should use practical examples, not the arbitrary generic examples that you see so often. Instead of just showing me list(1,"a"), it helps to give a realistic example of why you may actually need to collect together numeric and character elements in a single object.

Research:

• I started a new research project, the Advanced Data Analysis project, which will run until the end of this upcoming Fall semester (so about a year total). I am working with Rob Kass and Avniel Ghuman on using magnetoencephalography (MEG) data to study epilepsy.
• At Rob’s research group meetings, I learn a ton from the helpful questions he asks. When presenting someone else’s work (i.e. for a journal club), ask yourself, “What would you do if *your* research was based on the data from this paper?” Still, I’ve found I really do need to keep scheduling weekly 1-on-1 meetings—the group meetings are not enough to stay optimally on track.
• Neuroscience is hard! Pre-processing massive neuroscience datasets using not-fully-documented open source software is particularly hard. When I chose this project, I did not realize how much time I would have to spent on learning the subject matter, relevant specialized software tools, and data pre-processing workflow. Four months in and I’ve still barely gotten to the point of doing any “real” statistics. It’s a good project and I’m learning a lot, but it’s disheartening to see how much of that learning has been tied to debugging open-source software installations that I’ll only ever use again if I stay in this sub-field.
I would advise the next PhD cohort to choose projects that’ll primarily teach you more general-purpose, transferable skills. Maybe take an existing theoretical method that’s not implemented in software yet, and make it into an R package?

Life:

• This was a tougher semester in many ways, with harder classes and more research-related setbacks. The Cake song Tougher than it is got a lot of play time on my headphones
• I’m glad that despite my slow posting rate, the blog still kept getting regular traffic—particularly Is a Master’s degree in Statistics worthwhile? I guess it’s a burning question these days.
• A big help to my sanity this semester came from joining the All University Orchestra. After a long week of tough classes and research setbacks, it’s great to switch brain modes and play my clarinet. I’ve really missed playing for the past few years in DC, and I’m glad to get back into it.
• Pittsburgh highlights: Bayernhof museum, Pittsburgh Symphony Orchestra concerts (The Legend of Zelda, “Behind the Notes” talks), Jozsa Corner, Point Brugge Cafe, sampling all the Squirrel Hill pizzerias, MCMC Bar Crawl on the Southside Flats, riding the ridiculously steep inclines, Pittsburgh Area Theater Organ Society concerts and tours of their beautiful theater organ
• Things still on our list to do in Pittsburgh: see a CMU theater performance, Pittsburgh aviary and zoo, Kennywood amusement park, Steelers game, Penguins game
• I look forward to getting a chance to teach a whole course this summer. It’ll be 36-309, Experimental Design. I also took some Eberly Center seminars, and the department organized helpful planning meetings for those of us students who’ll teach in the summer, so I feel reasonably prepared.
I plan to have my students design a series of experiments to bake the ultimate chocolate chip cookie. It will be delicious. I baked Meg Hourihan’s mean chocolate chip cookies for a department event earlier this spring, which seems like an appropriate start.
However, ironically, as the local knitr / reproducible research fanboy… I’m supposed to teach the course using SPSS, which seems to be largely point-and-click, without much support for reproducible reports
• It was a nice difference to be on the other side of the department’s open house for admitted students this year I’m also happy to be reading Grad Cafe forums from a much more relaxed point of view this year!
• I’m surprised there’s not much crossover between the CMU and UPitt statistics departments. And the stats community outside each department doesn’t seem as vibrant as it was in DC. I attended the American Statistical Association’s Pittsburgh chapter banquet. Besides CMU and Pitt folks, most attendees seemed to be RAND employees or independent consultants. There are also some Meetup groups: the Pittsburgh Data Visualization Group and the Pittsburgh useR Group.
• I’ve updated and expanded my CMU blogroll in the sidebar. Please let me know if I missed your CMU/Pittsburgh statistics-related blog!

Other people’s helpful posts on the PhD experience: