No disrespect meant to Martin, his readers, or their families—it’s just a thought exercise that intrigued me, and I figured it may interest other people.

Also, we’ve blogged about GoT and statistics before.

In the Spring a young man’s fancy lightly turns to actuarial tables.

That’s right: Spring is the time of year when the next bloody season of *Game of Thrones* airs. This means the internet is awash with death counts from the show and survival predictions for the characters still alive.

Others, more pessimistically, wonder about the health of George R. R. Martin, author of the *A Song of Ice and Fire* (*ASOIAF*) book series (on which *Game of Thrones* is based). Some worried readers compare Martin to Robert Jordan, who passed away after writing the 11th *Wheel of Time* book, leaving 3 more books to be finished posthumously. Martin’s trilogy has become 5 books so far and is supposed to end at 7, unless it’s 8… so who really knows how long it’ll take.

(Understandably, Martin responds emphatically to these concerns. And after all, Martin and Jordan are *completely different* aging white American men who love beards and hats and are known for writing phone-book-sized fantasy novels that started out as intended trilogies but got out of hand. So, basically no similarities whatsoever.)

But besides the author and his characters, there’s another set of deaths to consider. The books will get finished eventually. But how many **readers** will have passed away waiting for that ending? Let’s take a look.

Caveat: the inputs are uncertain, the process is handwavy, and the outputs are certainly wrong. This is all purely for fun (depressing as it may be).

So, we’ll need to answer a few questions. How do we define readers? How many readers are there? What are their demographics? And what are the mortality statistics for those demographics?

**Readers:** By the fall of 2013, around 24 million ASOIAF books had been sold in North America, but that includes all 5 books (so far). Furthermore, it seems that book sales went through the roof once the HBO show began in 2011, which will make it really hard to estimate trends in readership over time after that year. 2011 is also the year when the latest book in the series, *A Dance With Dragons* (*ADWD*), was published.

The first book in the series reached its one-millionth US-paperback-edition copy by the fall of 2010. So for simplicity’s sake, let’s say that by the end of 2010, there were at least 1,000,000 US readers. [This misses people who bought the hardcover instead or who read it as a library book; and this overcounts people who bought but never read it or who didn’t like it enough to continue the series. Still, it’s a nice round number and probably the right order of magnitude.] These are the grizzled veterans who were already fans before the HBO show (a.k.a. the hipsters who liked it before it was cool). Some started when the first book appeared in 1996, others as late as 2010. Let’s call these 1,000,000 US residents our core ASOIAF readers who really want to know how the book series ends. How many of them had passed away before the HBO show began and ADWD came out?

The first US printing of the first book had around 50,000 copies according to Martin. Instead of thoroughly researching how book readership tends to grow, let’s assume it’s linear (another crass oversimplification). Then over the 15 years from 1996 to 2010 (inclusive), we’d have to add around 65,000 readers a year to reach 1 million total readers. Since that’s not too far from the first print run, let’s go with it: we have 65,000 new ASOIAF readers every year over 15 years.

**Demographics:** I can’t find demographics for Martin’s readers specifically, but there are a few demographic summaries of fantasy readers in general. Readers of *The Magazine of Fantasy & Science Fiction* and *Lightspeed Magazine* seem to be roughly 10% ages 18-24, 50% ages 26-45, 30% ages 46-55, which I guess leaves around 10% aged 56 or older. (That ignores people under 18, but let’s face it, kids probably shouldn’t be reading the gruesome *Game of Thrones* anyway.) The sex breakdown is roughly 60% male, 40% female. Let’s assume this age/sex breakdown holds for our million ASOIAF readers, though here are several other summaries with slightly different demographics.

**Mortality:** Okay, it’s time for the morbid part. Here are some Death Rates by Age and Sex (we’ll ignore Race since I didn’t find those reader demographics). None of them have changed dramatically since around 2000, close to the first book’s publication date, so let’s just use the latest 2008 numbers. The age breakdowns here don’t match ours exactly, so let’s also average together the rates for age groups we need to combine. Table 2 here suggests 25-34 and 35-44 had roughly similar numbers of people, so we can take a simple average of their death rates to get the 26-45 rate. But for people 56+, we’ll do a weighted average, weighted by the approximate population in each death-rate category. Using very rough weights of 25 (million) in population for 55-64, 20 for 65-74, 10 for 75-84, and 5 for 85+, we get

`(25*1000 + 20*2500 + 10*6000 + 5*14000) / (25 + 20 + 10 + 5)`

or around 3500 for the 56+ male death rate. For females, it’s

`(25*1000 + 20*2000 + 10*4000 + 5*12500) / (25 + 20 + 10 + 5)`

or around 2800.

Finally, the table gives death rates per 100,000 population, but let’s translate them to percent of people who will pass away that year. The results are

Males: .11% for 18-25, .18% for 26-45, .53% for 46-55, 3.5% for 56+

Females: .04% for 18-25, .06% for 26-45, .23% for 46-55, 2.8% for 56+

Let’s run these numbers through R.

# Death rates (percent of people in that group who die in a given year) # rounded or estimated from Census tables DeathRatesVec = c(.11, .18, .53, 3.5, .04, .06, .23, 2.8) DeathRates = matrix(DeathRatesVec, 4, 2) / 100 colnames(DeathRates) = c("M", "F") rownames(DeathRates) = c("18-24", "25-44", "45-54", "55+") DeathRates ## M F ## 18-24 0.0011 0.0004 ## 25-44 0.0018 0.0006 ## 45-54 0.0053 0.0023 ## 55+ 0.0350 0.0280 # Number of readers in each age/sex group # estimated from fantasy magazine reader polls AgePcts = c(.1, .5, .3, .1) SexPcts = c(.6, .4) ReadersPerYear = t(65000 * rbind(AgePcts, AgePcts) * SexPcts) colnames(ReadersPerYear) = colnames(DeathRates) rownames(ReadersPerYear) = rownames(DeathRates) ReadersPerYear ## M F ## 18-24 3900 2600 ## 25-44 19500 13000 ## 45-54 11700 7800 ## 55+ 3900 2600 # Function to estimate the number of readers who die # within a certain number of years NrDeathsByYearsLeft = function(YearsLeft) { sum(ReadersPerYear - ReadersPerYear * (1 - DeathRates) ^ YearsLeft) } # Total number of reader deaths from 1996 through 2010 FirstYear = 1996 FinalYear = 2010 TotalYears = FinalYear - FirstYear + 1 DeathsByYearStarted = sapply(TotalYears:1, NrDeathsByYearsLeft) round(sum(DeathsByYearStarted)) ## [1] 36814

So it looks like almost 40,000 veteran readers didn’t survive even until *ADWD* was published or the HBO show aired. This is on the order of 100 times the number of characters who’ve died, whether in the show or in the books.

Finally, let’s show the breakdown by year, since we already calculated it above:

# Number of deaths by 2010, # broken out by the year in which they started reading plot(FirstYear:FinalYear, DeathsByYearStarted, type='h', xlab = 'Year', ylab = 'Deaths', main = 'Number of readers deceased by 2010\nwho started in a given year')

(The trend looks perfectly linear just because we assumed linear growth in the number of readers and stable demographics over time.)

No deep insights here. There’s just the stark (hah!) realization that a substantial number of Martin’s earliest readers have not survived the wait.

Let’s not worry about which characters will die; let’s not hurry Martin as he writes. Let us just savor our time on Earth before we make the same journey ourselves. After all, valar morghulis.

PS—A helpful librarian friend tells me that the Carnegie Library of Pittsburgh system has 102 copies of the books currently in the system (acquired from 2002 onwards), with about 2300 total checkouts all together. This could be extrapolated to estimate US readership by library patrons who didn’t actually buy the book. At some point I may also go through her data to see how readership seems to have changed over time (i.e., the number of checkouts over time for older vs. newer copies).

*Manual trackback: Partially Derivative ep. 20 (around 33:09); FlowingData*

The slides introduce a few variants of the simplest area-level (Fay-Herriot) model, analyzing the same dataset in a few different ways. The slides also explain some basic concepts behind Bayesian inference and MCMC, since the target audience wasn’t expected to be familiar with these topics.

- Part 1: the basic Frequentist area-level model; how to estimate it; model checking (pdf)
- Part 2: overview of Bayes and MCMC; model checking; how to estimate the basic Bayesian area-level model (pdf)
- All slides, data, and code (ZIP)

The code for all the Frequentist analyses is in SAS. There’s R code too, but only for a WinBUGS example of a Bayesian analysis (also repeated in SAS). One day I’ll redo the whole thing in R, but it’s not at the top of the list right now.

Frequentist examples:

- “ByHand” where we compute the Prasad-Rao estimator of the model error variance (just for illustrative purposes since all the steps are explicit and simpler to follow; but not something I’d usually recommend in practice)
- “ProcMixed” where we use mixed modeling to estimate the model error variance at the same time as everything else (a better way to go in practice; but the details get swept up under the hood)

Bayesian examples:

- “ProcMCMC” and “ProcMCMC_alt” where we use SAS to fit essentially the same model parameterized in a few different ways, some of whose chains converge better than others
- “R_WinBUGS” where we do the same but using R to call WinBUGS instead of using SAS

The example data comes from Mukhopadhyay and McDowell, “Small Area Estimation for Survey Data Analysis using SAS Software” [pdf].

If you get the code to run, I’d appreciate hearing that it still works

My SAE resources page still includes a broader set of tutorials/textbooks/examples.

]]>Not to be outdone by the journal editors who banned confidence intervals, the SIGBOVIK 2015 proceedings (p.83) feature a proposal to ban future papers from reporting any conclusions whatsoever:

In other words, from this point forward, BASP papers will only be allowed to include results that “kind of look significant”, but haven’t been vetted by any statistical processes…

This is a bold stance, and I think we, as ACH members, would be remiss if we were to take a stance any less bold. Which is why I propose that SIGBOVIK – from this day forward –

should ban conclusions…Of course, even this provision may not be sufficient, since readers may draw their own conclusions from any suggestions, statements, or data presented by authors. Thus, I suggest a phased plan to remove any potential of readers being mislead…

I applaud the author’s courageous leadership. Readers of my own SIGBOVIK 2014 paper on BS inference (with Alex Reinhart) will immediately see the natural synergy between conclusion-free analyses and our own BS.

]]>Although most of his examples are geared towards experimental science, most of it is just as valid for readers working in social science, data journalism [if Alberto Cairo likes your book it must be good!], conducting surveys or polls, business analytics, or any other “data science” situation where you’re using a data sample to learn something about the broader world.

This is NOT a how-to book about plugging numbers into the formulas for t-tests and confidence intervals. Rather, the focus is on *interpreting* these seemingly-arcane statistical results correctly; and on *designing your data collection process* (experiment, survey, etc.) well in the first place, so that your data analysis will be as straightforward as possible. For example, he really brings home points like these:

- Before you even collect any data, if your planned sample size is too small, you simply can’t expect to learn anything from your study. “The power will be too low,” i.e. the estimates will be too imprecise to be useful.
- For each analysis you do, it’s important to understand commonly-misinterpreted statistical concepts such as p-values, confidence intervals, etc.; else you’re going to mislead yourself about what you can learn from the data.
- If you run a ton of analyses overall and only publish the ones that came out significant, such data-fishing will mostly produce effects that just happened (by chance, in your particular sample) to look bigger than they really are… so you’re fooling yourself and your readers if you don’t account for this problem, leading to bad science and possibly harmful conclusions.

Admittedly, Alex’s physicist background shows in a few spots, when he implies that physicists do everything better (e.g. see my notes below on p.49, p.93, and p.122.)

Seriously though, the advice is good. You can find the correct formulas in any Stats 101 textbook. But Alex’s book is a concise reminder of how to plan a study and to understand the numbers you’re running, full of humor and meaningful, lively case studies.

Highlights and notes-to-self below the break:

- p.7: “We will always observe
*some*difference due to luck and random variation, so statisticians talk about*statistically significant*differences when the difference is larger than could easily be produced by luck.”

Larger? Well, sorta but not quite… Right on the first page of chapter one, Alex makes the same misleading implication that he’ll later spend pages and pages dispelling: “Statistically significant” isn’t really meant to imply that the difference is*large*, so much as it’s*measured precisely*. That sounds like a quibble but it’s a real problem. If our historical statistician-forebears had chosen “precise” or “convincing” (implying a focus on how well the measurement was done) instead of “significant” (which sounds like a statement about the size of what was measured), maybe we could avoid confusion and wouldn’t need books like Alex’s. - p.8: “More likely, your medication actually works.” Another nitpick: “more plausibly” might be better, avoiding some of the statistical baggage around the word “likely,” unless you’re explicitly being Bayesian… OK, that’s enough nitpicking for now.
- p.9: “This is a counterintuitive feature of significance testing: if you want to prove that your drug works, you do so by showing that the data is
*in*consistent with the drug*not*working.” Very nice and clear summary of the confusing nature of hypothesis tests. - p.10: “This is troubling: two experiments can collect identical data but result in different conclusions. Somehow, the
*p*value can read your intentions.” This is an example of how frequentist inference can violate the likelihood principle. I’ve never seen why this should bother us: the p-value is meant to inform us about how well*this*experiment*measures*the parameter, not about what the parameter itself is… so it’s not surprising that a differently-designed experiment (even if it happened to get equivalent data) would give a different p-value. - p.19: “In the prestigious journals
*Science*and*Nature*, fewer than 3% of articles calculate statistical power before starting their study.” In the “Sampling, Survey, and Society” class here at CMU, we ask undergraduates to conduct a real survey on campus. Maybe next year we should suggest that they study how many of our professors do power calculations in advance. - p.19: “An ethical review board should not approve a trial if it knows the trial is unable to detect the effect it is looking for.” A great point—all IRB members ought to read
*Statistics Done Wrong*! - p.20-21: A good section on why underpowered studies are so common. Researchers don’t realize their studies are too small; as long as they find
*something*significant among the tons of comparisons they run, they feel the study was powerful enough; they do multiple-comparisons corrections (which is good!), but don’t account for the fact that they*will*do these corrections when computing power; and even with best intentions, power calculations are hard. - p.23: Yes! Instead of computing power, we should do the equivalent to achieve a desired confidence interval width (sufficiently narrow). I hadn’t heard of this handy term “assurance, which determines how often the confidence interval must beat our target width.”

I’ve only seen this approach rarely, mostly with survey/poll sample size planning: if you want to say “Candidate A has X% of the vote (plus or minus 3%)” then you can calculate the appropriate sample size to ensure it really will be a 3% margin of error, not 5% or 10%.

I would much rather teach my students to design an experiment with high assurance than to compute power… assuming we can find or create good assurance-calculation software for them:

“Sample size selection methods based on assurance have been developed for many common statistical tests, though not for all; it is a new field, and statisticians have yet to fully explore it. (These methods go by the name*accuracy in parameter estimation*, or*AIPE*.)” - p.49-50: “Particle physicists call this the
*look-elsewhere effect*… they are searching for anomalies across a large swath of energies, any one of which could have produced a false positive. Physicists have developed complicated procedures to account for this and correctly limit the false positive rate.” Are these any different from what statisticians do? The cited reference may be worth a read: Gross and Vitells, “Trial factors for the look elsewhere effect in high energy physics.” - p.52: “One review of 241 fMRI studies found that they used 207 unique combinations of statistical methods, data collection strategies, and multiple comparison corrections, giving researchers great flexibility to achieve statistically significant results.” Sounds like some of the issues I’ve seen when working with neuroscientists on underpowered and overcomplicated (tiny n, huge p) studies are widespread. Cited reference: Carp, “The secret lives of experiments: methods reporting in the fMRI literature.” See also the classic dead-salmon study (“The salmon was asked to determine what emotion the individual in the photo must have been experiencing…”)
- p.60: “And because standard error bars are about half as wide as the 95% confidence interval, many papers will report ‘standard error bars’ that actually span
*two*standard errors above and below the mean, making a confidence interval instead.” Wow—I knew this is a confusing topic for many people, but I didn’t know this actually happens in practice so often. Never show any bars without clearly labeling what they are! Standard deviation, standard error, confidence interval (which level?) or what? - p.61: “A survey of psychologists, neuroscientists, and medical researchers found that the majority judged significance by confidence interval overlap, with many scientists confusing standard errors, standard deviations, and confidence intervals.” Yeah, no kidding. Our nomenclature is terrible. Statisticians need to hire a spin doctor. Cited reference: Belia et al., “Researchers misunderstand confidence intervals and standard error bars.”
- p.62: “Other procedures [for comparing confidence intervals] handle more general cases, but only approximately and not in ways that can easily be plotted.” Let me humbly suggest a research report I co-wrote with my (former) colleagues at the Census Bureau, covering several ways to visually compare confidence intervals or otherwise make appropriate multiple comparisons visually: Wright, Klein, and Wieczorek, “Ranking Population Based on Sample Survey Data.”
- p.90: Very helpful list of what to think about when preparing to design, implement, and analyze a study. There are many “researcher degrees of freedom” and more people are now arguing that at least some of these decisions should be made before seeing the data, to avoid excessive flexibility. I’ll use this list next time I teach experimental design: “What do I measure? Which variables do I adjust for? Which cases do I exclude? How do I define groups? What about missing data? How much data should I collect?”
- p.93-94: “particle physicists have begun performing
*blind analyses*: the scientists analyzing the data avoid calculating the value of interest until after the analysis procedure is finalized.” This may be built into how the data are collected; “Other blinding techniques include adding a constant to all measurements, keeping this constant hidden from analysts until the analysis is finalized; having independent groups perform separate parts of the analysis and only later combining their results; or using simulations to inject false data that is later removed.” Examples in medicine are discussed too, such as drafting a “clinical trial protocol.” - p.109: Nice—I didn’t know about Figshare and Dryad, which let you upload data and plots to encourage others to use and cite them. “To encourage sharing, submissions are given a digital object identifier (DOI), a unique ID commonly used to cite journal articles; this makes it easy to cite the original creators of the data when reusing it, giving them academic credit for their hard work.”
- p.122-123: Research results from physics education: “lectures do not suit how students learn,” so “How can we best teach our students to analyze data and make reasonable statistical inferences?” Use peer instruction and “force students to confront and correct their misconceptions… Forced to choose an answer and discuss why they believe it is true
*before*the instructor reveals the correct answer, students immediately see when their misconceptions do not match reality, and instructors spot problems before they grow.”

I tried using such an approach when I taught last summer, and I found it very useful although a bit tricky since I couldn’t find a good bank of misconception-revealing questions for statistics that’s equivalent to the physicists’ Force Concept Inventory. But for next time I’ll check out the “Comprehensive Assessment of Outcomes in Statistics” that Alex mentions: delMas et al., “Assessing students’ conceptual understanding after a first course in statistics.”

I’ve also just heard about the LOCUS test (Levels of Conceptual Understanding in Statistics)—may also be worth a look. - p.128: “A statistician should be a collaborator in your research, not a replacement for Microsoft Excel.” Truth.

I admit I don’t like how Alex suggests most statisticians will do work for you “in exchange for some chocolates or a beer.” I mean, yes, it’s true, but let’s not tell everybody that the going rate is so low! Surely my advice is worth at least a six-pack.

Similarly, a writeup on Nature’s website quoted a psychologist who sees two possibilities here:

“A pessimistic prediction is that it will become a dumping ground for results that people couldn’t publish elsewhere,” he says. “An optimistic prediction is that it might become an outlet for good, descriptive research that was undervalued under the traditional criteria.”

(Also—how does Nature, of all places, get the definition of p-value wrong? “The closer to zero the P value gets, the greater the chance the null hypothesis is false…” Argh. But that’s neither here nor there.)

Here’s our discussion, with Yotam Hechtlinger and Alex Reinhart.

Yotam:

I’ll play the devil’s advocate. If you try to figure out about the nature of people’s emotions or thoughts, a clear finding will be seen from descriptive statistics and the use of larger sample size. They are actually requesting for a stricter standard—it should be so significant that it will be obvious to the naked eye. A guess will be to ask a small number of people a question and draw a conclusion from the fact that the p-value < 0.012. This, sadly, leads to the fact that tons of the psychology statements can't be replicated.

Jerzy:

That would be nice, but what does it mean to be “so significant that it will be obvious to the naked eye”? I have trouble imagining a good simple way to defend such a claim.

Or, if the editors say “We’ll only publish a paper where the estimated effect sizes are huge, AND the sample is huge, AND the standard deviations are tiny,” how should a prospective author decide on their sample size? Do they have to spend all their money on recruiting 1000 subjects instead of, say, making the experimental setup better?

Yotam:

You and me should not be the ones defending a claim of significant. People at the field should. Think of Paleontology for example. They find some bone, and then start arguing whether the finding agrees with current theory or not, and work on developing some consensus.

So I would argue that significant finding is one that raise lots of interest among the researchers in the field, enable you to draw conclusions from, and provide some way to test or refute those conclusions.

They actually say that pretty nicely in the paper at the link you gave: “… we believe that the p < .05 bar is too easy to pass and sometimes serves as an excuse for lower quality research. We hope and anticipate that banning the NHSTP will have the effect of increasing the quality of submitted manuscripts by liberating authors from the stultified structure of NHSTP thinking thereby eliminating an important obstacle to creative thinking..."

I liked the "liberating" there. In other words they are saying---make something we find interesting, convince us that it's actually valuable, and we will get you published.

Regarding the fact that it will be harder to find effects that actually meet that criteria---good. Research is hard. Also publishing a philosophy paper is hard. But (arguably) the publication standard in psychology should be raised. I am not certain (to say the least) that p-values or CI's function as good criteria for publication quality. Editors' interest is just as good and as useful (for the science of psychology!). They have to publish something. Convince them that you are more interesting than the rest---and you got it.

Jerzy:

This discussion is great! But I’m still not convinced

(1) I agree that it’s not our job to decide what’s *interesting*. But I can come up with a ton of “interesting” findings that are spurious if I use small data sets. Or, if the editors’ only defense against spurious claims is that “we encourage larger than usual sample sizes,” I can just use a big dataset and do data-fishing.

I agree that p-values are *not* ideal for convincing me that your results are non-spurious, but I just don’t understand how the editors will decide what *is* convincing without CIs or something similar. “I found evidence of psychic powers! … in a sample of size 5″ is clearly not convincing evidence, even though the effect would be interesting if it were true. So what else will you use to decide what’s convincing vs. what’s spurious, if not statistical inference tools? Throwing out all inference just seems too liberating.

(2) Instead of banning CIs, a better way to raise the publication standard would be to *require* CIs that are tight/precise. This is a better place to give editors/reviewers leeway (“Is this precise enough to be useful?”) than sample size (“Did they data-snoop in a database that I think was big enough?”) and liberate authors/researchers. Then the reader can learn whether “That’s a precisely-measured but very small effect, so we’re sure it’s negligible” vs. “That’s a precisely-measured large effect, so we can apply this new knowledge appropriately.”

(3) “Significant” is a terrible word that historical statisticians chose. It should be replaced with “statistically demonstrable” or “defensible” or “plausible” or “non-spurious” or “convincing”. It has nothing to do with whether the claimed effect/finding is *interesting* or *large*. It only tells us whether we think the sample size was big enough for it to be worth even discussing yet, or whether more data are needed before we start to discuss it. (In that sense, I agree that the p < 0.05 cutoff is "too easy to pass.")

But you and I *should* have a say in whether something is called "statistically significant." Our core job, as PhD level statisticians, is basically to develop this and other similar inferential properties of estimators/procedures.

(4) Of course there are cases where sample size is irrelevant. If previously everybody thought that no 2-year-old child can learn to recite the alphabet backwards, then it only takes n=1 such children to be an interesting publication. But that's just showing that something is possible, not estimating effects or population parameters, which does require stat inference tools.

Alex:

There’s some work (e.g. Cumming’s book “Understanding the new statistics”) on choosing sample sizes to ensure a small CI, rather than a high power. I agree with Jerzy that, without CIs or some other inferential tool, requiring a larger sample size isn’t meaningful—the sample size needed to detect a given difference is often non-intuitive, and without making a CI or calculating power you won’t realize that your sample is inadequate.

Requiring effects to be big enough to be visually obvious also doesn’t cover the opposite problem in inference: when people conclude “I can’t see an effect, so one must not exist.” It’s much better to use a CI to quantify which effect sizes are plausible.

Yotam:

I think the question or the position I’m taking in this discussion is further away than CI’s. I agree with you that if we are doing statistics, it’s better to do it right, and CI’s, especially tight ones, can often provide quite valuable and important information if the problem is stated right. That is, WHEN statistics is used in order to draw important conclusions.

You asked: “So what else will you use to decide what’s convincing vs. what’s spurious, if not statistical inference tools?”

And this, at least to me, the heart of our discussion. Statistics is not the holy grail for an interesting scientific discovery in all fields. I gave as an example Paleontology. Take History, Philosophy, Math(!), CS (mostly), Chemistry(?), Microeconomics, Businesses(?), Law, and tons others. Of course, statistics is widely used in most of those fields, but when people conduct research it’s being done inside the research community, and they develop theories and schools, without statistical significance.

In Psychology statistics has become the Sheriff for valid research. But de facto, it is not working. Some may say that this is because statistics is not being done right. I think this is a pretty big statement, as there are very smart people there. Even when done perfectly right the nature of the discovery is different there. In my opinion (without checking), the research questions are too reliant on humans, and more often than not feels like the research is being held back by the use of the tools.

There are alternative for the statistical framework. Think about Geology. If some researcher find something that doesn’t hold with current theories, he will point that out and offer an alternative. For his ideas to be accepted, it’s not a matter of a small CI’s, but a matter of convincing the geological community that this is an important discovery.

Another example—Philosophy. Descartes says something. Hume disagrees. Philosophers can go on and on with logic and stories and claims until one school is more sound, and then move on to research some other interesting problems in the field. You can think of an experiment. You can conduct it and get some statistics—but what will it mean?

Psychology might deserve the same treatment. Freud wasn’t doing any statistics. If you want to state something about the human nature, or human mind, state that. And explain perfectly well why you think that. If you show that with experiments, or an interesting story (like in businesses when they analyze test cases), or with strong CI—that is less important. The important part is that you manage to convince the people in your community that your work is interesting and important to the field.

I think that what troubles you about my position (correct me if I’m wrong) is that I state that Psychology’s CI’s doesn’t mean a lot. But it’s not coming from disrespecting psychology’s research, rather than by understanding statistics’ limitations. I think that after thousands of research papers, stating almost everything, and exactly the opposite, in a very significant way, maybe psychologists can use a change and do exactly as the editors ask them to do:

“Convince us in a clear and creative way that you are doing something important and interesting without stating < 0.002 with 89% power. We have enough of those type of claims. Find some other way to get our interest".How would you do that without inference? I guess with logic, experience, knowledge and common sense.

Alex:

Perhaps another way to state that is that psychology develops theories which are not easy to test statistically. Paul Meehl wrote a great paper in the 60s, “Theory-testing in psychology and physics: A methodological paradox”, which argues that statistical tests of psychological theories typically don’t provide much evidence of anything.

Jerzy:

Yotam, I agree 100% that there is scope for other kinds of research than the numerical experiments which statistics can be applied to. Yes, more people should be encouraged to observe interesting things that are not data-driven (like digging up an unknown kind of bone) and invent new theories that have no reliance on statistical inference, just “logic, experience, knowledge and common sense.”

But in this particular journal editorial, they don’t seem to be talking about that. They say:

“Are any inferential statistical procedures required? No, because the state of the art remains uncertain. However, BASP will require strong descriptive statistics, including effect sizes… we encourage the use of larger sample sizes…”So in their own words, they plan to keep focusing on publishing studies that rely on large samples and are interpreted in terms of statistical analysis. They are *not* talking about the studies that you describe (a business case-study, a new theory of mind, a chemical lab-bench experiment, a newly-discovered species). They *want* to publish statistics—they just don’t want to publish any rigorous inferential info along with them.

Again, I fully agree that the direction you propose is valuable. But these journal authors aren’t proposing that! They propose to keep demanding statistical evidence, but ignore the measures of quality that distinguish better vs worse statistical evidence. Right?

Yotam:

I see what you’re saying. Well I am not certain about their publishing criteria. It can be read as if they insist from now on to use “bad statistics” since it’s simpler for the researcher, which is obviously a mistake.

But I have read that a bit differently, and I think that since they are doing such a big step it’s better to give them the benefit of the doubt. I have read their message as: “Forget about statistics. You are liberated from those tools. Do interesting, creative experiments, and if you find something cool, your results would follow from descriptive statistics”.

This is somewhat different than an editor requesting the researcher to do statistical research and publish statistics. The way I read it (which obviously can be wrong) they *want* to publish psychology, and statistics can be used in the process to demonstrate your claims. It might turn out to be a too liberal interpretation to this specific journal, I’m not sure, and it will depend on the type of papers they are going to publish from now on.

At any case—my main point is to claim that by easing the statistical standard on psychology research, psychology can only benefit. By forcing the researcher to do statistics “right” you’ll end up getting psychology journals publishing statistics that usually doesn’t mean a lot. This is why I think this is a step in the right direction (and maybe not far enough).

Of course there is quantitative work being done in psychology experiments, but I think the nature of the claims, and the interest it raises should focus more on psychology, and less on statistics. I do not know a lot about behavioral psychology or psychology in general, so I might be wrong regarding that—but the first experiment I can think of is the Stanford jail experiment, where they made bunch of students prisoners and guards, and watched how the students behave. Now this is an interesting experiment over the human nature—and you do not need any p-values to discuss its meaning or results. I know this is from the 70’s and IRB would never approve something like that again—but shouldn’t that be the type of research the researcher is encouraged and focused on doing? Why is the statistics important there?

If we agree about this general (more radical) claim about statistics in psychology, I’m fine with discussing the intentions of this specific journal at a later time, or over a beer, if that is cool with you. My main claim is that the solution to the replicability crisis in the field is not to do statistics “better”, rather than to give the researcher enough (or total) statistical slack, and focus on psychology. Claims which are statistically stupid would fail just because it won’t be possible to perpetuate those claims into the psychological community. Not because the CI is too loose.

Jerzy:

“discussing … over a beer ” == yes!

So, feel free to continue in the comments, or find us over beers

]]>Most calculations performed by the average R user are unremarkable in the sense that nowadays, any computer can crush the related code in a matter of seconds. But more and more often, heavy calculations are also performed using R, something especially true in some fields such as statistics. The user then faces total execution times of his codes that are hard to work with: hours, days, even weeks. In this paper, how to reduce the total execution time of various codes will be shown and typical bottlenecks will be discussed. As a last resort, how to run your code on a cluster of computers (most workplaces have one) in order to make use of a larger processing power than the one available on an average computer will also be discussed through two examples.

Unlike many similar guides I’ve seen, this really is aimed at a computing novice. You don’t need to be a master of the command line or a Linux expert (Windows and Mac are addressed too). You are walked through installation of helpful non-R software. There’s even a nice summary of how hardware (hard drives vs RAM vs CPU) all interact to affect your code’s speed. The whole thing is 60 pages, but it’s a quick read, and even just skimming it will probably benefit you.

Favorite parts:

- “The strategy of opening R several times and of breaking down the calculations across these different R instances in order to use more than one core at the same time will also be explored (this strategy is very effective!)” I’d never realized this is possible. He gives some nice advice on how to do it with a small number of R instances (sort of “by hand,” but semi-automated).
- I knew about rm(myLargeObject), but not about needing to run gc() afterwards.
- I haven’t used Rprof before, but now I will.
- There’s helpful advice on how to get started combining C code with R under Windows—including what to install and how to set up the computer.
- The doSMP package sounds great — too bad it’s been removed but I should practice using the parallel and snow packages.
- P.63 has a helpful list of questions to ask when you’re ready to learn using your local cluster.

One thing Uyttendaele could have mentioned, but didn’t, is the use of databases and SQL. These can be used to store really big datasets and pass small pieces of them into R efficiently, instead of loading the whole dataset into RAM at once. Anthony Damico recommends the column-store database system MonetDB and has a nice introduction to using MonetDB with survey data in R.

]]>Tomorrow (March 3rd) is the launch party for several new (joint-)major programs for CMU undergrads: Statistics and Machine Learning, Statistics and Neuroscience, and Mathematical Statistics. That’s in addition to two existing programs: Statistics Core and the joint program in Economics and Statistics.

If you’re in Pittsburgh, come to the launch party at 4:30pm tomorrow. We’ll have project showcases, advising, interactive demos, etc., not to mention free food

]]>Well, the journal *Basic and Applied Social Psychology* has recently decided to ban p-values… but they’ve also tossed out confidence intervals and all the rest of classical statistical inference. And they’re not sold on Bayesian inference either. (Nor does their description of Bayes convince me that they understand it, with weird wordings like “strong grounds for assuming that the numbers really are there.”)

Apparently, instead of choosing another, less common inference flavor (such as likelihood or fiducial inference), they are doing away with rigorous inference altogether and only publishing descriptive statistics. The only measure they explicitly mention to prevent publishing spurious findings is that “we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.” That sounds to me like they know sampling error and inference are important—they just refuse to quantify them, which strikes me as bizarre.

I’m all in favor of larger-than-typical sample sizes, but I’m really curious how they will decide whether they are large enough. Sample sizes need to be planned before the experiment happens, long before you get feedback from the journal editors. If a researcher plans an experiment, hoping to publish in this journal, what guidance do they have on what sample size they will need? Even just doubling the sample size is already often prohibitively expensive, yet it doesn’t even halve the standard error; will that be convincing enough? Or will they only publish Facebook-sized studies with millions of participants (which often have other experimental-design issues)?

Conceivably, they might work out these details and this might still turn out to be a productive change making for a better journal, if the editors are more knowledgeable than the editorial makes them sound, AND if they do actually impose a stricter standard than p<0.05, AND if good research work meeting this standard is ever submitted to the journal. But I worry that, instead, it'll just end up downgrading the journal's quality and reputation, making referees unsure how to review articles without statistical evidence, and making readers unsure how reliable the published results are.

See also the American Statistical Association’s comment on the journal’s new policy, and the reddit discussion (particularly Peter’s response).

*Edit: John Kruschke is more hopeful, and Andrew Gelman links to a great paper citing cases of actual harm done by NHST. Again, I’m not trying to defend overuse of p-values—but there are useful and important parts of statistical inference (such as confidence intervals) that cannot be treated rigorously with descriptive statistics alone. And reliance on the interocular trauma test alone just frees up more ways to fiddle with the data to sneak it past reviewers.*

See also the previous two posts: After 1st semester of Statistics PhD program and After 2nd semester of Statistics PhD program.

This was my last semester of required coursework. Having passed the Data Analysis Exam in May, and with all the courses under my belt, I am pretty much ready to focus on the thesis topic search and proposal. Exciting!

Classes:

- Let me elaborate on Cosma’s post: “Note to graduate students: It is important that you internalize that you are, in fact, a badass…”

Ideally you should really internalize that you’re a badass**before**you come to grad school, because this is not the place to prove to yourself that you’re a badass. There are too many opportunities to feel bad about yourself at every stumble, when you’re surrounded by high-performing classmates and faculty who seem to do everything faster and more smoothly… It can be demoralizing when, say, you learn that you had the lowest score on an exam in a required class. - On the other hand, now that the Advanced Statistical Theory course is over, I do feel much more badass about reading and doing statistical theory. I used to see a paper with a ton of unfamiliar math and my eyes would glaze over. Now I see it as: “Well, it may take a while, but I’m capable of learning to parse that, use it, and even contribute to the field.” It feels no more daunting than other things I’ve done. Thank you, Advanced Prob and Advanced Stat Theory!

For example, I finally internalized that “hard math” is no worse than learning a new coding language. If I do an applied project and have to learn a new topic like Python, or parallel programming, or version control, it’s not an impossible task: it’s just a lot of work, like learning a foreign language. And I finally feel the same about math again: I may not have known what a Frobenius norm is, or my intuition about the difference between o(1) and O(1) may still be underdeveloped—but it’s not substantively different to get there than it is to keep track of the differences between for-loops in R vs Python vs MATLAB (like I had to do all year).

Also, if I get stuck on a theory problem, it’s my own concern. I can read previous work on it and find a solution; or if there is none, I can write one and thus make a contribution to the literature. But if I’m stuck on an applied problem because I don’t have a codebook for the variables or don’t know what preprocessing was done to the dataset, I really am stuck waiting until the data owner responds (if he/she even knows or remembers what was done, which is not a safe bet…) - I was a bit surprised by the choice of topics in Advanced Stat Theory. We covered several historically important topics in great detail, but then the professor told us that most of them are not especially popular directions or practically useful tools in modern statistical research. (For example, Neyman-Pearson hypothesis testing in exponential families seems to be a solved problem, tackled by tools specific to that scenario alone… So why spend so much course time on it?) Maybe the course could be better focused if it were split into two parts: one on historically-important foundations, vs. one on modern techniques.
- My TA assignment this semester was for Discrete Multivariate Analysis: advanced methods for contingency tables and log-linear models. I came away with a bigger appreciation for the rich and interesting questions that can arise about what looks, on the surface, to be a simple and small corner of statistics.

Journal Club:

- My favorite course this fall was the Statistical Journal Club, led by CMU’s Ryan Tibshirani jointly with his father Rob Tibshirani (on sabbatical here from Stanford). The Tibshiranis chose a great selection of papers for us to read and discuss. Each week a pair or trio of students would present that week’s paper. It was helpful to give practice “chalk talks” as well as to see simulations illustrating each paper. (On day 1, Rob Tibshirani told us he likes to implement a small simulation whenever he reads a new paper or attends a talk: it helps gain intuition, see how well the method really works in practice, and see how sensitive it is to the authors’ particular setup and assumptions.)
- I mentioned in Journal Club that we’d benefit from a MS/PhD level course on experimental design and sampling design for advanced stats & ML. Beyond just simple data collection for a basic psych experiment, how should one collect “big data” well, what to watch out for, how does the data collection affects analysis, etc.? Someone asked if I’m volunteering to teach it—maybe not a bad idea someday
- The papers on “A kernel two-sample test” and “Brownian distance covariance” reminded me of a few moments when I saw an abstract definition in AdvProb class and thought, “Too bad this is just a technical tool for proofs and not something you can check in practice on real data…” As it turns out, the authors of these papers DID find a way to use them with real data. (For instance, there’s a very abstract definition of equality of distributions that cannot be checked directly: “for any function, the mean of that function on X is the same as the mean of that function on Y.” You can’t take a real dataset and check this for ALL functions—but the authors figured out that you can use kernel methods to get pretty close, by checking a vast infinite space of functions. So they took the abstract impractical definition and developed a nice practical test you can run on real data.) Impressive, and a good reminder to watch out for that thought again in the future—maybe a second look could turn into something useful.
- Similarly, a few papers (like “Stability selection”) take an idea that seems reasonable to try in practice but without any theoretical grounding… (What if we just take random half-samples of the data, refit our lasso regression on each one, and see which variables are kept in the model on most of the half-samples?)… and develop proofs that give theoretical guarantees about how good this procedure can be.
- Still other papers (like my own team’s assigned paper, on Deep Learning) were unable to find a solid theoretical grounding for why the model does so well or any guarantees on how well it should be expected to do. But it seems like it should be tractable, if only we could hit on the right framework for looking at this problem. The Dropout paper had a nice way to look at the very top layer of a neural network, but not directly helpful for deeper networks.
- I got really excited about the “Post-selection inference paper” which discussed conditional hypothesis testing for regression coefficients. I thought we could apply it to the simplest OLS case to do some nifty new test that would let you make inferences such as: “Beta is estimated to be positive, and our conditional one-sided test says it’s significant, so it’s significantly positive.” You’re usually told not to do this: you’re supposed to decide ahead of time if you want a two-sided test or one-sided; and if it’s one-sided, then decide which side you want to check before looking at the data. However… after some scratch work, in the Normal case it looks like the correction you do (for deciding on the direction of the one-sided test
**after**observing the sign of the estimate) is exactly equivalent to doing a two-sided test instead. (Basically you double the one-sided test’s p-value, which is the same as computing the two-sided p-value for a Normal statistic). So on the one hand, we don’t get a new better test out of this: it’s just what people do in practice anyway. On the other hand, it shows that the thing that people do, even though they’re told it’s wrong, is actually not wrong after all

This made me wonder: Apart from this simple case of one coefficient in OLS, are there other aspects of sequential/adaptive/conditional hypothesis testing that could be simplified and spread to a wider audience? Are there common use-cases where these tools would help less-statistically-savvy users to get rigorous inference out of the missteps they normally do? - A few of the papers were less technical, such as “Why most published research findings are false.” We discussed how to incentivize scientists to publish well-powered interesting null findings and avoid the file-drawer problem. Rob Tibshirani suggested the idea of a “PLoS Zero” (vs. the existing PLoS ONE) He also told us how he encouraged PubMed to add a comment system, the PubMed Commons. Now you can point out issues or mistakes in a paper in this public space and get the authors’ responses right there, instead of having to go back & forth through the journal editors’ gatekeeping to publish letters slowly.

Research:

- Besides the year-long Advanced Data Analysis (ADA) project, I also got back in to research on Small Area Estimation with Beka Steorts, which led me to attend the SAE2014 conference in Poznań, Poland (near my hometown—the first time that business travel has ever taken me anywhere near family!). Beka also got me involved in the MIDAS (Models of Infectious Disease Agent Study) project: we are developing “synthetic ecosystems,” aka artificial populations that epidemiologists can plug into agent-based models to study the spread of disease. The current version is an EXTREMELY rudimentary first pass: I’ll write a bit more about the project once we have a version we’re happier with.
- I finally sat down and learned version control (via Git), and it turned out to be a good friend. For the MIDAS project we had three of us working on Dropbox, which led to: clogging all our Dropboxes, overwriting each other’s files, trying to coordinate by email, renaming things from “blahblah” to “blahblah_temp” and “blahblah_temp_2_tmp_recent” and so on… So it became clear it’s time for a better approach. Git lets you exclude files (so you don’t need to sync everything like Dropbox does); check differences between file versions; and use branching to try out temporary versions without renaming or breaking everything. I used the helpful tutorials by Bitbucket and Karl Broman.
- MIDAS also sponsored me to attend the North American Cartographic Information Society (NACIS) 2014 conference here in Pittsburgh. That deserves its own post, but I found it nifty that the conference was co-organized by Amy Griffin… whom I met (when she came to do some research on spatial visualization of uncertainty with the Census Bureau) via Nicholas Nagle… who first reached out to me through a comment on this blog. It all comes back around!
- As for the yearlong ADA project itself: it’s almost wrapped up, but quite differently from what we expected. There turned out to be major issues in getting and combining all the required dataset pieces: We needed (1) MEG brain scans, (2) MRI brain imagery, and (3) personal covariates about the medical/neuropsychological outcomes of each patient. Each of these three datasets had a different owner, and was de-identified for privacy/security… and we were never able to get a set of patient IDS that we could use to merge the different datasets together. In the end I had to switch topics entirely, to a similar neuroscientific dataset (which
**had**been successfully combined and pre-processed) but for studying Autism instead of Epilepsy. This switch finally happened in the last few months of the semester, so I had just a short time in which to address the scientific questions in appropriate statistical ways, while also learning about a new disorder, and also refreshing my knowledge of MATLAB (since this data was in that format, not Python as the previous one had been)…

Lessons learned: I should have been more proactive with collaborators about either pushing harder to get data quickly or just switching topics sooner. And for those stats students who are about to start a new applied project like this one, make sure your collaborators already have the full dataset in hand. (Of course, in general if you’re able to get in early and help to**plan**the data collection for optimal statistical efficiency, so much the better. But if you’re just a student whose goal is to practice data analysis, you’d better be sure the data has been compiled before you start.)

Life:

- Before coming to CMU, I always knew it as a strong technical school but didn’t realize how great the drama department was. We finally made it to a stage performance—actually Britten’s The Beggar’s Opera. I was wearing a sleep monitor watch that week, and the readout later claimed I was asleep during the show… It just noticed my low movement and the dim lighting, but I promise I was awake! Really, a great performance and I look forward to seeing more theater here.
- For a while I’ve been disappointed that Deschutes Brewery beers from Oregon hadn’t made it out to Pennsylvania yet. But no longer! I can finally buy my favorite Obsidian Stout down the street!
- Though I haven’t been posting much this fall, there’s been plenty of good stuff by first-year CMU student Lee Richardson. I especially like his recent post‘s comments about institutional knowledge—it’s far more important than we usually give it credit for.
- Nathan Yau is many steps ahead of me again, with great posts like how to improve government data websites, as well as one on a major life event. My own household size is also expected to increase from N to N+1 shortly, and everyone tells us “Your life is about to change!”—so I thank Nathan for a data-driven view of how exactly that change may look.

So, Rudder is one of the founders of dating site OkCupid and its quirky, data-driven research blog. His new book is very readable—each short, catchy chapter was hard to put down. I like how he gently alludes to the statistical details for nerds like myself, in a way that shouldn’t overwhelm lay readers. The clean, Tufte-minimalist graphs work quite well and are accompanied by clear writeups. Some of the insights are basically repeats of material already on the blog, but with a cleaner writeup, though there’s plenty of new stuff too. Whether or not you agree with all of his conclusions *[edit: see Cathy O’Neil’s valid critiques of the stats analyses here]*, the book sets a good example to follow for anyone interested in data- or evidence-based popular science writing.

Most of all, I loved his description of statistical precision:

Ironically, with research like this, precision is often less appropriate than a generalization. That’s why I often round findings to the nearest 5 or 10 and the words ‘roughly’ and ‘approximately’ and ‘about’ appear frequently in these pages. When you see in some article that ‘89.6 percent’ of people do x, the real finding is that ‘many’ or ‘nearly all’ or ‘roughly 90 percent’ of them do it, it’s just that the writer probably thought the decimals sounded cooler and more authoritative. The next time a scientist runs the numbers, perhaps the outcome will be 85.2 percent. The next time, maybe it’s 93.4. Look out at the churning ocean and ask yourself exactly which whitecap is ‘sea level.’ It’s a pointless exercise at best. At worst, it’s a misleading one.

I might use that next time I teach.

The description of how academics hunt for data is also spot on: “Data sets move through the research community like yeti—*I have a bunch of interesting stuff but I can’t say from where; I heard someone at Temple has tons of Amazon reviews; I think L has a scrape of Facebook.*”

Sorry I didn’t take many notes this time, but Alberto Cairo’s post on the book links to a few more detailed reviews.

]]>