As I’ve said before, I’m curious about finding better ways to draw maps which simultaneously show numerical estimates **and** their precision or uncertainty.

The April 2015 issue of *Significance* magazine includes a nice example of this [subscription link; PDF], thanks to Michael Wininger. Here is his Figure 2a (I think the labels for the red and blue areas are mistakenly swapped, but you get the idea):

Basically, Wininger is mapping the weather continuously over space, and he overlays two contours: one for where the predicted snowfall **amount** is highest, and another for where the **probability** of snowfall is highest.

I can imagine people would also enjoy an interactive version of this map, where you have sliders for the two cutoffs (how many inches of snow? what level of certainty?). You could also just show more levels of the contours on one static map, by adding extra lines, though that would get messy fast.

I think Wininger’s approach looks great and is easy to read, but it works largely because he’s mapping spatially-continuous data. The snowfall levels and their certainties are estimated at a very fine spatial resolution, unlike say a choropleth map of the average snowfall by county or by state. The other thing that helps here is that certainty is expressed as a probability (which most people can interpret)… not as a measure of spread or precision (standard deviation, margin of error, coefficient of variation, or what have you).

Could this also work on a choropleth map? If you only have data at the level of discrete areas, such as counties… Well, this is not a problem with weather data, but it does come up with administrative or survey data. Say you have survey estimates for the poverty rate in each county (along with MOEs or some other measure of precision). You could still use one color to fill all the counties with high estimated poverty rates. Then use another color to fill all the counties with highly precise estimates. Their overlap would show the areas where poverty is estimated to be high **and** that estimate is very precise. Sliders would let the readers set their own definition of “high poverty” and “highly precise.”

I might be wrong, but I don’t think I’ve seen this approach before. Could be worth a try.

]]>(I really can’t recommend the series. I enjoyed the first few books in middle school, but in a re-read last year they haven’t stood up to my childhood memories. The first is still fun but a blatant Tolkien ripoff; the rest are plodding and repetitive.)

Readers, can you recommend any good fantasy / sci-fi (or other fiction) that treats stats & math well?

**The Dragon Reborn**

A few of the characters discuss the difference between distributions that show clustering, uniformity, and randomness:

“It tells us it is all too neat,” Elayne said calmly. “What chance that thirteen women chosen solely because they were Darkfriends would be so neatly arrayed across age, across nations, across Ajahs? Shouldn’t there be perhaps three Reds, or four born in Cairhien, or just two the same age, if it was all chance? They had women to choose from or they could not have chosen so random a pattern. There are still Black Ajah in the Tower, or elsewhere we don’t know about. It must mean that.”

She’s suspicious of the very uniform distribution of demographic characteristics in the observed sample of 13 bad-guy characters. If turning evil happens at random, or at least is independent of these demographics, you’d expect some clusters to occur by chance in such a small sample—that’s why statistical theory exists, to help decide if apparent patterns are spurious. And if evil was associated with any demographic, you’d certainly expect to see some clusters. The complete absence of clustering (in fact, we see the opposite: dispersion) looks more like an experimental design, selecting observations that are as different as possible… implying there is a larger population to choose from than just these 13. Nice

There are also records of historical hypothesis testing of a magical artifact:

“Use unknown, save that channeling through it seems to suspend chance in some way, or twist it.” She began to read aloud. “‘Tossed coins presented the same face every time, and in one test landed balanced on edge one hundred times in a row. One thousand tosses of the dice produced five crowns one thousand times.'”

That’s a degenerate distribution right there.

Mat, the lucky-gambler character, also talks of luck going in his favor more often where there’s more randomness: he always wins at dice, usually at card games, and rarely at games like “stones” (basically Go). It’d be good fodder for a short story set in our own world—a character who realizes he’s no braniac but incredibly lucky and so seeks out luck-based situations. What else could you do, besides the obvious lottery tickets and casinos?

**The Shadow Rising**

I was impressed by Elayne’s budding ability to think like a statistician in the previous book, but she returns to more simplistic thinking in this book. The characters ponder murder motives (p.157):

“They were killed because they talked […] Or to stop them from it […] They might have been killed simply to punish them for being captured […] Three possibilities, and only one says the Black Ajah knows they revealed a word. Since all three are equal, the chances are that they do not know.”

Oh, Elayne. There are well-known problems with the principle of insufficient reason. Your approach to logic may get you into trouble yet.

**Lord of Chaos**

The description of Caemlyn’s chief clerk and census-taker Halwin Norry is hamfisted and a missed opportunity:

Rand … was not certain anything was real to Norry except the numbers in his ledgers. He recited the number of deaths during the week and the price of turnips carted in from the countryside in the same dusty tone, arranged the daily burials of penniless friendless refugees with no more horror and no more joy than he showed hiring masons to check the repair of the city walls. Illian was just another land to him, not the abode of Sammael, and Rand just another ruler.

If anything, Norry sounds like an admirable professional! Official statisticians must be as objective and politically disinterested as possible; else the rulers can make whatever “decisions” they like but there’ll be no way to accurately carry them out when you don’t know what resources you actually have on hand nor how severe the problem really is. It’d be fascinating to see how Norry actually gets runs a war-time census—perhaps with scrying help from the local magic users? But here Jordan is just sneering down. Such a shame.

**Knife of Dreams**

There are a few ridiculous scenes of White Ajah logicians arguing; I should have noted them down. I’m not sure if Jordan really believes mathematicians and logicians talk like this, or whether his tongue is in cheek and he’s just joking, but man, it’s a grotesque caricature. Someday I’d love to see a popular book describe the kind of arguments mathematicians actually have with each other. But this isn’t it.

]]>No disrespect meant to Martin, his readers, or their families—it’s just a thought exercise that intrigued me, and I figured it may interest other people.

Also, we’ve blogged about GoT and statistics before.

In the Spring a young man’s fancy lightly turns to actuarial tables.

That’s right: Spring is the time of year when the next bloody season of *Game of Thrones* airs. This means the internet is awash with death counts from the show and survival predictions for the characters still alive.

Others, more pessimistically, wonder about the health of George R. R. Martin, author of the *A Song of Ice and Fire*Â (*ASOIAF*) book series (on which *Game of Thrones* is based). Some worried readers compare Martin to Robert Jordan, who passed away after writing the 11thÂ *Wheel of Time* book, leaving 3 more books to be finished posthumously. Martin’s trilogy has become 5 books so far and is supposed to end at 7, unless it’s 8… so who really knows how long it’ll take.

(Understandably, Martin responds emphatically to these concerns. And after all, Martin and Jordan are *completely different* aging white American men who love beards and hats and are known for writing phone-book-sizedÂ fantasy novels that started out asÂ intended trilogies but got out of hand. So, basically no similarities whatsoever.)

But besides the author and his characters, there’s another set of deaths to consider. The books will get finished eventually. But how many **readers** will have passed away waiting for that ending? Let’s take a look.

Caveat: theÂ inputs are uncertain, the process is handwavy, and the outputsÂ are certainly wrong. This is all purely for fun (depressing as it may be).

So, we’ll need to answer a few questions. How do we define readers? How many readers are there? What are their demographics? And what are the mortality statistics for those demographics?

**Readers:** By the fall of 2013, around 24 million ASOIAF books had been sold in North America, but that includes all 5 books (so far). Furthermore, it seems that book sales went through the roof once the HBO show began in 2011, which will make it really hard to estimate trends in readership over time after that year. 2011 is also the year when the latest book in the series, *A Dance With Dragons* (*ADWD*), was published.

The first book in the series reached its one-millionth US-paperback-edition copy by the fall of 2010. So for simplicity’s sake, let’s say that by the end of 2010, there were at least 1,000,000 US readers. [This misses people who bought the hardcover instead or who read it as a library book; and this overcounts people who bought but never read it or who didn’t like it enough to continue the series. Still, it’s a nice round number and probably the right order of magnitude.] These are the grizzled veterans who were already fans before the HBO show (a.k.a. the hipsters who liked it before it was cool). Some started when the first book appeared in 1996, others as late as 2010. Let’s call these 1,000,000 US residents our core ASOIAF readers who really want to know how the book series ends. How many of them had passed away before the HBO show began and ADWD came out?

The first US printing of the first book had around 50,000 copies according to Martin. Instead of thoroughly researching how book readership tends to grow, let’s assume it’s linear (another crass oversimplification). Then over the 15 years from 1996 to 2010 (inclusive), we’d have to add around 65,000 readers a year to reach 1 million total readers. Since that’s not too far from the first print run, let’s go with it: we have 65,000 new ASOIAF readers every year over 15 years.

**Demographics:** I can’t find demographics for Martin’s readers specifically, but there are a few demographic summaries of fantasy readers in general. Readers of *The Magazine of Fantasy & Science Fiction* and *Lightspeed Magazine* seem to be roughly 10% ages 18-24, 50% ages 26-45, 30% ages 46-55, which I guess leaves around 10% aged 56 or older. (That ignores people under 18, but let’s face it, kids probably shouldn’t be reading the gruesome *Game of Thrones* anyway.) The sex breakdown is roughly 60% male, 40% female. Let’s assume this age/sex breakdown holds for our million ASOIAF readers, though here are several other summaries with slightly different demographics.

**Mortality:** Okay, it’s time for the morbid part. Here are some Death Rates by Age and Sex (we’ll ignore Race since I didn’t find those reader demographics). None of them have changed dramatically since around 2000, close to the first book’s publication date, so let’s just use the latest 2008 numbers. The age breakdowns here don’t match ours exactly, so let’s also average together the rates for age groups we need to combine. Table 2 here suggests 25-34 and 35-44 had roughly similar numbers of people, so we can take a simple average of their death rates to get the 26-45 rate. But for people 56+, we’ll do a weighted average, weighted by the approximate population in each death-rate category. Using very rough weights of 25 (million) in population for 55-64, 20 for 65-74, 10 for 75-84, and 5 for 85+, we get

`(25*1000 + 20*2500 + 10*6000 + 5*14000) / (25 + 20 + 10 + 5)`

or around 3500 for the 56+ male death rate. For females, it’s

`(25*1000 + 20*2000 + 10*4000 + 5*12500) / (25 + 20 + 10 + 5)`

or around 2800.

Finally, the table gives death rates per 100,000 population, but let’s translate them to percent of people who will pass away that year. The results are

Males: .11% for 18-25, .18% for 26-45, .53% for 46-55, 3.5% for 56+

Females: .04% for 18-25, .06% for 26-45, .23% for 46-55, 2.8% for 56+

Let’s run these numbers through R.

# Death rates (percent of people in that group who die in a given year) # rounded or estimated from Census tables DeathRatesVec = c(.11, .18, .53, 3.5, .04, .06, .23, 2.8) DeathRates = matrix(DeathRatesVec, 4, 2) / 100 colnames(DeathRates) = c("M", "F") rownames(DeathRates) = c("18-24", "25-44", "45-54", "55+") DeathRates ## M F ## 18-24 0.0011 0.0004 ## 25-44 0.0018 0.0006 ## 45-54 0.0053 0.0023 ## 55+ 0.0350 0.0280 # Number of readers in each age/sex group # estimated from fantasy magazine reader polls AgePcts = c(.1, .5, .3, .1) SexPcts = c(.6, .4) ReadersPerYear = t(65000 * rbind(AgePcts, AgePcts) * SexPcts) colnames(ReadersPerYear) = colnames(DeathRates) rownames(ReadersPerYear) = rownames(DeathRates) ReadersPerYear ## M F ## 18-24 3900 2600 ## 25-44 19500 13000 ## 45-54 11700 7800 ## 55+ 3900 2600 # Function to estimate the number of readers who die # within a certain number of years NrDeathsByYearsLeft = function(YearsLeft) { sum(ReadersPerYear - ReadersPerYear * (1 - DeathRates) ^ YearsLeft) } # Total number of reader deaths from 1996 through 2010 FirstYear = 1996 FinalYear = 2010 TotalYears = FinalYear - FirstYear + 1 DeathsByYearStarted = sapply(TotalYears:1, NrDeathsByYearsLeft) round(sum(DeathsByYearStarted)) ## [1] 36814

So it looks like almost 40,000 veteran readers didn’t survive even until *ADWD* was published or the HBO show aired. This is on the order of 100 times the number of characters who’ve died, whether in the show or in the books.

Finally, let’s show the breakdown by year, since we already calculated it above:

# Number of deaths by 2010, # broken out by the year in which they started reading plot(FirstYear:FinalYear, DeathsByYearStarted, type='h', xlab = 'Year', ylab = 'Deaths', main = 'Number of readers deceased by 2010\nwho started in a given year')

(The trend looks perfectly linear just because we assumed linear growth in the number of readers and stable demographics over time.)

No deep insights here. There’s just the stark (hah!) realization that a substantial number of Martin’s earliest readers have not survived the wait.

Let’s not worry about which characters will die; let’s not hurry Martin as he writes. Let us just savor our time on Earth before we make the same journey ourselves. After all, valar morghulis.

PS—A helpful librarian friend tells me that the Carnegie Library of Pittsburgh system has 102 copies of the books currently in the system (acquired from 2002 onwards), with about 2300 total checkouts all together. This could be extrapolated to estimate US readership by library patrons who didn’t actually buy the book. At some point I may also go through her data to see how readership seems to have changed over time (i.e., the number of checkouts over time for older vs. newer copies).

*Manual trackback: Partially Derivative ep. 20 (around 33:09); FlowingData*

The slides introduce a few variants of the simplest area-level (Fay-Herriot) model, analyzing the same dataset in a few different ways. The slides also explain some basic concepts behind Bayesian inference and MCMC, since the target audience wasn’t expected to be familiar with these topics.

- Part 1: the basic Frequentist area-level model; how to estimate it; model checking (pdf)
- Part 2: overview of Bayes and MCMC; model checking; how to estimate the basic Bayesian area-level model (pdf)
- All slides, data, and code (ZIP)

The code for all the Frequentist analyses is in SAS. There’s R code too, but only for a WinBUGS example of a Bayesian analysis (also repeated in SAS). One day I’ll redo the whole thing in R, but it’s not at the top of the list right now.

Frequentist examples:

- “ByHand” where we compute the Prasad-Rao estimator of the model error variance (just for illustrative purposes since all the steps are explicit and simpler to follow; but not something I’d usually recommend in practice)
- “ProcMixed” where we use mixed modeling to estimate the model error variance at the same time as everything else (a better way to go in practice; but the details get swept up under the hood)

Bayesian examples:

- “ProcMCMC” and “ProcMCMC_alt” where we use SAS to fit essentially the same model parameterized in a few different ways, some of whose chains converge better than others
- “R_WinBUGS” where we do the same but using R to call WinBUGS instead of using SAS

The example data comes from Mukhopadhyay and McDowell, “Small Area Estimation for Survey Data Analysis using SAS Software” [pdf].

If you get the code to run, I’d appreciate hearing that it still works

My SAE resources page still includes a broader set of tutorials/textbooks/examples.

]]>Not to be outdone by the journal editors who banned confidence intervals, the SIGBOVIK 2015 proceedings (p.83) feature a proposal to ban future papers from reporting any conclusions whatsoever:

In other words, from this point forward, BASP papers will only be allowed to include results that âkind of look significantâ, but havenât been vetted by any statistical processes…

This is a bold stance, and I think we, as ACH members, would be remiss if we were to take a stance any less bold. Which is why I propose that SIGBOVIK â from this day forward â

should ban conclusions…Of course, even this provision may not be sufficient, since readers may draw their own conclusions from any suggestions, statements, or data presented by authors. Thus, I suggest a phased plan to remove any potential of readers being mislead…

I applaud the author’s courageous leadership. Readers of my own SIGBOVIK 2014 paper on BS inference (with Alex Reinhart) will immediately see the natural synergy between conclusion-free analyses and our own BS.

]]>Although most of his examples are geared towards experimental science, most of it is just as valid for readers working in social science, data journalism [if Alberto Cairo likes your book it must be good!], conducting surveys or polls, business analytics, or any other “data science” situation where you’re using a data sample to learn something about the broader world.

This is NOT a how-to book about plugging numbers into the formulas for t-tests and confidence intervals. Rather, the focus is on *interpreting* these seemingly-arcane statistical results correctly; and on *designing your data collection process* (experiment, survey, etc.) well in the first place, so that your data analysis will be as straightforward as possible. For example, he really brings home points like these:

- Before you even collect any data, if your planned sample size is too small, you simply can’t expect to learn anything from your study. “The power will be too low,” i.e. the estimates will be too imprecise to be useful.
- For each analysis you do, it’s important to understand commonly-misinterpreted statistical concepts such as p-values, confidence intervals, etc.; else you’re going to mislead yourself about what you can learn from the data.
- If you run a ton of analyses overall and only publish the ones that came out significant, such data-fishing will mostly produce effects that just happened (by chance, in your particular sample) to look bigger than they really are… so you’re fooling yourself and your readers if you don’t account for this problem, leading to bad science and possibly harmful conclusions.

Admittedly, Alex’s physicist background shows in a few spots, when he implies that physicists do everything better (e.g. see my notes below on p.49, p.93, and p.122.)

Seriously though, the advice is good. You can find the correct formulas in any Stats 101 textbook. But Alex’s book is a concise reminder of how to plan a study and to understand the numbers you’re running, full of humor and meaningful, lively case studies.

Highlights and notes-to-self below the break:

- p.7: “We will always observe
*some*difference due to luck and random variation, so statisticians talk about*statistically significant*differences when the difference is larger than could easily be produced by luck.”

Larger? Well, sorta but not quite… Right on the first page of chapter one, Alex makes the same misleading implication that he’ll later spend pages and pages dispelling: “Statistically significant” isn’t really meant to imply that the difference is*large*, so much as it’s*measured precisely*. That sounds like a quibble but it’s a real problem. If our historical statistician-forebears had chosen “precise” or “convincing” (implying a focus on how well the measurement was done) instead of “significant” (which sounds like a statement about the size of what was measured), maybe we could avoid confusion and wouldn’t need books like Alex’s. - p.8: “More likely, your medication actually works.” Another nitpick: “more plausibly” might be better, avoiding some of the statistical baggage around the word “likely,” unless you’re explicitly being Bayesian… OK, that’s enough nitpicking for now.
- p.9: “This is a counterintuitive feature of significance testing: if you want to prove that your drug works, you do so by showing that the data is
*in*consistent with the drug*not*working.” Very nice and clear summary of the confusing nature of hypothesis tests. - p.10: “This is troubling: two experiments can collect identical data but result in different conclusions. Somehow, the
*p*value can read your intentions.” This is an example of how frequentist inference can violate the likelihood principle. I’ve never seen why this should bother us: the p-value is meant to inform us about how well*this*experiment*measures*the parameter, not about what the parameter itself is… so it’s not surprising that a differently-designed experiment (even if it happened to get equivalent data) would give a different p-value. - p.19: “In the prestigious journals
*Science*and*Nature*, fewer than 3% of articles calculate statistical power before starting their study.” In the “Sampling, Survey, and Society” class here at CMU, we ask undergraduates to conduct a real survey on campus. Maybe next year we should suggest that they study how many of our professors do power calculations in advance. - p.19: “An ethical review board should not approve a trial if it knows the trial is unable to detect the effect it is looking for.” A great point—all IRB members ought to read
*Statistics Done Wrong*! - p.20-21: A good section on why underpowered studies are so common. Researchers don’t realize their studies are too small; as long as they find
*something*significant among the tons of comparisons they run, they feel the study was powerful enough; they do multiple-comparisons corrections (which is good!), but don’t account for the fact that they*will*do these corrections when computing power; and even with best intentions, power calculations are hard. - p.23: Yes! Instead of computing power, we should do the equivalent to achieve a desired confidence interval width (sufficiently narrow). I hadn’t heard of this handy term “assurance, which determines how often the confidence interval must beat our target width.”

I’ve only seen this approach rarely, mostly with survey/poll sample size planning: if you want to say “Candidate A has X% of the vote (plus or minus 3%)” then you can calculate the appropriate sample size to ensure it really will be a 3% margin of error, not 5% or 10%.

I would much rather teach my students to design an experiment with high assurance than to compute power… assuming we can find or create good assurance-calculation software for them:

“Sample size selection methods based on assurance have been developed for many common statistical tests, though not for all; it is a new field, and statisticians have yet to fully explore it. (These methods go by the name*accuracy in parameter estimation*, or*AIPE*.)” - p.49-50: “Particle physicists call this the
*look-elsewhere effect*… they are searching for anomalies across a large swath of energies, any one of which could have produced a false positive. Physicists have developed complicated procedures to account for this and correctly limit the false positive rate.” Are these any different from what statisticians do? The cited reference may be worth a read: Gross and Vitells, “Trial factors for the look elsewhere effect in high energy physics.” - p.52: “One review of 241 fMRI studies found that they used 207 unique combinations of statistical methods, data collection strategies, and multiple comparison corrections, giving researchers great flexibility to achieve statistically significant results.” Sounds like some of the issues I’ve seen when working with neuroscientists on underpowered and overcomplicated (tiny n, huge p) studies are widespread. Cited reference: Carp, “The secret lives of experiments: methods reporting in the fMRI literature.” See also the classic dead-salmon study (“The salmon was asked to determine what emotion the individual in the photo must have been experiencing…”)
- p.60: “And because standard error bars are about half as wide as the 95% confidence interval, many papers will report ‘standard error bars’ that actually span
*two*standard errors above and below the mean, making a confidence interval instead.” Wow—I knew this is a confusing topic for many people, but I didn’t know this actually happens in practice so often. Never show any bars without clearly labeling what they are! Standard deviation, standard error, confidence interval (which level?) or what? - p.61: “A survey of psychologists, neuroscientists, and medical researchers found that the majority judged significance by confidence interval overlap, with many scientists confusing standard errors, standard deviations, and confidence intervals.” Yeah, no kidding. Our nomenclature is terrible. Statisticians need to hire a spin doctor. Cited reference: Belia et al., “Researchers misunderstand confidence intervals and standard error bars.”
- p.62: “Other procedures [for comparing confidence intervals] handle more general cases, but only approximately and not in ways that can easily be plotted.” Let me humbly suggest a research report I co-wrote with my (former) colleagues at the Census Bureau, covering several ways to visually compare confidence intervals or otherwise make appropriate multiple comparisons visually: Wright, Klein, and Wieczorek, “Ranking Population Based on Sample Survey Data.”
- p.90: Very helpful list of what to think about when preparing to design, implement, and analyze a study. There are many “researcher degrees of freedom” and more people are now arguing that at least some of these decisions should be made before seeing the data, to avoid excessive flexibility. I’ll use this list next time I teach experimental design: “What do I measure? Which variables do I adjust for? Which cases do I exclude? How do I define groups? What about missing data? How much data should I collect?”
- p.93-94: “particle physicists have begun performing
*blind analyses*: the scientists analyzing the data avoid calculating the value of interest until after the analysis procedure is finalized.” This may be built into how the data are collected; “Other blinding techniques include adding a constant to all measurements, keeping this constant hidden from analysts until the analysis is finalized; having independent groups perform separate parts of the analysis and only later combining their results; or using simulations to inject false data that is later removed.” Examples in medicine are discussed too, such as drafting a “clinical trial protocol.” - p.109: Nice—I didn’t know about Figshare and Dryad, which let you upload data and plots to encourage others to use and cite them. “To encourage sharing, submissions are given a digital object identifier (DOI), a unique ID commonly used to cite journal articles; this makes it easy to cite the original creators of the data when reusing it, giving them academic credit for their hard work.”
- p.122-123: Research results from physics education: “lectures do not suit how students learn,” so “How can we best teach our students to analyze data and make reasonable statistical inferences?” Use peer instruction and “force students to confront and correct their misconceptions… Forced to choose an answer and discuss why they believe it is true
*before*the instructor reveals the correct answer, students immediately see when their misconceptions do not match reality, and instructors spot problems before they grow.”

I tried using such an approach when I taught last summer, and I found it very useful although a bit tricky since I couldn’t find a good bank of misconception-revealing questions for statistics that’s equivalent to the physicists’ Force Concept Inventory. But for next time I’ll check out the “Comprehensive Assessment of Outcomes in Statistics” that Alex mentions: delMas et al., “Assessing students’ conceptual understanding after a first course in statistics.”

I’ve also just heard about the LOCUS test (Levels of Conceptual Understanding in Statistics)—may also be worth a look. - p.128: “A statistician should be a collaborator in your research, not a replacement for Microsoft Excel.” Truth.

I admit I don’t like how Alex suggests most statisticians will do work for you “in exchange for some chocolates or a beer.” I mean, yes, it’s true, but let’s not tell everybody that the going rate is so low! Surely my advice is worth at least a six-pack.

Similarly, a writeup on Nature’s website quoted a psychologist who sees two possibilities here:

âA pessimistic prediction is that it will become a dumping ground for results that people couldnât publish elsewhere,â he says. âAn optimistic prediction is that it might become an outlet for good, descriptive research that was undervalued under the traditional criteria.â

(Also—how does Nature, of all places, get the definition of p-value wrong? “The closer to zero the P value gets, the greater the chance the null hypothesis is false…” Argh. But that’s neither here nor there.)

Here’s our discussion, with Yotam Hechtlinger and Alex Reinhart.

Yotam:

I’ll play the devil’s advocate. If you try to figure out about the nature of people’s emotions or thoughts, a clear finding will be seen from descriptive statistics and the use of larger sample size. They are actually requesting for a stricter standard—it should be so significant that it will be obvious to the naked eye. A guess will be to ask a small number of people a question and draw a conclusion from the fact that the p-value < 0.012. This, sadly, leads to the fact that tons of the psychology statements can't be replicated.

Jerzy:

That would be nice, but what does it mean to be “so significant that it will be obvious to the naked eye”? I have trouble imagining a good simple way to defend such a claim.

Or, if the editors say “We’ll only publish a paper where the estimated effect sizes are huge, AND the sample is huge, AND the standard deviations are tiny,” how should a prospective author decide on their sample size? Do they have to spend all their money on recruiting 1000 subjects instead of, say, making the experimental setup better?

Yotam:

You and me should not be the ones defending a claim of significant. People at the field should. Think of Paleontology for example. They find some bone, and then start arguing whether the finding agrees with current theory or not, and work on developing some consensus.

So I would argue that significant finding is one that raise lots of interest among the researchers in the field, enable you to draw conclusions from, and provide some way to test or refute those conclusions.

They actually say that pretty nicely in the paper at the link you gave: “… we believe that the p < .05 bar is too easy to pass and sometimes serves as an excuse for lower quality research. We hope and anticipate that banning the NHSTP will have the effect of increasing the quality of submitted manuscripts by liberating authors from the stultified structure of NHSTP thinking thereby eliminating an important obstacle to creative thinking..."

I liked the "liberating" there. In other words they are saying---make something we find interesting, convince us that it's actually valuable, and we will get you published.

Regarding the fact that it will be harder to find effects that actually meet that criteria---good. Research is hard. Also publishing a philosophy paper is hard. But (arguably) the publication standard in psychology should be raised. I am not certain (to say the least) that p-values or CI's function as good criteria for publication quality. Editors' interest is just as good and as useful (for the science of psychology!). They have to publish something. Convince them that you are more interesting than the rest---and you got it.

Jerzy:

This discussion is great! But I’m still not convinced

(1) I agree that it’s not our job to decide what’s *interesting*. But I can come up with a ton of “interesting” findings that are spurious if I use small data sets. Or, if the editors’ only defense against spurious claims is that “we encourage larger than usual sample sizes,” I can just use a big dataset and do data-fishing.

I agree that p-values are *not* ideal for convincing me that your results are non-spurious, but I just don’t understand how the editors will decide what *is* convincing without CIs or something similar. “I found evidence of psychic powers! … in a sample of size 5″ is clearly not convincing evidence, even though the effect would be interesting if it were true. So what else will you use to decide what’s convincing vs. what’s spurious, if not statistical inference tools? Throwing out all inference just seems too liberating.

(2) Instead of banning CIs, a better way to raise the publication standard would be to *require* CIs that are tight/precise. This is a better place to give editors/reviewers leeway (“Is this precise enough to be useful?”) than sample size (“Did they data-snoop in a database that I think was big enough?”) and liberate authors/researchers. Then the reader can learn whether “That’s a precisely-measured but very small effect, so we’re sure it’s negligible” vs. “That’s a precisely-measured large effect, so we can apply this new knowledge appropriately.”

(3) “Significant” is a terrible word that historical statisticians chose. It should be replaced with “statistically demonstrable” or “defensible” or “plausible” or “non-spurious” or “convincing”. It has nothing to do with whether the claimed effect/finding is *interesting* or *large*. It only tells us whether we think the sample size was big enough for it to be worth even discussing yet, or whether more data are needed before we start to discuss it. (In that sense, I agree that the p < 0.05 cutoff is "too easy to pass.")

But you and I *should* have a say in whether something is called "statistically significant." Our core job, as PhD level statisticians, is basically to develop this and other similar inferential properties of estimators/procedures.

(4) Of course there are cases where sample size is irrelevant. If previously everybody thought that no 2-year-old child can learn to recite the alphabet backwards, then it only takes n=1 such children to be an interesting publication. But that's just showing that something is possible, not estimating effects or population parameters, which does require stat inference tools.

Alex:

There’s some work (e.g. Cumming’s book “Understanding the new statistics”) on choosing sample sizes to ensure a small CI, rather than a high power. I agree with Jerzy that, without CIs or some other inferential tool, requiring a larger sample size isn’t meaningful—the sample size needed to detect a given difference is often non-intuitive, and without making a CI or calculating power you won’t realize that your sample is inadequate.

Requiring effects to be big enough to be visually obvious also doesn’t cover the opposite problem in inference: when people conclude “I can’t see an effect, so one must not exist.” It’s much better to use a CI to quantify which effect sizes are plausible.

Yotam:

I think the question or the position I’m taking in this discussion is further away than CI’s. I agree with you that if we are doing statistics, it’s better to do it right, and CI’s, especially tight ones, can often provide quite valuable and important information if the problem is stated right. That is, WHEN statistics is used in order to draw important conclusions.

You asked: “So what else will you use to decide what’s convincing vs. what’s spurious, if not statistical inference tools?”

And this, at least to me, the heart of our discussion. Statistics is not the holy grail for an interesting scientific discovery in all fields. I gave as an example Paleontology. Take History, Philosophy, Math(!), CS (mostly), Chemistry(?), Microeconomics, Businesses(?), Law, and tons others. Of course, statistics is widely used in most of those fields, but when people conduct research it’s being done inside the research community, and they develop theories and schools, without statistical significance.

In Psychology statistics has become the Sheriff for valid research. But de facto, it is not working. Some may say that this is because statistics is not being done right. I think this is a pretty big statement, as there are very smart people there. Even when done perfectly right the nature of the discovery is different there. In my opinion (without checking), the research questions are too reliant on humans, and more often than not feels like the research is being held back by the use of the tools.

There are alternative for the statistical framework. Think about Geology. If some researcher find something that doesn’t hold with current theories, he will point that out and offer an alternative. For his ideas to be accepted, it’s not a matter of a small CI’s, but a matter of convincing the geological community that this is an important discovery.

Another example—Philosophy. Descartes says something. Hume disagrees. Philosophers can go on and on with logic and stories and claims until one school is more sound, and then move on to research some other interesting problems in the field. You can think of an experiment. You can conduct it and get some statistics—but what will it mean?

Psychology might deserve the same treatment. Freud wasn’t doing any statistics. If you want to state something about the human nature, or human mind, state that. And explain perfectly well why you think that. If you show that with experiments, or an interesting story (like in businesses when they analyze test cases), or with strong CI—that is less important. The important part is that you manage to convince the people in your community that your work is interesting and important to the field.

I think that what troubles you about my position (correct me if I’m wrong) is that I state that Psychology’s CI’s doesn’t mean a lot. But it’s not coming from disrespecting psychology’s research, rather than by understanding statistics’ limitations. I think that after thousands of research papers, stating almost everything, and exactly the opposite, in a very significant way, maybe psychologists can use a change and do exactly as the editors ask them to do:

“Convince us in a clear and creative way that you are doing something important and interesting without stating < 0.002 with 89% power. We have enough of those type of claims. Find some other way to get our interest".How would you do that without inference? I guess with logic, experience, knowledge and common sense.

Alex:

Perhaps another way to state that is that psychology develops theories which are not easy to test statistically. Paul Meehl wrote a great paper in the 60s, “Theory-testing in psychology and physics: A methodological paradox”, which argues that statistical tests of psychological theories typically don’t provide much evidence of anything.

Jerzy:

Yotam, I agree 100% that there is scope for other kinds of research than the numerical experiments which statistics can be applied to. Yes, more people should be encouraged to observe interesting things that are not data-driven (like digging up an unknown kind of bone) and invent new theories that have no reliance on statistical inference, just “logic, experience, knowledge and common sense.”

But in this particular journal editorial, they don’t seem to be talking about that. They say:

“Are any inferential statistical procedures required? No, because the state of the art remains uncertain. However, BASP will require strong descriptive statistics, including effect sizes… we encourage the use of larger sample sizes…”So in their own words, they plan to keep focusing on publishing studies that rely on large samples and are interpreted in terms of statistical analysis. They are *not* talking about the studies that you describe (a business case-study, a new theory of mind, a chemical lab-bench experiment, a newly-discovered species). They *want* to publish statistics—they just don’t want to publish any rigorous inferential info along with them.

Again, I fully agree that the direction you propose is valuable. But these journal authors aren’t proposing that! They propose to keep demanding statistical evidence, but ignore the measures of quality that distinguish better vs worse statistical evidence. Right?

Yotam:

I see what you’re saying. Well I am not certain about their publishing criteria. It can be read as if they insist from now on to use “bad statistics” since it’s simpler for the researcher, which is obviously a mistake.

But I have read that a bit differently, and I think that since they are doing such a big step it’s better to give them the benefit of the doubt. I have read their message as: “Forget about statistics. You are liberated from those tools. Do interesting, creative experiments, and if you find something cool, your results would follow from descriptive statistics”.

This is somewhat different than an editor requesting the researcher to do statistical research and publish statistics. The way I read it (which obviously can be wrong) they *want* to publish psychology, and statistics can be used in the process to demonstrate your claims. It might turn out to be a too liberal interpretation to this specific journal, I’m not sure, and it will depend on the type of papers they are going to publish from now on.

At any case—my main point is to claim that by easing the statistical standard on psychology research, psychology can only benefit. By forcing the researcher to do statistics “right” you’ll end up getting psychology journals publishing statistics that usually doesn’t mean a lot. This is why I think this is a step in the right direction (and maybe not far enough).

Of course there is quantitative work being done in psychology experiments, but I think the nature of the claims, and the interest it raises should focus more on psychology, and less on statistics. I do not know a lot about behavioral psychology or psychology in general, so I might be wrong regarding that—but the first experiment I can think of is the Stanford jail experiment, where they made bunch of students prisoners and guards, and watched how the students behave. Now this is an interesting experiment over the human nature—and you do not need any p-values to discuss its meaning or results. I know this is from the 70’s and IRB would never approve something like that again—but shouldn’t that be the type of research the researcher is encouraged and focused on doing? Why is the statistics important there?

If we agree about this general (more radical) claim about statistics in psychology, I’m fine with discussing the intentions of this specific journal at a later time, or over a beer, if that is cool with you. My main claim is that the solution to the replicability crisis in the field is not to do statistics “better”, rather than to give the researcher enough (or total) statistical slack, and focus on psychology. Claims which are statistically stupid would fail just because it won’t be possible to perpetuate those claims into the psychological community. Not because the CI is too loose.

Jerzy:

“discussing … over a beer ” == yes!

So, feel free to continue in the comments, or find us over beers

]]>Most calculations performed by the average R user are unremarkable in the sense that nowadays, any computer can crush the related code in a matter of seconds. But more and more often, heavy calculations are also performed using R, something especially true in some fields such as statistics. The user then faces total execution times of his codes that are hard to work with: hours, days, even weeks. In this paper, how to reduce the total execution time of various codes will be shown and typical bottlenecks will be discussed. As a last resort, how to run your code on a cluster of computers (most workplaces have one) in order to make use of a larger processing power than the one available on an average computer will also be discussed through two examples.

Unlike many similar guides I’ve seen, this really is aimed at a computing novice. You don’t need to be a master of the command line or a Linux expert (Windows and Mac are addressed too). You are walked through installation of helpful non-R software. There’s even a nice summary of how hardware (hard drives vs RAM vs CPU) all interact to affect your code’s speed. The whole thing is 60 pages, but it’s a quick read, and even just skimming it will probably benefit you.

Favorite parts:

- “The strategy of opening R several times and of breaking down the calculations across these different R instances in order to use more than one core at the same time will also be explored (this strategy is very effective!)” I’d never realized this is possible. He gives some nice advice on how to do it with a small number of R instances (sort of “by hand,” but semi-automated).
- I knew about rm(myLargeObject), but not about needing to run gc() afterwards.
- I haven’t used Rprof before, but now I will.
- There’s helpful advice on how to get started combining C code with R under Windows—including what to install and how to set up the computer.
- The doSMP package sounds great — too bad it’s been removed but I should practice using the parallel and snow packages.
- P.63 has a helpful list of questions to ask when you’re ready to learn using your local cluster.

One thing Uyttendaele could have mentioned, but didn’t, is the use of databases and SQL. These can be used to store really big datasets and pass small pieces of them into R efficiently, instead of loading the whole dataset into RAM at once. Anthony Damico recommends the column-store database system MonetDB and has a nice introduction to using MonetDB with survey data in R.

]]>Tomorrow (March 3rd) is the launch party for several new (joint-)major programs for CMU undergrads: Statistics and Machine Learning, Statistics and Neuroscience, and Mathematical Statistics. That’s in addition to two existing programs: Statistics Core and the joint program in Economics and Statistics.

If you’re in Pittsburgh, come to the launch party at 4:30pm tomorrow. We’ll have project showcases, advising, interactive demos, etc., not to mention free food

]]>Well, the journal *Basic and Applied Social Psychology* has recently decided to ban p-values… but they’ve also tossed out confidence intervals and all the rest of classical statistical inference. And they’re not sold on Bayesian inference either. (Nor does their description of Bayes convince me that they understand it, with weird wordings like “strong grounds for assuming that the numbers really are there.”)

Apparently, instead of choosing another, less common inference flavor (such as likelihood or fiducial inference), they are doing away with rigorous inference altogether and only publishing descriptive statistics. The only measure they explicitly mention to prevent publishing spurious findings is that “we encourage the use of larger sample sizes than is typical in much psychology research, because as the sample size increases, descriptive statistics become increasingly stable and sampling error is less of a problem.” That sounds to me like they know sampling error and inference are important—they just refuse to quantify them, which strikes me as bizarre.

I’m all in favor of larger-than-typical sample sizes, but I’m really curious how they will decide whether they are large enough. Sample sizes need to be planned before the experiment happens, long before you get feedback from the journal editors. If a researcher plans an experiment, hoping to publish in this journal, what guidance do they have on what sample size they will need? Even just doubling the sample size is already often prohibitively expensive, yet it doesn’t even halve the standard error; will that be convincing enough? Or will they only publish Facebook-sized studies with millions of participants (which often have other experimental-design issues)?

Conceivably, they might work out these details and this might still turn out to be a productive change making for a better journal, if the editors are more knowledgeable than the editorial makes them sound, AND if they do actually impose a stricter standard than p<0.05, AND if good research work meeting this standard is ever submitted to the journal. But I worry that, instead, it'll just end up downgrading the journal's quality and reputation, making referees unsure how to review articles without statistical evidence, and making readers unsure how reliable the published results are.

See also the American Statistical Association’s comment on the journal’s new policy, and the reddit discussion (particularly Peter’s response).

*Edit: John Kruschke is more hopeful, and Andrew Gelman links to a great paper citing cases of actual harm done by NHST. Again, I’m not trying to defend overuse of p-values—but there are useful and important parts of statistical inference (such as confidence intervals) that cannot be treated rigorously with descriptive statistics alone. And reliance on the interocular trauma test alone just frees up more ways to fiddle with the data to sneak it past reviewers.*