Category Archives: Statistics

Think-aloud interviews can help you write better assessments

I’m delighted to share with you a pilot study that we ran at CMU this fall. Long story short: It’s hard to write good conceptual-level questions to test student understanding, but think-aloud interviews are a very promising tool. By asking real students to talk out loud as they solve problems, you get insight into whether students give the right/wrong answers because they really do/don’t understand the problem—or whether the question should be better written. If students answer wrong because the question is ambiguous, or if they get it right using generic test-taking strategies rather than knowledge from the course, think-alouds give you a chance to detect this and revise your test questions.

Some context:
CMU’s stats department—renamed the Department of Statistics & Data Science just this fall—is revising the traditional Introductory Statistics classes we offer. Of course, as good statisticians, we’d like to gather evidence and measure whether students are learning any better in the new curriculum. We found several pre-made standardized tests of student learning in college-level Intro Stats, but none of them quite fit what we wanted to measure: have students learned the core concepts, even if they haven’t memorized traditional formulas and jargon?

We tried writing a few such multiple-choice questions ourselves, but it was quite a challenge to see past our own expert blind spots. So, we decided to get hands-on practice with assessment in the Fall 2017 offering of our graduate course on Teaching Statistics. We read Adams and Wieman (2011), “Development and Validation of Instruments to Measure Learning of Expert-Like Thinking”—who recommended using think-aloud interviews as a core part of the test-question design and validation process. This method isn’t commonly known in Statistics, although I have related experience from a decade ago when I studied design at Olin College and then worked in consumer insights research for Ziba Design. It’s been such a delight to revisit those skills and mindsets in a new context here.

We decided to run a pilot study where everyone could get practice running think-aloud interviews. With a handful of anonymous student volunteers, we ran through the process: welcome the volunteer, describe the process, give them some warm-up questions to practice thinking aloud as you solve problems, then run through a handful of “real” Intro Stats test questions and see how they tackle them. During the first pass, the interviewer should stay silent, apart from reminders like “Please remember to think out loud” if the student stops speaking. It’s not perfect, but it gets us closer to how students would really approach this question on an in-class test (not at office hours or in a study session). At the end, we would do a second pass to follow up on anything interesting or unclear, though it’s still best to let them do most of the talking: interviewers might say “I see you answered B here. Can you explain in more detail?” rather than “This is wrong; it should be C because…”

After this pilot, we feel quite confident that a formal think-aloud study will help us write questions that really measure the concepts and (mis)understandings we want to detect. The think-aloud script was drafted based on materials from Chris Nodder’s Lynda.com course and advice from Bowen (1994), “Think-aloud methods in chemistry education”. But there are quite a few open questions remaining about how best to implement the study. We list these on the poster above, which we presented last week at CMU’s Teaching & Learning Summit.

The current plan is to revise our protocol for the rest of Fall 2017 and design a complete formal study. Next, we will run think-alouds and revise individual questions throughout Spring 2018, then pilot and validate at the test level (which set of questions works well as a whole?) in Fall 2018, with pre- and post-tests across several sections and variations of Intro Stats.

PS — I mean no disrespect towards existing Intro Stats assessments such as CAOS, ARTIST, GOALS, SCI, or LOCUS. These have all been reviewed thoroughly by expert statisticians and educators. However, in the documentation for how these tests were developed, I cannot find any mention of think-alouds or similar testing with real students. Student testing seems limited to psychometric validation (for reliability etc.) after all the questions were written. I think there is considerable value in testing question-prototypes with students early in the development process.

PPS — Apologies for the long lack of updates. It’s been a busy year of research and paper-writing, with a particularly busy fall of job applications and interviews. But I’ll have a few more projects ready for sharing here over the next month or two.

ASA watch on integrity of federal statistical data

If you follow the topics I blog about, then you may also wish to read the American Statistical Association’s recent post about monitoring possible threats to the US federal statistical system.

In short, there are concerns that the new administration may defund or eliminate valuable statistical programs. They may also insist on asking for citizenship / immigration status on the next decennial Census, despite the detrimental effect this would have on response rates and data quality. The ASA’s post has further details and links to relevant news stories.

Whatever your political views, it’s difficult to manage a country effectively and efficiently without high-quality statistical information. Public data is an important good for members of the public as well, as I argued a few years back during worries about eliminating a federal statistical survey. I’m grateful those concerns did not come to pass then. I hope for the best this time too.

Statistical Science conversations, and in memoriam

The “Conversations” sections of Statistical Science are now available for open access. These interviews are valuable perspectives on the history of our field. But as I look over the list of names here, I am sad to reflect on the influential statisticians who passed away in 2016.

Earlier this year, I know we lost Peter Hall and Charles Stein, important contributors to statistical theory and practice.

This month, my department bid farewell to Steve Fienberg, a wonderful mentor, teacher, and researcher. His work on categorical data informed several of my projects back at the Census Bureau. I fondly remember the warm welcome my family received from Steve and his wife Joyce when we arrived at CMU. I regret I never took the opportunity to collaborate directly on his many fascinating projects, which included a wide range of topics like human rights, Census work, privacy & confidentiality, and forensic science.

Steve’s “Conversations” interview from 2013 contains many nuggets of wisdom on theory vs. practice, success in grad school, life in academia and beyond, etc. There was also a good interview at Statistics Views last year.
He was a pillar of the department and the broader statistical community, and we miss him dearly.

After 7th semester of statistics PhD program

I was lucky to have research grant support and minimal TAing duties this fall, so all semester I’ve felt my research was chugging along productively. Yet I have less to show for it than last semester—I went a little too far down an unrewarding rabbit-hole. Knowing when to cut your losses is an important skill to learn!

Previous posts: the 1st, 2nd, 3rd, 4th, 5th, and 6th semesters of my Statistics PhD program.

Research

Having defended my proposal this summer, I spent a lot of time this fall attacking one main sub-problem. Though I always felt I was making reasonable progress, eventually I discovered it to be a dead-end with no practical solution. I had wondered why nobody’s solved this problem yet; it turns out that it’s just inherently difficult, even for the simplest linear-regression case! Basically I wanted to provide finite-sample advice for a method where (1) the commonly-used approach is far from optimal but (2) the asymptotically-optimal approach is useless in finite samples. I think we can salvage parts of my work and still publish something useful, but it’ll be much less satisfying than I had hoped.

Working on a different problem, it felt encouraging to find errors in another statistician’s relevant proof: I felt like a legitimate statistician who can help colleagues notice problems and suggest improvements. On the other hand, it was also disappointing, because I had hoped to apply the proof idea directly to my own problem, and now I cannot 🙂

On a third front, my advisor invited another graduate student, Daren Wang, to help us wrap up a research project I had started in 2015 and then abandoned. Daren is bright, fast, and friendly, a pleasure to collaborate with (except when I’m despairing that it only took him a week to whiz through and improve on the stuff that took me half a year). Quite quickly, we agreed there’s no more to be done to make this project a much-better paper—so let’s just package it up now and submit to a conference. It was satisfying to work on writing and submitting a paper, one of the main skills for which I came to grad school!

Finally, I was hoping to clear up some stumbling blocks in an end-of-semester meeting with several committee members. Instead, our meeting raised many fascinating new questions & possible future directions… without wrapping up any loose ends. Alas, such is research 🙂

Classes

As I’ve noted before, I audited Jordan Rodu’s Deep Learning course. I really liked the journal-club format: Read a paper or two for every class session. Write a short response before class, so the instructor can read them first. Come prepared to discuss and bring up questions of your own. I wish more of our courses were like this—compared to lecture, it seems better for the students and less laborious for the instructor.

Although it was a theory course, not hands-on, I did become intrigued enough by one of the papers to try out the ideas myself. Together with classmate Nicolas Kim, we’re playing around with Keras on a GPU to understand some counterintuitive ideas a little better. Hopefully we’ll have something to report in a couple of weeks.

I also started to audit Kevin Kelly’s undergrad and grad-level courses on Epistemology (theory of knowing). Both were so fascinating that I had to drop them, else I would have done all the course readings at the expense of my own research 🙂 but I hope to take another stab someday. One possibly-helpful perspective I got, from my brief exposure to Epistemology, was a new-to-me (caricatured) difference between Bayesian and classical statistics.

  • Apparently most philosophy-of-science epistemologists are Bayesian. They posit that a scientist’s work goes like this: You are given a hypothesis, some data, and some prior knowledge or belief about the problem. How should we use the data to update our knowledge/belief about that hypothesis? In that case, obviously, Bayesian updating is a sensible way to go.
  • But I disagree with the premise. Often, a scientist’s work is more like this: You’re not handed a hypothesis or a dataset, but must choose them yourself. You also know your colleagues will bicker over claims of prior knowledge. If you come up with an interesting question, what data should you collect so that you’ll most likely find a strong answer? That is, an answer that most colleagues will find convincing regardless of prior belief, and that will keep you from fooling yourself? This is the classical / frequentist setting, which treats design (of a powerful, convincing experiment / survey / study) as the heart of statistics. In other words, you’re not merely evaluating “found” data—your task is to choose a design in hopes of making a convincing argument.

Other projects

Some of my cohort-mates and I finally organized a Dissertation Writing Group, a formal setting to talk shop technically with other students whose advisors don’t already hold research-group meetings. I instigated this selfishly, wanting to have other people I can pester with theory questions or simply vent with. But my fellow students agreed it’s been useful to them too. We’re also grateful to our student government for funding coffee and snacks for these meetings.

I did not take on other new side projects this fall, but I’ve stayed in touch with former colleagues from the Census Bureau still working on assessing & visualizing uncertainty in estimate rankings. We have a couple of older reports about these ideas. We still hope to publish a revised version, and we’re working on a website to present some of the ideas interactively. Eventually, the hope is to incorporate some of this into the Census website, to help statistical-novice data users understand that estimates and rankings come with statistical uncertainty.

Finally, I heard about (but have not attended) CMU’s Web Dev Weekend. I really like the format: a grab-bag of 1- or 2-hour courses, suitable for novices, that get you up and running with a concrete project and a practical skill you can take away. Can we do something similar for statistics?

Topic ideas where a novice could learn something both interesting and
useful in a 1.5h talk:

  • How not to fool yourself in A/B testing (basic experimental design and power analysis)
  • Befriend your dataset (basic graphical and numerical EDA, univariate and bivariate summaries, checking for errors and outliers)
  • Plus or minus a bit (estimating margins of error—canned methods for a few simple problems, intro to bootstrap for others)
  • Black box white belt (intro to some common data mining methods you might use as baselines in Kaggle-like prediction problems)

Many of these could be done with tools that are familiar (Excel) or novice-friendly (Tableau), instead of teaching novices to code in R at the same time as they learn statistical concepts. This would be a fun project for a spring weekend, in my copious spare time (hah!)

Life

Offline, we are starting to make some parent friends through daycare and playgrounds. I’m getting a new perspective on why parents tend to hang out with other parents: it’s nice to be around another person who really understands the rhythm of conversation when your brain is at best a quarter-present (half-occupied by watching kid, quarter-dysfunctional from lack of sleep). On the other hand, it’s sad to see some of these new friends moving away already, leaving the travails of academia behind for industry (with its own new & different travails but a higher salary).

So… I made the mistake of looking up average salaries myself. In statistics departments, average starting salaries for teaching faculty are well below starting salaries for research faculty. In turn, research faculty’s final salary (after decades of tenure) is barely up to the starting salaries I found for industry Data Scientists. Careers are certainly not all about the money, but the discrepancies were eye-opening, and they are good to know about in terms of financial planning going forward. (Of course, those are just averages, with all kinds of flaws. Particularly notable is the lack of cost-of-living adjustment, if a typical Data Scientist is hired in expensive San Francisco while typical teaching faculty are not.)

But let’s end on a high note. Responding to a question about which R / data science blogs to follow, Hadley Wickham cited this blog! If a Hadley citation can’t go on a statistician’s CV, I don’t know what can 🙂

“Sound experimentation was profitable”

Last time I mentioned some papers on the historical role of statistics in medicine. Here they are, by Donald Mainland:

  • “Statistics in Clinical Research: Some General Principles” (1950) [journal, pdf]
  • “The Rise of Experimental Statistics and the Problems of a Medical Statistician” (1954) [journal, pdf]

I’ve just re-read them and they are excellent. What is the heart of statistical thinking? What are the most critical parts of (applied) statistical education? At just 8-9 pages each, they are valuable reading, especially as a gentle rejoinder in this age of shifting fashions around Data Science, concerns about the replicability crisis, and misplaced hopes that Big Data will fix everything painlessly.

Some of Mainland’s key points, with which I strongly agree:

  • The heart of statistical thinking concerns data design, even more so than data analysis. How should we design the study (sampling, randomization, power, etc.) in order to gather strong evidence and to avoid fooling ourselves?

    …the methods of investigating variation are statistical methods. Investigating variation means far more than applying statistical tests to data already obtained. … Statistical ideas, to be effective, must enter at the very beginning, i.e., in the planning of an investigation.

  • Whenever possible, a well-designed experiment is highly preferred over poorly-designed experimental or observational data. It’s stronger evidence… and, as industry has long recognized, it cuts costs.

    In all the applied sciences, inefficient or wrong methods of research or production cause loss of money. Therefore, sound experimentation was profitable; and so applied chemistry and physics adopted modern biological statistics while academic chemists, physicists, and even biologists were disregarding the revolution or resisting it, largely through ignorance.

  • Yes, of course you can apply statistical methods to “found” data. Sometimes you have no alternative (macroeconomics; data journalism); sometimes it’s just substantially cheaper (Big Data). But if you gather haphazard data and merely run statistical tests after the fact, you’re missing the point.

    These unplanned observations may be the only information available as a basis for action, and they may form a useful basis for planned experiments; but we should never forget their inferior status.

    …a significance test has no useful meaning unless an experiment has been properly designed.

  • Statistical education for non-statisticians spends too little time on good data design, and too much on a slew of cookbook formulas and tests.

    …the increase in the incidence of tests—statistical arithmetic—has continued, and so also, very commonly, has the disregard of the more important contribution of statistics, the principles and methods of sound, economical experimentation and valid inference… Another obvious cause is the common human tendency to use gadgets instead of thought. Here the gadgets are the arithmetical techniques, and the statistical “cookbooks” that have presented these techniques most lucidly, without primary emphasis on experimentation and logic, have undoubtedly done much harm.

  • Statistical education for actual applied statisticians also spends too little time on good data design, and too much on mathematics.

    The most important single element in the training (and continuous education) of any statistician is practical experience—experience of investigations for which he himself is responsible, with all their difficulties and disappointments.

    …even if a mathematician specializes in the statistical branch of mathematics, he is not thereby fitted to give guidance in the application of the methods.

  • As an investigator, you must understand statistical reasoning yourself. You can (and should!) hire an applied statistician to help with the details of study design and data analysis, but you must understand their viewpoint to benefit from their help.

    If, however, he is acquainted with the requirements for valid proof, he will often see that what looked like evidence is not evidence at all…

Of course study design is not all of statistics. But it’s a hugely important component that seems underappreciated in modern statistics curricula (at least in my experience). Even if it’s not the sexiest area of current research, I’m surprised my PhD program at CMU completely omits it from our education. (The BS and MS programs here do offer one course each. But I was offered much deeper courses in my MS at Portland State, covering design of experiments and also of survey samples.)

As a bonus, Mainland also offers advice on starting and running a statistical consulting unit. It’s aimed at medical scientists but useful more broadly.

I would quote more, but you should really just read the whole thing. Then comment to tell me why I’m wrong 🙂

After 6th semester of statistics PhD program

Posting far too late again, but here’s what I remember from last Spring…

This was my first semester with no teaching, TAing, or classes (besides one I audited for fun). As much as I enjoy these things, research has finally gone much faster and smoother with no other school obligations. The fact that our baby started daycare also helped, although it’s a bittersweet transition. At the end of the summer I passed my proposal presentation, which means I am now ABD!

Previous posts: the 1st, 2nd, 3rd, 4th, and 5th semesters of my Statistics PhD program.

Thesis research and proposal

During 2015, most of my research with my advisor, Jing Lei, was a slow churn through understanding and extending his sparse PCA work with Vince Vu. At the end of the year I hadn’t gotten far and we decided to switch to a new project… which eventually became my proposal, in a roundabout way.

We’d heard about the concept of submodularity, which seems better known in CS, and wondered where it could be useful in Statistics as well. Das & Kempe (2011) used submodularity to understand when greedy variable selection algorithms like Forward Selection (FS, aka Forward Stepwise regression) can’t do too much worse than Best Subsets regression. We thought this approach might give a new proof of model-selection consistency for FS. It turned out that submodularity didn’t give us a fruitful proof approach after all… but also that (high-dimensional) conditions for model-selection consistency of FS hadn’t been derived yet. Hence, this became our goal: Find sufficient conditions for FS to choose the “right” linear regression model (when such a thing exists), with probability going to 1 as the numbers of observations and variables go to infinity. Then, compare these conditions to those known for other methods, such as Orthogonal Matching Pursuit (OMP) or the Lasso. Finally, analyze data-driven stopping rules for FS—so far we have focused on variants of cross-validation (CV), which is surprisingly not as well-understood as I thought.

One thing I hadn’t realized before: when writing the actual proposal, the intent is to demonstrate your abilities and preparedness for research, not necessarily to plan out your next research questions. As it turns out, it’s more important to prove that you can ask interesting questions and follow through on them. Proposing concrete “future work” is less critical, since we all know it’ll likely change by the time you finish the current task. Also, the process of rewriting everything for the paper and talk was a helpful process itself in getting me to see the “big picture” ideas in my proofs.

Anyhow, it did feel pretty great to actually complete a proof or two for the proposal. Even if the core ideas really came from my advisor or other papers I’ve read, I did do real work to pull it all together and prepare the paper & talk.

Many thanks to everyone who attended my proposal talk. I appreciated the helpful questions and discussion; it didn’t feel like a grilling for its own sake (as every grad student fears). Now it’s time to follow through, complete the research, practice the paper-submission process, and write a thesis!

The research process

When we shifted gears to something my advisor does not already know much about, it helped me feel much more in charge and productive. Of course, he caught up and passed me quickly, but that’s only to be expected of someone who also just won a prestigious NSF CAREER award.

Other things that have helped: Getting the baby into day care. No TAing duties to divide my energy this semester. Writing up the week’s research notes for my advisor before each meeting, so that (1) the meetings are more focused & productive and (2) I build up a record of notes that we can copy-paste into papers later. Reading Cal Newport’s Deep Work book and following common-sense suggestions about keeping a better schedule and tracking better metrics. (I used to tally all my daily/weekly stats-related work hours; now I just tally thesis hours and try to hit a good target each week on those alone, undiluted by side stuff.)

I’m no smarter, but my work is much more productive, I feel much better, and I’m learning much more. Every month I look back and realize that, just a month ago, I’d have been unable to understand the work I’m doing today. So it is possible to learn and progress quite quickly, which makes me feel much better about this whole theory-research world. I just need to immerse myself, spend enough time, revisit it regularly enough, have a concrete research question that I’m asking—and then I’ll learn it and retain it far better than I did the HWs from classes I took.

Indeed, a friend asked what I’d do differently if I were starting the PhD again. I’d spend far less energy on classes, especially on homework. It feels good and productive to do HW, and being good at HW is how I got here… but it’s not really the same as making research progress. Besides, as interesting and valuable as the coursework has been, very little of it has been directly relevant to my thesis (and the few parts that were, I’ve had to relearn anyway). So I’d aim explicitly for “B equals PhD” and instead spend more time doing real research projects, wrapping them up into publications (at least conference papers). As it is, I have a pile of half-arsed never-finished class / side projects, which could instead be nice CV entries if I’d polished them instead of spending hours trying to get from a B to an A.

My advisor also pointed out that he didn’t pick up his immense store of knowledge in a class, but by reading many many papers and talking with senior colleagues. I’ve also noticed a pattern from reading a ton of papers on each of several specialized sub-topics. First new paper I encounter in an area: whoa, how’d they come up with this from scratch, what does it all mean? Next 2-3 papers: whoa, these all look so different, how will I ever follow the big ideas? Another 10-15 papers: aha, they’re actually rehashing similar ideas and reusing similar proof techniques with small variations, and I can do this too. Reassuring, but it does all take time to digest.

All that said, I still feel like a slowly-plodding turtle compared to the superstar researchers here at CMU. Sometimes it helps to follow Wondermark’s advice on how he avoided discouragement in webcomics: ignore the more-successful people already out there and make one thing at a time, for a long time, until you’ve made many things and some are even good.

(Two years in!) I had just learned the word “webcomics” from a panel at Comic-Con. I was just starting to meet other people who were doing what I was doing.

Let me repeat that: Two years and over a hundred strips in is when I learned there was a word for what I was doing.

I had a precious, lucky gift that I don’t know actually exists anymore: a lack of expectations for my own success. I didn’t know any (or very few) comic creators personally; I didn’t know their audience metrics or see how many Twitter followers they had or how much they made on Patreon. My comics weren’t being liked or retweeted (or not liked, or not retweeted) within minutes of being posted.

I had been able to just sit down and write a bunch of comics without anyone really paying attention, and I didn’t have much of a sense of impatience about it. That was a precious gift that allowed me to start finding my footing as a creator by the time anyone did notice me – when people did start to come to my site, there was already a lot of content there and some of it was pretty decent.

Such blissful ignorance is hard to achieve in a department full of high-achievers. I’ve found that stressing about the competition doesn’t help me work harder or faster. But when I cultivate patience, at least I’m able to continue (at my own pace) instead of stopping entirely.

[Update:] another take on this issue, from Jeff Leek:

Don’t compare myself to other scientists. It is very hard to get good evaluation in science and I’m extra bad at self-evaluation. Scientists are good in many different dimensions and so whenever I pick a one dimensional summary and compare myself to others there are always people who are “better” than me. I find I’m happier when I set internal, short term goals for myself and only compare myself to them.

Classes

I audited Christopher Phillips’ course Moneyball Nation. This was a gen-ed course in the best possible sense, getting students to think both like historians and like statisticians. We explored how statistical/quantitative thinking entered three broad fields: medicine, law, and sports.

Reading articles by doctors and clinical researchers, I got a context for how statistical evidence fits in with other kinds of evidence. Doctors (and patients!) find it much more satisfying to get a chemist’s explanation of how a drug “really works,” vs. a statistician’s indirect analysis showing that a drug outperforms placebo on average. Another paper confirmed for me that (traditional) Statistics’ biggest impact on science was better experimental design, not better data analysis. Most researchers don’t need to collaborate with a statistical theoretician to derive new estimators; they need an applied statistician who’ll ensure that their expensive experimental costs are well spent, avoiding confounding and low power and all the other issues.

[Update:] I’ve added a whole post on these medical articles.

In the law module, we saw how difficult it is to use statistical evidence appropriately in trials, and how lawyers don’t always find it to be useful. Of course we want our trial system to get the right answers as often as possible (free the innocent and catch the guilty), so from a purely stats view it’s a decision-theory question: what legal procedures will optimize your sensitivity and specificity? But the courts, especially trial by jury, also serve a distinct social purpose: ensuring that the legal decision reflects and represents community agreement, not just isolated experts who can’t be held accountable. When you admit complicated statistical arguments that juries cannot understand, the legal process becomes hard to distinguish from quack experts bamboozling the public, which undermines trust in the whole system. That is, you have the right to a fair trial by a jury of your peers; and you can’t trample on that right in order to “objectively” make fewer mistakes. (Of course, this is also an argument for better statistical education for the public, so that statistical evidence becomes less abstruse.)

[Update:] In a bit more detail, “juries should convict only when guilt is beyond reasonable doubt. …one function of the presumption of innocence is to encourage the community to treat a defendant’s acquittal as banishing all lingering suspicion that he might have been guilty.” So reasonable doubt is meant to be a fuzzy social construct that depends on your local community. If trials devolve into computing a fungible “probability of guilt,” you lose that specificity / dependence on local community, and no numerical threshold can truly serve this purpose of being “beyond a reasonable doubt.” For more details on this ritual/pageant view of jury trials, along with many other arguments against statistics in the courtroom, see (very long but worthwhile) Tribe (1971), “Trial by Mathematics: Precision and Ritual in the Legal Process” [journal, pdf].

[Note to self: link to some of the readings described above.]

Next time I teach I’ll also use Prof. Phillips’ trick for getting to know students: require everyone to sign up for a time slot to meet in his office, in small groups (2-4 people). This helps put names to faces and discover students’ interests.

Other projects

I almost had a Tweet cited in a paper 😛 Rob Kass sent along to the department an early draft of “Ten Simple Rules for Effective Statistical Practice” which cited one of my tweets. Alas, the tweet didn’t make it into the final version, but the paper is still worth a read.

I also attended the Tapestry conference in Colorado, presenting course materials from the Fall 2015 dataviz class that I taught. See my conference notes here and here.

Even beyond that, it’s been a semester full of thought on statistical education, starting with a special issue in The American Statistician (plus supplementary material). I also attended a few faculty meetings in our college of humanities and social sciences, to which our statistics department belongs. They are considering future curricular revisions to the general-education requirements. What should it mean to have a well-rounded education, in general and specifically at this college? These chats also touch on the role of our introductory statistics course: where should statistical thinking and statistical evidence fit into the training of humanities students? This summer we started an Intro Stats working group for revising our department’s first course; I hope to have more to report there eventually.

Finally, I TA’ed for our department’s summer undergraduate research experience program. More on that in a separate post.

Life

My son is coordinated enough to play with a shape-sorter, which is funny to watch. He gets so frustrated that the square peg won’t go in the triangular hole, and so incredibly pleased when I gently nudge him to try a different hole and it works. (Then I go to meet my advisor and it feels like the same scene, with me in my son’s role…)

He’s had many firsts this spring: start of day care, first road trip, first time attending a wedding, first ER visit… Scary, joyful, bittersweet, all mixed up. It’s also becoming easier to communicate, as he can understand us and express himself better; but he also now has preferences and insists on them, which is a new challenge!

I’ve also joined some classmates in a new book club. A few picks have become new favorites; others really put me outside my comfort zone in reading things I’d never choose otherwise.

When static graphs beat interactives

William Cleveland gave a great interview in a recent Policyviz podcast. (Cleveland is a statistician and a major figure in data visualization research; I’ve reviewed his classic book The Elements of Graphic Data before.) He discussed the history of the term “data science,” his visual perception research, statistical computing advances, etc.

But Cleveland also described his work on brushing and on trellis graphics.

  • Brushing is an interactive technique for highlighting data points across linked plots. Plot Y vs X1 and Y vs X2; select some points on the first plot; and they are automatically highlighted on the second plot. You can condition on-the-fly on X1 to better understand the multivariate structure between X1, X2, and Y.
  • Trellis displays are essentially Cleveland’s version of small multiples, or of faceting in the Grammar of Graphics sense. Again, you condition on one variable and see how it affects the plots of other variables. See for example slides 10 and 15 here.

I found it fascinating that the static trellis technique evolved from interactive brushing, not vice versa!

Cleveland and colleagues noticed that although brushing let you find interesting patterns, it was too difficult to remember and compare them. You only saw one “view” of the linked plots at a time. Trellises would instead allow you to see many slices at once, making simultaneous comparisons easier.

For example, here’s a brushing view of data on housing: rent, size, year it was built, and whether or not it’s in a “good neighborhood” (figures from Interactive Graphics for Data Analysis: Principles and Examples). The user has selected a subset of years and chosen “good” neighborhoods, and now these points are highlighted in the scatterplot of size vs rent.

Brushing

That’s great for finding patterns in one subset at a time, but not ideal for comparing the patterns in different subsets. If you select a different subset of years, you’ll have to memorize the old subset’s scatterplot in order to decide whether it differs much from the new subset’s scatterplot; or switch back and forth between views.

Now look at the trellis display: the rows show whether or not the neighborhood is “good,” the columns show subsets of year, and each scatterplot shows size vs rent within that data subset. All these subsets’ scatterplots are visible at once.

Trellis

If there were different size-vs-rent patterns across year and neighborhood subsets, we’d be able to spot such an effect easily. I admit I don’t see any such effect—but that’s an interesting finding in its own right, and easier to confirm here than with brushing’s one-view-at-a-time.

So the shinier, fancier, interactive graphic is not uniformly better than a careful redesign of the old static one. Good to remember.

Deep Learning course, and my own (outdated) DL attempts

This fall I’m enjoying auditing Jordan Rodu‘s mini course on Deep Learning. He’s had us read parts of the forthcoming Deep Learning book (free draft online), finished just this year and thus presumably up-to-date.

Illustration of artificial neuron in a neural network

It’s fascinating to see how the core advice has changed from the literature we covered in Journal Club just a few years ago. Then, my team was assigned a 2010 paper by Erhan et al.: “Why Does Unsupervised Pre-training Help Deep Learning?” Unsupervised pre-training1 seems to have sparked the latest neural network / deep learning renaissance in 2006, underlying some dramatic performance improvements that got people interested in this methodology again after a decade-long “neural network winter.” So, we spent a lot of time reading this paper and writing simulations to help us understand how/why/when pre-training helps. (Here are our notes on the paper, simulations, and class discussion.)

But now in 2016, the Deep Learning book’s Chapter 15 says that “Today, unsupervised pretraining has been largely abandoned” (p.535). It seems to be used only in a few specific fields where there are good reasons for it to work, such as natural language processing. How quickly this field has changed!

Obviously, larger datasets and more raw computing power helped make deep neural networks feasible and interesting again in the 2000s. But algorithmic developments have helped too. Although unsupervised pre-training is what sparked renewed interest, the recent book claims (p.226) that the most important improvements have been: (1) using cross-entropy loss functions (optimize the negative log-likelihood) instead of always using mean squared error, and (2) using rectified linear activation functions in hidden units instead of sigmoid activation functions. Chapter 6 explains what these things mean and why they make a difference. But basically, these small tweaks (to the loss function you optimize, and to the non-linearities you work with) make large models much easier to fit, because it helps give you steeper gradients when your model fits poorly, so you don’t get stuck in regions of poor fit quite as often.

I look forward to learning more as Jordan’s class progresses. Meanwhile, if you want to try building a deep neural network from scratch yourself, I found the Stanford Deep Learning Tutorial helpful. Here are my solutions to some of the exercises. (This doesn’t teach you to use the well-designed, optimized, pre-made Deep Learning libraries that you’d want for a real application—just to practice building their core components from scratch so you understand how they work in principle. Your resulting code isn’t meant to be optimal and you wouldn’t use it to deploy something real.)


PS—here’s also a nice post on Deep Learning from Michael Jordan (the ML expert, not the athlete). Instead of claiming ML will take over Statistics, I was glad to hear him reinforcing the importance of traditionally statistical questions:

…while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g.,

1. How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have?
2. How can I get meaningful error bars or other measures of performance on all of the queries to my database?
3. How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources?
4. How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on?
5. How can I do diagnostics so that I don’t roll out a system that’s flawed or figure out that an existing system is now broken?
6. How do I deal with non-stationarity?
7. How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?

Although I could possibly investigate such issues in the context of deep learning ideas, I generally find it a whole lot more transparent to investigate them in the context of simpler building blocks.


PSA: R’s rnorm() and mvrnorm() use different spreads

Quick public service announcement for my fellow R nerds:

R has two commonly-used random-Normal generators: rnorm and MASS::mvrnorm. I was foolish and assumed that their parameterizations were equivalent when you’re generating univariate data. But nope:

  • Base R can generate univariate draws with rnorm(n, mean, sd), which uses the standard deviation for the spread.
  • The MASS package has a multivariate equivalent, mvrnorm(n, mu, Sigma), which uses the variance-covariance matrix for the spread. In the univariate case, Sigma is the variance.

I was using mvrnorm to generate a univariate random variable, but giving it the standard deviation instead of the variance. It took me two weeks of debugging to find this problem.

Dear reader, I hope this cautionary tale reminds you to check R function arguments carefully!

Data sanity checks: Data Proofer (and R analogues?)

I just heard about Data Proofer (h/t Nathan Yau), a test suite of sanity-checks for your CSV dataset.

It checks a few basic things you’d really want to know but might forget to check yourself, like whether any rows are exact duplicates, or whether any columns are totally empty.
There are things I always forget to check until they cause a bug, like whether geographic coordinates are within -180 to 180 degrees latitude or longitude.
And there are things I never think to check, though I should, like whether there are exactly 65k rows (probably an error exporting from Excel) or whether integers are exactly at certain common cutoff/overflow values.
I like the idea of automating this. It certainly wouldn’t absolved me of the need to think critically about a new dataset—but it might flag some things I wouldn’t have caught otherwise.

(They also do some statistical checks for outliers; but being a statistician, this is one thing I do remember to do myself. (I’d like to think) I do it more carefully than any simple automated check.)

Does an R package like this exist already? The closest thing in spirit that I’ve seen is testdat, though I haven’t played with that yet. If not, maybe testdat could add some more of Data Proofer’s checks. It’d become an even more valuable tool to run whenever you load or import any tabular dataset for the first time.