Category Archives: Statistics

Statistical Science conversations, and in memoriam

The “Conversations” sections of Statistical Science are now available for open access. These interviews are valuable perspectives on the history of our field. But as I look over the list of names here, I am sad to reflect on the influential statisticians who passed away in 2016.

Earlier this year, I know we lost Peter Hall and Charles Stein, important contributors to statistical theory and practice.

This month, my department bid farewell to Steve Fienberg, a wonderful mentor, teacher, and researcher. His work on categorical data informed several of my projects back at the Census Bureau. I fondly remember the warm welcome my family received from Steve and his wife Joyce when we arrived at CMU. I regret I never took the opportunity to collaborate directly on his many fascinating projects, which included a wide range of topics like human rights, Census work, privacy & confidentiality, and forensic science.

Steve’s “Conversations” interview from 2013 contains many nuggets of wisdom on theory vs. practice, success in grad school, life in academia and beyond, etc. There was also a good interview at Statistics Views last year.
He was a pillar of the department and the broader statistical community, and we miss him dearly.

After 7th semester of statistics PhD program

I was lucky to have research grant support and minimal TAing duties this fall, so all semester I’ve felt my research was chugging along productively. Yet I have less to show for it than last semester—I went a little too far down an unrewarding rabbit-hole. Knowing when to cut your losses is an important skill to learn!

Previous posts: the 1st, 2nd, 3rd, 4th, 5th, and 6th semesters of my Statistics PhD program.


Having defended my proposal this summer, I spent a lot of time this fall attacking one main sub-problem. Though I always felt I was making reasonable progress, eventually I discovered it to be a dead-end with no practical solution. I had wondered why nobody’s solved this problem yet; it turns out that it’s just inherently difficult, even for the simplest linear-regression case! Basically I wanted to provide finite-sample advice for a method where (1) the commonly-used approach is far from optimal but (2) the asymptotically-optimal approach is useless in finite samples. I think we can salvage parts of my work and still publish something useful, but it’ll be much less satisfying than I had hoped.

Working on a different problem, it felt encouraging to find errors in another statistician’s relevant proof: I felt like a legitimate statistician who can help colleagues notice problems and suggest improvements. On the other hand, it was also disappointing, because I had hoped to apply the proof idea directly to my own problem, and now I cannot 🙂

On a third front, my advisor invited another graduate student, Daren Wang, to help us wrap up a research project I had started in 2015 and then abandoned. Daren is bright, fast, and friendly, a pleasure to collaborate with (except when I’m despairing that it only took him a week to whiz through and improve on the stuff that took me half a year). Quite quickly, we agreed there’s no more to be done to make this project a much-better paper—so let’s just package it up now and submit to a conference. It was satisfying to work on writing and submitting a paper, one of the main skills for which I came to grad school!

Finally, I was hoping to clear up some stumbling blocks in an end-of-semester meeting with several committee members. Instead, our meeting raised many fascinating new questions & possible future directions… without wrapping up any loose ends. Alas, such is research 🙂


As I’ve noted before, I audited Jordan Rodu’s Deep Learning course. I really liked the journal-club format: Read a paper or two for every class session. Write a short response before class, so the instructor can read them first. Come prepared to discuss and bring up questions of your own. I wish more of our courses were like this—compared to lecture, it seems better for the students and less laborious for the instructor.

Although it was a theory course, not hands-on, I did become intrigued enough by one of the papers to try out the ideas myself. Together with classmate Nicolas Kim, we’re playing around with Keras on a GPU to understand some counterintuitive ideas a little better. Hopefully we’ll have something to report in a couple of weeks.

I also started to audit Kevin Kelly’s undergrad and grad-level courses on Epistemology (theory of knowing). Both were so fascinating that I had to drop them, else I would have done all the course readings at the expense of my own research 🙂 but I hope to take another stab someday. One possibly-helpful perspective I got, from my brief exposure to Epistemology, was a new-to-me (caricatured) difference between Bayesian and classical statistics.

  • Apparently most philosophy-of-science epistemologists are Bayesian. They posit that a scientist’s work goes like this: You are given a hypothesis, some data, and some prior knowledge or belief about the problem. How should we use the data to update our knowledge/belief about that hypothesis? In that case, obviously, Bayesian updating is a sensible way to go.
  • But I disagree with the premise. Often, a scientist’s work is more like this: You’re not handed a hypothesis or a dataset, but must choose them yourself. You also know your colleagues will bicker over claims of prior knowledge. If you come up with an interesting question, what data should you collect so that you’ll most likely find a strong answer? That is, an answer that most colleagues will find convincing regardless of prior belief, and that will keep you from fooling yourself? This is the classical / frequentist setting, which treats design (of a powerful, convincing experiment / survey / study) as the heart of statistics. In other words, you’re not merely evaluating “found” data—your task is to choose a design in hopes of making a convincing argument.

Other projects

Some of my cohort-mates and I finally organized a Dissertation Writing Group, a formal setting to talk shop technically with other students whose advisors don’t already hold research-group meetings. I instigated this selfishly, wanting to have other people I can pester with theory questions or simply vent with. But my fellow students agreed it’s been useful to them too. We’re also grateful to our student government for funding coffee and snacks for these meetings.

I did not take on other new side projects this fall, but I’ve stayed in touch with former colleagues from the Census Bureau still working on assessing & visualizing uncertainty in estimate rankings. We have a couple of older reports about these ideas. We still hope to publish a revised version, and we’re working on a website to present some of the ideas interactively. Eventually, the hope is to incorporate some of this into the Census website, to help statistical-novice data users understand that estimates and rankings come with statistical uncertainty.

Finally, I heard about (but have not attended) CMU’s Web Dev Weekend. I really like the format: a grab-bag of 1- or 2-hour courses, suitable for novices, that get you up and running with a concrete project and a practical skill you can take away. Can we do something similar for statistics?

Topic ideas where a novice could learn something both interesting and
useful in a 1.5h talk:

  • How not to fool yourself in A/B testing (basic experimental design and power analysis)
  • Befriend your dataset (basic graphical and numerical EDA, univariate and bivariate summaries, checking for errors and outliers)
  • Plus or minus a bit (estimating margins of error—canned methods for a few simple problems, intro to bootstrap for others)
  • Black box white belt (intro to some common data mining methods you might use as baselines in Kaggle-like prediction problems)

Many of these could be done with tools that are familiar (Excel) or novice-friendly (Tableau), instead of teaching novices to code in R at the same time as they learn statistical concepts. This would be a fun project for a spring weekend, in my copious spare time (hah!)


Offline, we are starting to make some parent friends through daycare and playgrounds. I’m getting a new perspective on why parents tend to hang out with other parents: it’s nice to be around another person who really understands the rhythm of conversation when your brain is at best a quarter-present (half-occupied by watching kid, quarter-dysfunctional from lack of sleep). On the other hand, it’s sad to see some of these new friends moving away already, leaving the travails of academia behind for industry (with its own new & different travails but a higher salary).

So… I made the mistake of looking up average salaries myself. In statistics departments, average starting salaries for teaching faculty are well below starting salaries for research faculty. In turn, research faculty’s final salary (after decades of tenure) is barely up to the starting salaries I found for industry Data Scientists. Careers are certainly not all about the money, but the discrepancies were eye-opening, and they are good to know about in terms of financial planning going forward. (Of course, those are just averages, with all kinds of flaws. Particularly notable is the lack of cost-of-living adjustment, if a typical Data Scientist is hired in expensive San Francisco while typical teaching faculty are not.)

But let’s end on a high note. Responding to a question about which R / data science blogs to follow, Hadley Wickham cited this blog! If a Hadley citation can’t go on a statistician’s CV, I don’t know what can 🙂

“Sound experimentation was profitable”

Last time I mentioned some papers on the historical role of statistics in medicine. Here they are, by Donald Mainland:

  • “Statistics in Clinical Research: Some General Principles” (1950) [journal, pdf]
  • “The Rise of Experimental Statistics and the Problems of a Medical Statistician” (1954) [journal, pdf]

I’ve just re-read them and they are excellent. What is the heart of statistical thinking? What are the most critical parts of (applied) statistical education? At just 8-9 pages each, they are valuable reading, especially as a gentle rejoinder in this age of shifting fashions around Data Science, concerns about the replicability crisis, and misplaced hopes that Big Data will fix everything painlessly.

Some of Mainland’s key points, with which I strongly agree:

  • The heart of statistical thinking concerns data design, even more so than data analysis. How should we design the study (sampling, randomization, power, etc.) in order to gather strong evidence and to avoid fooling ourselves?

    …the methods of investigating variation are statistical methods. Investigating variation means far more than applying statistical tests to data already obtained. … Statistical ideas, to be effective, must enter at the very beginning, i.e., in the planning of an investigation.

  • Whenever possible, a well-designed experiment is highly preferred over poorly-designed experimental or observational data. It’s stronger evidence… and, as industry has long recognized, it cuts costs.

    In all the applied sciences, inefficient or wrong methods of research or production cause loss of money. Therefore, sound experimentation was profitable; and so applied chemistry and physics adopted modern biological statistics while academic chemists, physicists, and even biologists were disregarding the revolution or resisting it, largely through ignorance.

  • Yes, of course you can apply statistical methods to “found” data. Sometimes you have no alternative (macroeconomics; data journalism); sometimes it’s just substantially cheaper (Big Data). But if you gather haphazard data and merely run statistical tests after the fact, you’re missing the point.

    These unplanned observations may be the only information available as a basis for action, and they may form a useful basis for planned experiments; but we should never forget their inferior status.

    …a significance test has no useful meaning unless an experiment has been properly designed.

  • Statistical education for non-statisticians spends too little time on good data design, and too much on a slew of cookbook formulas and tests.

    …the increase in the incidence of tests—statistical arithmetic—has continued, and so also, very commonly, has the disregard of the more important contribution of statistics, the principles and methods of sound, economical experimentation and valid inference… Another obvious cause is the common human tendency to use gadgets instead of thought. Here the gadgets are the arithmetical techniques, and the statistical “cookbooks” that have presented these techniques most lucidly, without primary emphasis on experimentation and logic, have undoubtedly done much harm.

  • Statistical education for actual applied statisticians also spends too little time on good data design, and too much on mathematics.

    The most important single element in the training (and continuous education) of any statistician is practical experience—experience of investigations for which he himself is responsible, with all their difficulties and disappointments.

    …even if a mathematician specializes in the statistical branch of mathematics, he is not thereby fitted to give guidance in the application of the methods.

  • As an investigator, you must understand statistical reasoning yourself. You can (and should!) hire an applied statistician to help with the details of study design and data analysis, but you must understand their viewpoint to benefit from their help.

    If, however, he is acquainted with the requirements for valid proof, he will often see that what looked like evidence is not evidence at all…

Of course study design is not all of statistics. But it’s a hugely important component that seems underappreciated in modern statistics curricula (at least in my experience). Even if it’s not the sexiest area of current research, I’m surprised my PhD program at CMU completely omits it from our education. (The BS and MS programs here do offer one course each. But I was offered much deeper courses in my MS at Portland State, covering design of experiments and also of survey samples.)

As a bonus, Mainland also offers advice on starting and running a statistical consulting unit. It’s aimed at medical scientists but useful more broadly.

I would quote more, but you should really just read the whole thing. Then comment to tell me why I’m wrong 🙂

After 6th semester of statistics PhD program

Posting far too late again, but here’s what I remember from last Spring…

This was my first semester with no teaching, TAing, or classes (besides one I audited for fun). As much as I enjoy these things, research has finally gone much faster and smoother with no other school obligations. The fact that our baby started daycare also helped, although it’s a bittersweet transition. At the end of the summer I passed my proposal presentation, which means I am now ABD!

Previous posts: the 1st, 2nd, 3rd, 4th, and 5th semesters of my Statistics PhD program.

Thesis research and proposal

During 2015, most of my research with my advisor, Jing Lei, was a slow churn through understanding and extending his sparse PCA work with Vince Vu. At the end of the year I hadn’t gotten far and we decided to switch to a new project… which eventually became my proposal, in a roundabout way.

We’d heard about the concept of submodularity, which seems better known in CS, and wondered where it could be useful in Statistics as well. Das & Kempe (2011) used submodularity to understand when greedy variable selection algorithms like Forward Selection (FS, aka Forward Stepwise regression) can’t do too much worse than Best Subsets regression. We thought this approach might give a new proof of model-selection consistency for FS. It turned out that submodularity didn’t give us a fruitful proof approach after all… but also that (high-dimensional) conditions for model-selection consistency of FS hadn’t been derived yet. Hence, this became our goal: Find sufficient conditions for FS to choose the “right” linear regression model (when such a thing exists), with probability going to 1 as the numbers of observations and variables go to infinity. Then, compare these conditions to those known for other methods, such as Orthogonal Matching Pursuit (OMP) or the Lasso. Finally, analyze data-driven stopping rules for FS—so far we have focused on variants of cross-validation (CV), which is surprisingly not as well-understood as I thought.

One thing I hadn’t realized before: when writing the actual proposal, the intent is to demonstrate your abilities and preparedness for research, not necessarily to plan out your next research questions. As it turns out, it’s more important to prove that you can ask interesting questions and follow through on them. Proposing concrete “future work” is less critical, since we all know it’ll likely change by the time you finish the current task. Also, the process of rewriting everything for the paper and talk was a helpful process itself in getting me to see the “big picture” ideas in my proofs.

Anyhow, it did feel pretty great to actually complete a proof or two for the proposal. Even if the core ideas really came from my advisor or other papers I’ve read, I did do real work to pull it all together and prepare the paper & talk.

Many thanks to everyone who attended my proposal talk. I appreciated the helpful questions and discussion; it didn’t feel like a grilling for its own sake (as every grad student fears). Now it’s time to follow through, complete the research, practice the paper-submission process, and write a thesis!

The research process

When we shifted gears to something my advisor does not already know much about, it helped me feel much more in charge and productive. Of course, he caught up and passed me quickly, but that’s only to be expected of someone who also just won a prestigious NSF CAREER award.

Other things that have helped: Getting the baby into day care. No TAing duties to divide my energy this semester. Writing up the week’s research notes for my advisor before each meeting, so that (1) the meetings are more focused & productive and (2) I build up a record of notes that we can copy-paste into papers later. Reading Cal Newport’s Deep Work book and following common-sense suggestions about keeping a better schedule and tracking better metrics. (I used to tally all my daily/weekly stats-related work hours; now I just tally thesis hours and try to hit a good target each week on those alone, undiluted by side stuff.)

I’m no smarter, but my work is much more productive, I feel much better, and I’m learning much more. Every month I look back and realize that, just a month ago, I’d have been unable to understand the work I’m doing today. So it is possible to learn and progress quite quickly, which makes me feel much better about this whole theory-research world. I just need to immerse myself, spend enough time, revisit it regularly enough, have a concrete research question that I’m asking—and then I’ll learn it and retain it far better than I did the HWs from classes I took.

Indeed, a friend asked what I’d do differently if I were starting the PhD again. I’d spend far less energy on classes, especially on homework. It feels good and productive to do HW, and being good at HW is how I got here… but it’s not really the same as making research progress. Besides, as interesting and valuable as the coursework has been, very little of it has been directly relevant to my thesis (and the few parts that were, I’ve had to relearn anyway). So I’d aim explicitly for “B equals PhD” and instead spend more time doing real research projects, wrapping them up into publications (at least conference papers). As it is, I have a pile of half-arsed never-finished class / side projects, which could instead be nice CV entries if I’d polished them instead of spending hours trying to get from a B to an A.

My advisor also pointed out that he didn’t pick up his immense store of knowledge in a class, but by reading many many papers and talking with senior colleagues. I’ve also noticed a pattern from reading a ton of papers on each of several specialized sub-topics. First new paper I encounter in an area: whoa, how’d they come up with this from scratch, what does it all mean? Next 2-3 papers: whoa, these all look so different, how will I ever follow the big ideas? Another 10-15 papers: aha, they’re actually rehashing similar ideas and reusing similar proof techniques with small variations, and I can do this too. Reassuring, but it does all take time to digest.

All that said, I still feel like a slowly-plodding turtle compared to the superstar researchers here at CMU. Sometimes it helps to follow Wondermark’s advice on how he avoided discouragement in webcomics: ignore the more-successful people already out there and make one thing at a time, for a long time, until you’ve made many things and some are even good.

(Two years in!) I had just learned the word “webcomics” from a panel at Comic-Con. I was just starting to meet other people who were doing what I was doing.

Let me repeat that: Two years and over a hundred strips in is when I learned there was a word for what I was doing.

I had a precious, lucky gift that I don’t know actually exists anymore: a lack of expectations for my own success. I didn’t know any (or very few) comic creators personally; I didn’t know their audience metrics or see how many Twitter followers they had or how much they made on Patreon. My comics weren’t being liked or retweeted (or not liked, or not retweeted) within minutes of being posted.

I had been able to just sit down and write a bunch of comics without anyone really paying attention, and I didn’t have much of a sense of impatience about it. That was a precious gift that allowed me to start finding my footing as a creator by the time anyone did notice me – when people did start to come to my site, there was already a lot of content there and some of it was pretty decent.

Such blissful ignorance is hard to achieve in a department full of high-achievers. I’ve found that stressing about the competition doesn’t help me work harder or faster. But when I cultivate patience, at least I’m able to continue (at my own pace) instead of stopping entirely.

[Update:] another take on this issue, from Jeff Leek:

Don’t compare myself to other scientists. It is very hard to get good evaluation in science and I’m extra bad at self-evaluation. Scientists are good in many different dimensions and so whenever I pick a one dimensional summary and compare myself to others there are always people who are “better” than me. I find I’m happier when I set internal, short term goals for myself and only compare myself to them.


I audited Christopher Phillips’ course Moneyball Nation. This was a gen-ed course in the best possible sense, getting students to think both like historians and like statisticians. We explored how statistical/quantitative thinking entered three broad fields: medicine, law, and sports.

Reading articles by doctors and clinical researchers, I got a context for how statistical evidence fits in with other kinds of evidence. Doctors (and patients!) find it much more satisfying to get a chemist’s explanation of how a drug “really works,” vs. a statistician’s indirect analysis showing that a drug outperforms placebo on average. Another paper confirmed for me that (traditional) Statistics’ biggest impact on science was better experimental design, not better data analysis. Most researchers don’t need to collaborate with a statistical theoretician to derive new estimators; they need an applied statistician who’ll ensure that their expensive experimental costs are well spent, avoiding confounding and low power and all the other issues.

[Update:] I’ve added a whole post on these medical articles.

In the law module, we saw how difficult it is to use statistical evidence appropriately in trials, and how lawyers don’t always find it to be useful. Of course we want our trial system to get the right answers as often as possible (free the innocent and catch the guilty), so from a purely stats view it’s a decision-theory question: what legal procedures will optimize your sensitivity and specificity? But the courts, especially trial by jury, also serve a distinct social purpose: ensuring that the legal decision reflects and represents community agreement, not just isolated experts who can’t be held accountable. When you admit complicated statistical arguments that juries cannot understand, the legal process becomes hard to distinguish from quack experts bamboozling the public, which undermines trust in the whole system. That is, you have the right to a fair trial by a jury of your peers; and you can’t trample on that right in order to “objectively” make fewer mistakes. (Of course, this is also an argument for better statistical education for the public, so that statistical evidence becomes less abstruse.)

[Update:] In a bit more detail, “juries should convict only when guilt is beyond reasonable doubt. …one function of the presumption of innocence is to encourage the community to treat a defendant’s acquittal as banishing all lingering suspicion that he might have been guilty.” So reasonable doubt is meant to be a fuzzy social construct that depends on your local community. If trials devolve into computing a fungible “probability of guilt,” you lose that specificity / dependence on local community, and no numerical threshold can truly serve this purpose of being “beyond a reasonable doubt.” For more details on this ritual/pageant view of jury trials, along with many other arguments against statistics in the courtroom, see (very long but worthwhile) Tribe (1971), “Trial by Mathematics: Precision and Ritual in the Legal Process” [journal, pdf].

[Note to self: link to some of the readings described above.]

Next time I teach I’ll also use Prof. Phillips’ trick for getting to know students: require everyone to sign up for a time slot to meet in his office, in small groups (2-4 people). This helps put names to faces and discover students’ interests.

Other projects

I almost had a Tweet cited in a paper 😛 Rob Kass sent along to the department an early draft of “Ten Simple Rules for Effective Statistical Practice” which cited one of my tweets. Alas, the tweet didn’t make it into the final version, but the paper is still worth a read.

I also attended the Tapestry conference in Colorado, presenting course materials from the Fall 2015 dataviz class that I taught. See my conference notes here and here.

Even beyond that, it’s been a semester full of thought on statistical education, starting with a special issue in The American Statistician (plus supplementary material). I also attended a few faculty meetings in our college of humanities and social sciences, to which our statistics department belongs. They are considering future curricular revisions to the general-education requirements. What should it mean to have a well-rounded education, in general and specifically at this college? These chats also touch on the role of our introductory statistics course: where should statistical thinking and statistical evidence fit into the training of humanities students? This summer we started an Intro Stats working group for revising our department’s first course; I hope to have more to report there eventually.

Finally, I TA’ed for our department’s summer undergraduate research experience program. More on that in a separate post.


My son is coordinated enough to play with a shape-sorter, which is funny to watch. He gets so frustrated that the square peg won’t go in the triangular hole, and so incredibly pleased when I gently nudge him to try a different hole and it works. (Then I go to meet my advisor and it feels like the same scene, with me in my son’s role…)

He’s had many firsts this spring: start of day care, first road trip, first time attending a wedding, first ER visit… Scary, joyful, bittersweet, all mixed up. It’s also becoming easier to communicate, as he can understand us and express himself better; but he also now has preferences and insists on them, which is a new challenge!

I’ve also joined some classmates in a new book club. A few picks have become new favorites; others really put me outside my comfort zone in reading things I’d never choose otherwise.

When static graphs beat interactives

William Cleveland gave a great interview in a recent Policyviz podcast. (Cleveland is a statistician and a major figure in data visualization research; I’ve reviewed his classic book The Elements of Graphic Data before.) He discussed the history of the term “data science,” his visual perception research, statistical computing advances, etc.

But Cleveland also described his work on brushing and on trellis graphics.

  • Brushing is an interactive technique for highlighting data points across linked plots. Plot Y vs X1 and Y vs X2; select some points on the first plot; and they are automatically highlighted on the second plot. You can condition on-the-fly on X1 to better understand the multivariate structure between X1, X2, and Y.
  • Trellis displays are essentially Cleveland’s version of small multiples, or of faceting in the Grammar of Graphics sense. Again, you condition on one variable and see how it affects the plots of other variables. See for example slides 10 and 15 here.

I found it fascinating that the static trellis technique evolved from interactive brushing, not vice versa!

Cleveland and colleagues noticed that although brushing let you find interesting patterns, it was too difficult to remember and compare them. You only saw one “view” of the linked plots at a time. Trellises would instead allow you to see many slices at once, making simultaneous comparisons easier.

For example, here’s a brushing view of data on housing: rent, size, year it was built, and whether or not it’s in a “good neighborhood” (figures from Interactive Graphics for Data Analysis: Principles and Examples). The user has selected a subset of years and chosen “good” neighborhoods, and now these points are highlighted in the scatterplot of size vs rent.


That’s great for finding patterns in one subset at a time, but not ideal for comparing the patterns in different subsets. If you select a different subset of years, you’ll have to memorize the old subset’s scatterplot in order to decide whether it differs much from the new subset’s scatterplot; or switch back and forth between views.

Now look at the trellis display: the rows show whether or not the neighborhood is “good,” the columns show subsets of year, and each scatterplot shows size vs rent within that data subset. All these subsets’ scatterplots are visible at once.


If there were different size-vs-rent patterns across year and neighborhood subsets, we’d be able to spot such an effect easily. I admit I don’t see any such effect—but that’s an interesting finding in its own right, and easier to confirm here than with brushing’s one-view-at-a-time.

So the shinier, fancier, interactive graphic is not uniformly better than a careful redesign of the old static one. Good to remember.

Deep Learning course, and my own (outdated) DL attempts

This fall I’m enjoying auditing Jordan Rodu‘s mini course on Deep Learning. He’s had us read parts of the forthcoming Deep Learning book (free draft online), finished just this year and thus presumably up-to-date.

Illustration of artificial neuron in a neural network

It’s fascinating to see how the core advice has changed from the literature we covered in Journal Club just a few years ago. Then, my team was assigned a 2010 paper by Erhan et al.: “Why Does Unsupervised Pre-training Help Deep Learning?” Unsupervised pre-training1 seems to have sparked the latest neural network / deep learning renaissance in 2006, underlying some dramatic performance improvements that got people interested in this methodology again after a decade-long “neural network winter.” So, we spent a lot of time reading this paper and writing simulations to help us understand how/why/when pre-training helps. (Here are our notes on the paper, simulations, and class discussion.)

But now in 2016, the Deep Learning book’s Chapter 15 says that “Today, unsupervised pretraining has been largely abandoned” (p.535). It seems to be used only in a few specific fields where there are good reasons for it to work, such as natural language processing. How quickly this field has changed!

Obviously, larger datasets and more raw computing power helped make deep neural networks feasible and interesting again in the 2000s. But algorithmic developments have helped too. Although unsupervised pre-training is what sparked renewed interest, the recent book claims (p.226) that the most important improvements have been: (1) using cross-entropy loss functions (optimize the negative log-likelihood) instead of always using mean squared error, and (2) using rectified linear activation functions in hidden units instead of sigmoid activation functions. Chapter 6 explains what these things mean and why they make a difference. But basically, these small tweaks (to the loss function you optimize, and to the non-linearities you work with) make large models much easier to fit, because it helps give you steeper gradients when your model fits poorly, so you don’t get stuck in regions of poor fit quite as often.

I look forward to learning more as Jordan’s class progresses. Meanwhile, if you want to try building a deep neural network from scratch yourself, I found the Stanford Deep Learning Tutorial helpful. Here are my solutions to some of the exercises. (This doesn’t teach you to use the well-designed, optimized, pre-made Deep Learning libraries that you’d want for a real application—just to practice building their core components from scratch so you understand how they work in principle. Your resulting code isn’t meant to be optimal and you wouldn’t use it to deploy something real.)

PS—here’s also a nice post on Deep Learning from Michael Jordan (the ML expert, not the athlete). Instead of claiming ML will take over Statistics, I was glad to hear him reinforcing the importance of traditionally statistical questions:

…while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g.,

1. How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have?
2. How can I get meaningful error bars or other measures of performance on all of the queries to my database?
3. How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources?
4. How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on?
5. How can I do diagnostics so that I don’t roll out a system that’s flawed or figure out that an existing system is now broken?
6. How do I deal with non-stationarity?
7. How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?

Although I could possibly investigate such issues in the context of deep learning ideas, I generally find it a whole lot more transparent to investigate them in the context of simpler building blocks.

PSA: R’s rnorm() and mvrnorm() use different spreads

Quick public service announcement for my fellow R nerds:

R has two commonly-used random-Normal generators: rnorm and MASS::mvrnorm. I was foolish and assumed that their parameterizations were equivalent when you’re generating univariate data. But nope:

  • Base R can generate univariate draws with rnorm(n, mean, sd), which uses the standard deviation for the spread.
  • The MASS package has a multivariate equivalent, mvrnorm(n, mu, Sigma), which uses the variance-covariance matrix for the spread. In the univariate case, Sigma is the variance.

I was using mvrnorm to generate a univariate random variable, but giving it the standard deviation instead of the variance. It took me two weeks of debugging to find this problem.

Dear reader, I hope this cautionary tale reminds you to check R function arguments carefully!

Data sanity checks: Data Proofer (and R analogues?)

I just heard about Data Proofer (h/t Nathan Yau), a test suite of sanity-checks for your CSV dataset.

It checks a few basic things you’d really want to know but might forget to check yourself, like whether any rows are exact duplicates, or whether any columns are totally empty.
There are things I always forget to check until they cause a bug, like whether geographic coordinates are within -180 to 180 degrees latitude or longitude.
And there are things I never think to check, though I should, like whether there are exactly 65k rows (probably an error exporting from Excel) or whether integers are exactly at certain common cutoff/overflow values.
I like the idea of automating this. It certainly wouldn’t absolved me of the need to think critically about a new dataset—but it might flag some things I wouldn’t have caught otherwise.

(They also do some statistical checks for outliers; but being a statistician, this is one thing I do remember to do myself. (I’d like to think) I do it more carefully than any simple automated check.)

Does an R package like this exist already? The closest thing in spirit that I’ve seen is testdat, though I haven’t played with that yet. If not, maybe testdat could add some more of Data Proofer’s checks. It’d become an even more valuable tool to run whenever you load or import any tabular dataset for the first time.

After 5th semester of statistics PhD program

Better late than never—here are my hazy memories of last semester. It was one of the tougher ones: an intense teaching experience, attempts to ratchet up research, and parenting a baby that’s still too young to entertain itself but old enough to get into trouble.

Previous posts: the 1st, 2nd, 3rd, and 4th semesters of my Statistics PhD program.


I’m past all the required coursework, so I only audited Topics in High Dimensional Statistics, taught by Alessandro Rinaldo as a pair of half-semester courses (36-788 and 36-789). “High-dimensional” here loosely means problems where you have more variables (p) than observations (n). For instance, in genetic or neuroscience datasets, you might have thousands of measurements each from only tens of patients. The theory here is different than in traditional statistics because you usually assume that p grows with n, so that getting more observations won’t reduce the problem to a traditional one.

This course focused on some of the theoretical tools (like concentration inequalities) and results (like minimax bounds) that are especially useful for studying properties of high-dimensional methods. Ale did a great job covering useful techniques and connecting the material from lecture to lecture.

In the final part of the course, students presented recent minimax-theory papers. It was useful to see my fellow students work through how these techniques are used in practice, as well as to get practice giving “chalk talks” without projected slides. I gave a talk too, preparing jointly with my classmate Lingxue Zhu (who is very knowledgeable, sharp, and always great to work with!) Ale’s feedback on my talk was that it was “very linear”—I hope that was a good thing? Easy to follow?

Also, as in every other stats class I’ve had here, we brought up the curse of dimensionality—meaning that, in high-dimensional data, very few points are likely to be near the joint mean. I saw a great practical example of this in a story about the US Air Force’s troubles designing fighter planes for the “average” pilot.


I taught a data visualization course! Check out my course materials here. There’ll be a separate post reflecting on the whole experience. But the summer before, it was fun (and helpful) to binge-read all those dataviz books I’ve always meant to read.

I’ve been able to repurpose my lecture materials for a few short talks too. I was invited to present a one-lecture intro to data viz for Seth Wiener‘s linguistics students here at CMU, as well as for a seminar on Data Dashboard Design run by Matthew Ritter at my alma mater (Olin College). I also gave an intro to the Grammar of Graphics (the broader concept behind ggplot2) for our Pittsburgh useR Group.


I’m officially working with Jing Lei, still looking at sparse PCA but also some other possible thesis topics. Jing is a great instructor, researcher, and collaborator working on many fascinating problems. (I also appreciate that he, too, has a young child and is understanding about the challenges of parenting.)

But I’m afraid I made very slow research progress this fall. A lot of my time went towards teaching the dataviz course, and plenty went to parenthood (see below), both of which will be reduced in the spring semester. I also wish I had some grad-student collaborators. I’m not part of a larger research group right now, so meetings are just between my advisor and me. Meetings with Jing are very productive, but in between it’d also be nice to hash out tough ideas together with a fellow student, without taking up an advisor’s time or stumbling around on my own.

Though it’s not quite the same, I started attending the Statistical Machine Learning Reading Group regularly. Following these talks is another good way to stretch my math muscles and keep up with recent literature.


As a nice break from statistics, we got to see our friends Bryan Wright and Yuko Eguchi both defend their PhD dissertations in musicology. A defense in the humanities seems to be much more of a conversation involving the whole committee, vs. the lecture given by Statistics folks defending PhDs.

Besides home and school, I’ve been a well-intentioned but ineffective volunteer, trying to manage a few pro bono statistical projects. It turns out that virtual collaboration, managing a far-flung team of people who’ve never met face-to-face, is a serious challenge. I’ve tried reading up on advice but haven’t found any great tips—so please leave a comment if you know any good resources.

So far, I’ve learned that choosing the right volunteer team is important. Apparent enthusiasm (I’m eager to have a new project! or even eager for this particular project!) doesn’t seem to predict commitment or followup as well as apparent professionalism (whether or not I’m eager, I will stay organized and get s**t done).

Meanwhile, the baby is no longer in the “potted-plant stage” (when you can put him down and expect he’ll still be there a second later), but not yet in day care, while my wife is returning to part-time work. After this semester, we finally got off the wait-lists and into day care, but meanwhile it was much harder to juggle home and school commitments this semester.

However, he’s an amazing little guy, and it’s fun finally taking him to outings and playdates at the park and zoo and museums (where he stares at the floor instead of exhibits… except for the model railroad, which he really loved!) We also finally made it out to Kennywood, a gorgeous local amusement park, for their holiday light show.

Here’s to more exploration of Pittsburgh as the little guy keeps growing!

Are you really moving to Canada?

It’s another presidential election year in the USA, and you know what that means: Everyone’s claiming they’ll move to Canada if the wrong candidate wins. But does anyone really follow through?

Anecdotal evidence: Last week, a Canadian told me she knows at least a dozen of her friends back home are former US citizens who moved, allegedly, in the wake of disappointing election results. So perhaps there’s something to this claim/threat/promise?

Statistical evidence: Take a look for yourself.


As a first pass, I don’t see evidence of consistent, large spikes in migration right after elections. The dotted vertical lines denote the years after an election year, i.e. the years where I’d expect spikes if this really happened a lot. For example: there was a US presidential election at the end of 1980, and the victor took office in 1981. So if tons of disappointed Americans moved to Canada afterwards, we’d expect a dramatically higher migration count during 1981 than 1980 or 1982. The 1981 count is a bit higher than its neighbors, but the 1985 is not, and so on. Election-year effects alone don’t seem to drive migration more than other factors.

What about political leanings? Maybe Democrats are likely to move to Canada after a Republican wins, but not vice versa? (In the plot, blue and red shading indicate Democratic and Republican administrations, respectively.) Migration fell during the Republican administrations of the ’80s, but rose during the ’00s. So, again, the victor’s political party doesn’t explain the whole story either.

I’m not an economist, political scientist, or demographer, so I won’t try to interpret this chart any further. All I can say is that the annual counts vary by a factor of 2 (5,000 in the mid-’90s, compared to 10,000 around 1980 or 2010)… So the factors behind this long-term effect seems to be much more important than any possible short-term election-year effects.

Extensions: Someone better informed than myself could compare this trend to politically-motivated migration between other countries. For example, my Canadian informant told me about the Quebec independence referendum, which lost 49.5% to 50.5%, and how many disappointed Québécois apparently moved to France afterwards.

Data notes: I plotted data on permanent immigrants (temporary migration might be another story?) from the UN’s Population Division, “International Migration Flows to and from Selected Countries: The 2015 Revision.” Of course it’s a nontrivial question to define who counts as an immigrant. The documentation for Canada says:

International migration data are derived from administrative sources recording foreigners who were granted permission to reside permanently in Canada. … The number of immigrants is subject to administrative corrections made by Citizenship and Immigration Canada.