# Category Archives: Education

## About to teach Statistical Graphics and Visualization course at CMU

I’m pretty excited for tomorrow: I’ll begin teaching the Fall 2015 offering of 36-721, Statistical Graphics and Visualization. This is a half-semester course designed primarily for students in our MSP program (Masters in Statistical Practice).

A large part of the focus will be on useful principles and frameworks: human visual perception, the Grammar of Graphics, graphic design and interaction design, and more current dataviz research. As for tools, besides base R and ggplot2, I’ll introduce a bit of Tableau, D3.js, and Inkscape/Illustrator. For assessments, I’m trying a variant of “specs grading”, with a heavy use of rubrics, hoping to make my expectations clear and my TA’s grading easier.

Classifier diagnostics from Cook & Swayne’s book

My initial course materials are up on my department webpage.
Here are the

• syllabus (pdf),
• first lecture (html created with Rpres), and
• first homework (pdf) with dataset (csv).

(I’ll probably just use Blackboard during the semester, but I may post the final materials here again.)

It’s been a pleasant challenge to plan a course that can satisfy statisticians (slice and dice data quickly to support detailed analyses! examine residuals and other model diagnostics! work with data formats from rectangular CSVs through shapefiles to social networks!) … while also passing on lessons from the data journalism and design communities (take design and the user experience seriously! use layout, typography, and interaction sensibly!). I’m also trying to put into practice all the advice from teaching seminars I’ve taken at CMU’s Eberly Center.

Also, in preparation, this summer I finally enjoyed reading more of the classic visualization books on my list.

• Cleveland’s The Elements of Graphing Data and Robbins’ Creating More Effective Graphs are chock full of advice on making clear graphics that harness human visual perception correctly.
• Ware’s Information Visualization adds to this the latest research findings and a ton of useful detail.
• Cleveland’s Visualizing Data and Cook & Swayne’s Interactive and Dynamic Graphics for Data Analysis are a treasure trove of practical data analysis advice. Cleveland’s many case studies show how graphics are a critical part of exploratory data analysis (EDA) and model-checking. In several cases, his analysis demonstrates that previously-published findings used an inappropriate model and reached poor conclusions due to what he calls rote data analysis (RDA). Cook & Swayne do similar work with more modern statistical methods, including the first time I’ve seen graphical diagnostics for many machine learning tools. There’s also a great section on visualizing missing data. The title is misleading: you don’t need R and GGobi to learn a lot from their book.
• Monmonier’s How to Lie with Maps refers to dated technology, but the concepts are great. It’s still useful to know just how maps are made, and how different projections work and why it matters. Much of cartographic work sounds analogous to statistical work: making simplifications in order to convey a point more clearly, worrying about data quality and provenance (different areas on the map might have been updated by different folks at different times), setting national standards that are imperfect but necessary… The section on “data maps” is critical for any statistician working with spatial data, and the chapter on bureaucratic mapping agencies will sound familiar to my Census Bureau colleagues.

I hope to post longer notes on each book sometime later.

## One more difference between statistics and [machine learning, data science, etc.]

Statisticians have always done a myriad of different things related to data collection and analysis. Many of us are surprised (even frustrated) that Data Science is even a thing. “That’s just statistics under a new name!” we cry. Others are trying to bring Data Science, Machine Learning, Data Mining, etc. into our fold, hoping that Statistics will be the “big tent” for everyone learning from data.

But I do think there is one core thing that differentiates Statisticians from these others. Having an interest in this is why you might choose to major in statistics rather than applied math, machine learning, etc. And it’s the reason you might hire a trained statistician rather than someone else fluent with data:

Statisticians use the idea of variability due to sampling to design good data collection processes, to quantify uncertainty, and to understand the statistical properties of our methods.

When applied statisticians design an experiment or a survey, they account for the inherent randomness and try to control it. They plan your study in such a way that’ll make your estimates/predictions as accurate as possible for the sample size you can afford. And when they analyze the data, alongside each estimate they report its precision, so you can decide whether you have enough evidence or whether you still need further study. For more complex models, they also worry about overfitting: can this model generalize well to the population, or is too complicated to estimate with this sample and hence is it just fitting noise?

When theoretical statisticians invent a new estimator, they study how well it’ll perform over repeated sampling, under various assumptions. They study its statistical properties first and foremost. Loosely speaking: How variable will the estimates tend to be? Will they be biased (i.e. tend to always overestimate or always underestimate)? How robust will they be to outliers? Is the estimator consistent (as the sample size grows, does the estimate tend to approach the true value)?

These are not the only important things in working with data, and they’re not the only things statisticians are trained to do. But (as far as I can tell) they are a much deeper part of the curriculum in statistics training than in any other field. Statistics is their home. Without them, you can often still be a good data analyst but a poor statistician.

## After 4th semester of statistics PhD program

This was my first PhD semester without any required courses (more or less). That means I had time to focus on research, right?

It was also my first semester as a dad. Exhilarating, joyful, and exhausting So, time was freed up by having less coursework, but it was reallocated largely towards diapering and sleep. Still, I did start on a new research project, about which I’m pretty excited.

Our department was also recognized as one of the nation’s fastest-growing statistics departments. I got to see some of the challenges with this first-hand as a TA for a huge 200-student class.

See also my previous posts on the 1st, the 2nd, and the 3rd semester of my Statistics PhD program.

Classes:

• Statistical Computing:
This was a revamped, semi-required, half-semester course, and we were the guinea pigs. I found it quite useful. The revamp was spearheaded by our department chair Chris Genovese, who wanted to pass on his software engineering knowledge/mindset to the rest of us statisticians. This course was not just “how to use R” (though we did cover some advanced topics from Hadley Wickham’s new books Advanced R and R Packages; and it got me to try writing homework assignment analyses as R package vignettes).
Rather, it was a mix of pragmatic coding practices (using version control such as Git; writing and running unit tests; etc.) and good-to-know algorithms (hashing; sorting and searching; dynamic programming; etc.). It’s the kind of stuff you’d pick up on the job as a programmer, or in class as a CS student, but not necessarily as a statistician even if you write code often.
The homework scheme was nice in that we could choose from a large set of assignments. We had to do two per week, but could do them in any order—so you could do several on a hard topic you really wanted to learn, or pick an easy one if you were having a rough week. The only problem is that I never had to practice certain topics if I wanted to avoid them. I’d like to try doing this as an instructor sometime, but I’d want to control my students’ coverage a bit more tightly.
This fall, Stat Computing becomes an actually-required, full-semester course and will be cotaught by my classmate Alex Reinhart.
• Convex Optimization:
Another great course with Ryan Tibshirani. Tons of work, with fairly long homeworks, but I also learned a huge amount of very practical stuff, both theory (how to prove a certain problem is convex? how to prove a certain optimization method works well?) and practice (which methods are likely to work on which problems?).
My favorite assignments were the ones in which we replicated analyses from recent papers. A great way to practice your coding, improve your optimization, and catch up with the literature all at once. One of these homeworks actually inspired in me a new methodological idea, which I’ve pursued as a research project.
Ryan’s teaching was great as usual. He’d start each class with a review from last time and how it connects to today. There were also daily online quizzes, posted after class and due at midnight, that asked simple comprehension questions—not difficult and not a huge chunk of your grade, but enough to encourage you to keep up with the class regularly instead of leaving your studying to the last minute.
• TAing for Intro to Stat Inference:
This was the 200-student class. I’m really glad statistics is popular enough to draw such crowds, but it’s the first time the department has had so many folks in the course, and we are still working out how to manage it. We had an army of undergrad- and Masters-level graders for the weekly homeworks, but just three of us PhD-level TAs to grade midterms and exams, which made for several loooong weekends.
I also regret that I often wasn’t at my best during my office hours this semester. I’ll blame it largely on baby-induced sleep deprivation, but I could have spent more time preparing too. I hope the students who came to my sessions still found them helpful.
• Next semester, I’ll be teaching the grad-level data visualization course! It will be heavily inspired by Alberto Cairo’s book and his MOOC. I’m still trying to find the right balance between the theory I think is important (how does the Grammar of Graphics work, and why does it underpin ggplot2, Tableau, D3, etc.? how does human visual perception work? what makes for a well-designed graphic?) vs. the tool-using practice that would certainly help many students too (teach me D3 and Shiny so I can make something impressive for portfolios and job interviews!)
I was glad to hear Scott Murray’s reflections on his recent online dataviz course co-taught with Alberto.

Research:

• Sparse PCA: I’ve been working with Jing Lei on several aspects of sparse PCA, extending some methodology that he’s developed with collaborators including his wife Kehui Chen (also a statistics professor, just down the street at UPitt). It’s a great opportunity to practice what I’ve learned in Convex Optimization and earlier courses. I admired Jing’s teaching when I took his courses last year, and I’m enjoying research work with him: I have plenty of independence, but he is also happy to provide direction and advice when needed.
We have some nice simulation results illustrating that our method can work in an ideal setting, so now it’s time to start looking at proofs of why it should work as well as a real dataset to showcase its use. More on this soon, I hope.
Unfortunately, one research direction that I thought could become a thesis topic turned out to be a dead end as soon as we formulated the problem more precisely. Too bad, though at least it’s better to find out now than after spending months on it.
• I still need to finish writing up a few projects from last fall: my ADA report and a Small Area Estimation paper with Rebecca Steorts (now moving from CMU to Duke). I really wish I had pushed myself to finish them before the baby came—now they’ve been on the backburner for months. I hope to wrap them up this summer. Apologies to my collaborators!

Life:

• Being a sDADistician: Finally, my penchant for terrible puns becomes socially acceptable, maybe even expected—they’re “dad jokes,” after all.
Grad school seems to be a good time to start a family. (If you don’t believe me, I heard it as well from Rob Tibshirani last semester.) I have a pretty flexible schedule, so I can easily make time to see the baby and help out, working from home or going back and forth, instead of staying all day on campus or at the office until late o’clock after he’s gone to bed. Still, it helps to make a concrete schedule with my wife, about who’s watching the baby when. Before he arrived, I had imagined we could just pop him in the crib to sleep or entertain himself when we needed to work—ah, foolish optimism…
It certainly doesn’t work for us both to work from home and be half-working, half-watching him. Neither the work nor the child care is particularly good that way. But when we set a schedule, it’s great for organization & motivation—I only have a chunk of X hours now, so let me get this task DONE, not fritter the day away.
I’ve spent less time this semester attending talks and department events (special apologies to all the students whose defenses I missed!), but I’ve also forced myself to get much better about ignoring distractions like computer games and Facebook, and I spend more of my free time on things that really do make me feel better such as exercise and reading.
• Stoicism: This semester I decided to really finish the Seneca book I’d started years ago. It is part of a set of philosophy books I received as a gift from my grandparents. Long story short, once I got in the zone I was hooked, and I’ve really enjoyed Seneca’s Letters to Lucilius as well as Practical Philosophy, a Great Courses lecture series on his contemporaries.
It turns out several of my fellow students (including Lee Richardson) have been reading the Stoics lately too. The name “Stoic” comes from “Stoa,” i.e. porch, after the place where they used to gather… so clearly we need to meet for beers at The Porch by campus to discuss this stuff.
• Podcasts: This semester I also discovered the joy of listening to good podcasts.
(1) Planet Money is the perfect length for my walk to/from campus, covers quirky stories loosely related to economics and finance, and includes a great episode with a shoutout to CMU’s Computer Science school.
(2) Talking Machines is a more academic podcast about Machine Learning. The hosts cover interesting recent ideas and hit a good balance—the material is presented deeply enough to interest me, but not so deeply I can’t follow it while out on a walk. The episodes usually explain a novel paper and link to it online, then answer a listener question, and end with an interview with a ML researcher or practitioner. They cover not only technical details, but other important perspectives as well: how do you write a ML textbook and get it published? how do you organize a conference to encourage women in ML? how do you run a successful research lab? Most of all, I love that they respect statisticians too and in fact, when they interview the creator of The Automatic Statistician, they probe him on whether this isn’t just going to make the data-fishing problem worse.
(3) PolicyViz is a new podcast on data visualization, with somewhat of a focus on data and analyses for the public: government statistics, data journalism, etc. It’s run by Jon Schwabish, whom I (think I) got to meet when I still worked in DC, and whose visualization workshop materials are a great resource.
• It’s a chore to update R with all the zillion packages I have installed. I found that Tal Galili’s installr manages updates cleanly and helpfully.
• Next time I bake brownies, I’ll add some spices and call them “Chai squares.” But we must ask, of course: what size to cut them for optimal goodness of fit in the mouth?

## Small Area Estimation 101: old materials posted

I never got around to polishing my Small Area Estimation (SAE) “101” tutorial materials that I promised a while ago. So here they are, though still unedited and not as clean / self-explanatory as I’d like.

The slides introduce a few variants of the simplest area-level (Fay-Herriot) model, analyzing the same dataset in a few different ways. The slides also explain some basic concepts behind Bayesian inference and MCMC, since the target audience wasn’t expected to be familiar with these topics.

• Part 1: the basic Frequentist area-level model; how to estimate it; model checking (pdf)
• Part 2: overview of Bayes and MCMC; model checking; how to estimate the basic Bayesian area-level model (pdf)
• All slides, data, and code (ZIP)

The code for all the Frequentist analyses is in SAS. There’s R code too, but only for a WinBUGS example of a Bayesian analysis (also repeated in SAS). One day I’ll redo the whole thing in R, but it’s not at the top of the list right now.

Frequentist examples:

• “ByHand” where we compute the Prasad-Rao estimator of the model error variance (just for illustrative purposes since all the steps are explicit and simpler to follow; but not something I’d usually recommend in practice)
• “ProcMixed” where we use mixed modeling to estimate the model error variance at the same time as everything else (a better way to go in practice; but the details get swept up under the hood)

Bayesian examples:

• “ProcMCMC” and “ProcMCMC_alt” where we use SAS to fit essentially the same model parameterized in a few different ways, some of whose chains converge better than others
• “R_WinBUGS” where we do the same but using R to call WinBUGS instead of using SAS

The example data comes from Mukhopadhyay and McDowell, “Small Area Estimation for Survey Data Analysis using SAS Software” [pdf].

If you get the code to run, I’d appreciate hearing that it still works

My SAE resources page still includes a broader set of tutorials/textbooks/examples.

## Forget NHST: conference bans all conclusions

Once again, CMU is hosting the illustrious notorious SIGBOVIK conference.

Not to be outdone by the journal editors who banned confidence intervals, the SIGBOVIK 2015 proceedings (p.83) feature a proposal to ban future papers from reporting any conclusions whatsoever:

In other words, from this point forward, BASP papers will only be allowed to include results that “kind of look significant”, but haven’t been vetted by any statistical processes…

This is a bold stance, and I think we, as ACH members, would be remiss if we were to take a stance any less bold. Which is why I propose that SIGBOVIK – from this day forward – should ban conclusions

Of course, even this provision may not be sufficient, since readers may draw their own conclusions from any suggestions, statements, or data presented by authors. Thus, I suggest a phased plan to remove any potential of readers being mislead…

I applaud the author’s courageous leadership. Readers of my own SIGBOVIK 2014 paper on BS inference (with Alex Reinhart) will immediately see the natural synergy between conclusion-free analyses and our own BS.

## Statistics Done Wrong, Alex Reinhart

Hats off to my classmate Alex Reinhart for publishing his first book! Statistics Done Wrong: The Woefully Complete Guide [website, publisher, Amazon] came out this month. It’s a well-written, funny, and useful guide to the most common problems in statistical practice today.

Although most of his examples are geared towards experimental science, most of it is just as valid for readers working in social science, data journalism [if Alberto Cairo likes your book it must be good!], conducting surveys or polls, business analytics, or any other “data science” situation where you’re using a data sample to learn something about the broader world.

This is NOT a how-to book about plugging numbers into the formulas for t-tests and confidence intervals. Rather, the focus is on interpreting these seemingly-arcane statistical results correctly; and on designing your data collection process (experiment, survey, etc.) well in the first place, so that your data analysis will be as straightforward as possible. For example, he really brings home points like these:

• Before you even collect any data, if your planned sample size is too small, you simply can’t expect to learn anything from your study. “The power will be too low,” i.e. the estimates will be too imprecise to be useful.
• For each analysis you do, it’s important to understand commonly-misinterpreted statistical concepts such as p-values, confidence intervals, etc.; else you’re going to mislead yourself about what you can learn from the data.
• If you run a ton of analyses overall and only publish the ones that came out significant, such data-fishing will mostly produce effects that just happened (by chance, in your particular sample) to look bigger than they really are… so you’re fooling yourself and your readers if you don’t account for this problem, leading to bad science and possibly harmful conclusions.

Admittedly, Alex’s physicist background shows in a few spots, when he implies that physicists do everything better (e.g. see my notes below on p.49, p.93, and p.122.)

Seriously though, the advice is good. You can find the correct formulas in any Stats 101 textbook. But Alex’s book is a concise reminder of how to plan a study and to understand the numbers you’re running, full of humor and meaningful, lively case studies.

Highlights and notes-to-self below the break:

## Launch party for CMU undergrad stats major programs

So here at CMU, we’re proud to have one of the “largest and fastest-growing” statistics departments in the US.

Tomorrow (March 3rd) is the launch party for several new (joint-)major programs for CMU undergrads: Statistics and Machine Learning, Statistics and Neuroscience, and Mathematical Statistics. That’s in addition to two existing programs: Statistics Core and the joint program in Economics and Statistics.

If you’re in Pittsburgh, come to the launch party at 4:30pm tomorrow. We’ll have project showcases, advising, interactive demos, etc., not to mention free food

## After teaching 1st statistics course

I’ve just finished an exhausting but rewarding 6 weeks teaching a summer-session course on “Experimental Design for Behavioral and Social Sciences,” CMU course 36-309. My course materials are secreted away on Blackboard, but here is my syllabus. You can also see some materials from a previous session here, including Howard Seltman’s textbook (free online).

The students were expected to have already taken an introductory statistics course. After a short review of basic concepts and t-tests, we dove into more intermediate analyses (ANOVA and regression, contrasts, chi-square tests and logistic regression, repeated measures) and into how a good study should be designed (power, internal/external validity, etc.)

I’ve taught one-off statistics workshops before, and I’ve taught once-a-week semester-long Polish language classes, but this was my first experience teaching a full-length course in statistics. Detailed notes are below.

## What the Best College Teachers Do, Ken Bain

Although CMU has no school of education, it does have strong support for those of us who’d like to become better educators, not just better researchers. There’s the Eberly Center, which bridges the research-about-education that happens on campus, to the education-of-researchers for which most of us are here. And there’s the brand-new Simon Initiative—I’m not fully sure yet what it entails, but I enjoyed the inaugural lecture by Carl Wieman on improving science education.

Amidst all this, I’ve started teaching a summer course (36-309, Experimental Design). While preparing to teach, I’ve read Ken Bain’s What the Best College Teachers Do (recommended by CMU’s Sciences Teaching Club).

Much of the content is about convincing you to adopt the mindset of a good teachers: You should be interested in the students’ understanding, not just in getting them to regurgitate facts or plug & chug formulas. You should be patient with learners of different types and levels. Assessments for the sake of getting feedback should be frequent and separate from assessments for the sake of labeling the student with a final grade. You want the students to become able to learn independently, so train them to think constructively about their own learning.

Mostly, this is stuff I already agreed with. I really like Bain’s high-level ideas. But I wish there would have been more concrete illustrations of how these ideas work in practice. Practical examples could have replaced a lot of the fluffy language about the opening the students’ minds and hearts, etc.

Still, there are a couple of lists of explicit questions to use when planning your course. No list can cover everything you need to consider—but still, it doesn’t hurt to use such a list, to ensure that at least you haven’t overlooked what’s on it.

Bain also has some lists of “types of learners” or “developmental stages of learning.” It’s often unhelpful to pigeonhole individual students into one bucket or another… but it can be useful to treat these archetypes as if they were user personas, and consider how your lesson plan will work for these users.

Some of these lists, and other excessive notes-to-self, below the break.

## After 2nd semester of Statistics PhD program

Here’s another post on life as a statistics PhD student (in the Department of Statistics, at Carnegie Mellon University, in Pittsburgh, PA).
The previous such post was After 1st semester of Statistics PhD program.

Classes:

• I feared that Advanced Probability Overview would be just dry esoteric theory, but Jing Lei ensured all the topics were really well-motivated. Although it was tough, I did better than I’d hoped (especially given that I’ve never taken a proper Real Analysis course). In Statistical Machine Learning, Larry Wasserman and Ryan Tibshirani did a great job of balancing “old” core theory with new cutting-edge research topics, including helpful homework assignments that gave us practice both in theory and in applications.
• My highlight of the semester was being able to read and digest a research paper that was way too abstract when I tried reading it a few years ago. It really hit me that I must be learning something in grad school
(The paper was Building Consistent Regression Trees from Complex Sample Data, by Toth and Eltinge. While working at Census, I wanted to try running a complex-survey-weighted regression tree, but I couldn’t get much out of this paper. Now, after a good dose of probability theory and machine learning, it’s far clearer. In fact, I have some ideas about extending this work!)
• The Statistical Machine Learning class referenced a ton of crazy math terms I wasn’t familiar with: Banach and Hilbert spaces, Lp norms, conjugate functions, etc. It terrified me at first—I’ve never even heard of this stuff, should I have taken grad-level functional analysis before I started this PhD, am I about to fail?!?—but it turns out a lot of it is just names for specific versions of general concepts that I already knew. Whew. Also, most of it got used repeatedly from topic to topic, so we did gain familiarity even without explicitly taking a functional analysis course etc. So, don’t get disheartened too easily by unfamiliar terminology!
• It was great to finally learn more about Lp norms and about splines. Also, almost everything in SML can be written as a penalized regression 😛
• Smoothing splines and Reproducing Kernel Hilbert Space (RKHS) regression are nifty because the setup is that you want to optimize over all possible functions. So you start out with an infinite-dimensional space, for which in general there might be no simple way to search/optimize! … But in these specific setups, we can prove that the optimal solution happens to lie in a finite-dimensional subspace, where your usual optimization/search tools will work after all. Nice.
• Larry had a nice “foundations” day in SML, with examples where Bayes and Frequentist analysis differ greatly. However, I didn’t find most of his examples too convincing, since the Bayesian “loses” only due to a stupid choice of priors; or the Bayesian “loses” for finite n but in a case where n in practice would have to be ridiculously large. Still, this helped stretch my thinking about how these inference philosophies differ.
• Larry points out: you often hear that “We might as well go Bayes because if you give people a Frequentist interval, they’ll interpret it as a Bayes interval.” But the reverse is also true: Give someone a sequence of 95% Bayes intervals, and they’ll expect 95% of them to contain the true value. That is NOT necessarily going to happen with Bayes CIs (unlike Frequentist CIs).
• In addition to Subjective, Objective, Empirical, or Calibrated Bayes, let me propose “Cynical Bayes”: Don’t choose a prior because you believe it. Instead, choose one to optimize your estimator’s Frequentist properties. That way you can keep your expert Freq’ist colleagues happy, yet still call it a Bayes estimator, so you can give the usual Bayes interpretation to keep nonexperts happy
• A background in Statistics will keep you thinking about distributions and probabilities and convergences. But a background in Applied Math may be better at giving you tools and ideas for feature engineering. It’s worth having both toolsets.
• The Advanced Probability Overview course covered some measure-theoretic probability. I’m finally understanding the subtleties of how the different convergences $\xrightarrow{p}$, $\xrightarrow{as}$, $\xrightarrow{D}$, and $\xrightarrow{L^p}$ all differ, and why it matters. We saw these concepts last semester in Intermediate Statistics, but the distinctions are far clearer to me now.
• AdvProb’s measure theory section also really helped me understand why textbooks say a random variable is a “function”: intuitively it seems like just a variable or a number or something… but in fact it really is a function, from “the state of the world” i.e. an element $\omega$ of the set $\Omega$ of all possible outcomes or states of the world, to the measurement you will collect (often a number on the real line). Finally, this measure theory view of probability, as the size of a subset of $\Omega$, is helpful. Even though statisticians’ goal is to develop tools that let them work with the range of the random variable and ignore the domain $\Omega$, it’s good to remember that this domain exists.
• However, measure theory and probability theory suffer from some really poor terminology! For example, it took me far too long to realize that “integrable” means “the integral is finite”, NOT “the integral exists.”
• When we teach students R, we really should use practical examples, not the arbitrary generic examples that you see so often. Instead of just showing me list(1,"a"), it helps to give a realistic example of why you may actually need to collect together numeric and character elements in a single object.

Research:

• I started a new research project, the Advanced Data Analysis project, which will run until the end of this upcoming Fall semester (so about a year total). I am working with Rob Kass and Avniel Ghuman on using magnetoencephalography (MEG) data to study epilepsy.
• At Rob’s research group meetings, I learn a ton from the helpful questions he asks. When presenting someone else’s work (i.e. for a journal club), ask yourself, “What would you do if *your* research was based on the data from this paper?” Still, I’ve found I really do need to keep scheduling weekly 1-on-1 meetings—the group meetings are not enough to stay optimally on track.
• Neuroscience is hard! Pre-processing massive neuroscience datasets using not-fully-documented open source software is particularly hard. When I chose this project, I did not realize how much time I would have to spent on learning the subject matter, relevant specialized software tools, and data pre-processing workflow. Four months in and I’ve still barely gotten to the point of doing any “real” statistics. It’s a good project and I’m learning a lot, but it’s disheartening to see how much of that learning has been tied to debugging open-source software installations that I’ll only ever use again if I stay in this sub-field.
I would advise the next PhD cohort to choose projects that’ll primarily teach you more general-purpose, transferable skills. Maybe take an existing theoretical method that’s not implemented in software yet, and make it into an R package?

Life:

• This was a tougher semester in many ways, with harder classes and more research-related setbacks. The Cake song Tougher than it is got a lot of play time on my headphones 😛
• I’m glad that despite my slow posting rate, the blog still kept getting regular traffic—particularly Is a Master’s degree in Statistics worthwhile? I guess it’s a burning question these days.
• A big help to my sanity this semester came from joining the All University Orchestra. After a long week of tough classes and research setbacks, it’s great to switch brain modes and play my clarinet. I’ve really missed playing for the past few years in DC, and I’m glad to get back into it.
• Pittsburgh highlights: Bayernhof museum, Pittsburgh Symphony Orchestra concerts (The Legend of Zelda, “Behind the Notes” talks), Jozsa Corner, Point Brugge Cafe, sampling all the Squirrel Hill pizzerias, MCMC Bar Crawl on the Southside Flats, riding the ridiculously steep inclines, Pittsburgh Area Theater Organ Society concerts and tours of their beautiful theater organ
• Things still on our list to do in Pittsburgh: see a CMU theater performance, Pittsburgh aviary and zoo, Kennywood amusement park, Steelers game, Penguins game
• I look forward to getting a chance to teach a whole course this summer. It’ll be 36-309, Experimental Design. I also took some Eberly Center seminars, and the department organized helpful planning meetings for those of us students who’ll teach in the summer, so I feel reasonably prepared.
I plan to have my students design a series of experiments to bake the ultimate chocolate chip cookie. It will be delicious. I baked Meg Hourihan’s mean chocolate chip cookies for a department event earlier this spring, which seems like an appropriate start.
However, ironically, as the local knitr / reproducible research fanboy… I’m supposed to teach the course using SPSS, which seems to be largely point-and-click, without much support for reproducible reports
• It was a nice difference to be on the other side of the department’s open house for admitted students this year I’m also happy to be reading Grad Cafe forums from a much more relaxed point of view this year!
• I’m surprised there’s not much crossover between the CMU and UPitt statistics departments. And the stats community outside each department doesn’t seem as vibrant as it was in DC. I attended the American Statistical Association’s Pittsburgh chapter banquet. Besides CMU and Pitt folks, most attendees seemed to be RAND employees or independent consultants. There are also some Meetup groups: the Pittsburgh Data Visualization Group and the Pittsburgh useR Group.
• I’ve updated and expanded my CMU blogroll in the sidebar. Please let me know if I missed your CMU/Pittsburgh statistics-related blog!

Other people’s helpful posts on the PhD experience: