After 3rd semester of Statistics PhD program

It’s time for another braindump of reflections on statistics grad school.
See also the previous two posts: After 1st semester of Statistics PhD program and  After 2nd semester of Statistics PhD program.

This was my last semester of required coursework. Having passed the Data Analysis Exam in May, and with all the courses under my belt, I am pretty much ready to focus on the thesis topic search and proposal. Exciting!

Classes:

  • Let me elaborate on Cosma’s post: “Note to graduate students: It is important that you internalize that you are, in fact, a badass…”
    Ideally you should really internalize that you’re a badass before you come to grad school, because this is not the place to prove to yourself that you’re a badass. There are too many opportunities to feel bad about yourself at every stumble, when you’re surrounded by high-performing classmates and faculty who seem to do everything faster and more smoothly… It can be demoralizing when, say, you learn that you had the lowest score on an exam in a required class.
  • On the other hand, now that the Advanced Statistical Theory course is over, I do feel much more badass about reading and doing statistical theory. I used to see a paper with a ton of unfamiliar math and my eyes would glaze over. Now I see it as: “Well, it may take a while, but I’m capable of learning to parse that, use it, and even contribute to the field.” It feels no more daunting than other things I’ve done. Thank you, Advanced Prob and Advanced Stat Theory!
    For example, I finally internalized that “hard math” is no worse than learning a new coding language. If I do an applied project and have to learn a new topic like Python, or parallel programming, or version control, it’s not an impossible task: it’s just a lot of work, like learning a foreign language. And I finally feel the same about math again: I may not have known what a Frobenius norm is, or my intuition about the difference between o(1) and O(1) may still be underdeveloped—but it’s not substantively different to get there than it is to keep track of the differences between for-loops in R vs Python vs MATLAB (like I had to do all year).
    Also, if I get stuck on a theory problem, it’s my own concern. I can read previous work on it and find a solution; or if there is none, I can write one and thus make a contribution to the literature. But if I’m stuck on an applied problem because I don’t have a codebook for the variables or don’t know what preprocessing was done to the dataset, I really am stuck waiting until the data owner responds (if he/she even knows or remembers what was done, which is not a safe bet…)
  • I was a bit surprised by the choice of topics in Advanced Stat Theory. We covered several historically important topics in great detail, but then the professor told us that most of them are not especially popular directions or practically useful tools in modern statistical research. (For example, Neyman-Pearson hypothesis testing in exponential families seems to be a solved problem, tackled by tools specific to that scenario alone… So why spend so much course time on it?) Maybe the course could be better focused if it were split into two parts: one on historically-important foundations, vs. one on modern techniques.
  • My TA assignment this semester was for Discrete Multivariate Analysis: advanced methods for contingency tables and log-linear models. I came away with a bigger appreciation for the rich and interesting questions that can arise about what looks, on the surface, to be a simple and small corner of statistics.

Journal Club:

  • My favorite course this fall was the Statistical Journal Club, led by CMU’s Ryan Tibshirani jointly with his father Rob Tibshirani (on sabbatical here from Stanford). The Tibshiranis chose a great selection of papers for us to read and discuss. Each week a pair or trio of students would present that week’s paper. It was helpful to give practice “chalk talks” as well as to see simulations illustrating each paper. (On day 1, Rob Tibshirani told us he likes to implement a small simulation whenever he reads a new paper or attends a talk: it helps gain intuition, see how well the method really works in practice, and see how sensitive it is to the authors’ particular setup and assumptions.)
  • I mentioned in Journal Club that we’d benefit from a MS/PhD level course on experimental design and sampling design for advanced stats & ML. Beyond just simple data collection for a basic psych experiment, how should one collect “big data” well, what to watch out for, how does the data collection affects analysis, etc.? Someone asked if I’m volunteering to teach it—maybe not a bad idea someday :)
  • The papers on “A kernel two-sample test” and “Brownian distance covariance” reminded me of a few moments when I saw an abstract definition in AdvProb class and thought, “Too bad this is just a technical tool for proofs and not something you can check in practice on real data…” As it turns out, the authors of these papers DID find a way to use them with real data. (For instance, there’s a very abstract definition of equality of distributions that cannot be checked directly: “for any function, the mean of that function on X is the same as the mean of that function on Y.” You can’t take a real dataset and check this for ALL functions—but the authors figured out that you can use kernel methods to get pretty close, by checking a vast infinite space of functions. So they took the abstract impractical definition and developed a nice practical test you can run on real data.) Impressive, and a good reminder to watch out for that thought again in the future—maybe a second look could turn into something useful.
  • Similarly, a few papers (like “Stability selection”) take an idea that seems reasonable to try in practice but without any theoretical grounding… (What if we just take random half-samples of the data, refit our lasso regression on each one, and see which variables are kept in the model on most of the half-samples?)… and develop proofs that give theoretical guarantees about how good this procedure can be.
  • Still other papers (like my own team’s assigned paper, on Deep Learning) were unable to find a solid theoretical grounding for why the model does so well or any guarantees on how well it should be expected to do. But it seems like it should be tractable, if only we could hit on the right framework for looking at this problem. The Dropout paper had a nice way to look at the very top layer of a neural network, but not directly helpful for deeper networks.
  • I got really excited about the “Post-selection inference paper” which discussed conditional hypothesis testing for regression coefficients. I thought we could apply it to the simplest OLS case to do some nifty new test that would let you make inferences such as: “Beta is estimated to be positive, and our conditional one-sided test says it’s significant, so it’s significantly positive.” You’re usually told not to do this: you’re supposed to decide ahead of time if you want a two-sided test or one-sided; and if it’s one-sided, then decide which side you want to check before looking at the data. However… after some scratch work, in the Normal case it looks like the correction you do (for deciding on the direction of the one-sided test after observing the sign of the estimate) is exactly equivalent to doing a two-sided test instead. (Basically you double the one-sided test’s p-value, which is the same as computing the two-sided p-value for a Normal statistic). So on the one hand, we don’t get a new better test out of this: it’s just what people do in practice anyway. On the other hand, it shows that the thing that people do, even though they’re told it’s wrong, is actually not wrong after all :)
    This made me wonder: Apart from this simple case of one coefficient in OLS, are there other aspects of sequential/adaptive/conditional hypothesis testing that could be simplified and spread to a wider audience? Are there common use-cases where these tools would help less-statistically-savvy users to get rigorous inference out of the missteps they normally do?
  • A few of the papers were less technical, such as “Why most published research findings are false.” We discussed how to incentivize scientists to publish well-powered interesting null findings and avoid the file-drawer problem. Rob Tibshirani suggested the idea of a “PLoS Zero” :) (vs. the existing PLoS ONE) He also told us how he encouraged PubMed to add a comment system, the PubMed Commons. Now you can point out issues or mistakes in a paper in this public space and get the authors’ responses right there, instead of having to go back & forth through the journal editors’ gatekeeping to publish letters slowly.

Research:

  • Besides the year-long Advanced Data Analysis (ADA) project, I also got back in to research on Small Area Estimation with Beka Steorts, which led me to attend the SAE2014 conference in Poznań, Poland (near my hometown—the first time that business travel has ever taken me anywhere near family!). Beka also got me involved in the MIDAS (Models of Infectious Disease Agent Study) project: we are developing “synthetic ecosystems,” aka artificial populations that epidemiologists can plug into agent-based models to study the spread of disease. The current version is an EXTREMELY rudimentary first pass: I’ll write a bit more about the project once we have a version we’re happier with.
  • I finally sat down and learned version control (via Git), and it turned out to be a good friend. For the MIDAS project we had three of us working on Dropbox, which led to: clogging all our Dropboxes, overwriting each other’s files, trying to coordinate by email, renaming things from “blahblah” to “blahblah_temp” and “blahblah_temp_2_tmp_recent” and so on… So it became clear it’s time for a better approach. Git lets you exclude files (so you don’t need to sync everything like Dropbox does); check differences between file versions; and use branching to try out temporary versions without renaming or breaking everything. I used the helpful tutorials by Bitbucket and Karl Broman.
  • MIDAS also sponsored me to attend the North American Cartographic Information Society (NACIS) 2014 conference here in Pittsburgh. That deserves its own post, but I found it nifty that the conference was co-organized by Amy Griffin… whom I met (when she came to do some research on spatial visualization of uncertainty with the Census Bureau) via Nicholas Nagle… who first reached out to me through a comment on this blog. It all comes back around!
  • As for the yearlong ADA project itself: it’s almost wrapped up, but quite differently from what we expected. There turned out to be major issues in getting and combining all the required dataset pieces: We needed (1) MEG brain scans, (2) MRI brain imagery, and (3) personal covariates about the medical/neuropsychological outcomes of each patient. Each of these three datasets had a different owner, and was de-identified for privacy/security… and we were never able to get a set of patient IDS that we could use to merge the different datasets together. In the end I had to switch topics entirely, to a similar neuroscientific dataset (which had been successfully combined and pre-processed) but for studying Autism instead of Epilepsy. This switch finally happened in the last few months of the semester, so I had just a short time in which to address the scientific questions in appropriate statistical ways, while also learning about a new disorder, and also refreshing my knowledge of MATLAB (since this data was in that format, not Python as the previous one had been)…
    Lessons learned: I should have been more proactive with collaborators about either pushing harder to get data quickly or just switching topics sooner. And for those stats students who are about to start a new applied project like this one, make sure your collaborators already have the full dataset in hand. (Of course, in general if you’re able to get in early and help to plan the data collection for optimal statistical efficiency, so much the better. But if you’re just a student whose goal is to practice data analysis, you’d better be sure the data has been compiled before you start.)

Life:

  • Before coming to CMU, I always knew it as a strong technical school but didn’t realize how great the drama department was. We finally made it to a stage performance—actually Britten’s The Beggar’s Opera. I was wearing a sleep monitor watch that week, and the readout later claimed I was asleep during the show… It just noticed my low movement and the dim lighting, but I promise I was awake! :P Really, a great performance and I look forward to seeing more theater here.
  • For a while I’ve been disappointed that Deschutes Brewery beers from Oregon hadn’t made it out to Pennsylvania yet. But no longer! I can finally buy my favorite Obsidian Stout down the street!
  • Though I haven’t been posting much this fall, there’s been plenty of good stuff by first-year CMU student Lee Richardson. I especially like his recent post‘s comments about institutional knowledge—it’s far more important than we usually give it credit for.
  • Nathan Yau is many steps ahead of me again, with great posts like how to improve government data websites, as well as one on a major life event. My own household size is also expected to increase from N to N+1 shortly, and everyone tells us “Your life is about to change!”—so I thank Nathan for a data-driven view of how exactly that change may look.

Dataclysm, Christian Rudder

In between project deadlines and homework assignments, I enjoyed taking a break to read Christian Rudder’s Dataclysm. (That’s right, my pleasure-reading break from statistics grad school textbooks is… a different book about statistics. I think I have a problem. Please suggest some good fiction!)

So, Rudder is one of the founders of dating site OkCupid and its quirky, data-driven research blog. His new book is very readable—each short, catchy chapter was hard to put down. I like how he gently alludes to the statistical details for nerds like myself, in a way that shouldn’t overwhelm lay readers. The clean, Tufte-minimalist graphs work quite well and are accompanied by clear writeups. Some of the insights are basically repeats of material already on the blog, but with a cleaner writeup, though there’s plenty of new stuff too. Whether or not you agree with all of his conclusions [edit: see Cathy O’Neil’s valid critiques of the stats analyses here], the book sets a good example to follow for anyone interested in data- or evidence-based popular science writing.

Most of all, I loved his description of statistical precision:

Ironically, with research like this, precision is often less appropriate than a generalization. That’s why I often round findings to the nearest 5 or 10 and the words ‘roughly’ and ‘approximately’ and ‘about’ appear frequently in these pages. When you see in some article that ‘89.6 percent’ of people do x, the real finding is that ‘many’ or ‘nearly all’ or ‘roughly 90 percent’ of them do it, it’s just that the writer probably thought the decimals sounded cooler and more authoritative. The next time a scientist runs the numbers, perhaps the outcome will be 85.2 percent. The next time, maybe it’s 93.4. Look out at the churning ocean and ask yourself exactly which whitecap is ‘sea level.’ It’s a pointless exercise at best. At worst, it’s a misleading one.

I might use that next time I teach.

The description of how academics hunt for data is also spot on: “Data sets move through the research community like yeti—I have a bunch of interesting stuff but I can’t say from where; I heard someone at Temple has tons of Amazon reviews; I think L has a scrape of Facebook.

Sorry I didn’t take many notes this time, but Alberto Cairo’s post on the book links to a few more detailed reviews.

“Statistical Modeling: The Two Cultures,” Breiman

One highlight of my fall semester is going to be a statistics journal club led by CMU’s Ryan Tibshirani together with his dad Rob Tibshirani (here on sabbatical from Stanford). The journal club will focus on “Hot Ideas in Statistics“: some classic papers that aren’t covered in standard courses, and some newer papers on hot or developing areas. I’m hoping to find time to blog about several of the papers we discuss.

The first paper was Leo Breiman’s “Statistical Modeling: The Two Cultures” (2001) with discussion and rejoinder. This is a very readable, high-level paper about the culture of statistical education and practice, rather than about technical details. I strongly encourage you to read it yourself.

Breiman’s article is quite provocative, encouraging statisticians to downgrade the role of traditional mainstream statistics in favor of a more machine-learning approach. Breiman calls the two approaches “data modeling” and “algorithmic modeling”: Continue reading

After teaching 1st statistics course

I’ve just finished an exhausting but rewarding 6 weeks teaching a summer-session course on “Experimental Design for Behavioral and Social Sciences,” CMU course 36-309. My course materials are secreted away on Blackboard, but here is my syllabus. You can also see some materials from a previous session here, including Howard Seltman’s textbook (free online).

The students were expected to have already taken an introductory statistics course. After a short review of basic concepts and t-tests, we dove into more intermediate analyses (ANOVA and regression, contrasts, chi-square tests and logistic regression, repeated measures) and into how a good study should be designed (power, internal/external validity, etc.)

I’ve taught one-off statistics workshops before, and I’ve taught once-a-week semester-long Polish language classes, but this was my first experience teaching a full-length course in statistics. Detailed notes are below.

Continue reading

What the Best College Teachers Do, Ken Bain

Although CMU has no school of education, it does have strong support for those of us who’d like to become better educators, not just better researchers. There’s the Eberly Center, which bridges the research-about-education that happens on campus, to the education-of-researchers for which most of us are here. And there’s the brand-new Simon Initiative—I’m not fully sure yet what it entails, but I enjoyed the inaugural lecture by Carl Wieman on improving science education.

Amidst all this, I’ve started teaching a summer course (36-309, Experimental Design). While preparing to teach, I’ve read Ken Bain’s What the Best College Teachers Do (recommended by CMU’s Sciences Teaching Club).

Much of the content is about convincing you to adopt the mindset of a good teachers: You should be interested in the students’ understanding, not just in getting them to regurgitate facts or plug & chug formulas. You should be patient with learners of different types and levels. Assessments for the sake of getting feedback should be frequent and separate from assessments for the sake of labeling the student with a final grade. You want the students to become able to learn independently, so train them to think constructively about their own learning.

Mostly, this is stuff I already agreed with. I really like Bain’s high-level ideas. But I wish there would have been more concrete illustrations of how these ideas work in practice. Practical examples could have replaced a lot of the fluffy language about the opening the students’ minds and hearts, etc.

Still, there are a couple of lists of explicit questions to use when planning your course. No list can cover everything you need to consider—but still, it doesn’t hurt to use such a list, to ensure that at least you haven’t overlooked what’s on it.

Bain also has some lists of “types of learners” or “developmental stages of learning.” It’s often unhelpful to pigeonhole individual students into one bucket or another… but it can be useful to treat these archetypes as if they were user personas, and consider how your lesson plan will work for these users.

Some of these lists, and other excessive notes-to-self, below the break.

Continue reading

How to Listen to and Understand Great Music, Robert Greenberg

These are just notes to myself on an audio course I got from the library. Nothing about statistics or R here :)

I’ve spent the past few months listening to Robert Greenberg’s How to Listen to and Understand Great Music, 3rd Edition as I walk to and from school. I’ve played classical music for years (in school bands and orchestras as well as at home), so I’d picked up a fair bit about its history, but I hoped this survey course would fill in some gaps.

Below are some notes-to-self, though my appetite for note-taking got weaker and eventually petered out halfway through the course.
Continue reading

Winter is coming (to the Broad Street pump)

We live in an amazing future, where an offhand Twitter joke about classic data visualizations and Game of Thrones immediately turns into a real t-shirt you can buy.

You know nothing (about cholera), John Snow

Hats off to Alberto Cairo (whose book The Functional Art and blog are the best introductions to data visualization that I can recommend—but you already knew that).

If you don’t already know the story of John Snow and the Broad Street pump—or if you think you do but haven’t heard the full details—then The Ghost Map is a great telling.

Update: Alberto continues to kick this up a notch, adding two more Game Of Thrones-themed classic dataviz jokes, and making the images/captions available under the Creative Commons license. Awesome.

Winter is coming (for Napoleon)

After 2nd semester of Statistics PhD program

Here’s another post on life as a statistics PhD student (in the Department of Statistics, at Carnegie Mellon University, in Pittsburgh, PA).
The previous such post was After 1st semester of Statistics PhD program.

Classes:

  • I feared that Advanced Probability Overview would be just dry esoteric theory, but Jing Lei ensured all the topics were really well-motivated. Although it was tough, I did better than I’d hoped (especially given that I’ve never taken a proper Real Analysis course). In Statistical Machine Learning, Larry Wasserman and Ryan Tibshirani did a great job of balancing “old” core theory with new cutting-edge research topics, including helpful homework assignments that gave us practice both in theory and in applications.
  • My highlight of the semester was being able to read and digest a research paper that was way too abstract when I tried reading it a few years ago. It really hit me that I must be learning something in grad school :)
    (The paper was Building Consistent Regression Trees from Complex Sample Data, by Toth and Eltinge. While working at Census, I wanted to try running a complex-survey-weighted regression tree, but I couldn’t get much out of this paper. Now, after a good dose of probability theory and machine learning, it’s far clearer. In fact, I have some ideas about extending this work!)
  • The Statistical Machine Learning class referenced a ton of crazy math terms I wasn’t familiar with: Banach and Hilbert spaces, Lp norms, conjugate functions, etc. It terrified me at first—I’ve never even heard of this stuff, should I have taken grad-level functional analysis before I started this PhD, am I about to fail?!?—but it turns out a lot of it is just names for specific versions of general concepts that I already knew. Whew. Also, most of it got used repeatedly from topic to topic, so we did gain familiarity even without explicitly taking a functional analysis course etc. So, don’t get disheartened too easily by unfamiliar terminology!
  • It was great to finally learn more about Lp norms and about splines. Also, almost everything in SML can be written as a penalized regression :P
  • Smoothing splines and Reproducing Kernel Hilbert Space (RKHS) regression are nifty because the setup is that you want to optimize over all possible functions. So you start out with an infinite-dimensional space, for which in general there might be no simple way to search/optimize! … But in these specific setups, we can prove that the optimal solution happens to lie in a finite-dimensional subspace, where your usual optimization/search tools will work after all. Nice.
  • Larry had a nice “foundations” day in SML, with examples where Bayes and Frequentist analysis differ greatly. However, I didn’t find most of his examples too convincing, since the Bayesian “loses” only due to a stupid choice of priors; or the Bayesian “loses” for finite n but in a case where n in practice would have to be ridiculously large. Still, this helped stretch my thinking about how these inference philosophies differ.
  • Larry points out: you often hear that “We might as well go Bayes because if you give people a Frequentist interval, they’ll interpret it as a Bayes interval.” But the reverse is also true: Give someone a sequence of 95% Bayes intervals, and they’ll expect 95% of them to contain the true value. That is NOT necessarily going to happen with Bayes CIs (unlike Frequentist CIs).
  • In addition to Subjective, Objective, Empirical, or Calibrated Bayes, let me propose “Cynical Bayes”: Don’t choose a prior because you believe it. Instead, choose one to optimize your estimator’s Frequentist properties. That way you can keep your expert Freq’ist colleagues happy, yet still call it a Bayes estimator, so you can give the usual Bayes interpretation to keep nonexperts happy :)
  • A background in Statistics will keep you thinking about distributions and probabilities and convergences. But a background in Applied Math may be better at giving you tools and ideas for feature engineering. It’s worth having both toolsets.
  • The Advanced Probability Overview course covered some measure-theoretic probability. I’m finally understanding the subtleties of how the different convergences \xrightarrow{p}, \xrightarrow{as}, \xrightarrow{D}, and \xrightarrow{L^p} all differ, and why it matters. We saw these concepts last semester in Intermediate Statistics, but the distinctions are far clearer to me now.
  • AdvProb’s measure theory section also really helped me understand why textbooks say a random variable is a “function”: intuitively it seems like just a variable or a number or something… but in fact it really is a function, from “the state of the world” i.e. an element \omega of the set \Omega of all possible outcomes or states of the world, to the measurement you will collect (often a number on the real line). Finally, this measure theory view of probability, as the size of a subset of \Omega, is helpful. Even though statisticians’ goal is to develop tools that let them work with the range of the random variable and ignore the domain \Omega, it’s good to remember that this domain exists.
  • However, measure theory and probability theory suffer from some really poor terminology! For example, it took me far too long to realize that “integrable” means “the integral is finite”, NOT “the integral exists.”
  • When we teach students R, we really should use practical examples, not the arbitrary generic examples that you see so often. Instead of just showing me list(1,"a"), it helps to give a realistic example of why you may actually need to collect together numeric and character elements in a single object.

Research:

  • I started a new research project, the Advanced Data Analysis project, which will run until the end of this upcoming Fall semester (so about a year total). I am working with Rob Kass and Avniel Ghuman on using magnetoencephalography (MEG) data to study epilepsy.
  • At Rob’s research group meetings, I learn a ton from the helpful questions he asks. When presenting someone else’s work (i.e. for a journal club), ask yourself, “What would you do if *your* research was based on the data from this paper?” Still, I’ve found I really do need to keep scheduling weekly 1-on-1 meetings—the group meetings are not enough to stay optimally on track.
  • Neuroscience is hard! Pre-processing massive neuroscience datasets using not-fully-documented open source software is particularly hard. When I chose this project, I did not realize how much time I would have to spent on learning the subject matter, relevant specialized software tools, and data pre-processing workflow. Four months in and I’ve still barely gotten to the point of doing any “real” statistics. It’s a good project and I’m learning a lot, but it’s disheartening to see how much of that learning has been tied to debugging open-source software installations that I’ll only ever use again if I stay in this sub-field.
    I would advise the next PhD cohort to choose projects that’ll primarily teach you more general-purpose, transferable skills. Maybe take an existing theoretical method that’s not implemented in software yet, and make it into an R package?

Life:

  • This was a tougher semester in many ways, with harder classes and more research-related setbacks. The Cake song Tougher than it is got a lot of play time on my headphones :P
  • I’m glad that despite my slow posting rate, the blog still kept getting regular traffic—particularly Is a Master’s degree in Statistics worthwhile? I guess it’s a burning question these days.
  • A big help to my sanity this semester came from joining the All University Orchestra. After a long week of tough classes and research setbacks, it’s great to switch brain modes and play my clarinet. I’ve really missed playing for the past few years in DC, and I’m glad to get back into it.
  • Pittsburgh highlights: Bayernhof museum, Pittsburgh Symphony Orchestra concerts (The Legend of Zelda, “Behind the Notes” talks), Jozsa Corner, Point Brugge Cafe, sampling all the Squirrel Hill pizzerias, MCMC Bar Crawl on the Southside Flats, riding the ridiculously steep inclines, Pittsburgh Area Theater Organ Society concerts and tours of their beautiful theater organ
  • Things still on our list to do in Pittsburgh: see a CMU theater performance, Pittsburgh aviary and zoo, Kennywood amusement park, Steelers game, Penguins game
  • I look forward to getting a chance to teach a whole course this summer. It’ll be 36-309, Experimental Design. I also took some Eberly Center seminars, and the department organized helpful planning meetings for those of us students who’ll teach in the summer, so I feel reasonably prepared.
    I plan to have my students design a series of experiments to bake the ultimate chocolate chip cookie. It will be delicious. I baked Meg Hourihan’s mean chocolate chip cookies for a department event earlier this spring, which seems like an appropriate start.
    However, ironically, as the local knitr / reproducible research fanboy… I’m supposed to teach the course using SPSS, which seems to be largely point-and-click, without much support for reproducible reports :(
  • It was a nice difference to be on the other side of the department’s open house for admitted students this year :) I’m also happy to be reading Grad Cafe forums from a much more relaxed point of view this year!
  • I’m surprised there’s not much crossover between the CMU and UPitt statistics departments. And the stats community outside each department doesn’t seem as vibrant as it was in DC. I attended the American Statistical Association’s Pittsburgh chapter banquet. Besides CMU and Pitt folks, most attendees seemed to be RAND employees or independent consultants. There are also some Meetup groups: the Pittsburgh Data Visualization Group and the Pittsburgh useR Group.
  • I’ve updated and expanded my CMU blogroll in the sidebar. Please let me know if I missed your CMU/Pittsburgh statistics-related blog!

Other people’s helpful posts on the PhD experience:

‘Census Marketing’ app by Olin College students

I love having the chance to promote nifty data visualizations; good work from my former employer, the Census Bureau; and student projects from my alma mater, Olin College. So it’s a particular pleasure to highlight all three at once:

Elizabeth Duncan and Marena Richardson, students in Olin’s Data Science course, teamed up with Census staff and BusinessUSA to develop an app that helps make Census data accessible to small business owners.

CensusMarketing

The result, Census Marketing, is a nifty and simple interface to overlay Decennial Census and American Community Survey data on Google Maps.

Imagine you’re planning to start or expand a small business, and you know the demographic you’d like to target (age, income, etc.) Where in your town is there a high concentration of your target market? And, are there already competing businesses nearby?

Load up Duncan and Richardson’s website, enter your location, select demographic categories from a few drop-down menus, and give your business type. The app will go find the relevant data (through the Census API) and display it for you as a block-level heatmap on Google Maps. It’ll also highlight the locations of existing businesses that might be competitors.

For example, say you want to open a pizzeria in my Pittsburgh neighborhood of Squirrel Hill. You might want to target the undergrad and grad student populations, since they tend to order pizza pretty often. Punch in the zip code 15217, choose all races and both sexes, select age groups 20-29 and 30-39, and specify that you’re looking for other competing pizzerias:

CensusMarketing_SqHillPizza

Well! The student-age population is clearly concentrated around Hobart and Murray… but so are the competing pizzerias. Good to know. Maybe you need to brainstorm a new business plan, seek out a different part of town, or try marketing to a different demographic.

Besides learning about data science and creating a website, Duncan and Richardson also interviewed several actual small business owners to refine the user experience. It’s a nice example of Olin’s design-centered approach to engineering education. I can imagine a couple of further improvements to this app… But it’s already a nice use case for the Census API, and a good example of the work Olin students can do in a short time.

PS—the course instructor, Allen Downey, has a free book ThinkStats on introductory statistics from a computer scientist’s point of view. I hear that a revised second edition is on its way.

Other students’ views on CMU’s Statistics department

(1) There are a couple of nice posts on Quora answering “What is it like to be a graduate student in Statistics at CMU?”
(If you don’t want to sign in to Quora, you might be able to read the replies through these direct links: Jack, Alex, Sangwon.)

(2) When I was applying to schools, a fellow PhD student here shared his thoughts about CMU’s Statistics department. He kindly allowed me to share his comments here as a guest post, though he warns it may be a year or two out of date.

In probably all graduate programs, but at least at CMU, graduate study consists of a coursework component and a research component. (You can see the curriculum here, and while they keep tweaking it, this looks like it’s more or less up to date.) As you can see, the balance starts out tilted heavily toward coursework and gradually starts to shift toward research, so that by your fourth semester you are mostly doing research. This makes sense – it would be tough to do much research-wise without at least some foundational methodological and theoretical training.

A key component of the easing-in process is the well-designed but not particularly well-named Advanced Data Analysis (“ADA”) course, which is a yearlong project spanning your second and third semesters. In this, you choose a professor to work with (they all give presentations about their work first semester to give you a sense of whom to choose), and this professor arranges a relationship with an outside investigator — a “real scientist”, not a statistician, usually in some other department at CMU or Pitt — who has data for you to analyze. Then the three (or more) of you work on the problem of analyzing that data for a year, meeting relatively frequently to discuss progress and whatever issues may arise. You also produce reports and presentations on the project as milestones.

So I’m now at the beginning of my second year, in the midst of my ADA project as well as the Advanced Stat Theory class. To give you a sense of an ADA project, I am working with two professors from Stats and one from CMU Astrophysics on a data set consisting of galaxies, trying to develop predictive models for galaxy redshift purely by analyzing these images. Other ADA projects right now include applications to educational testing, the genetic basis of autism, and medical studies of dementia.

So with that said, while I’m not in the full-blown research part of the PhD, I’ve still had the opportunity to work closely with professors and it has been very fruitful. They tend to be accessible and willing to meet as often as I want to, which tends to be once a week or every other week. My experience with research is that we’ll meet and talk about stuff, then I’ll go home and try whatever new stuff is suggested, and when I have something to show or have hit a wall, we meet again to talk about it. I’ve also started going to the meetings of the Astrostatistics group, which is a collaborative research effort between CMU Stats, CMU Astrophysics, and Pitt Astronomy, and hearing about all the research that’s being done in that setting.

I think the way CMU structures the research experience speaks to how much emphasis it places on acclimating you to that environment, which is really quite different from the classroom. Regarding the coursework component, most of the classes I’ve had here have been well-taught, and the professors hold office hours and generally welcome student inquiry. I think the professors, for the most part, do an admirable job of juggling their research and teaching without short-changing one piece or the other. I’ve definitely learned a ton from classes, which is important because my background in statistics was rather weak coming in. (I had a solid foundation in Math and CS, but not a ton of exposure to Stats.)

Regarding the distinguishing qualities of the program, there are a few. Among the spectrum of theoretical vs. applied programs, it tends to skew applied — there are a few people doing theory but many more working on applications to various fields. (This could be a good thing or a bad thing depending on your taste.) But if you ask people here, they might say the distinction between theoretical and applied work is kind of silly, since advances in theory can yield new methodology and novel applications can motivate development of theory. But anyhow, given that professors do a lot of applied work, there are fertile collaborations here with quite a few disciplines — astrostatistics as I mentioned, neuroscience, CS/machine learning, genetics, even some people working on finance/economics problems. So it’s not limiting at all in terms of what you can work on.

Another good thing about the program is that it’s pretty current and (you might say) somewhat pragmatic. For instance, they just revamped our Advanced Stat Theory core course to be taught with a huge focus on nonparametric inference instead of the canonical/classical inference theory, because it turns out that most people in real-world research settings are using nonparametric methods much more. In general, it’s great when a department recognizes that a field is evolving (rapidly!) and they are willing to adapt to cover what will be useful for students rather than what they became famous for writing books about in the ’70s.

That’s all I can think of now. Best of luck with the application process!