With loss of generality

Public service announcement: Dear math and statistics students, “WLOG” means you’re about to prove something “without loss of generality.”

So please don’t copy your friend’s homework and write it as “with log” or “using log”. It’s just too easy for your grader to catch you.

♪ The more you know! ♫

Reproducible research, training wheels, and knitr

Last week I gave a short talk at CMU’s statistical computing seminar, Stat Bytes. I summarized why reproducible research (RR) and literate programming are worthwhile, not just for serious research but also for homework reports or statistical blog posts. I demonstrated how to get started with a range of RR document formats in R: from the “training wheels” R Notebook in RStudio, through the more flexible but still simple R Markdown format, to R Sweave for \LaTeX articles and Beamer slides.

If you’ve wanted to get on the RR bandwagon, but found Sweave too overwhelming, these other tools are a great way to start—and useful in their own right, not just for training.

My materials are here:

  • Overview and links (html output, Rmd source)
  • R Notebook example (html output, R source)
  • R Markdown example (html output, Rmd source)
  • R Sweave / Beamer example (pdf output, Rnw source)

Extra details below.

Reproducible research story time

First, story time! I was once asked to step in and take over the statistical analysis for an article, after the primary statistician became unavailable. It sounded like a pretty straightforward analysis of survey data, with clear scientific questions, and they told me they had the previous statistician’s R code, so I thought it sounded reasonable. Hah…

Continue reading

After 1st semester of Statistics PhD program

Have you ever wondered whether the first semester of a PhD is really all that busy? My complete lack of posts last fall should prove it :)

Some thoughts on the Fall term, now that Spring is well under way [edit: added a few more points]:

  • RMarkdown and knitr are amazing. When I next teach a course using R, my students will be turning in homeworks using these tools: The output immediately shows whether the code runs and what its results are. This is much better than students copying and pasting possibly-broken code and unconnected output into a text file or (gasp) Word document.
  • I’m glad my cohort socializes outside the office, taking each other out for birthday lunches or going to see a Pirates game. Some of the older PhD students are so focused on their thesis work that they don’t take time for a social break, and I’d like to avoid getting stuck in that rut.
    However! Our lunches always lead us back to the age old question: How many statisticians does it take to split a bill? Answer: too long. I threw together a Shiny app, DinneR, to help us answer this question :)


  • The first-year PhD courses in Statistics and in Machine Learning have rather different approaches.
    • Statistics professor: Just assume we can compute this estimator. In class we’ll prove that the estimates are reasonably good (e.g. we’ll bound the probability that an estimate is far from the true value).
    • Machine Learning professor: Just trust me that this algorithm gets useful estimates. In class we’ll prove that we can compute it in a reasonable amount of time (e.g. we’ll bound the number of steps until the algorithm converges).
    • Somewhere between these ideas, I ran into the sensible concept of optimizing only until your solution is within statistical error. For example, say you only have enough data to publish an estimate with a confidence interval of +\- 0.1 units. If your optimization algorithm is computer-intensive, then running it until it converges to +\- 0.00001 units is just a waste of time. For instance, see Bottou & Bousquet’s “The Tradeoffs of Large Scale Learning.”
  • My ML professor, midway through a classification-focused semester, finally discussing regression for 10 minutes: “…And that’s all you need to know about regression.”
    My Regression professor, at end of semester, finally discussing classification for 20 minutes: “…And that’s all you need to know about classification.” :)
  • In any class that covers proofs or other long detailed arguments, handouts+chalkboards are seriously better than slideshows. With a chalkboard, you can show the whole proof at once—so if students get lost halfway through, they can still see the claim we’re proving and all the steps we’ve made so far. But when you cram a proof onto slides, either you oversimplify to get it onto one slide; or you split it across slides, so that we lose the continuity (and may even forget what we’re trying to prove).
  • Good homeworks and quick feedback are critical. One of my classes had weekly homeworks, each directly tied to the material we just covered, each problem expanding on a good question or illustrating an interesting principle from class. Homeworks were graded within a week, every single time.
    In another class, we had just a few homeworks, very loosely tied to the lecture contents and usually at a very different level (way too easy or too hard relative to what the lecture covered). Although this class had the same number of students and TAs as the other one, we never got our homeworks back in less than 2 weeks—and one of them took a full 2 months to return!
  • TAing is a mixed bag. I enjoy holding office hours and being there during lab sessions to help students understand something they were missing. I do not so much enjoy grading homeworks and labs by those students who don’t ask questions, don’t come to office hours, and clearly don’t read the comments I leave on their assignments since I see them make the same mistakes over and over. I especially don’t like finding instances of cheating. Urgh.
  • I was a bit worried about coming back to grad school as an “older” student (the youngest guy in our 1st-year PhD cohort is almost a decade younger than me!). But it’s been great, actually:
    • My schedule seems much saner than some of my classmates’. Quite a few seem to stay in the office until late most nights, then may sleep through a morning class. For me, after years of waking at 6:30 to spend an hour on the crowded metro to work… it’s been luxurious to sleep in until 7:30 or 8, walk to school in half an hour in the fresh air, have a focused workday of reasonable length, and come home for dinner with my wife, actually relaxing in the evening instead of studying until 3am. Yes, there’s the occasional late night, but occasional is the key word there.
    • The income’s lower than my old job, of course, but Pittsburgh is much cheaper than DC, especially for housing. Besides: my previous school loans are all paid off, I have a fair chunk of retirement savings already earning interest, and my wife and I are used to budgeting. (YNAB is an excellent tool for this—I will blog about it at some point. If you’re interested, here’s a slight discount referral code, or you can wait for the big sale they seem to have every 3-4 months.)
      [My point is: despite the drop in income, we're still more financially secure (thanks to savings and paid-off loans) than if I'd gone straight into the PhD from my MSc.]
    • As Cosma Shalizi points out: “Note to graduate students: It is important that you internalize that you are, in fact, a badass…” With age and experience, I’m far more able to speak confidently when it’s called for (e.g. giving a talk), and far less intimidated about tackling new topics, talking to professors, writing papers, speaking at conferences, etc.
  • On the other hand, despite longer experience as a statistician than my classmates, I appreciate and admire that they are much better at many things. I’m really impressed by my various classmates’ command of topics like real analysis and measure theory, scientific computing, or practical knowledge about fields like physics or economics.
  • Pittsburgh is a great town. Affordable housing, decent bus system, beautiful scenic views from the inclines, friendly people, livable walkable neighborhoods, tons of good food, extensive and well-run library system… It has a lot of what I liked about Portland, without as much of the “Portlandia” over-the-top hipsters. There are also beautiful old buildings, like the Carnegie Natural History Museum (with its sweet dinosaur exhibit) and UPitt’s Cathedral of Learning. The weather right now is pretty snowy/icy, but I don’t mind—I’m honestly impressed by how well Pittsburgh just goes ahead and deals with winter weather, in comparison to DC’s city-wide shutdown every time a snowflake is sighted.

Edit: Here’s another good post on the first semester of a PhD program, from several mathematics students. I agree with most of the responses, especially the ones that conflict each other :)

Turing-complete inversion tables, presented reasonable on your part!

I’ve not been keeping up with blogging this semester, but I had to share this beautiful spam comment my filter let through this morning:

Appreciation for the excellent writeup. This in reality was previously your fun profile it. Glimpse complex to help way presented reasonable on your part! On the other hand, the way could possibly we be in contact?

I can’t tell if it’s written by a non-native English speaker or by a Markov chain—does that mean it passes the Turing test? Either way, there’s something lovely about its broken grammar.

The author’s name was given as “buy inversion tables.” For a moment I thought this might be a real comment, by someone offering to compute large matrix inversions cheaply and quickly. But no, apparently inversion tables are these things where you strap yourself in, flip over, and hang upside down for as long as you can. Kind of like the first semester of a PhD program :)

PS—somehow the comment reminds me of when Cosma Shalizi’s students used Markov-chain generated text to fake a blog post for him, in a previous iteration of the Statistical Computing class (which I’m TA’ing this term).

Data-Driven Journalism MOOC

TL;DR: The Knight Center’s free online journalism courses are great for anyone who works with data, storytelling, or both. See what’s being offered here.
My favorite links from a recent course on Data-Driven Journalism are here.
And a fellow student’s suggested reading list is here.

Last fall, a coworker and I led a study group for the Knight Center‘s MOOC (massive open online course) on “Introduction to Infographics and Data Visualization”, taught by Alberto Cairo. The course and Alberto’s book were excellent, and we were actually able to bring Alberto in to the Census Bureau for a great lecture a few months later. This course is now in its 3rd offering (starting today!) and I cannot recommend it highly enough if you have any interest in data, journalism, visualization, design, storytelling, etc.!

So, this summer I was happy to see the Knight Center offering another MOOC, this time on “Data-Driven Journalism: The Basics”. What with moving cities and starting the semester, I hadn’t kept up with the class, but I’ve finally finished the last few readings & videos. Overall I found a ton of great material.

The course’s five lecturers gave an overview of data-driven journalism: from its historical roots in the 1800s and its relation to computer-aided reporting, to how to get data in the first place, through cleaning and checking the data, and finally to building news apps and journalistic data visualizations.

In week 3 there was a particularly useful exercise of going through a spreadsheet of hunting accidents. Of course it illustrated some of the difficulties in cleaning data, and it gave concrete practice in filtering and sorting. But it was also a great illustration of how a database can lead you potential trends or stories that you might have missed if you’d only gone out to interview a few individual hunters.

I loved some of the language that came up, such as “backgrounding the data” — analogous to checking out your sources to see how much you can trust them — or “interrogating the data,” including coming prepared to the “data interview” to ask thorough, thoughtful questions. I’d love to see a Statistics 101 course taught from this perspective. Statisticians do these things all the time, but our terminology and approach seem alien and confusing the first few times you see them. “Thinking like a journalist” and “thinking like a statistician” are not all that different, and the former might be a much more approachable path to the latter.

For those who missed the course, consider skimming the Data Journalism Handbook (free online); Stanford’s Data Journalism lectures (hour-long video); the course readings I saved on Pinboard; and my notes below.
Edit: See also fellow student Daniel Drew Turner’s suggested reading list.
Then, keep an eye out for next time it’s offered on the Knight Center MOOC page.

Below is a (very messy) braindump of notes I took during the class, in case there are any useful nuggets in there. (Messiness due to my own limited time to clean the notes up, not to any disorganization in the course itself!) I think the course videos were not for sharing outside the class, but I’ve linked to the other readings and videos.

Continue reading

Is a Master’s degree in Statistics worthwhile?

A student who is considering a Master’s degree in Statistics asks, “I’m interested in finding a job in data analysis and have been looking around, but I’m not sure if a masters is necessary to break into the field.”

Without much info about her background or job goals, here’s what I replied. Readers, do you have any additional or contradictory advice?

Continue reading

History of CMU’s Department of Statistics

As I’m about to begin my studies at CMU’s Department of Statistics, I have been curious about the department’s history. There is a nice writeup in Strength in Numbers: The Rising of Academic Statistics Departments in the U. S.. Luckily, the “Carnegie Mellon University Statistics Department” chapter happens to be the free sample chapter on the publisher’s website.

Some fun facts from the chapter:

I hadn’t known that Frederick Mosteller went here (back when it was Carnegie Tech). I enjoy his Fifty Challenging Problems in Probability, and I’ve also been meaning to read The Pleasures of Statistics: The Autobiography of Frederick Mosteller. One of his early students at Harvard (whose stats department Mosteller founded before CMU had one) was Steve Fienberg, still at CMU.

Although the department was formed in 1966, it didn’t have a permanent college to call home until it joined the humanities college in 1980.
StatLib, the department’s online collection of downloadable datasets, started in 1989 and is still in use today.
CMU’s stats department was one of the first anywhere to focus on Bayesian stats, applications, and statistical computing. All of these are areas of interest for me—good to know I’m in the right place!

Early on, they also agreed to evaluate applied research on whether it benefits the applied area, not necessarily statistics as a field. I saw this still in effect at a thesis defense this week: the focus was on a very practical contribution to improving a neurological-data processing pipeline, even if the statistical theory was not highly novel. I’m glad to know that applied thesis topics are  appreciated here.
The department also chose not to run a drop-in consulting center like many others do. Instead, they form long-term joint research collaborations with other departments’ scholars.
Journal editorship is also valued at the department. Hopefully I can pick the many experienced editors’ brains in tailoring my publication submissions to the right journals.
Finally, there’s a strong focus not only on research but also on teaching, and today CMU has the largest group of undergrad stats majors in the US.

I’m looking forward to working with great colleagues in such an excellent environment!

Census-related links from June

A few links to share since I’ve been away:

Article on “An American Tradition: The U.S. Census Bureau Continues to Innovate in Data Visualization” (free access after watching an ad): A colleague summarizes some of the many mapping and datavis resources provided by the Census Bureau.

Two interactive web apps by data users (these use Census Bureau data but the datavises are someone else’s products):

  • Web app “Point Context”: a data user calls the Census Bureau’s API to find the distributions of age, race, income, and education for residents of the “average” neighborhood containing an arbitrary set of latitude-longitude coordinates.
  • Interactive map of “Is the United States spending less on public education?”: A Census data user practices with D3 and tries out the lessons from datavis classes — show comparisons, allow color-blind-safe color palettes, “catchy” headlines and informative annotations help guide readers, etc. I particularly like the arrow indicating where the selected state falls on the colorbar.

Several tools for making maps from Excel or spreadsheet-like tools:

  • Esri, the makers of ArcGIS software, have created a Microsoft Office add-on that lets you create maps of your data in Excel. A live demo looked promising, especially if your organization is already an Esri client… but otherwise ArcGIS is not cheap!
  • If you do have ArcGIS Online access, you can try using Esri’s “Story Maps” templates. Their published examples include this simple one based on Census Bureau data.
  • JMP, a SAS product, also has mapping tools that should be fairly simple for people used to spreadsheets. But again, SAS’s products tend to be expensive too.


Apologies for the lack of posts recently. I’m very excited about upcoming changes that are keeping me busy:

Let me suggest a few other blogs to follow while this one is momentarily on the back burner.

By my Census Bureau colleagues:

By members of the Carnegie Mellon statistics department:

Visual Revelations, Howard Wainer

I’m starting to recognize several clusters of data visualization books. These include:

(Of course this list calls out for a flowchart or something to visualize it!)

Howard Wainer’s Visual Revelations falls in this last category. And it’s no surprise Wainer’s book emulates Tufte’s, given how often the author refers back to Tufte’s work (including comments like “As Edward Tufte told me once…”). And The Visual Display of Quantitative Information is still probably the best introduction to the genre. But Visual Revelations is different enough to be a worthwhile read too if you enjoy such books, as I do.

Most of all, I appreciated that Wainer presents many bad graph examples found “in the wild” and follows them with improvements of his own. Not all are successful, but even so I find this approach very helpful for learning to critique and improve my own graphics. (Tufte’s classic book critiques plenty, but spends less time on before-and-after redesigns. On the other hand, Kosslyn’s book is full of redesigns, but his “before” graphs are largely made up by him to illustrate a specific point, rather than real graphics created by someone else.)

Of course, Wainer covers the classics like John Snow’s cholera map and Minard’s plot of Napoleon’s march on Russia (well-trodden by now, but perhaps less so in 1997?). But I was pleased to find some fascinating new-to-me graphics. In particular, the Mann Gulch Fire section (p. 65-68) gave me shivers: it’s not a flashy graphic, but it tells a terrifying story and tells it well.
[Edit: I should point out that Snow's and Minard's plots are so well-known today largely thanks to Wainer's own efforts. I also meant to mention that Wainer is the man who helped bring into print an English translation of Jacques Bertin's seminal Semiology of Graphics and a replica volume of William Playfair's Commercial and Political Atlas and Statistical Breviary. He has done amazing work at unearthing and popularizing many lost gems of historical data visualization!
See also Alberto Cairo's review of a more recent Wainer book.]

Finally, Wainer’s tone overall is also much lighter and more humorous than Tufte’s. His first section gives detailed advice on how to make a bad graph, for example. I enjoyed Wainer’s jokes, though some might prefer more gravitas.

Continue reading