Category Archives: Education

After 5th semester of statistics PhD program

Better late than never—here are my hazy memories of last semester. It was one of the tougher ones: an intense teaching experience, attempts to ratchet up research, and parenting a baby that’s still too young to entertain itself but old enough to get into trouble.

Previous posts: the 1st, 2nd, 3rd, and 4th semesters of my Statistics PhD program.

Classes

I’m past all the required coursework, so I only audited Topics in High Dimensional Statistics, taught by Alessandro Rinaldo as a pair of half-semester courses (36-788 and 36-789). “High-dimensional” here loosely means problems where you have more variables (p) than observations (n). For instance, in genetic or neuroscience datasets, you might have thousands of measurements each from only tens of patients. The theory here is different than in traditional statistics because you usually assume that p grows with n, so that getting more observations won’t reduce the problem to a traditional one.

This course focused on some of the theoretical tools (like concentration inequalities) and results (like minimax bounds) that are especially useful for studying properties of high-dimensional methods. Ale did a great job covering useful techniques and connecting the material from lecture to lecture.

In the final part of the course, students presented recent minimax-theory papers. It was useful to see my fellow students work through how these techniques are used in practice, as well as to get practice giving “chalk talks” without projected slides. I gave a talk too, preparing jointly with my classmate Lingxue Zhu (who is very knowledgeable, sharp, and always great to work with!) Ale’s feedback on my talk was that it was “very linear”—I hope that was a good thing? Easy to follow?

Also, as in every other stats class I’ve had here, we brought up the curse of dimensionality—meaning that, in high-dimensional data, very few points are likely to be near the joint mean. I saw a great practical example of this in a story about the US Air Force’s troubles designing fighter planes for the “average” pilot.

Teaching

I taught a data visualization course! Check out my course materials here. There’ll be a separate post reflecting on the whole experience. But the summer before, it was fun (and helpful) to binge-read all those dataviz books I’ve always meant to read.

I’ve been able to repurpose my lecture materials for a few short talks too. I was invited to present a one-lecture intro to data viz for Seth Wiener‘s linguistics students here at CMU, as well as for a seminar on Data Dashboard Design run by Matthew Ritter at my alma mater (Olin College). I also gave an intro to the Grammar of Graphics (the broader concept behind ggplot2) for our Pittsburgh useR Group.

Research

I’m officially working with Jing Lei, still looking at sparse PCA but also some other possible thesis topics. Jing is a great instructor, researcher, and collaborator working on many fascinating problems. (I also appreciate that he, too, has a young child and is understanding about the challenges of parenting.)

But I’m afraid I made very slow research progress this fall. A lot of my time went towards teaching the dataviz course, and plenty went to parenthood (see below), both of which will be reduced in the spring semester. I also wish I had some grad-student collaborators. I’m not part of a larger research group right now, so meetings are just between my advisor and me. Meetings with Jing are very productive, but in between it’d also be nice to hash out tough ideas together with a fellow student, without taking up an advisor’s time or stumbling around on my own.

Though it’s not quite the same, I started attending the Statistical Machine Learning Reading Group regularly. Following these talks is another good way to stretch my math muscles and keep up with recent literature.

Life

As a nice break from statistics, we got to see our friends Bryan Wright and Yuko Eguchi both defend their PhD dissertations in musicology. A defense in the humanities seems to be much more of a conversation involving the whole committee, vs. the lecture given by Statistics folks defending PhDs.

Besides home and school, I’ve been a well-intentioned but ineffective volunteer, trying to manage a few pro bono statistical projects. It turns out that virtual collaboration, managing a far-flung team of people who’ve never met face-to-face, is a serious challenge. I’ve tried reading up on advice but haven’t found any great tips—so please leave a comment if you know any good resources.

So far, I’ve learned that choosing the right volunteer team is important. Apparent enthusiasm (I’m eager to have a new project! or even eager for this particular project!) doesn’t seem to predict commitment or followup as well as apparent professionalism (whether or not I’m eager, I will stay organized and get s**t done).

Meanwhile, the baby is no longer in the “potted-plant stage” (when you can put him down and expect he’ll still be there a second later), but not yet in day care, while my wife is returning to part-time work. After this semester, we finally got off the wait-lists and into day care, but meanwhile it was much harder to juggle home and school commitments this semester.

However, he’s an amazing little guy, and it’s fun finally taking him to outings and playdates at the park and zoo and museums (where he stares at the floor instead of exhibits… except for the model railroad, which he really loved!) We also finally made it out to Kennywood, a gorgeous local amusement park, for their holiday light show.

Here’s to more exploration of Pittsburgh as the little guy keeps growing!

Lunch with ASA president Jessica Utts

The president of the American Statistical Association, Jessica Utts, is speaking tonight at the Pittsburgh ASA Chapter meeting. She stopped by CMU first and had lunch with us grad students here.

LOGO FINALBRAND_Tagline under

First of all, I recommend reading Utts’ Comment on statistical computing, published 30 years ago. She mentioned a science-fiction story idea about a distant future (3 decades later, i.e. today!) in which statisticians are forgotten because everyone blindly trusts the black-box algorithm into which we feed our data. Of course, at some point in the story, it fails dramatically and a retired statistician has to save the day.
Utts gave good advice on avoiding that dystopian future, although some folks are having fun trying to implement it today—see for example The Automatic Statistician.
In some ways, I think that this worry (of being replaced by a computer) should be bigger in Machine Learning than in Statistics. Or, perhaps, ML has turned this threat into a goal. ML has a bigger culture of Kaggle-like contests: someone else provides data, splits it into training & test sets, asks a specific question (prediction or classification), and chooses a specific evaluation metric (percent correctly classified, MSE, etc.) David Donoho’s “50 years of Data Science” paper calls this the Common Task Framework (CTF). Optimizing predictions within this framework is exactly the thing that an Automatic Statistician could, indeed, automate. But the most interesting parts are the setup and interpretation of a CTF—understanding context, refining questions, designing data-collection processes, selecting evaluation metrics, interpreting results… All those fall outside the narrow task that Kaggle/CTF contestants are given. To me, such setup and interpretation are closer to the real heart of statistics and of using data to learn about the world. It’s usually nonsensical to even imagine automating them.

Besides statistical computing, Utts has worked on revamping statistics education more broadly. You should read her rejoinder to George Cobb’s article on rethinking the undergrad stats curriculum.

Utts is also the Chief Reader for grading the AP Statistics exams. AP Stats may need to change too, just as the undergraduate stats curriculum is changing… but it’s a much slower process, partly because high school AP Stats teachers aren’t actually trained in statistics the way that college and university professors are. There are also issues with computer access: even as colleges keep moving towards computer-intensive methods, in practice it remains difficult for AP Stats to assess fairly anything that can’t be done on a calculator.

Next, Utts told us that the recent ASA statement on p-values was inspired as a response to the psychology journal, BASP, that banned them. I think it’s interesting that the statement is only on p-values, even though BASP actually banned all statistical inference. Apparently it was difficult enough to get consensus on what to say about p-values alone, without agreeing on what to say about alternatives (e.g. publishing intervals, Bayesian inference, etc.) and other related statistical concepts (especially power).

Finally, we had a nice discussion about the benefits of joining the ASA: networking, organizational involvement (it’s good professional experience and looks good on your CV), attending conferences, joining chapters and sections, getting the journals… I learned that the ASA website also has lesson plans and teaching ideas, which seems quite useful. National membership is only $18 a year for students, and most local chapters or subject-matter sections are cheap or free.

The ASA has also started a website Stats.org for helping journalists understand, interpret, and report on statistical issues or analyses. If you know a journalist, tell them about this resource. If you’re a statistician willing to write some materials for the site, or to chat with journalists who have questions, go sign up.

Tapestry 2016 materials: LOs and Rubrics for teaching Statistical Graphics and Visualization

Here are the poster and handout I’ll be presenting tomorrow at the 2016 Tapestry Conference.

Poster "Statistical Graphics and Visualization: Course Learning Objectives and Rubrics"

My poster covers the Learning Objectives that I used to design my dataviz course last fall, along with the grading approach and rubric categories that I used for assessment. The Learning Objectives were a bit unusual for a Statistics department course, emphasizing some topics we teach too rarely (like graphic design). The “specs grading” approach1 seemed to be a success, both for student motivation and for the quality of their final projects.

The handout is a two-sided single page summary of my detailed rubrics for each assignment. By keeping the rubrics broad (and software-agnostic), it should be straightforward to (1) reuse the same basic assignments in future years with different prompts and (2) port these rubrics to dataviz courses in other departments.

I had no luck finding rubrics for these learning objectives when I was designing the course, so I had to write them myself.2 I’m sharing them here in the hopes that other instructors will be able to reuse them—and improve on them!

Any feedback is highly appreciated.


Footnotes:

PolicyViz episode on teaching data visualization

When I was still in DC, I knew Jon Schwabish’s work designing information and data graphics for the Congressional Budget Office. Now I’ve run across his podcast and blog, PolicyViz. There’s a lot of good material there.

I particularly liked a recent podcast episode that was a panel discussion about teaching dataviz. Schwabish and four other experienced instructors talked about course design, assignments and assessment, how to teach implementation tools, etc.

I recommend listening to the whole thing. Below are just notes-to-self on the episode, for my own future reference.

Continue reading

Participant observation in statistics classes (Steve Fienberg interview)

CMU professor Steve Fienberg has a nice recent interview at Statistics Views.

He brings up great nuggets of stats history, including insights into the history and challenges of Big Data. I also want to read his recommended books, especially Fisher’s Design of Experiments and Raiffa & Schlaifer’sApplied Statistical Decision Theory. But my favorite part was about involving intro stats students in data collection:

One of the things I’ve been able to do is teach a freshman seminar every once in a while. In 1990, I did it as a class in a very ad hoc way and then again in 2000, and again in 2010, I taught small freshman seminars on the census. Those were the census years, so I would bring real data into the classroom which we would discuss. One of the nice things about working on those seminars is that, because I personally knew many of the Census Directors, I was able to bring many of them to class as my guests. It was great fun and it really changes how students think about what they do. In 1990, we signed all students up as census enumerators and they did a shelter and homeless night and had to come back and describe their experiences and share them. That doesn’t sound like it should belong in a stat class but I can take you around here at JSM and introduce you to people who were in those classes and they’ve become statisticians!

What a great teaching idea 🙂 It reminds me of discussions in an anthropology class I took, where we learned about participant observation and communities of practice. Instead of just standing in a lecture hall talking about statistics, we’d do well to expose students to real-life statistical work “in the field”—not just analysis, but data collection too. I still feel strongly that data collection/generation is the heart of statistics (while data analysis is just icing on the cake), and Steve’s seminar is a great way to hammer that home.

Teaching data visualization: approaches and syllabi

While I’m still working on my reflection of the dataviz course I just taught, there were some useful dataviz-teaching talks at the recent IEEE VIS conference.

Jen Christiansen and Robert Kosara have great summaries of the panel on “Vis, The Next Generation: Teaching Across the Researcher-Practitioner Gap.”

Even better, slides are available for some of the talks: Marti Hearst, Tamara Munzner, and Eytan Adar. Lots of inspiration for the next time I teach.

Hearst_ClassDiscussions

Finally, here are links to the syllabi or websites of various past dataviz courses. Browsing these helps me think about what to cover and how to teach it.

Not quite data visualization, but related:

Comment below or tweet @civilstat with any others I’ve missed, and I’ll add them to the list.
(Update: Thanks to John Stasko for links to many I missed, including his own excellent course site & resource page.)

Statistical Graphics and Visualization course materials

I’ve just finished teaching the Fall 2015 session of 36-721, Statistical Graphics and Visualization. Again, it is a half-semester course designed primarily for students in the MSP program (Masters of Statistical Practice) in the CMU statistics department. I’m pleased that we also had a large number of students from other departments taking this as an elective.

For software we used mostly R (base graphics, ggplot2, and Shiny). But we also spent some time on Tableau, Inkscape, D3, and GGobi.

We covered a LOT of ground. At each point I tried to hammer home the importance of legible, comprehensible graphics that respect human visual perception.

Pie chart with remake

Remaking pie charts is a rite of passage for statistical graphics students

My course materials are below. Not all the slides are designed to stand alone, but I have no time to remake them right now. I’ll post some reflections separately.

Download all materials as a ZIP file (38 MB), or browse individual files:
Continue reading

About to teach Statistical Graphics and Visualization course at CMU

I’m pretty excited for tomorrow: I’ll begin teaching the Fall 2015 offering of 36-721, Statistical Graphics and Visualization. This is a half-semester course designed primarily for students in our MSP program (Masters in Statistical Practice).

A large part of the focus will be on useful principles and frameworks: human visual perception, the Grammar of Graphics, graphic design and interaction design, and more current dataviz research. As for tools, besides base R and ggplot2, I’ll introduce a bit of Tableau, D3.js, and Inkscape/Illustrator. For assessments, I’m trying a variant of “specs grading”, with a heavy use of rubrics, hoping to make my expectations clear and my TA’s grading easier.

Di Cook, LDA and CART classification boundaries on Flea Beetles dataset

Classifier diagnostics from Cook & Swayne’s book

My initial course materials are up on my department webpage.
Here are the

  • syllabus (pdf),
  • first lecture (pdf created with Rmd), and
  • first homework (pdf) with dataset (csv).

(I’ll probably just use Blackboard during the semester, but I may post the final materials here again.)

It’s been a pleasant challenge to plan a course that can satisfy statisticians (slice and dice data quickly to support detailed analyses! examine residuals and other model diagnostics! work with data formats from rectangular CSVs through shapefiles to social networks!) … while also passing on lessons from the data journalism and design communities (take design and the user experience seriously! use layout, typography, and interaction sensibly!). I’m also trying to put into practice all the advice from teaching seminars I’ve taken at CMU’s Eberly Center.

Also, in preparation, this summer I finally enjoyed reading more of the classic visualization books on my list.

  • Cleveland’s The Elements of Graphing Data and Robbins’ Creating More Effective Graphs are chock full of advice on making clear graphics that harness human visual perception correctly.
  • Ware’s Information Visualization adds to this the latest research findings and a ton of useful detail.
  • Cleveland’s Visualizing Data and Cook & Swayne’s Interactive and Dynamic Graphics for Data Analysis are a treasure trove of practical data analysis advice. Cleveland’s many case studies show how graphics are a critical part of exploratory data analysis (EDA) and model-checking. In several cases, his analysis demonstrates that previously-published findings used an inappropriate model and reached poor conclusions due to what he calls rote data analysis (RDA). Cook & Swayne do similar work with more modern statistical methods, including the first time I’ve seen graphical diagnostics for many machine learning tools. There’s also a great section on visualizing missing data. The title is misleading: you don’t need R and GGobi to learn a lot from their book.
  • Monmonier’s How to Lie with Maps refers to dated technology, but the concepts are great. It’s still useful to know just how maps are made, and how different projections work and why it matters. Much of cartographic work sounds analogous to statistical work: making simplifications in order to convey a point more clearly, worrying about data quality and provenance (different areas on the map might have been updated by different folks at different times), setting national standards that are imperfect but necessary… The section on “data maps” is critical for any statistician working with spatial data, and the chapter on bureaucratic mapping agencies will sound familiar to my Census Bureau colleagues.

I hope to post longer notes on each book sometime later.

One more difference between statistics and [machine learning, data science, etc.]

Statisticians have always done a myriad of different things related to data collection and analysis. Many of us are surprised (even frustrated) that Data Science is even a thing. “That’s just statistics under a new name!” we cry. Others are trying to bring Data Science, Machine Learning, Data Mining, etc. into our fold, hoping that Statistics will be the “big tent” for everyone learning from data.

But I do think there is one core thing that differentiates Statisticians from these others. Having an interest in this is why you might choose to major in statistics rather than applied math, machine learning, etc. And it’s the reason you might hire a trained statistician rather than someone else fluent with data:

Statisticians use the idea of variability due to sampling to design good data collection processes, to quantify uncertainty, and to understand the statistical properties of our methods.

When applied statisticians design an experiment or a survey, they account for the inherent randomness and try to control it. They plan your study in such a way that’ll make your estimates/predictions as accurate as possible for the sample size you can afford. And when they analyze the data, alongside each estimate they report its precision, so you can decide whether you have enough evidence or whether you still need further study. For more complex models, they also worry about overfitting: can this model generalize well to the population, or is too complicated to estimate with this sample and hence is it just fitting noise?

When theoretical statisticians invent a new estimator, they study how well it’ll perform over repeated sampling, under various assumptions. They study its statistical properties first and foremost. Loosely speaking: How variable will the estimates tend to be? Will they be biased (i.e. tend to always overestimate or always underestimate)? How robust will they be to outliers? Is the estimator consistent (as the sample size grows, does the estimate tend to approach the true value)?

These are not the only important things in working with data, and they’re not the only things statisticians are trained to do. But (as far as I can tell) they are a much deeper part of the curriculum in statistics training than in any other field. Statistics is their home. Without them, you can often still be a good data analyst but a poor statistician.

Continue reading

After 4th semester of statistics PhD program

This was my first PhD semester without any required courses (more or less). That means I had time to focus on research, right?

It was also my first semester as a dad. Exhilarating, joyful, and exhausting 🙂 So, time was freed up by having less coursework, but it was reallocated largely towards diapering and sleep. Still, I did start on a new research project, about which I’m pretty excited.

Our department was also recognized as one of the nation’s fastest-growing statistics departments. I got to see some of the challenges with this first-hand as a TA for a huge 200-student class.

See also my previous posts on the 1st, the 2nd, and the 3rd semester of my Statistics PhD program.

Classes:

  • Statistical Computing:
    This was a revamped, semi-required, half-semester course, and we were the guinea pigs. I found it quite useful. The revamp was spearheaded by our department chair Chris Genovese, who wanted to pass on his software engineering knowledge/mindset to the rest of us statisticians. This course was not just “how to use R” (though we did cover some advanced topics from Hadley Wickham’s new books Advanced R and R Packages; and it got me to try writing homework assignment analyses as R package vignettes).
    Rather, it was a mix of pragmatic coding practices (using version control such as Git; writing and running unit tests; etc.) and good-to-know algorithms (hashing; sorting and searching; dynamic programming; etc.). It’s the kind of stuff you’d pick up on the job as a programmer, or in class as a CS student, but not necessarily as a statistician even if you write code often.
    The homework scheme was nice in that we could choose from a large set of assignments. We had to do two per week, but could do them in any order—so you could do several on a hard topic you really wanted to learn, or pick an easy one if you were having a rough week. The only problem is that I never had to practice certain topics if I wanted to avoid them. I’d like to try doing this as an instructor sometime, but I’d want to control my students’ coverage a bit more tightly.
    This fall, Stat Computing becomes an actually-required, full-semester course and will be cotaught by my classmate Alex Reinhart.
  • Convex Optimization:
    Another great course with Ryan Tibshirani. Tons of work, with fairly long homeworks, but I also learned a huge amount of very practical stuff, both theory (how to prove a certain problem is convex? how to prove a certain optimization method works well?) and practice (which methods are likely to work on which problems?).
    My favorite assignments were the ones in which we replicated analyses from recent papers. A great way to practice your coding, improve your optimization, and catch up with the literature all at once. One of these homeworks actually inspired in me a new methodological idea, which I’ve pursued as a research project.
    Ryan’s teaching was great as usual. He’d start each class with a review from last time and how it connects to today. There were also daily online quizzes, posted after class and due at midnight, that asked simple comprehension questions—not difficult and not a huge chunk of your grade, but enough to encourage you to keep up with the class regularly instead of leaving your studying to the last minute.
  • TAing for Intro to Stat Inference:
    This was the 200-student class. I’m really glad statistics is popular enough to draw such crowds, but it’s the first time the department has had so many folks in the course, and we are still working out how to manage it. We had an army of undergrad- and Masters-level graders for the weekly homeworks, but just three of us PhD-level TAs to grade midterms and exams, which made for several loooong weekends.
    I also regret that I often wasn’t at my best during my office hours this semester. I’ll blame it largely on baby-induced sleep deprivation, but I could have spent more time preparing too. I hope the students who came to my sessions still found them helpful.
  • Next semester, I’ll be teaching the grad-level data visualization course! It will be heavily inspired by Alberto Cairo’s book and his MOOC. I’m still trying to find the right balance between the theory I think is important (how does the Grammar of Graphics work, and why does it underpin ggplot2, Tableau, D3, etc.? how does human visual perception work? what makes for a well-designed graphic?) vs. the tool-using practice that would certainly help many students too (teach me D3 and Shiny so I can make something impressive for portfolios and job interviews!)
    I was glad to hear Scott Murray’s reflections on his recent online dataviz course co-taught with Alberto.

Research:

  • Sparse PCA: I’ve been working with Jing Lei on several aspects of sparse PCA, extending some methodology that he’s developed with collaborators including his wife Kehui Chen (also a statistics professor, just down the street at UPitt). It’s a great opportunity to practice what I’ve learned in Convex Optimization and earlier courses. I admired Jing’s teaching when I took his courses last year, and I’m enjoying research work with him: I have plenty of independence, but he is also happy to provide direction and advice when needed.
    We have some nice simulation results illustrating that our method can work in an ideal setting, so now it’s time to start looking at proofs of why it should work 🙂 as well as a real dataset to showcase its use. More on this soon, I hope.
    Unfortunately, one research direction that I thought could become a thesis topic turned out to be a dead end as soon as we formulated the problem more precisely. Too bad, though at least it’s better to find out now than after spending months on it.
  • I still need to finish writing up a few projects from last fall: my ADA report and a Small Area Estimation paper with Rebecca Steorts (now moving from CMU to Duke). I really wish I had pushed myself to finish them before the baby came—now they’ve been on the backburner for months. I hope to wrap them up this summer. Apologies to my collaborators!

Life:

  • Being a sDADistician: Finally, my penchant for terrible puns becomes socially acceptable, maybe even expected—they’re “dad jokes,” after all.
    Grad school seems to be a good time to start a family. (If you don’t believe me, I heard it as well from Rob Tibshirani last semester.) I have a pretty flexible schedule, so I can easily make time to see the baby and help out, working from home or going back and forth, instead of staying all day on campus or at the office until late o’clock after he’s gone to bed. Still, it helps to make a concrete schedule with my wife, about who’s watching the baby when. Before he arrived, I had imagined we could just pop him in the crib to sleep or entertain himself when we needed to work—ah, foolish optimism…
    It certainly doesn’t work for us both to work from home and be half-working, half-watching him. Neither the work nor the child care is particularly good that way. But when we set a schedule, it’s great for organization & motivation—I only have a chunk of X hours now, so let me get this task DONE, not fritter the day away.
    I’ve spent less time this semester attending talks and department events (special apologies to all the students whose defenses I missed!), but I’ve also forced myself to get much better about ignoring distractions like computer games and Facebook, and I spend more of my free time on things that really do make me feel better such as exercise and reading.
  • Stoicism: This semester I decided to really finish the Seneca book I’d started years ago. It is part of a set of philosophy books I received as a gift from my grandparents. Long story short, once I got in the zone I was hooked, and I’ve really enjoyed Seneca’s Letters to Lucilius as well as Practical Philosophy, a Great Courses lecture series on his contemporaries.
    It turns out several of my fellow students (including Lee Richardson) have been reading the Stoics lately too. The name “Stoic” comes from “Stoa,” i.e. porch, after the place where they used to gather… so clearly we need to meet for beers at The Porch by campus to discuss this stuff.
  • Podcasts: This semester I also discovered the joy of listening to good podcasts.
    (1) Planet Money is the perfect length for my walk to/from campus, covers quirky stories loosely related to economics and finance, and includes a great episode with a shoutout to CMU’s Computer Science school.
    (2) Talking Machines is a more academic podcast about Machine Learning. The hosts cover interesting recent ideas and hit a good balance—the material is presented deeply enough to interest me, but not so deeply I can’t follow it while out on a walk. The episodes usually explain a novel paper and link to it online, then answer a listener question, and end with an interview with a ML researcher or practitioner. They cover not only technical details, but other important perspectives as well: how do you write a ML textbook and get it published? how do you organize a conference to encourage women in ML? how do you run a successful research lab? Most of all, I love that they respect statisticians too 🙂 and in fact, when they interview the creator of The Automatic Statistician, they probe him on whether this isn’t just going to make the data-fishing problem worse.
    (3) PolicyViz is a new podcast on data visualization, with somewhat of a focus on data and analyses for the public: government statistics, data journalism, etc. It’s run by Jon Schwabish, whom I (think I) got to meet when I still worked in DC, and whose visualization workshop materials are a great resource.
  • It’s a chore to update R with all the zillion packages I have installed. I found that Tal Galili’s installr manages updates cleanly and helpfully.
  • Next time I bake brownies, I’ll add some spices and call them “Chai squares.” But we must ask, of course: what size to cut them for optimal goodness of fit in the mouth?