Hanukkah of Data 2022

The fall semester is over. Time to kick back and relax with… data analysis puzzles? Yes, of course!

The creators of the VisiData software have put together a “Hanukkah of Data,” 8 short puzzles released one day at a time. Four have been released already, but there’s still time for you to join in. From their announcement:

If you like the concept of Advent of Code, but wish there was set of data puzzles for data nerds, well, this year you’re in luck!

We’ve been hard at work the past couple of months creating Hanukkah of Data, a holiday puzzle hunt, with 8 days of bite-sized data puzzles. Starting December 18th, we’ll be releasing one puzzle a day, over the 8 days of Hanukkah.

This is your chance to explore a fictional dataset with SQL or VisiData or Datasette or your favorite data analysis tool, to help Aunt Sarah find the family holiday tapestry before her father notices it’s missing!

Register here to receive notifications when puzzles become available.

I can’t remember where I heard about this, but I’m very glad I did. I wasn’t familiar with VisiData before this, but I look forward to giving it a try too. For now, I’m just using R and enjoying myself tremendously. The puzzles are just the right length for my end-of-semester brain, the story is sweet, and the ASCII artwork is gorgeous. Many thanks to Saul Pwanson and colleagues for putting this together.

Are there other efforts like this in the Statistics and/or R communities? Hanukkah of Data is the kind of thing I would love to assign my students to help them practice their data science skills in R. Here are closest other things I’ve seen, though none are quite the same:

Hiring a tenure-track statistician at Colby College

We’re hiring for a tenure-track faculty member in Statistics! Are you interested in teaching at a beautiful small liberal arts college in Maine? Are you looking for academic positions that value a balance of teaching & research — and provide resources to support you in both regards? Not to mention a competitive salary, good benefits, and all four seasons in a small New England town? Please do apply, and reach out to me with any questions, or share the ad with anyone you know who might be a good fit:

https://www.colby.edu/statistics/faculty-searches/

https://www.mathjobs.org/jobs/list/21000

We will start reviewing applications on October 24 and continue until the position is filled.

(And if you’re not just a solo statistician, but you are working on a two-body problem with a computationally-focused partner, then let me also note that both our Davis AI Institute and our CS department are hiring too this year.)

Some new developments since last time we had a faculty search in Statistics:

We have our own Department of Statistics — still quite rare among liberal arts colleges
We are working with Colby’s Davis Institute of Artificial Intelligence — the first such AI Institute at a liberal arts college;
In addition to our Data Science minor, we are close to approving a Data Science major in collaboration with Colby’s departments of Mathematics and of Computer Science

In terms of research, there are generous startup funds (more than I’ve been able to use so far) and plenty of other support for research materials, conference travel, etc.

The teaching load is 9 courses every 2 years. That comes out to 2 courses most semesters, and 3 every fourth semester. While we provide regular offerings of Intro Stats, Statistical Modeling, and other core courses, in a typical year each of us also gets to teach a favorite elective or two. For example, I have gotten to work on some great partnerships by planning Survey Sampling or Data Visualization courses with our Civic Engagement office. My students have shown care, respect, and insight as they help our local homeless shelter study what resources improve housing outcomes; or help our town fire department to survey citizens and local businesses to inform its five-year plan.

And frankly, it’s just plain fun to work across disciplines. I’ve help a Government major figure out how to collect & analyze a random sample of news articles for a project on public transport in Central America. I’ve helped a Biology professor figure out how to bootstrap an imbalanced experiment on amoebas, and I’ve learned nifty nuggets of data visualization history from an English professor.

Long story short: I really do enjoy teaching statistics in the liberal arts college environment. If you think you would too, come join us!

`surveyCV`: K-fold cross validation for complex sample survey designs

I’m fortunate to be able to report the publication of a paper and associated R package co-authored with two of my undergraduate students (now alums), Cole Guerin and Thomas McMahon: “K-Fold Cross-Validation for Complex Sample Surveys” (2022), Stat, doi:10.1002/sta4.454 and the surveyCV R package (CRAN, GitHub).

The paper’s abstract:

Although K-fold cross-validation (CV) is widely used for model evaluation and selection, there has been limited understanding of how to perform CV for non-iid data, including from sampling designs with unequal selection probabilities. We introduce CV methodology that is appropriate for design-based inference from complex survey sampling designs. For such data, we claim that we will tend to make better inferences when we choose the folds and compute the test errors in ways that account for the survey design features such as stratification and clustering. Our mathematical arguments are supported with simulations and our methods are illustrated on real survey data.

Long story short, traditional K-fold CV assumes that your rows of data are exchangeable, such as iid draws or simple random samples (SRS). But in survey sampling, we often use non-exchangeable sampling designs such as stratified sampling and/or cluster sampling.¹

Illustration of simple random sampling, stratified sampling, and cluster sampling

Our paper explains why in such situations it can be important to carry out CV that mimics the sampling design.² First, if you create CV folds that follow the same sampling process, then you’ll be more honest with yourself about how much precision there is in the data. Next, if on these folds you train fitted models and calculate test errors in ways that account for the sampling design (including sampling weights³), then you’ll generalize from the sample to the population more appropriately.

If you’d like to try this yourself, please consider using our R package surveyCV. For linear or logistic regression models, our function cv.svy() will carry out the whole K-fold Survey CV process:

generate folds that respect the sampling design,
train models that account for the sampling design, and
calculate test error estimates and their SE estimates that also account for the sampling design.

For more general models, our function folds.svy() will partition your dataset into K folds that respect any stratification and clustering in the sampling design. Then you can use these folds in your own custom CV loop. In our package README and the intro vignette, we illustrate how to use such folds to choose a tuning parameter for a design-consistent random forest from the rpms R package.

Finally, if you are already working with the survey R package and have created a svydesign object or a svyglm object, we have convenient wrapper functions folds.svydesign(), cv.svydesign(), and cv.svyglm() which can extract the relevant sampling design info out of these objects for you.

It was very rewarding to work with Cole and Thomas on this project. They did a lot of the heavy lifting on setting up the initial package, developing the functions, and carrying out simulations to check whether our proposed methods work the way we expect. My hat is off to them for making the paper and R package possible.

Some next steps in this work:

Find additional example datasets and give more detailed guidance around when there’s likely to be a substantial difference between usual CV and Survey CV.
Build in support for automated CV on other GLMs from the survey package beyond the linear and logistic models. Also, write more examples of how to use our R package with existing ML modeling packages that work with survey data, like those mentioned in Section 5 of Dagdoug, Goga, and Haziza (2021).
Try to integrate our R package better with existing general-purpose R packages for survey data like srvyr and for modeling like tidymodels, as suggested in this GitHub issue thread.
Work on better standard error estimates for the mean CV loss with Survey CV. For now we are taking the loss for each test case (e.g., the squared difference between prediction and true test-set value, in the case of linear regression) and using the survey package to get design-consistent estimates of the mean and SE of this across all the test cases together. This is a reasonable survey analogue to the standard practice for regular CV—but alas, that standard practice isn’t very good. Bengio and Grandvalet (2004) showed how hard it is to estimate SE well even for iid CV. Bates, Hastie, and Tibshirani (2021) have recently proposed another way to approach it for iid CV, but this has not been done for Survey CV yet.

Big Data Paradox and COVID-19 surveys

Welcome, new readers. I’m seeing an uptick in visits to my post on Xiao-Li Meng’s “Big Data Paradox,” probably due to the Nature paper that was just published: “Unrepresentative big surveys significantly overestimated US vaccine uptake” (Bradley et al., 2021).

Meng is one of the coauthors of this new Nature paper, which discusses the Big Data Paradox in context of concerns about two very large but statistically-biased US surveys related to the COVID-19 pandemic: the Delphi-Facebook survey and the Census Household Pulse survey. As someone who has worked with both the Delphi group at CMU and with the Census Bureau, I can’t help feeling a little defensive 🙂 but I do agree that both surveys show considerable statistical bias (at least nonresponse bias for the Census survey; and biases in the frame and sampling as well as nonresponse for the Delphi survey). More work is needed on how best to carry out & analyze such surveys. I don’t think I can put it any better myself than Frauke Kreuter’s brief “What surveys really say”, which describes the context for all of this and points to some of the research challenges needed in order to move ahead.

I hope my 2018 post is still a useful glimpse at the Big Data Paradox idea. That said, I also encourage you to read the Delphi team’s response to (an earlier draft of) Bradley et al.’s Nature paper. In their response, Reinhart and Tibshirani agree that the Delphi-Facebook survey does show sampling bias and that massive sample sizes don’t always drive mean squared errors to zero. But they also argue that Delphi’s survey is still appropriate for its intended uses: quickly detecting possible trends of rapid increase (say, in infections) over time, or finding possible hotspots across nearby geographies. If the bias is relatively stable over short spans of time or space, these estimated differences are still reliable. They also point out how Meng’s data defect correlation is not easily interpreted in the face of survey errors other than sampling bias (such as measurement error). Both Kreuter’s and Reinhart & Tibshirani’s overviews are well worth reading.

Colby is hiring a tenure-track statistician

Once again, we are hiring a tenure-track statistician here at Colby College, in the Department of Mathematics and Statistics! I should have posted this sooner; but if you are looking for an opportunity to teach Statistics and Data Science at a strong liberal arts college with great students and colleagues, please consider applying:

https://mathjobs.org/jobs/jobs/14358

We will give full consideration to applications received by this Monday, October 14, but review will continue until the position is filled. If you have concerns about making this deadline, or if you have any other questions, please do get in touch: jawieczo@colby.edu … I would also be happy to answer any questions about the position, Colby College, or our city of Waterville, Maine.

Some things that I am particularly excited about this year:

a recent multi-million-dollar gift that is funding Data Science courses, curricular development, speakers, and other activities — and strong encouragement to spend that gift on interdisciplinary projects with colleagues across campus (for example, I have co-organized a data visualization workshop with faculty in Biology and proposed a first-year course cluster to be co-taught with faculty in English);
our Office of Civic Engagement‘s support for my survey sampling course, in which we are working with the Waterville Fire Department to design and carry out a large-scale survey of city residents;
our department colloquium, full of engaging talks on mathematics and statistics, as well as our annual Runnals Dinner for Women in Mathematics — this year’s speaker is Dr Mahlet Tadesse.

“Concise Statistical Year-Book of Poland, 1939”

Eighty years ago this week, my grandmother and grandfather were each enthusiastic seven-year-olds, excited for September 1st — their first day of school! At the time, they lived hundreds of kilometers apart and had yet to meet. She had spent her childhood in France but was thrilled to be back in ancestral Poland, in the north-eastern city of Wilejka, where she would finally be able to study in Polish. He was a voracious reader in Poznań, the westernmost large city in Poland at the time. Still, both had laid out their best clothes and filled a satchel with notebooks and pens.

Of course, it was not to be. My grandfather’s mother woke him in the middle of the night and brought him quietly down to the cellar, in the dark, past windows blacked out with curtains and blankets, as German forces began shelling the city. In the morning his apartment still stood, but he saw the broken walls and ruined rooms of the building next door. Meanwhile, my grandmother’s long-awaited Polish school was cancelled as well, eventually replaced by a Russian school as Soviet forces occupied her city.

Somehow, they survived World War II and eventually met as teachers, committed to the critical importance of education in rebuilding their broken country. My grandfather went on to become a professor of history and a leading figure at the University of Zielona Góra, in the city where they finally settled (and where I was born). A few years ago, when he passed away, I found some of the old statistical yearbooks he must have used as research resources.

Worn cover of my grandfather's copy of the 1939 Concise Statistical Year-Book of Poland

The yearbook from 1939 is particularly touching. As a physical artifact, it has clearly been through a lot: worn from use, spine broken, pages torn, stamped and underlined and scribbled all over.

Title page of my grandfather's copy of the 1939 Concise Statistical Year-Book of Poland, with stamps and inked-out scribbles

But it’s the “Foreword to the 10th Edition,” written in April 1939, that really moves me with its premature optimism:

The current edition of the Year-Book closes the first ten years of its existence. Today I can emphatically assert the great utility of this publication … It remains only necessary to express a hope that the Concise Year-Book, completing currently the first decade of its existence and beginning in the near future its second decade… will continually and increasingly fulfill its mission as set out in 1930…

Once again, it was not to be. The statistical service could not continue its planned work, once the war began in September. The Polish government-in-exile in London did manage to publish a Concise Statistical Year-Book for 1939-1941, summarizing what was known about conditions in the German- and Soviet-occupied territories. But the regular annual compilation and publication of Polish statistical yearbooks did not resume until after the war, in 1947 — and even then it was interrupted again during 1951-1955 as the Soviets in charge did not want to risk revealing any state secrets.

First page of foreword to my grandfather's copy of the 1939 Concise Statistical Year-Book of Poland

Second page of foreword to my grandfather's copy of the 1939 Concise Statistical Year-Book of Poland

The Polish Wikipedia has a good article on these statistical yearbooks, but unfortunately it’s not yet translated into English. However, you can skim through a scanned PDF of the whole 1939 yearbook. For instance, the lovingly hand-drawn population density map reminds us that there were precursors to the (also beautiful) census dot maps based on 2010 US Census data.

Population density dot map from the 1939 Concise Statistical Year-Book of Poland

Now, on this 80th anniversary of the war, my own son is eager to start school, while I am preparing to bring the 1939 yearbook to my fall course on surveys and censuses. I am grateful that our life today is so much better than my grandparents’ was, even if it’s hard to be optimistic about the state of the world when you hear the news lately. All we can do is roll up our sleeves and get back to work, trying to leave the place better than we found it.

Another Pole, the poet Wisława Szymborska, said it well:

The End and the Beginning
After every war
someone has to clean up.
Things won’t
straighten themselves up, after all.

Someone has to push the rubble
to the side of the road,
so the corpse-filled wagons
can pass.

Someone has to get mired
in scum and ashes,
sofa springs,
splintered glass,
and bloody rags.

Someone has to drag in a girder
to prop up a wall,
Someone has to glaze a window,
rehang a door.

Photogenic it’s not,
and takes years.
All the cameras have left
for another war.

We’ll need the bridges back,
and new railway stations.
Sleeves will go ragged
from rolling them up.

Someone, broom in hand,
still recalls the way it was.
Someone else listens
and nods with unsevered head.
But already there are those nearby
starting to mill about
who will find it dull.

From out of the bushes
sometimes someone still unearths
rusted-out arguments
and carries them to the garbage pile.

Those who knew
what was going on here
must make way for
those who know little.
And less than little.
And finally as little as nothing.

In the grass that has overgrown
causes and effects,
someone must be stretched out
blade of grass in his mouth
gazing at the clouds.

After 10th semester of statistics PhD program

FINAL BOSS BATTLE! (Warning: severe nerdiness ahead.)

Previous posts: the 1st, 2nd, 3rd, 4th, 5th, 6th, 7th, 8th, and 9th semesters of my Statistics PhD program.

Research

It’s a strange thing to finish a thesis. You came in to grad school hoping, maybe even expecting, to make a grand contribution—a deep insight into the nature of the world. But Borges puts it best:

…the tall, haughty volumes that made a golden dimness in the corner of his room were not (as his vanity had dreamed them) a mirror of the world, but just another thing added to the world’s contents.

—Jorge Luis Borges, “The Yellow Rose,” Andrew Hurley’s translation

For that matter, when do you really “finish”? The dissertation is pretty much written — then a few weeks later the slides are ready — then the defense itself takes place — then you make revisions — then there’s a graduation ceremony — and even then, there remain a few minor “suggested” revisions I haven’t done yet but should do before getting it printed & bound & archived. [That finally happened half a year later, over January break, long after the PhD was “over.”]

Not to mention, my status felt weirdly ambiguous. The faculty at Colby have already hired me and seem to treat me as basically a colleague, pending the minor formality of defending. And at CMU I’m also teaching a course and students are calling me “Professor.” But of course, everyone else in the department is used to treating me as a student, so I feel very much like one. Admittedly, during the few weeks when I was no longer a “PhD Candidate” but not yet officially “Dr Wieczorek,” I enjoyed the brief chance to introduce myself as a “Graduand” 🙂

But it’s all very anticlimactic! There’s no trumpet fanfare saying: “NOW you are done.” If anything, it felt like the first Zelda game’s end music (starting about 20 sec into the clip below):

As you’ve played the game through, you’ve died many many times along the way, and Link’s death lullaby has been playing so long in all your lowest moments that you don’t even really hear it anymore. Then here it returns, after you thought you were done. Not this again?!? Yet suddenly your ears catch a faint counterpoint that wasn’t there before. This new bass line changes the character of the music completely, getting stronger and stronger, until somehow it has become a jazzy, celebratory tune. There’s no one moment of enlightenment, but a weary transition as the world slowly starts looking better and better.

Teaching

While I didn’t finish as much of the thesis last fall as I expected, it was supposed to be my focus this spring (along with the new baby!), with only minimal grading responsibilities as a TA.

HOWEVER… a week before spring term started, a last-minute shift in staffing became necessary and I was asked to teach 36-315, Statistical Graphics and Visualization—a hundred-student undergraduate class. I really should have said No (the chair told me it’s okay to say No!), but still, the department seemed to be at a loss, and I didn’t want to let them down after all they had done for me throughout my time at CMU. So instead, I found myself focused on teaching the largest class I’ve had yet, while just scraping by on both research and parenting.

Teaching this class was a valuable experience in and of itself, and you know I LOVE teaching data visualization, but—argh!

This spring I also finished the last requirement for the Eberly Center’s Future Faculty Program. It was a bit funny to have to revise my Statement of Teaching Philosophy several times, given that I’d already written one that got me a job, but it does feel good to wrap this up formally. The program made a big impact on my teaching over the past few years, and I’m glad I can contribute to their completion rate.

Other projects

We continued our Teaching Statistics research, running many more think-aloud interviews, and presented our work at eCOTS 2018 (the Electronic Conference on Teaching Statistics).

I also traveled back to DC, to the Census Bureau, to give a talk on my paper with Tommy Wright and Martin Klein. The paper is about visualizations designed to help readers make many comparisons in a statistically-appropriate way. The talk was recorded, and now I see the video has been posted as part of a new “Census Academy” resource.

Life

The baby went through a rough patch where he wouldn’t fall sleep unless someone was holding him and moving around—sitting wasn’t good enough. Somehow I discovered that I could hold him while playing Crypt of the NecroDancer, which is basically a mashup of Dance Dance Revolution and Legend of Zelda: you use the arrow keys to move your dungeon-explorer in time with the beat of this amazing techo-ish soundtrack. I would play with one hand, standing and rocking back and forth to help myself stick to the beat, and the little guy would just fall asleep contentedly while I repeatedly ~~destroyed~~ got killed by skeletons and slime monsters.

Despite everyone’s sensible advice, we went house-hunting in Maine in mid-March—when all the houses were covered in 3 feet of snow—because spring break was the only time I could get away. Surprisingly, nobody else was house-hunting there that week 🙂 and as it turned out, we found a wonderful home at a very reasonable price. Our landlord in Pittsburgh has been wonderful, but we’re excited to become homeowners.

Comics

So, at this point you might be wondering: Is a PhD degree in Statistics worthwhile?

Someday I’ll follow up with a post on that, akin to my MS degree post. Meanwhile, Matt Might has the best post on the value of a PhD in general.

But for today, I will leave you with some comics that kept me ~~crying~~ smiling during the tough times of a PhD. Of course, PHD Comics (aka Piled Higher and Deeper) and XKCD both go without saying, but here are a few others that felt relevant.

When you’re starting down this precarious road:

When you forget why you’re even doing this:

When you have spent years with almost ~~no human company besides other academics~~ no human company:

When you are suddenly expected to interview for jobs with bona fide humans, after spending years around aforementioned academics:

When you are neeeeeearly done, and if you could just focus on your existing responsibilities you’d be able to wrap them up:

When you somehow manage to buckle down and knock out a ton of research/writing like a boss:

When you finally submit the damned dissertation:

When you are as ready as you’ll ever be for the thesis defense:

When you look back on your time as a PhD student with a note of defiant optimism:

After 9th semester of statistics PhD program

The final year! Our 2nd baby was on the way, and the job search was imminent. Oh, and by the way there’s this little thing called “finishing your dissertation”…

Previous posts: the 1st, 2nd, 3rd, 4th, 5th, 6th, 7th, and 8th semesters of my Statistics PhD program.

Research

This year my advisor was able to place me on half-TA’ing grant support. So I asked to have all my TA’ing pushed to the spring, with the intent of finishing the bulk of my thesis work this fall. I figured that in the spring I’ll be OK to grade 10 hrs/wk on my own time while helping with the new baby (due in Dec), as long as the thesis is mostly done.

However, job hunting took a HUGE amount of time. That’ll be its own post. (Also the spring TA duties turned into something larger, but that’s a story for the next post.)

In other words, thesis research progress was… minimal, to put it mildly.

Other projects

Well, OK, so there’s another reason my thesis work was slow: I jumped onto two new short-term projects that (1) were very much in my wheelhouse, and (2) seemed like great fodder for job applications and interviews. Both of them arose thanks to Ryan Tibshirani. Unsurprisingly, it turns out that it’s GREAT for a grad student to be on good terms with a well-known professor who gets too many requests for interesting projects and occasionally passes them on to students. In both cases, it was fantastic of Ryan to think of me, and although it’s been tough to be doing it just right now, this is legitimately the kind of research I want to do later on (and probably should have done for my thesis in the first place! argh…).

First, knowing that I’m interested in education, Ryan asked if I’d like to help with some consulting for Duolingo, the language-learning app company (also founded by CMU folks and still Pittsburgh-based). The folks there had some interesting questions about experimental design and suitable metrics for A/B testing their app. One of our contacts there was Burr Settles, who wrote a helpful book on Active Learning (the ML technique, not the pedagogy). We had some good discussions after digging into the details of their problem: challenges of running permutation tests on massive datasets, whether old-school tests like sign-rank tests would be better for their goals, how the data-over-time structure affects their analyses, etc. These chats even led me to an old Portland State classmate’s work regarding how to get confidence intervals from a permutation test, and I have some ideas for extensions on that particular problem.

Second, knowing that I had experience with survey data, Ryan passed on to me an email request to review some new methodology. Varun Kshirsagar, working on the Poverty Probability Index (PPI), had made some revisions to this poverty-tracking method and wanted to get feedback from someone who understood both modern statistics & machine learning tools as well as the kind of complex-survey-design datasets used to estimate the PPI models. It was a real pleasure working with Varun, and thinking about how to combine these two worlds (ML and survey data) had been a huge part of my motivation to go back to grad school in the first place and attend CMU in particular. We wrote up a short paper on the method for the NIPS 2017 workshop on ML for the Developing World, and we won the best paper award 🙂 I plan to revisit some of these ideas in future research: How do we run logistic-regression lasso or elastic net with survey-weighted data? How should we cross-validate when the survey design is not iid?

Teaching

Alex and I also continued running the Teach Stats mini. This semester we decided it was a bit embarrassing that Statistics, of all fields, doesn’t have a better standardized assessment of what students are learning in our intro courses. Without such a tool, it’s hard to do good pedagogy research and give strong evidence about whether your interventions / new methods have any impact.

There are already a couple of assessment instruments out there, but most were written by psychologists or engineers rather than by statisticians. Also, many of their questions are procedural, rather than about conceptual understanding. Even though these assessments have passed all the standard psychometric tests, there’s no guarantee that whatever they measure is actually the thing we’d *like* to measure.

So we started discussing what we’d like to measure instead, drafting questions, and planning out how we would validate these questions. Inspired by my own experiences in user-experience research at Olin and Ziba, and with the help of an article on writing good assessments for physics education by Adams and Wieman, we started planning think-aloud studies. The idea is to watch students work through our problems, talking out loud as they do so, so that we can see where they trip up. Do they get it wrong just because the question is poorly worded, even if they understand the concept? Do they get it right just because you can eliminate some choices immediately, even if they don’t understand the concept at all?

We ran a pilot study this fall, and at the end of the semester, I had my first chance to help draft an IRB application—hurrah for statisticians actually working on research design!

This summer and fall, I also kept up with an informal History of Stats reading group organized by fellow PhD student Lee Richardson. Again, these journal-clubs and reading-groups have been my favorite part of the PhD curriculum, and I wish more of our “real” classes had been run this way.

Life

?!? “Life”? Not much that I can recall this semester.

Next up

The 10th and final semester of my Statistics PhD program.

After 8th semester of statistics PhD program

I realize this is over 2 years late, but I found these drafts sitting around and perhaps they are still worth posting.

Apologies for the negativity! I have to admit this was one of the toughest semesters for me, psychologically. I thought about toning it down, especially since I’m delighted to be where I am now—tenure track faculty at a small liberal arts college—but I don’t want to pretend it’s been easy to get here.

I believe this was the semester I was going to CMU’s Counseling and Psychological Services (CaPS). If you’re a grad student, I recommend that you get to know such resources on your campus. Just about everyone faces the all-pervasive impostor syndrome, which compounds any other specific challenges you might have personally, and it’s helpful to be able to talk through it all with someone experienced.

Previous posts: the 1st, 2nd, 3rd, 4th, 5th, 6th, and 7th semesters of my Statistics PhD program.

Life

The more my son grows up, the more the PhD starts to feel like an overwhelming distraction from real life. One day I got home late and my wife told me what he’d been singing:

“Old MacDonald had a daddy, E-I-E-I-O, with a work-work here and a work-work there…” 🙁

If you have a family while in grad school, you will often feel that either the PhD doesn’t come first, or your family doesn’t come first. The academic folks around you will lean towards the PhD coming first, which of course partly makes sense—if you’re not making the fullest use out of your limited time in grad school, why bother going at all?—but it’s also hugely selection bias. Many (though not all) of the faculty who land and stay at a top research university are people who did decide that work comes first.

Thankfully, nobody has ever sneered to my face that “Ah well, not everyone’s cut out for academia” (though I’ve heard of it happening in other departments). But I feel it in my head all the time. (And I overhear younger students dismissing 30+ year olds like myself as too old to do good research… And I see the offices full every evening and weekend…) At another stage in life, my reaction might have been, “Oh yeah? I’ll show you, I’ll work my butt off and prove I’m good enough to cut it!” But now, my reaction is, “I’ve got better things to do than play this petty game.” Yes, I’ll plod along and finish the PhD I’ve started—perseverance counts for a lot—but right now I am not eager to stretch out this stage of life any longer than necessary.

Research

Reading Thinking, Fast and Slow, I am noting the constant mentions of the author’s collaboration with Amos Tversky. I think I’ve been way too focused on self-sufficiency here. It’s true that to get the thesis, I need to demonstrate I can do the work on my own… BUT in order to actually succeed afterwards (whether in academia or not), I’ll need to collaborate with others, not just myself and advisor. Plus, it’s simply more fun, spending those hours tackling a tough problem with an equally interested collaborator! So, my plan starting this summer and into next year: Add a few collaborative projects—finish FPS with Daren, finish DL with Nick and Jordan, revisit CIs/AIPE with Alex, start something (visual inference? CV with confidence?) with Justin… [Looking back, I regret that I did not follow up and make most of these planned collaborations happen while I was still a student!]
Jing still amazes me with his quick insight and intuition about how to tackle a proof. When I get stuck after spending hours on something, it takes him almost no time to see: if we back up a few steps to this other point, and tackle that instead, it’ll be much cleaner. This trait is taking me a long time to learn.
Daren argues that technical math prowess is definitely not sufficient (good ideas of *what* to study are more important), but also not necessary (even theory-wizards like Yenchi and Ale have to constantly look up things they’ve forgotten). I disagree a bit: I really think fluency in the big-picture math concepts is important—if I have memorized the definition of an eigenvalue, but not internalized its *meaning*, then I will never see big-picture ideas quickly enough, nor know how to start proving technical details, nor recall where to find those details in work I’ve seen before. It’s like when I play clarinet: I don’t need to memorize the piece I’m playing—sheet music is fine—but I *do* need to memorize the basics. I simply cannot play in real-time if I have to refer back to a fingering chart for every single note! In Joel Spolsky’s words (although see Lang’s actual article too):

Serge Lang, a math professor at Yale, used to give his Calculus students a fairly simple algebra problem on the first day of classes, one which almost everyone could solve, but some of them solved it as quickly as they could write while others took a while, and Professor Lang claimed that all of the students who solved the problem as quickly as they could write would get an A in the Calculus course, and all the others wouldn’t. The speed with which they solved a simple algebra problem was as good a predictor of the final grade in Calculus as a whole semester of homework, tests, midterms, and a final.

You see, if you can’t whiz through the easy stuff at 100 m.p.h., you’re never gonna get the advanced stuff.
It’s also hitting me how stupidly selfish I’ve been here. As much as I’d like to think so, I didn’t come here to solve practical problems and make the world a better place. If I had, I’d have started right off the bat, using (and honing) skills I *do* have, working in the Census research group to make serious progress on applied problems. Instead, I wanted to bone up on my math-theory side, just thinking about the glory of proving theorems, but without putting in the prep work. It’s true that I’ve learned a lot by working on theory problems with Jing—but I would have been so much more productive if I’d taken a few hardcore math classes first, brushing up on my weak spots *before* starting such research. (I took linear algebra over a decade before starting the PhD, and it really shows. Yes, I can multiply matrices, but the advanced stuff has been a real slog.) I’ve spent a couple of years now on work that other students could have done much faster and enjoyed more, while neglecting to make real contributions using the skills I *do* have. In other words, I wish I could go back and tell myself: *either* just take some math classes (or even get a MS in Math?), until you can do the theory on your own (or with a mentoring colleague at work), and skip the Stats PhD… *or* do the PhD in a way that builds on your strengths (and makes real contributions!), not merely papers over your weaknesses. Sadly, I probably wouldn’t have believed me. My teaching experiences and the Eberly center seminars have been wonderful, but otherwise, right now I feel I have not made good use out of my time here. (Even in my neuroscience ADA project flop, only a few logistical challenges were out of my hands—and I could have overcome most of them by gritting my teeth and learning Python well, and by sitting alongside the scientists in the lab.) Hindsight is 20/20, and everyone goes through impostor syndrome, but still…

Teaching

I was a TA for Ann Lee’s section of 36-402: Undergraduate Advanced Data Analysis, using materials developed by Cosma Shalizi.

The course largely followed Cosma’s (draft) textbook Advanced Data Analysis from an Elementary Point of View. It was good for me to be “forced” to read up a little on causal inference and related topics. I’m still no expert, but at least not clueless. I also liked his perspective of statistical modeling as “data compression,” and his view of regression as a linear smoother with *weird* weights.
Some students mentioned that having to code up cross-validation from scratch 5 times or more was a *good* part of the class. They really feel they understand it now, more so than other things which they never or rarely had to code directly—such as backfitting in GAMs. I worried that repeatedly writing CV from scratch would start to feel like busywork, but luckily not (at least for these few students). And I felt the same about taking the Convex Optimization class myself: it’s great to have repeated practice *coding up the algorithms directly* and understanding what they’re trying to do, even if it’s only practice and in reality you’d actually use a pre-canned routine that deals with subtleties such as numerical convergence issues. So, in future years, we should give more opportunities to practice coding up algorithms, not just deriving theory about them and using them for data analysis. (Not to mention the omitted issues of data collection and power calculations…)
By the end of the semester, so many students still didn’t understand the idea of additive vs interaction models. They assumed “additive model” specifically means “GAM with spline terms” and “interaction model” means “linear regression with interaction terms.” We should hit these points harder earlier: “additive” means *any* model that is additive in the terms; and you can certainly do interactions within a GAM by having a multiple-predictor spline term; and so on.
If I’m going to be strict about not accepting late HWs, I should do so from the very beginning. It’ll (hopefully) save me ages of back-and-forth emails from students with excuses over the course of the semester. Also, if the promise of no-credit-for-late-HWs only kicks in at the end of semester, all of a sudden, then some students may have already used up their free dropped-HW opportunities, so they get a much lower grade than expected even if they *do* the work (but just submit it late). That’s not *technically* unfair (the syllabus did say we’d reject late HWs)… but it *feels* unfair. Best to set up consistent and clear expectations, right?
Likewise, if going to be serious about saying that “on time” means “at the start of class,” then have a TA pick up the HWs right at that time. We saw a trickle of late students (or all showing up at end of class) dumping in HWs after the fact. (Maybe electronic submission, with the deadline enforced by your course-management software, is not so bad.)
I’m pleased that we had decent turnaround time for grading most weeks—but it was sad that so many students never bothered to pick up graded HWs. We need better incentives to figure out your mistakes and learn from them, not merely be graded on them. (Alternately, it’s tempting to say that if you *don’t* pick up X of your early HWs, then you accept “stochastic grading” for the rest—we’ll give you a random grade and save time by not grading them manually!)
The Blackboard discussion forums were painful to set up and navigate. We should have used Piazza instead.
How would I grade such a class with specs-based grading? There are so many details to demonstrate understanding of, and so many ways to lose points on current assignments. How to get around point-grubbing here?

Other projects

I made no progress on the FPS paper with Daren, nor on the DL paper with Nick 🙁 At least the FPS paper was submitted to a conference… and rejected by reviewers who didn’t understand the purpose of the paper. I should have quickly revised the introduction to reframe our goals clearly and sent it somewhere else, but instead it’s been sitting on my desk.
This semester (or maybe near the end of last term?) I volunteered to join the GenEd committee. This is a mostly-faculty committee, revising the general education requirements in the CMU college (Dietrich College of Humanities and Social Sciences) to which our Statistics department belongs. It’s been eye-opening to see how faculty meetings go behind the scenes. (In particular, it’s fascinating that even top humanities scholars at a top department have trouble concisely defending the humanities as a GenEd requirement. There are also long digressions to quibble over a single word—“competence” is like a loaded gun, and even clearly-temporary placeholders like “Off-campus experiences” hold up the meeting interminably once someone points out that undergrad research also falls under that heading and it must be renamed right now…) But seriously, it’s been great to hear some truly remarkable educators discuss the direction of our programs, from broad goals to particular pedagogical methods. As a statistician, naturally I volunteered for the Assessment subgroup, and it has been so wonderful to work with experts like Marsha Lovett.
I did well in the “Three Minute Thesis” (3MT) competition! Grad students have 3 minutes and 1 slide to summarize their work for a wide audience. I was surprised and thrilled to win my preliminary round. Unfortunately the finals were the same date as an out-of-town trip I could not miss, so I did not get to compete further, but they still kindly gave me a prelim-round prize of research funds which I used for travel to USCOTS.
I presented my dataviz course poster at the US Conference on Teaching Statistics (USCOTS), in State College, PA. During the poster session I was pleased to meet several folks from stats departments looking to hire new faculty, and they seemed pleased to meet a grad student close to defending who is interested in both pedagogy and research. I believe this was my first contact with Chris Malone at Winona State University and KB Boomer at Bucknell University.
The CMU student-run Data Science Club also invited me to give a high-level talk on principles of good data visualization. Here are my slides, a checklist of best practices to follow, R code, and the nhanes and obesity_education CSV files.
Alex Reinhart and I proposed and ran a mini (half-semester course) on Teaching Statistics, with Rebecca Nugent advising. It was run mostly as a journal club—a good chance to read and discuss interesting papers on pedagogy in general as well as (introductory) statistics education in particular.
Finally, I had my first R package RankingProject accepted on CRAN! Together with Tommy Wright and Martin Klein, my former colleagues back at the Census Bureau, I have been working on a paper about ways to visualize data that lead to correct inferences about comparing many different estimates. This package contains the code and data to implement techniques we recommend in the paper, “A Primer on Visualizations for Comparing Populations, Including the Issue of Overlapping Confidence Intervals” [which went to print in May 2019 in The American Statistician]. The package also was listed as one of RStudio’s top 40 new packages in March 2017, and it has had a respectable number of downloads so far: total , with a monthly average of .

Next up

The 9th and 10th semesters of my Statistics PhD program.

Meng’s “Big Data Paradox” and an extreme example

Update (December 2021): Welcome, new readers. I’m seeing an uptick in visits to this post, probably due to the Nature paper that was just published: “Unrepresentative big surveys significantly overestimated US vaccine uptake” (Bradley et al., 2021). I’ve added a short post with a bit more about the Bradley paper. But the main point is that I strongly encourage you to also read Frauke Kreuter’s brief “What surveys really say,” which describes the context for all of this and points to some of the research challenges needed in order to move ahead; and Reinhart & Tibshirani’s “Big data, big problems,” the Delphi-Facebook survey team’s response to (an earlier draft of) Bradley et al. That said, I hope my 2018 post below is still a useful glimpse at the Big Data Paradox idea.

Original post (October 2018): Xiao-Li Meng’s recent article on “Statistical Paradises and Paradoxes in Big Data (I)” has a fascinating approach to quantifying uncertainty in the Big Data setting. Here’s a very loose/oversimplified summary, along with an extreme example of applying some of these ideas. But I certainly encourage you to read the original paper itself.

(Hat tip: Lee Richardson, and CMU’s History of Statistics reading group.)

We exhort our Intro Stats students: Don’t trust data from a biased sample! Yet in the real world, we rarely deal with datasets collected under a formal random sampling scheme. Sometimes we have e.g. administrative data collected from many people but only those who self-selected into the sample. Even when we do take a “proper” random survey, there is usually nonignorable non-response.

Meng tackles the question: When does a small random survey really give better estimates than a large non-random dataset? He frames the problem gracefully in a way which highlights the importance of two things: (1) the correlation between *whether* a person responds and *what* their response is, and (2) the overall population size N (not just the observed sample size n). Under simple random sampling, the correlation should be essentially 0, and the usual standard error estimates are appropriate. But when we have non-random sample selection, this correlation leads to bias in our estimates, in a way such that the bias gets *substantially worse* when the population is large and we have a large sampling fraction n/N.

Meng reframes the problem several ways, but my favorite is in terms of “effective sample size” or n_eff. For a particular n, N, and correlation, quantify the uncertainty in our estimate—then figure out what sample size from a simple random sample would give the same level of uncertainty. In an extended example around the 2016 US presidential election, Meng suggests that there was a small but non-zero correlation (around 0.005) between *whether* you responded to polling surveys and *which* candidate you supported, Trump or Clinton. This tiny correlation was enough to affect the polling results substantially. All the late election poll samples added up to about 2 million respondents, which at first seems like a massive sample size—naive confidence intervals would have tiny margins of error, less than 1/10th of a percentage point. Yet due to that small correlation of 0.005 (Trump voters were a tiny bit less likely to respond to polls than Clinton voters were), the effective sample size n_eff was around 400 instead, with a margin of error around 5 percentage points.

Let’s say that again: Final estimates for the percent of popular vote going to Trump vs Clinton were based on 2 million respondents. Yet due to self-selection bias (who agreed to reply to the polls), the estimates’ quality was no better than a simple random survey of only 400 people, if you could get a 100% response rate from those 400 folks.

On the other hand, if you want to try correcting your own dataset’s estimates and margins of error, estimating this correlation is difficult in practice. You certainly can’t do it from the sample at hand, though you can try to approximate it from several samples. Still, one rough estimate of n_eff gives some insight into how bad the problem can get.

Here at Colby College, students have a choice of three dining halls: Foss, Roberts (affectionately called Bobs), and Dana. Around the start of the year, I saw this informal “survey” near the entrance to Foss:

Students walking past the bulletin board were encouraged to make a tally mark near their favorite season, favorite article of clothing, etc… and their favorite dining hall. Here, Foss is far in the lead, with 32 votes compared to 5 and 6 for Dana and Roberts respectively. But of course, since this bulletin board is at Foss, you only get passers-by who are willing to go to Foss. We talked about this in class as an example of (highly!) biased sampling.

Now, I want to go further, using one of Meng’s effective sample size estimates. Let’s say that we want to estimate the proportion of Colby students whose favorite dining hall is Foss. Meng’s equation (4.4) estimates the effective sample size as

n_eff = 4 * (n/N)^2 * (1/delta)^2

where delta is the expected difference in [probability of responding for people who prefer Foss] versus [probability of responding for people who prefer another dining hall]. The bulletin board showed about n = 43 respondents, which is roughly 1/40th of the 2000-ish student body, so let’s assume n/N = 1/40. In that case, if delta is any larger than 0.05, n_eff is estimated to be 1.

It certainly seems reasonable to believe that Prob(respond to a survey in Foss | you prefer Foss) is at least 5 percentage points higher than Prob(respond to a survey in Foss | you prefer some other dining hall). Foss is near the edge of campus, close to dorms but not classrooms. If you *don’t* prefer the food at Foss, there’s little reason to pass by the bulletin board, much less to leave a tally mark. In other words, delta = 0.05 seems reasonable… which implies that n_eff <= 1 is reasonable… which means this “survey” is about as good as asking ONE random respondent about their favorite dining hall!

[Edit: for a sanity check on delta’s order of magnitude, imagine that in fact about 1/3 of Colby’s students prefer each of the 3 dining halls. So we got around 30/600ish Foss-preferrers responding here, for around a 0.05 response rate. And we got around 12/1200ish of the other students responding here, for around a 0.01 response rate. So the difference in response rates is 0.04. That’s a bit lower than our extreme hypothesized delta = 0.05, but it is in a reasonable ballpark.]

In general, if delta is around twice n/N, we estimate n_eff = 1. Even if we waited to get a much larger sample, say 200 or 250 students leaving tally marks on the board, the survey is still useless if we believe delta >= 0.01 is plausible (which I do).

There’s plenty more of interest in Meng’s article, but this one takeaway is already enough to be fascinating for me—and a nifty in-class example for just *how* bad convenience sampling can be.