Category Archives: Uncategorized

After 3rd semester of Statistics PhD program

It’s time for another braindump of reflections on statistics grad school.
See also the previous two posts: After 1st semester of Statistics PhD program and  After 2nd semester of Statistics PhD program.

This was my last semester of required coursework. Having passed the Data Analysis Exam in May, and with all the courses under my belt, I am pretty much ready to focus on the thesis topic search and proposal. Exciting!

Classes:

  • Let me elaborate on Cosma’s post: “Note to graduate students: It is important that you internalize that you are, in fact, a badass…”
    Ideally you should really internalize that you’re a badass before you come to grad school, because this is not the place to prove to yourself that you’re a badass. There are too many opportunities to feel bad about yourself at every stumble, when you’re surrounded by high-performing classmates and faculty who seem to do everything faster and more smoothly… It can be demoralizing when, say, you learn that you had the lowest score on an exam in a required class.
  • On the other hand, now that the Advanced Statistical Theory course is over, I do feel much more badass about reading and doing statistical theory. I used to see a paper with a ton of unfamiliar math and my eyes would glaze over. Now I see it as: “Well, it may take a while, but I’m capable of learning to parse that, use it, and even contribute to the field.” It feels no more daunting than other things I’ve done. Thank you, Advanced Prob and Advanced Stat Theory!
    For example, I finally internalized that “hard math” is no worse than learning a new coding language. If I do an applied project and have to learn a new topic like Python, or parallel programming, or version control, it’s not an impossible task: it’s just a lot of work, like learning a foreign language. And I finally feel the same about math again: I may not have known what a Frobenius norm is, or my intuition about the difference between o(1) and O(1) may still be underdeveloped—but it’s not substantively different to get there than it is to keep track of the differences between for-loops in R vs Python vs MATLAB (like I had to do all year).
    Also, if I get stuck on a theory problem, it’s my own concern. I can read previous work on it and find a solution; or if there is none, I can write one and thus make a contribution to the literature. But if I’m stuck on an applied problem because I don’t have a codebook for the variables or don’t know what preprocessing was done to the dataset, I really am stuck waiting until the data owner responds (if he/she even knows or remembers what was done, which is not a safe bet…)
  • I was a bit surprised by the choice of topics in Advanced Stat Theory. We covered several historically important topics in great detail, but then the professor told us that most of them are not especially popular directions or practically useful tools in modern statistical research. (For example, Neyman-Pearson hypothesis testing in exponential families seems to be a solved problem, tackled by tools specific to that scenario alone… So why spend so much course time on it?) Maybe the course could be better focused if it were split into two parts: one on historically-important foundations, vs. one on modern techniques.
  • My TA assignment this semester was for Discrete Multivariate Analysis: advanced methods for contingency tables and log-linear models. I came away with a bigger appreciation for the rich and interesting questions that can arise about what looks, on the surface, to be a simple and small corner of statistics.

Journal Club:

  • My favorite course this fall was the Statistical Journal Club, led by CMU’s Ryan Tibshirani jointly with his father Rob Tibshirani (on sabbatical here from Stanford). The Tibshiranis chose a great selection of papers for us to read and discuss. Each week a pair or trio of students would present that week’s paper. It was helpful to give practice “chalk talks” as well as to see simulations illustrating each paper. (On day 1, Rob Tibshirani told us he likes to implement a small simulation whenever he reads a new paper or attends a talk: it helps gain intuition, see how well the method really works in practice, and see how sensitive it is to the authors’ particular setup and assumptions.)
  • I mentioned in Journal Club that we’d benefit from a MS/PhD level course on experimental design and sampling design for advanced stats & ML. Beyond just simple data collection for a basic psych experiment, how should one collect “big data” well, what to watch out for, how does the data collection affects analysis, etc.? Someone asked if I’m volunteering to teach it—maybe not a bad idea someday :)
  • The papers on “A kernel two-sample test” and “Brownian distance covariance” reminded me of a few moments when I saw an abstract definition in AdvProb class and thought, “Too bad this is just a technical tool for proofs and not something you can check in practice on real data…” As it turns out, the authors of these papers DID find a way to use them with real data. (For instance, there’s a very abstract definition of equality of distributions that cannot be checked directly: “for any function, the mean of that function on X is the same as the mean of that function on Y.” You can’t take a real dataset and check this for ALL functions—but the authors figured out that you can use kernel methods to get pretty close, by checking a vast infinite space of functions. So they took the abstract impractical definition and developed a nice practical test you can run on real data.) Impressive, and a good reminder to watch out for that thought again in the future—maybe a second look could turn into something useful.
  • Similarly, a few papers (like “Stability selection”) take an idea that seems reasonable to try in practice but without any theoretical grounding… (What if we just take random half-samples of the data, refit our lasso regression on each one, and see which variables are kept in the model on most of the half-samples?)… and develop proofs that give theoretical guarantees about how good this procedure can be.
  • Still other papers (like my own team’s assigned paper, on Deep Learning) were unable to find a solid theoretical grounding for why the model does so well or any guarantees on how well it should be expected to do. But it seems like it should be tractable, if only we could hit on the right framework for looking at this problem. The Dropout paper had a nice way to look at the very top layer of a neural network, but not directly helpful for deeper networks.
  • I got really excited about the “Post-selection inference paper” which discussed conditional hypothesis testing for regression coefficients. I thought we could apply it to the simplest OLS case to do some nifty new test that would let you make inferences such as: “Beta is estimated to be positive, and our conditional one-sided test says it’s significant, so it’s significantly positive.” You’re usually told not to do this: you’re supposed to decide ahead of time if you want a two-sided test or one-sided; and if it’s one-sided, then decide which side you want to check before looking at the data. However… after some scratch work, in the Normal case it looks like the correction you do (for deciding on the direction of the one-sided test after observing the sign of the estimate) is exactly equivalent to doing a two-sided test instead. (Basically you double the one-sided test’s p-value, which is the same as computing the two-sided p-value for a Normal statistic). So on the one hand, we don’t get a new better test out of this: it’s just what people do in practice anyway. On the other hand, it shows that the thing that people do, even though they’re told it’s wrong, is actually not wrong after all :)
    This made me wonder: Apart from this simple case of one coefficient in OLS, are there other aspects of sequential/adaptive/conditional hypothesis testing that could be simplified and spread to a wider audience? Are there common use-cases where these tools would help less-statistically-savvy users to get rigorous inference out of the missteps they normally do?
  • A few of the papers were less technical, such as “Why most published research findings are false.” We discussed how to incentivize scientists to publish well-powered interesting null findings and avoid the file-drawer problem. Rob Tibshirani suggested the idea of a “PLoS Zero” :) (vs. the existing PLoS ONE) He also told us how he encouraged PubMed to add a comment system, the PubMed Commons. Now you can point out issues or mistakes in a paper in this public space and get the authors’ responses right there, instead of having to go back & forth through the journal editors’ gatekeeping to publish letters slowly.

Research:

  • Besides the year-long Advanced Data Analysis (ADA) project, I also got back in to research on Small Area Estimation with Beka Steorts, which led me to attend the SAE2014 conference in Poznań, Poland (near my hometown—the first time that business travel has ever taken me anywhere near family!). Beka also got me involved in the MIDAS (Models of Infectious Disease Agent Study) project: we are developing “synthetic ecosystems,” aka artificial populations that epidemiologists can plug into agent-based models to study the spread of disease. The current version is an EXTREMELY rudimentary first pass: I’ll write a bit more about the project once we have a version we’re happier with.
  • I finally sat down and learned version control (via Git), and it turned out to be a good friend. For the MIDAS project we had three of us working on Dropbox, which led to: clogging all our Dropboxes, overwriting each other’s files, trying to coordinate by email, renaming things from “blahblah” to “blahblah_temp” and “blahblah_temp_2_tmp_recent” and so on… So it became clear it’s time for a better approach. Git lets you exclude files (so you don’t need to sync everything like Dropbox does); check differences between file versions; and use branching to try out temporary versions without renaming or breaking everything. I used the helpful tutorials by Bitbucket and Karl Broman.
  • MIDAS also sponsored me to attend the North American Cartographic Information Society (NACIS) 2014 conference here in Pittsburgh. That deserves its own post, but I found it nifty that the conference was co-organized by Amy Griffin… whom I met (when she came to do some research on spatial visualization of uncertainty with the Census Bureau) via Nicholas Nagle… who first reached out to me through a comment on this blog. It all comes back around!
  • As for the yearlong ADA project itself: it’s almost wrapped up, but quite differently from what we expected. There turned out to be major issues in getting and combining all the required dataset pieces: We needed (1) MEG brain scans, (2) MRI brain imagery, and (3) personal covariates about the medical/neuropsychological outcomes of each patient. Each of these three datasets had a different owner, and was de-identified for privacy/security… and we were never able to get a set of patient IDS that we could use to merge the different datasets together. In the end I had to switch topics entirely, to a similar neuroscientific dataset (which had been successfully combined and pre-processed) but for studying Autism instead of Epilepsy. This switch finally happened in the last few months of the semester, so I had just a short time in which to address the scientific questions in appropriate statistical ways, while also learning about a new disorder, and also refreshing my knowledge of MATLAB (since this data was in that format, not Python as the previous one had been)…
    Lessons learned: I should have been more proactive with collaborators about either pushing harder to get data quickly or just switching topics sooner. And for those stats students who are about to start a new applied project like this one, make sure your collaborators already have the full dataset in hand. (Of course, in general if you’re able to get in early and help to plan the data collection for optimal statistical efficiency, so much the better. But if you’re just a student whose goal is to practice data analysis, you’d better be sure the data has been compiled before you start.)

Life:

  • Before coming to CMU, I always knew it as a strong technical school but didn’t realize how great the drama department was. We finally made it to a stage performance—actually Britten’s The Beggar’s Opera. I was wearing a sleep monitor watch that week, and the readout later claimed I was asleep during the show… It just noticed my low movement and the dim lighting, but I promise I was awake! :P Really, a great performance and I look forward to seeing more theater here.
  • For a while I’ve been disappointed that Deschutes Brewery beers from Oregon hadn’t made it out to Pennsylvania yet. But no longer! I can finally buy my favorite Obsidian Stout down the street!
  • Though I haven’t been posting much this fall, there’s been plenty of good stuff by first-year CMU student Lee Richardson. I especially like his recent post‘s comments about institutional knowledge—it’s far more important than we usually give it credit for.
  • Nathan Yau is many steps ahead of me again, with great posts like how to improve government data websites, as well as one on a major life event. My own household size is also expected to increase from N to N+1 shortly, and everyone tells us “Your life is about to change!”—so I thank Nathan for a data-driven view of how exactly that change may look.

Turing-complete inversion tables, presented reasonable on your part!

I’ve not been keeping up with blogging this semester, but I had to share this beautiful spam comment my filter let through this morning:

Appreciation for the excellent writeup. This in reality was previously your fun profile it. Glimpse complex to help way presented reasonable on your part! On the other hand, the way could possibly we be in contact?

I can’t tell if it’s written by a non-native English speaker or by a Markov chain—does that mean it passes the Turing test? Either way, there’s something lovely about its broken grammar.

The author’s name was given as “buy inversion tables.” For a moment I thought this might be a real comment, by someone offering to compute large matrix inversions cheaply and quickly. But no, apparently inversion tables are these things where you strap yourself in, flip over, and hang upside down for as long as you can. Kind of like the first semester of a PhD program :)

PS—somehow the comment reminds me of when Cosma Shalizi’s students used Markov-chain generated text to fake a blog post for him, in a previous iteration of the Statistical Computing class (which I’m TA’ing this term).

Transitions

Apologies for the lack of posts recently. I’m very excited about upcoming changes that are keeping me busy:

Let me suggest a few other blogs to follow while this one is momentarily on the back burner.

By my Census Bureau colleagues:

By members of the Carnegie Mellon statistics department:

Too close for bells, I’m switching to tubas

So when I’m not visualizing data or crunching small area estimates, I’ve been training to run DC’s Jingle All The Way 8k.

Most people wear little jingle bells as they run this race.
I decided to carry a tuba instead.

 More photos here. The one above is thanks to a blog I found by googling the race name + tuba. Our team t-shirts said Tuba Awareness, and apparently people were indeed aware! :)

My time was super slow (although I placed 1st in the carrying-a-tuba category), but I did run the whole thing, and I had a blast playing carols along the way. I really need to find somewhere in DC to play regularly, though perhaps a bit more sedentary…

Polymath project: social problem-solving

Earlier this week, Argentina hosted the 53rd International Math Olympiad (IMO), a mathematical problem-solving contest for high school students from all over the world. That means it’s almost time for another “mini-polymath” project!

Edit: As of Friday morning (7/13/2012), the problem still has not been completely solved, so there’s time to chime in on the discussion thread!

For the past few years, mathematician Terry Tao has hosted and coordinated a social problem-solving event, where people around the world use a blog and wiki to work together on one of that year’s IMO problems. His 2009 post is a good introduction to the event and the spirit behind it. Personally, I had a blast trying to contribute (if only a tiny bit) to the 2010 event.

Dang, I almost had comment 42!

Tao will be hosting a fourth “mini-polymath” tonight (July 12, 2012), starting at UTC 22:00, which is 6pm EST for us here on the US East Coast. If you read blogs like mine, I imagine you’d enjoy participating, or at least following along and watching the mathematical ideas going off like fireworks :)

Continue reading

Synaesthesia (or, This is Your Brain on Physics)

John Cook posted a fascinating Richard Feynman quote that made me wonder whether the physicist may have had synaesthesia:

I see some kind of vague showy, wiggling lines  — here and there an E and a B written on them somehow, and perhaps some of the lines have arrows on them — an arrow here or there which disappears when I look too closely at it. When I talk about the fields swishing through space, I have a terrible confusion between the symbols I use to describe the objects and the objects themselves. I cannot really make a picture that is even nearly like the true waves.

As it turns out, he probably did:

As I’m talking, I see vague pictures of Bessel functions from Jahnke and Emde’s book, with light-tan j’s, slightly violet-bluish n’s, and dark brown x’s flying around. And I wonder what the hell it must look like to the students.

The letter-color associations in this second quote are a fairly common type of synaesthesia. However, the first quote above sounds quite different, but still plausibly like synaesthesia: “I have a terrible confusion between the symbols I use to describe the objects and the objects themselves”…

I wonder whether many of the semi-mystical genius-heroes of math & physics lore (also, for example, Ramanujan) have had such neurological conditions underpinning their unusually intuitive views of their fields of study.

I love the idea of synaesthesia and am a bit jealous of people who have it. I’m not interested in drug-induced versions but I would love to experiment with other ways of experiencing synthetic synaesthesia myself. Wired Magazine has an article on such attempts, and I think I remember another approach discussed in Oliver Sacks’ book Musicophilia.

I have a friend who sees colors in letters, which helps her to remember names — I’ve heard her think out loud along these lines: “Hmm, so-and-so’s name is kind of reddish-orange, so it must start with P.” I wonder what would happen if she learned a new alphabet, say the Cyrillic alphabet (used in Russian etc.): would she associate the same colors with similar-sounding letters, even if they look different? Or similar-looking ones, even if they sound different? Or, since her current associations were formed long ago, would she never have any color associations at all with the new alphabet?

Also, my sister sees colors when she hears music; next time I see her I ought to ask for more details. (Is the color related to the mood of the song? The key? The instrument? The time she first heard it? etc. Does she see colors when practicing scales too, or just “real” songs?)

Finally, this isn’t quite synaesthesia but another natural superpower in a similar vein, suggesting that language can influence thought:

…unlike English, many languages do not use words like “left” and “right” and instead put everything in terms of cardinal directions, requiring their speakers to say things like “there’s an ant on your south-west leg”.  As a result, speakers of such languages are remarkably good at staying oriented (even in unfamiliar places or inside buildings) and perform feats of navigation that seem superhuman to English speakers. In this case, just a few words in a language make a big difference in what cognitive abilities their speakers develop. Certainly next time you plan to get lost in the woods, I recommend bringing along a speaker of Kuuk Thaayorre or Guugu Yimithirr rather than, say, Dutch or English.

The human brain, ladies and gentlemen!

Just when you thought it was safe to go back in the cubicle…

Yesterday’s earthquake in Virginia was a new experience for me. I am glad that there was no major damage and there seem to have been no serious injuries.

Most of us left the building quickly — this was not guidance, just instinct, but apparently it was the wrong thing to do: FEMA suggests that you take cover under a table until the shaking stops, as “most injuries occur when people inside buildings attempt to move to a different location inside the building or try to leave.”

After we evacuated the building, and once it was clear that nobody had been hurt, I began to wonder: how do you know when it’s safe to go back inside?
Assuming your building’s structural integrity is sound, what are the chances of experiencing major aftershocks, and how soon after the original quake should you expect them? Are you “safe” if there were no big aftershocks within, say, 15 minutes of the quake? Or should you wait several hours? Or do they continue for days afterwards?

Maybe a friendly geologist could tell me this is a pointless or unanswerable question, or that there’s a handy web app for that already. But googling does not present an immediate direct answer, so I dig into the details a bit…

FEMA does not help much in this regard: “secondary shockwaves are usually less violent than the main quake but can be strong enough to do additional damage to weakened structures and can occur in the first hours, days, weeks, or even months after the quake.”

I check the Wikipedia article on aftershocks and am surprised to learn that events in the New Madrid seismic zone (around where Kentucky, Tennessee, and Missouri meet) are still considered aftershocks to the 1811-1812 earthquake! So maybe I should wait 200 years before going back indoors…

All right, but if I don’t want to wait that long, Wikipedia gives me some good leads:
First of all, Båth’s Law tells us that the largest aftershock tends to be of magnitude about 1.1-1.2 points lower than the main shock. So in our case, the aftershocks for the 5.9 magnitude earthquake are unlikely to be of magnitude higher than 4.8. That suggests we are safe regardless of wait time, since earthquakes of magnitude below 5.0 are unlikely to cause much damage.
Actually, there are several magnitude scales; and there are other important variables too (such as intensity and depth of the earthquake)… but just for the sake of argument, we can use 5.0 (which is about the same on the Richter and the Moment Magnitude scales) as our cutoff for safety to go back inside. Except that, in that case, Båth’s Law suggests any aftershocks to the 5.9 quake are not likely to be dangerous — but now I’m itching to do some more detailed analysis… and anyhow, quakes above magnitude 4.0 can still be felt, and are probably still quite scary coming right after a bigger one. So let us say we are interested in the chance of an aftershock of magnitude 4.0 or greater, and keep pressing on through Wikipedia.

We can use the Gutenberg-Richter law to estimate the relative frequency of quakes above a certain size in a given time period.
The example given states that “The constant b is typically equal to 1.0 in seismically active regions.” So if we round up our recent quake to magnitude around 6.0, we should expect about 10 quakes of magnitude 5.0 or more, about 100 quakes of magnitude 4.0 or more, etc. for every 6.0 quake in this region.

But here is our first major stumper: is b=1.0 appropriate for the USA’s east coast? It’s not much of a “seismically active region”… I am not sure where to find the data to answer this question.

Also, this only says that we should expect an average of ten 5.0 quakes for every 6.0 quake. In other words, we’ll expect to see around ten 5.0 quakes some time before the next 6.0 quake, but that doesn’t mean that all (or even any) of them will be aftershocks to this 6.0 quake.

That’s where Omori’s Law comes in. Omori looked at earthquake data empirically (without any specific physical mechanism implied) and found that the aftershock frequency decreases more or less proportionally with 1/t, where t is time after the main shock. He tweaked this a bit and later Utsu made some more modifications, leading to an equation involving the main quake amplitude, a “time offset parameter”, and another parameter to modify the decay rate.

Our second major stumper: what are typical Omori parameter values for USA east coast quakes? Or where can I find data to fit them myself?

Omori’s Law gives the relationship for the total number of aftershocks, regardless of size. So if we knew the parameters for Omori’s Law, we could guess how many aftershocks total to expect in the next hour, day, week, etc. after the main quake. And if we knew the parameters for the Gutenberg-Richter law, we could guess what proportion of quakes (within each of those time periods) would be above a certain magnitude.
Combining this information (and assuming that the distribution of aftershock magnitudes is typical of the overall quake magnitude distribution for the region), we could guess the probability of a magnitude 4.0 or greater quake within the next day, week, etc. The Southern California Earthquake Center provides details on putting this all together.

What this does not answer directly is my first question: Given a quake of magnitude X, in a region with Omori and Gutenberg-Richter parameters Y, what is the time T such that, if any aftershocks of magnitude 4.0 or greater have not occurred yet, they probably won’t happen?
If I can find typical local parameter values for the laws given above, or good data for estimating them; and if I can figure out how to put it together; then I’d like to try to find the approximate value of T.

Stumper number three: think some more about whether (and how) this question can be answered, even if only approximately, using the laws given above.

I know this is a rough idea, and my lack of background in the underlying geology might give entirely the wrong answers. Still, it’s a fun exercise to think about. Please leave any advice, critiques, etc. in the comments!