Is a Master’s degree in Statistics worthwhile?

A student who is considering a Master’s degree in Statistics asks, “I’m interested in finding a job in data analysis and have been looking around, but I’m not sure if a masters is necessary to break into the field.”

Without much info about her background or job goals, here’s what I replied. Readers, do you have any additional or contradictory advice?

Continue reading “Is a Master’s degree in Statistics worthwhile?”

History of CMU’s Department of Statistics

As I’m about to begin my studies at CMU’s Department of Statistics, I have been curious about the department’s history. There is a nice writeup in Strength in Numbers: The Rising of Academic Statistics Departments in the U. S.. Luckily, the “Carnegie Mellon University Statistics Department” chapter happens to be the free sample chapter on the publisher’s website.

Some fun facts from the chapter:

I hadn’t known that Frederick Mosteller went here (back when it was Carnegie Tech). I enjoy his Fifty Challenging Problems in Probability, and I’ve also been meaning to read The Pleasures of Statistics: The Autobiography of Frederick Mosteller. One of his early students at Harvard (whose stats department Mosteller founded before CMU had one) was Steve Fienberg, still at CMU.

Although the department was formed in 1966, it didn’t have a permanent college to call home until it joined the humanities college in 1980.
StatLib, the department’s online collection of downloadable datasets, started in 1989 and is still in use today.
CMU’s stats department was one of the first anywhere to focus on Bayesian stats, applications, and statistical computing. All of these are areas of interest for me—good to know I’m in the right place!

Early on, they also agreed to evaluate applied research on whether it benefits the applied area, not necessarily statistics as a field. I saw this still in effect at a thesis defense this week: the focus was on a very practical contribution to improving a neurological-data processing pipeline, even if the statistical theory was not highly novel. I’m glad to know that applied thesis topics are  appreciated here.
The department also chose not to run a drop-in consulting center like many others do. Instead, they form long-term joint research collaborations with other departments’ scholars.
Journal editorship is also valued at the department. Hopefully I can pick the many experienced editors’ brains in tailoring my publication submissions to the right journals.
Finally, there’s a strong focus not only on research but also on teaching, and today CMU has the largest group of undergrad stats majors in the US.

I’m looking forward to working with great colleagues in such an excellent environment!

Small Area Estimation resources

Small Area Estimation is a field of statistics that seeks to improve the precision of your estimates when standard methods are not enough.

Say your organization has taken a large national survey of people’s income, and you are happy with the precision of the national estimate: The estimated national average income has a tight confidence interval around it. But then you try to use this data to estimate regional (state, county, province, etc.) average incomes, and some of the estimates are not as precise as you’d like: their standard errors are too high and the confidence intervals are too wide to be useful.

Unlike usual survey-sampling methods that treat each region’s data independently, a Small Area Estimation model makes some assumptions that let areas “borrow strength” from each other. This can lead to more precise and more stable estimates for the various regions (if the assumptions are reasonable).

Also note that it is sometimes called Small Domain Estimation because the “areas” do not have to be geographic: they can be other sub-domains of the data, such as finely cross-classified demographic categories of race by age by sex.

If you are interested in learning about the statistical techniques involved in Small Area Estimation, it can be difficult to get started. This field does not have as many textbooks yet as many other statistical topics do, and there are a few competing philosophies whose proponents do not cross-pollinate so much. (For example, the U.S. Census Bureau and the World Bank both use model-based small area estimation but in quite different ways.)

Recently I gave a couple of short tutorials on getting started with SAE, and I’m polishing those slides into something stand-alone I can post. [Edit: I still haven’t polished them, but I’ve posted my old slides and code.] Meanwhile, below is a list of resources I recommend if you would like to be more knowledgeable about this field. Continue reading “Small Area Estimation resources”

Evil Queen of Numbers

Would there be any demand for a statistics class taught by M from the James Bond films?

M: You don’t like me, Bond. You don’t like my methods. You think I’m an accountant, a bean counter, more interested in my numbers than your instincts.
JB: The thought had occurred to me.
[…]
M: I’ve no compunction about sending you to your death, but I won’t do it on a whim.

Continue reading “Evil Queen of Numbers”

More names for statistics, and do they matter?

Partly continuing on from my previous post

So I think we’d all agree that applied mathematics is a venerable field of its own. But are you tired of hearing statistics distinguished from “data science“? Trying to figure out the boundaries between data science, statistics, and machine learning? What skills are needed by the people in this field (these fields?), do they also need domain expertise, and are there too many posers?

Or are you now confused about what is statistics in the first place? (Excellent article by Brown and Kass, with excellent discussion and rejoinder — deserving of its own blog post soon!)

Or perhaps you are psyched for the growth of even more of these similar-sounding fields? I’ve recently started hearing people proclaim themselves experts in info-metrics and uncertainty quantification. [Edit: here’s yet another one: cognitive informatics.]

Is there a benefit to having so many names and traditions for what should, essentially, be the same thing, if it hadn’t been historically rediscovered independently in different fields? Is it just a matter of branding, or do you think all of these really are distinct specialties?

TwoTypesOfPeople

Given the position in my last post, I might argue that you should complete Chemistry Cat’s sentence with “…and those who can quantify their uncertainty about those extrapolations.” And maybe some fields have more sophisticated traditions for tackling the first part, but statisticians are especially focused on the second.

In other words, much of a statistician’s special unique contribution (what we think about more than might an applied mathematician, data scientist, haruspicer, etc.) is our focus on the uncertainty-related properties of our estimators. We are the first to ask: what’s your estimator’s bias and variance? Is it robust to data that doesn’t meet your assumptions? If your data were sampled again from scratch, or if you ran your experiment again, what’s the range of answers you’d expect to see? These questions are front and center in statistical training, whereas in, say, the Stanford machine learning class handouts, they often come in at the end as an afterthought.

So my impression is that other fields are at higher risk of modeling just the mean and stopping there (not also telling you what range of data you may see outside the mean), or overfitting to the training data and stopping there (not telling you how much your mean-predictions rely on what you saw in this particular dataset). On the other hand, perhaps traditional stats models for the mean/average/typical trend are less sophisticated than those in other communities. When statisticians limit our education to the kind of models where it’s easy to derive MSEs and compare them analytically, we miss out on the chance to contribute to the development & improvement of many other interesting approaches.

So: if you call yourself a statistician, don’t hesitate to talk with people who have a different title on their business cards, and see if your special view on the world can contribute to their work. And if you’re one of these others, don’t forget to put on your statistician hat once in a while and think deeply about the variability in the data or in your methods’ performance.

PS — I don’t mean to be adversarial here. Of course a good statistician, a good applied mathematician, a good data scientist, and presumably even a good infometrician(?) ought to have much of the same skillset and worldview. But given that people can be trained in different departments, I’m just hoping to articulate what might be gained or lost by studying Statistics rather than the other fields.

One difference between Statistics vs. Applied Math

I’ll admit it: before grad school I wasn’t fully clear on the distinction between statistics and applied mathematics. In fact — gasp! — I may have thought statistics was a branch of mathematics, rather than its own discipline. (On the contrary: see Cobb & Moore (1997) on “Mathematics, Statistics, and Teaching”William Briggs’s blog; and many others.)

Of course the two fields overlap considerably; but clearly a degree in one area will not emphasize exactly the same concepts as a degree in the other. One such difference I’ve seen is that statisticians have a greater focus on variability. That includes not just quantifying the usual uncertainty in your estimates, but also modeling the variability in the underlying population.

In many introductory applied-math courses and textbooks I’ve seen, the goal of modeling is usually to get the equivalent of a point estimate: the system’s behavior after converging to a steady state, the maximum or minimum necessary amount of something, etc. You may eventually get around to modeling the variability in the system too, but it’s not hammered into you from the start like it is in a statistics class.

For example, I was struck by some comments on John Cook’s post about (intellectual) traffic jams. Skipping the “intellectual” part for now, here’s what Cook said: Continue reading “One difference between Statistics vs. Applied Math”

Moore method / inquiry-based learning in statistics?

Via Dave Richeson:

For the last 10+ years I’ve taught topology using a modified Moore method, also known as inquiry-based learning (IBL). The students are given the skeleton of a textbook; then they must prove all the theorems and solve all of the problems. They are forbidden from looking at outside sources. The class types up their work as they go. At the end of the semester they have a textbook that they wrote. It is a great way to learn, and at the end of the semester the student are thrilled to hold a bound copy of the textbook that they created.

I love this idea! Wikipedia lists several universities with math courses using the Moore method, but none in probability or mathematical statistics. Google doesn’t suggest much besides this blog post with the same idea, and this article which seems to have good advice but is no longer accessible.

Have you ever seen the Moore approach used for a statistics course? Do you have any success stories or pitfalls to share?

Hot Pot recipe, and stages of learning

I use Mark Bittman’s How to Cook Everything all the time, and I can really identify with his “four stages of learning how to teach yourself to cook”:

First, you slavishly follow recipes; this is useful.

In stage two, you synthesize some of the recipes you’ve learned. […] You learn your preferences. You might, if you’re dedicated, consult two, three, four cookbooks before you tackle anything.

The third stage incorporates what you’ve learned with the preferences you’ve developed, what’s become your repertoire, your style, and leads you to search out new things. […] This is the stage at which many people bring cookbooks to bed, looking for links and inspiration; they don’t follow recipes quite as much, but sometimes begin to pull ideas from a variety of sources and simply start cooking.

Stage four is that of the mature cook, a person who consults cookbooks for fun or novelty but for the most part has both a fully developed repertoire and — far, far more importantly — the ability to start cooking with only an idea of what the final dish will look like. There’s a pantry, there’s a refrigerator, and there is a mind capable of combining ingredients from both to Make Dinner.

These phases seem to apply in other areas as well. Consider foreign languages: first, you parrot your phrasebook word-for-word. Next, you learn to plug in new words or conjugations and combine pieces of several phrases. Third, you’ve started to grasp the grammar and the structure of the language; you have enough vocabulary to get by in basic scenarios, though you enjoy learning more. Fourth, you’ve reached fluency and “the ability to start [speaking] with only an idea of what the final [sentence] will look like.”

Anyhow, when you spend most of your time in stage 2 or possibly 3, it’s a pleasure to reach stage 4 sometimes — just coming home and BAM! making something tasty with whatever’s in the fridge + pantry. That happened recently with some shaved beef my fiancée and I found at Trader Joes, combined with memories of a delicious hot pot restaurant in the DC area. I didn’t have any mala spice available (too bad, as it does indeed cause a delicious “neurological confusion”), and I make no claims to authenticity, but it was a seriously tasty recipe-less culinary adventure. Recipe follows, although there are no proportions — everything is “to taste”!

Continue reading “Hot Pot recipe, and stages of learning”

Superheroes? Dataheroes!

Jake Porway of DataKind gave an inspiring talk comparing statisticians and data scientists to superheroes. Hear the story of how “the data scientists, statisticians, analysts were able to bend data to their will” and how these powers are being used for good or for awesome:

(Hat Tip: FlowingData.com)

Jake’s comment that “you have extraordinary powers that ordinary people don’t have” reminds me of Andrew Gelman’s suggestion that “The next book to write, I guess, should be called, not Amazing Numberrunchers or Fabulous Stat-economists, but rather something like Statistics as Your Very Own Iron Man Suit.

Links to the statistics / data science volunteering opportunities Jake mentioned:

I also recommend Statistics Without Borders, with more of an international health focus. And if you’re here in Washington DC, Data Community DC and the related meetups are a great resource too.

Edit: Current students could also see if there is a Statistics in the Community (StatCom) Network branch at their university.

Statistics contests

Are you familiar with Kaggle? It’s a website for hosting online data-analysis contests, like smaller-scale versions of the Netflix Prize contest.
The U.S. Census Bureau is now hosting a Kaggle contest, asking statisticians and data scientists to help predict mail return rates on surveys and census forms (more info at census.gov and kaggle.com). The ability to predict return rates will help the Census Bureau target its outreach efforts and interview followup (phone calls and door-to-door interviews) more efficiently. So you could win a prize and make the government more efficient, all at the same time! 🙂
The contest ends on Nov 1st, so you still have 40 days to compete.

If you prefer making videos to crunching numbers, there’s also a video contest to promote the International Year of Statistics for 2013. Help people see how statistics makes the world better, impacts current events, or gives you a fun career, and you may win a prize and be featured on their website all next year. There are special prizes among non-English-language videos and among entrants under 18 years old.
Submissions are open until Oct 31st, just a day before the Census Challenge.