But I do think there is one core thing that differentiates Statisticians from these others. Having an interest in this is why you might choose to major in statistics rather than applied math, machine learning, etc. And it’s the reason you might hire a trained statistician rather than someone else fluent with data:

Statisticians use the idea of

variability due to samplingto design good data collection processes, to quantify uncertainty, and to understand the statistical properties of our methods.

When applied statisticians design an experiment or a survey, they account for the inherent randomness and try to control it. They plan your study in such a way that’ll make your estimates/predictions as accurate as possible for the sample size you can afford. And when they analyze the data, alongside each estimate they report its precision, so you can decide whether you have enough evidence or whether you still need further study. For more complex models, they also worry about overfitting: can this model generalize well to the population, or is too complicated to estimate with this sample and hence is it just fitting noise?

When theoretical statisticians invent a new estimator, they study how well it’ll perform over repeated sampling, under various assumptions. They study its statistical properties first and foremost. Loosely speaking: How variable will the estimates tend to be? Will they be biased (i.e. tend to always overestimate or always underestimate)? How robust will they be to outliers? Is the estimator consistent (as the sample size grows, does the estimate tend to approach the true value)?

These are not the only important things in working with data, and they’re not the only things statisticians are trained to do. But (as far as I can tell) they are a much deeper part of the curriculum in statistics training than in any other field. Statistics is their home. Without them, you can often still be a good data analyst but a poor statistician.

Certainly we need to do a better job of selling these points. (I don’t agree with everything in this article, but it really is a shame when the NSF invites 100 experts to a Big Data conference but does not include a single statistician.) But maybe it’s not really a problem that ML and Data Science are “eating our lunch.” These days there are many situations that don’t require solid understanding of statistical concepts & properties—situations where “generalizing from sample to population” isn’t the hard part:

- In some Big Data situations, you literally have
**all**the data. There’s no sampling going on. If you just need descriptive summaries of what happened in the past, you have the full population—no need for a statistician to quantify uncertainty.

[*Edit: Some redditors misunderstood my point here. Yes, there are many cases where you still want statistical inference on population data (about the future, or about what else might have happened); but that’s not what I mean here. An example might help. Lawyers in a corporate fraud case may have a digital file containing every single relevant financial record, so they can literally analyze all the data. There’s no worry here that this population is a random sample from some abstract superpopulation. You just summarize what the defendant did, not what they might have done but didn’t.*] - In other Big Data cases, you care not about the past but about estimates that’ll generalize to future data. If your sample is huge, and your data collection isn’t biased somehow, then the statistical uncertainty due to sampling will be negligible. Again, any data analyst will do—no need for statistical training.
- Other times, you don’t want parameter estimates—you need predictions. In the Netflix Prize or most Kaggle contests, you build a model on training data and evaluate your predictions’ performance on held-out test data. If both datasets are huge, then again, sampling variation may be a minor concern; you may not need to worry much about overfitting; and it really is okay to try a zillion complex, uninterpretable, black-box models and choose the one with the best score on the test data. Cross-validation or hold-out validation may be the only statistical hammer you need for every such nail.
- Finally, there are some hard problems (web search results, speech recognition, natural language translation) involving immediate give-and-take with a human, where it’s frankly okay for the model to make a lot of “mistakes.” If Google doesn’t return the very best search result for my query on page 1, I can look further or edit my query. If my speech recognition software makes a mistake, I can try again enunciating more clearly, or I can just type the word directly. Quantifying and controlling such a model’s randomness and errors would be useful, but not critical.
- Plus, there have always been problems better suited to mathematical modeling, where the uncertainty is more about how a complicated deterministic model turns its inputs to outputs. There, instead of statistical analysis you’d want sensitivity analysis, which is not usually part of our core training.

Yes, in most of these cases a statistician would do well, but so would the other flavors of data analyst. The statistician would be most valuable at the start, in setting up the data collection process, rather than in the actual analysis.

On the other hand, when sampling is expensive and difficult, and if you care about interpretable estimates rather than black-box predictions, you can’t beat statisticians.

- What does the Census Bureau need? Someone who can design a giant nationwide survey to be as cost-effective as possible, learning as much as we can about the nation’s citizens (including breakdowns by small geographic and demographic groups) without overspending taxpayers’ money. Who does it hire? Statisticians.
- What does the FDA need? Someone who can design a clinical trial that’ll stop as soon as the evidence in favor of or against the new drug/procedure is strong enough, so that as few patients as possible are exposed to a bad new drug or are withheld from an effective new treatment. Who does it hire? Statisticians.
- Statisticians also work on a different kind of Big Data: small-ish samples but with high dimensionality. In genetics, each person’s genome is a huge dataset, even if you only have the genomes of a relatively small number of people with the disease you’re studying. Naive data mining will find a zillion spurious associations, and too often such results get published… but it doesn’t actually advance the scientific understanding of which genes really do what. A statistician’s humility (we’re not confident about these associations yet and need further study) is better than asserting unfounded, possibly harmful claims.

Finally, there are plenty of cases in between. The data’s already been collected; it’s hard to know how important the sampling variability will be; or maybe you just need to make a decision quickly, even if there’s not enough data to have strong evidence. I can imagine that in business analytics, you’d be inclined to hire the data scientist (who’ll confidently tell you “We crunched the numbers!”) over the buzzkill statistician (who’ll tell you “Still not enough evidence…”), and the market is so unpredictable that it’s hard to tell afterwards who was right anyway.

Now, I’d love it if all statisticians had broader training in other topics, including the ones that machine learning and data science have claimed for themselves. Hadley Wickham’s recent interview points out:

He observed during his statistics PhD that there was a “total disconnect between what people need to actually understand data and what was being taught.” Unlike the statisticians who were focused on abstruse ramifications of the central limit theorem, Wickham was in the business of making data analysis easier for the public.

Indeed, in many traditional statistics departments, you’d have trouble getting funded to study data analysis from a usability standpoint, even though it’s an extremely important and valuable topic of study.

But if the new Data Science departments that are popping up claim this topic, I don’t see anything wrong with that. If academic Statistics departments keep chugging away at understanding estimators’ statistical properties, that’s fine; somebody needs to be doing it. However, if Statistics departments drop the mantle of studying sampling variation, and nobody else picks it up, that’d be a real loss.

I love my department at CMU, but sometimes I wonder if we’re chasing these other data science fields too much. We only offer one class each on survey sampling and on experimental design, both at the undergrad level and never taken by our grad students. Our course on Convex Optimization was phenomenal, but we almost never discussed the statistical properties of the crazy models we fit (not even to point out that you may as well stop optimizing once the numerical precision is within your statistical precision—you don’t need predictions optimized to 7 decimal places if the standard error is at 1 decimal place.)

]]>I am contacting you on behalf of the website Wikiprogress, which is currently running a Data Visualization Contest, with the prize of a paid trip to Mexico to attend the 5th OECD World Forum in Guadalajara in October this year. Wikiprogress is an open-source website, hosted by the OECD, to facilitate the exchange of information on well-being and sustainability, and the aim of the competition is to encourage participants to use well-being measurement in innovative ways to a) show how data on well-being give a more meaningful picture of the progress of societies than more traditional growth-oriented approaches, and b) to use their creativity to communicate key ideas about well-being to a broad audience.

After reading your blog, I think that you and your readers might be interested in this challenge. The OECD World Forums bring together hundreds of change-makers from around the world, from world leaders to small, grassroots projects, and the winners will have their work displayed and will be presented with a certificate of recognition during the event.

You can also visit the competition website here: http://bit.ly/1Gsso2y

It does sound like a challenge that might intrigue this blog’s readers:

- think about how to report human well-being, beyond traditional measures like GDP;
- find relevant good datasets (“official statistics” or otherwise);
- visualize these measures’ importance or insightful trends in the data; and
- possibly win a prize trip to the next OECD forum in Guadalajara, Mexico to network with others who are interested in putting data, statistics, and visualization to good use.

[*Edit: R code examples and results have been revised based on Nicholas Nagle’s comment below and advice from Ryan Tibshirani.*]

If possible, John says, you should just ask your scientific computing software to directly solve the linear system . This is often faster and more numerically accurate than computing the matrix inverse of A and then computing .

We’ll chug through a computation example below, to illustrate the difference between these two methods. But first, let’s start with some context: a common statistical situation where you may **think** you need matrix inversion, even though you really don’t.

[*One more edit: I’ve been guilty of inverting matrices directly, and it’s never caused a problem in my one-off data analyses. As Ben Klemens comments below, this may be overkill for most statisticians. But if you’re writing a package, which many people will use on datasets of varying sizes and structures, it may well be worth the extra effort to use solve or QR instead of inverting a matrix if you can help it.*]

(Be aware that the above (from traditional linear systems notation), and the below (from traditional regression notation), play totally different roles! It’s a shame that these notations conflict.)

**Statistical context: linear regression**

First of all, I’m used to reading and writing mathematical/statistical formulas using inverted matrices. In statistical theory courses, for example, we derive the equations behind linear regression this way all the time. If your regression model is , with independent errors of mean 0 and variance , then textbooks usually write the ordinary least squares (OLS) solution as .

Computing directly like this in R would be

`beta = solve(t(X) %*% X) %*% (X %*% y)`

while in MATLAB it would be

`beta = inv(X' * X) * (X' * y)`

This format is handy for deriving properties of OLS analytically. But, as John says, it’s often not the best way to compute . Instead, rewrite this equation as and then get your software to solve for .

In R, this would be

`beta = solve(t(X) %*% X, t(X) %*% y)`

or in MATLAB, it would be

`beta = (X' * X) \ (X' * y)`

(Occasionally, we do actually care about the values inside an inverted matrix. Analytically we can show that, in OLS, the variance-covariance matrix of the regression coefficients is . If you really need to **report** these variances and covariances, I suppose you really will have to invert the matrix. But even here, if you only need them temporarily as **input** to something else, you can probably compute that “something else” directly without matrix inversion.)

**Numerical example of problems with matrix inversion**

The MATLAB documentation for `inv`

has a nice example comparing computation times and accuracies for the two approaches.

Reddit commenter five9a2 gives an even simpler example in Octave (also works in MATLAB).

Here, I’ll demonstrate five9a2’s example in R. We’ll use their same notation of solving the system (rather than the regression example’s notation). We’ll let A be a 7×7 Hilbert matrix. The Hilbert matrices, with elements , are known to be poorly conditioned [1] and therefore to cause trouble with matrix inversion.

Here’s the R code and results, with **errors** and **residuals** defined as in the MATLAB example:

set.seed(13052015) options(digits = 3) library(Matrix) library(matrixcalc) # Set up a linear system Ax=b, # and compare # inverting A to compute x=inv(A)b # vs # solving Ax=b directly # Generate the 7x7 Hilbert matrix # (known to be poorly conditioned) n = 7 (A = as.matrix(Hilbert(n))) ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] ## [1,] 1.000 0.500 0.333 0.250 0.2000 0.1667 0.1429 ## [2,] 0.500 0.333 0.250 0.200 0.1667 0.1429 0.1250 ## [3,] 0.333 0.250 0.200 0.167 0.1429 0.1250 0.1111 ## [4,] 0.250 0.200 0.167 0.143 0.1250 0.1111 0.1000 ## [5,] 0.200 0.167 0.143 0.125 0.1111 0.1000 0.0909 ## [6,] 0.167 0.143 0.125 0.111 0.1000 0.0909 0.0833 ## [7,] 0.143 0.125 0.111 0.100 0.0909 0.0833 0.0769 # Generate a random x vector from N(0,1) x = rnorm(n) # Find the corresponding b b = A %*% x # Now solve for x, both ways, # and compare computation times system.time({xhat_inverting = solve(A) %*% b}) ## user system elapsed ## 0.002 0.002 0.032 system.time({xhat_solving = solve(A, b)}) ## user system elapsed ## 0.001 0.000 0.001 # Compare errors: sum of squared (x - xhat) (err_inverting = norm(x - xhat_inverting)) ## [1] 2.44e-07 (err_solving = norm(x - xhat_solving)) ## [1] 1.56e-08 # Compare residuals: sum of squared (b - bhat) (res_inverting = norm(b - A %*% xhat_inverting)) ## [1] 1.55e-08 (res_solving = norm(b - A %*% xhat_solving)) ## [1] 2.22e-16

As you can see, even with a small Hilbert matrix:

- inverting takes more
**time**than solving; - the
**error**in x when solving Ax=b directly is a little smaller than when inverting; and - the
**residuals**in the estimate of b when solving directly are many orders of magnitude smaller than when inverting.

**Repeated reuse of QR or LU factorization in R**

Finally: what if, as John suggests, you have to solve Ax=b for many different b’s? How do you encode this in R without inverting A?

I don’t know the best, canonical way to do this in R. However, here are two approaches worth trying: the QR decomposition and the LU decomposition. These are two ways to decompose the matrix A into factors with which it should be easier to solve . (There are other decompositions too—many more than I want to go into here.)

QR decomposition is included in base R. You use the function `qr`

once to create a decomposition, store the Q and R matrices with `qr.Q`

and `qr.R`

, then use a combination of `backsolve`

and matrix multiplication to solve for x repeatedly using new b’s. (Q is chosen to be orthogonal, so we know its inverse is just its transpose. This avoids the usual problems with matrix inversion.)

For the LU decomposition, we can use the `matrixcalc`

package. (Thanks to sample code on the Cartesian Faith blog.)

Imagine rewriting our problem as , then defining , so that we can solve it in two stages: . We can collapse this in R into a single line, in the form

`x = backsolve(U, forwardsolve(L, b))`

once we have decomposed A into L and U.

# What if we need to solve Ax=b for many different b's? # Compute the A=QR or A=LU decomposition once, # then reuse it with different b's. # Just once, on the same b as above: # QR, Ryan's suggestion: system.time({ qr_decomp = qr(A, tol = 1e-10) qr_R = qr.R(qr_decomp) qr_Qt = t(qr.Q(qr_decomp)) xhat_qr = backsolve(qr_R, qr_Qt %*% b) }) ## user system elapsed ## 0.013 0.000 0.014 (err_qr = norm(x - xhat_qr)) ## [1] 5.78e-08 (res_qr = norm(b - A %*% xhat_qr)) ## [1] 9.44e-16 # LU, Nicholas' suggestion: system.time({ lu_decomp = lu.decomposition(A) xhat_lu = backsolve(lu_decomp$U, forwardsolve(lu_decomp$L, b)) }) ## user system elapsed ## 0.003 0.000 0.016 (err_lu = norm(x - xhat_lu)) ## [1] 3.01e-08 (res_lu = norm(b - A %*% xhat_lu)) ## [1] 3.33e-16

Both the QR and LU decompositions’ errors and residuals are comparable to the use of `solve`

earlier in this post, and much smaller than inverting A directly.

However, what about the timing? Let’s create many new b vectors and time each approach: QR, LU, direct solving from scratch (without factorizing A), and inverting A:

# Reusing decompositions with many new b's: m = 1000 xs = replicate(m, rnorm(n)) bs = apply(xs, 2, function(xi) A %*% xi) # QR, Ryan's suggestion: system.time({ qr_decomp = qr(A, tol = 1e-10) qr_R = qr.R(qr_decomp) qr_Qt = t(qr.Q(qr_decomp)) xhats_qr = apply(bs, 2, function(bi) backsolve(qr_R, qr_Qt %*% bi)) }) ## user system elapsed ## 0.036 0.000 0.036 # LU, Nicholas' suggestion: system.time({ lu_decomp = lu.decomposition(A) bs = apply(bs, 2, function(bi) backsolve(lu_decomp$U, forwardsolve(lu_decomp$L, bi))) }) ## user system elapsed ## 0.094 0.001 0.095 # Compare to repeated use of solve() system.time({ xhats_solving = apply(bs, 2, function(bi) solve(A, bi)) }) ## user system elapsed ## 0.082 0.000 0.082 # Compare to inverting A directly system.time({ Ainv = solve(A) xhats_inverting = apply(bs, 2, function(bi) Ainv %*% bi) }) ## user system elapsed ## 0.01 0.00 0.01

It’s not surprising that inverting A is the fastest… but as we saw above, it’s also the least accurate, by far. Luckily, a well-done QR is almost as fast, and far more accurate.

However, I’m surprised that ~~QR and LU are both~~ ~~much~~ LU is a little slower than direct use of `solve(A, b)`

. ~~They are~~ LU is supposed to save you work that `solve(A, b)`

has to redo for every new b.

Please comment if you know why this happened! Did I make a mistake? Or does LU ~~and QR~~ give you speedups over `solve(A, b)`

only for much larger A matrices?

[*Edit: Nicholas’ comment below and an email from Ryan Tibshirani show how to speed up the QR and LU approaches considerably. Even so, both are LU is still slower than using solve directly. See [2] for the earlier (slower) QR and LU code and results.*]

Finally, see this Revolutions post on R and Linear Algebra for more on matrix manipulation in R. They mention dealing with giant and/or sparse matrices, which is also the last situation described in John Cook’s blog post.

[1] (I usually hear of a “poorly conditioned” matrix, meaning one with a high “condition number,” being defined in terms of the ratio of largest to smallest eigenvalues. However, this nice supplement on Condition Numbers from Lay’s *Linear Algebra* has a more general definition on p.4: if A is invertible, the condition number is the norm of A times the norm of A’s inverse. This is the same as the ratio of largest to smallest eigenvalues if you’re using the spectral norm… but the general definition is more interpretable for beginners who haven’t studied eigenvalues yet, since you can use other simpler matrix norms instead.)

[2] (Here are my original approaches to QR and LU in R, using `solve`

instead of the special cases `forwardsolve`

and `backsolve`

.)

# OLD VERSIONS OF QR AND LU # (before using Nicholas's suggested improvements) # For a single b: # QR, first attempt: system.time({ qr_decomp = qr(A, tol = 1e-10) xhat_qr = qr.coef(qr_decomp, b) }) ## user system elapsed ## 0.004 0.000 0.003 (err_qr = norm(x - xhat_qr)) ## [1] 4.18e-08 (res_qr = norm(b - A %*% xhat_qr)) ## [1] 6.11e-16 # QR, Nicholas' suggestion: system.time({ qr_decomp = qr(A, tol = 1e-10) R = qr.R(qr_decomp) xhat_qr = backsolve(R, qr.qty(qr_decomp, b)) }) ## user system elapsed ## 0.009 0.000 0.023 (err_qr = norm(x - xhat_qr)) ## [1] 4.18e-08 (res_qr = norm(b - A %*% xhat_qr)) ## [1] 6.11e-16 # LU, first attempt: system.time({ lu_decomp = lu.decomposition(A) xhat_lu = solve(lu_decomp$U, solve(lu_decomp$L, b)) }) ## user system elapsed ## 0.002 0.000 0.003 (err_lu = norm(x - xhat_lu)) ## [1] 3.99e-08 (res_lu = norm(b - A %*% xhat_lu)) ## [1] 1.11e-16 # For many b's: # QR, first attempt: system.time({ qr_decomp = qr(A, tol = 1e-10) xhats_qr = apply(bs, 2, function(bi) qr.coef(qr_decomp, bi)) }) ## user system elapsed ## 0.179 0.002 0.182 # QR, Nicholas' suggestion: system.time({ qr_decomp = qr(A, tol = 1e-10) qr_R = qr.R(qr_decomp) xhats_qr = apply(bs, 2, function(bi) backsolve(qr_R, qr.qty(qr_decomp, bi))) }) ## user system elapsed ## 0.157 0.000 0.159 # LU, first attempt: system.time({ lu_decomp = lu.decomposition(A) bs = apply(bs, 2, function(bi) solve(lu_decomp$U, solve(lu_decomp$L, bi))) }) ## user system elapsed ## 0.167 0.001 0.173]]>

My goal was to introduce two principled frameworks for thinking about data visualization: **human visual perception** and the **Grammar of Graphics**.

(We also covered some relevant R packages: `RColorBrewer`

, `directlabels`

, and a gentle intro to `ggplot2`

.)

These are not the only “right” approaches, nor do they guarantee your graphics will be good. They are just useful tools to have in your arsenal.

The talk was also a teaser for my upcoming fall course, *36-721: Statistical Graphics and Visualization* [draft syllabus pdf].

Here are my

The talk was quite interactive, so the slides aren’t designed to stand alone. Open the slides and follow along using my notes below.

(Answers are intentionally in white text, so you have a chance to think for yourself before you highlight the text to read them.)

If you want a deeper introduction to dataviz, including human visual perception, Alberto Cairo’s *The Functional Art* [website, amazon] is a great place to start.

For a more thorough intro to `ggplot2`

, see creator Hadley Wickham’s own presentations at the bottom of this page.

(Apologies also to the National Statistical Service of the Republic of Armenia for using their plots on slides 4, 6, and 12. They are a group of skilled people working hard under challenging conditions (including the need to show 3 languages on most reports and graphs!). I hope they do not mind me using a few of their graphics as starting points for discussing redesigns.)

**Framework 1: human visual perception**

- (2) How many 6s can you find in this image? How long does it take you?
- (3) To compare numeral shapes alone, you have to apply conscious attention, thinking slowly. But the human brain is amazingly efficient at grouping and comparing items with contrasting colors, automatically, before the image even reaches your conscious attention. Whenever possible, your graphics should make use of the brain’s
**preattentive processing**to simplify the task and to help viewers see the structure in a flash. - (4) Consider this idea of preattentive processing. What makes this graphic difficult or slow to read, and how could it be improved? [Some answers: legend is far from plot; year-to-year comparisons are difficult; pie slice angles are hard to compare; order of slices is uninformative]
- (5) One possible redesign [year-to-year comparisons are shown directly; categories labeled directly, not with a legend; y-axis positions are easier to compare than angles; colors now provide meaning (blue for increase, red for decrease)]
- (6) What could be improved? [legend is far from plot; similar colors are hard to distinguish; semi-alphabetical ordering is uninformative; comparing marriage to divorce rates within a region is hard]
- (7) One possible redesign [direct labels; informative sorting by marriage rate]
- (8) This is not an exhaustive explanation of visual perception and preattentive processing. But using that framework, here are a few principles you can apply directly when designing graphics.

Next we’ll talk more about the first two bullets and how to use them in R. - (9) Think for a moment: How would you choose a color scheme for its usability? What would you need to know about the color palette?
- (10) Cynthia Brewer and colleagues at Penn State do research into usable color palettes (for cartography, but also useful for other graphics). Their findings are summarized pragmatically on the ColorBrewer website. Play around with the site. Most of these palettes are easily accessed within R using the
`RColorBrewer`

package. - (11) Start R and play with the first half of my code, to see examples of
`RColorBrewer`

and`directlabels`

in use.

The dataset is a small subset of the NHANES 2011-2012 survey. This kind of data is used to create those growth percentile charts you see at the doctor’s office, when your baby gets weighed and measured to see whether the child’s growth is in a normal range. My wife and I have been seeing a lot of these lately

**Framework 2: The Grammar of Graphics**

- (12) What could be improved? [legend far from plotted values; axis/scale also far and misaligned from data; graphic shows volume, but the data is actually mapped to height]
- (13) One possible redesign [show bar heights directly without the confusing use of volume; informative sorting; direct labels]
- (14) GoG is principled because it cannot do “ungrammatical” things, like the plot on slide (12) which misleadingly shows changing volumes that do not represent a data variable. On the other hand, it’s more flexible than (say) Excel’s hard-wired templates. GoG lets you specify the graph you need from the ground up.

Leland Wilkinson developed this Grammar of Graphics idea and wrote a great book about it [amazon, my review]. This influential concept has been implemented many times, serving as the basis for the data visualization tools in Tableau, SPSS, JMP, D3.js, and (as`ggplot2`

) R. - (15) What are the aes, stat, geom, facet for slide (13)? [aes: service maps to position on x-axis, percent maps to position on y-axis; stat: identity; geom: bar; facet: none]
- (16) What are the aes, stat, geom, facet here? [original charts from WHO for Boys and for Girls] [aes: age maps to position on x-axis, length maps to position on y-axis, quantile maps to color; stat: quantiles (3, 15, 50, 85, and 97%); geom: line; facet: gender]
- (17) What are the aes, stat, geom, facet here? [aes: weight maps to position on x-axis, length maps to position on y-axis, gender maps to color and shape; stat: median; geom: point; facet: age] (This example isn’t perfect, because each month also shows previous months’ data)
- (18) Go back to R and play with the second half of my code, to see examples of making similar baby-growth plots in
`ggplot2`

. - (19) We discussed other plots that could be made with these commands, including a few variations that show all 6 variables at once (including both the raw data and overlaid statistical summaries). I’m not saying these are great, insightful plots—just showing the flexibility of
`ggplot2`

as a tool.

I also find that working with`ggplot2`

is very similar to coming up with a statistical regression model. Say we use facets to subgroup the data by gender and race/ethnicity, and the race-facets looks very similar but the gender-facets clearly differ. That suggests our regression model should probably include a term for gender, but it’s probably OK to omit the race/ethnicity terms.

I was glad to hear some audience members thought this was a good intro to `ggplot2`

. I tried to keep it simple by using just a few limited commands, reusing the same dataset over and over, and not bothering with the `qplot`

command (which I find gives you the wrong idea about how the GoG works).

If nothing else, I think BASP did a great job of starting a discussion on p-values, and more generally, the role of statistical inference in certain types of research. Stepping back a bit, I think the discussion fits into a broader question of how we deal with answers that are inherently grey, as opposed to clear cut. Hypothesis testing, combined with traditional cutoff values, is a neat way to get a yes/no answer, but many reviewers want a yes/no answer, even in the absence of hypothesis tests.

As one example, I recently helped a friend in psychology to validate concepts measured by a survey. In case you haven’t done this before, here’s a quick (and incomplete) summary of construct validation: based on substantive knowledge, group the questions in the survey into groups, each of which measures a different underlying concept, like positive attitude, or negativity. The construct validation question is then, “Do these groups of questions actually measure the concepts I believe they measure?”

In addition to making sure the groups are defensible based on their interpretation, you usually have to do a quantitative analysis to get published The standard approach is to model the data with a structural equation model (as a side note, this includes confirmatory factor analysis, which is not factor analysis!). The goodness of fit statistic is useless in this context, because the null hypothesis is not aligned with the scientific question, so people use a variety of heuristics, or fit indices, to decide if the model fits. The model is declared to either fit or not fit (and consequently the construct is either valid or not valid) depending on whether the fit index is larger or smaller than a rule-of-thumb value. This is the same mentality as hypothesis testing.

Setting aside the question of whether it makes sense to use structural equation models to validate constructs, the point I’m trying to make is that the p-value mentality is not restricted to statistical inference. Like any unsupervised learning situation, it’s very difficult to say how well the hypothesized groups measure the underlying constructs (or if they even exist). Any answer is inherently grey, and yet many researchers want a yes/no answer. In these types of cases, I think it would be great if statisticians could help other researchers come to terms not just with the limits of the statistical tools, but with the inquiry itself.

I agree with Brian that we can all do a better job of helping our collaborators to think statistically. Statistics is not just a set of arbitrary yes/no hoops to jump through in the process of publishing a paper; it’s a kind of applied epistemology. As tempting as it might be to just ban all conclusions entirely, we statisticians are well-trained in probing what can be known and how that knowledge can be justified. Give us the chance, and we’d would love to help you navigate the subtleties, limits, and grey areas in your research!

]]>I’ve just heard about Nathan’s computer game project, DotCity. It sounds like a statistician’s minimalist take on SimCity, with a special focus on demographic shifts in your population of dots (baby booms, aging, etc.). Furthermore, he’s planning to program the internals using R.

Consider backing the game on Kickstarter (through July 8th). I’m supporting it not just to play the game itself, but to see what Nathan learns from the development process. How do you even begin to write a game in R? Will gamers need to have R installed locally to play it, or will it be running online on something like an RStudio server?

Meanwhile, do you know of any other statistics-themed computer games?

- I missed the boat on backing Timmy’s Journey, but happily it seems that development is going ahead.
- SpaceChem is a puzzle game about factory line optimization (and not, actually, about chemistry). Perhaps someone can imagine how to take it a step further and gamify statistical process control à la Shewhart and Deming.
- It’s not exactly stats, but working with data in textfiles is an important related skill. The Command Line Murders is a detective noir game for teaching this skill to journalists.
- The command line approach reminds me of Zork and other old text adventure / interactive fiction games. Perhaps, using a similar approach to the step-by-step interaction of swirl (“Learn R, in R”), someone could make an I.F. game about data analysis. Instead of OPEN DOOR, ASK TROLL ABOUT SWORD, TAKE AMULET, you would type commands like READ TABLE, ASK SCIENTIST ABOUT DATA DICTIONARY, PLOT RESIDUALS… all in the service of some broader story/puzzle context, not just an analysis by itself.
- Kim Asendorf wrote a fictional “short story” told through a series of data visualizations. (See also FlowingData’s overview.) The same medium could be used for a puzzle/mystery/adventure game.

It was also my first semester as a dad. Exhilarating, joyful, and exhausting So, time was freed up by having less coursework, but it was reallocated largely towards diapering and sleep. Still, I did start on a new research project, about which I’m pretty excited.

Our department was also recognized as one of the nation’s fastest-growing statistics departments. I got to see some of the challenges with this first-hand as a TA for a huge 200-student class.

See also my previous posts on the 1st, the 2nd, and the 3rd semester of my Statistics PhD program.

Classes:

**Statistical Computing:**

This was a revamped, semi-required, half-semester course, and we were the guinea pigs. I found it quite useful. The revamp was spearheaded by our department chair Chris Genovese, who wanted to pass on his software engineering knowledge/mindset to the rest of us statisticians. This course was not just “how to use R” (though we did cover some advanced topics from Hadley Wickham’s new books*Advanced R*and*R Packages*; and it got me to try writing homework assignment analyses as R package vignettes).

Rather, it was a mix of pragmatic coding practices (using version control such as Git; writing and running unit tests; etc.) and good-to-know algorithms (hashing; sorting and searching; dynamic programming; etc.). It’s the kind of stuff you’d pick up on the job as a programmer, or in class as a CS student, but not necessarily as a statistician even if you write code often.

The homework scheme was nice in that we could choose from a large set of assignments. We had to do two per week, but could do them in any order—so you could do several on a hard topic you really wanted to learn, or pick an easy one if you were having a rough week. The only problem is that I never had to practice certain topics if I wanted to avoid them. I’d like to try doing this as an instructor sometime, but I’d want to control my students’ coverage a bit more tightly.

This fall, Stat Computing becomes an actually-required, full-semester course and will be cotaught by my classmate Alex Reinhart.**Convex Optimization:**

Another great course with Ryan Tibshirani. Tons of work, with fairly long homeworks, but I also learned a huge amount of very practical stuff, both theory (how to prove a certain problem is convex? how to prove a certain optimization method works well?) and practice (which methods are likely to work on which problems?).

My favorite assignments were the ones in which we replicated analyses from recent papers. A great way to practice your coding, improve your optimization, and catch up with the literature all at once. One of these homeworks actually inspired in me a new methodological idea, which I’ve pursued as a research project.

Ryan’s teaching was great as usual. He’d start each class with a review from last time and how it connects to today. There were also daily online quizzes, posted after class and due at midnight, that asked simple comprehension questions—not difficult and not a huge chunk of your grade, but enough to encourage you to keep up with the class regularly instead of leaving your studying to the last minute.**TAing for Intro to Stat Inference:**

This was the 200-student class. I’m really glad statistics is popular enough to draw such crowds, but it’s the first time the department has had so many folks in the course, and we are still working out how to manage it. We had an army of undergrad- and Masters-level graders for the weekly homeworks, but just three of us PhD-level TAs to grade midterms and exams, which made for several loooong weekends.

I also regret that I often wasn’t at my best during my office hours this semester. I’ll blame it largely on baby-induced sleep deprivation, but I could have spent more time preparing too. I hope the students who came to my sessions still found them helpful.- Next semester, I’ll be teaching the grad-level data visualization course! It will be heavily inspired by Alberto Cairo’s book and his MOOC. I’m still trying to find the right balance between the theory I think is important (how does the Grammar of Graphics work, and why does it underpin ggplot2, Tableau, D3, etc.? how does human visual perception work? what makes for a well-designed graphic?) vs. the tool-using practice that would certainly help many students too (teach me D3 and Shiny so I can make something impressive for portfolios and job interviews!)

I was glad to hear Scott Murray’s reflections on his recent online dataviz course co-taught with Alberto.

Research:

**Sparse PCA:**I’ve been working with Jing Lei on several aspects of sparse PCA, extending some methodology that he’s developed with collaborators including his wife Kehui Chen (also a statistics professor, just down the street at UPitt). It’s a great opportunity to practice what I’ve learned in Convex Optimization and earlier courses. I admired Jing’s teaching when I took his courses last year, and I’m enjoying research work with him: I have plenty of independence, but he is also happy to provide direction and advice when needed.

We have some nice simulation results illustrating that our method*can*work in an ideal setting, so now it’s time to start looking at proofs of why it*should*work as well as a real dataset to showcase its use. More on this soon, I hope.

Unfortunately, one research direction that I thought could become a thesis topic turned out to be a dead end as soon as we formulated the problem more precisely. Too bad, though at least it’s better to find out now than after spending months on it.- I still need to finish writing up a few projects from last fall: my ADA report and a Small Area Estimation paper with Rebecca Steorts (now moving from CMU to Duke). I really wish I had pushed myself to finish them before the baby came—now they’ve been on the backburner for months. I hope to wrap them up this summer. Apologies to my collaborators!

Life:

**Being a sDADistician**: Finally, my penchant for terrible puns becomes socially acceptable, maybe even expected—they’re “dad jokes,” after all.

Grad school seems to be a good time to start a family. (If you don’t believe me, I heard it as well from Rob Tibshirani last semester.) I have a pretty flexible schedule, so I can easily make time to see the baby and help out, working from home or going back and forth, instead of staying all day on campus or at the office until late o’clock after he’s gone to bed. Still, it helps to make a concrete schedule with my wife, about who’s watching the baby when. Before he arrived, I had imagined we could just pop him in the crib to sleep or entertain himself when we needed to work—ah, foolish optimism…

It certainly doesn’t work for us both to work from home and be half-working, half-watching him. Neither the work nor the child care is particularly good that way. But when we set a schedule, it’s great for organization & motivation—I only have a chunk of X hours now, so let me get this task DONE, not fritter the day away.

I’ve spent less time this semester attending talks and department events (special apologies to all the students whose defenses I missed!), but I’ve also forced myself to get much better about ignoring distractions like computer games and Facebook, and I spend more of my free time on things that really do make me feel better such as exercise and reading.**Stoicism:**This semester I decided to really finish the Seneca book I’d started years ago. It is part of a set of philosophy books I received as a gift from my grandparents. Long story short, once I got in the zone I was hooked, and I’ve really enjoyed Seneca’s*Letters to Lucilius*as well as*Practical Philosophy*, a Great Courses lecture series on his contemporaries.

It turns out several of my fellow students (including Lee Richardson) have been reading the Stoics lately too. The name “Stoic” comes from “Stoa,” i.e. porch, after the place where they used to gather… so clearly we need to meet for beers at The Porch by campus to discuss this stuff.**Podcasts:**This semester I also discovered the joy of listening to good podcasts.

(1) Planet Money is the perfect length for my walk to/from campus, covers quirky stories loosely related to economics and finance, and includes a great episode with a shoutout to CMU’s Computer Science school.

(2) Talking Machines is a more academic podcast about Machine Learning. The hosts cover interesting recent ideas and hit a good balance—the material is presented deeply enough to interest me, but not so deeply I can’t follow it while out on a walk. The episodes usually explain a novel paper and link to it online, then answer a listener question, and end with an interview with a ML researcher or practitioner. They cover not only technical details, but other important perspectives as well: how do you write a ML textbook and get it published? how do you organize a conference to encourage women in ML? how do you run a successful research lab? Most of all, I love that they respect statisticians too and in fact, when they interview the creator of The Automatic Statistician, they probe him on whether this isn’t just going to make the data-fishing problem worse.

(3) PolicyViz is a new podcast on data visualization, with somewhat of a focus on data and analyses for the public: government statistics, data journalism, etc. It’s run by Jon Schwabish, whom I (think I) got to meet when I still worked in DC, and whose visualization workshop materials are a great resource.- It’s a chore to update R with all the zillion packages I have installed. I found that Tal Galili’s installr manages updates cleanly and helpfully.
- Next time I bake brownies, I’ll add some spices and call them “Chai squares.” But we must ask, of course: what size to cut them for optimal goodness of fit in the mouth?

As I’ve said before, I’m curious about finding better ways to draw maps which simultaneously show numerical estimates **and** their precision or uncertainty.

The April 2015 issue of *Significance* magazine includes a nice example of this [subscription link; PDF], thanks to Michael Wininger. Here is his Figure 2a (I think the labels for the red and blue areas are mistakenly swapped, but you get the idea):

Basically, Wininger is mapping the weather continuously over space, and he overlays two contours: one for where the predicted snowfall **amount** is highest, and another for where the **probability** of snowfall is highest.

I can imagine people would also enjoy an interactive version of this map, where you have sliders for the two cutoffs (how many inches of snow? what level of certainty?). You could also just show more levels of the contours on one static map, by adding extra lines, though that would get messy fast.

I think Wininger’s approach looks great and is easy to read, but it works largely because he’s mapping spatially-continuous data. The snowfall levels and their certainties are estimated at a very fine spatial resolution, unlike say a choropleth map of the average snowfall by county or by state. The other thing that helps here is that certainty is expressed as a probability (which most people can interpret)… not as a measure of spread or precision (standard deviation, margin of error, coefficient of variation, or what have you).

Could this also work on a choropleth map? If you only have data at the level of discrete areas, such as counties… Well, this is not a problem with weather data, but it does come up with administrative or survey data. Say you have survey estimates for the poverty rate in each county (along with MOEs or some other measure of precision). You could still use one color to fill all the counties with high estimated poverty rates. Then use another color to fill all the counties with highly precise estimates. Their overlap would show the areas where poverty is estimated to be high **and** that estimate is very precise. Sliders would let the readers set their own definition of “high poverty” and “highly precise.”

I might be wrong, but I don’t think I’ve seen this approach before. Could be worth a try.

]]>(I really can’t recommend the series. I enjoyed the first few books in middle school, but in a re-read last year they haven’t stood up to my childhood memories. The first is still fun but a blatant Tolkien ripoff; the rest are plodding and repetitive.)

Readers, can you recommend any good fantasy / sci-fi (or other fiction) that treats stats & math well?

**The Dragon Reborn**

A few of the characters discuss the difference between distributions that show clustering, uniformity, and randomness:

“It tells us it is all too neat,” Elayne said calmly. “What chance that thirteen women chosen solely because they were Darkfriends would be so neatly arrayed across age, across nations, across Ajahs? Shouldn’t there be perhaps three Reds, or four born in Cairhien, or just two the same age, if it was all chance? They had women to choose from or they could not have chosen so random a pattern. There are still Black Ajah in the Tower, or elsewhere we don’t know about. It must mean that.”

She’s suspicious of the very uniform distribution of demographic characteristics in the observed sample of 13 bad-guy characters. If turning evil happens at random, or at least is independent of these demographics, you’d expect some clusters to occur by chance in such a small sample—that’s why statistical theory exists, to help decide if apparent patterns are spurious. And if evil was associated with any demographic, you’d certainly expect to see some clusters. The complete absence of clustering (in fact, we see the opposite: dispersion) looks more like an experimental design, selecting observations that are as different as possible… implying there is a larger population to choose from than just these 13. Nice

There are also records of historical hypothesis testing of a magical artifact:

“Use unknown, save that channeling through it seems to suspend chance in some way, or twist it.” She began to read aloud. “‘Tossed coins presented the same face every time, and in one test landed balanced on edge one hundred times in a row. One thousand tosses of the dice produced five crowns one thousand times.'”

That’s a degenerate distribution right there.

Mat, the lucky-gambler character, also talks of luck going in his favor more often where there’s more randomness: he always wins at dice, usually at card games, and rarely at games like “stones” (basically Go). It’d be good fodder for a short story set in our own world—a character who realizes he’s no braniac but incredibly lucky and so seeks out luck-based situations. What else could you do, besides the obvious lottery tickets and casinos?

**The Shadow Rising**

I was impressed by Elayne’s budding ability to think like a statistician in the previous book, but she returns to more simplistic thinking in this book. The characters ponder murder motives (p.157):

“They were killed because they talked […] Or to stop them from it […] They might have been killed simply to punish them for being captured […] Three possibilities, and only one says the Black Ajah knows they revealed a word. Since all three are equal, the chances are that they do not know.”

Oh, Elayne. There are well-known problems with the principle of insufficient reason. Your approach to logic may get you into trouble yet.

**Lord of Chaos**

The description of Caemlyn’s chief clerk and census-taker Halwin Norry is hamfisted and a missed opportunity:

Rand … was not certain anything was real to Norry except the numbers in his ledgers. He recited the number of deaths during the week and the price of turnips carted in from the countryside in the same dusty tone, arranged the daily burials of penniless friendless refugees with no more horror and no more joy than he showed hiring masons to check the repair of the city walls. Illian was just another land to him, not the abode of Sammael, and Rand just another ruler.

If anything, Norry sounds like an admirable professional! Official statisticians must be as objective and politically disinterested as possible; else the rulers can make whatever “decisions” they like but there’ll be no way to accurately carry them out when you don’t know what resources you actually have on hand nor how severe the problem really is. It’d be fascinating to see how Norry actually gets runs a war-time census—perhaps with scrying help from the local magic users? But here Jordan is just sneering down. Such a shame.

**Knife of Dreams**

There are a few ridiculous scenes of White Ajah logicians arguing; I should have noted them down. I’m not sure if Jordan really believes mathematicians and logicians talk like this, or whether his tongue is in cheek and he’s just joking, but man, it’s a grotesque caricature. Someday I’d love to see a popular book describe the kind of arguments mathematicians actually have with each other. But this isn’t it.

]]>No disrespect meant to Martin, his readers, or their families—it’s just a thought exercise that intrigued me, and I figured it may interest other people.

Also, we’ve blogged about GoT and statistics before.

In the Spring a young man’s fancy lightly turns to actuarial tables.

That’s right: Spring is the time of year when the next bloody season of *Game of Thrones* airs. This means the internet is awash with death counts from the show and survival predictions for the characters still alive.

Others, more pessimistically, wonder about the health of George R. R. Martin, author of the *A Song of Ice and Fire* (*ASOIAF*) book series (on which *Game of Thrones* is based). Some worried readers compare Martin to Robert Jordan, who passed away after writing the 11th *Wheel of Time* book, leaving 3 more books to be finished posthumously. Martin’s trilogy has become 5 books so far and is supposed to end at 7, unless it’s 8… so who really knows how long it’ll take.

(Understandably, Martin responds emphatically to these concerns. And after all, Martin and Jordan are *completely different* aging white American men who love beards and hats and are known for writing phone-book-sized fantasy novels that started out as intended trilogies but got out of hand. So, basically no similarities whatsoever.)

But besides the author and his characters, there’s another set of deaths to consider. The books will get finished eventually. But how many **readers** will have passed away waiting for that ending? Let’s take a look.

Caveat: the inputs are uncertain, the process is handwavy, and the outputs are certainly wrong. This is all purely for fun (depressing as it may be).

So, we’ll need to answer a few questions. How do we define readers? How many readers are there? What are their demographics? And what are the mortality statistics for those demographics?

**Readers:** By the fall of 2013, around 24 million ASOIAF books had been sold in North America, but that includes all 5 books (so far). Furthermore, it seems that book sales went through the roof once the HBO show began in 2011, which will make it really hard to estimate trends in readership over time after that year. 2011 is also the year when the latest book in the series, *A Dance With Dragons* (*ADWD*), was published.

The first book in the series reached its one-millionth US-paperback-edition copy by the fall of 2010. So for simplicity’s sake, let’s say that by the end of 2010, there were at least 1,000,000 US readers. [This misses people who bought the hardcover instead or who read it as a library book; and this overcounts people who bought but never read it or who didn’t like it enough to continue the series. Still, it’s a nice round number and probably the right order of magnitude.] These are the grizzled veterans who were already fans before the HBO show (a.k.a. the hipsters who liked it before it was cool). Some started when the first book appeared in 1996, others as late as 2010. Let’s call these 1,000,000 US residents our core ASOIAF readers who really want to know how the book series ends. How many of them had passed away before the HBO show began and ADWD came out?

The first US printing of the first book had around 50,000 copies according to Martin. Instead of thoroughly researching how book readership tends to grow, let’s assume it’s linear (another crass oversimplification). Then over the 15 years from 1996 to 2010 (inclusive), we’d have to add around 65,000 readers a year to reach 1 million total readers. Since that’s not too far from the first print run, let’s go with it: we have 65,000 new ASOIAF readers every year over 15 years.

**Demographics:** I can’t find demographics for Martin’s readers specifically, but there are a few demographic summaries of fantasy readers in general. Readers of *The Magazine of Fantasy & Science Fiction* and *Lightspeed Magazine* seem to be roughly 10% ages 18-24, 50% ages 26-45, 30% ages 46-55, which I guess leaves around 10% aged 56 or older. (That ignores people under 18, but let’s face it, kids probably shouldn’t be reading the gruesome *Game of Thrones* anyway.) The sex breakdown is roughly 60% male, 40% female. Let’s assume this age/sex breakdown holds for our million ASOIAF readers, though here are several other summaries with slightly different demographics.

**Mortality:** Okay, it’s time for the morbid part. Here are some Death Rates by Age and Sex (we’ll ignore Race since I didn’t find those reader demographics). None of them have changed dramatically since around 2000, close to the first book’s publication date, so let’s just use the latest 2008 numbers. The age breakdowns here don’t match ours exactly, so let’s also average together the rates for age groups we need to combine. Table 2 here suggests 25-34 and 35-44 had roughly similar numbers of people, so we can take a simple average of their death rates to get the 26-45 rate. But for people 56+, we’ll do a weighted average, weighted by the approximate population in each death-rate category. Using very rough weights of 25 (million) in population for 55-64, 20 for 65-74, 10 for 75-84, and 5 for 85+, we get

`(25*1000 + 20*2500 + 10*6000 + 5*14000) / (25 + 20 + 10 + 5)`

or around 3500 for the 56+ male death rate. For females, it’s

`(25*1000 + 20*2000 + 10*4000 + 5*12500) / (25 + 20 + 10 + 5)`

or around 2800.

Finally, the table gives death rates per 100,000 population, but let’s translate them to percent of people who will pass away that year. The results are

Males: .11% for 18-25, .18% for 26-45, .53% for 46-55, 3.5% for 56+

Females: .04% for 18-25, .06% for 26-45, .23% for 46-55, 2.8% for 56+

Let’s run these numbers through R.

# Death rates (percent of people in that group who die in a given year) # rounded or estimated from Census tables DeathRatesVec = c(.11, .18, .53, 3.5, .04, .06, .23, 2.8) DeathRates = matrix(DeathRatesVec, 4, 2) / 100 colnames(DeathRates) = c("M", "F") rownames(DeathRates) = c("18-24", "25-44", "45-54", "55+") DeathRates ## M F ## 18-24 0.0011 0.0004 ## 25-44 0.0018 0.0006 ## 45-54 0.0053 0.0023 ## 55+ 0.0350 0.0280 # Number of readers in each age/sex group # estimated from fantasy magazine reader polls AgePcts = c(.1, .5, .3, .1) SexPcts = c(.6, .4) ReadersPerYear = t(65000 * rbind(AgePcts, AgePcts) * SexPcts) colnames(ReadersPerYear) = colnames(DeathRates) rownames(ReadersPerYear) = rownames(DeathRates) ReadersPerYear ## M F ## 18-24 3900 2600 ## 25-44 19500 13000 ## 45-54 11700 7800 ## 55+ 3900 2600 # Function to estimate the number of readers who die # within a certain number of years NrDeathsByYearsLeft = function(YearsLeft) { sum(ReadersPerYear - ReadersPerYear * (1 - DeathRates) ^ YearsLeft) } # Total number of reader deaths from 1996 through 2010 FirstYear = 1996 FinalYear = 2010 TotalYears = FinalYear - FirstYear + 1 DeathsByYearStarted = sapply(TotalYears:1, NrDeathsByYearsLeft) round(sum(DeathsByYearStarted)) ## [1] 36814

So it looks like almost 40,000 veteran readers didn’t survive even until *ADWD* was published or the HBO show aired. This is on the order of 100 times the number of characters who’ve died, whether in the show or in the books.

Finally, let’s show the breakdown by year, since we already calculated it above:

# Number of deaths by 2010, # broken out by the year in which they started reading plot(FirstYear:FinalYear, DeathsByYearStarted, type='h', xlab = 'Year', ylab = 'Deaths', main = 'Number of readers deceased by 2010\nwho started in a given year')

(The trend looks perfectly linear just because we assumed linear growth in the number of readers and stable demographics over time.)

No deep insights here. There’s just the stark (hah!) realization that a substantial number of Martin’s earliest readers have not survived the wait.

Let’s not worry about which characters will die; let’s not hurry Martin as he writes. Let us just savor our time on Earth before we make the same journey ourselves. After all, valar morghulis.

PS—A helpful librarian friend tells me that the Carnegie Library of Pittsburgh system has 102 copies of the books currently in the system (acquired from 2002 onwards), with about 2300 total checkouts all together. This could be extrapolated to estimate US readership by library patrons who didn’t actually buy the book. At some point I may also go through her data to see how readership seems to have changed over time (i.e., the number of checkouts over time for older vs. newer copies).

*Manual trackback: Partially Derivative ep. 20 (around 33:09); FlowingData*