Maps of changes in area boundaries, with R

Today a coworker needed some maps showing boundary changes. I used what I learned last week in the useR 2012 geospatial data course to make a few simple maps in R, overlaid on OpenStreetMap tiles. I’m posting my maps and my R code in case others find them useful.

A change in Census block-groups from 2000 to 2010, in Mobile, AL

Continue reading “Maps of changes in area boundaries, with R” →

useR 2012: impressions, tutorials

First of all, useR 2012 (the 8th International R User Conference) was, hands down, the best-organized conference I’ve had the luck to attend. The session chairs kept everything moving on time, tactfully but sternly; the catering was delicious and varied; and Vanderbilt University’s leafy green campus and comfortable facilities were an excellent setting. Many thanks to Frank Harrell and the rest of Vanderbilt’s biostatistics department for hosting!

useR_bacon — Plus there's a giant statue of bacon. What's not to love?

Continue reading “useR 2012: impressions, tutorials” →

useR 2012: my materials

Just a quick note that I’ve posted the slides, code, and dataset from my useR 2012 talk.

I’m having a great time here in Nashville and will write up a conference review soon, with links to the many excellent packages and resources I’ve been discovering.

API, and online mapping platforms

The Census Bureau is beta-testing a new API for developers. As I understand it, within hours of the API going live, Jan Vink incorporated it into an updated version of the interactive maps I’ve discussed before.

I think the placement of the legend on the side makes it easier to read than the previous version, where it was below. It’s a great development for the map — and a good showcase for the Census Bureau’s API, which I hope will become ready for public use in the near future.

I’d love to see this and related approaches become available in several environments or frameworks for online/interactive mapping tools. One possibility is to make widgets for the ArcGIS Viewer for Flex platform, which works with ESRI’s ArcGIS products.

Another great environment I’m just learning about is Weave. This week the Census Bureau is hosting Dr. Georges Grinstein, of the University of Massachusetts at Lowell, who is building a powerful open-source platform for integrating and visualizing data. This is being developed alongside a consortium of local governments and nonprofits who are using Weave for information dashboards, data dissemination, etc.
It seems to be a mix of Actionscript, Javascript, and C++, so extending Weave’s core functionality sounds a bit daunting, but I was very glad to see that advanced users can call R scripts inside of a visualization. This will let you analyze and plot data in ways that the Weave team did not explicitly foresee.

In short, there’s plenty of exciting work being done in this arena!

Updated d3 idiopleth

I’ve updated the interactive poverty map from last month, providing better labels, legends, and a clickable link to the data source. It also actually compares confidence intervals correctly now. I may have switched the orange and purple colors too. (I also reordered the code so that things are defined in the right order; I think that was why sometimes you’d need to reload the map before the interactivity would work.)

Please click the screenshot to try the interactive version (seems to work better in Firefox than Internet Explorer):

Next steps: redo the default color scheme so it shows the states relative to the national average poverty rate; figure out why there are issues in the IE browser; clean up the code and share it on Github.
[Edit: the IE issues seem to be caused by D3’s use of the SVG format for its graphics; older versions of IE do not support SVG graphics. I may try to re-do this map in another Javascript library such as Raphaël, which can apparently detect old versions of IE and use another graphics format when needed.]

For lack of a better term I’m still using “idiopleth”: idio as in idiosyncratic (i.e. what’s special about this area?) and pleth as in plethora (or choropleth, the standard map for a multitude of areas). Hence, together, idiopleth: one map containing a multitude of idiosyncratic views. Please leave a comment if you know of a better term for this concept already.

Tidbits of geography (and of cake)

Futility Closet has plenty of great trivia. I want to share some of my favorite geographical tidbits from there, since I have maps on the mind lately.

Most major landmasses are not antipodean to others. Digging a hole straight down and through the earth, US kids would hit open ocean; only Argentine or Chilean kids could dig a hole to China.
Parts of the continental US “can be reached by land only by traveling through Canada.”
There are lakes containing islands that themselves have lakes with their own islands… Google Sightseeing and Elbruz have more nested lake-island recursion examples.
Nested geographies can be administrative too: There is a Belgian town located inside the Netherlands, and this town also contains Dutch parcels inside its borders. This creates interesting issues with taxes, banking, or even which nation’s ambulance responds to an accident.

While we’re talking about shapes and areas, here’s a more mathematical-geometrical question: What’s the most efficient way to carve up a circle to fit inside a square of slightly-larger surface area?

Round peg into a square hole, er, that is, cake into pan

I baked a cake in a round 9″ pan, so the surface area is $\pi*r^2 = \pi*4.5^2 = 63.6 \text{ in}^2$ . I wanted to transport it in a pan with a lid, and I have such a 8″ square pan with surface area $8^2 = 64 \text{ in}^2$ . What’s the best way to fit it in, with the fewest cuts and least wasted scraps? (Well, not really wasted, I’ll eat them gladly 🙂 )

Localized Comparisons: Idiopleth Maps?

In which we propose a unifying theme, name, and some new prototypes for visualizations that allow “localization of comparisons,” aka “How do I relate to others?”

When Nathan Yau visited the Bureau a few months ago, he compared two world maps of gasoline prices by country. The first one was your typical choropleth: various color shades correspond to different gas prices. Fair enough, but (say) an American viewing the map is most likely interested in how US gas prices compare to the rest of the world. So instead, present a map with America in a neutral color (grey) and recolor the other countries relative to the US, to show whether their prices are higher or lower than here (for instance, red for pricier and green for cheaper gas).

I liked this idea but wanted to take it further: Instead of a one-off map just for Americans, why not make an interactive map that recolors automatically when you select a new country?
As a statistician, I’m also interested in how to communicate uncertainty: is your local area’s estimate statistically significantly different from your neighbors’ estimates? Continue reading “Localized Comparisons: Idiopleth Maps?” →

Separation of degrees

Scientific American has a short article on trends in undergraduate degrees over the past 20 years, illustrated with a great infographic by Nathan Yau. As a big fan of STEM (science, tech, engineering and math) education, I was pleased to see data on changing patterns among STEM degree earners.

However, there seemed to be a missed opportunity. The article mentioned that “More women are entering college, which in turn is changing the relative popularity of disciplines.” If the data were broken down by gender, readers could better see this fact for themselves.

I thought I could exploit the current graphic’s slight redundancy: the bar heights below and above the gray horizontal lines are exactly the same. Why not repurpose this format to show data on degrees earned by men vs. by women (below vs. above the horizontal line), in the same amount of space?

I could not find the gender breakdown for the exact same set of degrees, but a similar dataset is in the Digest of Education Statistics, tables 308 to 330. Here are my revised plots, made using R with the ggplot2 package.

Click this thumbnail to see all the data in one plot (it’s too big for the WordPress column width):

Or see the STEM and non-STEM plots separately below.

So, what’s the verdict? These new graphs do support SciAm’s conclusions: women are largely driving the increases in psychology and biology degrees (as well as “health professions and related sciences”), and to a lesser degree in the arts and communications. On the other hand, increases in business and social science degrees appear to be driven equally by males and females. The mid-’00s spike in computer science was mostly guys, it seems.

I’d also like to think that my alma mater, Olin College, contributed to the tiny increase in female engineers in the early ’00s 🙂

Technical notes:
Some of these degree categories are hard to classify as STEM vs. non-STEM. In particular, Architecture and SocialScience include some sub-fields of each type… Really, I lumped them under non-STEM only because it balanced the number of items in each group.
Many thanks to a helpful Learning R tutorial on back-to-back bar charts.

Grafixing what ain’t broken

Yesterday I had the pleasure of eating lunch with Nathan Yau of FlowingData.com, who is visiting the Census Bureau this week to talk about data visualization.
He told us a little about his PhD thesis topic (monitoring, collecting, and sharing personal data). The work sounds interesting, although until recently it had been held up by work on his new book, Visualize This.

We also talked about some recent online discussions of “information visualization vs. statistical graphics.” These conversations were sparked by the latest Statistical Computing & Graphics newsletter. I highly recommend the pair of articles on this topic: Robert Kosara made some great points about the potential of info visualization, and Andrew Gelman with Antony Unwin responded with their view from the statistics side.

In Yau’s opinion, there is not much point in making a difference between the two. However, as I understand it, Gelman then continued blogging on this topic but in a way that may seem critical towards the info visualization community:
“Lots of work to convey a very simple piece of information,” “There’s nothing special about the top graph above except how it looks,” “sacrificing some information for an appealing look” …
Kaiser Fung, of the Junk Charts blog, pitched in on the statistics side as well. Kosara and Yau responded from the visualization point of view.
To all statisticians, I recommend Kosara’s article in the newsletter and Yau’s post which covers the state of infovis research.

My view is this: Gelman seems intent on pointing out the differences between graphs made by statisticians with no design expertise vs. by designers with no statistical expertise, but I don’t think this latter group represents what Kosara is talking about. Kosara wants to highlight the potential benefits for a person (or team) who can combine both sets of expertise. These are two rather different discussions, though both can contribute to the question of how to train people to be fluent in both skill-sets.

Personally, I can think of examples labeled “information visualization” that nobody would call “statistical graphics” (such as the Rock Paper Scissors poster), but not vice versa. Any statistical graphic could be considered a visualization, and essentially all statisticians will make graphs at some point in their careers, so there is no harm in statisticians learning from the best of the visualization community. On the other side, a “pure” graphics designer may be focused on how to communicate rather than how to analyze the data, but can still benefit from learning some statistical concepts. And a proper information visualization expert should know both fields deeply.

I agree there is some junk out there calling itself “information visualization”… but only because there is a lot of junk, period, and the people who make it (with no expertise in design or in statistics) are more likely to call it “information visualization” than “statistical graphics.” But that shouldn’t reflect poorly on people like Kosara and Yau who have expertise in both fields. Anyone working with numerical data and wanting to take the time to:
* thoughtfully examine the data, and
* thoughtfully communicate conclusions
might as well draw on insights both from statisticians and from designers.

What are some of these insights?
Some discussion about graphics, such as the Junk Charts blog and Edward Tufte’s books, reminds me of prescriptive grammar guides in the high school English class sense, along the lines of Strunk and White: “what should you do?” They warn the reader about the equivalent of “typos” (mislabeled axes) and “poor style” (thick gridlines that obscure the data points) that can hinder communication.
Then there is the descriptive linguist’s view of grammar: the building blocks of “what can you do?” A graphics-related example is Leland Wilkinson’s book Grammar of Graphics, applied to great success in Hadley Wickham’s R package ggplot2, allowing analysts to think about graphics more flexibly than the traditional grab-bag of plots.
Neither of these approaches to graphics is traditionally taught in many statistics curricula, although both are useful. Also missing are technical graphic design skills: not just using Illustrator and Photoshop, but even basic knowledge about pixels and graphics file types that can make the difference between clear and illegible graphs in a paper or presentation.

What other info visualization insights can statisticians take away? What statistical concepts should graphic designers learn? What topics are in need of solid information visualization research? As Yau said, each viewpoint has the same sentiments at heart: make graphics thoughtfully.

PS — some of the most heated discussion (particularly about Kosara’s spiral graph) seems due to blurred distinctions between the best way to (1) answer a specific question about the data (or present a conclusion that the analyst has already reached), vs. (2) explore a dataset with few preconceptions in mind. For example, Gelman talks about redoing Kosara’s spiral graph in a more traditional way that cleanly presents a particular conclusion. But Kosara points out that his spiral graph is meant for use as an interactive tool for exploring the data, rather than a static image for conveying a single summary. So Gelman’s comments about “that puzzle solving feeling” may be misdirected: there is use for graphs that let the analyst “solve a puzzle,” even when it only confirms something you already knew. (The things you think you know are often wrong, so there’s a benefit to such confirmation.) Once you’ve used this exploratory graphical tool, you might summarize the conclusion in a very different graph that you show to your boss or publish in the newspaper.

PPS — Here is some history and “greatest hits” of data visualization.