<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Civil Statistician</title>
	<atom:link href="http://civilstat.com/?feed=rss2" rel="self" type="application/rss+xml" />
	<link>http://civilstat.com</link>
	<description>Stats, datavis, edu, brains, etc.</description>
	<lastBuildDate>Wed, 24 Apr 2013 13:25:45 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Transitions</title>
		<link>http://civilstat.com/?p=1387&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=transitions</link>
		<comments>http://civilstat.com/?p=1387#comments</comments>
		<pubDate>Tue, 09 Apr 2013 16:08:53 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=1387</guid>
		<description><![CDATA[Apologies for the lack of posts recently. I&#8217;m very excited about upcoming changes that are keeping me busy: This month, Civilstat is changing civil status. (And yes, I&#8217;m aware how lucky I am that she wants to marry me despite &#8230; <a href="http://civilstat.com/?p=1387">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Apologies for the lack of posts recently. I&#8217;m very excited about upcoming changes that are keeping me busy:</p>
<ul>
<li><span style="line-height: 14px;">This month, Civilstat is changing civil status. (And yes, I&#8217;m aware how lucky I am that she wants to marry me despite such awful puns.)</span></li>
<li>This fall, I will be starting the <a href="http://www.stat.cmu.edu/academics/graduate/the-phd-program-in-statistics">PhD program in Statistics</a> at <a href="http://www.stat.cmu.edu/">Carnegie Mellon University</a>.</li>
</ul>
<p>Let me suggest a few other blogs to follow while this one is momentarily on the back burner.</p>
<p><span style="color: #444444;">By my Census Bureau colleagues:</span></p>
<ul>
<li><span style="line-height: 14px;"><a href="http://modelingwithdata.org/">Modeling With Data</a>, by <a href="http://modelingwithdata.org/about_the_author.html">Ben Klemens</a></span></li>
<li><a href="http://www.tokle.us/">tokle.us</a>, by <a href="http://www.tokle.us/about.html">Joshua Tokle</a></li>
<li><a href="http://blogs.census.gov/">Random Samplings</a>, by the <a href="http://www.census.gov/">U.S. Census Bureau</a></li>
<li><a href="http://researchmatters.blogs.census.gov/">Research Matters</a>, by <a href="http://www.census.gov/research/">U.S. Census Bureau researchers</a></li>
<li><a href="http://directorsblog.blogs.census.gov/">Director&#8217;s Blog</a>, by <a href="http://www.census.gov/newsroom/releases/archives/directors_corner/">U.S. Census Bureau Acting Director Tom Mesenbourg</a></li>
</ul>
<p>By members of the Carnegie Mellon statistics department:</p>
<ul>
<li><a href="http://www.cscs.umich.edu/~crshalizi/weblog/">Three Toed Sloth</a>, by <a href="http://www.stat.cmu.edu/~cshalizi/">Cosma Shalizi</a></li>
<li><a href="http://www.acthomas.ca/comment/">Blog</a> by <a href="http://www.acthomas.ca/">Andrew Thomas</a></li>
<li><a href="http://www.stat.cmu.edu/~nmv/blog/">Blog</a> by <a href="http://www.stat.cmu.edu/~nmv/">Nathan VanHoudnos</a></li>
<li><a href="http://normaldeviate.wordpress.com/">Normal Deviate</a>, by <a href="http://www.stat.cmu.edu/~larry/">Larry Wasserman</a></li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=1387</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Visual Revelations, Howard Wainer</title>
		<link>http://civilstat.com/?p=1366&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=visual-revelations-howard-wainer</link>
		<comments>http://civilstat.com/?p=1366#comments</comments>
		<pubDate>Thu, 21 Mar 2013 21:29:25 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[Books]]></category>
		<category><![CDATA[Visualization]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=1366</guid>
		<description><![CDATA[I&#8217;m starting to recognize several clusters of data visualization books. These include: how-to and best-practices books like Stephen Kosslyn&#8217;s Graph Design for the Eye and Mind [my notes] or Stephen Few&#8217;s Now You See It academic books like Colin Ware&#8217;s &#8230; <a href="http://civilstat.com/?p=1366">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I&#8217;m starting to recognize several clusters of data visualization books. These include:</p>
<ul>
<li>how-to and best-practices books like Stephen Kosslyn&#8217;s <a href="http://www.amazon.com/gp/product/0195311841/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0195311841&amp;linkCode=as2&amp;tag=civilstatis-20">Graph Design for the Eye and Mind</a> [<a href="http://civilstat.com/?p=974">my notes</a>] or Stephen Few&#8217;s <a href="http://www.amazon.com/gp/product/0970601980/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0970601980&amp;linkCode=as2&amp;tag=civilstatis-20">Now You See It</a></li>
<li>academic books like Colin Ware&#8217;s <a href="http://www.amazon.com/gp/product/0123814642/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0123814642&amp;linkCode=as2&amp;tag=civilstatis-20">Information Visualization: Perception for Design</a> or Leland Wilkinson&#8217;s <a href="http://www.amazon.com/gp/product/0387245448/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0387245448&amp;linkCode=as2&amp;tag=civilstatis-20">The Grammar of Graphics</a> [<a href="http://civilstat.com/?p=601">my notes</a>]</li>
<li>a category exemplified by Edward Tufte&#8217;s <a href="http://www.amazon.com/gp/product/0961392142/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0961392142&amp;linkCode=as2&amp;tag=civilstatis-20">The Visual Display of Quantitative Information</a>: lots of concrete examples, but with abstract commentary rather than how-to advice; a book that you wouldn&#8217;t mind setting on your coffee-table for the pretty pictures, though with much more content than what a &#8220;coffee-table book&#8221; usually connotes</li>
</ul>
<p>(Of course this list calls out for a flowchart or something to visualize it!)</p>
<p>Howard Wainer&#8217;s <a href="http://www.amazon.com/gp/product/0805838783/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0805838783&amp;linkCode=as2&amp;tag=civilstatis-20">Visual Revelations</a> falls in this last category. And it&#8217;s no surprise Wainer&#8217;s book emulates Tufte&#8217;s, given how often the author refers back to Tufte&#8217;s work (including comments like &#8220;As Edward Tufte told me once&#8230;&#8221;). And<em> The Visual Display of Quantitative Information</em> is still probably the best introduction to the genre. But <em>Visual Revelations</em> is different enough to be a worthwhile read too if you enjoy such books, as I do.</p>
<p>Most of all, I appreciated that Wainer presents many bad graph examples found &#8220;in the wild&#8221; and follows them with improvements of his own. Not all are successful, but even so I find this approach very helpful for learning to critique and improve my own graphics. (Tufte&#8217;s classic book critiques plenty, but spends less time on before-and-after redesigns. On the other hand, Kosslyn&#8217;s book is full of redesigns, but his &#8220;before&#8221; graphs are largely made up by him to illustrate a specific point, rather than real graphics created by someone else.)</p>
<p>Of course, Wainer covers the classics like <a href="http://en.wikipedia.org/wiki/John_Snow_%28physician%29#Cholera">John Snow&#8217;s cholera map</a> and <a href="http://en.wikipedia.org/wiki/Charles_Minard#Information_graphics">Minard&#8217;s plot of Napoleon&#8217;s march on Russia</a> (well-trodden by now, but perhaps less so in 1997?). But I was pleased to find some fascinating new-to-me graphics. In particular, the <a href="http://www.nifc.gov/safety/mann_gulch/investigation/reports/Mann_Gulch_Fire_A_Race_That_Could_Not_Be_Won_May_1993.pdf">Mann Gulch Fire</a> section (p. 65-68) gave me shivers: it&#8217;s not a flashy graphic, but it tells a terrifying story and tells it well.<br />
[<em>Edit:</em> I should point out that Snow's and Minard's plots are so well-known today largely thanks to Wainer's own efforts. I also meant to mention that Wainer is the man who helped bring into print an English translation of Jacques Bertin's seminal <em><a href="http://www.amazon.com/gp/product/1589482611/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=1589482611&amp;linkCode=as2&amp;tag=civilstatis-20">Semiology of Graphics</a></em> and a replica volume of William Playfair's <a href="http://www.amazon.com/gp/product/B003GAN2IS/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B003GAN2IS&amp;linkCode=as2&amp;tag=civilstatis-20"><em>Commercial and Political Atlas</em> and <em>Statistical Breviary</em></a>. He has done amazing work at unearthing and popularizing many lost gems of historical data visualization!<br />
See also <a href="http://www.thefunctionalart.com/2012/06/graphics-uncertainty-and-flaw-of.html">Alberto Cairo's review</a> of a more recent Wainer book.]</p>
<p>Finally, Wainer&#8217;s tone overall is also much lighter and more humorous than Tufte&#8217;s. His first section gives detailed advice on how to make a bad graph, for example. I enjoyed Wainer&#8217;s jokes, though some might prefer more gravitas.</p>
<p><span id="more-1366"></span>Below are my notes-to-self, with things-to-follow-up in bold:</p>
<ul>
<li>p. 11: &#8220;When looking at a good graph, your response should never be &#8216;what a great graph!&#8217; but &#8216;what interesting data!&#8217;&#8221; It&#8217;s a matter of taste and context, but my personal interests align with Wainer&#8217;s here. I&#8217;m currently much less interested in artsy visualizations that do not aid understanding; I&#8217;m reminded of <a href="http://flowingdata.com/2013/03/15/app-shows-what-the-internet-looks-like/">one recently highlighted on FlowingData</a> with the comment, &#8220;I can&#8217;t say how accurate it is or if the described mechanisms are accurate, but it sure is fun to play with.&#8221;</li>
<li>p. 43: &#8220;after more than two hundred practice exercises with [bivariate choropleth] maps, graduate students in perception at Johns Hopkins University were unable to internalize the legend.&#8221; <strong>Read the study:</strong> <a href="http://www.jstor.org/stable/2684111">Wainer and Francolini (1980)</a><strong><br />
</strong></li>
<li>p. 47: Sandy Zabell used graphs to highlight &#8220;inconsistencies, clerical errors, and a remarkable amount of other information&#8221; that earlier researchers had missed in the London Bills of Mortality. I&#8217;d love to <strong>find these graphs</strong>:<strong> </strong>Zabell, 1976, &#8220;Arbuthnot, Heberden and the <em>Bills of Mortality</em>,&#8221; Technical Report #40, Department of Statistics, University of Chicago.</li>
<li>p. 47: data graphics were uncommon, even in scientific journals, before William Playfair &#8212; but <strong>how &amp; when did journals start including graphics?</strong></li>
<li>p. 52: the famous <a href="http://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster#Use_as_case_study">O-ring example</a> is a case of plotting the wrong data for the question at hand. In the plot used for decision-making, they showed failures vs. temperature only for those space shuttle flights with no failures. That is, if <img src='http://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x' title='x' class='latex' /> is the number of failures and <img src='http://s0.wp.com/latex.php?latex=T&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='T' title='T' class='latex' /> is the temperature, they plotted <img src='http://s0.wp.com/latex.php?latex=%28x+%7C+x%3E0%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='(x | x&gt;0)' title='(x | x&gt;0)' class='latex' /> vs <img src='http://s0.wp.com/latex.php?latex=T&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='T' title='T' class='latex' />, rather than all <img src='http://s0.wp.com/latex.php?latex=x&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='x' title='x' class='latex' /> vs <img src='http://s0.wp.com/latex.php?latex=T&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='T' title='T' class='latex' />. Hence, they had a distorted view of <img src='http://s0.wp.com/latex.php?latex=p%28x%3E0+%7C+T%29&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='p(x&gt;0 | T)' title='p(x&gt;0 | T)' class='latex' />. Perhaps a related idea is key to Wald&#8217;s study of <a href="http://www.businesspundit.com/what-bullet-holes-in-airplanes-can-teach-you-about-making-better-business-decisions/">armoring airplanes</a> (p. 58): consider not just when you&#8217;ve observed the event of interest, but also when you haven&#8217;t.</li>
<li>p. 55: &#8220;Good graphs can make difficult problems trivial.&#8221; For a great example, see the inclined-plane question on p. 71-72, which can be answered either with trigonometry and calculus&#8230; or at a glance with the right graph. Also related to Colin Ware&#8217;s focus on &#8220;external cognition&#8221;: how resources outside the mind can be used to boost the mind.</li>
<li>p. 80: &#8220;a reasonable strategy in what ought to be an iterative process. Sometimes one has a data-related question and then draws a graph to try to answer it. After drawing the graph a new question might suggest itself, and hence a different graph, better suited to this new question (perhaps with additional data), is drawn. This in turn suggests something else, and so on, until either the data or the grapher is exhausted. [...] My experience suggests that if you begin with a general-purpose plot there is a greater chance of finding what you had not expected.&#8221; This is my experience as well, and reminds me also of <a href="http://datacommunitydc.org/blog/2013/01/the-near-future-of-data-analysis-a-review/">Hadley Wickham&#8217;s description</a> of statistics as <a href="http://www.johndcook.com/blog/2013/02/07/visualization-modeling-and-surprises/">iterating between models and graphics</a>.</li>
</ul>
<p style="text-align: center;"><a href="http://datacommunitydc.org/blog/2013/01/the-near-future-of-data-analysis-a-review/"><img class="aligncenter" alt="" src="http://datacommunitydc.org/blog/wp-content/uploads/2012/12/slides-0007.jpg" width="300" height="231" /></a></p>
<ul>
<li>p. 84: Futurism that actually came true, for once! &#8220;Indeed, it is easy to imagine a general-purpose device that might have (among many other things) all of the Los Angeles bus routes inside [...] I see no reason why <em>Streetmap<sup>TM</sup></em>-like software won&#8217;t become available eventually for cheap pocket computers of the sort now called &#8216;personal organizers.&#8217;&#8221;</li>
<li>p. 93-94: examples of misuse of double y-axes, and a comment that it would only be okay if &#8220;the same dependent variable can be represented in a transformed way. For example, plot <em>log of per pupil expenditures</em> on the left and <em>per pupil expenditur</em><em>es</em> on the right, the latter spaced to match the left-hand scale [...] Ironically, no graphics package I know of allows this latter use to be done easily, whereas the misuse is often a touted option.&#8221;</li>
<li>p. 97: Wainer really wants us to round the data for presentation: Readers rarely comprehend more than 2 digits easily, statisticians can rarely justify more than 2 digits of precision, and more than 2 digits are rarely of practical use.<br />
I love this part: &#8220;The standard error of any statistic is proportional to one over the square root of the sample size. God did this, and there is nothing we can do to change it.&#8221; (Say you print 2 digits of a correlation. That implies its standard error is be less than 0.005, which requires a sample size on the order of 40,000 &#8212; do you really have that much data?)<br />
And then on p. 99: &#8220;Round the numbers, and if you must, insert a footnote proclaiming that the unrounded details are available from the author. Then sit back and wait for the deluge of requests.&#8221;</li>
<li>p. 101: nice example of spacing rows of a table by the values of one column, showing clusters in the data.</li>
<li>Ch. 11 and 12: he argues in favor of Nightingale roses and trilinear plots but I don&#8217;t find them of much use, except maybe the example on p. 116.</li>
<li>p. 111: people have been complaining about the size and complexity of big data for centuries! William Playfair&#8217;s classic 1786 <em><a href="http://www.amazon.com/gp/product/B003GAN2IS/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=B003GAN2IS&amp;linkCode=as2&amp;tag=civilstatis-20">Commercial and Political Atlas</a></em> was a response to these kinds of concerns.</li>
<li>p. 121-123: I love these implicit graphs or <a href="http://en.wikipedia.org/wiki/Nomogram">nomographs</a>, explicitly making handy tools out of data graphics. Jonathan Rougier has an <a href="http://www.r-bloggers.com/RUG/2011/10/user-2011-jonathan-rougier-nomograms-for-visualising-relationships-between-three-variables/">example of using nomograms</a> to turn a predictive statistical model into something easily used in the field by non-math-savvy folks.</li>
<li>p. 128: Besides graphics, Wainer has a strong interest in education and standardized testing: &#8220;Basing a characterization of an examinee&#8217;s ability to understand graphical displays on a question paired with a flawed display is akin to characterizing someone&#8217;s ability to read by asking questions about a passage full of spelling and grammatical errors. What are we really testing?&#8221;</li>
<li>p. 138: great back-to-back stem and leaf plot, instead of an unhelpful table, for comparing test scores in US states vs. international countries.</li>
<li>p. 147, 149: I&#8217;m not too pleased with either Cleveland&#8217;s clean-but-boring computer-defaults plot or with Wainer&#8217;s cheesy Playfair-style remake. This is where I and many other statisticians feel a huge gap in our data-graphics skillset: once you&#8217;re happy with the content and inherent form of your graph, how do you make it look nice too, without being either bland or tacky?</li>
<li>Ch. 20: good advice on making readable slides, still aimed at overhead transparencies but largely applicable to PowerPoint etc. too. &#8220;If you can&#8217;t read it when you are against the back wall, either redo the ineffectual overheads or have as many of the back rows of chairs removed as necessary.&#8221;<br />
And of course limit the number of fonts, colors, significant digits, and equations in your talk.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=1366</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Upcoming DataKind datadive with the World Bank in DC</title>
		<link>http://civilstat.com/?p=1354&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=upcoming-datakind-datadive-with-the-world-bank-in-dc</link>
		<comments>http://civilstat.com/?p=1354#comments</comments>
		<pubDate>Sat, 16 Feb 2013 07:31:51 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=1354</guid>
		<description><![CDATA[DataKind (formerly Data Without Borders) is teaming up with the World Bank to host a datadive on monitoring poverty and corruption. If you&#8217;ve never been to one of their datadives, here&#8217;s my writeup of last year&#8217;s DC event (which I &#8230; <a href="http://civilstat.com/?p=1354">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p><a href="http://datakind.org/">DataKind</a> (formerly Data Without Borders) is teaming up with the <a href="http://www.worldbank.org/">World Bank</a> to host a <a href="http://datakind.org/2013/02/datadive-fight-poverty-corruption-world-bank/">datadive on monitoring poverty and corruption</a>.</p>
<p>If you&#8217;ve never been to one of their datadives, <a href="http://civilstat.com/?p=215">here&#8217;s my writeup of last year&#8217;s DC event</a> (which I thoroughly enjoyed), and <a href="http://datakind.org/2012/10/datacorps-project-launched-dc-action-children-team-awesome-dcs-youth/">DataKind&#8217;s writeup of our project results</a>. These datadives are a great way for statisticians and other data scientists to put our skills to good use, and to connect with other good folks in the field.</p>
<p>The World Bank events will take place in Washington DC on two days: <a href="http://datakind.org/2013/02/datadive-fight-poverty-corruption-world-bank/">preliminary prep work on 2/23 (Open Data Day), and the main datadive on 3/15 to 3/17</a>. Please consider attending if you&#8217;re around! If not, keep an eye out for future DataKind events or <a href="http://civilstat.com/?p=679">other related data science volunteer opportunities</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=1354</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Small Area Estimation resources</title>
		<link>http://civilstat.com/?p=1273&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=small-area-estimation-resources</link>
		<comments>http://civilstat.com/?p=1273#comments</comments>
		<pubDate>Sat, 02 Feb 2013 02:57:46 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=1273</guid>
		<description><![CDATA[Small Area Estimation is a field of statistics that seeks to improve the precision of your estimates when standard methods are not enough. Say your organization has taken a large national survey of people&#8217;s income, and you are happy with &#8230; <a href="http://civilstat.com/?p=1273">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Small Area Estimation is a field of statistics that seeks to improve the precision of your estimates when standard methods are not enough.</p>
<p>Say your organization has taken a large national survey of people&#8217;s income, and you are happy with the precision of the national estimate: The estimated national average income has a tight confidence interval around it. But then you try to use this data to estimate regional (state, county, province, etc.) average incomes, and some of the estimates are not as precise as you&#8217;d like: their standard errors are too high and the confidence intervals are too wide to be useful.</p>
<p>Unlike usual survey-sampling methods that treat each region&#8217;s data independent, a Small Area Estimation model makes some assumptions that let areas &#8220;borrow strength&#8221; from each other. This can lead to more precise and more stable estimates for the various regions.</p>
<p>Also note that it is sometimes called Small Domain Estimation because the &#8220;areas&#8221; do not have to be geographic: they can be other sub-domains of the data, such as finely cross-classified demographic categories of race by age by sex.</p>
<p>If you are interested in learning about the statistical techniques involved in Small Area Estimation, it can be difficult to get started. This field does not have as many textbooks yet as many other statistical topics do, and there are a few competing philosophies whose proponents do not cross-pollinate so much. (For example, the U.S. Census Bureau and the World Bank both use model-based small area estimation but in quite different ways.)</p>
<p>Recently I gave a couple of short tutorials on getting started with SAE, and I&#8217;m polishing those slides into something stand-alone I can post. Meanwhile, below is a list of resources I recommend if you would like to be more knowledgeable about this field.<span id="more-1273"></span></p>
<p><b>Textbooks on SAE:</b></p>
<ul>
<li>Nicholas Longford, <em>Missing Data and Small-Area Estimation: Modern Analytical Equipment for the Survey Statistician</em>, Springer, 2005.</li>
<li>J.N.K. Rao, <em>Small Area Estimation</em>, Wiley-Interscience, 2003.</li>
<li>Parimal Mukhopadhyay, <em>Small Area Estimation in Survey Sampling</em>, Narosa Publishing House, 1998.</li>
<li>Plater, Rao, Särndal, and Singh (ed.), <em>Small Area Statistics: An International Symposium</em>, Wiley, 1987.</li>
</ul>
<p><b>Book sections on SAE:</b></p>
<ul>
<li>Wayne Fuller, <em>Sampling Statistics</em>, Wiley, 2009. Ch. 5.5, &#8220;Small area estimation,&#8221; pp. 311-324.</li>
<li>Peter Congdon, <em>Applied Bayesian Modelling</em>, Wiley, 2003. Ch. 4.6, &#8220;Small domain estimation,&#8221; pp. 163-167.</li>
<li>Peter Congdon, <em>Bayesian Statistical Modelling</em>, Wiley, 2001. Ch. 8.8, &#8220;Small area and survey domain estimation,&#8221; pp. 415-421.</li>
</ul>
<p><b>Classic articles:</b></p>
<ul>
<li>Bradley Efron and Carl Morris, &#8220;Data Analysis Using Stein&#8217;s Estimator and Its Generalizations,&#8221; <em>JASA</em>, vol. 70, no. 350, pp. 311-319, 1975. <a href="http://www.jstor.org/stable/10.2307/2285814">[JSTOR]<br />
</a><em>Early popularization of <em>shrinkage methods, from the </em>Empirical Bayes point of view.</em></li>
<li>Robert Fay and Roger Herriot, &#8220;Estimates of Income for Small Places: An Application of James-Stein Procedures to Census Data,&#8221; <em>JASA</em>, vol. 74, no. 366, pp. 269-277, 1979. <a href="http://www.jstor.org/stable/10.2307/2286322">[JSTOR]<br />
</a><em>The classic area-level model (for survey data that&#8217;s already been aggregated to the level at which you want to publish estimates).</em></li>
<li><span style="color: #444444;">George Battese, Rachel Harter, and Wayne Fuller, &#8220;An Error-Components Model for Prediction of County Crop Areas Using Survey and Satellite Data,&#8221; </span><em>JASA</em><span style="color: #444444;">, vol. 83, no. 401, pp. 28-36, 1988. </span><a href="http://www.jstor.org/stable/10.2307/2288915">[JSTOR]<br />
</a><em>The classic unit-level model (for working with disaggregated data).</em></li>
<li>Chris Elbers, Jean Lanjouw, and Peter Lanjouw, &#8220;Micro–Level Estimation of Poverty and Inequality,&#8221; <em>Econometrica</em>, vol. 71, no. 1, pp. 355-364, 2003. <a href="http://www.jstor.org/stable/10.2307/3082050">[JSTOR]<br />
</a><em>Underlies the PovMap software that is made available by the World Bank and consequently in wide use</em><em>.</em></li>
</ul>
<p><b>Review articles:</b></p>
<ul>
<li>Gauri S. Datta, &#8220;Model-based approach to small area estimation,&#8221; pp. 251-288, <em>Handbook of Statistics: Sample Surveys: Inference and Analysis</em>, vol. 29B, eds.: D. Pfeffermann and C.R. Rao, North-Holland, 2009.</li>
<li>Risto Lehtonen and Ari Veijanen, &#8220;Design-based methods of estimation for domains and small areas,&#8221; pp. 219-249, <em>Handbook of Statistics: Sample Surveys: Inference and Analysis</em>, vol. 29B, eds.: D. Pfeffermann and C.R. Rao, North-Holland, 2009.</li>
<li>M. Ghosh and J. N. K. Rao, &#8220;Small area estimation:  an appraisal,&#8221; <em>Statist. Sci.</em>, vol. 9, no. 1, pp. 55 &#8211; 76, 1994. (See also comments and rejoinder.) <a href="http://projecteuclid.org/euclid.ss/1177010647">[Project Euclid]</a></li>
</ul>
<p><b>Other resources:</b></p>
<ul>
<li>Pushpal Mukhopadhyay and Allen McDowell, &#8220;Small Area Estimation for Survey Data Analysis using SAS Software,&#8221; SAS Global Forum 2011. <a href="http://support.sas.com/resources/papers/proceedings11/336-2011.pdf">[SAS]<br />
</a><em>Examples of unit-level and area-level estimation with PROC MIXED and hierarchical Bayes estimation with PROC MCMC.</em></li>
<li>Virgilio Gómez-Rubio, &#8220;Tutorial on Small Area Estimation,&#8221; 2008 useR! Conference. <a href="http://www.bias-project.org.uk/SAE_tutorial/">[Website]</a><br />
<em>Slides and R code from tutorial session.</em></li>
<li>Arman Bidarbakht-Nia et al., &#8220;Workshop on Concepts &amp; Methods for Producing Disaggregated Statistics Using Census Data,&#8221; Bangkok, 2011. <a href="http://www.unescap.org/stat/meet/disaggregated-20-23Sep2011/">[Website]</a><br />
<em>Slides from tutorial session by UN-ESCAP staff.</em></li>
<li>World Bank staff et al. (?), &#8220;More Frequent, More Timely &amp; More Comparable Data for Better Results,&#8221; PREM 2011. <a href="http://go.worldbank.org/VP0C8U1K90">[Website]</a><br />
<em>Slides from workshop on poverty monitoring.</em></li>
<li>Bedi, Coudouel, and Simler (ed.), <em>More Than a Pretty Picture: Using Poverty Maps to Design Better Policies and Interventions</em>, The World Bank, 2007.</li>
<li>Elliott, Cuzick, English, and Stern (ed.)<em id="__mceDel">, <em>Geographical and Environmental Epidemiology: Methods for Small-Area Studies</em>,</em> Oxford University Press, 1992.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=1273</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Evil Queen of Numbers</title>
		<link>http://civilstat.com/?p=1317&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=evil-queen-of-numbers</link>
		<comments>http://civilstat.com/?p=1317#comments</comments>
		<pubDate>Mon, 28 Jan 2013 20:18:27 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=1317</guid>
		<description><![CDATA[Would there be any demand for a statistics class taught by M from the James Bond films? M: You don&#8217;t like me, Bond. You don&#8217;t like my methods. You think I&#8217;m an accountant, a bean counter, more interested in my numbers &#8230; <a href="http://civilstat.com/?p=1317">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Would there be any demand for a statistics class taught by M from the James Bond films?</p>
<blockquote><p>M: You don&#8217;t like me, Bond. You don&#8217;t like my methods. You think I&#8217;m an accountant, a bean counter, more interested in my numbers than your instincts.<br />
JB: The thought had occurred to me.<br />
[...]<br />
M: I&#8217;ve no compunction about sending you to your death, but I won&#8217;t do it on a whim.</p></blockquote>
<p><span id="more-1317"></span></p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='500' height='312' src='http://www.youtube.com/embed/CqS9jxruy-A?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span></p>
<p><span style="color: #444444;">And </span><a href="http://en.wikipedia.org/wiki/M_(James_Bond)">via Wikipedia</a> (wish I could find the clip)<span style="color: #444444;">:</span></p>
<blockquote><p>Tanner, her Chief of Staff, refers to her during the film as &#8220;the Evil Queen of Numbers&#8221;, given her reputation at that stage for relying on statistics and analysis rather than impulse and initiative.</p></blockquote>
<p><span style="color: #444444;">Yes, Bond&#8217;s instincts are great, but there&#8217;s something to be said for developing number-crunching skills too:</span></p>
<ol>
<li>Not everyone&#8217;s instincts are naturally as great as Bond&#8217;s. The rest of us benefit from structured ways to decide which apparent patterns are real and what&#8217;s just spurious noise.</li>
<li>One man can&#8217;t do it all. He operates within the context of a much bigger intelligence agency, whose several analysts help him see the big picture. This makes it possible to send Bond on targeted missions, instead of needing a million Bonds to scope out each and every possible lead individually.</li>
</ol>
<p>I imagine M&#8217;s course would be intense but rewarding. Your mission, should you choose to accept it&#8230;</p>
<ul>
<li><span style="color: #444444;">You must be careful about who you interrogate and how, if you want to trust your results: <strong>survey design, design of experiments</strong></span></li>
<li><span style="color: #444444;">Once you&#8217;ve gathered the intelligence, you must look for meaning: <strong>EDA, means, regressions, etc.</strong></span></li>
<li><span style="color: #444444;">But be distrustful of apparent patterns and don&#8217;t act on a whim: <strong>hypothesis testing</strong></span></li>
<li><span style="color: #444444;">The boys in the lab have come up with a curious measuring device. </span><span style="color: #444444;">Instead of recording length or temperature or what have you, </span><span style="color: #444444;">it measures the reliability and precision of your information: </span><strong><span style="color: #444444;">CIs / MOEs</span></strong></li>
</ul>
<p>And so on. You could start off with some visualizations, such as <a href="http://www.economist.com/news/books-and-arts/21564816-various-bonds-are-more-different-you-think">The Economist&#8217;s chart</a> of James Bond&#8217;s kills, romances, and martinis per film broken out by actor:</p>
<p><a href="http://www.economist.com/news/books-and-arts/21564816-various-bonds-are-more-different-you-think"><img class="alignnone size-full wp-image-1331" alt="EconomistJamesBondChart" src="http://civilstat.com/wp-content/uploads/2013/01/EconomistJamesBondChart.png" width="595" height="404" /></a></p>
<p>I admit I&#8217;m not too well-versed in James Bond culture, but it seems like a fun idea. Anybody know of a statistics 101 class taught with a tongue-in-cheek backstory like this?</p>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=1317</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Hypothesis tests will not answer your question</title>
		<link>http://civilstat.com/?p=1194&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=hypothesis-tests-will-not-answer-your-question</link>
		<comments>http://civilstat.com/?p=1194#comments</comments>
		<pubDate>Sat, 26 Jan 2013 23:02:56 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=1194</guid>
		<description><![CDATA[Assume your hypothesis test concerns whether a certain effect or parameter is 0. With interval estimation, you can distinguish several options: The effect is precisely-measured and the interval includes 0, so we can ignore it. The effect is precisely-measured but &#8230; <a href="http://civilstat.com/?p=1194">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Assume your hypothesis test concerns whether a certain effect or parameter is 0. With interval estimation, you can distinguish several options:</p>
<ol>
<li>The effect is precisely-measured and the interval includes 0, so we can ignore it.</li>
<li>The effect is precisely-measured but the interval doesn&#8217;t include 0, but it&#8217;s close enough to be negligible for practical purposes, so we can ignore it.</li>
<li>The effect is precisely-measured to be far from 0, so we can keep it as is.</li>
<li>The effect is poorly-measured but we&#8217;re confident it&#8217;s not 0, so we can keep it but should still get more data to raise precision.</li>
<li>The effect is poorly-measured and might be 0, so we definitely need more data before deciding what to do.</li>
</ol>
<p>&#8230;as illustrated below:</p>
<p><img class="alignnone size-full wp-image-1199" alt="NHST" src="http://civilstat.com/wp-content/uploads/2013/01/NHST.png" width="500" height="500" /></p>
<p>Imagine you&#8217;re a scientist, making inferences about how the world works; or an engineer, building a tool that relies on knowing the size of these effects. You would like to distinguish between (1&amp;2) vs. (3) vs. (4&amp;5). Journal readers would like you to publish results for cases (1&amp;2) or (3), and should want you to collect more data before publishing in cases (4&amp;5).</p>
<p>Instead, hypothesis testing conflates (1&amp;5) vs. (2&amp;3&amp;4). That doesn&#8217;t help you much!</p>
<p><span style="color: #444444;">In particular, when you conflate (1&amp;5) it makes for bad science, as people in practice tend to interpret &#8220;not statistically significant&#8221; as &#8220;the effect must be spurious.&#8221; Instead they should interpret it as &#8220;</span><em>either</em><span style="color: #444444;"> the effect is spurious, </span><em>or</em><span style="color: #444444;"> it might be practically significant but not measured well enough to know.&#8221; And even if you were aware of this, and if you did collect more data to get out of situations (1&amp;5), a hypothesis test </span><em>still</em><span style="color: #444444;"> wouldn&#8217;t help you distinguish (2) vs. (3) vs. (4).</span></p>
<p>Finally, this is an issue whether your hypothesis tests are Frequentist or Bayesian. As <a href="http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/">John Kruschke&#8217;s excellent book</a> on <a href="http://www.amazon.com/gp/product/0123814855/ref=as_li_ss_tl?ie=UTF8&amp;tag=civilstatis-20&amp;linkCode=as2&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0123814855"><em>Doing Bayesian Data Analysis</em></a> points out, if you do a Bayesian model comparison of one model with a spiked prior at <img src='http://s0.wp.com/latex.php?latex=%5Ctheta%3D0&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='&#92;theta=0' title='&#92;theta=0' class='latex' /> and another with some diffuse prior for <img src='http://s0.wp.com/latex.php?latex=%5Ctheta&#038;bg=ffffff&#038;fg=333333&#038;s=0' alt='&#92;theta' title='&#92;theta' class='latex' />, &#8220;all that this model comparison tells us is which of two unbelievable models is less unbelievable. [And] that is not all we want to know, because usually we also want to know what [parameter] values <em>are</em> credible&#8221; (p.427).</p>
<p>This is in addition to all the other problems with hypothesis tests. See <a href="http://civilstat.com/?p=1076">my notes</a> from reading <a href="http://www.amazon.com/gp/product/0917227042/ref=as_li_ss_tl?ie=UTF8&amp;camp=1789&amp;creative=390957&amp;creativeASIN=0917227042&amp;linkCode=as2&amp;tag=civilstatis-20">Michael Oakes&#8217; <em>Statistical Inference</em></a> for several more.</p>
<p>Sure, traditional hypothesis testing does have its uses, but they are rather limited. I find that interval estimation does a better job of supporting careful thought about your analysis.</p>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=1194</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>audiolyzR: Data sonification with R</title>
		<link>http://civilstat.com/?p=740&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=audiolyzr-data-sonification-with-r</link>
		<comments>http://civilstat.com/?p=740#comments</comments>
		<pubDate>Sun, 13 Jan 2013 21:25:21 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics]]></category>
		<category><![CDATA[Visualization]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=740</guid>
		<description><![CDATA[In his talk &#8220;Give Your Data A Listen&#8221; at last summer&#8217;s useR! 2012 conference, Eric Stone presented joint work with Jesse Garrison on audiolyzR, an R package for &#8220;data sonification.&#8221; I thought this was a nifty and well-executed idea. Since I haven&#8217;t seen &#8230; <a href="http://civilstat.com/?p=740">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>In his talk <a href="http://biostat.mc.vanderbilt.edu/wiki/pub/Main/UseR-2012/81-Stone.pdf">&#8220;Give Your Data A Listen&#8221;</a> at last summer&#8217;s <a href="http://biostat.mc.vanderbilt.edu/wiki/Main/UseR-2012">useR! 2012 conference</a>, <a href="https://twitter.com/theericstone">Eric Stone</a> presented joint work with <a href="https://twitter.com/goto10">Jesse Garrison</a> on audiolyzR, an R package for &#8220;data sonification.&#8221; I thought this was a nifty and well-executed idea. Since I haven&#8217;t seen Eric and Jesse post any demos online yet, I&#8217;d like to share a summary and video clip here, so that I can point to them whenever I describe audiolyzR to other folks.</p>
<p><img class="alignnone size-full wp-image-1222" alt="audiolyzR" src="http://civilstat.com/wp-content/uploads/2012/10/audiolyzR.png" width="500" height="288" /></p>
<p>In August I invited Eric to my workplace to speak, and he gave us a great talk including demos of features added since the useR session. Here&#8217;s the post-event summary:</p>
<blockquote><p>Eric Stone, a PhD student at Temple University, presented his co-authored work with Jesse Garisson on &#8220;data sonification&#8221;: using sound (other than speech) to visualize a dataset.<br />
Eric demonstrated audiolizations of scatterplots and histograms using the statistical software R and the audio toolkit <a href="http://cycling74.com/products/max/">Max/MSP</a>, as well as his ongoing research on time-series line plots. The software shows a visual display of the data and then plays an audio version, with the x-axis mapped to time and the y-axis to pitch. For instance, a positively-correlated scatterplot sounds like rising scales or arpeggios. Other variables are represented by timbre, volume, etc. to distinguish them. The analyst can also tweak the tempo and other settings while listening to the data repeatedly to help outliers stand out more clearly. A few training examples helped the audience to learn how to listen to these audiolizations and identify these outliers.<br />
Eric believes that, even if the audiolization itself is no clearer than a visual plot, activating multiple cortices in the brain makes the analyst more attuned to the data. As a musician since childhood, he succeeded in making the results sound pleasant so that they do not wear out the listener.<br />
The software will soon be released as an R package and linked to <a href="http://en.wikipedia.org/wiki/RExcel">RExcel</a> to expand its reach to Excel users. Future work includes: 1) supporting more data structures and more layers of data in the same audiolization; 2) testing the software with visually impaired users as a tool for accessibility; and 3) developing ways to embed the audiolizations into a website.</p></blockquote>
<p>Eric suggested that he can imagine someone using this as part of an information dashboard or for reviewing a zillion different data views in a row, while multi-tasking: Just set it to loop through each slice of the data while you work on something else. Your ears will alert you when you hit a data slice that&#8217;s unusual and worth investigating further.</p>
<p>Eric has kindly sent me a version of the package, and below I demonstrate a few examples using <a href="http://www.cdc.gov/nchs/nhanes.htm">NHANES</a> data:</p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='500' height='312' src='http://www.youtube.com/embed/gVn74OceIfw?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span></p>
<p>I&#8217;ve asked Eric if there&#8217;s a public release coming anytime soon, but it may be a while:</p>
<blockquote><p>I am nearly ready to release it, but it&#8217;s one of those situations where my advisor will come up with &#8220;just one more thing&#8221; to add, so, you know, it might be a while.. Anyway, if people are interested I can provide them with the software and everything. Just let me know if anyone is.</p></blockquote>
<p>If you want to get in touch with Eric, his contact info is in the useR talk abstract linked at the top.</p>
<p>On a very-loosely-related note, consider also John Cook&#8217;s post on <a href="http://www.johndcook.com/blog/2008/03/11/how-loud-is-the-evidence/">measuring evidence in decibels</a>. Someday I&#8217;d like to re-read this after I&#8217;ve had my morning coffee and think about if there&#8217;s any useful way to turn this metaphor into literal sonic hypothesis testing.</p>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=740</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>DC R Meetup: &#8220;Analyze US Government Survey Data with R&#8221;</title>
		<link>http://civilstat.com/?p=1246&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=dc-r-meetup-analyze-us-government-survey-data-with-r</link>
		<comments>http://civilstat.com/?p=1246#comments</comments>
		<pubDate>Fri, 11 Jan 2013 04:47:32 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[R]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=1246</guid>
		<description><![CDATA[I really enjoyed tonight&#8217;s DC R Meetup, presented by the prolific Anthony Damico. [Edit: adding link to the full video of Anthony's talk; review is below.] I&#8217;ve met Anthony before to discuss whether the Census Bureau could either&#8230; publish R-readable &#8230; <a href="http://civilstat.com/?p=1246">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I really enjoyed <a href="http://www.meetup.com/R-users-DC/events/95903742/">tonight&#8217;s</a> <a href="http://www.meetup.com/R-users-DC/">DC R Meetup</a>, presented by the prolific Anthony Damico. [<em>Edit: adding link to the full video of Anthony's talk; review is below.</em>]</p>
<p><iframe src="http://player.vimeo.com/video/57238296" width="480" height="360" frameborder="0" webkitAllowFullScreen mozallowfullscreen allowFullScreen></iframe></p>
<p><a href="https://github.com/ajdamico/usgsd/blob/master/analyze%20survey%20data%20for%20free%20-%20the%20flowchart.pdf?raw=true"><img class="alignnone size-full wp-image-1247" alt="DamicoFlowchart (small)" src="http://civilstat.com/wp-content/uploads/2013/01/IMG_8807-500x375.jpg" width="500" height="375" /></a></p>
<p>I&#8217;ve met Anthony before to discuss whether the Census Bureau could either&#8230;</p>
<ul>
<li>publish R-readable input statements for flat file public datasets (instead of only the SAS input statements we publish now); or&#8230;</li>
<li>cite his R package <a href="http://blog.revolutionanalytics.com/2012/07/importing-public-data-with-sas-instructions-into-r.html"><code>sascii</code></a>, which automatically processes a SAS input file and reads data directly into R (no actual SAS installation required!). Folks agree <code>sascii</code> is an excellent tool and we&#8217;re working on the approvals to mention it on the relevant download pages.</li>
</ul>
<p>Meanwhile, Anthony&#8217;s not just waiting around. He&#8217;s put together an awesome blog, <a href="http://www.asdfree.com/">asdfree.com (&#8220;Analyze Survey Data for Free&#8221;)</a>, where he posts complete R instructions for finding, downloading, importing, and analyzing each of several publicly-available US government survey datasets. These include, in his words, &#8220;obsessively commented&#8221; R scripts that make it easy to follow his logic and understand the analysis examples. Of course, &#8220;My syntax does not excuse you from reading the technical documentation,&#8221; but the blog posts point you to the key features of the tech docs. For each dataset on the blog, he also makes sure to replicate a set of official estimates from that survey, so you can be confident that R is producing the same results that it should.<span id="more-1246"></span></p>
<p>For the huge datasets that fit on his hard drive but not into memory and hence R can&#8217;t handle natively, he recommends the open-source database <a href="http://www.monetdb.org/Home">MonetDB</a>. With R and MonetDB, he was able to do summary statistics on 67 million records in 8 seconds on his personal laptop; I don&#8217;t know the standards in this area myself but the audience seemed to find it impressive. And Anthony&#8217;s blog gives clear step-by-step instructions for installing MonetDB, getting these huge survey datasets into it, and calling them from R.</p>
<p>(He mentioned a few other ways to deal with big data and R, but my favorite was: &#8220;If your colleagues avoid R because the data won&#8217;t fit into RAM, tell them to take the $10,000 you spend on SAS licenses per year; spend $30 on RAM; and&#8230; spend the other $9,970 on pizza parties every day.&#8221;)</p>
<p>Anthony actually started the talk with a quick illustration of why R is so useful: unlike its competitors like SAS, SPSS, Stata, and SUDAAN, which are statistical <em>packages</em> that manipulate everything as data tables, R is a statistical <em>programming language</em> that lets you subset objects, pass them to functions, stick the output directly into another function, etc. without requiring that everything be a data table. He warned it&#8217;s a difficult language, and &#8220;You will not get instant gratification from R,&#8221; but it&#8217;s certainly worth learning.</p>
<p>Futhermore, he warned of the &#8220;vanishing right to privacy&#8221; of code used for public analyses based on public data. There&#8217;s a trend afoot to make such research openly reproducible by anyone, and that&#8217;s easier with an open language like R than with its proprietary competitors. Also, if you change jobs, it&#8217;s easier to keep working in an open tool like R than to justify your bosses paying for, say, a SAS license if your new workplace is a Stata shop (for example). Plus, the R community and package system are a huge bonus.</p>
<p>Finally, if you&#8217;re psyched to use R but need help not only with these survey datasets but with learning R itself, Anthony has a great series of two-minute R tutorials at <a href="http://www.twotorials.com/">twotorials.com</a> (I wish I&#8217;d thought of the name first!). He does talk <em>fast</em> and there are over 90 of them by now, so it&#8217;s a lot of information. His <a href="https://github.com/ajdamico/usgsd/blob/master/analyze%20survey%20data%20for%20free%20-%20the%20flowchart.pdf?raw=true">flowchart handout</a> lists other resources for learning R.</p>
<p>[<em>Edit: this seems to have been fixed, but just FYI:</em> Anthony's site, "Analyze Survey Data for Free" or <a href="http://www.asdfree.com/">asdfree.com</a>, <del>is currently</del> was being blocked by some anti-virus software. The site is safe, but apparently the domain name used to belong to spammers, and on some computers this <del>is</del> was even blocking the R Meetup event page that links to <a href="http://www.asdfree.com/">asdfree.com</a>. <del>For that reason I won't hyperlink to his site yet either.</del> But he's had the site re-reviewed by the anti-virus companies, and they <del>will fix</del> fixed it in an update <del>soon</del>. So if your computer blocks it too, wait a few days and it should be back!]</p>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=1246</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>More names for statistics, and do they matter?</title>
		<link>http://civilstat.com/?p=911&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=more-names-for-statistics-and-do-they-matter</link>
		<comments>http://civilstat.com/?p=911#comments</comments>
		<pubDate>Sun, 06 Jan 2013 17:46:48 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=911</guid>
		<description><![CDATA[Partly continuing on from my previous post&#8230; So I think we&#8217;d all agree that applied mathematics is a venerable field of its own. But are you tired of hearing statistics distinguished from &#8220;data science&#8220;? Trying to figure out the boundaries between &#8230; <a href="http://civilstat.com/?p=911">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Partly continuing on from <a href="http://civilstat.com/?p=1165">my previous post</a>&#8230;</p>
<p>So I think we&#8217;d all agree that applied mathematics is a venerable field of its own. But are you tired of hearing statistics distinguished from &#8220;<a href="http://masi.cscs.lsa.umich.edu/~crshalizi/weblog/925.html">data science</a>&#8220;? Trying to figure out the <a href="http://www.drewconway.com/zia/?p=2378">boundaries between data science, statistics, and machine learning</a>? <a href="http://www.dataspora.com/2009/05/sexy-data-geeks/">What skills are needed</a> by the people in this field (these fields?), <a href="http://medriscoll.com/post/18784448854/the-data-science-debate-domain-expertise-or-machine">do they also need domain expertise</a>, and <a href="http://mathbabe.org/2012/07/31/statisticians-arent-the-problem-for-data-science-the-real-problem-is-too-many-posers/">are there too many posers</a>?</p>
<p>Or are you now confused about <a href="http://www.stat.cmu.edu/~kass/papers/what.pdf">what is statistics</a> in the first place? (Excellent article by Brown and Kass, with excellent discussion and rejoinder &#8212; deserving of its own blog post soon!)</p>
<p>Or perhaps you are psyched for the growth of even more of these similar-sounding fields? I&#8217;ve recently started hearing people proclaim themselves experts in <a href="http://www.american.edu/cas/economics/info-metrics/upload/Golan-abstract-10-31-12.pdf">info-metrics</a> and <a href="http://www.siam.org/news/news.php?id=1972">uncertainty quantification</a>. <em>[Edit: here's yet another one: <a href="http://www.pnl.gov/coginformatics/">cognitive informatics</a>.]</em></p>
<p>Is there a benefit to having so many names and traditions for what should, essentially, be the same thing, if it hadn&#8217;t been historically rediscovered independently in different fields? Is it just a matter of branding, or do you think all of these really are distinct specialties?</p>
<p><a href="http://icanhascheezburger.files.wordpress.com/2012/03/advice-animals-memes-animal-memes-chemistry-cat-well-finish-the-sentence.jpg"><img class="alignnone" title="Incomplete data" alt="" src="http://icanhascheezburger.files.wordpress.com/2012/03/advice-animals-memes-animal-memes-chemistry-cat-well-finish-the-sentence.jpg" width="441" height="574" /></a></p>
<p>Given the position in <a href="http://civilstat.com/?p=1165">my last post</a>, I might argue that you should complete Chemistry Cat&#8217;s sentence with &#8220;&#8230;and those who can quantify their uncertainty about those extrapolations.&#8221; And maybe some fields have more sophisticated traditions for tackling the first part, but statisticians are especially focused on the second.</p>
<p>In other words, much of a statistician&#8217;s special unique contribution (what we think about more than might an applied mathematician, data scientist, <a href="http://masi.cscs.lsa.umich.edu/~crshalizi/weblog/698.html">haruspicer</a>, etc.) is our focus on the uncertainty-related properties of our estimators. We are the first to ask: what&#8217;s your estimator&#8217;s bias and variance? Is it robust to data that doesn&#8217;t meet your assumptions? If your data were sampled again from scratch, or if you ran your experiment again, what&#8217;s the range of answers you&#8217;d expect to see? These questions are front and center in statistical training, whereas in, say, <a href="http://cs229.stanford.edu/materials.html">the Stanford machine learning class handouts</a>, they often come in at the end as an afterthought.</p>
<p>So my <em>impression</em> is that other fields are at higher risk of modeling just the mean and stopping there (not also telling you what range of data you may see outside the mean), or overfitting to the training data and stopping there (not telling you how much your mean-predictions rely on what you saw in this particular dataset). On the other hand, perhaps traditional stats models for the mean/average/typical trend are less sophisticated than those in other communities. When statisticians limit our education to the kind of models where it&#8217;s <em>easy</em> to derive MSEs and compare them analytically, we miss out on the chance to contribute to the development &amp; improvement of many other interesting approaches.</p>
<p>So: if you call yourself a statistician, don&#8217;t hesitate to talk with people who have a different title on their business cards, and see if your special view on the world can contribute to their work. And if you&#8217;re one of these others, don&#8217;t forget to put on your statistician hat once in a while and think deeply about the variability in the data or in your methods&#8217; performance.</p>
<p>PS &#8212; I don&#8217;t mean to be adversarial here. Of course a good statistician, a good applied mathematician, a good data scientist, and presumably even a good infometrician(?) ought to have much of the same skillset and worldview. But given that people can be trained in different departments, I&#8217;m just hoping to articulate what might be gained or lost by studying Statistics rather than the other fields.</p>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=911</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>One difference between Statistics vs. Applied Math</title>
		<link>http://civilstat.com/?p=1165&#038;utm_source=rss&#038;utm_medium=rss&#038;utm_campaign=one-difference-between-statistics-vs-applied-math</link>
		<comments>http://civilstat.com/?p=1165#comments</comments>
		<pubDate>Sat, 05 Jan 2013 21:41:12 +0000</pubDate>
		<dc:creator>civilstat</dc:creator>
				<category><![CDATA[Education]]></category>
		<category><![CDATA[Statistics]]></category>

		<guid isPermaLink="false">http://civilstat.com/?p=1165</guid>
		<description><![CDATA[I&#8217;ll admit it: before grad school I wasn&#8217;t fully clear on the distinction between statistics and applied mathematics. In fact &#8212; gasp! &#8212; I may have thought statistics was a branch of mathematics, rather than its own discipline. (On the &#8230; <a href="http://civilstat.com/?p=1165">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>I&#8217;ll admit it: before grad school I wasn&#8217;t fully clear on the distinction between statistics and applied mathematics. In fact &#8212; gasp! &#8212; I may have thought statistics was a branch of mathematics, rather than its own discipline. (On the contrary: see <a href="http://clint.sharedwing.net/research/stat/mst.pdf">Cobb &amp; Moore (1997) on &#8220;Mathematics, Statistics, and Teaching&#8221;</a>; <a href="http://wmbriggs.com/blog/?p=3169">William Briggs&#8217;s blog</a>; and many others.)</p>
<p>Of course the two fields overlap considerably; but clearly a degree in one area will not emphasize exactly the same concepts as a degree in the other. One such difference I&#8217;ve seen is that statisticians have a greater focus on variability. That includes not just quantifying the usual uncertainty in your estimates, but also modeling the variability in the underlying population.</p>
<p>In many introductory applied-math courses and textbooks I&#8217;ve seen, the goal of modeling is usually to get the equivalent of a point estimate: the system&#8217;s behavior after converging to a steady state, the maximum or minimum necessary amount of something, etc. You may eventually get around to modeling the variability in the system too, but it&#8217;s not hammered into you from the start like it is in a statistics class.</p>
<p>For example, I was struck by some comments on <a href="http://www.johndcook.com/blog/2009/04/15/intellectual-traffic-jam/">John Cook&#8217;s post about (intellectual) traffic jams</a>. Skipping the &#8220;intellectual&#8221; part for now, here&#8217;s what Cook said:<span id="more-1165"></span></p>
<blockquote><p>Imagine you’re on a highway with two lanes in each direction. Two cars are traveling side-by-side at exactly the speed limit. No one can pass, and so the cars immediately behind the lead pair go a little slower than the speed limit in order to maintain a safe distance. This process cascades until traffic slows down to a crawl miles behind the pair of cars responsible for the traffic jam.</p></blockquote>
<p>Commenter 1:</p>
<blockquote><p>That’s a flawed assumption. After driving a little bit slower for a short period of time, the second pair of cars can speed up to drive at the same speed as the leading pair. The distance, providing everyone’s driving at a constant speed, is going to remain the same.</p></blockquote>
<p>Commenter 2:</p>
<blockquote><p>Why do the cars behind have to go slower? As they approach the two lead cars that cruise abreast, they will have to slow to the speed limit at a safe following distance. They might initially overcompensate. But ultimately, the system ought to stabilize to a point where everyone’s going the speed limit (given a sufficiently long highway, such that road capacity doesn’t become the limiting factor).</p></blockquote>
<p>My first reaction was that these commenters showed an applied-math way of thinking: There&#8217;s a steady state in which all the cars could go at the same speed as the leading pair, so presumably that&#8217;s what must happen.</p>
<p>A statistician, on the other hand, is trained to think about variability from the start, and should immediately recognize that the following cars <em>won&#8217;t</em> be able to match speeds perfectly (even with cruise control you&#8217;re not likely to keep <em>exactly</em> the same speed as the leading cars), and that this is going to be a major part of the problem. Indeed, see Cook&#8217;s response:</p>
<blockquote><p>Drivers speed up and slow down for various reasons over time. Say someone’s speed varies 10 mph. On an open highway they can average 55 by driving between 50 and 60. But if someone in front of them is driving a constant 55, they will have to slow down to 50 in order to be able to maintain their usual variability.</p></blockquote>
<p>In other words, if you&#8217;re &#8220;thinking like a statistician,&#8221; you&#8217;re likely to notice certain features of the problem that you might miss by &#8220;thinking like an applied mathematician.&#8221; Now, the kind of models or simulations you&#8217;d use here might not be the ones traditionally taught in a stats class, so the statistician may in fact need an applied mathematician to help with the modeling&#8230; but that key insight may be more likely to come from the statistician in the first place.</p>
<p>Of course this has been a caricature &#8212; a good applied mathematician will get around to considering variability too &#8212; but it still seems to be a difference between the fields&#8217; focuses. Do you agree, or am I seeing a spurious pattern?</p>
<p>Also, in hindsight, I&#8217;m glad that I went into statistics instead of applied math, since otherwise I would likely not have gained such a strong focus on quantifying variation. But I&#8217;m sure there are major conceptual insights I missed by not getting an applied math degree instead &#8212; I wonder what they are?</p>
<p>As for the traffic jam idea itself, here&#8217;s an empirical example:</p>
<p><span class='embed-youtube' style='text-align:center; display: block;'><iframe class='youtube-player' type='text/html' width='500' height='312' src='http://www.youtube.com/embed/Suugn-p5C1M?version=3&#038;rel=1&#038;fs=1&#038;showsearch=0&#038;showinfo=1&#038;iv_load_policy=1&#038;wmode=transparent' frameborder='0'></iframe></span></p>
<p>&nbsp;</p>
]]></content:encoded>
			<wfw:commentRss>http://civilstat.com/?feed=rss2&#038;p=1165</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>
