TL;DR: The Knight Center’s free online journalism courses are great for anyone who works with data, storytelling, or both. See what’s being offered here.
My favorite links from a recent course on Data-Driven Journalism are here.
And a fellow student’s suggested reading list is here.
Last fall, a coworker and I led a study group for the Knight Center‘s MOOC (massive open online course) on “Introduction to Infographics and Data Visualization”, taught by Alberto Cairo. The course and Alberto’s book were excellent, and we were actually able to bring Alberto in to the Census Bureau for a great lecture a few months later. This course is now in its 3rd offering (starting today!) and I cannot recommend it highly enough if you have any interest in data, journalism, visualization, design, storytelling, etc.!
So, this summer I was happy to see the Knight Center offering another MOOC, this time on “Data-Driven Journalism: The Basics”. What with moving cities and starting the semester, I hadn’t kept up with the class, but I’ve finally finished the last few readings & videos. Overall I found a ton of great material.
The course’s five lecturers gave an overview of data-driven journalism: from its historical roots in the 1800s and its relation to computer-aided reporting, to how to get data in the first place, through cleaning and checking the data, and finally to building news apps and journalistic data visualizations.
In week 3 there was a particularly useful exercise of going through a spreadsheet of hunting accidents. Of course it illustrated some of the difficulties in cleaning data, and it gave concrete practice in filtering and sorting. But it was also a great illustration of how a database can lead you potential trends or stories that you might have missed if you’d only gone out to interview a few individual hunters.
I loved some of the language that came up, such as “backgrounding the data” — analogous to checking out your sources to see how much you can trust them — or “interrogating the data,” including coming prepared to the “data interview” to ask thorough, thoughtful questions. I’d love to see a Statistics 101 course taught from this perspective. Statisticians do these things all the time, but our terminology and approach seem alien and confusing the first few times you see them. “Thinking like a journalist” and “thinking like a statistician” are not all that different, and the former might be a much more approachable path to the latter.
For those who missed the course, consider skimming the Data Journalism Handbook (free online); Stanford’s Data Journalism lectures (hour-long video); the course readings I saved on Pinboard; and my notes below.
Edit: See also fellow student Daniel Drew Turner’s suggested reading list.
Then, keep an eye out for next time it’s offered on the Knight Center MOOC page.
Below is a (very messy) braindump of notes I took during the class, in case there are any useful nuggets in there. (Messiness due to my own limited time to clean the notes up, not to any disorganization in the course itself!) I think the course videos were not for sharing outside the class, but I’ve linked to the other readings and videos.
- Very relevant if I ever do a Stats companion to CodeWithMe (perhaps StatWithMe?)
- Compile data, then Clean it well, then find the Context, and Combine insights, and finally Communicate.
- MathStats traditionally focus on just a sliver of the Context and Combine pieces,
but there’s so much more.
- There should be a Journalism For Statisticians course too, not just Stats for Journalists!
“Data-journalists are the new punks” TEDx talk, Simon Rogers
“PDFs are where data goes to die. It’s a way for governments to release information without really releasing it.”
Week 1: Amy Schmitz Weiss, SDSU
- It’s not just charts and spreadsheets… but her examples DO rely on those underlying the story.
Not just interviewing a few people and taking their word for their impressions
(“Our response times are quick” or “Our employees are not overpaid”)
but rather taking a whole dataset and seeing if the story it tells aligns with the interviews
(response times here and here tend to be quick but that locality has major delays;
or this is the range of your salaries compared to national trends)
i.e. letting you dig deeper or find unexpected facts/features
- The data “can provide context and depth to daily stories. It includes techniques of producing tips that launch more complex stories from a broader perspective…”
- History of DDJ incl 1821 “leaked table” of school attendance & tuition payment records
and 18?? Flo Nightingale report that more died from illness than bullets
- Philip Meyer’s precision journalism in 1960s-70s: use the tools of social science
- DDJ brings sense & structure — puts your other info in context
and finds relevance of local issues by connecting to broader trends
(how’s my nbhd compare to rest of city?)
and lets you independently verify/interpret the official info/interpretations
- 5 components:
1. the story/issue/situation/event
2. curiosity surrounding it — why are things the way they are?
3. the data — incl metadata about its format, organization, and source
4. solid interviewing/reporting — healthy skepticism and use of multiple sources
5. data-driven mindset:
social science tools, relationships b/w variables, add another level to the story
study of regular patterns and irregularities at aggregate level
peel it all back to rel’n to social life
…so it’s all about layers and context
- “interviewing the data”
- what data do you seek?
how do you obtain & clean it?
how do you interview/analyze it?
how do you present it to public?
- consider the layers (ppl, places, events, situations, etc):
what are similarities & diffs?
how do they change over time and space?
how are & aren’t they associated with each other?
- data can be the source of the story, or how it’s told, or both.
so be conscious of how:
like any other source, it demands skepticism;
and like any other tool, it affects/restricts the stories created with it
- data’s a complement to your process. can help find the story, give context to it, and/or present it.
Data Journalism Handbook Introduction
DDJ can automate the gathering/combining of info,
connect thousands of documents,
tell a broad story w/infographics,
use ixive tools to show how a specific individual is affected, not just the average, etc
Journalists used to be the only ones who can “multiply and distribute what had happened over night”
but no longer… so instead of now journalist just being the 1st to tell a story,
they’re the ones to tell what a new dvlpt actually means,
to create useful personalized calculators,
to point out fallacies
But existing journalists need help getting trained in use of data
and also to help combat data-driven PR and numbers liars like Enron
DDJ helps you find unique stories (not just copied from the wires), and be a watchdog
Quoting/citing/sharing academic works is what inspired the Web’s hyperlinks,
and citing/sharing data and other source materials can help improve journalism
Friction between CAR (computer-assisted reporting) and DDJ communities
— perhaps similar to Statistics vs DataScience?
How to get started in data journalism
Start with a question
Find a community and learn from others
There are a ton of sub-fields, all useful, and you can’t learn them all
Dabble in various areas so you can learn to adapt to the needs of each new project
How to be a data journalist
Four main sub-fields to DDJ:
1. Finding data, whether through expert contacts or SQL skillz
2. Interrogating data, incl jargon, context, stats, and spreadsheet skillz
3. Visualizing data
4. Mashing data
Either start with a dataset covering some topic of interest,
or start with a question and find the data
Try using Yahoo!Pipes
“a powerful composition tool to aggregate, manipulate, and mashup content from around the web”
Week 2: Lise Olsen, Houston Chronicle
Video lecture 1, 2
- Videos suggests lots of sources for one-off data points: Facebook messages, web event posts, etc:
things useful for a journalist seeking a specific story,
but maybe less so for a statistician gathering general trends
- Useful directory of existing public records
- Don’t forget the Wayback Machine
- Corporate data across countries
- Google’s “publicdata directory”?
The Guardian’s data page?
- Document Cloud helps organize and search a whole set of documents at once
- Look at the way things should be … and look at reality,
and there will always be a difference
- Read up on background first, then talk to people around the edges,
before talking to the person at the center of the investigation,
so you are prepared with good questions
and with good sharp responses when they lie to you (as they will)
- Often you can access FOIA laws from other countries w/o being a citizen
Beyond Google: list of many other useful search engines and related sites
- Intense processing of PDFs with OCR and Excel required before can begin the serious work!
- Several stories came through linking 1000s of docs:
– spending emergency funds on office furniture
– overspending on bodyguards etc during trips
– multiple conflicting trips (impossible at the same time)
- “Most of my investigations are based on documents. It’s hard for anyone to factually find fault with something that has been put in writing – in a public document.”
- “many times I will make a records request for some basic documents like salaries for all employees and then get an unsolicited call or letter telling me what I should be looking for. I guess word spreads when records requests are filed, and when employees hear someone is looking into their entity it is an impetus to call me.”
- “investigating obscure entities and agencies. … I have had the most success as an investigative reporter looking into those entities. The less sun that’s been shined, the more chance there is for abuse.”
- “the key to most strong investigations is that they don’t disappear after one story”
Week 3: Derek Willis, NYT
“Spreadsheets … are the gateway drug for data journalism. If you can get comfortable using them, you’ll be much more likely to be able to move on to using database and doing more sophisticated and complex analyses.”
Video 2, Backgrounding the data
“Data should have a history. If it doesn’t, it can be hard to know whether to trust it and how far to take the answers it gives you.”
Q’s to ask the data:
* Where does it come from? If a form, get a copy of that form.
* How often & for what purpose is it collected? If exist reports based on it, get those.
* Get any documentation for the data.
* Talk to whoever’s responsible for the data.
These questions will help shape the questions you’ll ask of the data (and of any experts you interview).
Data documentation: “The Census Bureau does a really good job of this. They’re a great example of an agency that provides comprehensive documentation as a standard practice.”
Video 3, Flaws and Limitations – Sorting
“All data is bad; you just need to find out how bad.”
Video 4, Flaws and Limitations – Filtering
In Excel, Data -> Filter, then click on a header to filter by that column’s contents
Video 5, Asking Questions
* There are no stupid questions
* Start with the obvious
* Test the conventional wisdom
* Think about the data that isn’t there
Week 4: Jeremy Bowers, NPR
pre-video readings: I already read them a few weeks ago, and pinboarded a few top ideas, particularly:
- the story may be a big-blob-of-text in the paper itself,
but your website should track it in a database so you can tag/search/browse by date, subject, etc
(the CMU Stats project database should do this too!)
- taking public-but-hard-to-use data and making it widely-available-to-browse
can have a huge impact on the people whose data it originally was
i.e. gun owners database, felon mug shots app, etc:
technically this was all publically available before
but if you make it TOO easy for others to see then it becomes problematic
and you don’t want certain things to linger on Google forever for those poor folks
Week 4 intro video
* NewsApp w/o DataVis: ProPub Dollars for Docs
* DataVis w/o NewsApp: WaPo Olympic Athletes
* NewsApp w/ DataVises on each page: ChiTrib crime site
* NewsApp w/ one DataVis on landing page: NPR Arrested Dvlpt
Use a DataVis if you have a powerful overall story;
a NewsApp if you have powerful individual stories;
and both if both!
“A news app… is an interactive web page that uses software instead of words and pictures to do journalism”
–Scott Klein, ProPublica
“It’s like a miniature self-contained DMS for a single bucket of data”
need to figure out “atomic unit” of your news dataset
take the main nouns in your stories and make them database fields
e.g. Dollars For Docs app uses each payment as atomic units, with From (drug company) and To (doctor) fields
index page shows readers why they care about the data: trends and aggregates
list page shows list of atomic units that can be clicked on for detail
i.e. all ppl from same region, payments to same company, accidents to same body part, etc
detail pages (individual pages) show details of interest about each atomic unit
but clickable parts on these detail pages send you back to another list page
- need to have unique IDs for each atomic unit, may need to do data-cleaning or linkage to achieve this,
i.e. if 10 crimes by “sally smith” but are different people, name is not a unique ID
- make sure detail pages give context to where that detail fits relative to others
[? how would I make Stats dept’s projects page into a compelling “app”?
make a paper prototype and pitch it to Chad, Carl, etc!]
Week 5: Sisi Wei, Pro Publica
dataviz includes both data art vs data journalism;
both take data and show it visually, but only the latter ALSO tells a story
Understand, Translate, Display
Understand the source or data yourself (takes time!)
Translate for the others so readers don’t need to do the work themselves, incl
1. reduce complexity: choose what’s important, leave out what’s not
2. reveal patterns
3. show change: “compared to what?”
WaPo exit polls 2012 example
1. good design is invisible: interact with data, not with interface
if you make the design the most exciting part, it means the data/story is harder to read
2. be true to the data: factual accuracy not enough, need visual accuracy too
not only “ppl often don’t read the y-axis, just look at bar heights”
but in fact, the POINT of a viz is so you CAN look at the bar heights not read the y-axis
so if your bars don’t start at 0, you’ve removed the benefit that a bar viz provides
(though could still be OK to use dots or whatever)
3. viz isn’t same as explanation:
don’t forget to explain the story if the graphic doesn’t show it clearly
i.e. matt ericson’s flooding-of-katrina map was interesting/nifty VIZ,
but hard to tell what the story is at a glance,
so needed to write explanation and simpler graphics/tables to actually EXPLAIN
read Dona Wong’s book too
print: Adobe Illustrator is still the best
online, no programming skillz: Tableau (Public), ManyEyes
with coding skillz: Google Charts API, Highcharts, Raphael, D3
Highcharts: gives code examples but very customizable,
can tweak live code, explicit about how to use options
Google Fusion Tables: convert google spreadsheet etc to map or dataviz
CartoDB: beautiful; import your data and it’ll create the map for you
MapBox: download, choose base layer
Google Maps API: started charging on enterprise level recently, but still free below certain # views