Data sanity checks: Data Proofer (and R analogues?)

I just heard about Data Proofer (h/t Nathan Yau), a test suite of sanity-checks for your CSV dataset.

It checks a few basic things you’d really want to know but might forget to check yourself, like whether any rows are exact duplicates, or whether any columns are totally empty.
There are things I always forget to check until they cause a bug, like whether geographic coordinates are within -180 to 180 degrees latitude or longitude.
And there are things I never think to check, though I should, like whether there are exactly 65k rows (probably an error exporting from Excel) or whether integers are exactly at certain common cutoff/overflow values.
I like the idea of automating this. It certainly wouldn’t absolved me of the need to think critically about a new dataset—but it might flag some things I wouldn’t have caught otherwise.

(They also do some statistical checks for outliers; but being a statistician, this is one thing I do remember to do myself. (I’d like to think) I do it more carefully than any simple automated check.)

Does an R package like this exist already? The closest thing in spirit that I’ve seen is testdat, though I haven’t played with that yet. If not, maybe testdat could add some more of Data Proofer’s checks. It’d become an even more valuable tool to run whenever you load or import any tabular dataset for the first time.

3 responses to “Data sanity checks: Data Proofer (and R analogues?)

  1. hi, my co-author (@tylerrinker) and i (@data_steve) have been working on a package called valiData. https://github.com/data-steve/valiData It’s fairly robust automated import validator which has a broad swath of tests it runs, more than any other R package we’ve seen to date.

    It was more of an internal tool for us, but we’ve been stepping up its game for broader application to the useR community over the past few weeks. It’s in active development and in beta mode where we still getting the documentation up to standards of the new functionality we’ve added. But its core functionality is good to go.

    If you’d like to try it out and give feedback, that’d be great.

  2. Hi,

    Great post. Another package to check is visdata by Nicholas Tierney
    https://github.com/njtierney/visdat

    I saw Nick give a great talk about data checking at a recent conference. http://wombat2016.org/slides/nick.pdf

  3. Have a look at the ‘validate’ package.