Getting it the rightest

27
Getting it the rightest you can Thomas Hargrove, Scripps News John Perry, Atlanta Journal - Constitution Janet Roberts, Reuters Jennifer LaFleur , Reveal | The Center for Investigative Reporting IRE 2015 CAR Conference, Atlanta

Transcript of Getting it the rightest

Getting it the rightest

you can

Thomas Hargrove, Scripps NewsJohn Perry, Atlanta Journal-Constitution

Janet Roberts, ReutersJennifer LaFleur, Reveal | The Center for Investigative Reporting

IRE 2015 CAR Conference, Atlanta

Beware duplicates

Every time Saint Paul, Minn., housing inspectors made follow-up visits to check on violations, all of the data entries from the previous visit were logged again. So every violation was listed in the database multiple times.

Do integrity checks from your desk

Beware dates

Did 592,000 people in Ohio really vote before they registered?

Do integrity checks from your desk

Does it make sense?

“We select things for publication just to make available a wide scope of data to the public ... There is some burden on the public to decide whether or not to use the material.”

--Kathleen McGuire, Sourcebook of Criminal Justice Statistics

(a/k/a: The case of the disappearing lifers)

Do integrity checks from your desk

Do the data conform to the real world?

Are half of the records male, half female?

In a national data set, are about 13 percent of the records from California?

Are racial minorities adequately represented?

Do integrity checks from your desk

Check for patterns in missing data.

Do patterns render estimates inaccurate?

Do integrity checks from your desk

Think like a statistician

Do integrity checks from your desk

a/k/a: How George Will became the darling of statistics teachers

"In 1992-93, none of the five states with the highest teachers'

salaries were among the 15 states with the highest SAT scores.

And the 10 states with the lowest per pupil spending included

four . . . among the 10 states with the highest SAT scores."

--George Will, 1993

Statistical checks: From the simple to the sophisticated

Do integrity checks from your desk

R-squared = .82

ss2 = 43 + 0.95(ss1)

Descriptive statistics:

Frequency

Average

Mode

Beware the documentation

Do integrity checks with other sources

Yes, that’s Harold Spaeth’s view and mostly I think he’s right, though I’d substitute the word more “efficient” for more “accurate.”

--Lee Epstein

(Find a power user, and compare notes.)

What’s missing?

An estimated 30 percent of felony convictions are missing from the Minnesota public convictions file.

(ask the keepers of the data)

Do integrity checks with other sources

Check those codes

Do integrity checks with other sources

(a/k/a: The codes are not what they seem)

Data spanned six years. Sometime in those six years, the violation codes changed. No one in the Housing Violations Bureau knew when the switch was made, and no one had definitions for the previous codes.

(a/k/a: Why to pull some paper records)

Beware elements of change

Do integrity checks with other sources

The “feename” – name of the property owner –in the Saint Paul Housing Bureau’s code violations database is pulled in from property tax rolls. It shows the current owner. That person may not have owned the property at the time of the violation.

(a/k/a: Why to pull some paper records)

Summarize cases by institutions, then spot check results.

Do integrity checks with other sources

Is it true only 6 percent of hospital emergency cases are transferred from other hospitals?

Beware nulls!Technology bites

Null scariness from the FDA’s MAUDE database

http://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm

Beware nulls!Technology bites

We want to explore reports involving Promus heart stents , but NOT the Promus Element devices.

First, let’s see what’s in there for Promus.

Beware nulls!Technology bites

There are 50 records that mention Promus. We can see by scrolling that four are the Promus Element that we wish to exclude.

Beware nulls!Technology bites

Let’s get rid of those Elements.

Beware nulls!Technology bites

50 – 4 = …..????

Beware nulls!Technology bites

You’re supposed to have 46 records, but you got 30. What are the missing 16 records?

Beware nulls!Technology bites

Right:

Wrong:

Beware false joins in "encrypted“ data.

Technology bites

Medicare 5 percent sample: Doctors IDs were encrypted in some files, not in others.

Don’t alter original data.

As you report and just before you publish

Make a copy of the original data file. Put it somewhere and don’t touch it again.

Don’t edit an original column or field. Make a copy and edit that.

Document as you go

As you report and just before you publish

Keep track of all of your queries so you can retrace your steps or find where you went wrong.

As you integrity check your data, annotate the queries to remember what you learned.

Cross check

As you report and just before you publish

If you summed data in SQL, can you reproduce the results in a pivot table?

If you’re summing, do a list. Make sure there‘s nothing wacky in that list that would cause your count to be wrong; e.g., duplicates.

If you have various data sources that should yield the same conclusions, do they?

Beware the single case

As you report and just before you publish

Never report on one data record without pulling the paper report or talking to the person in question.

What if it was a data entry error?

What if there are circumstances you don’t understand?

Recreate the wheel

As you report and just before you publish

For every fact, number, finding in your story, write an original query or formula to support it.

Go back to your original data.

Try to arrive at the same conclusion in different ways.

Fear is your friend