Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj ›...

39
Data Cleansing - Open Refine Data Journalism InfoUma 2018-19 Andrea Marchetti

Transcript of Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj ›...

Page 1: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data Cleansing - Open RefineData Journalism

InfoUma 2018-19 Andrea Marchetti

Page 2: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Clean the data

Page 3: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Definition

Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from data.

The term refers to identifying incomplete, incorrect, inaccurate, irrelevant, etc. parts of the data and then replacing, modifying, or deleting this dirty data.

https://en.wikipedia.org/wiki/Data_cleansing

Page 4: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data cleaning tools

Page 5: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

BibliographyOpen Refine Home page

Official Documentation

List of Tutorials

Using OpenRefine Ruben Verborgh, Max De Wilde September 2013

General Refine Expression Language

Jython = Python for java platform

Page 6: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

What you can doCleansing - Analysing and fixing data

Fixing errors

Remove duplicate records

Split multi data columns

Enrichment

by Web Api

by Web Scraping

Page 7: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Common Errors you can find in the dataString vs numbers (“10,5432” vs 10.5432)

Different Formats (01/09/2016 vs 01-09-2016)

Data inconsistencies (Piazza, P.zza, P.za)

Lateral spaces (“B&B” vs “ B&B”)

Page 8: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Accomodations in TuscanyRegione Toscana - Strutture ricettive

File CSV

Page 9: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

First view with an editor

Page 10: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

utf-8

X

Page 11: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

By clicking on the upside down arrow of the column heading you start working on the data

Page 12: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Facet

Page 13: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Text Facet on “tipologia” column

Page 14: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Text FacetIn italian: sfaccettature

technically is an hystogram

Page 15: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Numeric Facetcheck the limits

9.68 12.36

Page 16: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Edit Cells

Page 17: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

To title case

Common transforms

Page 18: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Transforms

Page 19: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

General Refine Expression Language - GREL

Variables

value = value of current cell

row = number of current row

Functions

split(“division character”)

round() = round up

Page 20: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

round(value*100000)/100000.0

Page 21: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Edit columns

Page 22: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma
Page 23: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

value.split(“ “)[0].toLowercase()

Page 24: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Clustering of values from facet

Page 25: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data enrichment

Page 26: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data Enrichment

Web API - parseJson(string s)

Web Scraping - parseHtml(string s)

Page 27: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data Enrichment

Page 28: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data Enrichment: Open RefineOpenRefine makes it easy to annotate datasets with data fetched from any web service which returns JSON

I.e. Get coordinates from addresses

need Geocoding WebService

1. OpenStreetMap 2. GoogleMap

Page 29: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Geocoding servicesOpenStreetMap

http://nominatim.openstreetmap.org/search?format=json&q=Via Moruzzi 1 Pisa

low recall

GoogleMap

https://maps.googleapis.com/maps/api/geocode/json?address=Via Moruzzi 1 Pisa&key=YOUR API KEY

high recall, need a key

Page 30: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Openstreetmap Json Result[

{ place_id: "16952760", licence: "Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright", osm_type: "node", osm_id: "1477804118", boundingbox: [

"43.7193809", "43.7194809", "10.4237241", "10.4238241"

], lat: "43.7194309", lon: "10.4237741", display_name: "Area della Ricerca del CNR di Pisa, 1, Via Giuseppe Moruzzi, Don Bosco, Pisa,

PI, Tuscany, 56124, Italia", class: "place", type: "house", importance: 0.52025

}]

Page 31: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Google Json Result{

results: [

{ address_components: [],

formatted_address: "Via Giuseppe Moruzzi, 1, 56127 Pisa PI, Italia" ,

geometry: {

bounds: {},

location: {

lat: 43.7182358 , lng: 10.4248623

}, location_type: "ROOFTOP", viewport: {}

}, place_id: "ChIJ09Eml8OR1RIRELeidtGcXhA" , types: []

} ], status: "OK"

}

Page 32: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Google API Geocoding serviceGoogle API geocoding with REST documentation

To use the Google Maps Geocoding API, you need an API key. Before you start developing with the Geocoding API,

review the authentication requirements and

the API usage limits.

● 2,500 free requests per day (not still available)● 50 requests per second

Google Console to manage my API Keys

Page 33: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data Enrichment with Open Refine

Page 34: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data Enrichment with Open Refine

'https://maps.googleapis.com/maps/api/geocode/json?address='+escape(value,'URL')+'&key=YOUR API KEY'

GREL string function

Page 35: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data Enrichment with Open RefineIf you try to edit this cell you can see the formatted json

Page 36: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data Enrichment with Open RefineIf you try to edit this cell you can see the formatted json

GREL Other Functions

Page 37: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Data Enrichment with Open Refinevalue.parseJson().results[0].geometry.location.lat

The Google Geocoding Service returns always a list of results[], we get the first: results[0]

Page 38: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma

Export Data

Page 39: Data Journalism Data Cleansing - Open Refinedidawiki.cli.di.unipi.it › ... › dj › 05t_data_cleansing_-_open_refine.pdf · Data Cleansing - Open Refine Data Journalism InfoUma