#DataVizInSixWeeks, Wk 5 - Data

33
#DataVizInSixWeeks Copyright Anne Stevens Data

Transcript of #DataVizInSixWeeks, Wk 5 - Data

#DataVizInSixWeeksCopyright Anne Stevens

Data

#DataVizInSixWeeksCopyright Anne Stevens

Week One

What is data visualization? Historical context

Week Four

Design issues & best practices

Week Two

Visualization types

Week Five

Big data, data management

Week Three

Perception and cognitionWeek Six

Synthesis

Data Viz In Six WeeksAn Introduction to Visual Analytics course taught at OCAD University, Toronto

By Anne Stevens

Get Data

Clean it

Combine

#DataVizInSixWeeksCopyright Anne Stevens

Big Data

What is it?

#DataVizInSixWeeksCopyright Anne Stevens

Invasive targeted advertising

#DataVizInSixWeeksCopyright Anne Stevens

Social media: early warning

Source: MIT, Health Maphealthmap.org

#DataVizInSixWeeksCopyright Anne Stevens

Social media: early warning

Source: BioDiasporabiodiaspora.com/

#DataVizInSixWeeksCopyright Anne Stevens

Data activism - Ushihidi

Source: Kenyan elections, 2010 – Ushihidiilissafrica.wordpress.com/2011/03/23/crowdsourcing-with-ushahidi/

#DataVizInSixWeeksCopyright Anne Stevens

Crowd sourced data - Ushihidi

Source: Haiti earthquake crisis map – Ushihiditheextremecentrist.wordpress.com/

#DataVizInSixWeeksCopyright Anne Stevens

Small Data

What is it?

#DataVizInSixWeeksCopyright Anne Stevens

Data driven journalism

www.theguardian.com/news/datablog/2011/jul/28/data-journalism

#DataVizInSixWeeksCopyright Anne Stevens

Open data movement

www.twoviewsbeyond.com/safi/charlie-hebdo-right-say-dumb-stuff-four-interesting-dumb-commentaries/

#DataVizInSixWeeksCopyright Anne Stevens

The 3 V’s: Volume, Variety, Velocity

Source: datasciencecentral.com/forum/topics/the-3vs-that-define-big-data

#DataVizInSixWeeksCopyright Anne Stevens

Finding patterns in all the noise

[Pole] ran test after test, analyzing the data, and before long some useful patterns emerged. Lotions, for example. Lots of people buy lotion, but one of Pole’s colleagues noticed that women on the baby registry were buying larger quantities of unscented lotion around the beginning of their second trimester. Another analyst noted that sometime in the first 20 weeks, pregnant women loaded up on supplements like calcium, magnesium and zinc. Many shoppers purchase soap and cotton balls, but when someone suddenly starts buying lots of scent-free soap and extra-big bags of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to their delivery date.

Source: forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/

#DataVizInSixWeeksCopyright Anne Stevens

Big Data challenges

Very informal data

Very messy data

Echo chamber effect

Not a representative sample of society

Volatility

#DataVizInSixWeeksCopyright Anne Stevens

#DataVizInSixWeeksCopyright Anne Stevens

Curation, not content, creates value

Get Data

Clean it

Combine it

Explore

Visualize /Analyse

Maintain

#DataVizInSixWeeksCopyright Anne Stevens

Open Data Portals, eg. City of Toronto

Scrape social media

APIs

Scrape websites

XPath code + Scrape Similar Chrome Extension + Google Docs

RSS Feeds

Create your own data (eg. Survey Monkey)

Sensor data, GPS, mobile phone data (not just names and numbers)

Get Data

Clean it

Combine it

Explore

Visualize /Analyse

Maintain

#DataVizInSixWeeksCopyright Anne Stevens

Big Data is messy data

Open data is messy

Excel tools

Google Refine

Structure unstructured data

Restructure the data set

Data Wrangler

Google Refine

Excel Pivot Tables

Tableau Reshaper

Get Data

Clean it

Combine it

Explore

Visualize/Analyse

Maintain

#DataVizInSixWeeksCopyright Anne Stevens

Cleaning data

#DataVizInSixWeeksCopyright Anne Stevens

Excel cleaning tools

Text to Columns (a split function)

Remove Duplicates

=SUBSTITUTE(cell ref, “to be replaced”, “replaced with this”)

=FIND(“character to find pos’n of”, cell ref)

=LEFT(cell ref, number of characters to grab from left side)

=RIGHT(cell ref, number of characters to grab from right side)

=LEN(cell ref)

=CONCATENATE(1st thing, 2nd thing, 3rd thing, …)

Paste Special -> Values

=TRIM(cell ref)

=VALUE(cell ref)

#DataVizInSixWeeksCopyright Anne Stevens

Restructuring the data set

Making it TALL

#DataVizInSixWeeksCopyright Anne Stevens

Restructure the data set Make data as RAW as possible

One row of headers

Convert section headers to columns

Eliminate empty cells & rows

#DataVizInSixWeeksCopyright Anne Stevens

Source: Data Wranglerhttp://vis.stanford.edu/wrangler/

#DataVizInSixWeeksCopyright Anne Stevens

Connect data from different sources

Make structure & syntax consistent

Structure unstructured data

Get Data

Clean it

Combine it

Explore

Visualize/Analyse

Maintain

#DataVizInSixWeeksCopyright Anne Stevens

Combining data sets

Provides context that can lead to new insight

Presents a lot of challenges

Social media is typically informal and unstructured Formal vs informal data

Structured vs unstructured data

Data viz needs structured data

#DataVizInSixWeeksCopyright Anne Stevens

Combining data

Challenges Combining structured with unstructured data

Non-standard vocabulary, units, accuracy

$ values from different years have to be adjusted for inflation

Don’t mix weighted & unweighted data

Don’t mix raw and normalized data

MAUP (modified areal unit problem)

Licensing issues

#DataVizInSixWeeksCopyright Anne Stevens

Probe into data

Histograms for variable distribution

Log vs. linear axes scales

Get Data

Clean it

Combine it

Explore

Visualize/Analyse

Maintain

#DataVizInSixWeeksCopyright Anne Stevens

Use existing chart libraries (Tableau etc.)

Create original visualizations (D3.js, Processing etc.)

Test with sample data sets

Get Data

Clean it

Combine it

Explore

Visualize/Analyse

Maintain

#DataVizInSixWeeksCopyright Anne Stevens

Update

Maintain

Check

Get Data

Clean it

Combine it

Explore

Visualize/Analyse

Maintain

#DataVizInSixWeeksCopyright Anne Stevens

Resources

Xpath tutorials

annielytics.com/blog/google-docs/how-to-scrape-the-web-using-google-docs/

w3schools.com/xpath/default.asp

distilled.net/blog/distilled/guide-to-google-docs-importxml/

Google Refine: OpenRefine

openrefine.org

Tutorial: http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial

Scraper / Scrape Similar

chrome.google.com/webstore/detail/scraper/mbigbapnjcgaffohmbkdlecaccepngjd

mnmldave.github.io/scraper/

Data cleaning

schoolofdata.org/courses

#DataVizInSixWeeksCopyright Anne Stevens

Week One

What is data visualization? Historical context

Week Four

Design issues & best practices

Week Two

Visualization typesWeek Five

Big data, data management

Week Three

Perception and cognitionWeek Six

Synthesis

Data Viz In Six WeeksAn Introduction to Visual Analytics course taught at OCAD University, Toronto

By Anne Stevens

stevensanne.com

stevensanne.com/blog/

@3_ring_binder