Big Data Technologyand the Social Sciences:
A Lecture at Mannheim University
Abe Usher CCHP, CISSP Chief Technology Officer, HumanGeo
2
What’s In It For You?
Theory•Definitions and overview
•Where data are being generated
Practice•Google’s three secret techniques* for unlocking insights from data
•The kitchen model
•Recommended resources to build data science skills
Presentation slides: http://www.slideshare.net/abeusher/big-data-and-the-social-sciences
*Not specifically endorsed by Google. Also, not really a secret.
3
Background
HumanGeo is focused on digital Human Geography:
Understanding the location attributes of individuals and groups
And the social attributes of locations
Through ‘Big Data’ analysis of billions geolocated data elements
4
Big Data Wake-Up Call
Berkeley University Research http://goo.gl/zjSUr1
By 2016 the rate of data growth surpasses the rate of Moore’s Law
5
Defining Big Data
http://knowyourmeme.com/memes/you-keep-using-that-word-i-do-not-think-it-means-what-you-think-it-means
6
Big Data Definition
Boring Traditional definition
“High volume, velocity and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”
7
Big Data Definition
Abe’s definition:
8
The Original “Big Data”
1880 US Census•50 million people
•Data included: age, gender, number of insane people in household*
•Took 7 years to tabulate
•1890 Census estimated at 13 years to complete
*Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census
9
The Original “Big Data”
1880 US Census•50 million people
•Data included: age, gender, number of insane people in household*
•Took 7 years to tabulate
•1890 Census estimated at 13 years to complete
1890•63 million people
•Additional data: citizenship and military service
•New technology: Hollerith Tabulating System
•Took 6 weeks to tabulate (76x faster)
Takeaway• Better technology and
methodology led to 76x speedup
*Credit to Ken Krugler for this factoid: http://www.censusrecords.com/content/1880_census
10
Data Generation
Where are data created?•Website interaction logs
•Social Media
•Cyber events
•Smartphones
What is the volume?•3B phone calls in USA
•700M Facebook posts
•500M tweets per day
•50B WhatsApp messages per day
Takeaway• Social media,
telecommunication, and instant messaging generate an increasingly high volume of data
11
Traditional Modelof Interpreting Observations
Tracy Morrow (aka “Ice T”)
How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)?
http://www.npr.org/2005/08/30/4824690/original-gangster-rapper-and-actor-ice-t
12
Tracy Morrow (aka “Ice T”)
How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)?
“Game knows game, baby.”
Traditional Modelof Interpreting Observations
13
Tracy Morrow (aka “Ice T”)
How can you identify a legitimate hip-hop artist (versus someone who just gets up and rhymes)?
“If you have expert knowledge, then you are capable of answering complex questions by interpreting domain specific information.” [paraphrased]
Traditional Modelof Interpreting Observations
Trust Models for complex data
• August Gorman carried out a plot to grab fractions of a penny from a corporate payroll system. http://goo.gl/vAScel
14
IMDB: 4.9/10Rotten Tomatoes: 26/100
Trust Models for complex data
• Peter Gibbons hatches a plot to write a computer virus that grab fractions of a penny from a corporate retirement account. http://goo.gl/rDg1U
• Known in security circles as a salami attack.
15
IMDB: 7.9/10Rotten Tomatoes: 79/100
Takeaway point: Little bits of value (information) provide deep insights in the aggregate
16
1. Aggregation
2. Visualization
3. Correlation
New Models of Interpreting (Big) Data
Takeaways• Expert based knowledge is no
longer sufficient.• Simple mathematical methods
create value from captured data
17
Aggregation(Counting)
William Thomson, 1st Baron Kelvin
"When you can measure what you are speaking about, and express it in
numbers, you know something about it.”
Takeaway• Aggregation via counting
things is the most common way to exploit Big Data
The book “Fearless” is much more popular than the 80s movie “Navy Seals.”It also has a more favorable distribution of reviews.
Aggregation:A Tale of Two Products
The distribution we’re looking for looks like the #1 hand:Responses concentrated in the most positive category,With very few responses that were unfavorable.
Aggregation:A Tale of Two Products
Aggregation & Visualization:Counting with Google Trends
Aggregation & Visualization:Bing Search vs. Google Search
Aggregation:Diet Pepsi vs. Diet Coke
Aggregation & Visualization:Big Data vs. Britney Spears
Geospatial Visualization Example:Social Drift in DC
Takeaway• Visualization provides a
powerful mechanism for Exploratory Data Analysis
A
25
Correlation:Canadian Flu Research
Gunther Eysenbach•Professor @ University of Toronto
•Focused on eHealth
•Google Ads user
Infodemiology•2004-2005 tracked flu related searches
•54,507 Ad impressions in Canada
•High R^2 correlation to actual flu activity
http://gunther-eysenbach.blogspot.com/
Infodemiology paper: http://goo.gl/aeUZtA
Takeaway• Human behavior in response
to Google Ads related to the flu was highly correlated with “officially reported” cases of the flu.
26
Correlation:Google Flu Trends
“Google Flu Trends provides near real-time estimates of flu activity for a number of countries and regions around the world based on aggregated search queries.”
Process•Map searches to regions
•Quantify “normal”
•Detect “anomalies”
NPR: http://goo.gl/Iv7A87
NYT: http://goo.gl/mNyAi7
27
Correlation:Box Office Hit Prediction
“Use of socially generated ‘big data’ to access information about collective states of the minds in human societies has become a new paradigm in the emerging field of computational social science.”
Simple factors•number of total page views
•number of total edits made
•number of users editing
•number of revisions in the article's revision history
Early Prediction of Movie Box Office Success: http://goo.gl/BWf7H1
Counts of Wikipedia factors correlate to Box Office sales
28
Big Data:Significance for Social Sciences
1. Proxy variables.Digital exhaust collected for purposes other than survey often creates ‘proxy variables’ that provide complementary insights.
2. Aggregation Insights.Combining many small observations leads to insights that we can trust.
3. Data Linking.It is possible to ‘link’ or synchronize records between digital exhaust and instrumented surveys by selecting a common dimension (e.g. location).
The future of social science will involve combining “fuzzy Big Data insights” with instrumented survey results
Chef Ingredients Utensils Recipes
The kitchen model of value creation
YourStaff
YourData
Technology Techniques
31
Take Action:Experiment yourself
Exploratory Data Analysis lifecycle:• collect - Twitter API, Datasift.com• clean - open refine• analyze - Python or R• visualize - Google Earth
Related data: https://s3.amazonaws.com/devbackup/germany.txt.gz
Related code: https://github.com/abeusher
32
Take Action: Explore
Google Trends http://goo.gl/8eJZg Google Ngram http://goo.gl/4U09fa
Google Correlate http://goo.gl/nEhe8D Bing Keyword Research http://goo.gl/q2V88g
33
Contact information
Abe Usher
Email: [email protected] Twitter: @abeusherLinkedIn: http://goo.gl/DUxZOP Presentations: http://goo.gl/bCa3Qt
Top Related