BIg Data is Not About Data
Transcript of BIg Data is Not About Data
-
8/10/2019 BIg Data is Not About Data
1/118
Big Data is Not About the Data!
Gary King1
Institute for Quantitative Social ScienceHarvard University
Talk at the History of Evidence class, Harvard Law School, 11/17/2014
1GaryKing.org1/10
-
8/10/2019 BIg Data is Not About Data
2/118
The Value in Big Data: the Analytics
2/10
-
8/10/2019 BIg Data is Not About Data
3/118
The Value in Big Data: the Analytics
Data:
2/10
-
8/10/2019 BIg Data is Not About Data
4/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements
2/10
-
8/10/2019 BIg Data is Not About Data
5/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized
2/10
-
8/10/2019 BIg Data is Not About Data
6/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year
2/10
-
8/10/2019 BIg Data is Not About Data
7/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year
With a bit of effort: huge data production increases
2/10
-
8/10/2019 BIg Data is Not About Data
8/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year
With a bit of effort: huge data production increases Where the Value is: the Analytics
2/10
-
8/10/2019 BIg Data is Not About Data
9/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year
With a bit of effort: huge data production increases Where the Value is: the Analytics
Output can be highly customized
2/10
-
8/10/2019 BIg Data is Not About Data
10/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year
With a bit of effort: huge data production increases Where the Value is: the Analytics
Output can be highly customized Moores Law (doubling speed/power every 18 months)
2/10
-
8/10/2019 BIg Data is Not About Data
11/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year
With a bit of effort: huge data production increases Where the Value is: the Analytics
Output can be highly customized Moores Law (doubling speed/power every 18 months)
v. Our Students (1000x speed increase in 1 day)
2/10
-
8/10/2019 BIg Data is Not About Data
12/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year
With a bit of effort: huge data production increases Where the Value is: the Analytics
Output can be highly customized Moores Law (doubling speed/power every 18 months)
v. Our Students (1000x speed increase in 1 day)
$2M computer v. 2 hours of algorithm design
2/10
-
8/10/2019 BIg Data is Not About Data
13/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year
With a bit of effort: huge data production increases Where the Value is: the Analytics
Output can be highly customized Moores Law (doubling speed/power every 18 months)
v. Our Students (1000x speed increase in 1 day)
$2M computer v. 2 hours of algorithm design Low cost; little infrastructure; mostly human capital needed
2/10
-
8/10/2019 BIg Data is Not About Data
14/118
The Value in Big Data: the Analytics
Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year
With a bit of effort: huge data production increases Where the Value is: the Analytics
Output can be highly customized Moores Law (doubling speed/power every 18 months)
v. Our Students (1000x speed increase in 1 day)
$2M computer v. 2 hours of algorithm design Low cost; little infrastructure; mostly human capital needed Innovative analytics: enormously better than off-the-shelf
2/10
-
8/10/2019 BIg Data is Not About Data
15/118
Examples of whats now possible
3/10
-
8/10/2019 BIg Data is Not About Data
16/118
Examples of whats now possible
Opinions of activists:
3/10
-
8/10/2019 BIg Data is Not About Data
17/118
Examples of whats now possible
Opinions of activists: A few thousand interviews
3/10
-
8/10/2019 BIg Data is Not About Data
18/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days)
3/10
-
8/10/2019 BIg Data is Not About Data
19/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise:
3/10
-
8/10/2019 BIg Data is Not About Data
20/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week?
3/10
-
8/10/2019 BIg Data is Not About Data
21/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week? 500K people carrying cell phones withaccelerometers
3/10
-
8/10/2019 BIg Data is Not About Data
22/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week? 500K people carrying cell phones withaccelerometers
Social contacts:
3/10
-
8/10/2019 BIg Data is Not About Data
23/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week? 500K people carrying cell phones withaccelerometers
Social contacts: A survey: Please tell me your 5 bestfriends
3/10
-
8/10/2019 BIg Data is Not About Data
24/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week? 500K people carrying cell phones withaccelerometers
Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books
3/10
-
8/10/2019 BIg Data is Not About Data
25/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week? 500K people carrying cell phones withaccelerometers
Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books
Economic development in developing countries:
3/10
-
8/10/2019 BIg Data is Not About Data
26/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week? 500K people carrying cell phones withaccelerometers
Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books
Economic development in developing countries: Dubious ornonexistent governmental statistics
3/10
-
8/10/2019 BIg Data is Not About Data
27/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week? 500K people carrying cell phones withaccelerometers
Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books
Economic development in developing countries: Dubious ornonexistent governmental statistics satellite images of
human-generated light at night, road networks, otherinfrastructure
3/10
-
8/10/2019 BIg Data is Not About Data
28/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week? 500K people carrying cell phones withaccelerometers
Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books
Economic development in developing countries: Dubious ornonexistent governmental statistics satellite images of
human-generated light at night, road networks, otherinfrastructure
Many, many, more...
3/10
-
8/10/2019 BIg Data is Not About Data
29/118
Examples of whats now possible
Opinions of activists: A few thousand interviews billions of
political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last
week? 500K people carrying cell phones withaccelerometers
Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books
Economic development in developing countries: Dubious ornonexistent governmental statistics satellite images of
human-generated light at night, road networks, otherinfrastructure
Many, many, more...
In each: without new analytics, the data are useless
3/10
-
8/10/2019 BIg Data is Not About Data
30/118
The End of The Quantitative-Qualitative Divide
4/10
-
8/10/2019 BIg Data is Not About Data
31/118
The End of The Quantitative-Qualitative Divide
The Quant-Qual divide exists in everyfield.
4/10
-
8/10/2019 BIg Data is Not About Data
32/118
The End of The Quantitative-Qualitative Divide
The Quant-Qual divide exists in everyfield.
Qualitative researchers: overwhelmed by information; needhelp
4/10
-
8/10/2019 BIg Data is Not About Data
33/118
The End of The Quantitative-Qualitative Divide
The Quant-Qual divide exists in everyfield.
Qualitative researchers: overwhelmed by information; needhelp
Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze
unstructured text, video, audio as data
4/10
f Q Q
-
8/10/2019 BIg Data is Not About Data
34/118
The End of The Quantitative-Qualitative Divide
The Quant-Qual divide exists in everyfield.
Qualitative researchers: overwhelmed by information; needhelp
Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze
unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is
quantified, a right answer exists, and good analytics areapplied: analytics wins
4/10
Th E d f Th Q i i Q li i Di id
-
8/10/2019 BIg Data is Not About Data
35/118
The End of The Quantitative-Qualitative Divide
The Quant-Qual divide exists in everyfield.
Qualitative researchers: overwhelmed by information; needhelp
Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze
unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is
quantified, a right answer exists, and good analytics areapplied: analytics wins
Moral of the story:
4/10
Th E d f Th Q i i Q li i Di id
-
8/10/2019 BIg Data is Not About Data
36/118
The End of The Quantitative-Qualitative Divide
The Quant-Qual divide exists in everyfield.
Qualitative researchers: overwhelmed by information; needhelp
Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze
unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is
quantified, a right answer exists, and good analytics areapplied: analytics wins
Moral of the story: Fully human is inadequate
4/10
Th E d f Th Q tit ti Q lit ti Di id
-
8/10/2019 BIg Data is Not About Data
37/118
The End of The Quantitative-Qualitative Divide
The Quant-Qual divide exists in everyfield.
Qualitative researchers: overwhelmed by information; needhelp
Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze
unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is
quantified, a right answer exists, and good analytics areapplied: analytics wins
Moral of the story: Fully human is inadequate Fully automated fails
4/10
Th E d f Th Q tit ti Q lit ti Di id
-
8/10/2019 BIg Data is Not About Data
38/118
The End of The Quantitative-Qualitative Divide
The Quant-Qual divide exists in everyfield.
Qualitative researchers: overwhelmed by information; needhelp
Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze
unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is
quantified, a right answer exists, and good analytics areapplied: analytics wins
Moral of the story: Fully human is inadequate Fully automated fails We needcomputer assisted, human controlledtechnology
4/10
The E d of The Q a titati e Q alitati e Di ide
-
8/10/2019 BIg Data is Not About Data
39/118
The End of The Quantitative-Qualitative Divide
The Quant-Qual divide exists in everyfield.
Qualitative researchers: overwhelmed by information; needhelp
Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze
unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is
quantified, a right answer exists, and good analytics areapplied: analytics wins
Moral of the story: Fully human is inadequate Fully automated fails We needcomputer assisted, human controlledtechnology (Technically correct, & politically much easier)
4/10
How to Read a Billion Blog Posts
-
8/10/2019 BIg Data is Not About Data
40/118
How to Read a Billion Blog Posts& Classify Deaths without Physicians
5/10
How to Read a Billion Blog Posts
-
8/10/2019 BIg Data is Not About Data
41/118
How to Read a Billion Blog Posts& Classify Deaths without Physicians
Examples of Bad Analytics:
5/10
How to Read a Billion Blog Posts
-
8/10/2019 BIg Data is Not About Data
42/118
How to Read a Billion Blog Posts& Classify Deaths without Physicians
Examples of Bad Analytics: Physicians Verbal Autopsy analysis
5/10
How to Read a Billion Blog Posts
-
8/10/2019 BIg Data is Not About Data
43/118
How to Read a Billion Blog Posts& Classify Deaths without Physicians
Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts
5/10
How to Read a Billion Blog Posts
-
8/10/2019 BIg Data is Not About Data
44/118
How to Read a Billion Blog Posts& Classify Deaths without Physicians
Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts
Different problems, Same Analytics Solution:
5/10
-
8/10/2019 BIg Data is Not About Data
45/118
How to Read a Billion Blog Posts
-
8/10/2019 BIg Data is Not About Data
46/118
How to Read a Billion Blog Posts& Classify Deaths without Physicians
Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts
Different problems, Same Analytics Solution: Key to both methods: classifying(deaths, social media posts)
Key to both goals: estimating %s
5/10
How to Read a Billion Blog Posts
-
8/10/2019 BIg Data is Not About Data
47/118
How to Read a Billion Blog Posts& Classify Deaths without Physicians
Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts
Different problems, Same Analytics Solution: Key to both methods: classifying(deaths, social media posts)
Key to both goals: estimating %s Modern Data Analytics: New method led to:
5/10
How to Read a Billion Blog Posts
-
8/10/2019 BIg Data is Not About Data
48/118
How to Read a Billion Blog Posts& Classify Deaths without Physicians
Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts
Different problems, Same Analytics Solution: Key to both methods: classifying(deaths, social media posts)
Key to both goals: estimating %s Modern Data Analytics: New method led to:
1.
5/10
How to Read a Billion Blog Posts
-
8/10/2019 BIg Data is Not About Data
49/118
g& Classify Deaths without Physicians
Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts
Different problems, Same Analytics Solution: Key to both methods: classifying(deaths, social media posts)
Key to both goals: estimating %s Modern Data Analytics: New method led to:
1.
2. Worldwide cause-of-death estimates for
5/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
50/118
y y
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
51/118
y y
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
52/118
y y
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts:
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
53/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
54/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
55/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics:
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
56/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics: Few statistical improvements for 75 years
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
57/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity)
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
58/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence)
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
59/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
60/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic
New customized analytics we developed:
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
61/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic
New customized analytics we developed: Logical consistency (e.g., older people have higher mortality)
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
62/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic
New customized analytics we developed: Logical consistency (e.g., older people have higher mortality) More accurate forecasts
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
63/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic
New customized analytics we developed: Logical consistency (e.g., older people have higher mortality) More accurate forecasts Trust fund needs $800 billionmore than SSA thought
6/10
The Solvency of Social Security
-
8/10/2019 BIg Data is Not About Data
64/118
Successful: single largest government program; lifted a whole
generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive
benefits longer than expected, the Trust Fund runs out
SSA data: little change other than updates for 75 years
SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic
New customized analytics we developed: Logical consistency (e.g., older people have higher mortality) More accurate forecasts Trust fund needs $800 billionmore than SSA thought Other applications to insurance industry, public health, etc.
6/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
65/118
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
66/118
Example Substitution 1:
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
67/118
Example Substitution 1:
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
68/118
Example Substitution 1:
Freedom
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
69/118
Example Substitution 1:
Freedom
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
70/118
Example Substitution 1:
Freedom
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
71/118
Example Substitution 1:
Freedom Eye field
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
72/118
Example Substitution 1:
Freedom Eye field (nonsensical)
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
73/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
74/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
75/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2:
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
76/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2:
7/10
Following Conversations that Hide in Plain Sight
-
8/10/2019 BIg Data is Not About Data
77/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2:
Harmonious [Society] (official slogan)
7/10
Following Conversations that Hide in Plain Sight
E l S b i i 1 H h
-
8/10/2019 BIg Data is Not About Data
78/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2:
Harmonious [Society] (official slogan)
7/10
Following Conversations that Hide in Plain Sight
E l S b i i 1 H h
-
8/10/2019 BIg Data is Not About Data
79/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2:
Harmonious [Society] (official slogan)
7/10
Following Conversations that Hide in Plain Sight
E l S b tit ti 1 H h
-
8/10/2019 BIg Data is Not About Data
80/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2:
Harmonious [Society] (official slogan) River crab
7/10
Following Conversations that Hide in Plain Sight
E l S b tit ti 1 H h
-
8/10/2019 BIg Data is Not About Data
81/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2:
Harmonious [Society] (official slogan) River crab (irrelevant)
7/10
Following Conversations that Hide in Plain Sight
Example Substitution 1: Homograph
-
8/10/2019 BIg Data is Not About Data
82/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2: Homophone (sound like hexie)
Harmonious [Society] (official slogan) River crab (irrelevant)
7/10
Following Conversations that Hide in Plain Sight
Example Substitution 1: Homograph
-
8/10/2019 BIg Data is Not About Data
83/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2: Homophone (sound like hexie)
Harmonious [Society] (official slogan) River crab (irrelevant)
They cant follow the conversation; Our methods can!
7/10
Following Conversations that Hide in Plain Sight
Example Substitution 1: Homograph
-
8/10/2019 BIg Data is Not About Data
84/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2: Homophone (sound like hexie)
Harmonious [Society] (official slogan) River crab (irrelevant)
They cant follow the conversation; Our methods can!
The same task:
7/10
Following Conversations that Hide in Plain Sight
Example Substitution 1: Homograph
-
8/10/2019 BIg Data is Not About Data
85/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2: Homophone (sound like hexie)
Harmonious [Society] (official slogan) River crab (irrelevant)
They cant follow the conversation; Our methods can!
The same task: (1) Government and industry analysts job,
7/10
Following Conversations that Hide in Plain Sight
Example Substitution 1: Homograph
-
8/10/2019 BIg Data is Not About Data
86/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2: Homophone (sound like hexie)
Harmonious [Society] (official slogan) River crab (irrelevant)
They cant follow the conversation; Our methods can!
The same task: (1) Government and industry analysts job, (2)language drift (#BostonBombings #BostonStrong),
7/10
Following Conversations that Hide in Plain Sight
Example Substitution 1: Homograph
-
8/10/2019 BIg Data is Not About Data
87/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2: Homophone (sound like hexie)
Harmonious [Society] (official slogan) River crab (irrelevant)
They cant follow the conversation; Our methods can!
The same task: (1) Government and industry analysts job, (2)language drift (#BostonBombings #BostonStrong), (3) Childpornographers,
7/10
Following Conversations that Hide in Plain Sight
Example Substitution 1: Homograph
-
8/10/2019 BIg Data is Not About Data
88/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2: Homophone (sound like hexie)
Harmonious [Society] (official slogan) River crab (irrelevant)
They cant follow the conversation; Our methods can!
The same task: (1) Government and industry analysts job, (2)language drift (#BostonBombings #BostonStrong), (3) Childpornographers, (4) Look-alike modeling,
7/10
Following Conversations that Hide in Plain Sight
Example Substitution 1: Homograph
-
8/10/2019 BIg Data is Not About Data
89/118
Example Substitution 1: Homograph
Freedom Eye field (nonsensical)
Example Substitution 2: Homophone (sound like hexie)
Harmonious [Society] (official slogan) River crab (irrelevant)
They cant follow the conversation; Our methods can!
The same task: (1) Government and industry analysts job, (2)language drift (#BostonBombings #BostonStrong), (3) Childpornographers, (4) Look-alike modeling,(5) Starting point forsophisticated automated text analysis
7/10
Computer-Assisted Reading (Consilience)
-
8/10/2019 BIg Data is Not About Data
90/118
8/10
Computer-Assisted Reading (Consilience)
To understand many documents humans create categories to
-
8/10/2019 BIg Data is Not About Data
91/118
To understand many documents, humanscreate categoriesto
represent conceptualization, insight, etc.
8/10
Computer-Assisted Reading (Consilience)
To understand many documents humans create categories to
-
8/10/2019 BIg Data is Not About Data
92/118
To understand many documents, humanscreate categoriesto
represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer
complaints, sort reports, retrieve information
8/10
Computer-Assisted Reading (Consilience)
To understand many documents humans create categories to
-
8/10/2019 BIg Data is Not About Data
93/118
To understand many documents, humanscreate categoriesto
represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer
complaints, sort reports, retrieve information
Bad Analytics:
8/10
Computer-Assisted Reading (Consilience)
To understand many documents, humans create categories to
-
8/10/2019 BIg Data is Not About Data
94/118
To understand many documents, humanscreate categoriesto
represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer
complaints, sort reports, retrieve information
Bad Analytics: Unassisted Human Categorization: time consuming; huge
efforts trying notto innovate!
8/10
Computer-Assisted Reading (Consilience)
To understand many documents, humans create categories to
-
8/10/2019 BIg Data is Not About Data
95/118
To understand many documents, humanscreate categoriesto
represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer
complaints, sort reports, retrieve information
Bad Analytics: Unassisted Human Categorization: time consuming; huge
efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,
but none work (computers dont know what you want!)
8/10
Computer-Assisted Reading (Consilience)
To understand many documents, humanscreate categoriesto
-
8/10/2019 BIg Data is Not About Data
96/118
y , g
represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer
complaints, sort reports, retrieve information
Bad Analytics: Unassisted Human Categorization: time consuming; huge
efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,
but none work (computers dont know what you want!)
Our alternative: Computer-assisted Categorization
8/10
Computer-Assisted Reading (Consilience)
To understand many documents, humanscreate categoriesto
-
8/10/2019 BIg Data is Not About Data
97/118
y , g
represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer
complaints, sort reports, retrieve information
Bad Analytics: Unassisted Human Categorization: time consuming; huge
efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,
but none work (computers dont know what you want!)
Our alternative: Computer-assisted Categorization You decide whats important, but with help
8/10
Computer-Assisted Reading (Consilience)
To understand many documents, humanscreate categoriesto
-
8/10/2019 BIg Data is Not About Data
98/118
y g
represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer
complaints, sort reports, retrieve information
Bad Analytics: Unassisted Human Categorization: time consuming; huge
efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,
but none work (computers dont know what you want!)
Our alternative: Computer-assisted Categorization You decide whats important, but with help Invert effort: you innovate; the computer categorizes
8/10
Computer-Assisted Reading (Consilience)
To understand many documents, humanscreate categoriesto
-
8/10/2019 BIg Data is Not About Data
99/118
represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer
complaints, sort reports, retrieve information
Bad Analytics: Unassisted Human Categorization: time consuming; huge
efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,
but none work (computers dont know what you want!)
Our alternative: Computer-assisted Categorization You decide whats important, but with help Invert effort: you innovate; the computer categorizes Insights: easier, faster, better
8/10
-
8/10/2019 BIg Data is Not About Data
100/118
Example Insights from Computer-Assisted Reading
-
8/10/2019 BIg Data is Not About Data
101/118
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do
-
8/10/2019 BIg Data is Not About Data
102/118
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
103/118
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
104/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
105/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
106/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
/
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
107/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
/
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
108/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it?
/
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
109/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it? 27% of all Senatorial press releases!
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
110/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it? 27% of all Senatorial press releases!
2. Reverse Engineering Chinese Censorship
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
111/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it? 27% of all Senatorial press releases!
2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
112/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it? 27% of all Senatorial press releases!
2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
113/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it? 27% of all Senatorial press releases!
2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases
-
8/10/2019 BIg Data is Not About Data
114/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it? 27% of all Senatorial press releases!
2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored Previous understanding: they censor criticisms of the
government
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases Categorization: (1) advertising (2) position taking (3) credit
-
8/10/2019 BIg Data is Not About Data
115/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming
New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it? 27% of all Senatorial press releases!
2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored Previous understanding: they censor criticisms of the
government Results:
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases Categorization: (1) advertising (2) position taking (3) credit
-
8/10/2019 BIg Data is Not About Data
116/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming
New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it? 27% of all Senatorial press releases!
2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored Previous understanding: they censor criticisms of the
government Results:
Uncensored: criticism of the government
9/10
Example Insights from Computer-Assisted Reading
1. What Members of Congress Do Data: 64,000 Senators press releases Categorization: (1) advertising, (2) position taking, (3) credit
-
8/10/2019 BIg Data is Not About Data
117/118
Categorization: (1) advertising, (2) position taking, (3) creditclaiming
New Insight: partisan taunting
Joe Wilson during Obamas State of the Union: You lie!
Senator Lautenberg Blasts Republicans as Chicken Hawks
How common is it? 27% of all Senatorial press releases!
2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored Previous understanding: they censor criticisms of the
government Results:
Uncensored: criticism of the government
Censored: attempts at collective action
9/10
For more information
-
8/10/2019 BIg Data is Not About Data
118/118
GaryKing.org
10/10