BIg Data is Not About Data

download BIg Data is Not About Data

of 118

Transcript of BIg Data is Not About Data

  • 8/10/2019 BIg Data is Not About Data

    1/118

    Big Data is Not About the Data!

    Gary King1

    Institute for Quantitative Social ScienceHarvard University

    Talk at the History of Evidence class, Harvard Law School, 11/17/2014

    1GaryKing.org1/10

  • 8/10/2019 BIg Data is Not About Data

    2/118

    The Value in Big Data: the Analytics

    2/10

  • 8/10/2019 BIg Data is Not About Data

    3/118

    The Value in Big Data: the Analytics

    Data:

    2/10

  • 8/10/2019 BIg Data is Not About Data

    4/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements

    2/10

  • 8/10/2019 BIg Data is Not About Data

    5/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized

    2/10

  • 8/10/2019 BIg Data is Not About Data

    6/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year

    2/10

  • 8/10/2019 BIg Data is Not About Data

    7/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year

    With a bit of effort: huge data production increases

    2/10

  • 8/10/2019 BIg Data is Not About Data

    8/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year

    With a bit of effort: huge data production increases Where the Value is: the Analytics

    2/10

  • 8/10/2019 BIg Data is Not About Data

    9/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year

    With a bit of effort: huge data production increases Where the Value is: the Analytics

    Output can be highly customized

    2/10

  • 8/10/2019 BIg Data is Not About Data

    10/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year

    With a bit of effort: huge data production increases Where the Value is: the Analytics

    Output can be highly customized Moores Law (doubling speed/power every 18 months)

    2/10

  • 8/10/2019 BIg Data is Not About Data

    11/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year

    With a bit of effort: huge data production increases Where the Value is: the Analytics

    Output can be highly customized Moores Law (doubling speed/power every 18 months)

    v. Our Students (1000x speed increase in 1 day)

    2/10

  • 8/10/2019 BIg Data is Not About Data

    12/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year

    With a bit of effort: huge data production increases Where the Value is: the Analytics

    Output can be highly customized Moores Law (doubling speed/power every 18 months)

    v. Our Students (1000x speed increase in 1 day)

    $2M computer v. 2 hours of algorithm design

    2/10

  • 8/10/2019 BIg Data is Not About Data

    13/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year

    With a bit of effort: huge data production increases Where the Value is: the Analytics

    Output can be highly customized Moores Law (doubling speed/power every 18 months)

    v. Our Students (1000x speed increase in 1 day)

    $2M computer v. 2 hours of algorithm design Low cost; little infrastructure; mostly human capital needed

    2/10

  • 8/10/2019 BIg Data is Not About Data

    14/118

    The Value in Big Data: the Analytics

    Data: easy to come by; often a free byproduct of IT improvements becoming commoditized Ignore it & every institution will have more every year

    With a bit of effort: huge data production increases Where the Value is: the Analytics

    Output can be highly customized Moores Law (doubling speed/power every 18 months)

    v. Our Students (1000x speed increase in 1 day)

    $2M computer v. 2 hours of algorithm design Low cost; little infrastructure; mostly human capital needed Innovative analytics: enormously better than off-the-shelf

    2/10

  • 8/10/2019 BIg Data is Not About Data

    15/118

    Examples of whats now possible

    3/10

  • 8/10/2019 BIg Data is Not About Data

    16/118

    Examples of whats now possible

    Opinions of activists:

    3/10

  • 8/10/2019 BIg Data is Not About Data

    17/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews

    3/10

  • 8/10/2019 BIg Data is Not About Data

    18/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days)

    3/10

  • 8/10/2019 BIg Data is Not About Data

    19/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise:

    3/10

  • 8/10/2019 BIg Data is Not About Data

    20/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week?

    3/10

  • 8/10/2019 BIg Data is Not About Data

    21/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week? 500K people carrying cell phones withaccelerometers

    3/10

  • 8/10/2019 BIg Data is Not About Data

    22/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week? 500K people carrying cell phones withaccelerometers

    Social contacts:

    3/10

  • 8/10/2019 BIg Data is Not About Data

    23/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week? 500K people carrying cell phones withaccelerometers

    Social contacts: A survey: Please tell me your 5 bestfriends

    3/10

  • 8/10/2019 BIg Data is Not About Data

    24/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week? 500K people carrying cell phones withaccelerometers

    Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books

    3/10

  • 8/10/2019 BIg Data is Not About Data

    25/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week? 500K people carrying cell phones withaccelerometers

    Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books

    Economic development in developing countries:

    3/10

  • 8/10/2019 BIg Data is Not About Data

    26/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week? 500K people carrying cell phones withaccelerometers

    Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books

    Economic development in developing countries: Dubious ornonexistent governmental statistics

    3/10

  • 8/10/2019 BIg Data is Not About Data

    27/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week? 500K people carrying cell phones withaccelerometers

    Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books

    Economic development in developing countries: Dubious ornonexistent governmental statistics satellite images of

    human-generated light at night, road networks, otherinfrastructure

    3/10

  • 8/10/2019 BIg Data is Not About Data

    28/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week? 500K people carrying cell phones withaccelerometers

    Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books

    Economic development in developing countries: Dubious ornonexistent governmental statistics satellite images of

    human-generated light at night, road networks, otherinfrastructure

    Many, many, more...

    3/10

  • 8/10/2019 BIg Data is Not About Data

    29/118

    Examples of whats now possible

    Opinions of activists: A few thousand interviews billions of

    political opinions in social media posts (1B every 2 Days) Exercise: A survey: How many times did you exercise last

    week? 500K people carrying cell phones withaccelerometers

    Social contacts: A survey: Please tell me your 5 bestfriends continuous record of phone calls, emails, textmessages, bluetooth, social media connections, address books

    Economic development in developing countries: Dubious ornonexistent governmental statistics satellite images of

    human-generated light at night, road networks, otherinfrastructure

    Many, many, more...

    In each: without new analytics, the data are useless

    3/10

  • 8/10/2019 BIg Data is Not About Data

    30/118

    The End of The Quantitative-Qualitative Divide

    4/10

  • 8/10/2019 BIg Data is Not About Data

    31/118

    The End of The Quantitative-Qualitative Divide

    The Quant-Qual divide exists in everyfield.

    4/10

  • 8/10/2019 BIg Data is Not About Data

    32/118

    The End of The Quantitative-Qualitative Divide

    The Quant-Qual divide exists in everyfield.

    Qualitative researchers: overwhelmed by information; needhelp

    4/10

  • 8/10/2019 BIg Data is Not About Data

    33/118

    The End of The Quantitative-Qualitative Divide

    The Quant-Qual divide exists in everyfield.

    Qualitative researchers: overwhelmed by information; needhelp

    Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze

    unstructured text, video, audio as data

    4/10

    f Q Q

  • 8/10/2019 BIg Data is Not About Data

    34/118

    The End of The Quantitative-Qualitative Divide

    The Quant-Qual divide exists in everyfield.

    Qualitative researchers: overwhelmed by information; needhelp

    Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze

    unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is

    quantified, a right answer exists, and good analytics areapplied: analytics wins

    4/10

    Th E d f Th Q i i Q li i Di id

  • 8/10/2019 BIg Data is Not About Data

    35/118

    The End of The Quantitative-Qualitative Divide

    The Quant-Qual divide exists in everyfield.

    Qualitative researchers: overwhelmed by information; needhelp

    Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze

    unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is

    quantified, a right answer exists, and good analytics areapplied: analytics wins

    Moral of the story:

    4/10

    Th E d f Th Q i i Q li i Di id

  • 8/10/2019 BIg Data is Not About Data

    36/118

    The End of The Quantitative-Qualitative Divide

    The Quant-Qual divide exists in everyfield.

    Qualitative researchers: overwhelmed by information; needhelp

    Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze

    unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is

    quantified, a right answer exists, and good analytics areapplied: analytics wins

    Moral of the story: Fully human is inadequate

    4/10

    Th E d f Th Q tit ti Q lit ti Di id

  • 8/10/2019 BIg Data is Not About Data

    37/118

    The End of The Quantitative-Qualitative Divide

    The Quant-Qual divide exists in everyfield.

    Qualitative researchers: overwhelmed by information; needhelp

    Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze

    unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is

    quantified, a right answer exists, and good analytics areapplied: analytics wins

    Moral of the story: Fully human is inadequate Fully automated fails

    4/10

    Th E d f Th Q tit ti Q lit ti Di id

  • 8/10/2019 BIg Data is Not About Data

    38/118

    The End of The Quantitative-Qualitative Divide

    The Quant-Qual divide exists in everyfield.

    Qualitative researchers: overwhelmed by information; needhelp

    Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze

    unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is

    quantified, a right answer exists, and good analytics areapplied: analytics wins

    Moral of the story: Fully human is inadequate Fully automated fails We needcomputer assisted, human controlledtechnology

    4/10

    The E d of The Q a titati e Q alitati e Di ide

  • 8/10/2019 BIg Data is Not About Data

    39/118

    The End of The Quantitative-Qualitative Divide

    The Quant-Qual divide exists in everyfield.

    Qualitative researchers: overwhelmed by information; needhelp

    Quantitative researchers: recognize the huge amounts ofinformation in qualitative analyses, starting to analyze

    unstructured text, video, audio as data Expert-vs-analytics contests: Whenever enough information is

    quantified, a right answer exists, and good analytics areapplied: analytics wins

    Moral of the story: Fully human is inadequate Fully automated fails We needcomputer assisted, human controlledtechnology (Technically correct, & politically much easier)

    4/10

    How to Read a Billion Blog Posts

  • 8/10/2019 BIg Data is Not About Data

    40/118

    How to Read a Billion Blog Posts& Classify Deaths without Physicians

    5/10

    How to Read a Billion Blog Posts

  • 8/10/2019 BIg Data is Not About Data

    41/118

    How to Read a Billion Blog Posts& Classify Deaths without Physicians

    Examples of Bad Analytics:

    5/10

    How to Read a Billion Blog Posts

  • 8/10/2019 BIg Data is Not About Data

    42/118

    How to Read a Billion Blog Posts& Classify Deaths without Physicians

    Examples of Bad Analytics: Physicians Verbal Autopsy analysis

    5/10

    How to Read a Billion Blog Posts

  • 8/10/2019 BIg Data is Not About Data

    43/118

    How to Read a Billion Blog Posts& Classify Deaths without Physicians

    Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts

    5/10

    How to Read a Billion Blog Posts

  • 8/10/2019 BIg Data is Not About Data

    44/118

    How to Read a Billion Blog Posts& Classify Deaths without Physicians

    Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts

    Different problems, Same Analytics Solution:

    5/10

  • 8/10/2019 BIg Data is Not About Data

    45/118

    How to Read a Billion Blog Posts

  • 8/10/2019 BIg Data is Not About Data

    46/118

    How to Read a Billion Blog Posts& Classify Deaths without Physicians

    Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts

    Different problems, Same Analytics Solution: Key to both methods: classifying(deaths, social media posts)

    Key to both goals: estimating %s

    5/10

    How to Read a Billion Blog Posts

  • 8/10/2019 BIg Data is Not About Data

    47/118

    How to Read a Billion Blog Posts& Classify Deaths without Physicians

    Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts

    Different problems, Same Analytics Solution: Key to both methods: classifying(deaths, social media posts)

    Key to both goals: estimating %s Modern Data Analytics: New method led to:

    5/10

    How to Read a Billion Blog Posts

  • 8/10/2019 BIg Data is Not About Data

    48/118

    How to Read a Billion Blog Posts& Classify Deaths without Physicians

    Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts

    Different problems, Same Analytics Solution: Key to both methods: classifying(deaths, social media posts)

    Key to both goals: estimating %s Modern Data Analytics: New method led to:

    1.

    5/10

    How to Read a Billion Blog Posts

  • 8/10/2019 BIg Data is Not About Data

    49/118

    g& Classify Deaths without Physicians

    Examples of Bad Analytics: Physicians Verbal Autopsy analysis Sentiment analysis via word counts

    Different problems, Same Analytics Solution: Key to both methods: classifying(deaths, social media posts)

    Key to both goals: estimating %s Modern Data Analytics: New method led to:

    1.

    2. Worldwide cause-of-death estimates for

    5/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    50/118

    y y

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    51/118

    y y

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    52/118

    y y

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts:

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    53/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    54/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    55/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics:

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    56/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics: Few statistical improvements for 75 years

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    57/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity)

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    58/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence)

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    59/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    60/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic

    New customized analytics we developed:

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    61/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic

    New customized analytics we developed: Logical consistency (e.g., older people have higher mortality)

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    62/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic

    New customized analytics we developed: Logical consistency (e.g., older people have higher mortality) More accurate forecasts

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    63/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic

    New customized analytics we developed: Logical consistency (e.g., older people have higher mortality) More accurate forecasts Trust fund needs $800 billionmore than SSA thought

    6/10

    The Solvency of Social Security

  • 8/10/2019 BIg Data is Not About Data

    64/118

    Successful: single largest government program; lifted a whole

    generation out of poverty; extremely popular Solvency: depends on mortality forecasts: If retirees receive

    benefits longer than expected, the Trust Fund runs out

    SSA data: little change other than updates for 75 years

    SSA analytics: Few statistical improvements for 75 years Ignore risk factors (smoking, obesity) Mostly informal (subject to error & political influence) Forecasts: inaccurate, inconsistent, overly optimistic

    New customized analytics we developed: Logical consistency (e.g., older people have higher mortality) More accurate forecasts Trust fund needs $800 billionmore than SSA thought Other applications to insurance industry, public health, etc.

    6/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    65/118

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    66/118

    Example Substitution 1:

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    67/118

    Example Substitution 1:

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    68/118

    Example Substitution 1:

    Freedom

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    69/118

    Example Substitution 1:

    Freedom

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    70/118

    Example Substitution 1:

    Freedom

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    71/118

    Example Substitution 1:

    Freedom Eye field

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    72/118

    Example Substitution 1:

    Freedom Eye field (nonsensical)

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    73/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    74/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    75/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2:

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    76/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2:

    7/10

    Following Conversations that Hide in Plain Sight

  • 8/10/2019 BIg Data is Not About Data

    77/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2:

    Harmonious [Society] (official slogan)

    7/10

    Following Conversations that Hide in Plain Sight

    E l S b i i 1 H h

  • 8/10/2019 BIg Data is Not About Data

    78/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2:

    Harmonious [Society] (official slogan)

    7/10

    Following Conversations that Hide in Plain Sight

    E l S b i i 1 H h

  • 8/10/2019 BIg Data is Not About Data

    79/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2:

    Harmonious [Society] (official slogan)

    7/10

    Following Conversations that Hide in Plain Sight

    E l S b tit ti 1 H h

  • 8/10/2019 BIg Data is Not About Data

    80/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2:

    Harmonious [Society] (official slogan) River crab

    7/10

    Following Conversations that Hide in Plain Sight

    E l S b tit ti 1 H h

  • 8/10/2019 BIg Data is Not About Data

    81/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2:

    Harmonious [Society] (official slogan) River crab (irrelevant)

    7/10

    Following Conversations that Hide in Plain Sight

    Example Substitution 1: Homograph

  • 8/10/2019 BIg Data is Not About Data

    82/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2: Homophone (sound like hexie)

    Harmonious [Society] (official slogan) River crab (irrelevant)

    7/10

    Following Conversations that Hide in Plain Sight

    Example Substitution 1: Homograph

  • 8/10/2019 BIg Data is Not About Data

    83/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2: Homophone (sound like hexie)

    Harmonious [Society] (official slogan) River crab (irrelevant)

    They cant follow the conversation; Our methods can!

    7/10

    Following Conversations that Hide in Plain Sight

    Example Substitution 1: Homograph

  • 8/10/2019 BIg Data is Not About Data

    84/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2: Homophone (sound like hexie)

    Harmonious [Society] (official slogan) River crab (irrelevant)

    They cant follow the conversation; Our methods can!

    The same task:

    7/10

    Following Conversations that Hide in Plain Sight

    Example Substitution 1: Homograph

  • 8/10/2019 BIg Data is Not About Data

    85/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2: Homophone (sound like hexie)

    Harmonious [Society] (official slogan) River crab (irrelevant)

    They cant follow the conversation; Our methods can!

    The same task: (1) Government and industry analysts job,

    7/10

    Following Conversations that Hide in Plain Sight

    Example Substitution 1: Homograph

  • 8/10/2019 BIg Data is Not About Data

    86/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2: Homophone (sound like hexie)

    Harmonious [Society] (official slogan) River crab (irrelevant)

    They cant follow the conversation; Our methods can!

    The same task: (1) Government and industry analysts job, (2)language drift (#BostonBombings #BostonStrong),

    7/10

    Following Conversations that Hide in Plain Sight

    Example Substitution 1: Homograph

  • 8/10/2019 BIg Data is Not About Data

    87/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2: Homophone (sound like hexie)

    Harmonious [Society] (official slogan) River crab (irrelevant)

    They cant follow the conversation; Our methods can!

    The same task: (1) Government and industry analysts job, (2)language drift (#BostonBombings #BostonStrong), (3) Childpornographers,

    7/10

    Following Conversations that Hide in Plain Sight

    Example Substitution 1: Homograph

  • 8/10/2019 BIg Data is Not About Data

    88/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2: Homophone (sound like hexie)

    Harmonious [Society] (official slogan) River crab (irrelevant)

    They cant follow the conversation; Our methods can!

    The same task: (1) Government and industry analysts job, (2)language drift (#BostonBombings #BostonStrong), (3) Childpornographers, (4) Look-alike modeling,

    7/10

    Following Conversations that Hide in Plain Sight

    Example Substitution 1: Homograph

  • 8/10/2019 BIg Data is Not About Data

    89/118

    Example Substitution 1: Homograph

    Freedom Eye field (nonsensical)

    Example Substitution 2: Homophone (sound like hexie)

    Harmonious [Society] (official slogan) River crab (irrelevant)

    They cant follow the conversation; Our methods can!

    The same task: (1) Government and industry analysts job, (2)language drift (#BostonBombings #BostonStrong), (3) Childpornographers, (4) Look-alike modeling,(5) Starting point forsophisticated automated text analysis

    7/10

    Computer-Assisted Reading (Consilience)

  • 8/10/2019 BIg Data is Not About Data

    90/118

    8/10

    Computer-Assisted Reading (Consilience)

    To understand many documents humans create categories to

  • 8/10/2019 BIg Data is Not About Data

    91/118

    To understand many documents, humanscreate categoriesto

    represent conceptualization, insight, etc.

    8/10

    Computer-Assisted Reading (Consilience)

    To understand many documents humans create categories to

  • 8/10/2019 BIg Data is Not About Data

    92/118

    To understand many documents, humanscreate categoriesto

    represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer

    complaints, sort reports, retrieve information

    8/10

    Computer-Assisted Reading (Consilience)

    To understand many documents humans create categories to

  • 8/10/2019 BIg Data is Not About Data

    93/118

    To understand many documents, humanscreate categoriesto

    represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer

    complaints, sort reports, retrieve information

    Bad Analytics:

    8/10

    Computer-Assisted Reading (Consilience)

    To understand many documents, humans create categories to

  • 8/10/2019 BIg Data is Not About Data

    94/118

    To understand many documents, humanscreate categoriesto

    represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer

    complaints, sort reports, retrieve information

    Bad Analytics: Unassisted Human Categorization: time consuming; huge

    efforts trying notto innovate!

    8/10

    Computer-Assisted Reading (Consilience)

    To understand many documents, humans create categories to

  • 8/10/2019 BIg Data is Not About Data

    95/118

    To understand many documents, humanscreate categoriesto

    represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer

    complaints, sort reports, retrieve information

    Bad Analytics: Unassisted Human Categorization: time consuming; huge

    efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,

    but none work (computers dont know what you want!)

    8/10

    Computer-Assisted Reading (Consilience)

    To understand many documents, humanscreate categoriesto

  • 8/10/2019 BIg Data is Not About Data

    96/118

    y , g

    represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer

    complaints, sort reports, retrieve information

    Bad Analytics: Unassisted Human Categorization: time consuming; huge

    efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,

    but none work (computers dont know what you want!)

    Our alternative: Computer-assisted Categorization

    8/10

    Computer-Assisted Reading (Consilience)

    To understand many documents, humanscreate categoriesto

  • 8/10/2019 BIg Data is Not About Data

    97/118

    y , g

    represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer

    complaints, sort reports, retrieve information

    Bad Analytics: Unassisted Human Categorization: time consuming; huge

    efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,

    but none work (computers dont know what you want!)

    Our alternative: Computer-assisted Categorization You decide whats important, but with help

    8/10

    Computer-Assisted Reading (Consilience)

    To understand many documents, humanscreate categoriesto

  • 8/10/2019 BIg Data is Not About Data

    98/118

    y g

    represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer

    complaints, sort reports, retrieve information

    Bad Analytics: Unassisted Human Categorization: time consuming; huge

    efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,

    but none work (computers dont know what you want!)

    Our alternative: Computer-assisted Categorization You decide whats important, but with help Invert effort: you innovate; the computer categorizes

    8/10

    Computer-Assisted Reading (Consilience)

    To understand many documents, humanscreate categoriesto

  • 8/10/2019 BIg Data is Not About Data

    99/118

    represent conceptualization, insight, etc. Most firms: impose fixed categorizations to tally customer

    complaints, sort reports, retrieve information

    Bad Analytics: Unassisted Human Categorization: time consuming; huge

    efforts trying notto innovate! Fully Automated Cluster Analysis: Many widely available,

    but none work (computers dont know what you want!)

    Our alternative: Computer-assisted Categorization You decide whats important, but with help Invert effort: you innovate; the computer categorizes Insights: easier, faster, better

    8/10

  • 8/10/2019 BIg Data is Not About Data

    100/118

    Example Insights from Computer-Assisted Reading

  • 8/10/2019 BIg Data is Not About Data

    101/118

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do

  • 8/10/2019 BIg Data is Not About Data

    102/118

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    103/118

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    104/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    105/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    106/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    /

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    107/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    /

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    108/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it?

    /

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    109/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it? 27% of all Senatorial press releases!

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    110/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it? 27% of all Senatorial press releases!

    2. Reverse Engineering Chinese Censorship

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    111/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it? 27% of all Senatorial press releases!

    2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    112/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it? 27% of all Senatorial press releases!

    2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    113/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it? 27% of all Senatorial press releases!

    2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases

  • 8/10/2019 BIg Data is Not About Data

    114/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it? 27% of all Senatorial press releases!

    2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored Previous understanding: they censor criticisms of the

    government

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases Categorization: (1) advertising (2) position taking (3) credit

  • 8/10/2019 BIg Data is Not About Data

    115/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming

    New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it? 27% of all Senatorial press releases!

    2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored Previous understanding: they censor criticisms of the

    government Results:

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases Categorization: (1) advertising (2) position taking (3) credit

  • 8/10/2019 BIg Data is Not About Data

    116/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming

    New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it? 27% of all Senatorial press releases!

    2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored Previous understanding: they censor criticisms of the

    government Results:

    Uncensored: criticism of the government

    9/10

    Example Insights from Computer-Assisted Reading

    1. What Members of Congress Do Data: 64,000 Senators press releases Categorization: (1) advertising, (2) position taking, (3) credit

  • 8/10/2019 BIg Data is Not About Data

    117/118

    Categorization: (1) advertising, (2) position taking, (3) creditclaiming

    New Insight: partisan taunting

    Joe Wilson during Obamas State of the Union: You lie!

    Senator Lautenberg Blasts Republicans as Chicken Hawks

    How common is it? 27% of all Senatorial press releases!

    2. Reverse Engineering Chinese Censorship Previous approach: manual effort to see what is taken down Data: We get posts before the Chinese censor them We analyzed 11 million posts, about 13% censored Previous understanding: they censor criticisms of the

    government Results:

    Uncensored: criticism of the government

    Censored: attempts at collective action

    9/10

    For more information

  • 8/10/2019 BIg Data is Not About Data

    118/118

    GaryKing.org

    10/10