Joost ouwerkerk

Post on 21-Jun-2015

174 views 0 download

Tags:

Transcript of Joost ouwerkerk

Big Databig problems.

What is Big Data? Volume Velocity Variety

VolumeBillions of Things:

Posts, Tweets and Likes Web Transactions Sensor Readings

VelocityStreaming Data:

Twitter: 500,000,000 TPD Walmart: 20,000,000 TPD Hopper: 750,000,000 TPD

Variety Integrating Many Sources of Data:

Unstructured Web Content Semi-structured Logs Relational Databases Images, Video, Audio

So What’s Changed?

Mobile devices Social Web Sensors, Metrics Digitization of

everything

Open Source Tools•Hadoop: distributed processing

•R: predictive analytics for big data

•Hive, Pig: ad-hoc analytics for Hadoop

•Mahout: machine learning for Hadoop

•HBase, Cassandra: distributed databases

•ElasticSearch: distributed search engine

•Storm: distributed processing for data streams

"The best minds of my generation are thinking about how to make people click ads"

- Jeff Hammerbacher (Facebook, Accel, Cloudera)

Big Minds + Big Data

Aggregate, Summarize Detect Patterns Model, Simulate Forecast, Predict

Open Data

Reports Request/Response APIs Small Data

TextText

Hack/reduce

Open Hackspace in Boston Home for Pre-seed projects, Community events

Not-for-profit sponsored by local industry and government

Hack/reduce Cluster

240-core cluster sponsored by GoGrid, a cloud computing company.

Available for use at today’s Open Data Day.

What do you with a 240-core Cluster?

Use the power of many machines to analyze Big Data sets.

How do you get computers to work together like that??That’s what Hadoop is for.

An Example

Daily Hansard: transcript of Canadian parliament since 1994

Swearwords.txt (http://www.bannedwordlist.com)

Who are the most foul-mouthed Federal MPs?

Results

•20 years of House of Commons statements

•511,341 Statements analyzed

•121,985,310 Words spoken

•3,839 Swearwords spoken

•1 in 133 statements has a swearword

Top 5 Swearers (absolute)

Pat Martin NDP 98

Randy White Conservative 88

Alexa McDonough

NDP 52

Jim Silye Conservative 50

Yvan Loubier Bloc Quebecois 49

Top 5 Swearers (relative)

Randy WhiteConservativ

e0.037

%88 299,114

Dennis Mills Liberal0.023

%14 62,221

Gerry RitzConservativ

e0.022

%22 99,037

John McCallum

Conservative

0.017%

38 226,155

John McKay Liberal0.016

%44 268,188

Top 5 Words Spoken

Paul Szabo 1,482,106

Pat Martin 1,053,365

Don Boudria 867,204

Yvan Loubier 861,888

Peter McKay 844,130

Prime Ministers

Jean Chrétien 11 604,431

Paul Martin 6 485,990

Stephen Harper 22 620,999

"The best minds of my generation are thinking about how to make people click ads"

- Jeff Hammerbacher (Facebook, Accel, Cloudera)