O'Reilly Where 2.0 Spatial Analytics Workshop

74
Spatial Analytics Workshop Pete Skomoroch, LinkedIn (@peteskomoroch) Kevin Weil, Twitter (@kevinweil) Sean Gorman, FortiusOne (@seangorman) #spatialanalytics

description

This workshop focused on quickly uncovering patterns and generating actionable insights from large datasets using spatial analytics. The three part workshop consisted of: 1) an overview of traditional spatial analysis tools 2) an intro to Hadoop and other tools for analyzing terabytes or more of data 3) Live code examples on combining the two to uncover patterns in location data pulled from the Twitter streaming API using Mechanical Turk, Amazon Elastic MapReduce, and Pig. Given at the O’Reilly Where 2.0 conference in March 2010. http://en.oreilly.com/where2010/public/schedule/detail/12400

Transcript of O'Reilly Where 2.0 Spatial Analytics Workshop

Page 1: O'Reilly Where 2.0 Spatial Analytics Workshop

Spatial Analytics WorkshopPete Skomoroch, LinkedIn (@peteskomoroch)Kevin Weil, Twitter (@kevinweil)Sean Gorman, FortiusOne (@seangorman)

#spatialanalytics

Page 2: O'Reilly Where 2.0 Spatial Analytics Workshop

Introduction‣ The Rise of Spatial Analytics

‣ Spatial Analysis Techniques

‣ Hadoop, Pig, and Big Data

‣ Bringing the Two Together

‣ Conclusion

‣ Q&A

Page 3: O'Reilly Where 2.0 Spatial Analytics Workshop

Introduction‣ The Rise of Spatial Analytics

‣ Spatial Analysis Techniques

‣ Hadoop, Pig, and Big Data

‣ Bringing the Two Together

‣ Conclusion

‣ Q&A

Page 4: O'Reilly Where 2.0 Spatial Analytics Workshop

Introduction‣ The Rise of Spatial Analytics

‣ Spatial Analysis Techniques

‣ Hadoop, Pig, and Big Data

‣ Bringing the Two Together

‣ Conclusion

‣ Q&A

Page 5: O'Reilly Where 2.0 Spatial Analytics Workshop

Spatial Analysis

Analytical techniques to determine the spatial

distribution of a variable, the relationship between

the spatial distribution of variables, and the

association of the variables in an area.

Page 6: O'Reilly Where 2.0 Spatial Analytics Workshop

Pattern Analysis

Page 7: O'Reilly Where 2.0 Spatial Analytics Workshop

Spatial Analysis Types

1. Spatial autocorrelation

2. Spatial interpolation

3. Spatial interaction

4. Simulation and modeling

5. Density mapping

Page 8: O'Reilly Where 2.0 Spatial Analytics Workshop

Spatial Autocorrelation

Spatial autocorrelation statistics measure and analyze

the degree of dependency among observations in a

geographic space.

First law of geography: “everything is related to everything

else, but near things are more related than distant things.”

-- Waldo Tobler

Page 9: O'Reilly Where 2.0 Spatial Analytics Workshop

Moran’s I - Random Variable

Moran’s I = .012

Moran’s I - Per Capita Income in Monroe County

Moran’s I = .66

Page 10: O'Reilly Where 2.0 Spatial Analytics Workshop

Spatial Interpolation

Spatial interpolation methods estimate the variables

at unobserved locations in geographic space based

on the values at observed locations.

Page 11: O'Reilly Where 2.0 Spatial Analytics Workshop

Natural Gas Demand in Response to February 21, 2003 Alberta Clipper cold front

$7.55

$14.00

$14.00

Henry

NYC

Chicago

Page 12: O'Reilly Where 2.0 Spatial Analytics Workshop

Natural Gas Demand in Response to February 24, 2003 Alberta Clipper cold front

$16.00

$30.00

$18.50

Henry

NYC

Chicago

Page 13: O'Reilly Where 2.0 Spatial Analytics Workshop

Natural Gas Demand in Response to February 25, 2003 Alberta Clipper cold front

$22.00

$37.00

$20.00

Henry

NYC

Chicago

Page 14: O'Reilly Where 2.0 Spatial Analytics Workshop

Spatial Interaction

Spatial interaction or “gravity models” estimate

the flow of people, material, or information

between locations in geographic space.

Page 15: O'Reilly Where 2.0 Spatial Analytics Workshop

Introduction‣ Motiviation

‣ Execution

‣ Prototype

‣ Service

‣ API

‣ Operations

‣ UX

Global Oil Supply and Demand Gravity Model

Page 16: O'Reilly Where 2.0 Spatial Analytics Workshop

Simulation and Modeling

Simple interactions among proximal entities can

lead to intricate, persistent, and functional spatial

entities at aggregate levels (complex adaptive

systems).

Page 17: O'Reilly Where 2.0 Spatial Analytics Workshop

Spatial Interdependency Analysis of the San Francisco Failure Simulation

InfrastructureTotal Number of Links

No. Links Congested

% Links Congested

%Volume Delay

Refined Products (National)

3,197 1 0.03% 0.05%Refined Products (MSA) 8 1 12.50% 93%

Power Grid (Regional) 1,942 4 0% N/A

Power Grid (MSA) 16 2 13% N/A

Page 18: O'Reilly Where 2.0 Spatial Analytics Workshop

Density Mapping

Calculating the proximity and frequency of a

spatial phenomenon by creating a probabilistic

surface.

Page 19: O'Reilly Where 2.0 Spatial Analytics Workshop

New York City Fiber Density Map

Page 20: O'Reilly Where 2.0 Spatial Analytics Workshop

Standard GIS Architectures

Page 21: O'Reilly Where 2.0 Spatial Analytics Workshop

Distributed Analytics

Queueing analysis tasks from disparate data sources

for agents to run across distributed servers to collate

back to the user as answers.

Page 22: O'Reilly Where 2.0 Spatial Analytics Workshop

User

Request Queue

Disparate Data

Age

nts

Dis

trib

uted

Ser

vers

Analysis

Page 23: O'Reilly Where 2.0 Spatial Analytics Workshop

User

Request Queue

Agent

Amazon S3

Amazon EC2

1. Rasterize2. Kernel

density calc3. Color map

(http://finder.geocommons.com/overlays/20148)

Page 24: O'Reilly Where 2.0 Spatial Analytics Workshop

Vector Density Mapping Demo

Page 25: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 26: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 27: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 28: O'Reilly Where 2.0 Spatial Analytics Workshop

Introduction‣ The Rise of Spatial Analytics

‣ Spatial Analysis Techniques

‣ Hadoop, Pig, and Big Data

‣ Bringing the Two Together

‣ Conclusion

‣ Q&A

Page 29: O'Reilly Where 2.0 Spatial Analytics Workshop

Data is Getting Big‣ NYSE: 1 TB/day

‣ Facebook: 20+ TB compressed/day

‣ CERN/LHC: 40 TB/day (15 PB/year!)

‣ And growth is accelerating

‣ Need multiple machines, horizontal scalability

Page 30: O'Reilly Where 2.0 Spatial Analytics Workshop

Hadoop‣ Distributed file system (hard to store a PB)

‣ Fault-tolerant, handles replication, node failure, etc

‣ MapReduce-based parallel computation(even harder to process a PB)

‣ Generic key-value based computation interfaceallows for wide applicability

‣ Open source, top-level Apache project

‣ Scalable: Y! has a 4000-node cluster

‣ Powerful: sorted a TB of random integers in 62 seconds

Page 31: O'Reilly Where 2.0 Spatial Analytics Workshop

MapReduce?‣ Challenge: how many tweets per

county, given tweets table?

‣ Input: key=row, value=tweet info

‣ Map: output key=county, value=1

‣ Shuffle: sort by county

‣ Reduce: for each county, sum

‣ Output: county, tweet count

‣ With 2x machines, runs close to 2x faster.

cat file | grep geo | sort | uniq -c > output

Page 32: O'Reilly Where 2.0 Spatial Analytics Workshop

MapReduce?‣ Challenge: how many tweets per

county, given tweets table?

‣ Input: key=row, value=tweet info

‣ Map: output key=county, value=1

‣ Shuffle: sort by county

‣ Reduce: for each county, sum

‣ Output: county, tweet count

‣ With 2x machines, runs close to 2x faster.

cat file | grep geo | sort | uniq -c > output

Page 33: O'Reilly Where 2.0 Spatial Analytics Workshop

MapReduce?‣ Challenge: how many tweets per

county, given tweets table?

‣ Input: key=row, value=tweet info

‣ Map: output key=county, value=1

‣ Shuffle: sort by county

‣ Reduce: for each county, sum

‣ Output: county, tweet count

‣ With 2x machines, runs close to 2x faster.

cat file | grep geo | sort | uniq -c > output

Page 34: O'Reilly Where 2.0 Spatial Analytics Workshop

MapReduce?‣ Challenge: how many tweets per

county, given tweets table?

‣ Input: key=row, value=tweet info

‣ Map: output key=county, value=1

‣ Shuffle: sort by county

‣ Reduce: for each county, sum

‣ Output: county, tweet count

‣ With 2x machines, runs close to 2x faster.

cat file | grep geo | sort | uniq -c > output

Page 35: O'Reilly Where 2.0 Spatial Analytics Workshop

MapReduce?‣ Challenge: how many tweets per

county, given tweets table?

‣ Input: key=row, value=tweet info

‣ Map: output key=county, value=1

‣ Shuffle: sort by county

‣ Reduce: for each county, sum

‣ Output: county, tweet count

‣ With 2x machines, runs close to 2x faster.

cat file | grep geo | sort | uniq -c > output

Page 36: O'Reilly Where 2.0 Spatial Analytics Workshop

MapReduce?‣ Challenge: how many tweets per

county, given tweets table?

‣ Input: key=row, value=tweet info

‣ Map: output key=county, value=1

‣ Shuffle: sort by county

‣ Reduce: for each county, sum

‣ Output: county, tweet count

‣ With 2x machines, runs close to 2x faster.

cat file | grep geo | sort | uniq -c > output

Page 37: O'Reilly Where 2.0 Spatial Analytics Workshop

MapReduce?‣ Challenge: how many tweets per

county, given tweets table?

‣ Input: key=row, value=tweet info

‣ Map: output key=county, value=1

‣ Shuffle: sort by county

‣ Reduce: for each county, sum

‣ Output: county, tweet count

‣ With 2x machines, runs close to 2x faster.

cat file | grep geo | sort | uniq -c > output

Page 38: O'Reilly Where 2.0 Spatial Analytics Workshop

But...‣ Analysis typically done in Java

‣ Single-input, two-stage data flow is rigid

‣ Projections, filters: custom code

‣ Joins: lengthy, error-prone

‣ n-stage jobs: Hard to manage

‣ Prototyping/exploration requires compilation

‣ analytics in Eclipse?ur doin it wrong...

Page 39: O'Reilly Where 2.0 Spatial Analytics Workshop

Enter Pig

‣ High level language

‣ Transformations on sets of records

‣ Process data one step at a time

‣ Easier than SQL?

Page 40: O'Reilly Where 2.0 Spatial Analytics Workshop

Why Pig?‣ Because I bet you can read the following script.

Page 41: O'Reilly Where 2.0 Spatial Analytics Workshop

A Real Pig Script

‣ Now, just for fun... the same calculation in vanilla Hadoop MapReduce.

Page 42: O'Reilly Where 2.0 Spatial Analytics Workshop

No, seriously.

Page 43: O'Reilly Where 2.0 Spatial Analytics Workshop

Pig Simplifies Analysis

‣ The Pig version is:

‣ 5% of the code, 5% of the time

‣ Within 50% of the execution time.

‣ Pig ♥ Geo:

‣ Programmable: fuzzy matching, custom filtering

‣ Easily link multiple datasets, regardless of size/structure

‣ Iterative, quick

Page 44: O'Reilly Where 2.0 Spatial Analytics Workshop

A Real Example

‣ Fire up your EMR.

‣ ... or follow along at http://bit.ly/whereanalytics

‣ Pete used Twitter’s streaming API to store some tweets

‣ Simplest thing: group by location and count with Pig

‣ http://bit.ly/where20pig

‣ Here comes some code!

Page 45: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 46: O'Reilly Where 2.0 Spatial Analytics Workshop

tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);

Page 47: O'Reilly Where 2.0 Spatial Analytics Workshop

tweets = LOAD 's3://where20demo/sample-tweets' as ( user_screen_name:chararray, tweet_id:chararray, ... user_friends_count:int, user_statuses_count:int, user_location:chararray, user_lang:chararray, user_time_zone:chararray, place_id:chararray, ...);

Page 48: O'Reilly Where 2.0 Spatial Analytics Workshop

tweets_with_location = FILTER tweets BY user_location != 'NULL';

Page 49: O'Reilly Where 2.0 Spatial Analytics Workshop

normalized_locations = FOREACH tweets_with_location GENERATE LOWER(user_location) as user_location;

Page 50: O'Reilly Where 2.0 Spatial Analytics Workshop

grouped_tweets = GROUP normalized_locations BY user_location PARALLEL 10;

Page 51: O'Reilly Where 2.0 Spatial Analytics Workshop

location_counts = FOREACH grouped_tweets GENERATE $0 as location, SIZE($1) as user_count;

Page 52: O'Reilly Where 2.0 Spatial Analytics Workshop

sorted_counts = ORDER location_counts BY user_count DESC;

Page 53: O'Reilly Where 2.0 Spatial Analytics Workshop

STORE sorted_counts INTO 'global_location_tweets';

Page 54: O'Reilly Where 2.0 Spatial Analytics Workshop

hadoop@ip-10-160-113-142:~$ hadoop dfs -cat /global_location_counts/part* | head -30

brasil 37985indonesia 33777brazil 22432london 17294usa 14564são paulo 14238new york 13420tokyo 10967singapore 10225rio de janeiro 10135los angeles 9934california 9386chicago 9155uk 9095jakarta 9086germany 8741canada 8201東京都 7696

東京 7121

jakarta, indonesia 6480nyc 6456new york, ny 6331

Page 55: O'Reilly Where 2.0 Spatial Analytics Workshop

Neat, but...

‣ Wow, that data is messy!

‣ brasil, brazil at #1 and #3

‣ new york, nyc, and new york ny all in the top 30

‣ Pete to the rescue.

Page 56: O'Reilly Where 2.0 Spatial Analytics Workshop

Introduction‣ The Rise of Spatial Analytics

‣ Spatial Analysis Techniques

‣ Hadoop, Pig, and Big Data

‣ Bringing the Two Together

‣ Conclusion

‣ Q&A

Page 57: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 58: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 59: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 60: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 61: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 62: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 63: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 64: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 65: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 66: O'Reilly Where 2.0 Spatial Analytics Workshop
Page 67: O'Reilly Where 2.0 Spatial Analytics Workshop

Users by County

Page 68: O'Reilly Where 2.0 Spatial Analytics Workshop

Lady Gaga

Page 69: O'Reilly Where 2.0 Spatial Analytics Workshop

Tea Party

Page 70: O'Reilly Where 2.0 Spatial Analytics Workshop

Dallas

Page 71: O'Reilly Where 2.0 Spatial Analytics Workshop

Colbert

Page 72: O'Reilly Where 2.0 Spatial Analytics Workshop

Introduction‣ The Rise of Spatial Analytics

‣ Spatial Analysis Techniques

‣ Hadoop, Pig, and Big Data

‣ Bringing the Two Together

‣ Conclusion

‣ Q&A

Page 73: O'Reilly Where 2.0 Spatial Analytics Workshop

Introduction‣ The Rise of Spatial Analytics

‣ Spatial Analysis Techniques

‣ Hadoop, Pig, and Big Data

‣ Bringing the Two Together

‣ Conclusion

‣ Q&A

Page 74: O'Reilly Where 2.0 Spatial Analytics Workshop

Questions? Follow us attwitter.com/peteskomorochtwitter.com/kevinweiltwitter.com/seangorman