Mining the Web for Information using Hadoop

Post on 05-Dec-2014

4.864 views 1 download

description

 

Transcript of Mining the Web for Information using Hadoop

1

– Someday Soon

(Flickr)

Mining the web with HadoopSteve Watt Emerging Technologies @

HP

2

– timsnell (Flickr)

3

Gathering Data

Data Marketplaces

4

5

6

Gathering Data

Apache Nutch(Web Crawler)

7

Tech Bubble?

What does the Data Say?

Pascal Terjan (Flickr)

8

9

10

Using Apache

Identify Optimal Seed URLs for a Seed List & Crawl to a depth of 2

For example:

http://www.crunchbase.com/companies?c=a&q=private_heldhttp://www.crunchbase.com/companies?c=b&q=private_heldhttp://www.crunchbase.com/companies?c=c&q=private_heldhttp://www.crunchbase.com/companies?c=d&q=private_held. . .

Crawl data is stored in sequence files in the segments dir on the HDFS

11

ALSO

12

Company POJO then /t Out

Prelim Filtering on URL

Making the data STRUCTURED

Retrieving HTML

13

Company City State Country Sector Round Day Month Year Amount Investors

InfoChimps Austin TX USA Enterprise Angel 14 9 2010 350000 Stage One Capital

InfoChimps Austin TX USA Enterprise A 7 11 2010 1200000 DFJ Mercury

MassRelevance Austin TX USA Enterprise A 20 12 2010 2200000 Floodgate, AV,etc

Masher Calabasas CA USA Games_Video Seed 0 2 2009 175000

Masher Calabasas CA USA Games_Video Angel 11 8 2009 300000 Tech Coast Angels

The Result? Tab Delimited Structured Data…

Note: I dropped the ZipCode because it didn’t occur

consistently

14

Time to Analyze/Visualize the data…

Step1: Select the right visual encoding for your questions

Lets start by asking questions & seeing what we can learn from some simple Bar Charts…

*Total Tech Investments By Year

*Total Tech Investments By Year

*Total Tech Investments By Year

*Investment Funding By Sector

18

Total Investments By Zip Code for all Sectors

$7.3 Billion in San Francisco

$2.9 Billion in Mountain View

$1.2 Billion in Boston

$1.7 Billion in Austin

19

Total Investments By Zip Code for all Sectors

$7.3 Billion in San Francisco

$2.9 Billion in Mountain View

$1.2 Billion in Boston

$1.7 Billion in Austin

20

Total Investments By Zip Code for Consumer Web

$1.2 Billion in Chicago

$600 Million in Seattle

$1.7 Billion in San Francisco

21

Total Investments By Zip Code for BioTech

$1.3 Billion in Cambridge

$528 Million in Dallas

$1.1 Billion in San Diego

22

HP Confidential

Geospatial Encoding of Data

23

Steve’s Not so Excellent Adventure

• Let’s try a Choropleth Encoding of the distribution of investment income by County

• Wait, what is GeoJSON?

• OK, the GeoJSON County is mapped to some code

• Each County code has a value that corresponds to a palette color

• So what are these codes? FIPS Codes? But Google returns 3 & 5 digit codes?!?

• I found a 5 digit code list, it has A LOT of codes in it. I’m going to assume its correct because there is no way I can manually verify all of them

24

Generating Investment Income By County

FIPS = LOAD ‘data/fips.txt’ using PigStorage(‘\t’) as (City, State, FIPSCode);

Amt = LOAD ‘data/equity.txt’ using PigStorage(‘\t’) as (City, State, Amount);

AmtGroup = Group Amt BY (City, State);

SumGroup = FOREACH AmtGroup Generate group, SUM(Amt.Amount);

JoinGroup = JOIN SumGroup by (City,State), FIPS By (City,State);

Final = FOREACH JoinGroup generate FIPSCode, Amount;

RESULT: 51234 5000000

16234 1234000 (...)

ALWAYS, ALWAYS check your output…

25

But wait, why are there duplicate records?

Apparently some cities can actually belong to two counties… I guess I’ll pick one.

26

Yay, no duplicates. Lets visualize this!

• Wait, what happened to California ?

• Aaargh, I stored the FIPS codes in PIG as INTS instead of charrays which trimmed off the leading Zero. OK, I add them back. Voila! We have California.

27

On Error Checking…

• Crowd Sourced data has LOADS of errors in it. Actually influencing your results. You need a good system that helps identify those errors.

• Santa Clara, Ca

• Santa, Clara

• Santa, Clara CA

• Track(Count) input and output records. Examine the results. Something fishy?

28

HP Confidential

29

Questions?

Steve Watt swatt@hp.com

@wattsteve

emergingafrican.com