Tech4Africa - Opportunities around Big Data

Post on 28-Nov-2014

4.358 views 5 download

Tags:

description

 

Transcript of Tech4Africa - Opportunities around Big Data

1

Big Data

Steve Watt Technology Strategy @ HP swatt@hp.com@wattsteve

2

Agenda

Hardware Software Data

• Situational Applications

• Big Data

3

Situational Applications

– eaghra (Flickr)

4

Situational Applicatio

nsMashups

Data Explosion

Social Platfor

ms

Enterprise

SOA

LAMPPublishi

ng Platform

s

Web 2.0 Era Topic Map

Inexpensive

Storage

Produce Process

Web 2.0

New Information

5

6

Big Data

– blmiers2 (Flickr)

The data just keeps growing…

1024 GIGABYTE= 1 TERABYTE1024 TERABYTES = 1 PETABYTE

1024 PETABYTES = 1 EXABYTE

1 PETABYTE 13.3 Years of HD Video

20 PETABYTES Amount of Data processed by Google daily

5 EXABYTES All words ever spoken by humanity

8

Web as a Platform Web 1.0 - Connecting Machines

Infrastructure

Web 2.0 - Connecting People API Foundation

Facebook Twitter LinkedIn

Google NetFlix

PayPaleBay Pandora

New York Times

The Fractured Web

Service EconomyService for this

Service for that

App Economy for DevicesApp for this App for that

Web 2.0 Data Exhaust of Historical and Real-time Data

Real-time Data

Mobile

Set Top Boxes

Tablets, etc.

Sensor WebAn instrumented and monitored world

Multiple Sensors in your pocket

Opportunity

9

Data Deluge! But filter patterns can help…

Kakadu (Flickr)

10

Filtering WithSearch

11

Awesome

Filtering Socially

12

Filtering Visually

But filter patterns force you down a pre-processed path

M.V. Jantzen (Flickr)

14

What if you could ask your own questions?

wowwzers(Flickr)

– MrB-MMX

(Flickr)

And go from discovering Something about Everything…

16

To discovering Everything about Something ?

17

Gathering,

Storing,

Processing &

Delivering Data @

Scale

How do we do this?

Lets examine a few techniques for

18

Gathering Data

Data Marketplaces

19

20

21

Gathering Data

Apache Nutch(Web Crawler)

22

Storing, Reading and Processing - Apache Hadoop Cluster technology with a single master and scale out with multiple slaves It consists of two runtimes:

The Hadoop Distributed File System (HDFS) Map/Reduce

As data is copied onto the HDFS it ensures the data is blocked and replicated to other machines to provide redundancy

A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop Master which in-turn distributes the job to each slave in the cluster.

Jobs run on data that is on the local disks of the machine they are sent to ensuring data locality

Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re-execute a job on any node in the cluster.

Want to know more? “Hadoop – The Definitive Guide (2nd Edition)”

23

Delivering Data @ Scale

• Structured Data

• Low Latency & Random Access

• Column Stores (Apache HBase or Apache Cassandra)• faster seeks

• better compression

• simpler scale out

• De-normalized – Data is written as it is intended to be queried

Want to know more? “HBase – The Definitive Guide” & “Cassandra High Performance Cookbook”

24

Storing, Processing & Delivering : Hadoop + NoSQL

NoSQL Repository

Apache Hadoop

FlumeConnector

NoSQL Connector/API

SQOOPConnector

MySQL

H D F SLog Files

Relational Data (JDBC)

Gather

Read/Transform

Low-latency

-Clean and Filter Data

- Transform and Enrich Data

- Often multiple Hadoop jobs

Web Data

Nutch Crawl

Application

Query

ServeCopy

25

Some things to keep in mind…

– Kanaka Menehune (Flickr)

26

Some things to keep in mind…

• Processing arbitrary types of data (unstructured, semi-structured, structured) requires normalizing data with many different kinds of readers

Hadoop is really great at this !

• However, readers won’t really help you process truly unstructured data such as prose. For that you’re going to have to get handy with Natural Language Processing. But this is really hard.

Consider using parsing services & APIs like Open Calais

Want to know more? “Programming Pig” (O’REILLY)

27

Open Calais (Gnosis)

28

Statistical real-time decision making

Capture Historical information

Use Machine Learning to build decision making models (such as Classification, Clustering & Recommendation)

Mesh real-time events (such as sensor data) against Models to make automated decisions

Want to know more? “Mahout in Action”

29

Tech Bubble?

What does the Data Say?

Pascal Terjan (Flickr

30

31

32

Apache

Identify Optimal Seed URLs& Crawl to a depth of 2

http://www.crunchbase.com/companies?

c=a&q=privately_held

Crawl data is stored in segment dirs on the HDFS

33

34

Company POJO then /t Out

Prelim Filtering on URL

Making the data STRUCTURED

Retrieving HTML

35

Aargh!

My viz tool requires zipcodes to plot geospatially!

Apache Pig Script to Join on City to get Zip Code and Write the results to Vertica

ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('\t') AS (State:chararray, City:chararray, ZipCode:int);

CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('\t') AS

(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amo

unt:int);

CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);

STORE CrunchBaseZip INTO

'{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40),

Month int, Year int, Investor int, Amount varchar(40))}’

USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');

Total Tech Investments By Year

Investment Funding By Sector

39

Total Investments By Zip Code for all Sectors

$7.3 Billion in San Francisco

$2.9 Billion in Mountain View

$1.2 Billion in Boston

$1.7 Billion in Austin

40

Total Investments By Zip Code for Consumer Web

$1.2 Billion in Chicago

$600 Million in Seattle

$1.7 Billion in San Francisco

41

Total Investments By Zip Code for BioTech

$1.3 Billion in Cambridge

$528 Million in Dallas

$1.1 Billion in San Diego

42

Questions?