Tech4Africa - Opportunities around Big Data

Big Data

Steve Watt Technology Strategy @ HP swatt@hp.com@wattsteve

Agenda

Hardware Software Data

• Situational Applications

• Big Data

Situational Applications

– eaghra (Flickr)

Situational Applicatio

nsMashups

Data Explosion

Social Platfor

Enterprise

LAMPPublishi

ng Platform

Web 2.0 Era Topic Map

Inexpensive

Storage

Produce Process

Web 2.0

New Information

Big Data

– blmiers2 (Flickr)

The data just keeps growing…

1024 GIGABYTE= 1 TERABYTE1024 TERABYTES = 1 PETABYTE

1024 PETABYTES = 1 EXABYTE

1 PETABYTE 13.3 Years of HD Video

20 PETABYTES Amount of Data processed by Google daily

5 EXABYTES All words ever spoken by humanity

Web as a Platform Web 1.0 - Connecting Machines

Infrastructure

Web 2.0 - Connecting People API Foundation

Facebook Twitter LinkedIn

Google NetFlix

PayPaleBay Pandora

New York Times

The Fractured Web

Service EconomyService for this

Service for that

App Economy for DevicesApp for this App for that

Web 2.0 Data Exhaust of Historical and Real-time Data

Real-time Data

Mobile

Set Top Boxes

Tablets, etc.

Sensor WebAn instrumented and monitored world

Multiple Sensors in your pocket

Opportunity

Data Deluge! But filter patterns can help…

Kakadu (Flickr)

Filtering WithSearch

Awesome

Filtering Socially

Filtering Visually

But filter patterns force you down a pre-processed path

M.V. Jantzen (Flickr)

What if you could ask your own questions?

wowwzers(Flickr)

– MrB-MMX

(Flickr)

And go from discovering Something about Everything…

To discovering Everything about Something ?

Gathering,

Storing,

Processing &

Delivering Data @

How do we do this?

Lets examine a few techniques for

Gathering Data

Data Marketplaces

Gathering Data

Apache Nutch(Web Crawler)

Storing, Reading and Processing - Apache Hadoop Cluster technology with a single master and scale out with multiple slaves It consists of two runtimes:

The Hadoop Distributed File System (HDFS) Map/Reduce

As data is copied onto the HDFS it ensures the data is blocked and replicated to other machines to provide redundancy

A self-contained job (workload) is written in Map/Reduce and submitted to the Hadoop Master which in-turn distributes the job to each slave in the cluster.

Jobs run on data that is on the local disks of the machine they are sent to ensuring data locality

Node (Slave) failures are handled automatically by Hadoop. Hadoop may execute or re-execute a job on any node in the cluster.

Want to know more? “Hadoop – The Definitive Guide (2nd Edition)”

Delivering Data @ Scale

• Structured Data

• Low Latency & Random Access

• Column Stores (Apache HBase or Apache Cassandra)• faster seeks

• better compression

• simpler scale out

• De-normalized – Data is written as it is intended to be queried

Want to know more? “HBase – The Definitive Guide” & “Cassandra High Performance Cookbook”

Storing, Processing & Delivering : Hadoop + NoSQL

NoSQL Repository

Apache Hadoop

FlumeConnector

NoSQL Connector/API

SQOOPConnector

H D F SLog Files

Relational Data (JDBC)

Gather

Read/Transform

Low-latency

-Clean and Filter Data

- Transform and Enrich Data

- Often multiple Hadoop jobs

Web Data

Nutch Crawl

Application

ServeCopy

Some things to keep in mind…

– Kanaka Menehune (Flickr)

Some things to keep in mind…

• Processing arbitrary types of data (unstructured, semi-structured, structured) requires normalizing data with many different kinds of readers

Hadoop is really great at this !

• However, readers won’t really help you process truly unstructured data such as prose. For that you’re going to have to get handy with Natural Language Processing. But this is really hard.

Consider using parsing services & APIs like Open Calais

Want to know more? “Programming Pig” (O’REILLY)

Open Calais (Gnosis)

Statistical real-time decision making

Capture Historical information

Use Machine Learning to build decision making models (such as Classification, Clustering & Recommendation)

Mesh real-time events (such as sensor data) against Models to make automated decisions

Want to know more? “Mahout in Action”

Tech Bubble?

What does the Data Say?

Pascal Terjan (Flickr

Apache

Identify Optimal Seed URLs& Crawl to a depth of 2

http://www.crunchbase.com/companies?

c=a&q=privately_held

Crawl data is stored in segment dirs on the HDFS

Company POJO then /t Out

Prelim Filtering on URL

Making the data STRUCTURED

Retrieving HTML

Aargh!

My viz tool requires zipcodes to plot geospatially!

Apache Pig Script to Join on City to get Zip Code and Write the results to Vertica

ZipCodes = LOAD 'demo/zipcodes.txt' USING PigStorage('\t') AS (State:chararray, City:chararray, ZipCode:int);

CrunchBase = LOAD 'demo/crunchbase.txt' USING PigStorage('\t') AS

(Company:chararray,City:chararray,State:chararray,Sector:chararray,Round:chararray,Month:int,Year:int,Investor:chararray,Amo

unt:int);

CrunchBaseZip = JOIN CrunchBase BY (City,State), ZipCodes BY (City,State);

STORE CrunchBaseZip INTO

'{CrunchBaseZip(Company varchar(40), City varchar(40), State varchar(40), Sector varchar(40), Round varchar(40),

Month int, Year int, Investor int, Amount varchar(40))}’

USING com.vertica.pig.VerticaStorer(‘VerticaServer','OSCON','5433','dbadmin','');

Total Tech Investments By Year

Investment Funding By Sector

Total Investments By Zip Code for all Sectors

$7.3 Billion in San Francisco

$2.9 Billion in Mountain View

$1.2 Billion in Boston

$1.7 Billion in Austin

Total Investments By Zip Code for Consumer Web

$1.2 Billion in Chicago

$600 Million in Seattle

$1.7 Billion in San Francisco

Total Investments By Zip Code for BioTech

$1.3 Billion in Cambridge

$528 Million in Dallas

$1.1 Billion in San Diego

Questions?

Tech4Africa - Opportunities around Big Data

Technology

Transcript of Tech4Africa - Opportunities around Big Data

Opportunities and Challenges for International Cooperation Around Big Data

Offering new training opportunities to people around the ... · UNIDO’s E-Learning Courses Offering new training opportunities to people around the world How to deal with hydrogen

Opportunities around the corner Assessing market development in the MENA region

AROUND THE WORLD · AROUND THE WORLD. From Post-EducationCareer ... Unbounded Beauty& Opportunity in the LandsDown Under •Australia –post-study opportunities for international

Tech4Africa presentation - Designing for context using a mobile UX strategy

Opportunities for China-UK collaboration in … › wp-content › uploads › ...in Beijing, Chinese museums reﬂected on the challenges and opportunities around surging domestic

Build opportunities around change Your ideas Stakeholder’s ideas.

Financial insight: challenges and opportunities...partnering. Financial insight: challenges and opportunities commenced with CFO roundtables held in key locations around the world,

Challenges and Opportunities Around Integration of Clinical Trials Data

Business Opportunities around the FIFA World Cup · Indicators for business opportunities around the FIFA World Cup a) ... 104 mio − Approx 50’000 ... Final Report of the German

Local Community Involvement in Tourism Around National Parks- Opportunities and Constraints

4.16 Recreation€¦ · around the Mojave River. Recreational opportunities at Camp Cady include wildlife viewing, birdwatching, hiking, and hunting. Hunting opportunities include

IA Partnership Opportunities HR - The Washington Center · Opportunities for partnership are built around three key initiatives: • Annual Higher Education Civic Engagement Awards

Winning with connected construction Digital opportunities ... · Winning with connected construction Digital opportunities in engineering and construction 1 Every day, around the

Convention Sponsorship Opportunities - American …€¦ · Convention Sponsorship Opportunities. High School All-America Match: $100,000 (SOLD) Build an event around your ... Cost

Accelerating New Energy Opportunities GloballyA collaborative forum and transaction accelerator connecting you to new energy opportunities around the world. Commercial & Industrial

SEA TRAFFIC MANAGEMENT EXPLORING OPPORTUNITIES … · 2020-03-31 · SEA TRAFFIC MANAGEMENT – EXPLORING OPPORTUNITIES AROUND DATA FLOW FOR AUTOMATED TERMINALS Smart Ports Summit

Build opportunities around change

TECHniques: Opportunities Exist All Around You

The Route to 2012 - Opportunities for companies around the 2012 Olympic Games