Predictive Analytics San Diego

56
1 ©MapR Technologies - Confidential Remembering the Future

description

The unification of big and little data processing onto a single platform is an important requirement for Hadoop. How can this be achieved? Ted Dunning explains what is needed for three important use cases.

Transcript of Predictive Analytics San Diego

Page 1: Predictive Analytics San Diego

1©MapR Technologies - Confidential

Remembering the Future

Page 2: Predictive Analytics San Diego

2©MapR Technologies - Confidential

My Background

University, Startups– Aptex, MusicMatch, ID Analytics, Veoh– big data since before it was big

Open source– even before the internet– Apache Hadoop, Mahout, Zookeeper, Drill– bought the beer at first HUG

MapR Founding member of Apache Drill

Page 3: Predictive Analytics San Diego

3©MapR Technologies - Confidential

MapR Technologies

Silicon Valley Startup– Top investors– Top technical and management team• Google, Microsoft, EMC, NetApp, Oracle

Enterprise quality distribution for Hadoop

Many extensions to basic Hadoop function Strong supporter of Apache Drill

Page 4: Predictive Analytics San Diego

4©MapR Technologies - Confidential

Philosophy First

What is History?

Page 5: Predictive Analytics San Diego

5©MapR Technologies - Confidential

The study of the past

(what came before now)

Page 6: Predictive Analytics San Diego

6©MapR Technologies - Confidential

What is the future?

(it comes after now)

Page 7: Predictive Analytics San Diego

7©MapR Technologies - Confidential

Page 8: Predictive Analytics San Diego

8©MapR Technologies - Confidential

Page 9: Predictive Analytics San Diego

9©MapR Technologies - Confidential

Page 10: Predictive Analytics San Diego

10©MapR Technologies - Confidential

But the future also has a past!

Page 11: Predictive Analytics San Diego

11©MapR Technologies - Confidential

Do you remember the future?

Page 12: Predictive Analytics San Diego

12©MapR Technologies - Confidential

Page 13: Predictive Analytics San Diego

13©MapR Technologies - Confidential

Page 14: Predictive Analytics San Diego

14©MapR Technologies - Confidential

Page 15: Predictive Analytics San Diego

15©MapR Technologies - Confidential

Page 16: Predictive Analytics San Diego

16©MapR Technologies - Confidential

Page 17: Predictive Analytics San Diego

17©MapR Technologies - Confidential

Some things

turned out as

expected

Page 18: Predictive Analytics San Diego

18©MapR Technologies - Confidential

Guys wearing Fedoras

Page 19: Predictive Analytics San Diego

19©MapR Technologies - Confidential

Many things are different!

Page 20: Predictive Analytics San Diego

20©MapR Technologies - Confidential

Hadoop has a history

Page 21: Predictive Analytics San Diego

21©MapR Technologies - Confidential

Hadoop also has a

future

Page 22: Predictive Analytics San Diego

22©MapR Technologies - Confidential

The Old Future of Hadoop

Map-reduce and HDFS– more and more, but not really different

Eco-system additions– Simpler programming (Hive and Pig)– Key-value store– Ad hoc query

Stands apart from other computing– Required by HDFS and other limitations

Page 23: Predictive Analytics San Diego

23©MapR Technologies - Confidential

The New Future of Hadoop

Real-time processing– Combines real-time and long-time

Integration with traditional IT– No need to stand apart

Integration with new technologies– Solr, Node.js, Twisted all should interface directly

Fast and flexible computation– Drill logical plan language

Page 24: Predictive Analytics San Diego

24©MapR Technologies - Confidential

Example #1Search Abuse

Page 25: Predictive Analytics San Diego

25©MapR Technologies - Confidential

History matrix

One row per user

One column per thing

Page 26: Predictive Analytics San Diego

26©MapR Technologies - Confidential

Recommendation based on cooccurrence

Cooccurrence gives item-item mapping

One row and column per thing

Page 27: Predictive Analytics San Diego

27©MapR Technologies - Confidential

Cooccurrence matrix can also be implemented as a search index

Page 28: Predictive Analytics San Diego

28©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrindexing

Cooccurrence(Mahout)

Item meta-data

Indexshards

Complete history

Page 29: Predictive Analytics San Diego

29©MapR Technologies - Confidential

SolRIndexerSolR

IndexerSolrsearchWeb tier

Item meta-data

Indexshards

User history

Page 30: Predictive Analytics San Diego

30©MapR Technologies - Confidential

Objective Results

At a very large credit card company

History is all transactions, all web interaction

Processing time cut from 20 hours per day to 3

Recommendation engine load time decreased from 8 hours to 3 minutes

Page 31: Predictive Analytics San Diego

31©MapR Technologies - Confidential

Example #2Web

Technology

Page 32: Predictive Analytics San Diego

32©MapR Technologies - Confidential

Fast analysis(Storm)

Analytic output

Real-timedata

Raw logs

Page 33: Predictive Analytics San Diego

33©MapR Technologies - Confidential

Large analysis(map-reduce)

Analytic output Raw logs

Page 34: Predictive Analytics San Diego

34©MapR Technologies - Confidential

Presentation tier (d3 + node.js)

Analytic output

Browser query

Raw logs

Page 35: Predictive Analytics San Diego

35©MapR Technologies - Confidential

Objective Results

Real-time + long-time analysis is seamless

Web tier can be rooted directly on Hadoop cluster

No need to move data

Page 36: Predictive Analytics San Diego

36©MapR Technologies - Confidential

Example #3Apache Drill

Page 37: Predictive Analytics San Diego

37©MapR Technologies - Confidential

Big Data Processing – Hadoop

Batch processing

Query runtime Minutes to hours

Data volume TBs to PBs

Programming model

MapReduce

Users Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Page 38: Predictive Analytics San Diego

38©MapR Technologies - Confidential

Big Data Processing – Hadoop and Storm

Batch processing Stream processing

Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream

Programming model

MapReduce DAG (pre-programmed)

Users Developers Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Storm or Apache S4

Page 39: Predictive Analytics San Diego

39©MapR Technologies - Confidential

Big Data Processing – The missing part

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Never-ending

Data volume TBs to PBs Continuous stream

Programming model

MapReduce DAG (pre-programmed)

Users Developers Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Storm and S4

Page 40: Predictive Analytics San Diego

40©MapR Technologies - Confidential

Big Data Processing – The missing part

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries(ad hoc)

DAG (pre-programmed)

Users Developers Analysts and developers

Developers

Google project MapReduce

Open source project

Hadoop MapReduce

Storm and S4

Page 41: Predictive Analytics San Diego

41©MapR Technologies - Confidential

Big Data Processing

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries DAG

Users Developers Analysts and developers

Developers

Google project MapReduce Dremel

Open source project

Hadoop MapReduce

Storm and S4

Page 42: Predictive Analytics San Diego

42©MapR Technologies - Confidential

Big Data Processing

Batch processing Interactive analysis Stream processing

Query runtime Minutes to hours Milliseconds to minutes

Never-ending

Data volume TBs to PBs GBs to PBs Continuous stream

Programming model

MapReduce Queries DAG

Users Developers Analysts and developers

Developers

Google project MapReduce Dremel

Open source project

Hadoop MapReduce

Storm and S4

Apache Drill

Page 43: Predictive Analytics San Diego

43©MapR Technologies - Confidential

Design Principles

Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats

• Column-based and row-based• Schema and schema-less

• Pluggable data sources

Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages

Dependable• No SPOF• Instant recovery from crashes

Fast• C/C++ core with Java support

• Google C++ style guide• Min latency and max throughput

(limited only by hardware)

Page 44: Predictive Analytics San Diego

44©MapR Technologies - Confidential

Simple Architecture

Page 45: Predictive Analytics San Diego

45©MapR Technologies - Confidential

Standard Interfaces

Page 46: Predictive Analytics San Diego

46©MapR Technologies - Confidential

query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales”} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, …

Logical Plan Syntax:

Page 47: Predictive Analytics San Diego

47©MapR Technologies - Confidential

Logical Streaming Example

{ @id: <refnum>, op: “window-frame”, input: <input>, keys: [ <name>,... ], ref: <name>, before: 2, after: here}

0 1 2 3 4

0 0 10 1 2 1 2 32 3 4

Page 48: Predictive Analytics San Diego

48©MapR Technologies - Confidential

Logical Plan

Page 49: Predictive Analytics San Diego

49©MapR Technologies - Confidential

Execution Plan

Page 50: Predictive Analytics San Diego

50©MapR Technologies - Confidential

Representing a DAG

{ @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ]}

Page 51: Predictive Analytics San Diego

51©MapR Technologies - Confidential

Non-SQL queries

Page 52: Predictive Analytics San Diego

52©MapR Technologies - Confidential

Design Principles

Flexible• Pluggable query languages• Extensible execution engine• Pluggable data formats

• Column-based and row-based• Schema and schema-less

• Pluggable data sources

Easy• Unzip and run• Zero configuration• Reverse DNS not needed• IP addresses can change• Clear and concise log messages

Dependable• No SPOF• Instant recovery from crashes

Fast• C/C++ core with Java support

• Google C++ style guide• Min latency and max throughput

(limited only by hardware)

Page 53: Predictive Analytics San Diego

53©MapR Technologies - Confidential

The future is not what we thought it would be

Page 54: Predictive Analytics San Diego

54©MapR Technologies - Confidential

It is better!

Page 55: Predictive Analytics San Diego

55©MapR Technologies - Confidential

Get Involved!

Tweet:#hcj13w#mapr

@ted_dunning

Page 56: Predictive Analytics San Diego

56©MapR Technologies - Confidential

Get Involved!

Download these slides– http://www.mapr.com/company/events/hcj-01-21-2013

Join the Drill project– [email protected] – #apachedrill

Contact me:– [email protected][email protected]– @ted_dunning

Join MapR (in Japan!)– [email protected]