Modern Big Data Analytics Tools: An Overview

67
Modern Big Data Analytics Tools: An Overview Milind Bhandarkar Chief Scientist, Pivotal (Twitter: @techmilind) (All Images Courtesy Flickr, Creative Commons Licensed)

description

Great Wide Open 2014 - Day 1 Milind Bhandarkar - Pivotal 3:30 PM - Operations 2 (Big Data)

Transcript of Modern Big Data Analytics Tools: An Overview

Page 1: Modern Big Data Analytics Tools: An Overview

Modern Big Data Analytics Tools: An

Overview

Milind Bhandarkar Chief Scientist, Pivotal (Twitter : @techmilind)

(All Images Courtesy Flickr, Creative Commons Licensed)

Page 2: Modern Big Data Analytics Tools: An Overview

About Me• http://www.linkedin.com/in/milindb

• Founding member of Hadoop team at Yahoo! [2005-2010]

• Contributor to Apache Hadoop since v0.1

• Built and led Grid Solutions Team at Yahoo! [2007-2010]

• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)

• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)

Page 3: Modern Big Data Analytics Tools: An Overview

Hadoop Midwife :-)

Page 4: Modern Big Data Analytics Tools: An Overview

Once upon a time, in a land far far away…

Page 5: Modern Big Data Analytics Tools: An Overview
Page 6: Modern Big Data Analytics Tools: An Overview

Fast forward 15 years..

Page 7: Modern Big Data Analytics Tools: An Overview
Page 8: Modern Big Data Analytics Tools: An Overview

What Happened ?

Page 9: Modern Big Data Analytics Tools: An Overview
Page 10: Modern Big Data Analytics Tools: An Overview
Page 11: Modern Big Data Analytics Tools: An Overview
Page 12: Modern Big Data Analytics Tools: An Overview
Page 13: Modern Big Data Analytics Tools: An Overview
Page 14: Modern Big Data Analytics Tools: An Overview
Page 15: Modern Big Data Analytics Tools: An Overview

And, then…

HDFS

ASF Projects FLOSS Projects Pivotal Products

MapReduce

Page 16: Modern Big Data Analytics Tools: An Overview

In a blink of an eye…

HDFS

Pig

Sqoop Flume

Coordination and workflow management

Zookeeper

Command Center

GemFire XD

Oozie

MapReduce

Hive

Tez

Giraph

Hadoop UI

Hue

SolrCloud

Phoenix

HBase

Crunch Mahout

Spark

Shark

Streaming

MLib

GraphX

Impala

HAW

Q

SpringXD

MADlib

Ham

ster

PivotalR

YARN

ASF Projects FLOSS Projects Pivotal Products

Page 17: Modern Big Data Analytics Tools: An Overview

History (2003-2010)

Page 18: Modern Big Data Analytics Tools: An Overview

Google Papers

Page 19: Modern Big Data Analytics Tools: An Overview

Yahoo! Search

+

=

Page 20: Modern Big Data Analytics Tools: An Overview

W-1-W

• WebMap : Graph processing for WWW

• Dreadnaught: Infrastructure for WebMap

• W-1-W: WebMap In One Week

• Juggernaut: Infrastructure for W-1-W

• JFS, JMR, Condor: Abandoned for Hadoop

Page 21: Modern Big Data Analytics Tools: An Overview

Lucene, Nutch

Page 22: Modern Big Data Analytics Tools: An Overview
Page 23: Modern Big Data Analytics Tools: An Overview

Kryptonite

Page 24: Modern Big Data Analytics Tools: An Overview

Major Step Backwards?

Page 25: Modern Big Data Analytics Tools: An Overview

MapReduce is the Revenge of System Programmers on

Database community. - Anonymous at XLDB, Stanford, 2010

Page 26: Modern Big Data Analytics Tools: An Overview
Page 27: Modern Big Data Analytics Tools: An Overview

O’Reilly Books 2013

Page 28: Modern Big Data Analytics Tools: An Overview

Who Uses Hadoop? (From Hadoop Summit 2010)

Page 29: Modern Big Data Analytics Tools: An Overview

Big Data Landscape - July 2012 http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/

Page 30: Modern Big Data Analytics Tools: An Overview

Hadoop Ecosystem (Jan 2013) http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html

Page 31: Modern Big Data Analytics Tools: An Overview

Game Changing Hadoop Economics

$-

$20,000

$40,000

$60,000

$80,000

2008 2009 2010 2011 2012 2013

Big Data Platform Price/TB

Big Data DB Hadoop

Page 32: Modern Big Data Analytics Tools: An Overview
Page 33: Modern Big Data Analytics Tools: An Overview
Page 34: Modern Big Data Analytics Tools: An Overview
Page 35: Modern Big Data Analytics Tools: An Overview

Hadoop Maturity

ETL Offload Accommodate massive data growth with existing EDW investments

Data Lakes Unify Unstructured and Structured Data Access

Big Data Apps Build analytic-led applications impacting top line revenue

Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture

Page 36: Modern Big Data Analytics Tools: An Overview

70% of data generated by

customers

80% of data being stored

3% being prepared for

analysis

0.5% being analyzed

<0.5% being operationalized

Average Enterprises

The Big Gap

Page 37: Modern Big Data Analytics Tools: An Overview

Storage Options

• HDFS, MapR, Quantcast QFS

• EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre

• Amazon S3, EMC Atmos, OpenStack Swift

• GlusterFS, Ceph

• EMC ViPR

Page 38: Modern Big Data Analytics Tools: An Overview

SQL-on-Hadoop• Pivotal HAWQ

• Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger

• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase

• More to come...

Page 39: Modern Big Data Analytics Tools: An Overview

Network Interconnect

...

......HAWQ & HDFS MasterSevers

Planning & dispatch

SegmentSevers

Query execution

...Storage !

HDFS, HBase …

Page 40: Modern Big Data Analytics Tools: An Overview

Namenode

Breplication

Rack1 Rack2

DatanodeDatanode Datanode

Read/Write

Segment

Segment host

SegmentSegment

Segment host

SegmentSegment host

Master host

Meta Ops

HAWQ Interconnect

Segment

Segment

Segment

Segment hostSegment

Datanode

Segment Segment Segment Segment

Page 41: Modern Big Data Analytics Tools: An Overview

HAWQ vs Hive

Lower is Better

Page 42: Modern Big Data Analytics Tools: An Overview

Provides data-parallel implementations of mathematical, statistical and machine-learning

methods for structured and unstructured data.

In-Database Analytics

Page 43: Modern Big Data Analytics Tools: An Overview

MADlib Algorithms

Page 44: Modern Big Data Analytics Tools: An Overview

MADLib Functions• Linear Regression

• Logistic Regression

• Multinomial Logistic Regression

• K-Means

• Association Rules

• Latent Dirichlet Allocation

• Naïve Bayes

• Elastic Net Regression

• Decision Trees / Random Forest

• Support Vector Machines

• Cox Proportional Hazards Regression

• Descriptive Statistics

• ARIMA

Page 45: Modern Big Data Analytics Tools: An Overview

k-Means Usage

SELECT * FROM madlib.kmeanspp ( ‘customers’, -- name of the input table ‘features’, -- name of the feature array column 2 -- k : number of clusters ); ! centroids | objective_fn | frac_reassigned | …!------------------------------------------------------------------------+------------------+-----------------+ …{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …

Page 46: Modern Big Data Analytics Tools: An Overview

Accessing HAWQ Through R

Page 47: Modern Big Data Analytics Tools: An Overview

Pivotal R

• Interface is R client

• Execution is in database

• Parallelism handled by PivotalR

• Supports a portion of R

R> x = db.data.frame(“t1”) R> l = madlib.lm(interlocks ~ assets + nation, data = t)

Page 48: Modern Big Data Analytics Tools: An Overview
Page 49: Modern Big Data Analytics Tools: An Overview

A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary

Page 50: Modern Big Data Analytics Tools: An Overview

A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary

• $ [ [[ $<- [<- [[<-

• is.na

• + - * / %% %/% ^

• & | !

• == != > < >= <=

• merge

• by

• db.data.frame

• as.db.data.frame

• preview• sort

• c mean sum sd var min max length colMeans colSums

• db.connect db.disconnect db.list db.objects

db.existsObject delete• dim names

• content

And more ... (SQL wrapper)

• predict

Page 51: Modern Big Data Analytics Tools: An Overview

A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary

• Categorial variable as.factor()

• $ [ [[ $<- [<- [[<-

• is.na

• + - * / %% %/% ^

• & | !

• == != > < >= <=

• merge

• by

• db.data.frame

• as.db.data.frame

• preview• sort

• c mean sum sd var min max length colMeans colSums

• db.connect db.disconnect db.list db.objects

db.existsObject delete• dim names

• content

And more ... (SQL wrapper)

• predict

Page 52: Modern Big Data Analytics Tools: An Overview

In-Database Execution• All data stays in DB: R objects merely point

to DB objects

• All model estimation and heavy lifting done in DB by MADlib

• R→ SQL translation done in the R client

• Only strings of SQL and model output transferred across ODBC/DBI

Page 53: Modern Big Data Analytics Tools: An Overview

Beyond MapReduce with YARN

Page 54: Modern Big Data Analytics Tools: An Overview

Single'App'

BATCH

HDFS

Single'App'

INTERACTIVE

Single'App'

BATCH

HDFS

Single'App'

BATCH

HDFS

Single'App'

ONLINE

Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks)

Page 55: Modern Big Data Analytics Tools: An Overview

MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks)

Page 56: Modern Big Data Analytics Tools: An Overview

Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks)

HADOOP 1.0

HDFS%(redundant,*reliable*storage)*

MapReduce%(cluster*resource*management*

*&*data*processing)*

HDFS2%(redundant,*reliable*storage)*

YARN%(cluster*resource*management)*

Tez%(execu7on*engine)*

HADOOP 2.0

Pig%(data*flow)*

Hive%(sql)*

%Others%(cascading)*

*

Pig%(data*flow)*

Hive%(sql)*

%Others%(cascading)*

%

MR%(batch)*

RT%%Stream,%Graph%Storm,''Giraph'

*

Services%HBase'

*

Page 57: Modern Big Data Analytics Tools: An Overview

Applica'ons+Run+Na'vely+IN+Hadoop+

HDFS2+(Redundant,*Reliable*Storage)*

YARN+(Cluster*Resource*Management)***

BATCH+(MapReduce)+

INTERACTIVE+(Tez)+

STREAMING+(Storm,+S4,…)+

GRAPH+(Giraph)+

INLMEMORY+(Spark)+

HPC+MPI+(OpenMPI)+

ONLINE+(HBase)+

OTHER+(Search)+(Weave…)+

YARN Platform (Image Courtesy Arun Murthy, Hortonworks)

Page 58: Modern Big Data Analytics Tools: An Overview

NodeManager* NodeManager* NodeManager* NodeManager*

Container*1.1*

Container*2.4*

NodeManager* NodeManager* NodeManager* NodeManager*

NodeManager* NodeManager* NodeManager* NodeManager*

Container*1.2*

Container*1.3*

AM*1*

Container*2.2*

Container*2.1*

Container*2.3*

AM2*

Client2*

ResourceManager*

Scheduler*

YARN Architecture (Image Courtesy Arun Murthy, Hortonworks)

Page 59: Modern Big Data Analytics Tools: An Overview

YARN

• Yet Another Resource Negotiator

• Resource Manager

• Node Managers

• Application Masters

• Specific to paradigm, e.g. MR Application master (aka JobTracker)

Page 60: Modern Big Data Analytics Tools: An Overview

Beyond MapReduce

• Apache Giraph - BSP & Graph Processing

• Storm on Yarn - Streaming Computation

• HOYA - HBase on Yarn

• Hamster - MPI on Hadoop

• More to come ...

Page 61: Modern Big Data Analytics Tools: An Overview

Hamster• Hadoop and MPI on the same

cluster

• OpenMPI Runtime on Hadoop YARN

• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System

• Open MPI Provides: Process launching, Communication, I/O forwarding

Page 62: Modern Big Data Analytics Tools: An Overview

GraphLab + Hamster on Hadoop

!

Page 63: Modern Big Data Analytics Tools: An Overview

About GraphLab

• Graph-based, High-Performance distributed computation framework

• Started by Prof. Carlos Guestrin in CMU in 2009

• Recently founded Graphlab Inc to commercialize Graphlab.org

Page 64: Modern Big Data Analytics Tools: An Overview

GraphLab Features• Topic Modeling (e.g. LDA)

• Graph Analytics (Pagerank, Triangle counting)

• Clustering (K-Means)

• Collaborative Filtering

• Linear Solvers

• etc...

Page 65: Modern Big Data Analytics Tools: An Overview

Only Graphs are not Enough

• Full Data processing workflow requires ETL/Postprocessing, Visualization, Data Wrangling, Serving

• MapReduce excels at data wrangling

• OLTP/NoSQL Row-Based stores excel at Serving

• GraphLab should co-exist with other Hadoop frameworks

Page 66: Modern Big Data Analytics Tools: An Overview

Data Platform of the Future ?

AnalyticData Marts

SQL Services

Operational Intelligence

In-Memory Database

Run-Time Applications

Data StagingPlatform

Data Mgmt. Services

Stream Ingestion

Streaming Services

Software-Defined Datacenter

New Data-fabrics

In-Memory Grid

...ETC

Page 67: Modern Big Data Analytics Tools: An Overview

Questions?