Big Data Analytics Architecture and Challenges, Issues of Big Data Analytics
Modern Big Data Analytics Tools: An Overview
-
Upload
great-wide-open -
Category
Technology
-
view
112 -
download
1
description
Transcript of Modern Big Data Analytics Tools: An Overview
Modern Big Data Analytics Tools: An
Overview
Milind Bhandarkar Chief Scientist, Pivotal (Twitter : @techmilind)
(All Images Courtesy Flickr, Creative Commons Licensed)
About Me• http://www.linkedin.com/in/milindb
• Founding member of Hadoop team at Yahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid Solutions Team at Yahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
Hadoop Midwife :-)
Once upon a time, in a land far far away…
Fast forward 15 years..
What Happened ?
And, then…
HDFS
ASF Projects FLOSS Projects Pivotal Products
MapReduce
In a blink of an eye…
HDFS
Pig
Sqoop Flume
Coordination and workflow management
Zookeeper
Command Center
GemFire XD
Oozie
MapReduce
Hive
Tez
Giraph
Hadoop UI
Hue
SolrCloud
Phoenix
HBase
Crunch Mahout
Spark
Shark
Streaming
MLib
GraphX
Impala
HAW
Q
SpringXD
MADlib
Ham
ster
PivotalR
YARN
ASF Projects FLOSS Projects Pivotal Products
History (2003-2010)
Google Papers
Yahoo! Search
+
=
W-1-W
• WebMap : Graph processing for WWW
• Dreadnaught: Infrastructure for WebMap
• W-1-W: WebMap In One Week
• Juggernaut: Infrastructure for W-1-W
• JFS, JMR, Condor: Abandoned for Hadoop
Lucene, Nutch
Kryptonite
Major Step Backwards?
MapReduce is the Revenge of System Programmers on
Database community. - Anonymous at XLDB, Stanford, 2010
O’Reilly Books 2013
Who Uses Hadoop? (From Hadoop Summit 2010)
Big Data Landscape - July 2012 http://www.forbes.com/sites/davefeinleib/2012/06/19/the-big-data-landscape/
Hadoop Ecosystem (Jan 2013) http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
Game Changing Hadoop Economics
$-
$20,000
$40,000
$60,000
$80,000
2008 2009 2010 2011 2012 2013
Big Data Platform Price/TB
Big Data DB Hadoop
Hadoop Maturity
ETL Offload Accommodate massive data growth with existing EDW investments
Data Lakes Unify Unstructured and Structured Data Access
Big Data Apps Build analytic-led applications impacting top line revenue
Data-Driven Enterprise App Dev and Operational Management on HDFS Data Architecture
70% of data generated by
customers
80% of data being stored
3% being prepared for
analysis
0.5% being analyzed
<0.5% being operationalized
Average Enterprises
The Big Gap
Storage Options
• HDFS, MapR, Quantcast QFS
• EMC Isilon, NetApp, IBM GPFS, PanFS, PVFS, Lustre
• Amazon S3, EMC Atmos, OpenStack Swift
• GlusterFS, Ceph
• EMC ViPR
SQL-on-Hadoop• Pivotal HAWQ
• Cloudera Impala, Facebook Presto, Apache Drill, Cascading Lingual, Optiq, Hortonworks Stinger
• Hadapt, Jethrodata, IBM BigSQL, Microsoft PolyBase
• More to come...
Network Interconnect
...
......HAWQ & HDFS MasterSevers
Planning & dispatch
SegmentSevers
Query execution
...Storage !
HDFS, HBase …
Namenode
Breplication
Rack1 Rack2
DatanodeDatanode Datanode
Read/Write
Segment
Segment host
SegmentSegment
Segment host
SegmentSegment host
Master host
Meta Ops
HAWQ Interconnect
Segment
Segment
Segment
Segment hostSegment
Datanode
Segment Segment Segment Segment
HAWQ vs Hive
Lower is Better
Provides data-parallel implementations of mathematical, statistical and machine-learning
methods for structured and unstructured data.
In-Database Analytics
MADlib Algorithms
MADLib Functions• Linear Regression
• Logistic Regression
• Multinomial Logistic Regression
• K-Means
• Association Rules
• Latent Dirichlet Allocation
• Naïve Bayes
• Elastic Net Regression
• Decision Trees / Random Forest
• Support Vector Machines
• Cox Proportional Hazards Regression
• Descriptive Statistics
• ARIMA
k-Means Usage
SELECT * FROM madlib.kmeanspp ( ‘customers’, -- name of the input table ‘features’, -- name of the feature array column 2 -- k : number of clusters ); ! centroids | objective_fn | frac_reassigned | …!------------------------------------------------------------------------+------------------+-----------------+ …{{68.01668579784,48.9667382972952},{28.1452167573446,84.5992507653263}} | 586729.010675982 | 0.001 | …
Accessing HAWQ Through R
Pivotal R
• Interface is R client
• Execution is in database
• Parallelism handled by PivotalR
• Supports a portion of R
R> x = db.data.frame(“t1”) R> l = madlib.lm(interlocks ~ assets + nation, data = t)
A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary
A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/% ^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max length colMeans colSums
• db.connect db.disconnect db.list db.objects
db.existsObject delete• dim names
• content
And more ... (SQL wrapper)
• predict
A wrapper of MADlib • Linear regression • Logistic regression • Elastic Net • ARIMA • Table summary
• Categorial variable as.factor()
• $ [ [[ $<- [<- [[<-
• is.na
• + - * / %% %/% ^
• & | !
• == != > < >= <=
• merge
• by
• db.data.frame
• as.db.data.frame
• preview• sort
• c mean sum sd var min max length colMeans colSums
• db.connect db.disconnect db.list db.objects
db.existsObject delete• dim names
• content
And more ... (SQL wrapper)
• predict
In-Database Execution• All data stays in DB: R objects merely point
to DB objects
• All model estimation and heavy lifting done in DB by MADlib
• R→ SQL translation done in the R client
• Only strings of SQL and model output transferred across ODBC/DBI
Beyond MapReduce with YARN
Single'App'
BATCH
HDFS
Single'App'
INTERACTIVE
Single'App'
BATCH
HDFS
Single'App'
BATCH
HDFS
Single'App'
ONLINE
Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks)
MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks)
Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks)
HADOOP 1.0
HDFS%(redundant,*reliable*storage)*
MapReduce%(cluster*resource*management*
*&*data*processing)*
HDFS2%(redundant,*reliable*storage)*
YARN%(cluster*resource*management)*
Tez%(execu7on*engine)*
HADOOP 2.0
Pig%(data*flow)*
Hive%(sql)*
%Others%(cascading)*
*
Pig%(data*flow)*
Hive%(sql)*
%Others%(cascading)*
%
MR%(batch)*
RT%%Stream,%Graph%Storm,''Giraph'
*
Services%HBase'
*
Applica'ons+Run+Na'vely+IN+Hadoop+
HDFS2+(Redundant,*Reliable*Storage)*
YARN+(Cluster*Resource*Management)***
BATCH+(MapReduce)+
INTERACTIVE+(Tez)+
STREAMING+(Storm,+S4,…)+
GRAPH+(Giraph)+
INLMEMORY+(Spark)+
HPC+MPI+(OpenMPI)+
ONLINE+(HBase)+
OTHER+(Search)+(Weave…)+
YARN Platform (Image Courtesy Arun Murthy, Hortonworks)
NodeManager* NodeManager* NodeManager* NodeManager*
Container*1.1*
Container*2.4*
NodeManager* NodeManager* NodeManager* NodeManager*
NodeManager* NodeManager* NodeManager* NodeManager*
Container*1.2*
Container*1.3*
AM*1*
Container*2.2*
Container*2.1*
Container*2.3*
AM2*
Client2*
ResourceManager*
Scheduler*
YARN Architecture (Image Courtesy Arun Murthy, Hortonworks)
YARN
• Yet Another Resource Negotiator
• Resource Manager
• Node Managers
• Application Masters
• Specific to paradigm, e.g. MR Application master (aka JobTracker)
Beyond MapReduce
• Apache Giraph - BSP & Graph Processing
• Storm on Yarn - Streaming Computation
• HOYA - HBase on Yarn
• Hamster - MPI on Hadoop
• More to come ...
Hamster• Hadoop and MPI on the same
cluster
• OpenMPI Runtime on Hadoop YARN
• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System
• Open MPI Provides: Process launching, Communication, I/O forwarding
GraphLab + Hamster on Hadoop
!
About GraphLab
• Graph-based, High-Performance distributed computation framework
• Started by Prof. Carlos Guestrin in CMU in 2009
• Recently founded Graphlab Inc to commercialize Graphlab.org
GraphLab Features• Topic Modeling (e.g. LDA)
• Graph Analytics (Pagerank, Triangle counting)
• Clustering (K-Means)
• Collaborative Filtering
• Linear Solvers
• etc...
Only Graphs are not Enough
• Full Data processing workflow requires ETL/Postprocessing, Visualization, Data Wrangling, Serving
• MapReduce excels at data wrangling
• OLTP/NoSQL Row-Based stores excel at Serving
• GraphLab should co-exist with other Hadoop frameworks
Data Platform of the Future ?
AnalyticData Marts
SQL Services
Operational Intelligence
In-Memory Database
Run-Time Applications
Data StagingPlatform
Data Mgmt. Services
Stream Ingestion
Streaming Services
Software-Defined Datacenter
New Data-fabrics
In-Memory Grid
...ETC
Questions?