Introduction to Hadoop Capabilities, Accelerators and Solutions.

30
Introduction to Hadoop Capabilities, Accelerators and Solutions

Transcript of Introduction to Hadoop Capabilities, Accelerators and Solutions.

Page 1: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Introduction to Hadoop

Capabilities, Accelerators and Solutions

Page 2: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Big Data

2

Google processed over 400 PB of data on datacenters composed of thousands of machines in September 2007 alone ***

Today, every organization has it’s own big data problem and most are using Hadoop to solve it.

*** MapReduce: Simplified Data Processing on Large Clusters, Communications of the ACM, vol. 51, no. 1 (2008), pp. 107-113, Jeffrey Dean and Sanjay Ghemawat

Page 3: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Where is Big Data?

3

Big Data Has Reached Every Market Sector

Source – McKinsey & Company report. Big data: The next frontier for innovation, competition and productivity. May 2011.

Page 4: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Big Data Value Creation Opportunities

Financial Services• Detect fraud • Model and manage risk• Improve debt recovery rates • Personalize banking/insurance Products

Healthcare• Optimal treatment pathways• Remote patient monitoring• Predictive modeling for new drugs• Personalized Medicine

Retail• In-store behavior analysis• Cross selling • Optimize pricing, placement, design• Optimize inventory and distribution

Web/Social/Mobile• Location-based marketing• Social segmentation• Sentiment analysis• Price comparison Services

Manufacturing• Design to value• Crowd-sourcing “Digital factory” for lean manufacturing• Improve service via product sensor Data

Government• Reduce fraud • Segment populations, customize action• Support open data initiatives• Automate decision Making

4

Page 5: Introduction to Hadoop Capabilities, Accelerators and Solutions.

What is Hadoop?

Hadoop is an open-source project overseen by the Apache Software Foundation

Originally based on papers published by Google in 2003 and 2004 Hadoop is an ecosystem, not a single product Hadoop committers work at several different organizations – Including Facebook, Yahoo!, Twitter, Cloudera, Hortonworks

5

Page 6: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop - Inspiration

6

Google calls it: Hadoop equivalent

GFS HDFS

MapReduce Hadoop MapReduce

Sawzall Hive, Pig

BigTable HBase

Chubby ZooKeeper

Pregel Giraph

You Say, “tomato…”

Google was awarded a patent for “map reduce – a system for large scale data processing” in 2010, but blessed Apache Hadoop by granting a license.

Page 7: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop Timeline

• Started for Nutch at Yahoo! by Doug Cutting in early 2006

• Hadoop 2.x, released in 2012, is basis for all current, stable Hadoop distributions Apache Hadoop 2.0.xx CDH4.* HDP2.*

7

Page 8: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Typical Data Strategy

8

  ETL Tools DW / Marts BI Analytics

Commercial

 Informatica  Teradata  Microstrategy  SAS

 Oracle Data Integrator  Oracle  OBIEE  TIBCO Spotfire IBM Datastage  DB2, Netezza  Cognos  SPSS Microsoft SSIS  SQL server  Microsoft SSRS  

EMC Greenplum  Open source  Talend  mySQL  Pentaho , Jaspersoft  R, RapidMiner

Page 9: Introduction to Hadoop Capabilities, Accelerators and Solutions.

How Hadoop fits in?

9

Hadoop can complement the existing DW environment as well replace some of the components in a traditional data strategy.

Page 10: Introduction to Hadoop Capabilities, Accelerators and Solutions.

How Hadoop fits in?

10

• Storage • HDFS – It’s a file system, not a DBMS• HBase - Columnar storage that serves low-latency read / write request

• Extract / Load• Source / Target is RDBMS - Sqoop, hiho• Stream processing - Flume, Scribe, Chukwa, S4, Storm

• Transformation• Map-reduce (Java or any other language), Pig, Hive, Oozie etc.• Talend and Informatica have built products to abstract complexity of map-reduce

• Analytics• RHadoop, Mahout

• BI – All existing players are coming up with Hadoop connectors

Page 11: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop Ecosystem

11

Page 12: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop Ecosystem Continued…

12

Page 13: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Map-reduce – Programming model

13

Single map task and a single reduce task -

Multiple map tasks with a single reduce task -

Page 14: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Map-reduce – Programming model

14

Page 15: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop Map Reduce

What happens during a Map-reduce job’s lifetime? Clients submit MapReduce jobs to the JobTracker, a daemon that resides on

“master node” The JobTracker assigns Map and Reduce tasks to other nodes on the cluster These nodes each run a software daemon known as the TaskTracker The TaskTracker is responsible for actually instantiating the Map or Reduce task,

and reporting progress back to the JobTracker

Terminology – A job is a ‘full program’ – a complete execution of Mappers and Reducers over a

dataset A task is the execution of a single Mapper or Reducer over a slice of data A task attempt is a particular instance of an attempt to execute a task

There will be at least as many task attempts as there are tasks If a task attempt fails, another will be started by the JobTracker Speculative execution can also result in more task attempts than completed tasks

15

Page 16: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Pig Latin

• Data-flow oriented language• High-level language for routing data, allows easy integration of Java

for complex tasks• Data-types include sets, associative arrays, tuples

16

• Client-side utility• Pig interpreter converts the pig-

script to Java map-reduce jobs and submits it to JobTracker

• No additional installs needed on Hadoop Cluster

• Pig performance ~ 1.4x Java MapReduce jobs, but lines of code needed ~ 1/10th

• Developed at Yahoo!

Page 17: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hive

• SQL-based data warehousing app• Feature set is similar to Pig• Language is more strictly SQL-esque

• Supports SELECT, JOIN, GROUP BY, etc.

• Uses “Schema on Read” philosophy

• Features for analyzing very large data sets• Partition columns• Sampling• Buckets

• Requires install of metastore on Hadoop cluster

• Developed at Facebook

17

Page 18: Introduction to Hadoop Capabilities, Accelerators and Solutions.

HBase

Distributed, versioned, column-oriented store on top of HDFS Goal - To store tables with billion rows and million columns Provides an option of “low-latency” (OLTP) reads/writes along with

support for batch-processing model of map-reduce HBase cluster consists of a single “HBase Master” and multiple

“RegionServers” Facebook uses HBase to drive its messaging infrastructure Stats - Chat service supports over 300 million users who send

over 120 billion messages per month Nulls are not stored by design and typical table storage looks like –

18

Row-key Column-family Column Timestamp Value

1 CF Name Ts1 Vijay

1 CF Address Ts1 Mumbai

1 CF Address Ts2 Goa

Page 19: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Sqoop

RDBMS to Hadoop Command-line tool to import any JDBC supported database into

Hadoop And also export data from Hadoop to any database

Generates map-only jobs to connect to database and read/write records

DB specific connectors contributed by vendors – Oraoop for Oracle by Quest software Teradata connector from Teradata Netezza connector from IBM

Developed at Cloudera

Oracle has come up with “Oracle Loader for Hadoop” and claim that it is optimized for “Oracle Database 11g”

19

Page 20: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Informatica HParser

Graphical interface to design data transformation jobs

Converts designed DT jobs to Hadoop Map-reduce jobs

Out-of-the-box Hadoop parsing support for industry-standard formats, including Bloomberg, SWIFT, NACHA, HIPAA, HL7, ACORD, EDI X12, and EDIFACT etc.

20

Page 21: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Flume

Flume is a distributed, reliable, available service for efficiently moving large amounts of data as it is produced

Developed at Cloudera

21

Page 22: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Machine Learning

• Apache Mahout • Scalable machine learning library most of the algorithms implemented on top

Apache Hadoop using map/reduce paradigm• Supported Algorithms –

• Recommendation mining - takes users’ behavior and find items said specified user might like.

• Clustering - takes e.g. text documents and groups them based on related document topics.

• Classification - learns from existing categorized documents what specific category documents look like and is able to assign unlabeled documents to the appropriate category.

• Frequent item set mining - takes a set of item groups (e.g. terms in a query session, shopping cart content) and identifies, which individual items typically appear together.

• RHadoop (from Revolution Analytics) and RHIPE (from Purdue University) allows executing R programs over Apache Hadoop

22

Page 23: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Graph Implementations

Graph implementations follow the bulk-synchronous parallel model, popularized by Google’s Pregel –

1) Giraph (submitted to Apache Incubator)

2) GoldernOrb

3) Apache Hama

4) More – http://www.quora.com/What-are-some-good-MapReduce-implementations-for-graphs

23

Page 24: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop Distributions

24

Page 25: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop Variants / Flavors / Distributions

Apache Hadoop – Completely open and up-to-date version of Hadoop

Cloudera’s distribution including Hadoop (CDH) Open source Hadoop tools packaged with “closed” management suite (SCM) Profits by providing support (Cost-model is per node in Cluster) & Trainings

Hortonworks Data Platform Spun-off in 2011 from Yahoo!’s core Hadoop team Open source Hadoop tools packaged with “open” management suite (Apache Ambari) Profits by providing support (Cost-model is per node in Cluster) &Trainings Signed a deal with Microsoft to develop Hadoop for Windows

MapR Claims to have developed faster version of HDFS MapR’s distribution powers EMC’s Greenplum products

Oracle Big Data Appliance & IBM BigInsights Powered by CDH

More may exist……..

25

Page 26: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop - Key Contributors

26

Page 27: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop - Key Contributors

27

Page 28: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Hadoop - Key Contributors

28

Page 29: Introduction to Hadoop Capabilities, Accelerators and Solutions.

References

Hadoop: The Definitive Guide by Tom White (Cloudera Inc.)

Hadoop in Action by Chuck Lam ()

HBase: The Definitive Guide by Lars George (Cloudera Inc.)

Mahout in Action by Sean Owen, Robin Anil, Ted Dunning, and Ellen Friedman

Programming Pig by Alan Gates (Hortonworks)

Page 30: Introduction to Hadoop Capabilities, Accelerators and Solutions.

Thank You

.