Hadoop is Happening

37
Hadoop is Happening May 1, 2014

description

 

Transcript of Hadoop is Happening

Page 1: Hadoop is Happening

Hadoop is Happening

May 1, 2014

Page 2: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Agenda

Hadoop Evolution Use Cases The Hadoop Ecosystem, from open source to vendor solutions Tooling, implementation and skillset challenges Real-World Case Studies Future of Hadoop Q&A

2

Page 3: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Our Guest – Chida from OpenOsmium

20+ years of Enterprise Application Development Experience Focused on Big Data & Cloud Founder of Big Data Solution Provider – OpenOsmium DC Tech Community Organizer of Meetups

– Google Developer Group, Tech Breakfast, NoVA Hadoop User Group Open Source, Big Data and Cloud Advocate 703-568-7426, [email protected]

3

Page 4: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

EVOLUTION OF HADOOP

4

Page 5: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Evolution of Hadoop – Data Volumes are Growing

5

Page 6: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Evolution of Hadoop – Key Events

6

Next? 2000 2004

Search Engine Problem @ Google

3 White Papers: GFS, MapReduce, BigTable

MapReduce: Simplified Data Processing on Large Clusters

Yahoo!

HDFS, MapReduce, Hbase

2008 2010 2012 2013

MapR

Hortonworks

HHadoop 2.0

Cloudera

Page 7: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Why Hadoop As a Data Management Platform?

The Reliability of a Mainframe, The Massive Performance at Scale of an

MPP appliance, The Storage Capacity of a SAN, All at a

Disruptively Low Price Point

7

Page 8: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

The Economics of Data

8

Cost of managing 1TB of data

Mainframe EDW Hadoop

$20,000 – $100,000 $15,000 – $80,000 $250 – $2,000

Scalability

Performance

Reliability

Agility

Skills Supply

But there’s more…

Page 9: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Hadoop - The Big Picture

9

Unified computation provided by MapReduce distributed computing framework

Unified storage provided by distributed file system called HDFS

Commodity Hardware

Hardware contains bunch of disks and cores

Physical

Logical Storage

Computation

Page 10: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

MapReduce – Football Stadium Analogy

10

Page 11: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Yesterday’s Architecture

11

Page 12: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Tomorrow’s Data Architecture

12

Page 13: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

HADOOP USE CASES

13

Page 14: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Hadoop Use Cases

14

Data Lake

Offload Mainframe Data & Batch Workloads

Machine Data

Cyber Security

Fraud Detection

Offload ELT from Data Warehouse Clickstream / Weblogs, EMR

Social Media Data Geo Spatial Analyzing

Video and Audio Analytics

Real-Time Processing

Predictive Analytics

Unstructured Data

Active Archive

Multi-media

Leverage “Dark Data”

Sentiment Analysis

Enterprise Data Hub

Page 15: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Hadoop Use Cases

A Roadmap for Hadoop Success

– Offload batch & ELT workloads from data warehouse and mainframe systems into Hadoop

– Develop and active archive, shed light on dark data

– Build your Enterprise Data Hub (Data Lake!)

– Leverage new data sources – Extend BI with data discovery &

exploration – Deliver next-generation analytics

15

Page 16: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Sample Use Case: Offload

Phase III: Optimize & Secure

Phase II: Offload

Phase I: Identify

• Identify data & workloads most suitable for offload

• Focus on those that will deliver maximum savings & performance

• Access and move virtually any data to Hadoop with one tool

• Easily replicate existing workloads in Hadoop using a graphical user interface

• Deploy and optimize the new environment

• Manage & secure all your data with business class tools

16

Page 17: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Phase 2: Deliver ‘Next-generation’ Applications

Advanced – ‘Next-gen’ – Applications for Hadoop

– Semi-structured data analytics

• Clickstream/Weblog, Electronic Medical Records

– Unstructured data analytics

• video, audio, documents, text, social

• Predictive modeling

– Geospatial analysis

– Real-Time Processing

17

Page 18: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Use Cases Across Industries

Vertical Refine Explore Enrich

Retail & Web

• Log Analysis/Site Optimization

• Loyalty Program Optimization

• Brand and Sentiment Analysis • Market basket analysis

• Dynamic Pricing • Session & Content

Optimization • Product recommendation

Telco • Customer profiling • Equipment failure prediction • Location based advertising

Government • Threat Identification • Person of Interest Discovery • Mission work

Finance • Risk Modeling & Fraud

Identification • Trade Performance Analytics

• Surveillance and Fraud Detection

• Customer Risk Analysis

• Real-time upsell, cross sales marketing offers

Energy • Smart Grid: Production Optimization

• Grid Failure Prevention • Smart Meters • Individual Power Grid

Manufacturing • Supply Chain Optimization • Customer Churn Analysis • Dynamic Delivery • Replacement parts

Healthcare • Electronic Medical Records (EMPI)

• Clinical decision support • Clinical Trials Analysis

• Insurance Premium Determination

18

Page 19: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

IMPLEMENTATION & SKILLSET CHALLENGES

19

Page 20: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Overview of Hadoop Challenges

Hardware??

Skills??

Training??

Rapid change of Hadoop

Ecosystem?

20

Page 21: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Example 1 - ETL in Hadoop

21

COLLECT PROCESS DISTRIBUTE

Sort

Join Aggregate Copy

Merge

• FS Shell Put Command • Flume

• Sqoop

HARD

• Pig • HiveQL • Java

HARDER

• Sqoop • FS Shell Get Command

HARD

Page 22: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute 22

Images: http://monkeestv.tripod.com/BatMonkee/

Perception: Just Call the Mainframe Guy…

Example 2 – Mainframe Data Ingestion

Page 23: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Reality

Example 2 – Mainframe Data Ingestion

23

Every Change = Time, Cost

SMS Compression

DB Tables, Flat Files

Filtering , Reformatting

Copy, Sort, Join,

Aggregation EBCDIC to

ASCII Cobol

copybooks

Call MF Guy SMS Compression

DB Tables, Flat Files

Filtering , Reformatting

Copy, Sort, Join,

Aggregation EBCDIC to

ASCII Cobol

copybooks

Call MF Guy SMS Compression

DB Tables, Flat Files

Filtering , Reformatting

Copy, Sort, Join,

Aggregation EBCDIC to

ASCII Cobol

copybooks

Image: bottletales.com

Page 24: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Big Data Team

24

Senior Linux/Unix Admin Hadoop Administrators Infrastructure Engineers

Java Developers Hadoop Developers Object Oriented Developers Hadoop Developers

Data Analysts Functional Users Hadoop Analytics Users

Project Managers! Chief Data Officer

Executive Management

Page 25: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Enterprise Adoption Approach

Agile Ideal Use Case for the company Proof-of-concept or Pilot Tech Heavy Aware of Available Options – Many.. Work with Solution Architects Infrastructure Analysis Security Options Testing.. Testing.. Integrating with current Stack Cost.. Cost.. Promises Vs Reality

25

Page 26: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

THE HADOOP ECOSYSTEMS – FROM OPEN SOURCE TO VENDOR TOOLS

26

Page 27: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Hadoop Distributions

27

Page 28: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute 28

Vendor Landscape

Distributions / Platforms

Data Integration/ETL

Search

Document Store

Database / Data Warehouse

Social Operational

XML Database

Graphs

Page 29: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

REAL-WORLD CASE STUDIES

29

Page 30: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Understanding Mainframe Data at Major US Bank

30

Customer hit a wall after months of manual effort migrating Mainframe data

• Difficult to find data errors. No Mainframe application logic that matches Copybook

• Large and complex Copybooks • Depends on Mainframe team to provide data • Very manual-intensive ; inadequate

documentation • Not scalable. Only a few Java + Mainframe

experts could do the work

• Easy to validate Copybooks and find data errors • Ability to pull data directly from Mainframe

without relying on Mainframe team • No coding. No scripting. Easier to document,

maintain & reuse • Enables developers with a broader set of skills

to build complex migration jobs.

+ ( ) 86-page copybook

? Weeks 4 hrs

Before: Manual Effort After: DMX-h + CDH

86-page copybook

30

Page 31: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Social Security Administration

The Challenge: – The SSA has an expensive problem with fraudulent claims for benefits,

and they need more and better data to prevent and punish that fraud. The Office of the Inspector General for the SSA reports that:

– “Nationally, in Fiscal Year 2011, there were more than 103,000 allegations of Social Security fraud, with more than 7,000 criminal investigations resulting in 1,374 convictions and more than $410 million in recoveries, fines, restitution, judgments, settlements, and savings.”

Why Hadoop? – Data Processing Time – 30 hrs on the MF and PoC cluster completed in

2 hrs – Accuracy – Obituary data is likely more accurate over social media than

current death file

31

Page 32: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Optimizing the EDW at Large Teradata Customer

32

• Offload ELT processing from Teradata into CDH using DMX-h

• Implement flexible architecture for staging and change data capture

• Ability to pull data directly from Mainframe • No coding. Easier to maintain & reuse • Enable developers with a broader set of skills

to build complex ETL workflows 0

100

200

300

400

Elap

sed

Tim

e (m

) HiveQL 360 min

DMX-h 15 min

0 4 8 12 16Development Effort (Weeks)

DMX-h 4 Man weeks

HiveQL 12 Man weeks

Impact on Loans Application Project: Cut development time by 1/3 Reduced complexity. From 140 HiveQL scripts to

12 DMX-h graphical jobs Eliminated need for Java user defined functions 24x faster!

+

Page 33: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Log File Processing

33

Page 34: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Video - Placemeter

34

http://vimeo.com/69091237

Page 35: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

What to do next

No one is impartial, but it’s still worth talking to: – Vendors – Industry Analysts – Industry Peers – People at Meetups – Practitioners like Chida

35

Page 36: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Why Hadoop As a Data Management Platform?

The Reliability of a Mainframe, The Massive Performance at Scale of an

MPP appliance, The Storage Capacity of a SAN, All at a

Disruptively Low Price Point

36

Page 37: Hadoop is Happening

Syncsort Confidential and Proprietary - do not copy or distribute

Big Data – Projects

37