Hadoop is Happening
description
Transcript of Hadoop is Happening
Hadoop is Happening
May 1, 2014
Syncsort Confidential and Proprietary - do not copy or distribute
Agenda
Hadoop Evolution Use Cases The Hadoop Ecosystem, from open source to vendor solutions Tooling, implementation and skillset challenges Real-World Case Studies Future of Hadoop Q&A
2
Syncsort Confidential and Proprietary - do not copy or distribute
Our Guest – Chida from OpenOsmium
20+ years of Enterprise Application Development Experience Focused on Big Data & Cloud Founder of Big Data Solution Provider – OpenOsmium DC Tech Community Organizer of Meetups
– Google Developer Group, Tech Breakfast, NoVA Hadoop User Group Open Source, Big Data and Cloud Advocate 703-568-7426, [email protected]
3
Syncsort Confidential and Proprietary - do not copy or distribute
EVOLUTION OF HADOOP
4
Syncsort Confidential and Proprietary - do not copy or distribute
Evolution of Hadoop – Data Volumes are Growing
5
Syncsort Confidential and Proprietary - do not copy or distribute
Evolution of Hadoop – Key Events
6
Next? 2000 2004
Search Engine Problem @ Google
3 White Papers: GFS, MapReduce, BigTable
MapReduce: Simplified Data Processing on Large Clusters
Yahoo!
HDFS, MapReduce, Hbase
2008 2010 2012 2013
MapR
Hortonworks
HHadoop 2.0
Cloudera
Syncsort Confidential and Proprietary - do not copy or distribute
Why Hadoop As a Data Management Platform?
The Reliability of a Mainframe, The Massive Performance at Scale of an
MPP appliance, The Storage Capacity of a SAN, All at a
Disruptively Low Price Point
7
Syncsort Confidential and Proprietary - do not copy or distribute
The Economics of Data
8
Cost of managing 1TB of data
Mainframe EDW Hadoop
$20,000 – $100,000 $15,000 – $80,000 $250 – $2,000
Scalability
Performance
Reliability
Agility
Skills Supply
But there’s more…
Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop - The Big Picture
9
Unified computation provided by MapReduce distributed computing framework
Unified storage provided by distributed file system called HDFS
Commodity Hardware
Hardware contains bunch of disks and cores
Physical
Logical Storage
Computation
Syncsort Confidential and Proprietary - do not copy or distribute
MapReduce – Football Stadium Analogy
10
Syncsort Confidential and Proprietary - do not copy or distribute
Yesterday’s Architecture
11
Syncsort Confidential and Proprietary - do not copy or distribute
Tomorrow’s Data Architecture
12
Syncsort Confidential and Proprietary - do not copy or distribute
HADOOP USE CASES
13
Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Use Cases
14
Data Lake
Offload Mainframe Data & Batch Workloads
Machine Data
Cyber Security
Fraud Detection
Offload ELT from Data Warehouse Clickstream / Weblogs, EMR
Social Media Data Geo Spatial Analyzing
Video and Audio Analytics
Real-Time Processing
Predictive Analytics
Unstructured Data
Active Archive
Multi-media
Leverage “Dark Data”
Sentiment Analysis
Enterprise Data Hub
Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Use Cases
A Roadmap for Hadoop Success
– Offload batch & ELT workloads from data warehouse and mainframe systems into Hadoop
– Develop and active archive, shed light on dark data
– Build your Enterprise Data Hub (Data Lake!)
– Leverage new data sources – Extend BI with data discovery &
exploration – Deliver next-generation analytics
15
Syncsort Confidential and Proprietary - do not copy or distribute
Sample Use Case: Offload
Phase III: Optimize & Secure
Phase II: Offload
Phase I: Identify
• Identify data & workloads most suitable for offload
• Focus on those that will deliver maximum savings & performance
• Access and move virtually any data to Hadoop with one tool
• Easily replicate existing workloads in Hadoop using a graphical user interface
• Deploy and optimize the new environment
• Manage & secure all your data with business class tools
16
Syncsort Confidential and Proprietary - do not copy or distribute
Phase 2: Deliver ‘Next-generation’ Applications
Advanced – ‘Next-gen’ – Applications for Hadoop
– Semi-structured data analytics
• Clickstream/Weblog, Electronic Medical Records
– Unstructured data analytics
• video, audio, documents, text, social
• Predictive modeling
– Geospatial analysis
– Real-Time Processing
17
Syncsort Confidential and Proprietary - do not copy or distribute
Use Cases Across Industries
Vertical Refine Explore Enrich
Retail & Web
• Log Analysis/Site Optimization
• Loyalty Program Optimization
• Brand and Sentiment Analysis • Market basket analysis
• Dynamic Pricing • Session & Content
Optimization • Product recommendation
Telco • Customer profiling • Equipment failure prediction • Location based advertising
Government • Threat Identification • Person of Interest Discovery • Mission work
Finance • Risk Modeling & Fraud
Identification • Trade Performance Analytics
• Surveillance and Fraud Detection
• Customer Risk Analysis
• Real-time upsell, cross sales marketing offers
Energy • Smart Grid: Production Optimization
• Grid Failure Prevention • Smart Meters • Individual Power Grid
Manufacturing • Supply Chain Optimization • Customer Churn Analysis • Dynamic Delivery • Replacement parts
Healthcare • Electronic Medical Records (EMPI)
• Clinical decision support • Clinical Trials Analysis
• Insurance Premium Determination
18
Syncsort Confidential and Proprietary - do not copy or distribute
IMPLEMENTATION & SKILLSET CHALLENGES
19
Syncsort Confidential and Proprietary - do not copy or distribute
Overview of Hadoop Challenges
Hardware??
Skills??
Training??
Rapid change of Hadoop
Ecosystem?
20
Syncsort Confidential and Proprietary - do not copy or distribute
Example 1 - ETL in Hadoop
21
COLLECT PROCESS DISTRIBUTE
Sort
Join Aggregate Copy
Merge
• FS Shell Put Command • Flume
• Sqoop
HARD
• Pig • HiveQL • Java
HARDER
• Sqoop • FS Shell Get Command
HARD
Syncsort Confidential and Proprietary - do not copy or distribute 22
Images: http://monkeestv.tripod.com/BatMonkee/
Perception: Just Call the Mainframe Guy…
Example 2 – Mainframe Data Ingestion
Syncsort Confidential and Proprietary - do not copy or distribute
Reality
Example 2 – Mainframe Data Ingestion
23
Every Change = Time, Cost
SMS Compression
DB Tables, Flat Files
Filtering , Reformatting
Copy, Sort, Join,
Aggregation EBCDIC to
ASCII Cobol
copybooks
Call MF Guy SMS Compression
DB Tables, Flat Files
Filtering , Reformatting
Copy, Sort, Join,
Aggregation EBCDIC to
ASCII Cobol
copybooks
Call MF Guy SMS Compression
DB Tables, Flat Files
Filtering , Reformatting
Copy, Sort, Join,
Aggregation EBCDIC to
ASCII Cobol
copybooks
Image: bottletales.com
Syncsort Confidential and Proprietary - do not copy or distribute
Big Data Team
24
Senior Linux/Unix Admin Hadoop Administrators Infrastructure Engineers
Java Developers Hadoop Developers Object Oriented Developers Hadoop Developers
Data Analysts Functional Users Hadoop Analytics Users
Project Managers! Chief Data Officer
Executive Management
Syncsort Confidential and Proprietary - do not copy or distribute
Enterprise Adoption Approach
Agile Ideal Use Case for the company Proof-of-concept or Pilot Tech Heavy Aware of Available Options – Many.. Work with Solution Architects Infrastructure Analysis Security Options Testing.. Testing.. Integrating with current Stack Cost.. Cost.. Promises Vs Reality
25
Syncsort Confidential and Proprietary - do not copy or distribute
THE HADOOP ECOSYSTEMS – FROM OPEN SOURCE TO VENDOR TOOLS
26
Syncsort Confidential and Proprietary - do not copy or distribute
Hadoop Distributions
27
Syncsort Confidential and Proprietary - do not copy or distribute 28
Vendor Landscape
Distributions / Platforms
Data Integration/ETL
Search
Document Store
Database / Data Warehouse
Social Operational
XML Database
Graphs
Syncsort Confidential and Proprietary - do not copy or distribute
REAL-WORLD CASE STUDIES
29
Syncsort Confidential and Proprietary - do not copy or distribute
Understanding Mainframe Data at Major US Bank
30
Customer hit a wall after months of manual effort migrating Mainframe data
• Difficult to find data errors. No Mainframe application logic that matches Copybook
• Large and complex Copybooks • Depends on Mainframe team to provide data • Very manual-intensive ; inadequate
documentation • Not scalable. Only a few Java + Mainframe
experts could do the work
• Easy to validate Copybooks and find data errors • Ability to pull data directly from Mainframe
without relying on Mainframe team • No coding. No scripting. Easier to document,
maintain & reuse • Enables developers with a broader set of skills
to build complex migration jobs.
+ ( ) 86-page copybook
? Weeks 4 hrs
Before: Manual Effort After: DMX-h + CDH
86-page copybook
30
Syncsort Confidential and Proprietary - do not copy or distribute
Social Security Administration
The Challenge: – The SSA has an expensive problem with fraudulent claims for benefits,
and they need more and better data to prevent and punish that fraud. The Office of the Inspector General for the SSA reports that:
– “Nationally, in Fiscal Year 2011, there were more than 103,000 allegations of Social Security fraud, with more than 7,000 criminal investigations resulting in 1,374 convictions and more than $410 million in recoveries, fines, restitution, judgments, settlements, and savings.”
Why Hadoop? – Data Processing Time – 30 hrs on the MF and PoC cluster completed in
2 hrs – Accuracy – Obituary data is likely more accurate over social media than
current death file
31
Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing the EDW at Large Teradata Customer
32
• Offload ELT processing from Teradata into CDH using DMX-h
• Implement flexible architecture for staging and change data capture
• Ability to pull data directly from Mainframe • No coding. Easier to maintain & reuse • Enable developers with a broader set of skills
to build complex ETL workflows 0
100
200
300
400
Elap
sed
Tim
e (m
) HiveQL 360 min
DMX-h 15 min
0 4 8 12 16Development Effort (Weeks)
DMX-h 4 Man weeks
HiveQL 12 Man weeks
Impact on Loans Application Project: Cut development time by 1/3 Reduced complexity. From 140 HiveQL scripts to
12 DMX-h graphical jobs Eliminated need for Java user defined functions 24x faster!
+
Syncsort Confidential and Proprietary - do not copy or distribute
Log File Processing
33
Syncsort Confidential and Proprietary - do not copy or distribute
Video - Placemeter
34
http://vimeo.com/69091237
Syncsort Confidential and Proprietary - do not copy or distribute
What to do next
No one is impartial, but it’s still worth talking to: – Vendors – Industry Analysts – Industry Peers – People at Meetups – Practitioners like Chida
35
Syncsort Confidential and Proprietary - do not copy or distribute
Why Hadoop As a Data Management Platform?
The Reliability of a Mainframe, The Massive Performance at Scale of an
MPP appliance, The Storage Capacity of a SAN, All at a
Disruptively Low Price Point
36
Syncsort Confidential and Proprietary - do not copy or distribute
Big Data – Projects
37