A Scalable Data Transformation Framework using the Hadoop … · · 2016-06-02A Scalable Data...

A Scalable Data Transformation Framework

using the Hadoop Ecosystem

Raj NairDirector–Data Platform

Kiru Pakkirisamy CTO

AGENDA

• About Penton and Serendio Inc

• Data Processing at Penton

• PoC Use Case

• Functional Aspects of the Use Case

• Big Data Architecture, Design and Implementation

• Lessons Learned• Lessons Learned

• Conclusion

• Questions

About Penton

• Professional information services company

• Provide actionable information to five core markets

Agriculture Transportation Natural

Products

Infrastructure Industrial Design

& Manufacturing

Success StoriesSuccess Stories

EquipmentWatch.com Govalytics.com

Prices, Specs, Costs, Rental Analytics around Gov’t capital spending

down to county level

SourceESB NextTrend.com

Vertical Directory, electronic parts Identify new product trends in the natural

products industry

About Serendio

Serendio provides Big Data Science

Solutions & Services for

Data-Driven Enterprises.

www.serendio.com

Data Processing at PentonData Processing at Penton

What got us thinking?

• Business units process data in silos

• Heavy ETL

– Hours to process, in some cases days

• Not even using all the data we want • Not even using all the data we want

• Not logging what we needed to

• Can’t scale for future requirements

Data Processing Pipeline New

features

Insights

Products

Biz ValueAssembly Line

processing

The Data Processing Pipeline

InsightsProducts

Data Processing Pipeline

Penton examples

• Daily Inventory data, ingested throughout the day

(tens of thousands of parts)

• Auction and survey data gathered daily

• Aviation Fleet data, varying frequency

AnalyzeIngest, store

Clean, validateApply Business Rules

Analyze

Report

Distribute

Slow Extract, Transform and Load = Frustration + missed business SLAs

Won’t scale for future

Various data formats, mostly unstructured

Current Design

• Survey data loaded as CSV files

• Data needs to be scrubbed/mapped

• All CSV rows loaded into one table

• Once scrubbed/mapped data is loaded into main tables

• Not all rows are loaded, some may be used in the future

What were our options?

Adopt Hadoop Ecosystem

- M/R: Ideal for Batch Processing

- Flexible for storage

- NoSQL: scale, usability and flexibility

Expand RDBMS options

- Expensive

- Complex

HBASE OracleSQL

Server

Drools

POC Use CasePOC Use Case

Primary Use Case

• Daily model data – upload and map

– Ingest data, build buckets

– Map data (batch and interactive)

– Build Aggregates (dynamic)

Issue: Mapping timeIssue: Mapping time

Functional AspectsFunctional Aspects

Data Scrubbing

• Standardized names for fields/columns

• Example - Country

– Unites States of America -> USA

– United States -> USA

Data Mapping

• Converting Fields - > Ids

– Manufacturer - Caterpillar -> 25

– Model - Caterpillar/Front Loader -> 300

• Requires the use of lookup tables and partial/fuzzy

matching strings

Data Exporting

• Move scrubbed/mapped data to main RDBMS

Key Pain Points

• CSV data table continues to grow

• Large size of the table impacts operations on rows in a single

• CSV data could grow rapidly in the future

Criteria for New Design

• Ability to store an individual file and manipulate it easily

– No join/relationships across CSV files

• Solution should have good integration with RDBMS

• Could possibly host the complete application in future

• Technology stack should possibly have advanced analytics

capabilities

NoSQL model would allow to quickly retrieve/address

individual file and manipulate it

Big Data ArchitectureBig Data Architecture

Solution Architecture

REST API

CSV and Rule Management Endpoints

Master database of

Products/ Parts

Current

Updates

Insert

RDB-> Data Upload UI

API Calls Launch

Data manipulation APIs exposed through REST

Existing Business

Applications

HADOOP HDFS

Current

Oracle

Schema

Insert

Accepted

DataMR Jobs

Survey

RESTDrools

Use HBaseas a store for CSV

Drools –for rule

based data scrubbing

Operations on individual files in UI through

Hbase Get/Put

Operations on all/groups of

files using MR jobs

Hbase Schema Design

• One row per HBase row

• One file per HBase row

– One cell per column qualifier (simple and started the development

with this approach)

– One row per column qualifier (more performant approach)

Hbase Rowkey Design

• Row Key

– Composite

• Created Date (YYYYMMDD)

• User

• FileType

• GUID• GUID

• Salting for better region splitting

– One byte

Hbase Column Family Design

• Column Family

– Data separated from Metadata into two or more

column families

– One cf for mapping data (more later)

– One cf for analytics data (used by analytics

coprocessors)

M/R Jobs

• Jobs

– Scrubbing

– Mapping

– Export

• Schedule

– Manually from UI– Manually from UI

– On schedule using Oozie

Sqoop Jobs

• One time

– FileDetailExport (current CSV)

– RuleImport (all current rules)

• Periodic

– Lookup Table Data import– Lookup Table Data import

• Manufacture

• Model

• State

• Country

• Currency

• Condition

• Participant

Application Integration - REST

• Hide HBase AP/Java APIs from rest of

application

• Language independence for PHP front-end

• REST APIs for

– CSV Management– CSV Management

– Drools Rule Management

Lessons LearnedLessons Learned

Performance Benefits

• Mapping

– 20000 csv files, 20 million records

– Time taken – 1/3rd of RDBMS processing

• Metrics

– < 10 secs vs (Oracle Materialized View)

• Upload a file• Upload a file

– < 10 secs

• Delete a file

– < 10 secs

Hbase Tuning

• Heap Size for

– RegionServer

– MapReduce Tasks

• Table Compression

– SNAPPY for Column Family holding csv data– SNAPPY for Column Family holding csv data

• Table data caching

– IN_MEMORY for lookup tables

Application Design Challenges

• Pagination – implemented using intermediate REST layer and

scan.setStartRow.

• Translating SQL queries

– Used Scan/Filter and Java (especially on coprocessor)

– No secondary indexes - used FuzzyRowFilter

– Maybe something like Phoenix would have helped

• Some issues in mixed mode. Want to move to 0.96.0 for • Some issues in mixed mode. Want to move to 0.96.0 for

better/individual column family flushing but needed to 'port'

coprocessors (to protobuf)

Hbase Value Proposition

• Better response in UI for CSV file operations - Operations

within a file (map, import, reject etc) not dependent on the

db size

• Relieve load on RDBMS - no more CSV data tables

• Scale out batch processing performance on the cheap (vs

vertical RDBMS upgrade)

• Redundant store for CSV files

• Versioning to track data cleansing

Roadmap

• Benchmark with 0.96

• Retire Coprocessors in favor of Phoenix (?)

• Lookup Data tables are small. Need to find a better alternative

than HBase than HBase

• Design UI for a more Big Data appropriate model

– Search oriented paradigm, than exploratory/ paginative

– Add REST endpoints to support such UI

Wrap-UpWrap-Up

Conclusion

• PoC demonstrated

– value of the Hadoop ecosystem

– Co-existence of Big data technologies with current solutions

– Adoption can significantly improve scale

– New skill requirements

Thank You

Rajesh.Nair@Penton.com Rajesh.Nair@Penton.com

Kiru@Serendio.com

A Scalable Data Transformation Framework using the Hadoop … · · 2016-06-02A Scalable Data...

Documents

Transcript of A Scalable Data Transformation Framework using the Hadoop … · · 2016-06-02A Scalable Data...

A Crash Course in Apache Hadoop - Blanco · Why Hadoop? Benefits of the Hadoop Architecture Consolidates Data Integrates with many existing platforms Scalable and Affordable Real-Time

Scalable and E ective Polyhedral Auto-transformation ...

Hadoop 2.x Core: YARN, Tez, and Spark · Hadoop Version 2.x •Hadoop 2.x has two core components. –HDFS provides distributed, scalable, and highly available data storage. –YARN

Hadoop for Bioinformatics: Building a Scalable Variant Store

Hadoop World 2011: Building Scalable Data Platforms ; Hadoop & Netezza Deployment Models

Hadoop with Python - Amazon Web Services · CHAPTER 1 Hadoop Distributed File System (HDFS) The Hadoop Distributed File System (HDFS) is a Java-based dis‐ tributed, scalable, and

Digital Transformation: How to Deliver Scalable Core Banking Solutions

Hadoop: Scalable Infrastructure for Big Data - QCon London · QCon London 2012 Hadoop: Scalable Infrastructure for Big Data QCon London 2012 Parand Tony Darugar Founder and CEO, Xpenser

Scalable Machine Learning with Hadoop

A Hybrid Scheduling Approach for Scalable Heterogeneous ...datasys.cs.iit.edu/events/MTAGS12/s04.pdfA Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Authors:

Fast Scalable R with H2Oh2o-release.s3.amazonaws.com/h2o/master/3213/docs-website/h2o … · H2O supports Spark, YARN, and all versions of Hadoop. Hadoop is a scalable open-source

Ceph as a scalable alternative to the Hadoop Distributed ...static.usenix.org/publications/login/2010-08/openpdfs/maltzahn.pdf · Ceph as a scalable alternative to the Hadoop Distributed

Store app a shared storage appliance for efficient and scalable virtualized hadoop clusters

Scalable vertical search engine with hadoop

A Database-Hadoop Hybrid Approach to Scalable Machine Learning

Scalable Data Warehousing on Hadoop - BI Consultingbiconsulting.hu/letoltes/2017budapestdata/fekete...Scalable Data Warehousing on Hadoop Hadoop Ecosystem Hive ... –ORC format, Facebook

Big Data Made Easy: A Simple, Scalable Solution for Getting Started with Hadoop

How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

Hadoop and Hive as Scalable Alternatives to RDBMS: A Case ...

Scalable Machine Learning CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.