Syncsort et le retour d'expérience ComScore

19
High Performance ETL in a #BigData #Hadoop context Steven Haddad Senior Software Architect Stéphane Heckel Partner Manager Hadoop User Group - September 12 th 2012

description

 

Transcript of Syncsort et le retour d'expérience ComScore

Page 1: Syncsort et le retour d'expérience ComScore

High Performance ETL in a BigData Hadoop context

Steven Haddad ndash Senior Software Architect

Steacutephane Heckel ndash Partner Manager

Hadoop User Group - September 12th 2012

Syncsort ndash Solving Big Data Breakpoints for 40 years

Company Track Record

bull Global Software Company bull 40+ Years of Performance Innovation bull 25+ Patents related to unique and

unparalleled integration technology

Large Established Customer Base

bull 16000+ deployments bull 68 Countries bull Across all verticals

2

Expertise amp Specialism

bull Leading provider of high-performance data integration solutions

bull Data Integration Acceleration and Cost Optimization

bull Delivering Cost Reduction Initiatives whilst delivering superior performance

bull Typical TCO reduction of 50 - 75 bull Customer ROI within 12 months

bull

DATA SERVICES

bull

FINANCE

bull

INSURANCE amp HEALTHCARE

TRAVEL amp TRANSPORT

bull

RETAIL

bull

TELECOMMUNICATIONS

A Fully Integrated Architecture for High-performance ETL

3

User Interface

Task Editor Job Editor SDK

Shared File-based Metadata Repository

Data Lineage

Metadata Interchange

Global Search

Impact Analysis

Small Footprint ETL Engine

Self-tuning Optimizer

Native Direct IO Access

Install in Minutes Deploy in Weeks Never Tune Again

High Performance Connectivity

Mainframe Files XML

Appliances Hadoop Cloud Real Time

Template-driven Design

DMExpress Server Engine

High Performance

Transformations

High Performance

Functions

Automatic Continuous Optimization

4

Syncsortrsquos Hadoop value proposition

Syncsort Value proposition on Hadoop

Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption

HDFS connectivity Ability to move data in amp out of Hadoop file system

Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework

Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible

Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance

5 Syncsort Confidential and Proprietary - do not copy or distribute

Optimizing Hadoop Deployments

DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments

Extract Preprocess amp Compress Load

RDBMS

Appliances

Cloud

XML

Mainframe

Files

Data Node

Data Node

Data Node

Data Node

HD

FS

Sort Aggregate Join

Compress Partition

0

50

100

150

Load

Tim

e (

min

) Elapsed Processing Time

HDFS Put DMExpress

Connect to virtually any source

Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings

Load data up to 6x faster

6

DMExpress ndash HDFS Connectivity

HDFS

DMExpress

Input

Load HDFS

ndash Partition the output for parallel loading

ndash Makes full use of network bandwidth with

reduced elapsed time

ndash HadoopDMExpress can process wildcard

input files from HDFS

Extract HDFS

ndash DMExpress can read wildcard inputs in

parallel 7

Distributions supported

ndash Cloudera CDH3u3

ndash Hortonworks Data Platform 107

ndash Greenplum HD 11

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 10GB to 100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

8

3x-6x Faster

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

9

6x Faster

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 2: Syncsort et le retour d'expérience ComScore

Syncsort ndash Solving Big Data Breakpoints for 40 years

Company Track Record

bull Global Software Company bull 40+ Years of Performance Innovation bull 25+ Patents related to unique and

unparalleled integration technology

Large Established Customer Base

bull 16000+ deployments bull 68 Countries bull Across all verticals

2

Expertise amp Specialism

bull Leading provider of high-performance data integration solutions

bull Data Integration Acceleration and Cost Optimization

bull Delivering Cost Reduction Initiatives whilst delivering superior performance

bull Typical TCO reduction of 50 - 75 bull Customer ROI within 12 months

bull

DATA SERVICES

bull

FINANCE

bull

INSURANCE amp HEALTHCARE

TRAVEL amp TRANSPORT

bull

RETAIL

bull

TELECOMMUNICATIONS

A Fully Integrated Architecture for High-performance ETL

3

User Interface

Task Editor Job Editor SDK

Shared File-based Metadata Repository

Data Lineage

Metadata Interchange

Global Search

Impact Analysis

Small Footprint ETL Engine

Self-tuning Optimizer

Native Direct IO Access

Install in Minutes Deploy in Weeks Never Tune Again

High Performance Connectivity

Mainframe Files XML

Appliances Hadoop Cloud Real Time

Template-driven Design

DMExpress Server Engine

High Performance

Transformations

High Performance

Functions

Automatic Continuous Optimization

4

Syncsortrsquos Hadoop value proposition

Syncsort Value proposition on Hadoop

Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption

HDFS connectivity Ability to move data in amp out of Hadoop file system

Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework

Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible

Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance

5 Syncsort Confidential and Proprietary - do not copy or distribute

Optimizing Hadoop Deployments

DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments

Extract Preprocess amp Compress Load

RDBMS

Appliances

Cloud

XML

Mainframe

Files

Data Node

Data Node

Data Node

Data Node

HD

FS

Sort Aggregate Join

Compress Partition

0

50

100

150

Load

Tim

e (

min

) Elapsed Processing Time

HDFS Put DMExpress

Connect to virtually any source

Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings

Load data up to 6x faster

6

DMExpress ndash HDFS Connectivity

HDFS

DMExpress

Input

Load HDFS

ndash Partition the output for parallel loading

ndash Makes full use of network bandwidth with

reduced elapsed time

ndash HadoopDMExpress can process wildcard

input files from HDFS

Extract HDFS

ndash DMExpress can read wildcard inputs in

parallel 7

Distributions supported

ndash Cloudera CDH3u3

ndash Hortonworks Data Platform 107

ndash Greenplum HD 11

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 10GB to 100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

8

3x-6x Faster

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

9

6x Faster

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 3: Syncsort et le retour d'expérience ComScore

A Fully Integrated Architecture for High-performance ETL

3

User Interface

Task Editor Job Editor SDK

Shared File-based Metadata Repository

Data Lineage

Metadata Interchange

Global Search

Impact Analysis

Small Footprint ETL Engine

Self-tuning Optimizer

Native Direct IO Access

Install in Minutes Deploy in Weeks Never Tune Again

High Performance Connectivity

Mainframe Files XML

Appliances Hadoop Cloud Real Time

Template-driven Design

DMExpress Server Engine

High Performance

Transformations

High Performance

Functions

Automatic Continuous Optimization

4

Syncsortrsquos Hadoop value proposition

Syncsort Value proposition on Hadoop

Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption

HDFS connectivity Ability to move data in amp out of Hadoop file system

Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework

Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible

Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance

5 Syncsort Confidential and Proprietary - do not copy or distribute

Optimizing Hadoop Deployments

DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments

Extract Preprocess amp Compress Load

RDBMS

Appliances

Cloud

XML

Mainframe

Files

Data Node

Data Node

Data Node

Data Node

HD

FS

Sort Aggregate Join

Compress Partition

0

50

100

150

Load

Tim

e (

min

) Elapsed Processing Time

HDFS Put DMExpress

Connect to virtually any source

Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings

Load data up to 6x faster

6

DMExpress ndash HDFS Connectivity

HDFS

DMExpress

Input

Load HDFS

ndash Partition the output for parallel loading

ndash Makes full use of network bandwidth with

reduced elapsed time

ndash HadoopDMExpress can process wildcard

input files from HDFS

Extract HDFS

ndash DMExpress can read wildcard inputs in

parallel 7

Distributions supported

ndash Cloudera CDH3u3

ndash Hortonworks Data Platform 107

ndash Greenplum HD 11

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 10GB to 100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

8

3x-6x Faster

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

9

6x Faster

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 4: Syncsort et le retour d'expérience ComScore

4

Syncsortrsquos Hadoop value proposition

Syncsort Value proposition on Hadoop

Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption

HDFS connectivity Ability to move data in amp out of Hadoop file system

Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework

Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible

Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance

5 Syncsort Confidential and Proprietary - do not copy or distribute

Optimizing Hadoop Deployments

DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments

Extract Preprocess amp Compress Load

RDBMS

Appliances

Cloud

XML

Mainframe

Files

Data Node

Data Node

Data Node

Data Node

HD

FS

Sort Aggregate Join

Compress Partition

0

50

100

150

Load

Tim

e (

min

) Elapsed Processing Time

HDFS Put DMExpress

Connect to virtually any source

Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings

Load data up to 6x faster

6

DMExpress ndash HDFS Connectivity

HDFS

DMExpress

Input

Load HDFS

ndash Partition the output for parallel loading

ndash Makes full use of network bandwidth with

reduced elapsed time

ndash HadoopDMExpress can process wildcard

input files from HDFS

Extract HDFS

ndash DMExpress can read wildcard inputs in

parallel 7

Distributions supported

ndash Cloudera CDH3u3

ndash Hortonworks Data Platform 107

ndash Greenplum HD 11

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 10GB to 100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

8

3x-6x Faster

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

9

6x Faster

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 5: Syncsort et le retour d'expérience ComScore

Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption

HDFS connectivity Ability to move data in amp out of Hadoop file system

Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework

Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible

Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance

5 Syncsort Confidential and Proprietary - do not copy or distribute

Optimizing Hadoop Deployments

DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments

Extract Preprocess amp Compress Load

RDBMS

Appliances

Cloud

XML

Mainframe

Files

Data Node

Data Node

Data Node

Data Node

HD

FS

Sort Aggregate Join

Compress Partition

0

50

100

150

Load

Tim

e (

min

) Elapsed Processing Time

HDFS Put DMExpress

Connect to virtually any source

Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings

Load data up to 6x faster

6

DMExpress ndash HDFS Connectivity

HDFS

DMExpress

Input

Load HDFS

ndash Partition the output for parallel loading

ndash Makes full use of network bandwidth with

reduced elapsed time

ndash HadoopDMExpress can process wildcard

input files from HDFS

Extract HDFS

ndash DMExpress can read wildcard inputs in

parallel 7

Distributions supported

ndash Cloudera CDH3u3

ndash Hortonworks Data Platform 107

ndash Greenplum HD 11

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 10GB to 100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

8

3x-6x Faster

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

9

6x Faster

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 6: Syncsort et le retour d'expérience ComScore

Optimizing Hadoop Deployments

DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments

Extract Preprocess amp Compress Load

RDBMS

Appliances

Cloud

XML

Mainframe

Files

Data Node

Data Node

Data Node

Data Node

HD

FS

Sort Aggregate Join

Compress Partition

0

50

100

150

Load

Tim

e (

min

) Elapsed Processing Time

HDFS Put DMExpress

Connect to virtually any source

Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings

Load data up to 6x faster

6

DMExpress ndash HDFS Connectivity

HDFS

DMExpress

Input

Load HDFS

ndash Partition the output for parallel loading

ndash Makes full use of network bandwidth with

reduced elapsed time

ndash HadoopDMExpress can process wildcard

input files from HDFS

Extract HDFS

ndash DMExpress can read wildcard inputs in

parallel 7

Distributions supported

ndash Cloudera CDH3u3

ndash Hortonworks Data Platform 107

ndash Greenplum HD 11

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 10GB to 100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

8

3x-6x Faster

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

9

6x Faster

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 7: Syncsort et le retour d'expérience ComScore

DMExpress ndash HDFS Connectivity

HDFS

DMExpress

Input

Load HDFS

ndash Partition the output for parallel loading

ndash Makes full use of network bandwidth with

reduced elapsed time

ndash HadoopDMExpress can process wildcard

input files from HDFS

Extract HDFS

ndash DMExpress can read wildcard inputs in

parallel 7

Distributions supported

ndash Cloudera CDH3u3

ndash Hortonworks Data Platform 107

ndash Greenplum HD 11

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 10GB to 100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

8

3x-6x Faster

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

9

6x Faster

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 8: Syncsort et le retour d'expérience ComScore

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 10GB to 100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

8

3x-6x Faster

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

9

6x Faster

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 9: Syncsort et le retour d'expérience ComScore

DMExpress Accelerates Loading HDFS

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

9

6x Faster

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 10: Syncsort et le retour d'expérience ComScore

Enabling Storage Savings and Accelerating Performance with DMExpress

bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and

partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month

DMExpress is enabling comScore to

32

B r

eco

rds

d

ay

Load files Cleansesort compress partition

Load to HDFS

Post-processing amp analysis

DMExpress Node

Node

Node

Node

HD

FS

Hadoop

10

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 11: Syncsort et le retour d'expérience ComScore

11

Michael Brown Chief Scientist comScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 12: Syncsort et le retour d'expérience ComScore

DMExpress Hadoop Integration

Contribute MapReduce code changes to Apache

Hadoop (JIRA MAPREDUCE-2454)

ndash Allow external sort to be plugged in

ndash Improve developer productivity

bull Develop MapReduce jobs via DMExpress GUI

ndash Aggregations cleansingfiltering reformatting

etc

ndash Seamlessly accelerate MapReduce performance

bull Replace Map output sorter

bull Replace Reduce input sorter

httpsissuesapacheorgjirabrowseMAPREDUCE-2454

Syncsort Confidential and Proprietary - do not copy or distribute 12

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 13: Syncsort et le retour d'expérience ComScore

DMExpress Accelerates HDFS Loading

HDFS Load

ndash 20 partitions

ndash Uncompressed input file size from 100GB to 2100GB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH4

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Write 650MBs

ndash Memory 94 GB

HDFS Load using DMExpress

13 Syncsort Confidential and Proprietary - do not copy or distribute

6x Faster

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 14: Syncsort et le retour d'expérience ComScore

Accelerate Development amp Remove Barriers to Adoption

Use DMExpress to Accelerate Development and Optimize MapReduce Jobs

MapReduce Development

Χ Lots of manual coding

Χ MapReduce Pig Java

Χ Limited skills supply

Χ Heavy learning curve

DMExpress Hadoop Edition

No coding required

Leverages the same skills most IT organizations already have

New resources can be trained in just 3 days

Syncsort Confidential and Proprietary - do not copy or distribute 14

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 15: Syncsort et le retour d'expérience ComScore

Native MapReduce DMExpress Execution

DMExpress Hadoop is not

generating code (ie Java Pig

Python)

DMExpress Hadoop runs native

on each data node on the cluster

ndash DMExpress is installed on each

data node

ndash Same benefits as High-performance

ETL

Issues with code generation

ndash Requires re-compilation with every

change

ndash May still require MR skills

ndash Ongoing issues with efficiency of

generated code

15 Sy

nc

so

rt

Co

nfi

de

nti

al

an

d

Pr

op

rie

tar

y -

do

no

t

co

py

or

dis

tri

bu

te

DMX DMX DMX DMX

Hadoop Cluster

DMX

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 16: Syncsort et le retour d'expérience ComScore

0

500

1000

1500

2000

2500

3000

0 500 1000 1500 2000 2500 3000

Elap

sed

Tim

e (s

ec)

File Size (GB)

TPC-H - Aggregation

Java

Pig

DMExpress

DMExpress Hadoop Edition Provides Significant Performance Improvements

TPC-H Benchmark

ndash Filter amp Aggregation

ndash GZIP compression

ndash Uncompressed input file size from 100GB to 24TB

Cluster Specifications

ndash Size 10+1+1 nodes

ndash Hadoop distribution CDH3U2

ndash HDFS block size 256 MB

Hardware Specifications (Per Node)

ndash Red Hat EL 58

ndash Intel Xeon x5670 2

ndash 6 disksnode

ndash Read 870MBs Write 660MBs

ndash Memory 94 GB

TPC-H Benchmark

16 Syncsort Confidential and Proprietary - do not copy or distribute

Almost 2x Faster than

Java Over 2x Faster Pig

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 17: Syncsort et le retour d'expérience ComScore

17

Conclusion

Syncsort Value proposition on Hadoop

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 18: Syncsort et le retour d'expérience ComScore

DMExpress Hadoop Edition Benefits

High performance HDFS load and extract

ndash DMExpress partitioning allows taking advantage of

full network bandwidth

ndash High performance parallel load from HDFS to GP

DB

Integration with diverse set of sources

ndash Files DBMS mainframe

Ease of development (GUI vs JavaPig)

High performance ETL operations (MapReduce)

ndash Aggregation sort filter copy reformatting join

merge

Seamless high performance sort

18 Syncsort Confidential and Proprietary - do not copy or distribute

Thank you

Page 19: Syncsort et le retour d'expérience ComScore

Thank you