Syncsort et le retour d'expérience ComScore
-
Upload
hadoop-user-group-france -
Category
Documents
-
view
2.024 -
download
3
description
Transcript of Syncsort et le retour d'expérience ComScore
High Performance ETL in a BigData Hadoop context
Steven Haddad ndash Senior Software Architect
Steacutephane Heckel ndash Partner Manager
Hadoop User Group - September 12th 2012
Syncsort ndash Solving Big Data Breakpoints for 40 years
Company Track Record
bull Global Software Company bull 40+ Years of Performance Innovation bull 25+ Patents related to unique and
unparalleled integration technology
Large Established Customer Base
bull 16000+ deployments bull 68 Countries bull Across all verticals
2
Expertise amp Specialism
bull Leading provider of high-performance data integration solutions
bull Data Integration Acceleration and Cost Optimization
bull Delivering Cost Reduction Initiatives whilst delivering superior performance
bull Typical TCO reduction of 50 - 75 bull Customer ROI within 12 months
bull
DATA SERVICES
bull
FINANCE
bull
INSURANCE amp HEALTHCARE
TRAVEL amp TRANSPORT
bull
RETAIL
bull
TELECOMMUNICATIONS
A Fully Integrated Architecture for High-performance ETL
3
User Interface
Task Editor Job Editor SDK
Shared File-based Metadata Repository
Data Lineage
Metadata Interchange
Global Search
Impact Analysis
Small Footprint ETL Engine
Self-tuning Optimizer
Native Direct IO Access
Install in Minutes Deploy in Weeks Never Tune Again
High Performance Connectivity
Mainframe Files XML
Appliances Hadoop Cloud Real Time
Template-driven Design
DMExpress Server Engine
High Performance
Transformations
High Performance
Functions
Automatic Continuous Optimization
4
Syncsortrsquos Hadoop value proposition
Syncsort Value proposition on Hadoop
Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption
HDFS connectivity Ability to move data in amp out of Hadoop file system
Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework
Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible
Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance
5 Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments
Extract Preprocess amp Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HD
FS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Tim
e (
min
) Elapsed Processing Time
HDFS Put DMExpress
Connect to virtually any source
Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings
Load data up to 6x faster
6
DMExpress ndash HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
ndash Partition the output for parallel loading
ndash Makes full use of network bandwidth with
reduced elapsed time
ndash HadoopDMExpress can process wildcard
input files from HDFS
Extract HDFS
ndash DMExpress can read wildcard inputs in
parallel 7
Distributions supported
ndash Cloudera CDH3u3
ndash Hortonworks Data Platform 107
ndash Greenplum HD 11
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 10GB to 100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
8
3x-6x Faster
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
9
6x Faster
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
Syncsort ndash Solving Big Data Breakpoints for 40 years
Company Track Record
bull Global Software Company bull 40+ Years of Performance Innovation bull 25+ Patents related to unique and
unparalleled integration technology
Large Established Customer Base
bull 16000+ deployments bull 68 Countries bull Across all verticals
2
Expertise amp Specialism
bull Leading provider of high-performance data integration solutions
bull Data Integration Acceleration and Cost Optimization
bull Delivering Cost Reduction Initiatives whilst delivering superior performance
bull Typical TCO reduction of 50 - 75 bull Customer ROI within 12 months
bull
DATA SERVICES
bull
FINANCE
bull
INSURANCE amp HEALTHCARE
TRAVEL amp TRANSPORT
bull
RETAIL
bull
TELECOMMUNICATIONS
A Fully Integrated Architecture for High-performance ETL
3
User Interface
Task Editor Job Editor SDK
Shared File-based Metadata Repository
Data Lineage
Metadata Interchange
Global Search
Impact Analysis
Small Footprint ETL Engine
Self-tuning Optimizer
Native Direct IO Access
Install in Minutes Deploy in Weeks Never Tune Again
High Performance Connectivity
Mainframe Files XML
Appliances Hadoop Cloud Real Time
Template-driven Design
DMExpress Server Engine
High Performance
Transformations
High Performance
Functions
Automatic Continuous Optimization
4
Syncsortrsquos Hadoop value proposition
Syncsort Value proposition on Hadoop
Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption
HDFS connectivity Ability to move data in amp out of Hadoop file system
Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework
Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible
Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance
5 Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments
Extract Preprocess amp Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HD
FS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Tim
e (
min
) Elapsed Processing Time
HDFS Put DMExpress
Connect to virtually any source
Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings
Load data up to 6x faster
6
DMExpress ndash HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
ndash Partition the output for parallel loading
ndash Makes full use of network bandwidth with
reduced elapsed time
ndash HadoopDMExpress can process wildcard
input files from HDFS
Extract HDFS
ndash DMExpress can read wildcard inputs in
parallel 7
Distributions supported
ndash Cloudera CDH3u3
ndash Hortonworks Data Platform 107
ndash Greenplum HD 11
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 10GB to 100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
8
3x-6x Faster
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
9
6x Faster
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
A Fully Integrated Architecture for High-performance ETL
3
User Interface
Task Editor Job Editor SDK
Shared File-based Metadata Repository
Data Lineage
Metadata Interchange
Global Search
Impact Analysis
Small Footprint ETL Engine
Self-tuning Optimizer
Native Direct IO Access
Install in Minutes Deploy in Weeks Never Tune Again
High Performance Connectivity
Mainframe Files XML
Appliances Hadoop Cloud Real Time
Template-driven Design
DMExpress Server Engine
High Performance
Transformations
High Performance
Functions
Automatic Continuous Optimization
4
Syncsortrsquos Hadoop value proposition
Syncsort Value proposition on Hadoop
Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption
HDFS connectivity Ability to move data in amp out of Hadoop file system
Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework
Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible
Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance
5 Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments
Extract Preprocess amp Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HD
FS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Tim
e (
min
) Elapsed Processing Time
HDFS Put DMExpress
Connect to virtually any source
Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings
Load data up to 6x faster
6
DMExpress ndash HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
ndash Partition the output for parallel loading
ndash Makes full use of network bandwidth with
reduced elapsed time
ndash HadoopDMExpress can process wildcard
input files from HDFS
Extract HDFS
ndash DMExpress can read wildcard inputs in
parallel 7
Distributions supported
ndash Cloudera CDH3u3
ndash Hortonworks Data Platform 107
ndash Greenplum HD 11
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 10GB to 100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
8
3x-6x Faster
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
9
6x Faster
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
4
Syncsortrsquos Hadoop value proposition
Syncsort Value proposition on Hadoop
Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption
HDFS connectivity Ability to move data in amp out of Hadoop file system
Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework
Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible
Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance
5 Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments
Extract Preprocess amp Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HD
FS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Tim
e (
min
) Elapsed Processing Time
HDFS Put DMExpress
Connect to virtually any source
Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings
Load data up to 6x faster
6
DMExpress ndash HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
ndash Partition the output for parallel loading
ndash Makes full use of network bandwidth with
reduced elapsed time
ndash HadoopDMExpress can process wildcard
input files from HDFS
Extract HDFS
ndash DMExpress can read wildcard inputs in
parallel 7
Distributions supported
ndash Cloudera CDH3u3
ndash Hortonworks Data Platform 107
ndash Greenplum HD 11
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 10GB to 100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
8
3x-6x Faster
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
9
6x Faster
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
Syncsort Goes Beyond Basic Connectivity to Enhance Hadoop and Facilitate Wider Adoption
HDFS connectivity Ability to move data in amp out of Hadoop file system
Enhanced usability Ability to create jobs using DMExpress graphical user interface and run it in the Hadoop MapReduce framework
Contribute to the Open Source Community Enhance Hadoop sort framework for everyone Make it more modular flexible extensible
Accelerate Hadoop Address existing drawbacks in Hadoop native sort by providing a simple self-tuning alternative to increase overall MapReduce performance and facilitate ongoing development and maintenance
5 Syncsort Confidential and Proprietary - do not copy or distribute
Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments
Extract Preprocess amp Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HD
FS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Tim
e (
min
) Elapsed Processing Time
HDFS Put DMExpress
Connect to virtually any source
Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings
Load data up to 6x faster
6
DMExpress ndash HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
ndash Partition the output for parallel loading
ndash Makes full use of network bandwidth with
reduced elapsed time
ndash HadoopDMExpress can process wildcard
input files from HDFS
Extract HDFS
ndash DMExpress can read wildcard inputs in
parallel 7
Distributions supported
ndash Cloudera CDH3u3
ndash Hortonworks Data Platform 107
ndash Greenplum HD 11
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 10GB to 100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
8
3x-6x Faster
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
9
6x Faster
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
Optimizing Hadoop Deployments
DMExpress delivers high-performance connectivity and processing capabilities to optimize Hadoop environments
Extract Preprocess amp Compress Load
RDBMS
Appliances
Cloud
XML
Mainframe
Files
Data Node
Data Node
Data Node
Data Node
HD
FS
Sort Aggregate Join
Compress Partition
0
50
100
150
Load
Tim
e (
min
) Elapsed Processing Time
HDFS Put DMExpress
Connect to virtually any source
Pre-process data to cleanse validate amp partition for better and faster Hadoop processing and significant storage savings
Load data up to 6x faster
6
DMExpress ndash HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
ndash Partition the output for parallel loading
ndash Makes full use of network bandwidth with
reduced elapsed time
ndash HadoopDMExpress can process wildcard
input files from HDFS
Extract HDFS
ndash DMExpress can read wildcard inputs in
parallel 7
Distributions supported
ndash Cloudera CDH3u3
ndash Hortonworks Data Platform 107
ndash Greenplum HD 11
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 10GB to 100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
8
3x-6x Faster
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
9
6x Faster
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
DMExpress ndash HDFS Connectivity
HDFS
DMExpress
Input
Load HDFS
ndash Partition the output for parallel loading
ndash Makes full use of network bandwidth with
reduced elapsed time
ndash HadoopDMExpress can process wildcard
input files from HDFS
Extract HDFS
ndash DMExpress can read wildcard inputs in
parallel 7
Distributions supported
ndash Cloudera CDH3u3
ndash Hortonworks Data Platform 107
ndash Greenplum HD 11
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 10GB to 100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
8
3x-6x Faster
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
9
6x Faster
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 10GB to 100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
8
3x-6x Faster
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
9
6x Faster
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
DMExpress Accelerates Loading HDFS
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
9
6x Faster
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
Enabling Storage Savings and Accelerating Performance with DMExpress
bull Load data faster into HDFS bull Store twice as much data on the cluster bull Improve overall performance by pre-sorting cleansing and
partitioning bull Achieve higher rate of parallelism bull Realize up to 75TB of data storage savings a month
DMExpress is enabling comScore to
32
B r
eco
rds
d
ay
Load files Cleansesort compress partition
Load to HDFS
Post-processing amp analysis
DMExpress Node
Node
Node
Node
HD
FS
Hadoop
10
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
11
Michael Brown Chief Scientist comScore
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
DMExpress Hadoop Integration
Contribute MapReduce code changes to Apache
Hadoop (JIRA MAPREDUCE-2454)
ndash Allow external sort to be plugged in
ndash Improve developer productivity
bull Develop MapReduce jobs via DMExpress GUI
ndash Aggregations cleansingfiltering reformatting
etc
ndash Seamlessly accelerate MapReduce performance
bull Replace Map output sorter
bull Replace Reduce input sorter
httpsissuesapacheorgjirabrowseMAPREDUCE-2454
Syncsort Confidential and Proprietary - do not copy or distribute 12
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
DMExpress Accelerates HDFS Loading
HDFS Load
ndash 20 partitions
ndash Uncompressed input file size from 100GB to 2100GB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH4
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Write 650MBs
ndash Memory 94 GB
HDFS Load using DMExpress
13 Syncsort Confidential and Proprietary - do not copy or distribute
6x Faster
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
Accelerate Development amp Remove Barriers to Adoption
Use DMExpress to Accelerate Development and Optimize MapReduce Jobs
MapReduce Development
Χ Lots of manual coding
Χ MapReduce Pig Java
Χ Limited skills supply
Χ Heavy learning curve
DMExpress Hadoop Edition
No coding required
Leverages the same skills most IT organizations already have
New resources can be trained in just 3 days
Syncsort Confidential and Proprietary - do not copy or distribute 14
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
Native MapReduce DMExpress Execution
DMExpress Hadoop is not
generating code (ie Java Pig
Python)
DMExpress Hadoop runs native
on each data node on the cluster
ndash DMExpress is installed on each
data node
ndash Same benefits as High-performance
ETL
Issues with code generation
ndash Requires re-compilation with every
change
ndash May still require MR skills
ndash Ongoing issues with efficiency of
generated code
15 Sy
nc
so
rt
Co
nfi
de
nti
al
an
d
Pr
op
rie
tar
y -
do
no
t
co
py
or
dis
tri
bu
te
DMX DMX DMX DMX
Hadoop Cluster
DMX
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
0
500
1000
1500
2000
2500
3000
0 500 1000 1500 2000 2500 3000
Elap
sed
Tim
e (s
ec)
File Size (GB)
TPC-H - Aggregation
Java
Pig
DMExpress
DMExpress Hadoop Edition Provides Significant Performance Improvements
TPC-H Benchmark
ndash Filter amp Aggregation
ndash GZIP compression
ndash Uncompressed input file size from 100GB to 24TB
Cluster Specifications
ndash Size 10+1+1 nodes
ndash Hadoop distribution CDH3U2
ndash HDFS block size 256 MB
Hardware Specifications (Per Node)
ndash Red Hat EL 58
ndash Intel Xeon x5670 2
ndash 6 disksnode
ndash Read 870MBs Write 660MBs
ndash Memory 94 GB
TPC-H Benchmark
16 Syncsort Confidential and Proprietary - do not copy or distribute
Almost 2x Faster than
Java Over 2x Faster Pig
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
17
Conclusion
Syncsort Value proposition on Hadoop
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
DMExpress Hadoop Edition Benefits
High performance HDFS load and extract
ndash DMExpress partitioning allows taking advantage of
full network bandwidth
ndash High performance parallel load from HDFS to GP
DB
Integration with diverse set of sources
ndash Files DBMS mainframe
Ease of development (GUI vs JavaPig)
High performance ETL operations (MapReduce)
ndash Aggregation sort filter copy reformatting join
merge
Seamless high performance sort
18 Syncsort Confidential and Proprietary - do not copy or distribute
Thank you
Thank you