Dell | Cloudera | Syncsort Data Warehouse Optimization...

from source systems, manipulated into a consumable format, and loaded into a target system for performing advanced analytics, analysis and reporting. Shifting this job into Hadoop can help your organization lower cost and increase efficiency by shortening batch windows with fresher data that can be queried faster because the EDW is not bogged down in data transformation jobs.

Traditional ETL tools have not been able to handle the data growth over the past decade, forcing organizations to shift the transformation into the enterprise data warehouse (EDW). This has caused significant pain for customers, resulting in 70 percent of all data warehouses being performance

Data transformation costs are on the riseToday’s enterprises are struggling to ingest, store, process, transform and analyze data to build insights that turn into business value. Many Dell customers have turned to Hadoop to help solve these data challenges.

At Dell, we recognize the need to help our customers better define Hadoop use case architectures to cut cost and gain operational efficiency. With those objectives in mind, we worked with our partners Intel, Cloudera and Syncsort to introduce the use case-based Reference Architecture for Data Warehouse Optimization for ETL Offload. ETL (Extract, Transform, Load) is the process by which raw data is moved

Dell | Cloudera | Syncsort Data Warehouse Optimization – ETL OffloadDrive operational efficiency and lower data transformation costs with a Reference Architecture for an end-to-end optimization and offload solution.

A Dell Big Data White Paper by Armando Acosta, SME, Product Manager, Dell Big Data Hadoop Solutions

2

Build Your Hadoop

Dell Reference Architectures

2011 - CDH 3 v1.4

2012 - CDH 3 v1.5, 1.6

2012 - CDH 4, 4.1

2013 - CDH 4.2, 4.5

2014 - CDH 5, 5.1

2015 - CDH 5.3, 5.4

Dell PowerEdge Cloudera Certified

2011 - PowerEdge C2100

2012 - PowerEdge R720/R720XD

2014 - PowerEdge R730/R730XD

and capacity constrained.1 EDWs are now unable to keep up with the most important demands—business reporting and analysis. Additionally, data transformation jobs are very expensive to run in an EDW, based on larger data sets and the growing amount of data sources, and it is cost prohibitive to scale EDW environments.

Augment the EDW with HadoopThe first use case in the big data journey typically begins with a goal to increase operational efficiency. Dell customers understand that they can use Hadoop to cut costs, yet they have asked us to make it simple. They want defined architectures that provide end-to-end solutions validated and engineered to work together.

The Dell | Cloudera | Syncsort Data Warehouse Optimization – ETL Offload Reference Architecture (RA) provides a blueprint to help your organization build an environment to augment your EDW. The RA provides the architecture, beginning from bare-metal hardware, for running ETL jobs in Cloudera Enterprise with Syncsort DMX-h software. Dell provides the cluster architecture, including configuration sizing for the edge nodes that ingest data and for the data nodes that do the data transformation work. Network configuration and setup are included in the RA to enable a ready-to-use Hadoop cluster.

Many of our customers have a skills-set gap when it comes to utilizing Hadoop for ETL in their environments. They don’t have time to build up expertise in Hadoop. The software components of the Reference Architecture help you address this challenge. They make it easy, even for non-data-scientists, to build and deploy ETL jobs in Hadoop.

The Syncsort software closes the skills gap between Hadoop and enterprise ETL, turning Hadoop into a more robust

1 Source: Gartner.

and feature-rich ETL solution. Syncsort’s high-performance ETL software enables your users to maximize the benefits of MapReduce without compromising on the capabilities and ease of use of conventional ETL tools. With Syncsort Hadoop ETL solutions, your organization can unleash Hadoop’s full potential, leveraging the only architecture that runs ETL processes natively within Hadoop. Syncsort software enables faster time to value by reducing the need to develop expertise on Pig, Hive and Sqoop, technologies that are essential for creating ETL jobs in MapReduce.

How did we get here?In the 1990s there was a vision of the enterprise data warehouse, a single, consistent version of the truth for all corporate data. At the core of the vision was a process through which organizations could take data from multiple transactional applications, transform it into a format suitable for analysis—with operations such as sorting, aggregating, and joining—and then load it into the data warehouse.

The continued growth of data warehousing and the rise of relational databases led to the development of ETL tools purpose built for managing the increasing complexity and variety of applications and sources involved in data warehouses. These tools usually run on dedicated systems as a back-end part of the overall data warehouse environment.

However, users got addicted to data, and early success resulted in greater demands for information:

• Data sources multiplied in number

• Data volumes grew exponentially

• Businesses demanded fresher data

• Mobile technologies, cloud computing and social media opened the doors for new types of users who demanded different, readily available views of the data

3

To cope with this demand, users were forced to push transformations down to the data warehouse, in many cases resorting back to hand coding. This shift turned the data warehouse architecture into a very different reality—something that looks like a spaghetti architecture with data transformations all over the place—because ETL tools couldn’t cope with core operations, such as sort, join, and aggregations on increasing data volumes.

This has caused a major performance and capacity problem for organizations. The agility and costs of the data warehouse have been impacted by:

• An increasing number of data sources

• New, unstructured data sources

• Exponential growth in data volumes

• Demands for fresher data

• The need for increased processing capacity

The scalability and low storage cost of Hadoop is interesting to many data warehouse installations. Hadoop can be used as a complement to data warehousing activities, including batch processing, data archiving and the handling of unstructured data sources. When organizations consider Hadoop, offloading ETL workloads is one of the common starting points.

Shifting ETL processing from the EDW to Hadoop and its supporting infrastructure offers three key benefits. It helps you:

• Achieve significant improvements in business agility

• Save money and defer unsustainable costs (particularly costly EDW upgrades just to keep lights on)

• Free up EDW capacity for faster queries and other workloads more suitable for the EDW

The Dell | Cloudera | Syncsort Data Warehouse Optimization – ETL Offload Reference Architecture is engineered to help our customers take the first step in the big data journey. It provides a validated architecture to help you build a data warehouse optimized for what it was meant to do. Additionally, the Dell solutions deliver faster time to value with Hadoop. Dell understands that Hadoop is not easy, and without the right tools designing, developing and maintaining a Hadoop cluster can drain lots of time, resources and money.

Hadoop requires new skills that are in high demand (and expensive). Offloading heavy ETL processes to Hadoop provides high ROI and delivers operational savings, while allowing your organization to build the required skills to manage and maintain your EDH. The Dell | Cloudera | Syncsort solution is built to meet all these needs.

4

Faster time to valueThe Dell | Cloudera | Syncsort Data Warehouse Optimization – ETL Offload Reference Architecture provides a blueprint to help you build an environment to augment your EDW. This Reference Architecture can help you reduce Hadoop deployment to weeks, develop Hadoop ETL jobs within hours and become fully productive within days. Dell, together with Cloudera, Syncsort and Intel, takes the hard work out of building, deploying, tuning, configuring and optimizing Hadoop environments.

The solution is based on Dell™ PowerEdge™ R730 and R730xd servers, Dell’s latest 13th Generation 2-socket, 2U rack servers that are designed to run complex workloads using highly scalable memory, I/O capacity and flexible network options. Both systems feature the Intel® Xeon® processor E5- 2600 v3 product family (Haswell-EP), up to 24 DIMMS, PCI Express® (PCIe) 3.0 enabled expansion slots and a choice of network interface technologies. The PowerEdge R730 is a Hadoop-purpose platform that is flexible enough to run balanced CPU-intensive or memory-intensive Hadoop workloads.

Built with Cloudera Enterprise Data Hub, Cloudera Distribution of Hadoop (CDH) delivers the core elements of Hadoop—scalable storage and distributed computing—as well as all of the necessary enterprise capabilities, such as security, high availability and integration, with the large set of ecosystem tools. CDH also includes Cloudera Manager, the best-in-class holistic interface that provides end-to-end system management and key enterprise features to deliver granular visibility into and control over every part of an enterprise data hub. For tighter integration and ease of management, Syncsort has a dedicated tab in Cloudera Manager to monitor DMX-h.

A key piece of the architecture is the Syncsort DMX-h software. Syncsort DMX-h is designed from the ground up to remove barriers to mainstream Hadoop adoption and deliver the best end-to-end approach for shifting heavy workloads into Hadoop. DMX-h provides all the connectivity you need to build your enterprise data hub.

An intelligent execution layer allows you to design sophisticated data transformations, focusing solely on

5

business rules, not on the underlying platform or execution framework. This unique architecture “future-proofs” the process of collecting, blending, transforming and distributing data—providing a consistent user experience while still taking advantage of the powerful native performance of the evolving compute frameworks that run on Hadoop.

Syncsort also has developed a unique utility, SILQ, which takes a SQL script as an input and then provides a detailed flow chart of the entire data flow. Using an intuitive web-based interface, you can easily drill down to get detailed information about each step within the data flow, including tables and data transformations. SILQ even offers hints and best practices to develop equivalent transformations using Syncsort DMX-h, a unique solution for Hadoop ETL that eliminates the need for custom code, delivers smarter connectivity to all your data and improves Hadoop’s processing efficiency.

One of the biggest barriers to offloading from the data warehouse into Hadoop has been a legacy of thousands of scripts built and extended over time. Understanding and documenting massive amounts of SQL code and then mastering the advanced programming skills to offload these transformation has left many organizations reluctant to move. SILQ removes this roadblock, eliminating the complexity and risk.

Dell Services can help provide additional velocity to the solution through implementation services for ETL offload or Hadoop Administration Services designed to support your needs from inception to steady state.

The Dell | Cloudera | Syncsort Data Warehouse Optimization – ETL Offload solution At the foundation of the solution is the Hadoop cluster powered by Cloudera Enterprise. The Hadoop cluster is divided into infrastructure and data nodes. The infrastructure nodes are the hardware required for the core operations of the cluster. The administration node provides deployment, configuration management and monitoring of the cluster, while the name nodes provide Hadoop Distributed File System (HDFS) directory and Map Reduce job tracking services.

Hadoop Cluster ArchitectureThe edge node acts as a gateway to the cluster, and runs the Cloudera Manager server and various Hadoop client tools. In the RA, the edge nodes are also used for data ingest, so it may be necessary to account for additional disk space for data staging or intermediate files. The data nodes are the workhorses of the cluster, and make up the bulk of the nodes in a typical cluster. The Syncsort DMX-h software will run on each data node. DMX-h has been optimized, resulting in up to 75 percent less CPU and memory utilization and up to 90 percent less storage. Therefore, the data nodes don’t need any increased processing capacity or memory performance.

6

The DMX-h client-server architecture enables your organization to cost-effectively solve enterprise class data integration problems, irrespective of data volume, complexity or velocity. The key to building this framework, which is optimized for a wide variety of data integration requirements, relies on a single processing engine that has continually evolved since its inception.

It is important to note that DMX-h has a very small-footprint architecture with no dependency on third-party applications like a relational database, compiler, or application server for design or runtime. DMX-h can be deployed virtually anywhere—on premises in Linux, Unix and Windows or even within a Hadoop cluster.

There are two major components of the DMX-h client-server platform:

Client: A graphical user interface that allows users to design, execute and control data integration jobs

Server: A combination of repository and engine:

• File-Based Metadata Repository—Using the standard file system enables seamless design and runtime version control integration with source code control systems. This also provides high availability simply by inheriting the characteristics of the underlying file system between nodes.

• Engine—A high-performance, linearly scalable and small-footprint engine includes a unique dynamic ETL Optimizer, which helps ensure maximum throughput at all times.

7

With traditional ETL tools, a majority of the large library of components is devoted to manually tuning performance and scalability. This forces you to make design decisions that can dramatically impact overall throughput.

Moreover, it means that performance is heavily dependent on an individual developer’s knowledge of the tool. In essence, the developer must not only code to meet the functional requirements, but also design for performance.

DMX-h is different because the dynamic ETL Optimizer handles the performance aspects of any job or task. The designer only has to learn a core set of five stages/transforms (copy, sort, merge, join and aggregate). These simple tasks are combined to meet all functional requirements. This is what makes DMX-h so unique. The designer doesn’t need to worry about performance because the Optimizer automatically delivers it to every job and task regardless of the environment. As a result, jobs have far fewer components and are easier to maintain and govern. With DMX-h, users design for functionality, and they simply inherit performance.

Take your big data journey with DellYou can also look to Dell for the rest of the pieces of a complete big data solution, including unique software products for data analytics, data integration and data management.

Dell offers all the tools you need to:

• Seamlessly join structured and unstructured data. Dell Statistica Big Data Analytics delivers integrated information modeling and visualization in a big data search and analytics platform. It seamlessly combines large-scale structured data with a variety of unstructured data, such as text, imagery and biometrics.

• Simplify Oracle-to-Hadoop data integration. Dell SharePlex Connector for Hadoop enables you to load and continuously replicate changes from an Oracle database to a Hadoop cluster. This toolset maintains near-real-time copies of source tables without impacting system performance or Oracle online transaction processing applications.

• Synchronize data between critical applications. Dell Boomi enables you to synchronize data between mission-critical applications—on-premises and in the cloud—without the costs of procuring appliances, maintaining software or generating custom codes.

• Easily access and merge data types. Dell Toad Data Point can join data from relational and non-relational data sources, enabling you to easily share and view queries, files, objects and data sets.

8

To learn more, visit Dell.com/Hadoop | Dell.com/BigData | Software.Dell.com/Solutions

©2015 Dell Inc. All rights reserved. Dell, the DELL logo, the DELL badge and PowerEdge ar e trademarks of Dell Inc. Other trademarks and trade names may be used in this document to refer to either the entities claiming the marks and names or their products. Dell disclaims proprietary interest in the marks and names of others. Intel and the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries.

June 2015 | Version 1.0

Dell Big Data and Analytics Solutions

http://Dell.com/Hadoop

http://www.Dell.com/BigData

http://www.Software.Dell.com/Solutions

Dell | Cloudera | Syncsort Data Warehouse Optimization...

Documents

Transcript of Dell | Cloudera | Syncsort Data Warehouse Optimization...