Shannon Holgate: Bending non-splittable data to harness distributed performance

96
Semi-structured Data and Hadoop Shannon Holgate

Transcript of Shannon Holgate: Bending non-splittable data to harness distributed performance

Page 1: Shannon Holgate: Bending non-splittable data to harness distributed performance

Semi-structured Data and Hadoop

Shannon Holgate

Page 2: Shannon Holgate: Bending non-splittable data to harness distributed performance

Why am I here?

Page 3: Shannon Holgate: Bending non-splittable data to harness distributed performance

Why am I here?Semi-structured Data

Page 4: Shannon Holgate: Bending non-splittable data to harness distributed performance

Why am I here?Semi-structured Data

Page 5: Shannon Holgate: Bending non-splittable data to harness distributed performance

Why am I here?Semi-structured Data

... and Hadoop

Page 6: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 7: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 8: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 9: Shannon Holgate: Bending non-splittable data to harness distributed performance

Agenda

Motivation for Analysing XML

Solution design in Spark on Hadoop

Tuning the Solution for performance

Page 10: Shannon Holgate: Bending non-splittable data to harness distributed performance

What to take away

Thumbs up to XML processing in Hadoop

Spark can process XML quite easily

Tuning is absolutely essential

Page 11: Shannon Holgate: Bending non-splittable data to harness distributed performance

Motivation

Page 12: Shannon Holgate: Bending non-splittable data to harness distributed performance

We should care about XML

Page 13: Shannon Holgate: Bending non-splittable data to harness distributed performance

XML is Incorrectly used

Page 14: Shannon Holgate: Bending non-splittable data to harness distributed performance

Extensive Schemas

SOAPBloat on the wire

XML is Incorrectly used

Page 15: Shannon Holgate: Bending non-splittable data to harness distributed performance

XML is alsoHuman ReadableExtremely PortableStorable in Databases

Page 16: Shannon Holgate: Bending non-splittable data to harness distributed performance

Making XML perfect for communicating Data

Page 17: Shannon Holgate: Bending non-splittable data to harness distributed performance

XML is consistently the preferred data transfer protocol we see in Financial

Institutions

Page 18: Shannon Holgate: Bending non-splittable data to harness distributed performance

We should care about XML

Page 19: Shannon Holgate: Bending non-splittable data to harness distributed performance

Why should we care?Financial Institutions have the data

and financial backing to use Big Data technologies

Page 20: Shannon Holgate: Bending non-splittable data to harness distributed performance

Our first Big Data engagement is with a Global

Insurance Provider

Page 21: Shannon Holgate: Bending non-splittable data to harness distributed performance

This customer would like to process XML at Scale within

Budget

Page 22: Shannon Holgate: Bending non-splittable data to harness distributed performance

Along came an Elephant

Page 23: Shannon Holgate: Bending non-splittable data to harness distributed performance

We want Hadoop, let's prove it can keep up with our current applications

Words from the customer:

Page 24: Shannon Holgate: Bending non-splittable data to harness distributed performance

Create a solution which ingests, extracts and exports XML on Hadoop

Page 25: Shannon Holgate: Bending non-splittable data to harness distributed performance

And make sure this solution has the performance to replace Teradata

Page 26: Shannon Holgate: Bending non-splittable data to harness distributed performance

Cost

Page 27: Shannon Holgate: Bending non-splittable data to harness distributed performance

Hadoop provides the opportunity to pay only for what you need

Page 28: Shannon Holgate: Bending non-splittable data to harness distributed performance

This means no Oracle guy knocking on the maintenance door

Page 29: Shannon Holgate: Bending non-splittable data to harness distributed performance

And no absurd Teradata licensing fees

Page 30: Shannon Holgate: Bending non-splittable data to harness distributed performance

XML is worth analysingFinancial Institutions are ready for Big Data

Hadoop is a cost effective Big Data solution

Page 31: Shannon Holgate: Bending non-splittable data to harness distributed performance

We must find a way to process XML in a performant fashion on Hadoop

Page 32: Shannon Holgate: Bending non-splittable data to harness distributed performance

Solution

Page 33: Shannon Holgate: Bending non-splittable data to harness distributed performance

Gathering Requirements

Page 34: Shannon Holgate: Bending non-splittable data to harness distributed performance

Specifications of the Hadoop Cluster

Expected Load

Data sources

Integration pointsTransformation Logic

Page 35: Shannon Holgate: Bending non-splittable data to harness distributed performance

Cluster Capacity

Installed Services

YARN

Kerberos and Sentry

Specifications of the Hadoop Cluster

Page 36: Shannon Holgate: Bending non-splittable data to harness distributed performance

Development Test Production

Cluster Size 6 Nodes,128GB 12 cores

12 Nodes,128GB

12 cores

12 Nodes,128GB

12 cores

Installed Services

Spark v1.3Flume v1.5

Sqoop v1.4.5 ....

Spark v1.3Flume v1.5

Sqoop v1.4.5 ....

Spark v1.3Flume v1.5

Sqoop v1.4.5 ....

YARN Yes Yes Yes

KerberosSentry Both Both Both

Page 37: Shannon Holgate: Bending non-splittable data to harness distributed performance

Data Sources

Batch loads vs. StreamingApples vs. Oranges

Page 38: Shannon Holgate: Bending non-splittable data to harness distributed performance

Flume and Spark can be used for Streaming XML messages

Page 39: Shannon Holgate: Bending non-splittable data to harness distributed performance

Batch Loads are better suited to a scheduled Oozie job

Page 40: Shannon Holgate: Bending non-splittable data to harness distributed performance

Expected Load

Messages per Second

Message scheduling

Expected Message Size

Page 41: Shannon Holgate: Bending non-splittable data to harness distributed performance

Expected Load - Streaming Messages

Messages per Second 48MPS, 192MPS peak

Message Scheduling 24/7

Expected Message Size 15KB

Page 42: Shannon Holgate: Bending non-splittable data to harness distributed performance

Expected Load - Batch Messages

Messages per Second 15GB daily

Message Scheduling 22:00 daily

Expected Message Size 15KB

Page 43: Shannon Holgate: Bending non-splittable data to harness distributed performance

Transformation Logic

Capture User Stories for Extraction Criteria

Work with Product Owner to Create Data Mapping Spreadsheet

Page 44: Shannon Holgate: Bending non-splittable data to harness distributed performance

Integration points

XML source location

Export destination

Audit endpoints

Page 45: Shannon Holgate: Bending non-splittable data to harness distributed performance

Integration points

XML source location JMS source

Export destination Exadata

Audit endpoints Exadata

Page 46: Shannon Holgate: Bending non-splittable data to harness distributed performance

Building the Pipeline

Page 47: Shannon Holgate: Bending non-splittable data to harness distributed performance

The Pipeline

Deep Dive

The Tech Stack

Worked Example - Streaming Messages

Page 48: Shannon Holgate: Bending non-splittable data to harness distributed performance

Data Flow

The Pipeline

Page 49: Shannon Holgate: Bending non-splittable data to harness distributed performance

The Tech Stack

Flume Ingestion

Spark Xml to Avro

Spark Avro to PSV

Sqoop Export of PSV

Page 50: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 51: Shannon Holgate: Bending non-splittable data to harness distributed performance

Deep Dive

Page 52: Shannon Holgate: Bending non-splittable data to harness distributed performance

JMS XML message source

Page 53: Shannon Holgate: Bending non-splittable data to harness distributed performance

Flume Agent with JMS source and HDFS sink

Page 54: Shannon Holgate: Bending non-splittable data to harness distributed performance

Spark Job every 10 minutes

Reads 10 minutes of streamed XML

Converts to Avro Datafiles

Page 55: Shannon Holgate: Bending non-splittable data to harness distributed performance

Spark job running prior to export

Converts Avro to PSV for Sqoop

Page 56: Shannon Holgate: Bending non-splittable data to harness distributed performance

Sqoop export for each table in Exadata

Reads PSV versions of the Avro data

Page 57: Shannon Holgate: Bending non-splittable data to harness distributed performance

Data warehouse holding the completed pipeline data

Page 58: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 59: Shannon Holgate: Bending non-splittable data to harness distributed performance

XML Processing with Spark

Page 60: Shannon Holgate: Bending non-splittable data to harness distributed performance

Spark on Scala

Access to Java LibrariesScala is Functional by design

Page 61: Shannon Holgate: Bending non-splittable data to harness distributed performance

Reading XML in Spark

Keep things simple

Use the Xml Input Format from Hadoop Streaming

Inputs are split from opening to closing tag

Page 62: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 63: Shannon Holgate: Bending non-splittable data to harness distributed performance

Avro Support in Spark

Use Kryo Serialisation for correct Avro support

Avro Serialisation and data format

Avro Serialisation and Parquet Data Format

Page 64: Shannon Holgate: Bending non-splittable data to harness distributed performance

Design Extractors

In this case we want to turn one XML message into 5 different Avros

5 Extraction Classes should be created

Page 65: Shannon Holgate: Bending non-splittable data to harness distributed performance

Design Extractors - cont

Use DOM/SAX if you have no definite XSD for the XML

DOM is acceptable as the data is already in memory

Page 66: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 67: Shannon Holgate: Bending non-splittable data to harness distributed performance

Spark Processing

All extractions should occur within a single Map

Map only job

Try not to cause any shuffles

Page 68: Shannon Holgate: Bending non-splittable data to harness distributed performance

Writing the Avro output

Use the AvroJob MapReduce output format

Create Avro Datafiles

Page 69: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 70: Shannon Holgate: Bending non-splittable data to harness distributed performance

Take time to understand incoming XML

Design solution to fit the Hadoop Cluster

XML in Spark should be processed carefully

Page 71: Shannon Holgate: Bending non-splittable data to harness distributed performance

Tuning

Page 72: Shannon Holgate: Bending non-splittable data to harness distributed performance

Remember those Cluster Specs?

Page 73: Shannon Holgate: Bending non-splittable data to harness distributed performance

Development Test Production

Cluster Size 6 Nodes,128GB 12 cores

12 Nodes,128GB

12 cores

12 Nodes,128GB

12 cores

Installed Services

Spark v1.3Flume v1.5

Sqoop v1.4.5 ....

Spark v1.3Flume v1.5

Sqoop v1.4.5 ....

Spark v1.3Flume v1.5

Sqoop v1.4.5 ....

YARN Yes Yes Yes

KerberosSentry Both Both Both

Page 74: Shannon Holgate: Bending non-splittable data to harness distributed performance

Time to use them

Page 75: Shannon Holgate: Bending non-splittable data to harness distributed performance

Tuning Flume

Build Flume Cluster and Load balance

Source should read enough events for 1 block

File channel vs. Memory channel

Page 76: Shannon Holgate: Bending non-splittable data to harness distributed performance

Tuning Spark - ExecutorsVital Tuning point

Page 77: Shannon Holgate: Bending non-splittable data to harness distributed performance

Tuning Spark - Executors

Memory allocation

Number of cores

Number of Executors

Page 78: Shannon Holgate: Bending non-splittable data to harness distributed performance

Memory allocation--executor-memory

Max out but leave some for the daemons

Page 79: Shannon Holgate: Bending non-splittable data to harness distributed performance

Number of coresSpark can run 1 task per core

HDFS Client doesn't like concurrent threads

Limit to 5

Page 80: Shannon Holgate: Bending non-splittable data to harness distributed performance

Number of ExecutorsMore Executors, less cores

Big nodes? 3-5 Executers per node

Adjust memory and cores to match

Page 81: Shannon Holgate: Bending non-splittable data to harness distributed performance

Don't forget YARN

Page 82: Shannon Holgate: Bending non-splittable data to harness distributed performance

If the Cluster is YARN enabled it will limit memory and cores

Page 83: Shannon Holgate: Bending non-splittable data to harness distributed performance

Solution?

Number of Executers = Number of Cores/Number of Container Cores * Nodes

Page 84: Shannon Holgate: Bending non-splittable data to harness distributed performance

Or,

Number of Executers = Memory per Node/Memory per Container * Nodes

Page 85: Shannon Holgate: Bending non-splittable data to harness distributed performance

Use the minimum value

Page 86: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 87: Shannon Holgate: Bending non-splittable data to harness distributed performance

Tuning Spark - Monitoring

Spark Context Web UI

Coda Hale Metrics

Page 88: Shannon Holgate: Bending non-splittable data to harness distributed performance
Page 89: Shannon Holgate: Bending non-splittable data to harness distributed performance

Tuning Sqoop Exports

Use Direct Connectors

Tweak Number of Mappers

Page 90: Shannon Holgate: Bending non-splittable data to harness distributed performance

Cluster Flume where possible

More Spark Executers > More cores

Sqoop mappers

Page 91: Shannon Holgate: Bending non-splittable data to harness distributed performance

Summary

Page 92: Shannon Holgate: Bending non-splittable data to harness distributed performance

XML is absolutely a worthwhile data source to analyse

Page 93: Shannon Holgate: Bending non-splittable data to harness distributed performance

Focus on using Spark to Extract XML data and move into Avro and Parquet

Page 94: Shannon Holgate: Bending non-splittable data to harness distributed performance

Tuning should revolve around Spark allocated resources

Page 95: Shannon Holgate: Bending non-splittable data to harness distributed performance

Thanks

Page 96: Shannon Holgate: Bending non-splittable data to harness distributed performance

Shannon Holgate

Senior Analytics Engineer @ Kainos

@sholgate13