Data Engineering with Spring, Hadoop and Hive

60
Faster Data Flows with Hive, Spring and Hadoop Alex Silva Principal Data Engineer

Transcript of Data Engineering with Spring, Hadoop and Hive

Page 1: Data Engineering with Spring, Hadoop and Hive

Faster Data Flows with Hive, Spring and Hadoop

Alex SilvaPrincipal Data Engineer

Page 2: Data Engineering with Spring, Hadoop and Hive

DATA ADVENTURES AT RACKSPACE• Datasets

• Data pipeline: flows and systems

• Creating a generic Hadoop ETL framework

• Integrating Hadoop with Spring

• Spring Hadoop, Spring Bach and Spring Boot

• Hive

• File formats

• Queries and performance

Page 3: Data Engineering with Spring, Hadoop and Hive

MAAS Dataset

• System and platform monitoring

• Pings, SSH, HTTP, HTTPS checks

• Remote monitoring

• CPU, file system, load average, disk memory

• MySQL, Apache

THE BUSINESS DOMAIN | 3

Page 4: Data Engineering with Spring, Hadoop and Hive

The Dataset

• Processing around 1.5B records/day

• Stored in Cassandra

• Exported to HDFS in batches

• TBs of uncompressed JSON (“raw data”) daily

• First dataset piped through ETL platform

DATA ENGINEERING STATS | 4

Page 5: Data Engineering with Spring, Hadoop and Hive

DATA PIPELINE• Data flow

• Stages

• ETL

• Input formats

• Generic Transformation Layer

• Outputs

Page 6: Data Engineering with Spring, Hadoop and Hive

Data Flow Diagram

DATA FLOW | 6

Monitoring

JSON Export

HDFS

Start

Available and well-formed?

No

Stop

EXTRACT AND TRANSFORM

BAD ROWOR ERROR?

LOGCSV

STAGING FILE

ETL

JSON DATA

HDFSYesYes No

LOAD

PartioningBucketingIndexing

Staging Table Production Table

ETL

Hive Table

Flume

Page 7: Data Engineering with Spring, Hadoop and Hive

Systems Diagram

SYSTEMS | 7

MonitoringEvents

HDFS

JSON

Extract

MapReduce

1.2.0.1.3.2.0

Load

Hive

0.12.0

Flume Log4JAppender

Flume

1.5.0

Access

End User

Bad records sink

Export

Page 8: Data Engineering with Spring, Hadoop and Hive

ETL Summary

• Extract

• JSON files in HDFS

• Transform

• Generic Java based ETL framework

• MapReduce jobs extract features

• Quality checks

• Load

• Load data into partitioned ORC Hive tables

DATA FLOW | 8

Page 9: Data Engineering with Spring, Hadoop and Hive

HADOOP

Page 10: Data Engineering with Spring, Hadoop and Hive

Hadoop: Pros

• Dataset volume

• Data is grows exponentially at a very rapid rate

• Integrates with existing ecosystem

• HiveQL

• Experimentation and exploration

• No expensive software or hardware to buy

TOOLS AND TECHNOLOGIES | 10

Page 11: Data Engineering with Spring, Hadoop and Hive

Hadoop: Cons

• Job monitoring and scheduling

• Data quality

• Error handling and notification

• Programming model

• Generic framework mitigates some of that

TOOLS AND TECHNOLOGIES | 11

Page 12: Data Engineering with Spring, Hadoop and Hive

CAN WE OVERCOME SOME OF THOSE?

Page 13: Data Engineering with Spring, Hadoop and Hive

Keeping the Elephant “Lean”

• Job control without the complexity of external tools

• Checks and validations

• Unified configuration model

• Integration with scripts

• Automation

• Job restartability

DATA ENGINEERING | 13

Page 14: Data Engineering with Spring, Hadoop and Hive

HEY! WHAT ABOUT SPRING?

Page 15: Data Engineering with Spring, Hadoop and Hive

SPRING DATA HADOOP

Page 16: Data Engineering with Spring, Hadoop and Hive

What is it about?

• Part of the Spring Framework

• Run Hadoop apps as standard Java apps using DI

• Unified declarative configuration model

• APIs to run MapReduce, Hive, and Pig jobs.

• Script HDFS operations using any JVM based languages.

• Supports both classic MR and YARN

TOOLS AND TECHNOLOGIES | 16

Page 17: Data Engineering with Spring, Hadoop and Hive

The Apache Hadoop Namespace

TOOLS AND TECHNOLOGIES | 17

Also supports annotation based configuration via the @EnableHadoop annotation.

Page 18: Data Engineering with Spring, Hadoop and Hive

Job Configuration: Standard Hadoop APIs

TOOLS AND TECHNOLOGIES | 18

Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); Job.setJarByClass(WordCountMapper.class); job.setMapperClass(WordCountMapper.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true);

Page 19: Data Engineering with Spring, Hadoop and Hive

Configuring Hadoop with Spring

SPRING HADOOP | 19

<context:property-placeholder location="hadoop-dev.properties"/> <hdp:configuration> fs.default.name=${hd.fs} </hdp:configuration> <hdp:job id="word-count-job" input-path=“${input.path}" output-path="${output.path}“ jar=“hadoop-examples.jar” mapper="examples.WordCount.WordMapper“ reducer="examples.WordCount.IntSumReducer"/> <hdp:job-runner id=“runner” job-ref="word-count-job“ run-at-startup=“true“ />

input.path=/wc/input/-output.path=/wc/word/-hd.fs=hdfs://localhost:9000

Page 20: Data Engineering with Spring, Hadoop and Hive

SPRING HADOOP | 20

Configuration Attributes

Page 21: Data Engineering with Spring, Hadoop and Hive

Creating a Job

SPRING HADOOP | 21

Page 22: Data Engineering with Spring, Hadoop and Hive

Injecting Jobs

• Use DI to obtain reference to Spring managed Hadoop job

• Perform additional validation and configuration before submitting

TOOLS AND TECHNOLOGIES | 22

public'class'WordService'{''''@Autowired'''private'Job'mapReduceJob;''''''public'void'processWords()'{'''''''''mapReduceJob.submit();'''}'}'

Page 23: Data Engineering with Spring, Hadoop and Hive

Running a Job

TOOLS AND TECHNOLOGIES | 23

Page 24: Data Engineering with Spring, Hadoop and Hive

Distributed Cache

TOOLS AND TECHNOLOGIES | 24

Page 25: Data Engineering with Spring, Hadoop and Hive

Using Scripts

TOOLS AND TECHNOLOGIES | 25

Page 26: Data Engineering with Spring, Hadoop and Hive

Scripting Implicit Variables

TOOLS AND TECHNOLOGIES | 26

Page 27: Data Engineering with Spring, Hadoop and Hive

Scripting Support in HDFS

• FSShell is designed to support scripting languages

• Use these for housekeeping tasks:

• Check for files, prepare input data, clean output directories, set flags, etc.

TOOLS AND TECHNOLOGIES | 27

Page 28: Data Engineering with Spring, Hadoop and Hive

SPRING BATCH

Page 29: Data Engineering with Spring, Hadoop and Hive

What is it about?

• Born out of collaboration with Accenture in 2007

• Fully automated processing of large volumes of data.

• Logging, txn management, listeners, job statistics, restart, skipping, and resource management.

• Automatic retries after failure

• Synch, async and parallel processing

• Data partitioningTOOLS AND TECHNOLOGIES | 29

Page 30: Data Engineering with Spring, Hadoop and Hive

Hadoop Workflow Orchestration

• Complex data flows

• Reuses batch infrastructure to manage Hadoop workflows.

• Steps can be any Hadoop job type or HDFS script

• Jobs can be invoked by events or scheduled.

• Steps can be sequential, conditional, split, concurrent, or programmatically determined.

• Works with flat files, XML, or databases.

TOOLS AND TECHNOLOGIES | 30

Page 31: Data Engineering with Spring, Hadoop and Hive

Spring Batch Configuration

• Jobs are composed of steps

TOOLS AND TECHNOLOGIES | 31

<job id="job1"> <step id="import" next="wordcount"> <tasklet ref=“import-tasklet"/> </step> <step id=“wc" next="pig"> <tasklet ref="wordcount-tasklet"/> </step> <step id="pig"> <tasklet ref="pig-tasklet“></step> <split id="parallel" next="hdfs"> <flow><step id="mrStep"> <tasklet ref="mr-tasklet"/> </step></flow> <flow><step id="hive"> <tasklet ref="hive-tasklet"/> </step></flow> </split> <step id="hdfs"> <tasklet ref="hdfs-tasklet"/></step> </job>

Page 32: Data Engineering with Spring, Hadoop and Hive

Spring Data Hadoop Integration

TOOLS AND TECHNOLOGIES | 32

Page 33: Data Engineering with Spring, Hadoop and Hive

SPRING BOOT

Page 34: Data Engineering with Spring, Hadoop and Hive

What is it about?

• Builds production-ready Spring applications.

• Creates a “runnable” jar with dependencies and classpath settings.

• Can embed Tomcat or Jetty within the JAR

• Automatic configuration

• Out of the box features:

• statistics, metrics, health checks and externalized configuration

• No code generation and no requirement for XML configuration.

TOOLS AND TECHNOLOGIES | 34

Page 35: Data Engineering with Spring, Hadoop and Hive

PUTTING IT ALL TOGETHER

Page 36: Data Engineering with Spring, Hadoop and Hive

Spring Data Flow Components

TOOLS AND TECHNOLOGIES | 36

Spring Boot

Extract

Spring Batch

2.0

Load

Spring Hadoop

2.01.1.5

HDFS

Hive

0.12.0

MapReduce

HDP 1.3

Page 37: Data Engineering with Spring, Hadoop and Hive

Hierarchical View

TOOLS AND TECHNOLOGIES | 37

Spring Boot

Spring Batch

Job control

Spring Hadoop- Notifications- Validation- Scheduling

- Data Flow- Callbacks

Page 38: Data Engineering with Spring, Hadoop and Hive

HADOOP DATA FLOWS, SPRINGFIED

Page 39: Data Engineering with Spring, Hadoop and Hive

Spring Hadoop Configuration

• Job parameters configured by Spring

• Sensible defaults used

• Parameters can be overridden:

• External properties file.

• At runtime via system properties: -Dproperty.name = property.value

TOOLS AND TECHNOLOGIES | 39

<configuration> fs.default.name=${hd.fs} io.sort.mb=${io.sort.mb:640mb} mapred.reduce.tasks=${mapred.reduce.tasks:1} mapred.job.tracker=${hd.jt:local} mapred.child.java.opts=${mapred.child.java.opts} </configuration>

Page 40: Data Engineering with Spring, Hadoop and Hive

MapReduce Jobs

• Configured via Spring Hadoop

• One job per entity

TOOLS AND TECHNOLOGIES | 40

<job id="metricsMR" input-path="${mapred.input.path}" output-path="${mapred.output.path}" mapper="GenericETLMapper" reducer="GenericETLReducer” input-format="org.apache.hadoop.mapreduce.lib.input.TextInputFormat" output-format="org.apache.hadoop.mapreduce.lib.output.TextOutputFormat" key="TextArrayWritable" value="org.apache.hadoop.io.NullWritable" map-key="org.apache.hadoop.io.Text" map-value="org.apache.hadoop.io.Text" jar-by-class="GenericETLMapper"> volga.etl.dto.class=Metric </job>

Page 41: Data Engineering with Spring, Hadoop and Hive

MapReduce Jobs

• Jobs are wrapped into Tasklet definitions

TOOLS AND TECHNOLOGIES | 41

<job-tasklet job-ref="metricsMR" id="metricsJobTasklet"/>

Page 42: Data Engineering with Spring, Hadoop and Hive

Hive Configuration

• Hive steps also defined as tasklets

• Parameters are passed from MapReduce phase to Hive phase

TOOLS AND TECHNOLOGIES | 42

<hive-client-factory host="${hive.host}" port="${hive.port:10000}"/> <hive-tasklet id="load-notifications"> <script location="classpath:hive/ddl/notifications-load.hql"/> </hive-tasklet> <hive-tasklet id="load-metrics"> <script location="classpath:hive/ddl/metrics-load.hql"> <arguments>INPUT_PATH=${mapreduce.output.path}</arguments> </script> </hive-tasklet>

Page 43: Data Engineering with Spring, Hadoop and Hive

Spring Batch Configuration

• One Spring Batch job per entity.

TOOLS AND TECHNOLOGIES | 43

<job id="metrics" restartable="false" parent="VolgaETLJob"> <step id="cleanMetricsOutputDirectory" next="metricsMapReduce"> <tasklet ref="setUpJobTasklet"/> </step> <step id="metricsMapReduce"> <tasklet ref="metricsJobTasklet"> <listeners> <listener ref="mapReduceErrorThresholdListener"/> </listeners> </tasklet> <fail on="FAILED" exit-code="Map Reduce Step Failed"/> <end on="COMPLETED"/> <!--<next on="*" to="loadMetricsIntoHive"/>--> </step> <step id="loadMetricsIntoHive"> <tasklet ref="load-notifications"/> </step> </job>

Page 44: Data Engineering with Spring, Hadoop and Hive

Spring Batch Listeners

• Monitor job flow

• Take action on job failure

• PagerDuty notifications

• Save job counters to the audit database

• Notify team if counters are not consistent with historical audit data (based on thresholds)

TOOLS AND TECHNOLOGIES | 44

Page 45: Data Engineering with Spring, Hadoop and Hive

Spring Boot: Pulling Everything Together

• Runnable jar created during build process

• Controlled by Maven plugin

TOOLS AND TECHNOLOGIES | 45

<plugin> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-maven-plugin</artifactId> <configuration> <finalName>maas-etl-${project.version}</finalName> <classifier>spring</classifier> <mainClass>com.rackspace....JobRunner</mainClass> <excludeGroupIds>org.slf4j</excludeGroupIds> </configuration> </plugin>

Page 46: Data Engineering with Spring, Hadoop and Hive

HIVE• Typical Use Cases

• File formats

• ORC

• Abstractions

• Hive in the monitoring pipeline

• Query performance

Page 47: Data Engineering with Spring, Hadoop and Hive

Overview

• Translates SQL commands into MR jobs.

• Structured and unstructured data in multiple formats

• Standard access protocols, including JDBC and Thrift

• Provides several serialization mechanisms

• Integrates seamlessly with Hadoop: HCatalog, Pig, HBase, etc.

HIVE | 47

Page 48: Data Engineering with Spring, Hadoop and Hive

Hive vs. RDBMS

HIVE | 48

Hive Traditional Databases

SQL Interface SQL Interface

Focus on batch analytics Mostly online, interactive analytics

No transactions Transactions are their way of life

No random insertsUpdates are not natively supported (but possible.) Random insert and updates

Distributed processing via MR Distributed processing capabilities vary

Scales to hundreds of nodes Seldom scales beyond 20 nodes

Built for commodity hardware Expensive, proprietary hardware

Low cost per petabyte What’s a petabyte?

Page 49: Data Engineering with Spring, Hadoop and Hive

Abstraction Layers in Hive

49HIVE |

Database

Table

PartitionSkewed Keys

Table

Partition Partition Unskewed Keys

Bucket

Bucket

Bucket

Optional

Page 50: Data Engineering with Spring, Hadoop and Hive

Schemas and File Formats

• We used the ORCFile format: built-in, easy to use and efficient.

• Efficient light-weight + generic compression

• Run length encoding for integers and strings, dictionary encoding, etc.

• Generic compression: Snappy, LZO, and ZLib (default)

• High performance

• Indexes value ranges within blocks of ORCFile data

• Predicate filter pushdown allows efficient scanning during queries.

• Flexible Data Model

• Hive types are supported including maps, structs and unions.

HIVE | 50

Page 51: Data Engineering with Spring, Hadoop and Hive

The ORC File Format

• An ORC file contains groups of row data called stripes, along with auxiliary information in a file footer.

• Default size is 256 MB (orc.stripe.size).

• Large stripes allow for efficient reads from HDFS configured independently from the block size.

HIVE | 51

Page 52: Data Engineering with Spring, Hadoop and Hive

The ORC File Format: Index

• Doesn’t answer queries

• Required for skipping rows:

• Row index entries provide offsets that enable seeking

• Min and max values for each column

HIVE | 52

Page 53: Data Engineering with Spring, Hadoop and Hive

ORC File Index Skipping

HIVE | 53

Skipping works for number types and for string types.

Done by recording a min and max value inside the inline index and determining if the lookup value falls outside that range.

Page 54: Data Engineering with Spring, Hadoop and Hive

The ORC File Format: File Footer

• List of stripes in the file, the number of rows per stripe, each column's data type.

• Column-level aggregates: count, min, max, and sum.

• ORC uses files footer to find the columns data streams.

HIVE | 54

Page 55: Data Engineering with Spring, Hadoop and Hive

Predicate Pushdowns

• “Push down” parts of the query to where the data is.

• filter/skip as much data as possible, and

• greatly reduce input size.

• Sorting a table on its secondary keys also reduces execution time.

• Sorted columns are grouped together in one area on disk and the other pieces will be skipped very quickly.

HIVE | 55

Page 56: Data Engineering with Spring, Hadoop and Hive

56HIVE |

ORC File

Page 57: Data Engineering with Spring, Hadoop and Hive

Query Performance

• Lower latency Hive queries rely on two major factors:

• Sorting and skipping data as much as possible

• Minimizing data shuffle from mappers to reducers

HIVE | 57

Page 58: Data Engineering with Spring, Hadoop and Hive

Improving Query Performance

• Divide data among different files/directories

• Partitions, buckets, etc.

• Skip records using small embedded indexes.

• ORCFile format.

• Sort data ahead of time.

• Simplifies joins making ORCFile skipping more effective.

HIVE | 58

Page 59: Data Engineering with Spring, Hadoop and Hive

The Big Picture

DATA ENGINEERING | 59

Data Preprocessing

HDFS HDFSMapReduce

Start Here

JSON Hive File

Data Load

Dynamic Load PartioningBucketingIndexing

HDFS

Hive File

Staging Table Prod Table

Data AccessAPI Hive CLI

Apache Thrift

Page 60: Data Engineering with Spring, Hadoop and Hive

THANK YOU!

Get in touch:

[email protected] @thealexsilva