Faster ETL Workflows using Apache Pig & Spark€¦ · Faster ETL Workflows using Apache Pig & Spark...

Faster ETL Workflows using Apache Pig & Spark

- Praveen Rachabattuni, Sigmoid Analytics

@praveenr019

About me● Apache Pig committer and Pig on Spark project lead.

OUR CUSTOMERS

Why pig on spark ?

● Spark shell (scala), Spark SQL, Dataframes API

● Large mapreduce clusters running Pig on mapreduce

● Much familiar language with developers/analysts and

easier to debug

What steered us to Pig ?

● Targeted users

○ Analysts

○ Large pig script codebase projects

○ Cost saving for organisations in training new frameworks

● Rich operator library

How Spark plugs into Pig ?

Logical Plan

Physical Plan

MR Plan

MR Exec Engine

Logical Plan

Physical Plan

Spark Plan

Spark Exec Engine

Pig Operator Spark Operator

Load newAPIHadoopFile

Store saveAsNewAPIHadoopFile

Filter filter transformation

GroupBy groupby & map

Join CoGroup

ForEach mapPartitions

Sort sortByKey + map

Operator Mapping

Simple script

A = LOAD './wiki' USING PigStorage(' ') as (hour:chararray, pcode:chararray, pagename:chararray, pageviews:chararray, pagebytes:chararray);

B = FILTER A BY pageviews >= 50000;

DUMP B;

Input data:en Main_Page 242332 4737756101

ak Italy 400 73160

en Main_Page 242332 4737756101

Load operator

@Overridepublic RDD<Tuple> convert(List<RDD<Tuple>> predecessorRdds, POLoad poLoad) throws IOException { JobConf loadJobConf = SparkUtil.newJobConf(pigContext); configureLoader(physicalPlan, poLoad, loadJobConf);

RDD<Tuple2<Text, Tuple>> hadoopRDD = sparkContext.newAPIHadoopFile( poLoad.getLFile().getFileName(), PigInputFormatSpark.class, Text.class, Tuple.class, loadJobConf);

// map to get just RDD<Tuple> return hadoopRDD.map(TO_TUPLE_FUNCTION, SparkUtil.getManifest(Tuple.class));}

Load operator (cont..)

private static class ToTupleFunction extends AbstractFunction1<Tuple2<Text, Tuple>, Tuple> implements Function1<Tuple2<Text, Tuple>, Tuple>, Serializable {

@Override public Tuple apply(Tuple2<Text, Tuple> v1) { return v1._2(); }}

Filter operator

@Overridepublic RDD<Tuple> convert(List<RDD<Tuple>> predecessors, POFilter physicalOperator) { SparkUtil.assertPredecessorSize(predecessors, physicalOperator, 1); RDD<Tuple> rdd = predecessors.get(0); FilterFunction filterFunction = new FilterFunction(physicalOperator); return rdd.filter(filterFunction);}

Filter operator (cont..)

private static class FilterFunction extends AbstractFunction1<Tuple, Object> implements Serializable { private POFilter poFilter; @Override public Boolean apply(Tuple v1) { Result result; try { poFilter.setInputs(null); poFilter.attachInput(v1); result = poFilter.getNextTuple(); } catch (ExecException e) { throw new RuntimeException("Couldn't filter tuple", e); }}

Spark plan

● MR Plan is structured towards mapreduce execution engine.

● Spark plan contains a sequence of transformations and more

optimized towards Spark.

● Handover logical plan to Spark for much optimized flow, as Spark is

pretty good at doing this.

Benchmark

Setting up Pig on Spark

1. Get the codea. git clone https://github.com/apache/pig -b spark

2. Building the project a. ant -Dhadoopversion=23 jar (assumes hadoop-2.x setup)

3. Env variablesa. export HADOOP_USER_CLASSPATH_FIRST="true”b. export SPARK_MASTER=”local”

4. Start pig grunt shella. bin/pig -x spark

Issues

● Spark plan to stand inline with Spark APIs

● Performance

● Functional parity with Pig on mapreduce

Contributors

● Mayur Rustagi (Sigmoid)

● Praveen Rachabattuni (Sigmoid)

● Liyun Zhang (Intel)

● Mohit Sabharwal (Cloudera)

● Xuefu Zhang (Cloudera)

References

● Apache Pig github mirror

○ https://github.com/apache/pig/tree/spark

● Umbrella jira for Pig on Spark

○ https://issues.apache.org/jira/browse/PIG-4059

Thank you

Queries ??

Faster ETL Workflows using Apache Pig & Spark€¦ · Faster ETL Workflows using Apache Pig & Spark...

Documents

Transcript of Faster ETL Workflows using Apache Pig & Spark€¦ · Faster ETL Workflows using Apache Pig & Spark...

Killing ETL with Apache Drill

ETL Big Data dengan Apache Hadoop

Developing Pig on Tez - events.static.linuxfound.org...Developing Pig on Tez Mark Wagner Committer, Apache Pig LinkedIn Cheolsoo Park VP, Apache Pig Netflix. What is Pig Apache project

EFFICIENT COLUMNAR STORAGE WITH APACHE PARQUET · EFFICIENT COLUMNAR STORAGE WITH APACHE PARQUET ... parquet avro thrift protobuf pig hive …. avro thrift protobuf pig hive ….

Apache Pig Example

Apache Pig for Data Science · 2017. 12. 14. · Apache Pig for Data Science Casey Stella April9,2014 Casey Stella (Hortonworks) Apache Pig for Data Science April 9, 2014

Introduction to Apache Pig - ut€¦ · Introduction to Apache Pig Pelle Jakovits 28 September 2016, Tartu. Outline •MapReduce recollection •Apache Pig –How to run Pig –Pig

Data Flow Languages & Apache Pig Lecture BigData Analytics€¦ · Data Flow Languages & Apache Pig Lecture BigData Analytics Julian M. Kunkel julian.kunkel@googlemail.com University

apache pig performance optimizations talk at apachecon 2010

Integrating Apache Sqoop And Apache Pig With Apache Hadoop · 2020-01-07 · Integrating Apache Sqoop And Apache Pig With Apache Hadoop 3 6789 SecondaryNameNode 7057 TaskTracker If

Apache Pig Tutorial PDF

Apache PIG introduction

Hortonworks Data Platform for HDInsight · • Apache Pig 0.16.0 • Apache Ranger 1.2.0 • Apache Spark 2.4.0 • Apache Sqoop 1.4.7 • Apache Storm 1.2.1 • Apache TEZ 0.9.1

Introduction to Apache Hadoop & Pig - SALSAHPCsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/Yahoo... · Hadoop & Pig Milind Bhandarkar ... (hadoop, pig) (apache, pig) (hadoop,

Apache Pig: Making data transformation easy

Apache Pig -- PittsburghHug

Apache pig power_tools_by_viswanath_gangavaram_r&d_dsg_i_labs

Performance Analysis and Optimization of Apache Pig...Apache Pig is a language, compiler, and run-time library for simplifying the de-velopment of data-analytics applications on Apache

Introduction To Apache Pig at WHUG

Apache Pig