Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

29
SQOOP on SPARK for Data Ingestion Veena Basavaraj & Vinoth Chandar @Uber

Transcript of Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Page 1: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

SQOOP on SPARK for Data IngestionVeena Basavaraj & Vinoth Chandar

@Uber

Page 2: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Currently @Uber on streaming systems. @Cloudera on Ingestion for Hadoop. @Linkedin on front-end service infra.

2

Currently @ Uber focussed on building a real time pipeline for ingestion to Hadoop. @linkedin lead on Voldemort. In the past, worked on log based replication, HPC and stream processing.

Page 3: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Agenda• Sqoop for Data Ingestion

• Why Sqoop on Spark?

• Sqoop Jobs on Spark

• Insights & Next Steps

3

Page 4: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Sqoop Before

4

SQL HADOOP

Page 5: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Data Ingestion• Data Ingestion needs evolved

– Non SQL like data sources

– Messaging Systems as data sources

– Multi-stage pipeline

5

Page 6: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Sqoop Now• Generic data

Transfer Service –FROM ANY

source –TO ANY

target

6

MYSQL KAFKA

HDFS MONGO

FTP HDFS

FROM TO

KAFKA MEMSQL

Page 7: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Sqoop How?• Connectors represent Pluggable Data Sources

• Connectors are configurable

•LINK configs

•JOB configs

7

Page 8: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Sqoop Connector API

8

Source Targetpartition()

**No Transform (T) stage yet!

extract() load()

Page 9: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Agenda• Sqoop for Data Ingestion

• Why Sqoop on Spark? • Sqoop Jobs on Spark

• Insights & Next Steps

9

Page 10: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

It turns out…• MapReduce is slow!

• We need Connector APIs to support (T) transformations, not just EL

• Good news! - Execution Engine is also pluggable

10

Page 11: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Why Apache Spark ? • Why not ? ETL expressed as Spark jobs

• Faster than MapReduce

• Growing Community embracing Apache Spark

11

Page 12: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Why Not Use Spark Data Sources?

12

Sure we can ! but …

Page 13: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Why Not Spark DataSources ?• Recent addition for data sources!

• Run MR Sqoop jobs on Spark with simple config change

• Leverage incremental EL & job management within Sqoop

Page 14: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Agenda• Sqoop for Data Ingestion

• Why Sqoop on Spark?

• Sqoop Jobs on Spark • Insights & Next Steps

14

Page 15: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Sqoop on Spark

• Creating a Job

• Job Submission

• Job Execution

15

Page 16: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Sqoop Job API• Create Sqoop Job

–Create FROM and TO job configs

–Create JOB associating FROM and TO configs

• SparkContext holds Sqoop Jobs

• Invoke SqoopSparkJob.execute(conf, context)

16

Page 17: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Spark Job Submission• We explored a few options.!

– Invoke Spark in process within the Sqoop Server to execute the job

– Use Remote Spark Context used by Hive on Spark to submit

– Sqoop Job as a driver for the Spark submit command

17

Page 18: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Spark Job Submission• Build a “uber.jar” with the driver and all the sqoop

dependencies

• Programmatically using Spark Yarn Client ( non public) or directly via command line submit the driver program to yarn client/

• bin/spark-submit —class org.apache.sqoop.spark.SqoopJDBCHDFSJobDriver --master yarn /path/to/uber.jar —confDir /path/to/sqoop/server/conf/ —jdbcString jdbc://myhost:3306/test —u uber —p hadoop —outputDir hdfs://path/to/output —numE 4 —numL 4

18

Page 19: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Spark Job Execution

19

MySQL

partitionRDD extractRDD

.map() .map()

loadRDD.collect()

Page 20: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Spark Job ExecutionSqoopSparkJob.execute(…)

List<Partition> sp = getPartitions(request,numMappers);

JavaRDD<Partition> partitionRDD = sc.parallelize(sp, sp.size());

20

1

2

3

JavaRDD<List<IntermediateDataFormat<?>>> extractRDD = partitionRDD.map(new SqoopExtractFunction(request));

extractRDD.map(new SqoopLoadFunction(request)).collect();

Page 21: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Spark Job Execution

21

MySQL

Compute Partitions Extract from MySQL

.map() .mapPartition()

Load into HDFS

.repartition()

Repartition to

limit files on HDFS

Page 22: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Agenda• Sqoop for Data Ingestion

• Why Sqoop on Spark?

• Sqoop Jobs on Spark

• Insights & Next Steps

22

Page 23: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Micro Benchmark: MySQL to HDFS

23

Table w/ 300K records, numExtractors = numLoaders

Page 24: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Table w/ 2.8M records, numExtractors = numLoaders good partitioning!!

Micro Benchmark: MySQL to HDFS

24

Page 25: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

What was Easy?• NO changes to the Connector API

required.

• Inbuilt support for Standalone and Yarn Cluster mode for quick end-end testing and faster iteration

• Scheduling Spark sqoop jobs via Oozie25

Page 26: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

What was not Easy?• No clean public Spark Job Submit API.

Using Yarn UI for Job status and health.

• Bunch of Sqoop core classes such as IDF had to be made serializable

• Managing Hadoop and spark dependencies together in Sqoop caused some pain

26

Page 27: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Next Steps!• Explore alternative ways for Spark Sqoop

Job Submission with Spark 1.4 additions • Connector Filter API (filter, data masking) • SQOOP-1532

– https://github.com/vybs/sqoop-on-spark

27

Page 28: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Sqoop Connector ETL

28

Source Targetpartition()

**With Transform (T) stage!

extract() Transform() load()

Page 29: Sqoop on Spark for Data Ingestion-(Veena Basavaraj and Vinoth Chandar, Uber)

Questions!• Thanks to the Folks @Cloudera and @Uber !

• You can reach us @vybs, @byte_array

29