Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

35
BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING Guozhang Wang Confluent

Transcript of Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Page 1: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

BUILDING REALTIME DATA PIPELINES WITH KAFKA CONNECT AND SPARK STREAMING

Guozhang Wang Confluent

Page 2: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

About Me: Guozhang Wang

• Engineer @ Confluent.

• Apache Kafka Committer, PMC Member.

• Before: Engineer @ LinkedIn, Kafka and Samza.

Page 3: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

What do you REALLY need for Stream Processing?

Page 4: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Spark Streaming! Is that All?

Page 5: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Spark Streaming! Is that All?

Page 6: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Spark Streaming! Is that All?

Page 7: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Data can Comes from / Goes to..

Page 8: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 9: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Real-time Data Integration:

getting data to all the right places

Page 10: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Option #1: One-off Tools

• Tools for each specific data systems

• Examples: • jdbcRDD, Cassandra-Spark connector, etc..

• Sqoop, logstash to Kafka, etc..

Page 11: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Option #2: Kitchen Sink Tools

• Generic point-to-point data copy / ETL tools

• Examples: • Enterprise application integration tools

Page 12: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Option #3: Streaming as Copying

• Use stream processing frameworks to copy data

• Examples: • Spark Streaming: MyRDDWriter (forEachPartition)

• Storm, Samza, Flink, etc..

Page 13: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Real-time Integration: E, T & L

Page 14: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 15: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Example: LinkedIn back in 2010

Page 16: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Example: LinkedIn with Kafka

Apache Kafka

Page 17: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 18: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 19: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Large-scale streaming data import/export for Kafka

Kafka Connect

Page 20: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 21: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming
Page 22: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Data Model

Page 23: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Data Model

Page 24: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Parallelism Model

Page 25: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Standalone Execution

Page 26: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Distributed Execution

Page 27: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Distributed Execution

Page 28: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Distributed Execution

Page 29: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Delivery Guarantees• Offsets automatically committed and restored

• On restart: task checks offsets & rewinds

• At least once delivery – flush data, then commit • Exactly once for connectors that support it (e.g. HDFS)

Page 30: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Format Converters• Abstract serialization agnostic to connectors

• Convert between Kafka Connect Data API (Connectors) and serialized bytes

• JSON and Avro currently supported

Page 31: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Connector Developer APIsclass Connector {

abstract void start(props);

abstract void stop();

abstract Class<? extends Task> taskClass();

abstract List<Map<…>> taskConfigs(maxTasks);

}

class Source/SinkTask {

abstract void start(props);

abstract void stop();

abstract List<SourceRecord> poll();

abstract void put(records);

abstract void commit();

}

Page 32: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Kafka Connect & Spark Streaming

Page 33: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Kafka Connect Today• Confluent open source: HDFS, JDBC

• Connector Hub: connectors.confluent.io • Examples: MySQL, MongoDB, Twitter, Solr, S3, MQTT,

Counchbase, Vertica, Cassandra, Elastic Search,

HBase, Kudu, Attunity, JustOne, Striim, Bloomberg ..

• Improved connector control (0.10.0)

Page 34: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

THANK YOU!Guozhang Wang | [email protected] | @guozhangwang

Confluent – Afternoon Break Sponsor for Spark Summit• Jay Kreps – I Heart Logs book signing and giveaway• 3:45pm – 4:15pm in Golden Gate

Kafka Training with Confluent University• Kafka Developer and Operations Courses• Visit www.confluent.io/training

Want more Kafka? • Download Confluent Platform Enterprise (incl. Kafka Connect) at

http://www.confluent.io/product• Apache Kafka 0.10 upgrade documentation at

http://docs.confluent.io/3.0.0/upgrade.html

Page 35: Building Realtim Data Pipelines with Kafka Connect and Spark Streaming

Separation of Concerns