Introduction to Kafka connect

download Introduction to Kafka connect

If you can't read please download the document

Transcript of Introduction to Kafka connect

Introduction to
Kafka Connect

Himani Arora
Software Consultant
Knoldus Software LLP

Topics Covered

What is Kafka Connect ?

Source and Sinks

Motivation behind kafka Connect

Use cases of kafka Connect

Architecture

Demo

What is Kafka Connect ?

Added in 0.9 release of Apache Kafka.

Tool for scalably and reliably streaming data between Apache Kafka and other data systems.

For a long time, companies used to do data processingas big batch jobs.CSV files dumped out of databases, log files collected at the end of the day.

But businesses operate in real time.So, rather than processing data at the end of the day, why not react to it continuosuly as the data arrives.This is where stream processing came into picture And this shift led to the popularity of apache kafka.

But even with apache kafka, building real time data pipeline has required some effort.


And this is why kafka connect was announced as a new feature in 0.9 relaease of kafka

It abstracts away the common problems every connector to Kafka needs to solve:

schema management

fault tolerance

delivery semantics

operations, monitoring etc.

What is Kafka Connect ?

Schema management: The ability of the data pipeline to carry schema information where it is available.

In the absence of this capability, you end up having to recreate it downstream.

Furthermore, if there are multiple consumers for the same data, then each consumer has to recreate it.

Fault tolerance: Run several instances of a process and be resilient to failures

Delivery semantics: Provide strong guarantees when machines fail or processes crash

Operations and monitoring: Monitor the health and progress of every data integration process in a consistent manner

Image source

Kafka Connect is a tool for scalably and reliably streaming data between Apache Kafka and other data systems.

It makes it simple to quickly define connectors that move large data sets into and out of Kafka.

Kafka Connect can ingest entire databases or collect metrics from all your application servers into Kafka topics, making the data available for stream processing with low latency.

Sources and Sinks

Image Source

Sources and Sinks

Sources import data into Kafka,

and Sinks export data from Kafka.

An implementation of a Source or Sink is a Connector.

Users deploy connectors to enable data flows on Kafka

Some of the certified connectors utilizing kafka connect framework are :Source -> Jdbc, couchbase, Apache ignite,cassandraSink -> HDFS, Apache ignite, Solr

Motivation behind Kafka Connect

Why build another framework when there are already so many to choose from?

most of the solutions do not integrate optimally with a stream data platform.

where streaming, event-based data is the lingua franca and Kafka is the common medium that serves as a hub for all data.

eg. in log metric collection processing frameworks like flume,logstash

They do not handle integration well with batch systems.Operationally complex for large data pipelines where an agent runs for each server.

Goblin,Siro ETL of data warehousing

Specific use case.Work with single sink

Benefits of kafka Connect

Broad copying by default

Streaming and batch

Scales to the application

Focus on copying data only

Accessible connector API

Quickly define connectors that copy vast quantities of data between systems

Support copying to and from both streaming and batch-oriented systems.

Scale down to a single process running one connector a small production environment, and scale up to an organization-wide service for copying data between a wide variety of large scale systems.

Focus on reliable, scalable data copying; leave transformation, enrichment, and other modifications It is easy to develop new connectors. The API and runtime model for implementing new connectors should make it simple to use.

Architecture

Three major models :

Connector model

Worker model

Data model

Connector Model

The connector model defines how third-party developers create connector plugins which import or export data from another system.

The model has two key concepts: Connector

Tasks

Connectors are the largest logical unit of work in Kafka Connect and define where data should be copied to and from.

This might cover copying a whole database or collection of databases into Kafka.

connector does not perform any copying itself instead it schedules tasks for it.

Tasks are responsible for producing or consuming sequences of Kafka ConnectRecords in order to copy data.

Connectors, tasks and workers

Image Source

Kafka Connects core concept that users interact with is a connector.

Partitions are balanced evenly across tasks.

Each task reads from its partitions, translates the data to Kafka Connect's format, decides the destination topic (and possibly partition) in Kafka.

Worker and Data Model

The worker model represents the runtime in which connectors and tasks execute.

Worker model allows Kafka Connect to scale to the application.

The data model addresses the remaining requirements, like coupling tightly with Kafka, schema management etc..

This layer decouples the logical work (connectors) from the physical execution (workers executing tasks)

Workers are processes that execute connectors and tasks

Workers automatically coordinate with each other to distribute work and provide scalability and fault tolerance.

All other tasks like schema managemenet,tight coupling with kafka.

Kafka Connect tracks offsets for each connector so that connectors can resume from their previous position in the event of failures or graceful restarts for maintenance.

It has two types of workers: Standalone

Distributed.

Worker and Data Model

so that connectors can resume from their previous position in the event of failures or graceful restarts for maintenance.

Standalone mode is the simplest mode, where a single process is responsible for executing all connectors and tasks. Since it is a single process, it requires minimal configuration.

In distributed mode, you start many worker processes using the same group.id

and they automatically coordinate to schedule execution of connectors and tasks across all available workers.

Balancing Work

simple example of a cluster of 3 workers (processes launched via any mechanism you choose) running two connectors.

The worker processes have balanced the connectors and tasks across themselves

Balancing Work

If a connector adds partitions, this causes it to regenerate task configurations.

Balancing Work

If one of the workers fails, the remaining workers rebalance the connectors

and tasks so the work previously handled by the failed worker is moved to other workers:

Questions

References

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767

http://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines

http://docs.confluent.io/3.0.0/connect/intro.html

THANK YOU