Kafka Connect & Kafka Streams/KSQL - the ecosystem around Kafka
mypipe: Buffering and consuming MySQL changes via Kafka
-
Upload
hisham-mardam-bey -
Category
Software
-
view
348 -
download
0
Transcript of mypipe: Buffering and consuming MySQL changes via Kafka
mypipe: Buffering and consuming MySQL changes via Kafka
with-=[ Scala - Avro - Akka ]=-
Hisham Mardam-Bey
Github: mardambey Twitter: @codewarrior
Overview
● Who is this guy? + Quick Mate1 Intro● Quick Tech Intro● Motivation and History● Features● Design and Architecture● Practical applications and usages● System diagram● Future work● Q&A
Who is this guy?
● Linux and OpenBSD user and developer since 1996
● Started out with C followed by Ruby● Working with the JVM since 2007● “Lately” building and running distributed
systems, and doing ScalaGithub: mardambey
Twitter: @codewarrior
Mate1: quick intro
● Online dating, since 2003, based in Montreal● Initially a team of 3, around 30 now● Engineering team has 12 geeks / geekettes
○ Always looking for talent!● We own and run our own hardware
○ fun!○ mostly…
https://github.com/mate1
Super Quick Tech Intro● MySQL: relational database● Avro: data serialization system● Kafka: publish-subscribe messaging
rethought as a distributed commit log● Akka: toolkit and runtime simplifying the
construction of concurrent and distributed applications
● Actors: universal primitives of concurrent computation using message passing
● Schema repo / registry: holds versioned Avro schemas
Motivation
● Initially, wanted:○ MySQL triggers outside the DB○ MySQL fan-in or fan-out replication (data cubes)○ MySQL to “Hadoop”
● And then:○ Cache or data store consistency with DB○ Direct integration with big-data systems○ Data schema evolution support○ Turning MySQL inside out
■ Bootstrapping downstream data systems
History
● 2010: Custom Perl scripts to parse binlogs● 2011/2012: Guzzler
○ Written in Scala, uses mysqlbinlog command○ Simple to start with, difficult to maintain and control
● 2014: Enter mypipe!○ Initial prototyping begins
Feature Overview (1/2)
● Emulates MySQL slave via binary log○ Writes MySQL events to Kafka
● Uses Avro to serialize and deserialize data○ Generically via a common schema for all tables○ Specifically via per-table schema
● Modular by design○ State saving / loading (files, MySQL, ZK, etc.)○ Error handling○ Event filtering○ Connection sources
Feature Overview (2/2)● Transaction and ALTER TABLE support
○ Includes transaction information within events○ Refreshes schema as needed
● Can publish to any downstream system○ Currently, we have have Kafka○ Initially, we started with Cassandra for the prototype
● Can bootstrap a MySQL table into Kafka○ Transforms entire table into Kafka events○ Useful with Kafka log compaction
● Configurable○ Kafka topic names○ whitelist / blacklist support
● Console consumer, Dockerized dev env
Project Structure● mypipe-api: API for MySQL binlogs● mypipe-avro: binary protocol, mutation
serialization and deserialization● mypipe-producers: push data downstream● mypipe-kafka: Serializer & Decoder
implementations● mypipe-runner: pipes and console tools● mypipe-snapshotter: import MySQL tables
(beta)
MySQL Binary Logging
● Foundation of MySQL replication● Statement or Row based● Represents a journal / change log of data● Allows applications to spy / tune in on
MySQL changes
MySQLBinaryLogConsumer
● Uses behavior from abstract class● Modular design, in this case, uses config
based implementations● Uses Hocon for ease and availability
case class MySQLBinaryLogConsumer(config: Config)
extends AbstractMySQLBinaryLogConsumer
with ConfigBasedConnectionSource
with ConfigBasedErrorHandlingBehaviour
with ConfigBasedEventSkippingBehaviour
with CacheableTableMapBehaviour
AbstractMySQLBinaryLogConsumer
● Maintains connection to MySQL● Primarily handles
○ TABLE_MAP○ QUERY (BEGIN, COMMIT, ROLLBACK, ALTER)○ XID○ Mutations (INSERT, UPDATE, DELETE)
● Provides an enriched binary log API○ Looks up table metadata and includes it○ Scala friendly case class and option-driven(*) API for
speaking MySQL binlogs
(*) constant work in progress (=
TABLE_MAP and table metadata
● Provides table metadata○ Precedes mutation events○ But no column names!
● MySQLMetadataManager○ One actor per database○ Uses “information_schema”○ Determines column metadata and primary key
● TableCache○ Wraps metadata actor providing a cache○ Refreshes tables “when needed”
Mutationscase class ColumnMetadata(name: String, colType: ColumnType.EnumVal, isPrimaryKey: Boolean)
case class PrimaryKey(columns: List[ColumnMetadata])
case class Column(metadata: ColumnMetadata, value: java.io.Serializable)
case class Table(id: Long, name: String, db: String, columns: List[ColumnMetadata], primaryKey:
Option[PrimaryKey])
case class Row(table: Table, columns: Map[String, Column])
case class InsertMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
case class UpdateMutation(timestamp: Long, table: Table, rows: List[(Row, Row)], txid: UUID)
case class DeleteMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)
● Fully enriched with table metadata● Contain column types, data and txid● Mutations can be serialized and deserialized
from and to Avro
Kafka Producers
● Two modes of operation:○ Generic Avro beans○ Specific Avro beans
● Producers decoupled from SerDE○ Recently started supporting Kafka serializers and
decoders○ Currently we only support: http://schemarepo.org/○ Very soon we can integrate with systems such as
Confluent Platform’s schema registry.
Kafka Message Format-----------------
| MAGIC | 1 byte ||-----------------|| MTYPE | 1 byte ||-----------------|| SCMID | N bytes ||-----------------|| DATA | N bytes | -----------------
● MAGIC: magic byte, for protocol version● MTYPE: mutation type, a single byte
○ indicating insert (0x1), update (0x2), or delete (0x3)● SCMID: Avro schema ID, N bytes● DATA: the actual mutation data as N bytes
Generic Message Format3 Avro beans
○ InsertMutation, DeleteMutation, UpdateMutation○ Hold data for new and old columns (for updates)○ Groups data by type into Avro maps
{ "name": "old_integers", "type": {"type": "map", "values": "int"} },
{ "name": "new_integers", "type": {"type": "map", "values": "int"} },
{ "name": "old_strings", "type": {"type": "map", "values": "string"} },
{ "name": "new_strings", "type": {"type": "map", "values": "string"} } ...
Specific Message FormatRequires 3 Avro beans per table
○ Insert, Update, Delete○ Specific fields can be used in the schema
{ "name": "UserInsert", "fields": [ { "name": "id", "type": ["null", "int"] }, { "name": "username", "type": ["null", "string"] }, { "name": "login_date", "type": ["null", "long"] },... ] },
ALTER table support
● ALTER table queries intercepted○ Producers can handle this event specifically
● Kafka serializer and deserializer○ They inspect Avro beans and refresh schema if
needed● Avro evolution rules must be respected
○ Or mypipe can’t properly encode / decode data
Pipes
● Join consumers to producers● Use configurable time based checkpointing
and flushing○ File based, MySQL based, ZK based, Kafka based
schema-repo-client = "mypipe.avro.schema.SchemaRepo"consumers { localhost { # database "host:port:user:pass" array source = "localhost:3306:mypipe:mypipe" }}
producers { stdout { class = "mypipe.kafka.producer.stdout.StdoutProducer" }
kafka-generic { class = "mypipe.kafka.producer.KafkaMutationGenericAvroProducer" }}
pipes {
stdout { consumers = ["localhost"] producer { stdout {} } binlog-position-repo { #class="mypipe.api.repo.ConfigurableMySQLBasedBinaryLogPositionRepository" class = "mypipe.api.repo.ConfigurableFileBasedBinaryLogPositionRepository" config { file-prefix = "stdout-00" # required if binlog-position-repo is specifiec data-dir = "/tmp/mypipe/data" } }}
kafka-generic { enabled = true consumers = ["localhost"] producer { kafka-generic { metadata-brokers = "localhost:9092" } }}
Practical Applications
● Cache coherence● Change logging and auditing● MySQL to:
○ HDFS○ Cassandra○ Spark
● Once Confluent Schema Registry integrated○ Kafka Connect○ KStreams
● Other reactive applications○ Real-time notifications
Pipe 2
Pipe 1
Kafka
System Diagram
Hadoop Cassandra
MySQL BinaryLog Consumer
Dashboards
Binary Logs
SelectConsumer
MySQL
KafkaProducer
Schema Registry
KafkaProducer
db2_tbl1
db2_tbl2
db1_tbl1
db1_tbl2 Event Consumers
Users
Pipe NMySQL
BinaryLog Consumer
KafkaProducer
Future Work
● Finish MySQL -> Kafka snapshot support● Move to Kafka 0.10● MySQL global transaction identifier (GTID)
support● Publish to Maven● More tests, we have a good amount, but you
can’t have enough!
Fin!That’s all folks (=
Thanks!Questions?
https://github.com/mardambey/mypipe