mypipe: Buffering and consuming MySQL changes via Kafka

mypipe: Buffering and consuming MySQL changes via Kafka

with-=[ Scala - Avro - Akka ]=-

Hisham Mardam-Bey

Github: mardambey Twitter: @codewarrior

https://github.com/mardambey

https://twitter.com/codewarrior

Overview

● Who is this guy? + Quick Mate1 Intro● Quick Tech Intro● Motivation and History● Features● Design and Architecture● Practical applications and usages● System diagram● Future work● Q&A

Who is this guy?

● Linux and OpenBSD user and developer since 1996

● Started out with C followed by Ruby● Working with the JVM since 2007● “Lately” building and running distributed

systems, and doing ScalaGithub: mardambey

Twitter: @codewarrior

https://github.com/mardambey

https://twitter.com/codewarrior

Mate1: quick intro

● Online dating, since 2003, based in Montreal● Initially a team of 3, around 30 now● Engineering team has 12 geeks / geekettes

○ Always looking for talent!● We own and run our own hardware

○ fun!○ mostly…

https://github.com/mate1



Super Quick Tech Intro● MySQL: relational database● Avro: data serialization system● Kafka: publish-subscribe messaging

rethought as a distributed commit log● Akka: toolkit and runtime simplifying the

construction of concurrent and distributed applications

● Actors: universal primitives of concurrent computation using message passing

● Schema repo / registry: holds versioned Avro schemas

Motivation

● Initially, wanted:○ MySQL triggers outside the DB○ MySQL fan-in or fan-out replication (data cubes)○ MySQL to “Hadoop”

● And then:○ Cache or data store consistency with DB○ Direct integration with big-data systems○ Data schema evolution support○ Turning MySQL inside out

■ Bootstrapping downstream data systems

History

● 2010: Custom Perl scripts to parse binlogs● 2011/2012: Guzzler

○ Written in Scala, uses mysqlbinlog command○ Simple to start with, difficult to maintain and control

● 2014: Enter mypipe!○ Initial prototyping begins

Feature Overview (1/2)

● Emulates MySQL slave via binary log○ Writes MySQL events to Kafka

● Uses Avro to serialize and deserialize data○ Generically via a common schema for all tables○ Specifically via per-table schema

● Modular by design○ State saving / loading (files, MySQL, ZK, etc.)○ Error handling○ Event filtering○ Connection sources

Feature Overview (2/2)● Transaction and ALTER TABLE support

○ Includes transaction information within events○ Refreshes schema as needed

● Can publish to any downstream system○ Currently, we have have Kafka○ Initially, we started with Cassandra for the prototype

● Can bootstrap a MySQL table into Kafka○ Transforms entire table into Kafka events○ Useful with Kafka log compaction

● Configurable○ Kafka topic names○ whitelist / blacklist support

● Console consumer, Dockerized dev env

Project Structure● mypipe-api: API for MySQL binlogs● mypipe-avro: binary protocol, mutation

serialization and deserialization● mypipe-producers: push data downstream● mypipe-kafka: Serializer & Decoder

implementations● mypipe-runner: pipes and console tools● mypipe-snapshotter: import MySQL tables

(beta)

MySQL Binary Logging

● Foundation of MySQL replication● Statement or Row based● Represents a journal / change log of data● Allows applications to spy / tune in on

MySQL changes

MySQLBinaryLogConsumer

● Uses behavior from abstract class● Modular design, in this case, uses config

based implementations● Uses Hocon for ease and availability

case class MySQLBinaryLogConsumer(config: Config)

extends AbstractMySQLBinaryLogConsumer

with ConfigBasedConnectionSource

with ConfigBasedErrorHandlingBehaviour

with ConfigBasedEventSkippingBehaviour

with CacheableTableMapBehaviour

AbstractMySQLBinaryLogConsumer

● Maintains connection to MySQL● Primarily handles

○ TABLE_MAP○ QUERY (BEGIN, COMMIT, ROLLBACK, ALTER)○ XID○ Mutations (INSERT, UPDATE, DELETE)

● Provides an enriched binary log API○ Looks up table metadata and includes it○ Scala friendly case class and option-driven(*) API for

speaking MySQL binlogs

(*) constant work in progress (=

TABLE_MAP and table metadata

● Provides table metadata○ Precedes mutation events○ But no column names!

● MySQLMetadataManager○ One actor per database○ Uses “information_schema”○ Determines column metadata and primary key

● TableCache○ Wraps metadata actor providing a cache○ Refreshes tables “when needed”

Mutationscase class ColumnMetadata(name: String, colType: ColumnType.EnumVal, isPrimaryKey: Boolean)

case class PrimaryKey(columns: List[ColumnMetadata])

case class Column(metadata: ColumnMetadata, value: java.io.Serializable)

case class Table(id: Long, name: String, db: String, columns: List[ColumnMetadata], primaryKey:

Option[PrimaryKey])

case class Row(table: Table, columns: Map[String, Column])

case class InsertMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)

case class UpdateMutation(timestamp: Long, table: Table, rows: List[(Row, Row)], txid: UUID)

case class DeleteMutation(timestamp: Long, table: Table, rows: List[Row], txid: UUID)

● Fully enriched with table metadata● Contain column types, data and txid● Mutations can be serialized and deserialized

from and to Avro

Kafka Producers

● Two modes of operation:○ Generic Avro beans○ Specific Avro beans

● Producers decoupled from SerDE○ Recently started supporting Kafka serializers and

decoders○ Currently we only support: http://schemarepo.org/○ Very soon we can integrate with systems such as

Confluent Platform’s schema registry.

http://schemarepo.org/

Kafka Message Format-----------------

| MAGIC | 1 byte ||-----------------|| MTYPE | 1 byte ||-----------------|| SCMID | N bytes ||-----------------|| DATA | N bytes | -----------------

● MAGIC: magic byte, for protocol version● MTYPE: mutation type, a single byte

○ indicating insert (0x1), update (0x2), or delete (0x3)● SCMID: Avro schema ID, N bytes● DATA: the actual mutation data as N bytes

Generic Message Format3 Avro beans

○ InsertMutation, DeleteMutation, UpdateMutation○ Hold data for new and old columns (for updates)○ Groups data by type into Avro maps

{ "name": "old_integers", "type": {"type": "map", "values": "int"} },

{ "name": "new_integers", "type": {"type": "map", "values": "int"} },

{ "name": "old_strings", "type": {"type": "map", "values": "string"} },

{ "name": "new_strings", "type": {"type": "map", "values": "string"} } ...

Specific Message FormatRequires 3 Avro beans per table

○ Insert, Update, Delete○ Specific fields can be used in the schema

{ "name": "UserInsert", "fields": [ { "name": "id", "type": ["null", "int"] }, { "name": "username", "type": ["null", "string"] }, { "name": "login_date", "type": ["null", "long"] },... ] },

ALTER table support

● ALTER table queries intercepted○ Producers can handle this event specifically

● Kafka serializer and deserializer○ They inspect Avro beans and refresh schema if

needed● Avro evolution rules must be respected

○ Or mypipe can’t properly encode / decode data

Pipes

● Join consumers to producers● Use configurable time based checkpointing

and flushing○ File based, MySQL based, ZK based, Kafka based

schema-repo-client = "mypipe.avro.schema.SchemaRepo"consumers { localhost { # database "host:port:user:pass" array source = "localhost:3306:mypipe:mypipe" }}

producers { stdout { class = "mypipe.kafka.producer.stdout.StdoutProducer" }

kafka-generic { class = "mypipe.kafka.producer.KafkaMutationGenericAvroProducer" }}

pipes {

stdout { consumers = ["localhost"] producer { stdout {} } binlog-position-repo { #class="mypipe.api.repo.ConfigurableMySQLBasedBinaryLogPositionRepository" class = "mypipe.api.repo.ConfigurableFileBasedBinaryLogPositionRepository" config { file-prefix = "stdout-00" # required if binlog-position-repo is specifiec data-dir = "/tmp/mypipe/data" } }}

kafka-generic { enabled = true consumers = ["localhost"] producer { kafka-generic { metadata-brokers = "localhost:9092" } }}

Practical Applications

● Cache coherence● Change logging and auditing● MySQL to:

○ HDFS○ Cassandra○ Spark

● Once Confluent Schema Registry integrated○ Kafka Connect○ KStreams

● Other reactive applications○ Real-time notifications

Pipe 2

Pipe 1

Kafka

System Diagram

Hadoop Cassandra

MySQL BinaryLog Consumer

Dashboards

Binary Logs

SelectConsumer

MySQL

KafkaProducer

Schema Registry

KafkaProducer

db2_tbl1

db2_tbl2

db1_tbl1

db1_tbl2 Event Consumers

Users

Pipe NMySQL

BinaryLog Consumer

KafkaProducer

Future Work

● Finish MySQL -> Kafka snapshot support● Move to Kafka 0.10● MySQL global transaction identifier (GTID)

support● Publish to Maven● More tests, we have a good amount, but you

can’t have enough!

Fin!That’s all folks (=

Thanks!Questions?

https://github.com/mardambey/mypipe



mypipe: Buffering and consuming MySQL changes via Kafka

Software

Transcript of mypipe: Buffering and consuming MySQL changes via Kafka