Efficient Schemas in Motion with Kafka and Schema Registry
Pat Patterson
Community Champion
@metadaddy
Enterprise Data DNA
Commercial Customers Across Verticals
250,000+ downloads
50+ of the Fortune 100
Doubling each quarter
Strong Partner Ecosystem Open Source Success
Mission: empower enterprises to harness their data in motion.
Who is StreamSets?
Avro
Schema Registry
Demo
Agenda
Joined ASF as a Hadoop subproject in 2009
Record-oriented serialization format
Binary (most common) and JSON (human readable) encodings
Apache Avro
Avro Prehistory
Schema defined in JSON• Relatively readable
Schema evolution• Can add new fields, rename fields in schema• Existing data can still be read under the new schema
Untagged binary data• Space-efficient!
Avro Advantages
{
"type": "record",
"namespace": "com.example",
"name": "Person",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "last_name", "type": "string" }
]
}
Avro Schema Definition
• null: 0 bytes
• boolean: 1 byte
• int/long: variable-length, zig-zag encoded
• float/double: 4/8 bytes
• bytes: length as long, then data
• string: length as long, then UTF-8-encoded data
Avro Binary Encoding - Simple Types
• Record: concatenate the field encodings
• Enum: zero-based index of symbol, as int
• Array: blocks of items, each preceded by a long count; zero count terminates array
• Map: blocks of K-V pairs, each preceded by a long count; zero count terminates array
• Union: position of item in schema as a long, then the item
• Fixed: the number of bytes defined in the schema
Avro Binary Encoding - Complex Types
{
"type": "record",
"namespace": "com.example",
"name": "Person",
"fields": [
{ "name": "first_name", "type": "string" },
{ "name": "last_name", "type": "string" },
{ "name": "age", "type": "int", "default": -1 }
]
}
Avro Schema Evolution
Compatibility Rules:• New fields must have a default• Deleted field must have had a default• Doc/Order can be added/removed/changed• Field default can be added/changed• Field/type aliases can be added/removed• Non-union can be converted to union with just that type, or vice
versa
General rule is that old data can be read under the new schema
Avro Schema Evolution
Avro Schema Serialization
Various options, depending on file/message orientation, but, generally:• Metadata, including the schema• Data
Great for files - schema is sent just once, but what about messages?• Send just once? Periodically?• Send per message?• Agree out of band?
Schema Overhead
Demo
Online schema repository• Simple REST APIEach schema has an ID• Unique within the repositorySchemas versioned within subjects• Supports schema evolution• Subject loosely corresponds to topic• Subject + version -> ID
Schema Registry
Register schema, registry returns an ID
Sender passes schema ID in each message
Recipient looks up ID in registry
Solves the Avro-by-Message Problem
Schema By Reference
Demo
Just register a new (compatible) schema via the same topic
Schema is assigned a new ID
Evolution with Schema Registry
Schema Evolution
Demo
Landoop schema-registry-uihttps://github.com/Landoop/schema-registry-ui
Bonus Feature: Web UI
Schema Evolution Part Deux
Demo
Conclusion
Avro: a row-oriented, self-describing format for data serialization
Default Avro is inefficient in a message-passing setting
Referencing schema by ID dramatically reduces the volume of network traffic
Top Related