Span Conference: Why your company needs a unified log
-
Upload
alexander-dean -
Category
Software
-
view
670 -
download
0
Transcript of Span Conference: Why your company needs a unified log
Introducing myself
• Alex Dean
• Co-‐founder and technical lead at Snowplow, the open-‐source event analyBcs plaCorm based here in London [1]
• Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2]
[1] hNps://github.com/snowplow/snowplow
[2] hNp://manning.com/dean
A quick history lesson: the three eras of business data processing [1]
1. The classic era, 1996+
2. The hybrid era, 2005+
3. The unified era, 2013+
[1] hNp://snowplowanalyBcs.com/blog/ 2014/01/20/the-‐three-‐eras-‐of-‐business-‐data-‐processing/
The classic era of business data processing, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-‐to-‐point connec+ons
WIDE DATA COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-‐comm
Silo Local loop
Management reporBng
ERP
Silo Local loop
Silo
Nightly batch ETL process
FULL DATA HISTORY
The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
Silo Local loop
LOW LATENCY LOCAL LOOPS
E-‐comm
Silo Local loop
CRM
Local loop
SAAS VENDOR #2
Email markeBng
Local loop
ERP
Silo Local loop
CMS
Silo Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Product rec’s
Micro-‐batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporBng
Batch processing
Ad hoc analyBcs
Hadoop
SAAS VENDOR #3
Web analyBcs
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The hybrid era: a surfeit of soNware vendors
CLOUD VENDOR / OWN DATA CENTER
Search
Silo Local loop
LOW LATENCY LOCAL LOOPS
E-‐comm
Silo Local loop
CRM
Local loop
SAAS VENDOR #2
Email markeBng
Local loop
ERP
Silo Local loop
CMS
Silo Local loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Product rec’s
Micro-‐batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporBng
Batch processing
Ad hoc analyBcs
Hadoop
SAAS VENDOR #3
Web analyBcs
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The hybrid era: company-‐wide reporQng and analyQcs ends up like Rashomon
The bandit’s story
vs.
The wife’s story
vs.
The samurai’s story
vs.
The woodcuNer’s story
The unified era, 2013+ CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-‐comm
Silo
CRM
SAAS VENDOR #2
Email markeBng
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
LOW LATENCY WIDE DATA COVERAGE
Archiving
Hadoop
< WIDE DATA COVERAGE > < FULL DATA HISTORY >
FEW DAYS’ DATA HISTORY
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s Ad hoc analyBcs
Management reporBng
Fraud detecBon
Churn prevenBon
APIs
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-‐comm
Silo
CRM
SAAS VENDOR #2
Email markeBng
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA COVERAGE > < FULL DATA HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s Ad hoc analyBcs
Management reporBng
Fraud detecBon
Churn prevenBon
APIs
The unified log is Amazon Kinesis, or Apache KaVa
• Amazon Kinesis, a hosted AWS service
• Extremely similar semanBcs to Kaba
• Apache Kaba, an append-‐only, distributed, ordered commit log
• Developed at LinkedIn to serve as their organizaBon’s unified log
“Kaba is designed to allow a single cluster to serve as the central data backbone for a
large organizaBon” [1]
[1] hNp://kaba.apache.org/
So what does a unified log give us?
A single version of the truth Our truth is now upstream from the data warehouse The hairball of point-‐to-‐point connecQons has been unravelled Local loops have been unbundled
1
2
3
4
What does a unified log let us do that we couldn’t do before?
PopulaQng a unified log with your company’s event streams
Real-‐Bme management reporBng
To enable…
HolisBc systems
monitoring
Re-‐running models from
Day 0
A/B tesBng end-‐to-‐end pipelines
Shipping offline
models to RT
… anything requiring low latency response / holis+c view of our company’s data!
But garbage in, garbage out: it’s crucial to properly model the event streams feeding into the unified log
Subject Direct Object
Indirect Object Verb
Event Context
Prep. Object ~
• We are working on a semanBc model for events – an “event grammar” at Snowplow [1]
• The event grammar borrows concepts from human language:
• A semanBc model prevents business and technology assumpBons leaking in to the event stream – making it less briNle over Bme
[1] hNp://snowplowanalyBcs.com/blog/2013/08/12/ towards-‐universal-‐event-‐analyBcs-‐building-‐an-‐event-‐grammar/
We also need to store and version the schemas used to describe our events, as these will change over Qme
Unified log
Some background: early on, we decided that Snowplow should be composed of a set of loosely coupled subsystems
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyBcs A B C D
D = Standardised data protocols
Generate event data from any environment
Log raw events from trackers
Validate and enrich raw events
Store enriched events ready for analysis
Analyze enriched events
These turned out to be criBcal to allowing us to evolve the above stack
Today almost all users/customers are running a batch-‐based Snowplow configuraQon
Hadoop-‐based
enrichment
Snowplow event
tracking SDK Amazon Redshik
Amazon S3
HTTP-‐based event
collector
• Batch-‐based • Normally run overnight;
someBmes every 4-‐6 hours The Snowplow batch-‐based flow uses Amazon S3 as a “poor man’s” unified log
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-‐comm
Silo
CRM
SAAS VENDOR #2
Email markeBng
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA COVERAGE > < FULL DATA HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’s Ad hoc analyBcs
Management reporBng
Fraud detecBon
Churn prevenBon
APIs
Can we implement Snowplow on top of Kinesis/KaVa?
We are working on Amazon Kinesis support first; Apache KaVa will come later (using Apache Samza for stream processing)
Scala Stream Collector
Raw event stream
Enrich Kinesis app
Bad raw events stream
Enriched event stream
S3 Redshik
S3 sink Kinesis app
Redshik sink Kinesis
app
Snowplow Trackers
= not yet released
ElasBc-‐Search sink Kinesis app
DynamoDB ElasBc-‐Search
Event aggregator Kinesis app
AnalyQcs on Read (for agile exploraBon of event stream, ML, audiBng,
applying alternate models,
reprocessing etc)
AnalyQcs on Write (for dashboarding, audience segmentaBon, RTB, etc)
QuesQons?
hNp://snowplowanalyBcs.com hNps://github.com/snowplow/snowplow
@snowplowdata To meet up or chat, @alexcrdean on TwiNer or
Discount code: spancNw (43% off all Manning eBooks for Span J)