AWS User Group UK: Why your company needs a unified log
-
Upload
alexander-dean -
Category
Software
-
view
110 -
download
0
Transcript of AWS User Group UK: Why your company needs a unified log
Introducing myself
• Alex Dean
• Co-founder and technical lead at Snowplow, the open-source event analytics platform based here in London [1]
• Weekend writer of Unified Log Processing, available on the Manning Early Access Program [2]
[1] https://github.com/snowplow/snowplow
[2] http://manning.com/dean
A quick history lesson: the three eras of business data processing [1]
1. The classic era, 1996+
2. The hybrid era, 2005+
3. The unified era, 2013+
[1] http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/
The classic era of business data processing, 1996+
OWN DATA CENTER
Data warehouse
HIGH LATENCY
Point-to-point connections
WIDE DATA
COVERAGE
CMS
Silo
CRM
Local loop Local loop
NARROW DATA SILOES LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
Management reporting
ERP
SiloLocal loop
Silo
Nightly batch ETL process
FULL DATA
HISTORY
The hybrid era, 2005+
CLOUD VENDOR / OWN DATA CENTER
Search
SiloLocal loop
LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
CRM
Local loop
SAAS VENDOR #2
Email marketing
Local loop
ERP
SiloLocal loop
CMS
SiloLocal loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Productrec’s
Micro-batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporting
Batch processing
Ad hoc analytics
Hadoop
SAAS VENDOR #3
Web analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The hybrid era: a surfeit of software vendors
CLOUD VENDOR / OWN DATA CENTER
Search
SiloLocal loop
LOW LATENCY LOCAL LOOPS
E-comm
SiloLocal loop
CRM
Local loop
SAAS VENDOR #2
Email marketing
Local loop
ERP
SiloLocal loop
CMS
SiloLocal loop
SAAS VENDOR #1
NARROW DATA SILOES
Stream processing
Productrec’s
Micro-batch processing
Systems monitoring
Batch processing
Data warehouse
Management reporting
Batch processing
Ad hoc analytics
Hadoop
SAAS VENDOR #3
Web analytics
Local loop
Local loop Local loop
LOW LATENCY LOW LATENCY
HIGH LATENCY HIGH LATENCY
APIs
Bulk exports
The hybrid era: company-wide reporting and analytics ends up like Rashomon
The bandit’s story
vs.
The wife’s story
vs.
The samurai’s story
vs.
The woodcutter’s story
The unified era, 2013+CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
LOW LATENCY WIDE DATA
COVERAGE
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
FEW DAYS’ DATA HISTORY
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc
analytics
Management reporting
Fraud detection
Churn prevention
APIs
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc
analytics
Management reporting
Fraud detection
Churn prevention
APIs
The unified log is Amazon Kinesis, or Apache Kafka
• Amazon Kinesis, a hosted AWS service
• Extremely similar semantics to Kafka
• Apache Kafka, an append-only, distributed, ordered commit log
• Developed at LinkedIn to serve as their organization’s unified log
“Kafka is designed to allow a single cluster to serve as the central data backbone for a
large organization” [1]
[1] http://kafka.apache.org/
So what does a unified log give us?
A single version of the truth
Our truth is now upstream from the data warehouse
The hairball of point-to-point connections has been unravelled
Local loops have been unbundled
1
2
3
4
What does a unified log let us do that we couldn’t do before?
Populating a unified log with your company’s event streams
Real-time management
reporting
To enable…
Holistic systems
monitoring
Re-running models from
Day 0
A/B testing end-to-end
pipelines
Shipping offline
models to RT
… anything requiring low latency response / holistic view of our company’s data!
Some background: early on, we decided that Snowplow should be composed of a set of loosely coupled subsystems
1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D
D = Standardised data protocols
Generate event data from any environment
Log raw events from trackers
Validate and enrich raw events
Store enriched events ready for analysis
Analyzeenriched events
These turned out to be critical to allowing us to evolve the above stack
Today most users are running a batch-based Snowplow configuration
Hadoop-based
enrichment
Snowplow event
tracking SDK
Amazon Redshift
Amazon S3
HTTP-based event
collector
• Batch-based• Normally run overnight;
sometimes every 4-6 hoursThe Snowplow batch-based flow uses Amazon S3 as a “poor man’s” unified log
CLOUD VENDOR / OWN DATA CENTER
Search
Silo
SOME LOW LATENCY LOCAL LOOPS
E-comm
Silo
CRM
SAAS VENDOR #2
Email marketing
ERP
Silo
CMS
Silo
SAAS VENDOR #1
NARROW DATA SILOES
Streaming APIs / web hooks
Unified log
Archiving
Hadoop
< WIDE DATA
COVERAGE >
< FULL DATA
HISTORY >
Systems monitoring
Eventstream
HIGH LATENCY LOW LATENCY
Product rec’sAd hoc
analytics
Management reporting
Fraud detection
Churn prevention
APIs
Can we implement Snowplow on top of Kinesis/Kafka?
We are working on Amazon Kinesis support first; Apache Kafka + Samza will come later this year
scala-stream-collector
scala-kinesis-enrich
S3 Amazon Redshift
S3 sink Kinesis app
Redshift sink
Kinesis app
Snowplow Trackers
= not yet released
kinesis-elasticsearch-
sink
DynamoDBElastic-search
Event aggregator Kinesis app
Analytics on Read for agile exploration of events, machine
learning, auditing, re-processing…
Raw event
stream
Bad raw event
stream
Enriched event
stream
Google BigQuery
kinesis-bigquery-
sink
Analytics on Write (for dashboarding, audience segmentation, RTB, etc)
Snowplow users can already write stream processing applications which leverage the Snowplow enriched event stream
scala-stream-collector
scala-kinesis-enrich
AWS LambdaApache Storm
Snowplow Trackers
Apache Samza
Raw event
stream
Bad raw event
stream
Enriched event
stream
Apache Spark
Streaming
Kinesis Client Library
Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To meet up or chat, @alexcrdean on Twitter or [email protected]
Discount code: ulogprugcf (43% off Unified Log Processing eBook)