Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
-
Upload
flip-kromer -
Category
Data & Analytics
-
view
321 -
download
2
Transcript of Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe
Patterns of the Lambda Architecture
Truth and Lies at the Edge of Scale
Pattern Set
Tradeoff RulesPICK
ANY
TWO
Lambda Architecture
Search w/ Update
BuildIndexes
A Ton of Text
HistoricalIndex
Live IndexerMoreText
RecentIndex
API
Search w/ Update
BuildIndexes
A Ton of Text
HistoricalIndex
Live IndexerMoreText
RecentIndex
API
Main Index
Search w/ Update
BuildIndexes
A Ton of Text
HistoricalIndex
Live IndexerMoreText
RecentIndex
API
Updates Index
Search w/ Update
BuildIndexes
A Ton of Text
HistoricalIndex
Live IndexerMoreText
RecentIndex
API
Serve Result
Lambda ArchitectureBuildIndexes
A Ton of Text
HistoricalIndex
Live IndexerMoreText
RecentIndex
API
Batch
Speed
Serving
TrainRecomm’der
Visitor ⇒History
History ⇒Alsobuy
Visitor:Product
Visitor ⇒Alsobuy
UpdateRecommendation
Fetch/UpdateHistory
Visitor:ProductHistory
Webserver
Recommender
TrainRecomm’der
Visitor ⇒History
History ⇒Alsobuy
Visitor:Product
Visitor ⇒Alsobuy
UpdateRecommendation
Fetch/UpdateHistory
Visitor:ProductHistory
Webserver
RecommenderBuild
Model
TrainRecomm’der
Visitor ⇒History
History ⇒Alsobuy
Visitor:Product
Visitor ⇒Alsobuy
UpdateRecommendation
Fetch/UpdateHistory
Visitor:ProductHistory
Webserver
Recommender
Applies Model
TrainRecomm’der
Visitor ⇒History
History ⇒Alsobuy
Visitor:Product
Visitor ⇒Alsobuy
UpdateRecommendation
Fetch/UpdateHistory
Visitor:ProductHistory
Webserver
Recommender
Serves Result
TrainRecomm’der
Visitor ⇒History
History ⇒Alsobuy
Visitor:Product
Visitor ⇒Alsobuy
UpdateRecommendation
Fetch/UpdateHistory
Visitor:ProductHistory
Webserver
Recommender
Batch
Speed
Serving
Lambda Arch Layers
• Batch layer Deep Global Truth throughput
• Speed layer Relevant Local Truth throughput
• Serving layer Rapid Retrieval latency
Lambda Arch: Technology
• Batch layer Hadoop, Spark, Batch DB Reports
• Speed layer Storm+Trident, Spark Str., Samza, AMQP, …
• Serving layer Web APIs, Static Assets, RPC, …
Lambda Architecture
Batch
Speed Serving
Lambda Architecture
Batch
Speed Servingλ
Lambda Architecture
λ(v)• Pure Function on immutable data
Two Big Ideas
• Fine-grained control over architectural tradeoffs
• Truth lives at the edge, not the middle
Ideal Data System
Ideal Data System• Capacity -- Can process arbitrarily large amounts of data
• Affordability -- Cheap to run
• Simplicity -- Easy to build, maintain, debug
• Resilience -- Jobs/Processes fail&restart gracefully
• Justification -- Incorporates all relevant data into result
• Comprehensiveness-- Answer questions about any subject
• Accuracy -- Few approximations or avoidable errors
• Responsiveness -- Low latency for delivering results
• Recency -- Promptly incorporates changes in world
Patterns
TrainRecomm’der
Visitor ⇒History
History ⇒Alsobuy
Visitor:Product
Visitor ⇒Alsobuy
UpdateRecommendation
Fetch/UpdateHistory
Visitor:ProductHistory
Webserver
Recommender
Pattern: Train / React• Model of the world lets you make immediate decisions
• World changes slowly, so we can re-build model at leisure
• Relax: Recency
• Batch layer: Train a machine learning model
• Speed layer: Apply that model
• Examples: most Machine Learning thingies
Search w/ UpdateBuildIndexes
A Ton of Text
HistoricalIndex
Live IndexerMoreText
RecentIndex
API
Pattern: Baseline / Delta• Understanding the world takes a long time
• World changes much faster than that, and you care
• Relax: Simplicity, Accuracy
• Batch layer: Process the entire world
• Speed layer: Handle any changes since last big run
• Examples: Real-time Search index; Count Distinct; other Approximate Stream Algorithms
Pagerank
48
24
42 12
126
24
24 42
6 6
6
6
6 6
6
48
24
42 12
126
24
24 42
6 6
6
6
6 6
6
94
-5
-
Guess Using Neighbors
?
48
24
42 12
126
24
24 42
6 6
6
6
6 6
6
94
-5
-
Guess Using Neighbors
12÷3 = 4
24÷5 ≈ 59
48
24
42 12
126
24
24 42
6 6
6
6
6 6
6
94
-5
-
…But Don’t Fix Neighbors
meh
Batch Updates Graph
42
30
36 11
106
21
21 36
46
6
6
5 5
4
93
9
6
PagerankConvergePagerank
Friend Relations
User ⇒Pagerank
Retrieve Bob’sFacebook NtwkBob Bob’s Friends’
PageranksEstimate
Bob’s Pagerank
But don’t bother updating Bob’s Friends (or friends friends or …)
API
Pattern: Guess Beats Blank• You can’t calculate a good answer quickly
• But Comprehensiveness is a must
• Relaxing: Accuracy, Justification
• Batch layer: finds the correct answer
• Speed layer: makes a reasonable guess
• Examples: Any time the sparse-data case is also the most valuable
Marine Corp’s 80% Rule
“Any decision made with more than 80% of the
necessary information is hesitation”
— “The Marine Corps Way” Santamaria & Martino
A Guess Beats a Blank• You can’t calculate a good answer quickly
• But Comprehensiveness is a must
• Relaxing: Accuracy, Justification
• Batch layer: finds the correct answer
• Speed layer: makes a reasonable guess
• Examples: Any time the sparse-data case is also the most valuable
Pattern: World/Local• Understanding the world needs full graph
• You can tell a little white lie reading immediate graph only
• Relaxing: Accuracy, Justification
• Batch layer: uses global graph information
• Speed layer: just reads immediate neighborhood
• Examples: “Whom to follow”, Clustering, anything at 2nd-degree (friend-of-a-friend)
Network Security
Find PotentialEvilness
Connection Counts
Agents of Interest
Store Interaction
Net Connect
ions
DetectedEvilnesses
ApproximateStreaming Agg
Agent of Interest?
Dashboard
Pattern: Slow Boil/Flash Fire• Two tempos of data: months vs milliseconds
• Short-term data too much to store
• Long-term data too much to handle immediately
• Often accompanies Baseline / Deltas
• Examples:• Trending Topics• Insider Security
Banking, OversimplifiedReconcileAccounts
AccountBalances
Event Store Transaction Update Records
Banking, OversimplifiedReconcileAccounts
AccountBalances
Event Store Transaction Update Records
nice-to-haveessential
This wins over fast layer
Pattern: C-A-P Tradeoffs• C-A-P tradeoffs:
• Can’t depend on when data will roll in (Justification)
• Can’t live in ignorance (Comprehensiveness)
• Batch layer: The final answer
• Speed layer: Actionable views
• Examples: Security (Authorization vs Auditing), lots of counting problems
Common Theme
The System Asymptotesto Truth over time
Two Big Ideas
• Fine-grained control over those architectural tradeoffs
• Truth lives at the edge, not the middle
Lambda Architecture for a Dinky Little Blog
Blog: Traditional Approach
• Familiar with the ORM Rails-style blog:
• Models: User, Article, Comment
• Views:
• /user/:id (user info, links to their articles and comments);
• /articles (list of articles);
• /articles/:id (article content, comments, author info)
User
id 3
name joeman
homepage http://…
photo http://…
bio “…”
Article
id 7
title The Crisis
body These are…
author_id 3
created_at 2014-08-08
Comment
id 12
body “lol”
article_id 7
author_id 3
Author NameAuthor Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
AuthorPhoto
Joe has written 2 Articles:“A Post about My Cat”
Donec nec justo eget felis facilisis fermentum. Aliquam porttitor mauris sit amet orci. Aenean dignissim pellentesque (… read more)
“Welcome to my blog”
Donec nec justo eget felis facilisis fermentum. Aliquam porttitor mauris sit amet orci. Aenean dignissim pellentesque (… read more)
Article TitleArticle Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.
Author Name
AuthorPhoto
Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
Comments"First Post" (8/8/2014 by Commenter 1)
"lol" (8/8/2014 by Commenter 2)
"No comment" (8/8/2014 by Commenter 3)
article show user show
articles
users
comments
Webserver
Traditional: Assemble on Read
Changes update models
updatearticle
updateuser
updatecomments
Δarticle
Δuser
Δcom’nt
models
user
com’nt
article
history
Models Trigger Reporters
updatearticle
updateuser
updatecomments
Δarticle
Δuser
Δcom’nt
models
user
com’nt
article
history
compactarticle
user’s #articles
expanded user
user’s #comments
sidebar user
compact comment
expandedarticle
exp’darticle
compactarticle
user’s #articles
exp’duser
sidebar user
user’s #comments
compactcomment
microuser
micro user
Serve Report Fragmentsexp’darticle
compactarticle
user’s #articles
exp’duser
sidebar user
user’s #comments
compactcomment
micro user
showarticle
Article TitleArticle Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.
Author Name
AuthorPhoto
Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
Comments"First Post" (8/8/2014 by Commenter 1)
"lol" (8/8/2014 by Commenter 2)
"No comment" (8/8/2014 by Commenter 3)
Article TitleArticle Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.
Author Name
AuthorPhoto
Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
Comments"First Post" (8/8/2014 by Commenter 1)
"lol" (8/8/2014 by Commenter 2)
"No comment" (8/8/2014 by Commenter 3)
article show rendered{"title":"Article Title","body":"Article Body Lorem [...]","author":{ ... },"comments: [ {"comment_id":1, "body":"First Post",...}, {"comment_id":2, "body":"lol",...}, ...]}
Serve Report FragmentsArticle Title
Article Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.
Author Name
AuthorPhoto
Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor
Comments"First Post" (8/8/2014 by Commenter 1)
"lol" (8/8/2014 by Commenter 2)
"No comment" (8/8/2014 by Commenter 3)
exp’darticle
compactarticle
user’s #articles
exp’duser
sidebar user
user’s #comments
compactcomment
micro user
showuser
Reports are Cheap
updatearticle
updateuser
updatecomments
Δarticle
Δuser
Δcom’nt
models
user
com’nt
article
history
compactarticle
user’s #articles
expanded user
user’s #comments
sidebar user
compact comment
expandedarticle
exp’darticle
compactarticle
user’s #articles
exp’duser
sidebar user
user’s #comments
compactcomment
microuser
micro user
list articles
showarticle
list user’s articles
showuser
• (…hack hack hack…)
/articles/v2/show.json
/articles/v1/show.json
• (…hack hack hack…)
What data model would you like to receive? {“title”:”…”,
“body”:”…”,…}
j/k lol can I have
Data Engineer Web Engineer
{“title”:”…”, “body”:”…”, “snippet”:…}
Syndicated Data• Reports are cheap, single-concern, and faithful to the view.
• You start thinking like the customer, not the database
• All pages render in O(1):• Your imagination doesn’t have to fit inside a TCP timeout
• Data is immutable, flows are idempotent:• Interface change is safe
• Data is always _there_,• Asynchrony doesn’t affect consumers
• Everything is decoupled:• Way harder to break everything
Syndicated Data
• The Data is always _there_
• …but sometimes it’s more perfect than other times.
• Lambda architecture isn’t about speed layer / batch layer.
• It's about
• moving truth to the edge, not the center;
• enabling fine-grained tradeoffs against fundamental limits;
• decoupling consumer from infrastructure
• decoupling consumer from asynchrony
• …with profound implications for how you build your teams
λ Arch: Truth, not Plumbing
…
…
Lambda Architecture Entity Resolution
Scrape Product Web
Intake
parseAmazonAmazon
parseeBayeBay
parseMa&Pa
Ma&PaElectronics
Bulk
Stream
RPC Callbackkey
words
mfr &model
ASIN
VendorListingListings
Batch Layer: Resolve/Unify
ProductResolver
Unified ProductsListings
UnifyProducts
Improve
ProductResolver
keywords
mfr &model
ASIN
VendorListing
FetchProducts
Unified ProductsListings
UnifyProducts
Update
ProductResolver
keywords
mfr &model
ASIN
VendorListing
FetchProducts
Unified Products
Resolve &Update
Listings
UnifyProducts
Cannot have Consistency
ProductResolver
keywords
mfr &model
ASIN
VendorListing
FetchProducts
Unified Products
Resolve &Update
Listings
UnifyProducts
Objections
Objections• Three objections
1.Why hasn't it been done before
2.Architecture Astronaut
3.I'm not at high scale
• Response
1.Chef/Puppet/Docker/etc
2.Chef/Puppet/Docker/etc
3.Shut Up
Objections
• Two APIs? Really?
• Yes. Guilty. That’s dumb and must be fixed.• Spark or Samza, if you’re willing to only drink one flavor of
Kool-Aid• EZbake.io, a CSC / 42six project to attack this• …but we shouldn’t be living at the low level anyhow
Objections
• Orchestration: “logical plan” (dataflow graph)
• Optimization/Allocation: “physical plan” (what goes where)
• Resource Projector: instantiates infrastructure• HTTP listeners, Trident streams, Oozie scheduling, ETL
flows, cron jobs, etc• Transport Machineries:
• move data around, fulfilling locality/ordering/etc guarantees• Data Processing: UDFs and operators