Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe

Post on 16-Jul-2015

321 views 2 download

Tags:

Transcript of Patterns of the Lambda Architecture -- 2015 April - Hadoop Summit, Europe

Patterns of the Lambda Architecture

Truth and Lies at the Edge of Scale

Pattern Set

Tradeoff RulesPICK

ANY

TWO

Lambda Architecture

Search w/ Update

BuildIndexes

A Ton of Text

HistoricalIndex

Live IndexerMoreText

RecentIndex

API

Search w/ Update

BuildIndexes

A Ton of Text

HistoricalIndex

Live IndexerMoreText

RecentIndex

API

Main Index

Search w/ Update

BuildIndexes

A Ton of Text

HistoricalIndex

Live IndexerMoreText

RecentIndex

API

Updates Index

Search w/ Update

BuildIndexes

A Ton of Text

HistoricalIndex

Live IndexerMoreText

RecentIndex

API

Serve Result

Lambda ArchitectureBuildIndexes

A Ton of Text

HistoricalIndex

Live IndexerMoreText

RecentIndex

API

Batch

Speed

Serving

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy

UpdateRecommendation

Fetch/UpdateHistory

Visitor:ProductHistory

Webserver

Recommender

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy

UpdateRecommendation

Fetch/UpdateHistory

Visitor:ProductHistory

Webserver

RecommenderBuild

Model

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy

UpdateRecommendation

Fetch/UpdateHistory

Visitor:ProductHistory

Webserver

Recommender

Applies Model

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy

UpdateRecommendation

Fetch/UpdateHistory

Visitor:ProductHistory

Webserver

Recommender

Serves Result

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy

UpdateRecommendation

Fetch/UpdateHistory

Visitor:ProductHistory

Webserver

Recommender

Batch

Speed

Serving

Lambda Arch Layers

• Batch layer Deep Global Truth throughput

• Speed layer Relevant Local Truth throughput

• Serving layer Rapid Retrieval latency

Lambda Arch: Technology

• Batch layer Hadoop, Spark, Batch DB Reports

• Speed layer Storm+Trident, Spark Str., Samza, AMQP, …

• Serving layer Web APIs, Static Assets, RPC, …

Lambda Architecture

Batch

Speed Serving

Lambda Architecture

Batch

Speed Servingλ

Lambda Architecture

λ(v)• Pure Function on immutable data

Two Big Ideas

• Fine-grained control over architectural tradeoffs

• Truth lives at the edge, not the middle

Ideal Data System

Ideal Data System• Capacity -- Can process arbitrarily large amounts of data

• Affordability -- Cheap to run

• Simplicity -- Easy to build, maintain, debug

• Resilience -- Jobs/Processes fail&restart gracefully

• Justification -- Incorporates all relevant data into result

• Comprehensiveness-- Answer questions about any subject

• Accuracy -- Few approximations or avoidable errors

• Responsiveness -- Low latency for delivering results

• Recency -- Promptly incorporates changes in world

Patterns

TrainRecomm’der

Visitor ⇒History

History ⇒Alsobuy

Visitor:Product

Visitor ⇒Alsobuy

UpdateRecommendation

Fetch/UpdateHistory

Visitor:ProductHistory

Webserver

Recommender

Pattern: Train / React• Model of the world lets you make immediate decisions

• World changes slowly, so we can re-build model at leisure

• Relax: Recency

• Batch layer: Train a machine learning model

• Speed layer: Apply that model

• Examples: most Machine Learning thingies

Search w/ UpdateBuildIndexes

A Ton of Text

HistoricalIndex

Live IndexerMoreText

RecentIndex

API

Pattern: Baseline / Delta• Understanding the world takes a long time

• World changes much faster than that, and you care

• Relax: Simplicity, Accuracy

• Batch layer: Process the entire world

• Speed layer: Handle any changes since last big run

• Examples: Real-time Search index; Count Distinct; other Approximate Stream Algorithms

Pagerank

48

24

42 12

126

24

24 42

6 6

6

6

6 6

6

48

24

42 12

126

24

24 42

6 6

6

6

6 6

6

94

-5

-

Guess Using Neighbors

?

48

24

42 12

126

24

24 42

6 6

6

6

6 6

6

94

-5

-

Guess Using Neighbors

12÷3 = 4

24÷5 ≈ 59

48

24

42 12

126

24

24 42

6 6

6

6

6 6

6

94

-5

-

…But Don’t Fix Neighbors

meh

Batch Updates Graph

42

30

36 11

106

21

21 36

46

6

6

5 5

4

93

9

6

PagerankConvergePagerank

Friend Relations

User ⇒Pagerank

Retrieve Bob’sFacebook NtwkBob Bob’s Friends’

PageranksEstimate

Bob’s Pagerank

But don’t bother updating Bob’s Friends (or friends friends or …)

API

Pattern: Guess Beats Blank• You can’t calculate a good answer quickly

• But Comprehensiveness is a must

• Relaxing: Accuracy, Justification

• Batch layer: finds the correct answer

• Speed layer: makes a reasonable guess

• Examples: Any time the sparse-data case is also the most valuable

Marine Corp’s 80% Rule

“Any decision made with more than 80% of the

necessary information is hesitation”

— “The Marine Corps Way” Santamaria & Martino

A Guess Beats a Blank• You can’t calculate a good answer quickly

• But Comprehensiveness is a must

• Relaxing: Accuracy, Justification

• Batch layer: finds the correct answer

• Speed layer: makes a reasonable guess

• Examples: Any time the sparse-data case is also the most valuable

Pattern: World/Local• Understanding the world needs full graph

• You can tell a little white lie reading immediate graph only

• Relaxing: Accuracy, Justification

• Batch layer: uses global graph information

• Speed layer: just reads immediate neighborhood

• Examples: “Whom to follow”, Clustering, anything at 2nd-degree (friend-of-a-friend)

Network Security

Find PotentialEvilness

Connection Counts

Agents of Interest

Store Interaction

Net Connect

ions

DetectedEvilnesses

ApproximateStreaming Agg

Agent of Interest?

Dashboard

Pattern: Slow Boil/Flash Fire• Two tempos of data: months vs milliseconds

• Short-term data too much to store

• Long-term data too much to handle immediately

• Often accompanies Baseline / Deltas

• Examples:• Trending Topics• Insider Security

Banking, OversimplifiedReconcileAccounts

AccountBalances

Event Store Transaction Update Records

Banking, OversimplifiedReconcileAccounts

AccountBalances

Event Store Transaction Update Records

nice-to-haveessential

This wins over fast layer

Pattern: C-A-P Tradeoffs• C-A-P tradeoffs:

• Can’t depend on when data will roll in (Justification)

• Can’t live in ignorance (Comprehensiveness)

• Batch layer: The final answer

• Speed layer: Actionable views

• Examples: Security (Authorization vs Auditing), lots of counting problems

Common Theme

The System Asymptotesto Truth over time

Two Big Ideas

• Fine-grained control over those architectural tradeoffs

• Truth lives at the edge, not the middle

Lambda Architecture for a Dinky Little Blog

Blog: Traditional Approach

• Familiar with the ORM Rails-style blog:

• Models: User, Article, Comment

• Views:

• /user/:id (user info, links to their articles and comments);

• /articles (list of articles);

• /articles/:id (article content, comments, author info)

User

id 3

name joeman

homepage http://…

photo http://…

bio “…”

Article

id 7

title The Crisis

body These are…

author_id 3

created_at 2014-08-08

Comment

id 12

body “lol”

article_id 7

author_id 3

Author NameAuthor Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.

AuthorPhoto

Joe has written 2 Articles:“A Post about My Cat”

Donec nec justo eget felis facilisis fermentum. Aliquam porttitor mauris sit amet orci. Aenean dignissim pellentesque (… read more)

“Welcome to my blog”

Donec nec justo eget felis facilisis fermentum. Aliquam porttitor mauris sit amet orci. Aenean dignissim pellentesque (… read more)

Article TitleArticle Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.

Author Name

AuthorPhoto

Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor

Comments"First Post" (8/8/2014 by Commenter 1)

"lol" (8/8/2014 by Commenter 2)

"No comment" (8/8/2014 by Commenter 3)

article show user show

articles

users

comments

Webserver

Traditional: Assemble on Read

Changes update models

updatearticle

updateuser

updatecomments

Δarticle

Δuser

Δcom’nt

models

user

com’nt

article

history

Models Trigger Reporters

updatearticle

updateuser

updatecomments

Δarticle

Δuser

Δcom’nt

models

user

com’nt

article

history

compactarticle

user’s #articles

expanded user

user’s #comments

sidebar user

compact comment

expandedarticle

exp’darticle

compactarticle

user’s #articles

exp’duser

sidebar user

user’s #comments

compactcomment

microuser

micro user

Serve Report Fragmentsexp’darticle

compactarticle

user’s #articles

exp’duser

sidebar user

user’s #comments

compactcomment

micro user

showarticle

Article TitleArticle Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.

Author Name

AuthorPhoto

Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor

Comments"First Post" (8/8/2014 by Commenter 1)

"lol" (8/8/2014 by Commenter 2)

"No comment" (8/8/2014 by Commenter 3)

Article TitleArticle Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.

Author Name

AuthorPhoto

Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor

Comments"First Post" (8/8/2014 by Commenter 1)

"lol" (8/8/2014 by Commenter 2)

"No comment" (8/8/2014 by Commenter 3)

article show rendered{"title":"Article Title","body":"Article Body Lorem [...]","author":{ ... },"comments: [ {"comment_id":1, "body":"First Post",...}, {"comment_id":2, "body":"lol",...}, ...]}

Serve Report FragmentsArticle Title

Article Body Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident.

Author Name

AuthorPhoto

Author Bio Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor

Comments"First Post" (8/8/2014 by Commenter 1)

"lol" (8/8/2014 by Commenter 2)

"No comment" (8/8/2014 by Commenter 3)

exp’darticle

compactarticle

user’s #articles

exp’duser

sidebar user

user’s #comments

compactcomment

micro user

showuser

Reports are Cheap

updatearticle

updateuser

updatecomments

Δarticle

Δuser

Δcom’nt

models

user

com’nt

article

history

compactarticle

user’s #articles

expanded user

user’s #comments

sidebar user

compact comment

expandedarticle

exp’darticle

compactarticle

user’s #articles

exp’duser

sidebar user

user’s #comments

compactcomment

microuser

micro user

list articles

showarticle

list user’s articles

showuser

• (…hack hack hack…)

/articles/v2/show.json

/articles/v1/show.json

• (…hack hack hack…)

What data model would you like to receive? {“title”:”…”,

“body”:”…”,…}

j/k lol can I have

Data Engineer Web Engineer

{“title”:”…”, “body”:”…”, “snippet”:…}

Syndicated Data• Reports are cheap, single-concern, and faithful to the view.

• You start thinking like the customer, not the database

• All pages render in O(1):• Your imagination doesn’t have to fit inside a TCP timeout

• Data is immutable, flows are idempotent:• Interface change is safe

• Data is always _there_,• Asynchrony doesn’t affect consumers

• Everything is decoupled:• Way harder to break everything

Syndicated Data

• The Data is always _there_

• …but sometimes it’s more perfect than other times.

• Lambda architecture isn’t about speed layer / batch layer.

• It's about

• moving truth to the edge, not the center;

• enabling fine-grained tradeoffs against fundamental limits;

• decoupling consumer from infrastructure

• decoupling consumer from asynchrony

• …with profound implications for how you build your teams

λ Arch: Truth, not Plumbing

Lambda Architecture Entity Resolution

Scrape Product Web

Intake

parseAmazonAmazon

parseeBayeBay

parseMa&Pa

Ma&PaElectronics

Bulk

Stream

RPC Callbackkey

words

mfr &model

ASIN

VendorListingListings

Batch Layer: Resolve/Unify

ProductResolver

Unified ProductsListings

UnifyProducts

Improve

ProductResolver

keywords

mfr &model

ASIN

VendorListing

FetchProducts

Unified ProductsListings

UnifyProducts

Update

ProductResolver

keywords

mfr &model

ASIN

VendorListing

FetchProducts

Unified Products

Resolve &Update

Listings

UnifyProducts

Cannot have Consistency

ProductResolver

keywords

mfr &model

ASIN

VendorListing

FetchProducts

Unified Products

Resolve &Update

Listings

UnifyProducts

Objections

Objections• Three objections

1.Why hasn't it been done before

2.Architecture Astronaut

3.I'm not at high scale

• Response

1.Chef/Puppet/Docker/etc

2.Chef/Puppet/Docker/etc

3.Shut Up

Objections

• Two APIs? Really?

• Yes. Guilty. That’s dumb and must be fixed.• Spark or Samza, if you’re willing to only drink one flavor of

Kool-Aid• EZbake.io, a CSC / 42six project to attack this• …but we shouldn’t be living at the low level anyhow

Objections

• Orchestration: “logical plan” (dataflow graph)

• Optimization/Allocation: “physical plan” (what goes where)

• Resource Projector: instantiates infrastructure• HTTP listeners, Trident streams, Oozie scheduling, ETL

flows, cron jobs, etc• Transport Machineries:

• move data around, fulfilling locality/ordering/etc guarantees• Data Processing: UDFs and operators