Distributed Time Travel for Feature Generation at Netflix

40
Distributed Time Travel for Feature Generation DB Tsai March 24, 2016 at SF Big Analytics Meetup

Transcript of Distributed Time Travel for Feature Generation at Netflix

Page 1: Distributed Time Travel for Feature Generation at Netflix

DistributedTimeTravelforFeatureGeneration

DBTsaiMarch24,2016atSFBigAnalyticsMeetup

Page 2: Distributed Time Travel for Feature Generation at Netflix

WhoamI?

• IamaSeniorResearchEngineeratNetflix

• IamanApacheSparkCommitter

Page 3: Distributed Time Travel for Feature Generation at Netflix
Page 4: Distributed Time Travel for Feature Generation at Netflix
Page 5: Distributed Time Travel for Feature Generation at Netflix
Page 6: Distributed Time Travel for Feature Generation at Netflix
Page 7: Distributed Time Travel for Feature Generation at Netflix
Page 8: Distributed Time Travel for Feature Generation at Netflix
Page 9: Distributed Time Travel for Feature Generation at Netflix
Page 10: Distributed Time Travel for Feature Generation at Netflix
Page 11: Distributed Time Travel for Feature Generation at Netflix

Turn on Netflix, and the absolute best content for you would automatically

start playing

Page 12: Distributed Time Travel for Feature Generation at Netflix

DataDriven

• Tryanideaofflineusinghistoricaldata toseeifitwouldhavemadebetterrecommendations

• Ifitdid,deployaliveA/Btest toseeifitperformswellinProduction

Page 13: Distributed Time Travel for Feature Generation at Netflix

Quickly try ideas on historical data and transition to online A/B test

Page 14: Distributed Time Travel for Feature Generation at Netflix

+ ≠

FeatureEngineering

Page 15: Distributed Time Travel for Feature Generation at Netflix

Why build a Time Machine?

Page 16: Distributed Time Travel for Feature Generation at Netflix

LabelLeaks

• Recommendationisananalyticlearningprocesstolearnfromthepasttopredictthefuture.

• P(playt|featurest’)wheret’<t

• Thishastobeverycareful;otherwisethefeaturescancontainthelabelssuchthatofflinemetricsareverygood,butonlineevaluationwillnotbeperformingasexpected.

Page 17: Distributed Time Travel for Feature Generation at Netflix

ThePast• GeneratefeaturesbasedoneventdataloggedinHive

• Needtoreimplement featuresforonlineA/Btest• Data discrepancies between offline and online sources

• Logfeaturesonlinewherethemodelwillbeused• Need to deploy each idea into production

• Featuregenerationcallsonlineservicesandfiltersdatapastacertaintime• Works only when a service records a log of historical events• Additional load on online services

Page 18: Distributed Time Travel for Feature Generation at Netflix

DeLorean image by JMortonPhoto.com & OtoGodfrey.com

Page 19: Distributed Time Travel for Feature Generation at Netflix

TimeTravelusingSnapshots

• Snapshotonlineservicesandusethesnapshotdataofflinetogeneratefeatures

• Sharefactsandfeaturesbetweenexperimentswithoutcallinglivesystems

Page 20: Distributed Time Travel for Feature Generation at Netflix

How to build a Time Machine

Page 21: Distributed Time Travel for Feature Generation at Netflix

ContextSelection

DataSnapshots

APIsforTimeTravel

Page 22: Distributed Time Travel for Feature Generation at Netflix

Context Selection

ContextSelection

Runs once a day

Hive

S3

Context SetStratified

SamplingContexts tagged with meta data

Page 23: Distributed Time Travel for Feature Generation at Netflix

DataSnapshotsS3

Context Set

DataSnapshots Runs once

a day

S3

Snapshot

Prana(NetflixLibraries)

ViewingHistoryService

MyListService

RatingsService

Snapshot data for each Context

Thrift

Parquet

Page 24: Distributed Time Travel for Feature Generation at Netflix

APIsforTimeTravel

Page 25: Distributed Time Travel for Feature Generation at Netflix

DataArchitecture

S3

Snapshot

S3

Context Set

Runs once a day

Prana(NetflixLibraries)

ViewingHistoryService

MyListService

RatingsService

ContextSelection

Runs once a day

HiveStratified Sampling

Contexts tagged with meta data

Thrift

Context Selection

Data Snapshots

Batch APIs

RDD of Snapshot Objects

DataSnapshots

BatchAPIs

Page 26: Distributed Time Travel for Feature Generation at Netflix

Generating Features via Time Travel

Page 27: Distributed Time Travel for Feature Generation at Netflix

GreatScott!

• DeLorean:Atime-travelingvehicle

• usesdatasnapshots totravelintime

• scaleswithApacheSpark

• prototypesnewideaswithZeppelin

• requiresminimalcodechangesfromexperimentation

toA/Btesttoproduction

https://en.wikipedia.org/wiki/Emmett_Brown

There’s the DeLorean!

Page 28: Distributed Time Travel for Feature Generation at Netflix

RunningTimeTravelExperiment

Select the destination time

Bring it up to 88 miles per hour!

Page 29: Distributed Time Travel for Feature Generation at Netflix

RunningTimeTravelExperiment

DesignExperiment

CollectLabelDataset

DeLorean:OfflineFeatureGeneration

DistributedModelTraining

Parallel trainingofindividualmodelsusingdifferentexecutors

ComputeValidationMetrics

ModelTesting

Choose best model

Design a New Experiment to Test Out Different Ideas

GoodMetrics

Offline Experiment

OnlineSystem

OnlineABTesting

Bad Metrics

Selected Contexts

Page 30: Distributed Time Travel for Feature Generation at Netflix

DeLoreanInputData

• Contexts:Thesettingforevaluatingasetofitems(e.g.tuplesofmemberprofiles,country,time,device,etc.)

• Items:Theelementstobetrainedon,scored,and/orranked(e.g.videos,rows,searchentities).

• Labels:Forsupervisedlearning,thiswillbethelabel(target)foreachitem.

Page 31: Distributed Time Travel for Feature Generation at Netflix

FeatureEncoders

• Computefeaturesforeachiteminagivencontext

• Eachtypeofrawdataelementhasitsowndatakey

• Datamapisamapfromdatakeystodataobjectsinagiven

context

• Datamapisconsumedbyfeatureencodertocomputefeatures

Page 32: Distributed Time Travel for Feature Generation at Netflix

TwotypeofDataElements• Context-dependentdataelements

• ViewingHistory

• Mylist

• ...

• Context-independentdataelements

• VideoMetadata

• GenreMetadata

• ...

Page 33: Distributed Time Travel for Feature Generation at Netflix

VideoCountryofOriginMatchingFraction

Context-Items

Context:s

Items:

Context:s

Items:

Context Dependent

Data ElementViewing History

Context:s

Items:

Context:s

Items:

Context:sItems:

=0.5

=0.5

=0.5

Context IndependentData Element

Video Metadata

Context:sItems:

=1.0

=0.0

=1.0

Features

Page 34: Distributed Time Travel for Feature Generation at Netflix

FeatureGenerationS3

Snapshot

ModelTraining

Label Features

FeatureEncodersLabelData FeatureEncodersDataElements

FeatureModel(JSON)

FeatureEncodersFeatureEncodersFeatureEncoders

RequiredFeature Keys

Data

Data Map

Features

Data in POJOs

Data Keys

Data Keys

Page 35: Distributed Time Travel for Feature Generation at Netflix

Features

• RepresentedinSpark’sDataFrames

• Innestedstructuretoavoiddatashufflinginrankingprocess

• StoredwithParquetformatinS3

Page 36: Distributed Time Travel for Feature Generation at Netflix

Features

Context

Item, label, and features

Page 37: Distributed Time Travel for Feature Generation at Netflix

GoingOnlineS3

Snapshot

DeLorean:OfflineFeatureGeneration

OnlineRanking/ScoringService

ModelTraining/Validation/Testing

Offline Experiment

Online SystemViewingHistoryService

MyListService

RatingsService

OnlineFeatureGeneration

Deploymodels

SharedFeatureEncoders

Page 38: Distributed Time Travel for Feature Generation at Netflix

Conclusion

Spark helped us significantly reduce the time from an idea to an AB Test

Page 39: Distributed Time Travel for Feature Generation at Netflix

Futurework

Event Driven Data Snapshots

Time Travel to the Future!!

Page 40: Distributed Time Travel for Feature Generation at Netflix

We’rehiring!(cometalktous)

https://jobs.netflix.com/

Tech Blog: http://bit.ly/sparktimetravel