ECCO - Feature-Oriented and Distributed Version Control System
Distributed Time Travel for Feature Generation at Netflix
-
Upload
sfbiganalytics -
Category
Software
-
view
511 -
download
0
Transcript of Distributed Time Travel for Feature Generation at Netflix
DistributedTimeTravelforFeatureGeneration
DBTsaiMarch24,2016atSFBigAnalyticsMeetup
WhoamI?
• IamaSeniorResearchEngineeratNetflix
• IamanApacheSparkCommitter
Turn on Netflix, and the absolute best content for you would automatically
start playing
DataDriven
• Tryanideaofflineusinghistoricaldata toseeifitwouldhavemadebetterrecommendations
• Ifitdid,deployaliveA/Btest toseeifitperformswellinProduction
Quickly try ideas on historical data and transition to online A/B test
+ ≠
FeatureEngineering
Why build a Time Machine?
LabelLeaks
• Recommendationisananalyticlearningprocesstolearnfromthepasttopredictthefuture.
• P(playt|featurest’)wheret’<t
• Thishastobeverycareful;otherwisethefeaturescancontainthelabelssuchthatofflinemetricsareverygood,butonlineevaluationwillnotbeperformingasexpected.
ThePast• GeneratefeaturesbasedoneventdataloggedinHive
• Needtoreimplement featuresforonlineA/Btest• Data discrepancies between offline and online sources
• Logfeaturesonlinewherethemodelwillbeused• Need to deploy each idea into production
• Featuregenerationcallsonlineservicesandfiltersdatapastacertaintime• Works only when a service records a log of historical events• Additional load on online services
DeLorean image by JMortonPhoto.com & OtoGodfrey.com
TimeTravelusingSnapshots
• Snapshotonlineservicesandusethesnapshotdataofflinetogeneratefeatures
• Sharefactsandfeaturesbetweenexperimentswithoutcallinglivesystems
How to build a Time Machine
ContextSelection
DataSnapshots
APIsforTimeTravel
Context Selection
ContextSelection
Runs once a day
Hive
S3
Context SetStratified
SamplingContexts tagged with meta data
DataSnapshotsS3
Context Set
DataSnapshots Runs once
a day
S3
Snapshot
Prana(NetflixLibraries)
ViewingHistoryService
MyListService
RatingsService
Snapshot data for each Context
Thrift
Parquet
APIsforTimeTravel
DataArchitecture
S3
Snapshot
S3
Context Set
Runs once a day
Prana(NetflixLibraries)
ViewingHistoryService
MyListService
RatingsService
ContextSelection
Runs once a day
HiveStratified Sampling
Contexts tagged with meta data
Thrift
Context Selection
Data Snapshots
Batch APIs
RDD of Snapshot Objects
DataSnapshots
BatchAPIs
Generating Features via Time Travel
GreatScott!
• DeLorean:Atime-travelingvehicle
• usesdatasnapshots totravelintime
• scaleswithApacheSpark
• prototypesnewideaswithZeppelin
• requiresminimalcodechangesfromexperimentation
toA/Btesttoproduction
https://en.wikipedia.org/wiki/Emmett_Brown
There’s the DeLorean!
RunningTimeTravelExperiment
Select the destination time
Bring it up to 88 miles per hour!
RunningTimeTravelExperiment
DesignExperiment
CollectLabelDataset
DeLorean:OfflineFeatureGeneration
DistributedModelTraining
Parallel trainingofindividualmodelsusingdifferentexecutors
ComputeValidationMetrics
ModelTesting
Choose best model
Design a New Experiment to Test Out Different Ideas
GoodMetrics
Offline Experiment
OnlineSystem
OnlineABTesting
Bad Metrics
Selected Contexts
DeLoreanInputData
• Contexts:Thesettingforevaluatingasetofitems(e.g.tuplesofmemberprofiles,country,time,device,etc.)
• Items:Theelementstobetrainedon,scored,and/orranked(e.g.videos,rows,searchentities).
• Labels:Forsupervisedlearning,thiswillbethelabel(target)foreachitem.
FeatureEncoders
• Computefeaturesforeachiteminagivencontext
• Eachtypeofrawdataelementhasitsowndatakey
• Datamapisamapfromdatakeystodataobjectsinagiven
context
• Datamapisconsumedbyfeatureencodertocomputefeatures
TwotypeofDataElements• Context-dependentdataelements
• ViewingHistory
• Mylist
• ...
• Context-independentdataelements
• VideoMetadata
• GenreMetadata
• ...
VideoCountryofOriginMatchingFraction
Context-Items
Context:s
Items:
Context:s
Items:
Context Dependent
Data ElementViewing History
Context:s
Items:
Context:s
Items:
Context:sItems:
=0.5
=0.5
=0.5
Context IndependentData Element
Video Metadata
Context:sItems:
=1.0
=0.0
=1.0
Features
FeatureGenerationS3
Snapshot
ModelTraining
Label Features
FeatureEncodersLabelData FeatureEncodersDataElements
FeatureModel(JSON)
FeatureEncodersFeatureEncodersFeatureEncoders
RequiredFeature Keys
Data
Data Map
Features
Data in POJOs
Data Keys
Data Keys
Features
• RepresentedinSpark’sDataFrames
• Innestedstructuretoavoiddatashufflinginrankingprocess
• StoredwithParquetformatinS3
Features
Context
Item, label, and features
GoingOnlineS3
Snapshot
DeLorean:OfflineFeatureGeneration
OnlineRanking/ScoringService
ModelTraining/Validation/Testing
Offline Experiment
Online SystemViewingHistoryService
MyListService
RatingsService
OnlineFeatureGeneration
Deploymodels
SharedFeatureEncoders
Conclusion
Spark helped us significantly reduce the time from an idea to an AB Test
Futurework
Event Driven Data Snapshots
Time Travel to the Future!!
We’rehiring!(cometalktous)
https://jobs.netflix.com/
Tech Blog: http://bit.ly/sparktimetravel