Deep Learning for Personalized Search and Recommender Systems
-
Upload
benjamin-le -
Category
Engineering
-
view
17.625 -
download
0
Transcript of Deep Learning for Personalized Search and Recommender Systems
DeepLearningforPersonalizedSearchandRecommender
Systems
GaneshVenkataramanAirbnb
NadiaFawaz,SaurabhKataria,BenjaminLe,LiangZhangLinkedIn
1
TutorialOutline
• PartI(45min)DeepLearningKeyconcepts• PartII(45min)DeeplearningforSearchandRecommendationsatScale
• Coffeebreak (30min)
• DeepLearningCaseStudies• PartIII (45min)JobsYouMayBeInterestedIn(JYMBII)atLinkedIn• PartIV(45min)JobSearchatLinkedIn
Q&Aattheendofeachpart2
Motivation– WhyRecommenderSystems?
• Recommendationsystemsareeverywhere.Someexamplesofimpact:• “Netflixvaluesrecommendationsathalfabilliondollarstothecompany”[netflix recsys]
• “LinkedInjobmatchingalgorithmstoimprovesperformanceby50%”[SanJoseMercuryNews]
• “Instagramswitchestousingalgorithmicfeed”[Instagramblog]
3
Motivation– WhySearch?
4
PERSONALIZEDSEARCH
4
Query=“thingstodoinhalifax”Searchview– thisisaclassicIRproblemRecommendationsview– Forthisquery,whataretherecommendedresults?
WhyDeepLearning?Whynow?
• Manyofthefundamentalalgorithmictechniqueshaveexistedsincethe80sorbefore
2.5Exobytes ofdataproducedperdayOr530,000,000songs150,000,000iPhones 5
WhyDeepLearning?
ImageclassificationeCommercefraudSearchRecommendationsNLP
Deeplearningiseatingtheworld
6
WhyDeepLearningandRecommenderSystems?• Features
• Semanticunderstandingofwords/sentencespossiblewithembeddings• Betterclassificationofimages(identifyingcatsinYouTubevideos)
• Modeling• Canwecastmatchingproblemsintoadeep(andpossibly)widenetandlearnfamilyoffunctions?
7
PartI– RepresentationLearningandDeepLearning:KeyConcepts
8
DeepLearningandAI
http://www.deeplearningbook.org/contents/intro.html 9
PartIOutline
• ShallowModelsforEmbeddingLearning• Word2Vec
• DeepArchitectures• FF,CNN,RNN
• TrainingDeepNeuralNetworks• SGD,Backpropagation,LearningRateSchedule,Regularization,Pre-Training
10
LearningEmbeddings
11
Representationlearningforautomatedfeaturegeneration
• NaturalLanguageProcessing• Wordembedding:word2vec,GloVe• SequencemodelingusingRNN’sandLSTM’s
• GraphInputs• DeepWalk
• MultipleHierarchyoffeaturesforvaryinggranularitiesforsemanticmeaningwithdeepnetworks
12
ExampleApplicationofRepresentationLearning- UnderstandingText• Oneofthekeystoanycontentbasedrecommendersystemisunderstandingtext
• Whatdoes“understanding”mean?• Howsimilar/dissimilarareanytwowords?• Whatdoesthewordrepresent?(NamedEntityRecognition)• “AbrahamLincoln,the16th President...”• “MycousindrivesaLincoln”
13
Howtorepresentaword?
• Vocabulary– run,jog,math• Simplerepresentation:
• [1,0,0],[0,1,0],[0,0,1]
• Norepresentationofmeaning• Cooccurrenceinaword/documentmatrix
14
Howtorepresentaword?
• Troublewithcooccurrencematrix• Largedimension,lotsofmemory
• DimensionalityreductionusingSVD• Highcomputationaltimenxm matrix=>O(mn^2)• Addingnewword=>redoeverything
15
Wordembeddingstakingcontext
• KeyConjecture• Contextmatters.• Wordsthatconveyacertaincontextoccurtogether
• “AbrahamLincolnwasthe16th PresidentoftheUnitedStates”• Bigrammodel
• P(“Lincoln”|”Abraham”)
• SkipGramModel• Considerallwordswithincontextandignoreposition• P(Context|Word)
16
Word2vec
17
Word2Vec:SkipGramModel
• Basicnotations:• w representsaword,C(w) representsallthecontextaroundaword• 𝜃 representstheparameterspace• Drepresentallthe(w,c)pairs
• 𝑝 𝑐 𝑤; 𝜃 representstheprobabilityofcontextcgivenwordwparametrizedby𝜃
• Theprobabilityofallthecontextappearinggivenawordisgivenby:• ∏ 𝑝(𝑐|𝑤; 𝜃)�
+∈-(.)
• Thelossfunctionthenbecomes:• 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∏ 𝑝(𝑐|𝑤; 𝜃)�
.,+ ∈6
18
Word2vecdetails
• Let𝑣.and𝑣+ representthecurrentwordandcontext.Notethat𝑣+and𝑣. areparameterswewanttolearn
• p c w; 𝜃 = <=>∗=@
∑ <=B∗=@B∈C
• C representssetofallavailablecontexts
19
NegativeSampling– basicintuition
p c w; 𝜃 = 𝑒E>∗E@
∑ 𝑒EB∗E@F∈-
• Samplefromunigramdistributioninsteadoftakingallcontextsintoaccount
• Word2vecitselfisashallowmodelandcanbeusedtoinitializeadeepmodel
20
DeepArchitecturesFF,CNN,RNN
21
Neuron:ComputationalUnit
• Inputvector:x=[x1,x2,… ,xn]
• Neuron• Weightvector:W• Bias:b• Activationfunction:f
• Outputa=f(WTx+b)
x1
x2
x3
x4
Wbf
a=f(WTx +b)
Inputx Neuron Outputa 22
ActivationFunctions• Tanh:ℝ → (-1,1)
tanh(𝑥) =𝑒M − 𝑒OM
𝑒M + 𝑒OM
• Sigmoid:ℝ → (0,1)
𝜎 𝑥 =1
1 + 𝑒OM
• ReLU:ℝ → [0,+∞)
𝑓 𝑥 = max 0, 𝑥 = 𝑥Whttp://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/
23
Layer
• Layerl:nl neurons• weightmatrix:W=[W1,…,Wnl]• biasvector:b=[b1,…,bnl]• activationfunction:f
• outputvector• a=f(WTx+b)
x1
x2
x3
x4
W1b1f
a1 =f(W1Tx+b1)
W2b2f
a2=f(W2Tx+b2)
Inputx Layer Outputa
W3b3f
a3=f(W3Tx+b3)
24
Layer:MatrixNotation
• Layerl:nl neurons• weightmatrix:W• biasvector:b• activationfunction:f
• outputvector• a=f(WTx+b)
• morecompactnotation• fast-linearalgebraroutinesforquickcomputationsinnetwork
x1
x2
x3
x4
Inputx Layer Outputa
a =f(WTa +b )
W,b ,f
25
FeedForwardNetwork• DepthLlayers
• Activationatlayerl+1a(l+1)=f(W(l)Ta(l)+b(l) )
• Output:predictioninsupervisedlearning
• goal:approximatey=F(x)
x1
x2
x3
x4
InputLayer1 HiddenLayer3
a(3)
HiddenLayer2W(1) ,b(1) ,f(1) W(2) ,b(2) ,f(2)
a(2)
DepthL=4
a(L)
W(3) ,b(3) ,f(3)
26OutputLayer4:Predictionlayer
WhyCNN:ConvolutionalNeuralNetworks?
• Largesizegridstructureddata• 1D:timeseries• 2D:image
• Convolutiontoextractfeaturesfromimage(e.g.edges,texture)• Localconnectivity• Parametersharing• Equivariance totranslation:smalltranslationsininputdonotaffectoutput
Convolutionexample
https://docs.gimp.org/en/plug-in-convmatrix.html
Edgedetectkernel Sharpenkernel
2Dconvolution
http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/
2Dkernel(3x3)
W1 W2 W3 W4
inputmatrix
Kernelmatrix(2x2)
29
• Fullyconnected• hiddenunitconnectedtoallinputunits• computationallyexpensive
• LargeimageNxN pixelsandHiddenlayerKfeatures• Numberofparameters:~KN2
• Locallyconnected• hiddenunitconnectedtosomecontiguousinputunits
• noparametersharing
• Convolution• locallyconnected• kernel:parametersharing
• 1DKernelvector[W1,W2]• 1DToeplitzweightmatrixW
• Scalingtolargeinput,images• Equivariance totranslation
30
W11 W12 W22 W23 W33 W34
W1 W2 W1 W2 W1 W2
W11 W12 W13 W14
W21 W22 W23 W24
W31 W32 W33 W34
W11 W12 0 0
0 W22 W23 0
0 0 W33 W34
Kernelvector
WeightmatrixW
Convolution
W1 W2 0 0
0 W1 W2 0
0 0 W1 W2
Pooling
• Summarystatistics• Aggregateoverregion• Reducesize• Lessoverfitting
• Translationinvariance• Max,mean
http://ufldl.stanford.edu/tutorial/supervised/Pooling/
31
CNN:ConvolutionalNeuralNetwork
Combination• Convolutionallayers• Poolinglayers• Fullyconnectedlayers
http://colah.github.io/posts/2014-07-Conv-Nets-Modular/
32
[LeCun etal.,1998]
CNNexampleforimagerecognition:ImageNet[Krizhevsky etal.,2012]
Picturescourtesyof[Krizhevsky etal.,2012],http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf
33
1st GPU
2nd GPU
filterslearnedbyfirstCNNlayer
WhyRNN:RecurrentNeuralNetwork?• Sequentialdataprocessing
• ex:predictnextwordinsentence:“IwasborninFrance.Icanspeak…”
• RNN• Persistinformationthroughfeedbackloop
• looppassesinformationfromonesteptothenext
• Parametersharingacrosstimeindexes• outputunitdependsonpreviousoutputunitsthroughsameupdaterule.
xt
ht
ht-1
UnfoldedRNN• CopiesofNNpassingfeedbacktooneanother
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
35
LSTM:LongShortTermMemory[Hochreiter etal.,1997]• Avoidvanishingorexplodinggradient
• Cellstateupdatesregulatedbygates• Forget:howmuchinfofromcellstatetoletthrough
• Input:whichcellstatecomponentstoupdate• Tanh:valuestoaddtocellstate• Output:selectcomponentvaluestooutput
picturecourtesyofhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/
Cellstate
• Longtermdependencies• largegapbetweenrelevantinformationand
whereitisneeded• Cellstate:long-termmemory• Canrememberrelevantinformationoverlong
periodoftime
36
ExamplesofRNNapplication
• Speechrecognition[Gravesetal.,2013]• Languagemodeling[Mikolov,2012]
• Machinetranslation[Kalchbrenner etal.,2013][Sustkever etal.,2014]• Imagecaptioning[Vinyals etal.,2014]
37
TrainingaDeepNeuralNetwork
38
CostFunction• mtrainingsamples(featurevector,label)
(𝑥 X , 𝑦 X ), … , (𝑥 [ , 𝑦 [ )
• Persamplecost:errorbetweenlabelandoutputfrompredictionlayer
𝐽 𝑊, 𝑏; 𝑥 _ , 𝑦 _ = 𝑎(`) 𝑥 _ − 𝑦(_)a
• Minimizecostfunctionoverparameters:weightsWandbiasesb
𝐽 𝑊, 𝑏 = 1𝑚b𝐽(𝑊, 𝑏; 𝑥 _ , 𝑦(_))
[
_cX
+𝜆2b 𝑊(f)
ga
`
fcX
Averageerror Regularization39
GradientDescent
• Randomparameterinitialization:symmetrybreaking
• Gradientdescentstep:updateforeveryparameterWij(l)andbi(l)
𝜃 = 𝜃 − 𝛼𝛻j𝔼[𝐽(𝜃)]
• GradientcomputedbyBackpropagation• Highcostofbackpropagationoverfulltrainingset
40
StochasticGradientDescent(SGD)
• SGD:follownegativegradientafter• singlesample
𝜃 = 𝜃 − 𝛼𝛻nJ(θ; 𝑥 _ , 𝑦(_))
• afewsamples:mini-batch(256)
• Epoch:fullpassthroughtrainingset• Randomlyshuffledatapriortoeachtrainingepoch
41
Backpropagation[Rumelhart etal.,1986]
Goal:Computegradientnumerically
RecursivelyapplychainruleforderivativeofcompositionoffunctionsLet𝑦 = 𝑔 𝑥 and𝑧 = 𝑓 𝑦 = 𝑓(𝑔(𝑥))
thenstsM= st
sususM= 𝑓v 𝑔 𝑥 𝑔′(𝑥)
Backpropagationsteps1. Feedforwardpass:computeallactivations2. Outputerror:measuresnodecontributiontooutputerror3. Backpropagate errorthroughalllayers4. Computepartialderivatives
42
Trainingoptimization• LearningRateSchedule
• Changinglearningrateaslearningprogresses
• Pre-training• Goal:trainingsimplemodelonsimpletaskbeforetrainingdesiredmodeltoperformdesiredtask• Greedysupervisedpre-training:pre-trainfortaskonsubsetoflayersasinitializationforfinalnetwork
• Regularizationtocurboverfitting• Goal:reducegeneralizationerror• Penalizeparameternorm:L2,L1• Augmentdataset:trainonmoredata• Earlystopping:returnparametersetatpointintimewithlowestvalidationerror• Dropout[Srivatstava,2013] :trainensembleofallsubnetworksformedbyremovingnon-outputunits
• Gradientclippingtoavoidexplodinggradient• normclipping• elementwiseclipping
43
PartII– DeepLearningforPersonalizedRecommenderSystemsatScale
44
ExamplesofPersonalizedRecommenderSystems
45
ExamplesofPersonalizedRecommenderSystems
JobSearch
46
ExamplesofPersonalizedRecommenderSystems
47
itemj fromasetofcandidates
Useriwith<userfeatures,query(optional)>(e.g.,industry,behavioralfeatures,Demographicfeatures,……)
(i, j):responseyijvisits
Algorithmselects
(actionornot,e.g.click,like,share,apply…)
Whichitem(s)shouldwerecommendtotheuser?• Theitem(s)withthebestexpectedutility• Utilityexamples:
• CTR,Revenue,JobApplyrates,Adsconversionrates,…• Canbeacombinationoftheabovefortrade-offs
Personalized Recommender Systems
48
AnExampleArchitectureofPersonalizedRecommender
Systems
49
UserInteraction
Logs
OfflineModelingWorkflow+User/
Itemderivedfeatures
User
UserFeatureStore
ItemStore+Features
RecommendationRanking
RankingModelStore
AdditionalRe-rankingSteps1
2
4
5
OfflineSystem OnlineSystem
3
AnexampleofRecommenderSystemArchitecture
Itemderivedfeatures
50
UserInteraction
Logs
OfflineModelingWorkflow+User/
Itemderivedfeatures
User
Search-basedCandidateSelection&Retrieval
QueryConstruction
UserFeatureStore
SearchIndexofItems
RecommendationRanking
RankingModelStore
AdditionalRe-rankingSteps
1
2
3
4 5
6
7
OfflineSystem OnlineSystem
Itemderivedfeatures
AnexampleofPersonalizedSearchSystemArchitecture
51
KeyComponents– OfflineModeling
• Trainthemodeloffline(e.g.Hadoop)• Pushmodeltoonlinerankingmodelstore• Pre-generateuser/itemderivedfeaturesforonlinesystemstoconsume
• E.g.user/itemembeddings fromword2vec/DNNsbasedontherawfeatures
52
KeyComponents– CandidateSelection
• PersonalizedSearch(Withuserquery):• Formaquerytotheindexbasedonuserqueryannotation[Aryaetal.,2016]• Example:PandaExpressSunnyvaleà +restaurant:panda express+location:sunnyvale
• Recommendersystem(Optional):• Canhelpdramaticallyreducethenumberofitemstoscoreinrankingsteps[Cheng,etal.,2016,Borisyuk etal.2016]
• Formaquerybasedontheuserfeatures• Goal:Fetchonlytheitemswithatleastsomematchwithuserfeature
• Example:auserwithtitlesoftwareengineer->+title:software engineerforjobsrecommendation
53
KeyComponents- Ranking
• RecommendationRanking• ThemainMLmodelthatranksitemsretrievedbycandidateselectionbasedontheexpectedutility
• AdditionalRe-rankingSteps• Oftenforuserexperienceoptimizationrelatedtobusinessrules,e.g.
• Diversificationoftherankingresults• Recency boost• Impressiondiscounting• …
54
IntegrationofDeepLearningModelsintoPersonalizedRecommender
SystemsatScale
55
Literature:DeepLearningforRecommendationSystems
• RBMforCollaborativeFiltering[Salakhutdinov etal.,2007]• DeepBeliefNetworks[Hintonetal.,2006]• NeuralAutoregressiveDistributionEstimator(NADE)[Zheng,2016]• NeuralCollaborativeFiltering[He,etal.,2017]• Siamesenetworksforuseritemmatching[Huangetal.,2013]• DeepBeliefNetworkswithPre-training[Hintonetal.,2006]• CollaborativeDeepLearning[Wangetal.,2015]
56
UserInteraction
Logs
OfflineModelingWorkflow+User/
Itemderivedfeatures
User
Search-basedCandidateSelection&Retrieval
QueryConstruction
UserFeatureStore
SearchIndexofItems
RecommendationRanking
RankingModelStore
AdditionalRe-rankingSteps
1
2
3
4 5
6
7
OfflineSystem OnlineSystem
Itemderivedfeatures
57
OfflineModeling+User/ItemEmbeddings
UserFeatures ItemFeatures
UserEmbeddingVector
ItemEmbeddingVector
Sim(U,I)
UserFeatureStore
ItemStore/IndexwithFeatures
58
QueryFormulation&CandidateSelection
• Issuesofusingrawtext:Noisyorincorrectquerytaggingdueto• Failuretocapturesemanticmeaning
• Ex.Query:Applewatch->+food:apple+product:watchor+product:applewatch?• Multilingualtext
• Query:熊猫快餐 -> +restaurant:pandaexpress• Cross-domainunderstanding
• Peoplesearchvsjobsearch
59
QueryFormulation&CandidateSelection
• RepresentQueryasanembedding
• Expandquerytosimilarqueriesinasemanticspace
• KNNsearchindensefeaturespacewithInvertedIndex[Cheng,etal.,2016]
Q=“AppleWatch”
D=“iphone”
D=“OrangeSwatch”
D=“ipad”
60
RecommendationRankingModels
• WideandDeepModelstocaptureallpossiblesignals[Cheng,etal.,2016]
https://arxiv.org/pdf/1606.07792.pdf
61
Challenges&OpenProblemsforDeepLearningatRecommenderSystems• Distributedtrainingonverylargedata
• Tensorflow onSpark(https://github.com/yahoo/TensorFlowOnSpark)• CNTK(https://github.com/Microsoft/CNTK)• MXNet (http://mxnet.io/)• Caffe (http://caffe.berkeleyvision.org/)• …
• LatencyIssuesfromOnlineScoring• Pre-generationofuser/itemembeddings• Multi-layerscoring(simplemodels=>complex)
• Batchvsonlinetraining
62
PartIII– CaseStudy:JobsYouMayBeInterestedIn(JYMBII)
63
Outline
• Introduction• GeneratingEmbeddingsviaWord2vec• GeneratingEmbeddingsviaDeepNetworks• TreeFeatureTransformsinDeep+WideFramework
64
Introduction:JYMBII
65
Introduction:ProblemFormulation
• Rankjobsby𝑃 User𝑢appliestoJob𝑗 𝑢, 𝑗)• Modelresponsegiven:
66
CareersHistory,Skills,Education,Connections JobTitle,Description,Location,Company
66
Introduction:JYMBIIModeling- Generalization
Recommend
• Modelshouldlearngeneralrulestopredictwhichjobstorecommendtoamember.
• Learngeneralizationsbasedonsimilarityintitle,skill,location,etc betweenprofile andjobposting
67
Introduction:JYMBIIModeling- Memorization
Appliesto
68
• Modelshouldmemorizeexceptionstotherules• Learnexceptionsbasedonfrequentco-
occurrenceoffeatures
Introduction:BaselineFeatures• Dense BoW SimilarityFeaturesforGeneralization
• i.e:Similarityintitletextgoodpredictorofresponse
• Sparse Two-DepthCrossFeaturesforMemorization• i.e:Memorizethatcomputersciencestudentswilltransitiontoentryengineeringroles
VectorBoW SimilarityFeatureSim(UserTitleBoW,JobTitleBoW)
SparseCrossFeatureAND(user=CompSci.Student,job=SoftwareEngineer)
SparseCrossFeatureAND(user=InSiliconValley,job=InAustin,TX)
SparseCrossFeatureAND(user=MLEngineer,job=UXDesigner)
69
Introduction:Issues
• BoW Featuresdon’tcapturesemanticsimilaritybetweenuser/job• CosineSimilaritybetweenApplicationDeveloperandSoftwareEngineeris0
• Generatingthree-depth,four-depthcrossfeatureswon’tscale• i.e.MemorizingthatFactoryWorkersfromDetroit areapplyingtoFrackingjobsinPennsylvania
• Hand-engineeredfeaturestimeconsumingandwillhavelowcoverage• Permutationsofthree-depth,four-depthcrossfeaturesgrowsexponentially
70
Introduction:Deep+WideforJYMBII
• BoW Featuresdon’tcapturesemanticsimilaritybetweenuser/job• GenerateembeddingstocaptureGeneralization throughsemanticsimilarity• Deep+WidemodelforJYMBII[Chengetal.,2016]
SemanticSimilarityFeatureSim(UserEmbedding,JobEmbedding)
GlobalModelCrossFeatureAND(user=CompSci.Student,job=SoftwareEngineer)
UserModelCrossFeatureAND(user=User2,job=JobLatentFeature1)
JobModelCrossFeatureAND(user=UserLatentFeature,job=Job1)
71
SparseCrossFeatureAND(user=CompSci.Student,job=SoftwareEngineer)
SparseCrossFeatureAND(user=InSiliconValley,job=InAustin,TX)
SparseCrossFeatureAND(user=MLEngineer,job=UXDesigner)
VectorBoW SimilarityFeatureSim(UserTitleBoW,JobTitleBoW)
GeneratingEmbeddingsviaWord2vec:TrainingWordVectors• KeyIdeas
• Sameusers(context)applytosimilarjobs(target)• Similarusers(target)willapplytothesamejobs(context)
ApplicationDeveloper=>SoftwareEngineer
• Trainwordvectorsviaword2vecskip-gramarchitecture• Concatenateuser’scurrenttitle andtheappliedjob’stitleasinput
UserTitle AppliedJobTitle
72
GeneratingEmbeddingsviaWord2vec:ModelStructure
Application,Developer Software,EngineerTokenizedTitles
WordEmbeddingLookupPre-trainedWordVectors
EntityEmbeddingsViaAveragePooling
WordVectors
ResponsePrediction(LogisticRegression)
CosineSimilarity
User Job 73
GeneratingEmbeddingsviaWord2vec:ResultsandNextSteps• ReceiverOperatingCharacteristic– AreaUnderCurveforevaluation
• Responsepredictionisbinaryclassification:Applyordon’tApply• Highlyskeweddata:LowCTRforApplyAction• Goodmetricforrankingquality:Focusondiscriminatoryabilityofmodel
• Marginal0.87% ROCAUCGain• Howtoimprovequalityofembeddings?
• Optimizeembeddingsforpredictiontaskwithsupervisedtraining• Leveragerichercontextaboutuserandjob
74
GeneratingEmbeddingsviaDeepNetworks:ModelStructure
User Job
ResponsePrediction(LogisticRegression)
SparseFeatures(Title,Skill,Company)
EmbeddingLayer
HiddenLayer
EntityEmbedding
Hadamard Product(ElementwiseProduct)
75
GeneratingEmbeddingsviaDeepNetworks:HyperParameters,LotsofKnobs!• OptimizerUsed
• SGDw/Momentumandexponentialdecayvs.Adam[Kingma etal.,2015](Adam)• LearningRate
• 10O� to10O� (𝟏𝟎O𝟒)• EmbeddingLayerSize
• 50to200(100)• Dropout
• 0%to50%dropout(0%dropout)• SharingParameterSpaceforbothuser/jobembeddings
• Assumescommunitivepropertyofrecommendations(a+b=b+a)(Nosharedparameterspace)• HiddenLayerSizes
• 0to2HiddenLayers(200->200 HiddenLayerSize)• ActivationFunction
• ReLU vs.Tanh (ReLU)
76
GeneratingEmbeddingsviaDeepNetworks:TrainingChallenges• Millionsofrowsoftrainingdataimpossibletostoreallinmemory
• Streamdataincrementallydirectlyfromfilesintoafixedsizeexamplepool• Addshufflingbyrandomlysamplingfromexamplepoolfortrainingbatches
• Extremedimensionalityofcompanysparsefeature• Reducedimensionalityofcompanyfeaturefrommillions->tensofthousands• Performfeatureselectionbyfrequencyintrainingset
• Hyperparametertuning• DistributegridsearchthroughparallelmodelinginsingledriverSparkjobs
77
GeneratingEmbeddingsviaDeepNetworks:ResultsModel ROC AUC
Baseline Model 0.753
Deep +WideModel 0.790(+4.91%***)
***Forreference,apreviousmajorJYMBIImodelingimprovementwitha20%liftinROCAUC resultedina30%liftinJobApplications
78
ResponsePrediction(LogisticRegression)
TheCurrentDeep+WideModel
DeepEmbeddingFeatures(FeedForwardNN)
• Generatingthree-depth,four-depthcrossfeatureswon’tscale• Smartfeatureselectionrequired
WideSparseCrossFeatures(Two-Depth)
79
TreeFeatureTransforms:FeatureSelectionviaGradientBoostedDecisionTrees
Eachtreeoutputsapathfromroottoleafencodingacombinationoffeaturecrosses[Heetal.,2014]
GDBT’sselectthemostusefulcombinationsoffeaturecrossesformemorization
MemberSeniority:VicePresident
Yes
No
MemberIndustry:Banking
Yes
No
MemberLocation:SiliconValley
MemberSkill:Statistics
Yes No
80
Yes No
JobSeniority:CXO
NoYes
JobTitle:MLEngineer
Yes No
ResponsePrediction(LogisticRegression)
TreeFeatureTransforms:TheFullPicture
HowtotrainboththeNNmodelandGBDTmodeljointlywitheachother?
DeepEmbeddingFeatures(FeedForwardNN) WideSparseCrossFeatures(GBDT)
81
TreeFeatureTransforms:JointTrainingviaBlock-wiseCyclicCoordinateDescent• TreatNNmodelandGBDTmodelasseparateblock-wisecoordinates• Implementedby
1. TrainingtheNNuntilconvergence2. TrainingGBDTw/fixedNNembeddings3. Trainingtheregressionlayerweightsw/generatedcrossfeaturesfromGBDT4. TrainingtheNNuntilconvergencew/fixedcrossfeatures5. Cyclestep2-4untilglobalconvergencecriteria
82
ResponsePrediction(LogisticRegression)
TreeFeatureTransforms:TrainNNUntilConvergence
Initiallynotreesareinourforest
DeepEmbeddingFeatures(FeedForwardNN) WideSparseCrossFeatures(GDBT)
83
ResponsePrediction(LogisticRegression)
TreeFeatureTransforms:TrainGDBTw/NNSectionasInitialMargin
DeepEmbeddingFeatures(FeedForwardNN) WideSparseCrossFeatures(GDBT)
84
ResponsePrediction(LogisticRegression)
TreeFeatureTransforms:TrainGDBTw/NNSectionasInitialMargin
DeepEmbeddingFeatures(FeedForwardNN) WideSparseCrossFeatures(GDBT)
85
ResponsePrediction(LogisticRegression)
TreeFeatureTransforms:TrainRegressionLayerWeights
DeepEmbeddingFeatures(FeedForwardNN) WideSparseCrossFeatures(GDBT)
86
ResponsePrediction(LogisticRegression)
TreeFeatureTransforms:TrainNNw/GDBTSectionasInitialMargin
DeepEmbeddingFeatures(FeedForwardNN) WideSparseCrossFeatures(GDBT)
87
TreeFeatureTransforms:Block-wiseCoordinateDescent ResultsModel ROC AUC
Baseline Model 0.753
Deep +WideModel 0.790(+4.91%)
Deep +WideModelw/GBDTIteration1 0.792(+5.18%)
Deep +WideModelw/GBDTIteration2 0.794(+5.44%)
Deep +WideModelw/GBDTIteration3 0.795 (+5.57%)
Deep +WideModelw/GBDTIteration4 0.796(+5.71%)
88
JYMBIIDeep+Wide:FutureDirection
• GeneratingEmbeddingsw/LSTMNetworks• Leveragesequentialcareerhistorydata• PromisingresultsinNEMO:NextCareerMovePredictionwithContextualEmbedding[Lietal.,2017]
• Semi-SupervisedTraining• Leveragepre-trainedtitle,skill,andcompanyembeddingsonprofiledata
• ReplaceHadamard Productforentityembeddingsimilarityfunction• DeepCrossing[Shanetal.,2016]
• Addevenrichercontext• i.e.Location,Education,andNetworkfeatures
89
PartIV– CaseStudy:DeepLearningNetworksforJobSearch
90
Outline
• Introduction• RepresentationsviaWord2vec• RobustRepresentationsviaDSSM
91
Introduction:JobSearch
92
Introduction:SearchArchitecture
Index
Indexer
Top-Kretrieval
ResultsOffline Training/Model
Result Ranking
User QueryQueryUnderstanding
93
Introduction: QueryUnderstanding-SegmentationandTagging• Firstdividethesearchqueryintosegments
• Tagquerysegmentsbasedonrecognizedentitytags
OracleJava
Application Developer
OracleJava Application Developer
QuerySegmentations
COMPANY = Oracle SKILL = Java
TITLE = Application Developer
COMPANY = Oracle TITLE = Java Application
Developer
QueryTagging
94
Introduction: QueryUnderstanding–Expansion• Taskofaddingadditionalsynonyms/relatedentitiestothequerytoimproverecall
• CurrentApproach:Curateddictionaryforcommonsynonymsandrelatedentities
COMPANY = Oracle OR NetSuite OR Taleo OR Sun Microsystems OR …
SKILL = Java OR Java EE OR J2EE OR JVM OR JRE OR JDK …
TITLE = Application Developer OR Software Engineer OR
Software Developer ORProgrammer …
Green – SynonymsBlue – RelatedEntities
95
Introduction: QueryUnderstanding- RetrievalandRanking
COMPANY = Oracle OR NetSuite OR Taleo OR Sun Microsystems OR …
SKILL = Java OR Java EE OR J2EE OR JVM OR JRE OR JDK …
TITLE = Application Developer OR Software Engineer OR
Software Developer ORProgrammer …
Title
Title
Skills
Company
96
Introduction: Issues– RetrievalandRanking
• Termretrievalhaslimitations• Crosslanguageretrieval
• Softwareentwickleró Softwaredeveloper• WordInflections
• EngineeringManagementó EngineeringManager
• Queryexpansionviacurateddictionaryofsynonymsisnotscalable• Expensivetorefreshandstoresynonymsforallpossibleentities
• Heavyrelianceonquerytaggingisnotrobustenough• Noveltitle,skill,andcompanyentitieswillnotbetaggedcorrectly• Errorsupstreampropagatestopoorretrievalandranking
97
Introduction:Solution– DeepLearningforQueryandDocumentRepresentations• Queryanddocumentrepresentations
• Mapqueriesanddocumenttexttovectorsinsemanticspace• RobusttoHandleOutofVocabularywords
• Termretrievalhaslimitations• Queryexpansionviacurateddictionaryofsynonymsisnotscalable
• Mapsynonyms,translationsandinflectionstosimilarvectorsinsemanticspace• TermretrievalonclusteridorKNNbasedretrieval
• Heavyrelianceonquerytaggingisnotrobustenough• Complimentstructuredqueryrepresentationswithsemanticrepresentations
98
RepresentationsviaWord2vec:LeverageJYMBIIWork• KeyIdeas
• Similarusers(context)applytothesamejob(target)• Thesameuser(target)willapplytosimilarjobs(context)
ApplicationDeveloper=>SoftwareEngineer
• Trainwordvectorsviaword2vecskip-gramarchitecture• Concatenateuser’scurrenttitle andtheappliedjob’stitleasinput
UserTitle AppliedJobTitle
99
RepresentationsviaWord2vec:Word2vecinRanking
Application,Developer Software,EngineerTokenizedText
WordEmbeddingLookupPre-trainedWordVectors
EntityEmbeddingsViaAveragePooling
WordVectors
LearningtoRankModel(NDCGLoss)
CosineSimilarity
JobQuery 100
RepresentationsviaWord2vec:RankingModelResultsModel Normalized Cumulative
DiscountedGain@5(NDCG@5)CTR@5(%)
BaselineModel 0.582 +0.0%
BaselineModel+Word2VecFeature 0.595(+2.2%) +1.6%
101
RepresentationsviaWord2vec:OptimizeEmbeddingsforJobSearchUseCase• Leverageapplyandclickfeedbacktoguidelearningofembeddings
• Finetuneembeddingsfortaskusingsupervisedfeedback
• Handleoutofvocabularywordsandscaletoqueryvocabularysize• ComparedtoJYMBII,queryvocabularyismuchlargerandlesswell-formed
• Misspellings• WordInflections• Freetextsearch
• Needtomakerepresentationsmorerobustforthesefreetextqueries
102
RobustRepresentationsviaDSSM:DeepStructuredSemanticModel[Huangetal.,2013]
Query AppliedJob(Positive)
ApplicationDeveloper SoftwareEngineerRawText
#Ap,App,ppl… #So,Sof,oft…Tri-letterHashing #Ha,Hai,air…
HairdresserRandomlySampled
AppliedJob(Negative)
HiddenLayer3
HiddenLayer2
HiddenLayer1
CosineSimilarity
Softmax w/CrossEntropyLoss
103
RobustRepresentationsviaDSSM:Tri-letterHashing• Tri-letterHashingExample
• Engineer->#en,eng,ngi,gin,ine,nee,eer,er#
• BenefitsofTri-letterHashing• MorecompactBagofTri-lettersvs.BagofWordsrepresentation
• 700KWordVocabulary->75KTri-letters• Cangeneralizeforoutofvocabularywords• Tri-letterhashingrobusttominormisspellingsandinflectionsofwords
• Engneer ->#en,eng,ngn,gne,nee,eer,er#
104
RobustRepresentationsviaDSSM:TrainingDetails
105
• ParameterSharingHelps• Betterandfasterconvergence• Modelsizeisreduced
• Regularization• L2performsbetterthandropout
• ToolkitComparisons(CNTKvsTensorFlow)• CNTK:Fasterconvergenceandbettermodelquality• TensorFlow:Easytoimplementandbettercommunitysupport.Comparativemodelquality
Trainingperformancewith/oparametersharing
RobustRepresentationsviaDSSM:LessonsinProductionEnvironment
106
+100%
+70%
+40%
• BottlenecksinProductionEnvironment
• Latencyduetoextracomputation• LatencyduetoGCactivity• FatJarsinJVMenvironment
• PracticalLessons• AvoidJVMHeapwhileservingthemodel
• Cachingmostaccessedentities’embedding
RobustRepresentationsviaDSSM:DSSMQualitativeResultsSoftwareEngineer DataMining LinkedIn Softwareentwickler
EngineerSoftware DataMiner Google Software
SoftwareEngineers MachineLearningEngineer
SoftwareEngineers SoftwareEngineer
SoftwareEngineering Microsoft Research SoftwareEngineer EngineerSoftware
Forqualitativeresults,onlytopheadqueriesaretakentoanalyzesimilaritytoeachother
107
RobustRepresentationsviaDSSM:DSSMMetricResultsModel Normalized Cumulative
DiscountedGain@5(NDCG@5)CTR@5Lift (%)
BaselineModel 0.582 +0.0%
BaselineModel+Word2Vec Feature 0.595(+2.2%) +1.6%
BaselineModel+DSSM Feature 0.602(+3.4%) +3.2%
108
RobustRepresentationsviaDSSM:DSSMFutureDirection• LeverageCurrentQueryUnderstandingIntoDSSMModel
• Querytagentityinformationforrichercontextembeddings• Querysegmentationstructurecanbeconsideredintothenetworkdesign
• DeepCrossingforSimilarityLayer[Shanetal.,2016]• ConvolutionalDSSM[Shenetal.,2014]
109
Conclusion
• RecommenderSystemsandpersonalizedsearchareverysimilarproblems
• DeepLearningisheretostayandcanhavesignificantimpactonboth• Understandingandconstructingqueries• Ranking
• Deeplearningandmoretraditionaltechniquesare*not*mutuallyexclusive(hint:Deep+Wide)
110
References• [Rumelhart etal.,1986]Learningrepresentationsbyback-propagatingerrors,Nature1986• [Hochreiter etal.,1997]Longshort-termmemory,Neuralcomputation 1997• [LeCun etal.,1998]Gradient-basedlearningappliedtodocumentrecognition, ProceedingsoftheIEEE 1998
• [Krizhevsky etal.,2012]Imagenet classificationwithdeepconvolutionalneuralnetworks, NIPS2012
• [Gravesetal.,2013]Speechrecognition with deep recurrent neuralnetworks,ICASSP2013• [Mikolov,2012]Statisticallanguage models based on neuralnetworks,PhD Thesis,BrnoUniversity of Technology,2012
• [Kalchbrenner etal.,2013]Recurrent continuous translation models,EMNLP2013• [Srivatstava,2013]Improving neuralnetworks with dropout,PhD Thesis,University of Toronto,2013
• [Sustkever etal.,2014]Sequence tosequence learningg with neuralnetworks,NIPS2014• [Vinyals etal.,2014]Showandtell:aneuralimagecaption generator,Arxiv 2014• [Zaremba etal.,2015]Recurrent NeuralNetworkRegularization,ICLR2015
111
References(continued)• [Aryaetal.,2016]PersonalizedFederatedSearchatLinkedIn,CIKM2016• [Chengetal.,2016]Wide&DeepLearningforRecommenderSystems,DLRS2016• [Heetal.,2014]PracticalLessonsfromPredictingClicksonAdsatFacebook,ADKDD2014• [Kingma etal.,2015]Adam:AMethodforStochasticOptimization,ICLR2015• [Huangetal.,2013]LearningDeepStructuredSemanticModelsforWebSearchusingClickthrough Data,CIKM2013• [Lietal.,2017]NEMO:NextCareerMovePredictionwithContextualEmbedding,WWW2017• [Shanetal.,2016]DeepCrossing:Web-scalemodelingwithoutmanuallycraftedcombinatorialfeatures,KDD2016• [Zhangetal.,2016]GLMix:GeneralizedLinearMixedModelsForLarge-ScaleResponsePrediction,KDD2016• [Salakhutdinov etal.,2007]RestrictedBoltzmannMachinesforCollaborativeFiltering,ICML2007• [Zheng,2016]http://tech.hulu.com/blog/2016/08/01/cfnade.html• [Hintonetal.,2006]Afastlearningalgorithmfordeepbeliefnets,NeuralComputations2006• [Wangetal.,2015]CollaborativeDeepLearningforRecommenderSystems,KDD2015• [Heetal.,2017]NeuralCollaborativeFiltering,WWW2017• [Borisyuk etal.2016].CaSMoS:AFrameworkforLearningCandidateSelectionModelsoverStructuredQueriesand
Documents,KDD2016
112
References(continued)
• [netflix recsys]http://nordic.businessinsider.com/netflix-recommendation-engine-worth-1-billion-per-year-2016-6/
• [SanJoseMercuryNews]http://www.mercurynews.com/2017/01/06/at-linkedin-artificial-intelligence-is-like-oxygen/
• [Instagramblog]http://blog.instagram.com/post/145322772067/160602-news
113