Deep Learning for Personalized Search and Recommender Systems

DeepLearningforPersonalizedSearchandRecommender

Systems

GaneshVenkataramanAirbnb

NadiaFawaz,SaurabhKataria,BenjaminLe,LiangZhangLinkedIn

1

TutorialOutline

• PartI(45min)DeepLearningKeyconcepts• PartII(45min)DeeplearningforSearchandRecommendationsatScale

• Coffeebreak (30min)

• DeepLearningCaseStudies• PartIII (45min)JobsYouMayBeInterestedIn(JYMBII)atLinkedIn• PartIV(45min)JobSearchatLinkedIn

Q&Aattheendofeachpart2

Motivation– WhyRecommenderSystems?

• Recommendationsystemsareeverywhere.Someexamplesofimpact:• “Netflixvaluesrecommendationsathalfabilliondollarstothecompany”[netflix recsys]

• “LinkedInjobmatchingalgorithmstoimprovesperformanceby50%”[SanJoseMercuryNews]

• “Instagramswitchestousingalgorithmicfeed”[Instagramblog]

3

Motivation– WhySearch?

4

PERSONALIZEDSEARCH

4

Query=“thingstodoinhalifax”Searchview– thisisaclassicIRproblemRecommendationsview– Forthisquery,whataretherecommendedresults?

WhyDeepLearning?Whynow?

• Manyofthefundamentalalgorithmictechniqueshaveexistedsincethe80sorbefore

2.5Exobytes ofdataproducedperdayOr530,000,000songs150,000,000iPhones 5

WhyDeepLearning?

ImageclassificationeCommercefraudSearchRecommendationsNLP

Deeplearningiseatingtheworld

6

WhyDeepLearningandRecommenderSystems?• Features

• Semanticunderstandingofwords/sentencespossiblewithembeddings• Betterclassificationofimages(identifyingcatsinYouTubevideos)

• Modeling• Canwecastmatchingproblemsintoadeep(andpossibly)widenetandlearnfamilyoffunctions?

7

PartI– RepresentationLearningandDeepLearning:KeyConcepts

8

DeepLearningandAI

http://www.deeplearningbook.org/contents/intro.html 9

PartIOutline

• ShallowModelsforEmbeddingLearning• Word2Vec

• DeepArchitectures• FF,CNN,RNN

• TrainingDeepNeuralNetworks• SGD,Backpropagation,LearningRateSchedule,Regularization,Pre-Training

10

LearningEmbeddings

11

Representationlearningforautomatedfeaturegeneration

• NaturalLanguageProcessing• Wordembedding:word2vec,GloVe• SequencemodelingusingRNN’sandLSTM’s

• GraphInputs• DeepWalk

• MultipleHierarchyoffeaturesforvaryinggranularitiesforsemanticmeaningwithdeepnetworks

12

ExampleApplicationofRepresentationLearning- UnderstandingText• Oneofthekeystoanycontentbasedrecommendersystemisunderstandingtext

• Whatdoes“understanding”mean?• Howsimilar/dissimilarareanytwowords?• Whatdoesthewordrepresent?(NamedEntityRecognition)• “AbrahamLincoln,the16th President...”• “MycousindrivesaLincoln”

13

Howtorepresentaword?

• Vocabulary– run,jog,math• Simplerepresentation:

• [1,0,0],[0,1,0],[0,0,1]

• Norepresentationofmeaning• Cooccurrenceinaword/documentmatrix

14

Howtorepresentaword?

• Troublewithcooccurrencematrix• Largedimension,lotsofmemory

• DimensionalityreductionusingSVD• Highcomputationaltimenxm matrix=>O(mn^2)• Addingnewword=>redoeverything

15

Wordembeddingstakingcontext

• KeyConjecture• Contextmatters.• Wordsthatconveyacertaincontextoccurtogether

• “AbrahamLincolnwasthe16th PresidentoftheUnitedStates”• Bigrammodel

• P(“Lincoln”|”Abraham”)

• SkipGramModel• Considerallwordswithincontextandignoreposition• P(Context|Word)

16

Word2vec

17

Word2Vec:SkipGramModel

• Basicnotations:• w representsaword,C(w) representsallthecontextaroundaword• 𝜃 representstheparameterspace• Drepresentallthe(w,c)pairs

• 𝑝 𝑐 𝑤; 𝜃 representstheprobabilityofcontextcgivenwordwparametrizedby𝜃

• Theprobabilityofallthecontextappearinggivenawordisgivenby:• ∏ 𝑝(𝑐|𝑤; 𝜃)�

+∈-(.)

• Thelossfunctionthenbecomes:• 𝑎𝑟𝑔𝑚𝑎𝑥𝜃∏ 𝑝(𝑐|𝑤; 𝜃)�

.,+ ∈6

18

Word2vecdetails

• Let𝑣.and𝑣+ representthecurrentwordandcontext.Notethat𝑣+and𝑣. areparameterswewanttolearn

• p c w; 𝜃 = <=>∗=@

∑ <=B∗=@B∈C

• C representssetofallavailablecontexts

19

NegativeSampling– basicintuition

p c w; 𝜃 = 𝑒E>∗E@

∑ 𝑒EB∗E@F∈-

• Samplefromunigramdistributioninsteadoftakingallcontextsintoaccount

• Word2vecitselfisashallowmodelandcanbeusedtoinitializeadeepmodel

20

DeepArchitecturesFF,CNN,RNN

21

Neuron:ComputationalUnit

• Inputvector:x=[x1,x2,… ,xn]

• Neuron• Weightvector:W• Bias:b• Activationfunction:f

• Outputa=f(WTx+b)

x1

x2

x3

x4

Wbf

a=f(WTx +b)

Inputx Neuron Outputa 22

ActivationFunctions• Tanh:ℝ → (-1,1)

tanh(𝑥) =𝑒M − 𝑒OM

𝑒M + 𝑒OM

• Sigmoid:ℝ → (0,1)

𝜎 𝑥 =1

1 + 𝑒OM

• ReLU:ℝ → [0,+∞)

𝑓 𝑥 = max 0, 𝑥 = 𝑥Whttp://ufldl.stanford.edu/tutorial/supervised/MultiLayerNeuralNetworks/

23

Layer

• Layerl:nl neurons• weightmatrix:W=[W1,…,Wnl]• biasvector:b=[b1,…,bnl]• activationfunction:f

• outputvector• a=f(WTx+b)

x1

x2

x3

x4

W1b1f

a1 =f(W1Tx+b1)

W2b2f

a2=f(W2Tx+b2)

Inputx Layer Outputa

W3b3f

a3=f(W3Tx+b3)

24

Layer:MatrixNotation

• Layerl:nl neurons• weightmatrix:W• biasvector:b• activationfunction:f

• outputvector• a=f(WTx+b)

• morecompactnotation• fast-linearalgebraroutinesforquickcomputationsinnetwork

x1

x2

x3

x4

Inputx Layer Outputa

a =f(WTa +b )

W,b ,f

25

FeedForwardNetwork• DepthLlayers

• Activationatlayerl+1a(l+1)=f(W(l)Ta(l)+b(l) )

• Output:predictioninsupervisedlearning

• goal:approximatey=F(x)

x1

x2

x3

x4

InputLayer1 HiddenLayer3

a(3)

HiddenLayer2W(1) ,b(1) ,f(1) W(2) ,b(2) ,f(2)

a(2)

DepthL=4

a(L)

W(3) ,b(3) ,f(3)

26OutputLayer4:Predictionlayer

WhyCNN:ConvolutionalNeuralNetworks?

• Largesizegridstructureddata• 1D:timeseries• 2D:image

• Convolutiontoextractfeaturesfromimage(e.g.edges,texture)• Localconnectivity• Parametersharing• Equivariance totranslation:smalltranslationsininputdonotaffectoutput

Convolutionexample

https://docs.gimp.org/en/plug-in-convmatrix.html

Edgedetectkernel Sharpenkernel

2Dconvolution

http://ufldl.stanford.edu/tutorial/supervised/FeatureExtractionUsingConvolution/

2Dkernel(3x3)

W1 W2 W3 W4

inputmatrix

Kernelmatrix(2x2)

29

• Fullyconnected• hiddenunitconnectedtoallinputunits• computationallyexpensive

• LargeimageNxN pixelsandHiddenlayerKfeatures• Numberofparameters:~KN2

• Locallyconnected• hiddenunitconnectedtosomecontiguousinputunits

• noparametersharing

• Convolution• locallyconnected• kernel:parametersharing

• 1DKernelvector[W1,W2]• 1DToeplitzweightmatrixW

• Scalingtolargeinput,images• Equivariance totranslation

30

W11 W12 W22 W23 W33 W34

W1 W2 W1 W2 W1 W2

W11 W12 W13 W14

W21 W22 W23 W24

W31 W32 W33 W34

W11 W12 0 0

0 W22 W23 0

0 0 W33 W34

Kernelvector

WeightmatrixW

Convolution

W1 W2 0 0

0 W1 W2 0

0 0 W1 W2

Pooling

• Summarystatistics• Aggregateoverregion• Reducesize• Lessoverfitting

• Translationinvariance• Max,mean

http://ufldl.stanford.edu/tutorial/supervised/Pooling/

31

CNN:ConvolutionalNeuralNetwork

Combination• Convolutionallayers• Poolinglayers• Fullyconnectedlayers

http://colah.github.io/posts/2014-07-Conv-Nets-Modular/

32

[LeCun etal.,1998]

CNNexampleforimagerecognition:ImageNet[Krizhevsky etal.,2012]

Picturescourtesyof[Krizhevsky etal.,2012],http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf

33

1st GPU

2nd GPU

filterslearnedbyfirstCNNlayer

WhyRNN:RecurrentNeuralNetwork?• Sequentialdataprocessing

• ex:predictnextwordinsentence:“IwasborninFrance.Icanspeak…”

• RNN• Persistinformationthroughfeedbackloop

• looppassesinformationfromonesteptothenext

• Parametersharingacrosstimeindexes• outputunitdependsonpreviousoutputunitsthroughsameupdaterule.

xt

ht

ht-1

UnfoldedRNN• CopiesofNNpassingfeedbacktooneanother

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

35

LSTM:LongShortTermMemory[Hochreiter etal.,1997]• Avoidvanishingorexplodinggradient

• Cellstateupdatesregulatedbygates• Forget:howmuchinfofromcellstatetoletthrough

• Input:whichcellstatecomponentstoupdate• Tanh:valuestoaddtocellstate• Output:selectcomponentvaluestooutput

picturecourtesyofhttp://colah.github.io/posts/2015-08-Understanding-LSTMs/

Cellstate

• Longtermdependencies• largegapbetweenrelevantinformationand

whereitisneeded• Cellstate:long-termmemory• Canrememberrelevantinformationoverlong

periodoftime

36

ExamplesofRNNapplication

• Speechrecognition[Gravesetal.,2013]• Languagemodeling[Mikolov,2012]

• Machinetranslation[Kalchbrenner etal.,2013][Sustkever etal.,2014]• Imagecaptioning[Vinyals etal.,2014]

37

TrainingaDeepNeuralNetwork

38

CostFunction• mtrainingsamples(featurevector,label)

(𝑥 X , 𝑦 X ), … , (𝑥 [ , 𝑦 [ )

• Persamplecost:errorbetweenlabelandoutputfrompredictionlayer

𝐽 𝑊, 𝑏; 𝑥 _ , 𝑦 _ = 𝑎(`) 𝑥 _ − 𝑦(_)a

• Minimizecostfunctionoverparameters:weightsWandbiasesb

𝐽 𝑊, 𝑏 = 1𝑚b𝐽(𝑊, 𝑏; 𝑥 _ , 𝑦(_))

[

_cX

+𝜆2b 𝑊(f)

ga

`

fcX

Averageerror Regularization39

GradientDescent

• Randomparameterinitialization:symmetrybreaking

• Gradientdescentstep:updateforeveryparameterWij(l)andbi(l)

𝜃 = 𝜃 − 𝛼𝛻j𝔼[𝐽(𝜃)]

• GradientcomputedbyBackpropagation• Highcostofbackpropagationoverfulltrainingset

40

StochasticGradientDescent(SGD)

• SGD:follownegativegradientafter• singlesample

𝜃 = 𝜃 − 𝛼𝛻nJ(θ; 𝑥 _ , 𝑦(_))

• afewsamples:mini-batch(256)

• Epoch:fullpassthroughtrainingset• Randomlyshuffledatapriortoeachtrainingepoch

41

Backpropagation[Rumelhart etal.,1986]

Goal:Computegradientnumerically

RecursivelyapplychainruleforderivativeofcompositionoffunctionsLet𝑦 = 𝑔 𝑥 and𝑧 = 𝑓 𝑦 = 𝑓(𝑔(𝑥))

thenstsM= st

sususM= 𝑓v 𝑔 𝑥 𝑔′(𝑥)

Backpropagationsteps1. Feedforwardpass:computeallactivations2. Outputerror:measuresnodecontributiontooutputerror3. Backpropagate errorthroughalllayers4. Computepartialderivatives

42

Trainingoptimization• LearningRateSchedule

• Changinglearningrateaslearningprogresses

• Pre-training• Goal:trainingsimplemodelonsimpletaskbeforetrainingdesiredmodeltoperformdesiredtask• Greedysupervisedpre-training:pre-trainfortaskonsubsetoflayersasinitializationforfinalnetwork

• Regularizationtocurboverfitting• Goal:reducegeneralizationerror• Penalizeparameternorm:L2,L1• Augmentdataset:trainonmoredata• Earlystopping:returnparametersetatpointintimewithlowestvalidationerror• Dropout[Srivatstava,2013] :trainensembleofallsubnetworksformedbyremovingnon-outputunits

• Gradientclippingtoavoidexplodinggradient• normclipping• elementwiseclipping

43

PartII– DeepLearningforPersonalizedRecommenderSystemsatScale

44

ExamplesofPersonalizedRecommenderSystems

45


JobSearch

46


47

itemj fromasetofcandidates

Useriwith<userfeatures,query(optional)>(e.g.,industry,behavioralfeatures,Demographicfeatures,……)

(i, j):responseyijvisits

Algorithmselects

(actionornot,e.g.click,like,share,apply…)

Whichitem(s)shouldwerecommendtotheuser?• Theitem(s)withthebestexpectedutility• Utilityexamples:

• CTR,Revenue,JobApplyrates,Adsconversionrates,…• Canbeacombinationoftheabovefortrade-offs

Personalized Recommender Systems

48

AnExampleArchitectureofPersonalizedRecommender

Systems

49

UserInteraction

Logs

OfflineModelingWorkflow+User/

Itemderivedfeatures

User

UserFeatureStore

ItemStore+Features

RecommendationRanking

RankingModelStore

AdditionalRe-rankingSteps1

2

4

5

OfflineSystem OnlineSystem

3

AnexampleofRecommenderSystemArchitecture

Itemderivedfeatures

50

UserInteraction

Logs


Itemderivedfeatures

User

Search-basedCandidateSelection&Retrieval

QueryConstruction

UserFeatureStore

SearchIndexofItems


RankingModelStore

AdditionalRe-rankingSteps

1

2

3

4 5

6

7


Itemderivedfeatures

AnexampleofPersonalizedSearchSystemArchitecture

51

KeyComponents– OfflineModeling

• Trainthemodeloffline(e.g.Hadoop)• Pushmodeltoonlinerankingmodelstore• Pre-generateuser/itemderivedfeaturesforonlinesystemstoconsume

• E.g.user/itemembeddings fromword2vec/DNNsbasedontherawfeatures

52

KeyComponents– CandidateSelection

• PersonalizedSearch(Withuserquery):• Formaquerytotheindexbasedonuserqueryannotation[Aryaetal.,2016]• Example:PandaExpressSunnyvaleà +restaurant:panda express+location:sunnyvale

• Recommendersystem(Optional):• Canhelpdramaticallyreducethenumberofitemstoscoreinrankingsteps[Cheng,etal.,2016,Borisyuk etal.2016]

• Formaquerybasedontheuserfeatures• Goal:Fetchonlytheitemswithatleastsomematchwithuserfeature

• Example:auserwithtitlesoftwareengineer->+title:software engineerforjobsrecommendation

53

KeyComponents- Ranking

• RecommendationRanking• ThemainMLmodelthatranksitemsretrievedbycandidateselectionbasedontheexpectedutility

• AdditionalRe-rankingSteps• Oftenforuserexperienceoptimizationrelatedtobusinessrules,e.g.

• Diversificationoftherankingresults• Recency boost• Impressiondiscounting• …

54

IntegrationofDeepLearningModelsintoPersonalizedRecommender

SystemsatScale

55

Literature:DeepLearningforRecommendationSystems

• RBMforCollaborativeFiltering[Salakhutdinov etal.,2007]• DeepBeliefNetworks[Hintonetal.,2006]• NeuralAutoregressiveDistributionEstimator(NADE)[Zheng,2016]• NeuralCollaborativeFiltering[He,etal.,2017]• Siamesenetworksforuseritemmatching[Huangetal.,2013]• DeepBeliefNetworkswithPre-training[Hintonetal.,2006]• CollaborativeDeepLearning[Wangetal.,2015]

56

UserInteraction

Logs


Itemderivedfeatures

User

Search-basedCandidateSelection&Retrieval

QueryConstruction

UserFeatureStore

SearchIndexofItems


RankingModelStore

AdditionalRe-rankingSteps

1

2

3

4 5

6

7


Itemderivedfeatures

57

OfflineModeling+User/ItemEmbeddings

UserFeatures ItemFeatures

UserEmbeddingVector

ItemEmbeddingVector

Sim(U,I)

UserFeatureStore

ItemStore/IndexwithFeatures

58

QueryFormulation&CandidateSelection

• Issuesofusingrawtext:Noisyorincorrectquerytaggingdueto• Failuretocapturesemanticmeaning

• Ex.Query:Applewatch->+food:apple+product:watchor+product:applewatch?• Multilingualtext

• Query:熊猫快餐 -> +restaurant:pandaexpress• Cross-domainunderstanding

• Peoplesearchvsjobsearch

59

QueryFormulation&CandidateSelection

• RepresentQueryasanembedding

• Expandquerytosimilarqueriesinasemanticspace

• KNNsearchindensefeaturespacewithInvertedIndex[Cheng,etal.,2016]

Q=“AppleWatch”

D=“iphone”

D=“OrangeSwatch”

D=“ipad”

60

RecommendationRankingModels

• WideandDeepModelstocaptureallpossiblesignals[Cheng,etal.,2016]

https://arxiv.org/pdf/1606.07792.pdf

61

Challenges&OpenProblemsforDeepLearningatRecommenderSystems• Distributedtrainingonverylargedata

• Tensorflow onSpark(https://github.com/yahoo/TensorFlowOnSpark)• CNTK(https://github.com/Microsoft/CNTK)• MXNet (http://mxnet.io/)• Caffe (http://caffe.berkeleyvision.org/)• …

• LatencyIssuesfromOnlineScoring• Pre-generationofuser/itemembeddings• Multi-layerscoring(simplemodels=>complex)

• Batchvsonlinetraining

62

PartIII– CaseStudy:JobsYouMayBeInterestedIn(JYMBII)

63

Outline

• Introduction• GeneratingEmbeddingsviaWord2vec• GeneratingEmbeddingsviaDeepNetworks• TreeFeatureTransformsinDeep+WideFramework

64

Introduction:JYMBII

65

Introduction:ProblemFormulation

• Rankjobsby𝑃 User𝑢appliestoJob𝑗 𝑢, 𝑗)• Modelresponsegiven:

66

CareersHistory,Skills,Education,Connections JobTitle,Description,Location,Company

66

Introduction:JYMBIIModeling- Generalization

Recommend

• Modelshouldlearngeneralrulestopredictwhichjobstorecommendtoamember.

• Learngeneralizationsbasedonsimilarityintitle,skill,location,etc betweenprofile andjobposting

67

Introduction:JYMBIIModeling- Memorization

Appliesto

68

• Modelshouldmemorizeexceptionstotherules• Learnexceptionsbasedonfrequentco-

occurrenceoffeatures

Introduction:BaselineFeatures• Dense BoW SimilarityFeaturesforGeneralization

• i.e:Similarityintitletextgoodpredictorofresponse

• Sparse Two-DepthCrossFeaturesforMemorization• i.e:Memorizethatcomputersciencestudentswilltransitiontoentryengineeringroles

VectorBoW SimilarityFeatureSim(UserTitleBoW,JobTitleBoW)

SparseCrossFeatureAND(user=CompSci.Student,job=SoftwareEngineer)

SparseCrossFeatureAND(user=InSiliconValley,job=InAustin,TX)

SparseCrossFeatureAND(user=MLEngineer,job=UXDesigner)

69

Introduction:Issues

• BoW Featuresdon’tcapturesemanticsimilaritybetweenuser/job• CosineSimilaritybetweenApplicationDeveloperandSoftwareEngineeris0

• Generatingthree-depth,four-depthcrossfeatureswon’tscale• i.e.MemorizingthatFactoryWorkersfromDetroit areapplyingtoFrackingjobsinPennsylvania

• Hand-engineeredfeaturestimeconsumingandwillhavelowcoverage• Permutationsofthree-depth,four-depthcrossfeaturesgrowsexponentially

70

Introduction:Deep+WideforJYMBII

• BoW Featuresdon’tcapturesemanticsimilaritybetweenuser/job• GenerateembeddingstocaptureGeneralization throughsemanticsimilarity• Deep+WidemodelforJYMBII[Chengetal.,2016]

SemanticSimilarityFeatureSim(UserEmbedding,JobEmbedding)

GlobalModelCrossFeatureAND(user=CompSci.Student,job=SoftwareEngineer)

UserModelCrossFeatureAND(user=User2,job=JobLatentFeature1)

JobModelCrossFeatureAND(user=UserLatentFeature,job=Job1)

71

SparseCrossFeatureAND(user=CompSci.Student,job=SoftwareEngineer)

SparseCrossFeatureAND(user=InSiliconValley,job=InAustin,TX)

SparseCrossFeatureAND(user=MLEngineer,job=UXDesigner)

VectorBoW SimilarityFeatureSim(UserTitleBoW,JobTitleBoW)

GeneratingEmbeddingsviaWord2vec:TrainingWordVectors• KeyIdeas

• Sameusers(context)applytosimilarjobs(target)• Similarusers(target)willapplytothesamejobs(context)

ApplicationDeveloper=>SoftwareEngineer

• Trainwordvectorsviaword2vecskip-gramarchitecture• Concatenateuser’scurrenttitle andtheappliedjob’stitleasinput

UserTitle AppliedJobTitle

72

GeneratingEmbeddingsviaWord2vec:ModelStructure

Application,Developer Software,EngineerTokenizedTitles

WordEmbeddingLookupPre-trainedWordVectors

EntityEmbeddingsViaAveragePooling

WordVectors

ResponsePrediction(LogisticRegression)

CosineSimilarity

User Job 73

GeneratingEmbeddingsviaWord2vec:ResultsandNextSteps• ReceiverOperatingCharacteristic– AreaUnderCurveforevaluation

• Responsepredictionisbinaryclassification:Applyordon’tApply• Highlyskeweddata:LowCTRforApplyAction• Goodmetricforrankingquality:Focusondiscriminatoryabilityofmodel

• Marginal0.87% ROCAUCGain• Howtoimprovequalityofembeddings?

• Optimizeembeddingsforpredictiontaskwithsupervisedtraining• Leveragerichercontextaboutuserandjob

74

GeneratingEmbeddingsviaDeepNetworks:ModelStructure

User Job


SparseFeatures(Title,Skill,Company)

EmbeddingLayer

HiddenLayer

EntityEmbedding

Hadamard Product(ElementwiseProduct)

75

GeneratingEmbeddingsviaDeepNetworks:HyperParameters,LotsofKnobs!• OptimizerUsed

• SGDw/Momentumandexponentialdecayvs.Adam[Kingma etal.,2015](Adam)• LearningRate

• 10O� to10O� (𝟏𝟎O𝟒)• EmbeddingLayerSize

• 50to200(100)• Dropout

• 0%to50%dropout(0%dropout)• SharingParameterSpaceforbothuser/jobembeddings

• Assumescommunitivepropertyofrecommendations(a+b=b+a)(Nosharedparameterspace)• HiddenLayerSizes

• 0to2HiddenLayers(200->200 HiddenLayerSize)• ActivationFunction

• ReLU vs.Tanh (ReLU)

76

GeneratingEmbeddingsviaDeepNetworks:TrainingChallenges• Millionsofrowsoftrainingdataimpossibletostoreallinmemory

• Streamdataincrementallydirectlyfromfilesintoafixedsizeexamplepool• Addshufflingbyrandomlysamplingfromexamplepoolfortrainingbatches

• Extremedimensionalityofcompanysparsefeature• Reducedimensionalityofcompanyfeaturefrommillions->tensofthousands• Performfeatureselectionbyfrequencyintrainingset

• Hyperparametertuning• DistributegridsearchthroughparallelmodelinginsingledriverSparkjobs

77

GeneratingEmbeddingsviaDeepNetworks:ResultsModel ROC AUC

Baseline Model 0.753

Deep +WideModel 0.790(+4.91%***)

***Forreference,apreviousmajorJYMBIImodelingimprovementwitha20%liftinROCAUC resultedina30%liftinJobApplications

78


TheCurrentDeep+WideModel

DeepEmbeddingFeatures(FeedForwardNN)

• Generatingthree-depth,four-depthcrossfeatureswon’tscale• Smartfeatureselectionrequired

WideSparseCrossFeatures(Two-Depth)

79

TreeFeatureTransforms:FeatureSelectionviaGradientBoostedDecisionTrees

Eachtreeoutputsapathfromroottoleafencodingacombinationoffeaturecrosses[Heetal.,2014]

GDBT’sselectthemostusefulcombinationsoffeaturecrossesformemorization

MemberSeniority:VicePresident

Yes

No

MemberIndustry:Banking

Yes

No

MemberLocation:SiliconValley

MemberSkill:Statistics

Yes No

80

Yes No

JobSeniority:CXO

NoYes

JobTitle:MLEngineer

Yes No


TreeFeatureTransforms:TheFullPicture

HowtotrainboththeNNmodelandGBDTmodeljointlywitheachother?

DeepEmbeddingFeatures(FeedForwardNN) WideSparseCrossFeatures(GBDT)

81

TreeFeatureTransforms:JointTrainingviaBlock-wiseCyclicCoordinateDescent• TreatNNmodelandGBDTmodelasseparateblock-wisecoordinates• Implementedby

1. TrainingtheNNuntilconvergence2. TrainingGBDTw/fixedNNembeddings3. Trainingtheregressionlayerweightsw/generatedcrossfeaturesfromGBDT4. TrainingtheNNuntilconvergencew/fixedcrossfeatures5. Cyclestep2-4untilglobalconvergencecriteria

82


TreeFeatureTransforms:TrainNNUntilConvergence

Initiallynotreesareinourforest

DeepEmbeddingFeatures(FeedForwardNN) WideSparseCrossFeatures(GDBT)

83


TreeFeatureTransforms:TrainGDBTw/NNSectionasInitialMargin


84


TreeFeatureTransforms:TrainGDBTw/NNSectionasInitialMargin


85


TreeFeatureTransforms:TrainRegressionLayerWeights


86


TreeFeatureTransforms:TrainNNw/GDBTSectionasInitialMargin


87

TreeFeatureTransforms:Block-wiseCoordinateDescent ResultsModel ROC AUC

Baseline Model 0.753

Deep +WideModel 0.790(+4.91%)

Deep +WideModelw/GBDTIteration1 0.792(+5.18%)


Deep +WideModelw/GBDTIteration3 0.795 (+5.57%)


88

JYMBIIDeep+Wide:FutureDirection

• GeneratingEmbeddingsw/LSTMNetworks• Leveragesequentialcareerhistorydata• PromisingresultsinNEMO:NextCareerMovePredictionwithContextualEmbedding[Lietal.,2017]

• Semi-SupervisedTraining• Leveragepre-trainedtitle,skill,andcompanyembeddingsonprofiledata

• ReplaceHadamard Productforentityembeddingsimilarityfunction• DeepCrossing[Shanetal.,2016]

• Addevenrichercontext• i.e.Location,Education,andNetworkfeatures

89

PartIV– CaseStudy:DeepLearningNetworksforJobSearch

90

Outline

• Introduction• RepresentationsviaWord2vec• RobustRepresentationsviaDSSM

91

Introduction:JobSearch

92

Introduction:SearchArchitecture

Index

Indexer

Top-Kretrieval

ResultsOffline Training/Model

Result Ranking

User QueryQueryUnderstanding

93

Introduction: QueryUnderstanding-SegmentationandTagging• Firstdividethesearchqueryintosegments

• Tagquerysegmentsbasedonrecognizedentitytags

OracleJava

Application Developer

OracleJava Application Developer

QuerySegmentations

COMPANY = Oracle SKILL = Java

TITLE = Application Developer

COMPANY = Oracle TITLE = Java Application

Developer

QueryTagging

94

Introduction: QueryUnderstanding–Expansion• Taskofaddingadditionalsynonyms/relatedentitiestothequerytoimproverecall

• CurrentApproach:Curateddictionaryforcommonsynonymsandrelatedentities

COMPANY = Oracle OR NetSuite OR Taleo OR Sun Microsystems OR …

SKILL = Java OR Java EE OR J2EE OR JVM OR JRE OR JDK …

TITLE = Application Developer OR Software Engineer OR

Software Developer ORProgrammer …

Green – SynonymsBlue – RelatedEntities

95

Introduction: QueryUnderstanding- RetrievalandRanking

COMPANY = Oracle OR NetSuite OR Taleo OR Sun Microsystems OR …

SKILL = Java OR Java EE OR J2EE OR JVM OR JRE OR JDK …

TITLE = Application Developer OR Software Engineer OR

Software Developer ORProgrammer …

Title

Title

Skills

Company

96

Introduction: Issues– RetrievalandRanking

• Termretrievalhaslimitations• Crosslanguageretrieval

• Softwareentwickleró Softwaredeveloper• WordInflections

• EngineeringManagementó EngineeringManager

• Queryexpansionviacurateddictionaryofsynonymsisnotscalable• Expensivetorefreshandstoresynonymsforallpossibleentities

• Heavyrelianceonquerytaggingisnotrobustenough• Noveltitle,skill,andcompanyentitieswillnotbetaggedcorrectly• Errorsupstreampropagatestopoorretrievalandranking

97

Introduction:Solution– DeepLearningforQueryandDocumentRepresentations• Queryanddocumentrepresentations

• Mapqueriesanddocumenttexttovectorsinsemanticspace• RobusttoHandleOutofVocabularywords

• Termretrievalhaslimitations• Queryexpansionviacurateddictionaryofsynonymsisnotscalable

• Mapsynonyms,translationsandinflectionstosimilarvectorsinsemanticspace• TermretrievalonclusteridorKNNbasedretrieval

• Heavyrelianceonquerytaggingisnotrobustenough• Complimentstructuredqueryrepresentationswithsemanticrepresentations

98

RepresentationsviaWord2vec:LeverageJYMBIIWork• KeyIdeas

• Similarusers(context)applytothesamejob(target)• Thesameuser(target)willapplytosimilarjobs(context)

ApplicationDeveloper=>SoftwareEngineer

• Trainwordvectorsviaword2vecskip-gramarchitecture• Concatenateuser’scurrenttitle andtheappliedjob’stitleasinput

UserTitle AppliedJobTitle

99

RepresentationsviaWord2vec:Word2vecinRanking

Application,Developer Software,EngineerTokenizedText

WordEmbeddingLookupPre-trainedWordVectors

EntityEmbeddingsViaAveragePooling

WordVectors

LearningtoRankModel(NDCGLoss)

CosineSimilarity

JobQuery 100

RepresentationsviaWord2vec:RankingModelResultsModel Normalized Cumulative

DiscountedGain@5(NDCG@5)CTR@5(%)

BaselineModel 0.582 +0.0%

BaselineModel+Word2VecFeature 0.595(+2.2%) +1.6%

101

RepresentationsviaWord2vec:OptimizeEmbeddingsforJobSearchUseCase• Leverageapplyandclickfeedbacktoguidelearningofembeddings

• Finetuneembeddingsfortaskusingsupervisedfeedback

• Handleoutofvocabularywordsandscaletoqueryvocabularysize• ComparedtoJYMBII,queryvocabularyismuchlargerandlesswell-formed

• Misspellings• WordInflections• Freetextsearch

• Needtomakerepresentationsmorerobustforthesefreetextqueries

102

RobustRepresentationsviaDSSM:DeepStructuredSemanticModel[Huangetal.,2013]

Query AppliedJob(Positive)

ApplicationDeveloper SoftwareEngineerRawText

#Ap,App,ppl… #So,Sof,oft…Tri-letterHashing #Ha,Hai,air…

HairdresserRandomlySampled

AppliedJob(Negative)

HiddenLayer3

HiddenLayer2

HiddenLayer1

CosineSimilarity

Softmax w/CrossEntropyLoss

103

RobustRepresentationsviaDSSM:Tri-letterHashing• Tri-letterHashingExample

• Engineer->#en,eng,ngi,gin,ine,nee,eer,er#

• BenefitsofTri-letterHashing• MorecompactBagofTri-lettersvs.BagofWordsrepresentation

• 700KWordVocabulary->75KTri-letters• Cangeneralizeforoutofvocabularywords• Tri-letterhashingrobusttominormisspellingsandinflectionsofwords

• Engneer ->#en,eng,ngn,gne,nee,eer,er#

104

RobustRepresentationsviaDSSM:TrainingDetails

105

• ParameterSharingHelps• Betterandfasterconvergence• Modelsizeisreduced

• Regularization• L2performsbetterthandropout

• ToolkitComparisons(CNTKvsTensorFlow)• CNTK:Fasterconvergenceandbettermodelquality• TensorFlow:Easytoimplementandbettercommunitysupport.Comparativemodelquality

Trainingperformancewith/oparametersharing

RobustRepresentationsviaDSSM:LessonsinProductionEnvironment

106

+100%

+70%

+40%

• BottlenecksinProductionEnvironment

• Latencyduetoextracomputation• LatencyduetoGCactivity• FatJarsinJVMenvironment

• PracticalLessons• AvoidJVMHeapwhileservingthemodel

• Cachingmostaccessedentities’embedding

RobustRepresentationsviaDSSM:DSSMQualitativeResultsSoftwareEngineer DataMining LinkedIn Softwareentwickler

EngineerSoftware DataMiner Google Software

SoftwareEngineers MachineLearningEngineer

SoftwareEngineers SoftwareEngineer

SoftwareEngineering Microsoft Research SoftwareEngineer EngineerSoftware

Forqualitativeresults,onlytopheadqueriesaretakentoanalyzesimilaritytoeachother

107

RobustRepresentationsviaDSSM:DSSMMetricResultsModel Normalized Cumulative

DiscountedGain@5(NDCG@5)CTR@5Lift (%)

BaselineModel 0.582 +0.0%

BaselineModel+Word2Vec Feature 0.595(+2.2%) +1.6%

BaselineModel+DSSM Feature 0.602(+3.4%) +3.2%

108

RobustRepresentationsviaDSSM:DSSMFutureDirection• LeverageCurrentQueryUnderstandingIntoDSSMModel

• Querytagentityinformationforrichercontextembeddings• Querysegmentationstructurecanbeconsideredintothenetworkdesign

• DeepCrossingforSimilarityLayer[Shanetal.,2016]• ConvolutionalDSSM[Shenetal.,2014]

109

Conclusion

• RecommenderSystemsandpersonalizedsearchareverysimilarproblems

• DeepLearningisheretostayandcanhavesignificantimpactonboth• Understandingandconstructingqueries• Ranking

• Deeplearningandmoretraditionaltechniquesare*not*mutuallyexclusive(hint:Deep+Wide)

110

References• [Rumelhart etal.,1986]Learningrepresentationsbyback-propagatingerrors,Nature1986• [Hochreiter etal.,1997]Longshort-termmemory,Neuralcomputation 1997• [LeCun etal.,1998]Gradient-basedlearningappliedtodocumentrecognition, ProceedingsoftheIEEE 1998

• [Krizhevsky etal.,2012]Imagenet classificationwithdeepconvolutionalneuralnetworks, NIPS2012

• [Gravesetal.,2013]Speechrecognition with deep recurrent neuralnetworks,ICASSP2013• [Mikolov,2012]Statisticallanguage models based on neuralnetworks,PhD Thesis,BrnoUniversity of Technology,2012

• [Kalchbrenner etal.,2013]Recurrent continuous translation models,EMNLP2013• [Srivatstava,2013]Improving neuralnetworks with dropout,PhD Thesis,University of Toronto,2013

• [Sustkever etal.,2014]Sequence tosequence learningg with neuralnetworks,NIPS2014• [Vinyals etal.,2014]Showandtell:aneuralimagecaption generator,Arxiv 2014• [Zaremba etal.,2015]Recurrent NeuralNetworkRegularization,ICLR2015

111

References(continued)• [Aryaetal.,2016]PersonalizedFederatedSearchatLinkedIn,CIKM2016• [Chengetal.,2016]Wide&DeepLearningforRecommenderSystems,DLRS2016• [Heetal.,2014]PracticalLessonsfromPredictingClicksonAdsatFacebook,ADKDD2014• [Kingma etal.,2015]Adam:AMethodforStochasticOptimization,ICLR2015• [Huangetal.,2013]LearningDeepStructuredSemanticModelsforWebSearchusingClickthrough Data,CIKM2013• [Lietal.,2017]NEMO:NextCareerMovePredictionwithContextualEmbedding,WWW2017• [Shanetal.,2016]DeepCrossing:Web-scalemodelingwithoutmanuallycraftedcombinatorialfeatures,KDD2016• [Zhangetal.,2016]GLMix:GeneralizedLinearMixedModelsForLarge-ScaleResponsePrediction,KDD2016• [Salakhutdinov etal.,2007]RestrictedBoltzmannMachinesforCollaborativeFiltering,ICML2007• [Zheng,2016]http://tech.hulu.com/blog/2016/08/01/cfnade.html• [Hintonetal.,2006]Afastlearningalgorithmfordeepbeliefnets,NeuralComputations2006• [Wangetal.,2015]CollaborativeDeepLearningforRecommenderSystems,KDD2015• [Heetal.,2017]NeuralCollaborativeFiltering,WWW2017• [Borisyuk etal.2016].CaSMoS:AFrameworkforLearningCandidateSelectionModelsoverStructuredQueriesand

Documents,KDD2016

112

References(continued)

• [netflix recsys]http://nordic.businessinsider.com/netflix-recommendation-engine-worth-1-billion-per-year-2016-6/

• [SanJoseMercuryNews]http://www.mercurynews.com/2017/01/06/at-linkedin-artificial-intelligence-is-like-oxygen/

• [Instagramblog]http://blog.instagram.com/post/145322772067/160602-news

113

Deep Learning for Personalized Search and Recommender Systems

Engineering

Transcript of Deep Learning for Personalized Search and Recommender Systems