Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under...

22
Learning Neural Network Based Learning Neural Network Based Learning Neural Network Based Learning Neural Network Based Similarity Models from Similarity Models from Similarity Models from Similarity Models from U U User Clicks ser Clicks ser Clicks ser Clicks Xugang Ye

Transcript of Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under...

Page 1: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Learning Neural Network Based Learning Neural Network Based Learning Neural Network Based Learning Neural Network Based

Similarity Models from Similarity Models from Similarity Models from Similarity Models from UUUUser Clicksser Clicksser Clicksser Clicks

Xugang Ye

Page 2: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Outline

• Neural network based similarity models based on term or lexical match

• Neural network based similarity models based on semantic match

• Combined models

• Results

Microsoft Research

Input text streams: q =�����

�… ��

�d =��

���� … ��

Similarity(q, d)

Mapping Function

P(d|q): Relevance Probability

Page 3: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Term or Lexical Match

Input text streams: q =�����

�… ��

�d =��

���� … ��

S: Similarity(q, d)

Word counts vectors:

Word match counts vector:

… A neural network

��,�

P(d|q): Relevance Probability

Microsoft Research

Page 4: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Semantic Match

Input text streams: q =�����

�… ��

�d =��

���� … ��

R: Similarity(q, d)

Word counts vectors:

… A neural network…A neural network

Fully connected version (DSSM):

Huang, He, Gao, Deng, Heck, CIKM 2013

Convolutional version (CLSM):

Shen, He, Gao, Deng, Mesnil, CIKM 2014

Generalized Loss function (GDSSM1):

Ye, Qi, Song, He, Massey, ICDM 2015

�� ��

P(d|q): Relevance Probability

Microsoft Research

Page 5: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Combined Models

Input text streams: q =�����

�… ��

�d =��

���� … ��

R: Similarity(q, d)

……

Microsoft Research

S: Similarity(q, d)

P(d|q): Relevance Probability

Fully connected version (GDSSM2):

Ye, Qi, Massey, IEEE BigData 2015

Convolutional version:

Ongoing

�� �� ��,�

Microsoft Research

Page 6: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Network Structure

Text stream: w1 w2 w3 w4 …

50k

300

300

128

Fully connected: Convolutional with max-pooling:

Text stream: w1 w2 w3 w4 …

50k

300

300

128

w1 w2 w3 w2 w3 w4

50k

300

Max pooling

Sliding window

Microsoft Research

Page 7: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Mathematical Formulation: Loss functionSuppose a document � has been viewed by � distinct users under a query �, and the proportion of these users who clicked � is ��0 � � � 1 , according to the binomial distribution, the probability of observing � given �, �, � is

���|�, �, �; Θ ���� � � �; Θ �� 1 � � � �; Θ

����,

where � � �; Θ is the parameterized conditional probability that document � is clicked by a user under query �, and Θ denotes the set of model parameters. Assume that � different clicked ��, � -pairs �� , �� : � 1,2, … , � are independent, then the joint probability of observing �� given ��, ��, �� for � 1, … , � is

∏ � �� �� , �� , ��; Θ��#� ∝ ∏ � �� ��; Θ

�%�% 1 � � �� ��; �%��%�%�

�#� .

By taking the negative natural logarithm and ignoring the constants, we have the loss function

& Θ � � ∑ ln � �� �� , �� , ��; Θ��#�

� � ∑ �� �� ln � �� ��; Θ * �1 � �� ln 1 � � �� ��; Θ��#�

Microsoft Research

Page 8: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Mathematical Formulation: Parametrization

� � �; Θ �+,- ./ �,�;0 12 �,�;3,0

+,- ./ �,�;0 12 �,�;3,0 1∑ +,- ./ �,�5;0 12 �,�5;3,065∈89

,

: �, �; Λ �<�= ><�6

||<�= ||·||<�6 ||,

@ �, �; Ω, Λ �+,- 3><�=,6 ��

+,- 3><�=,6 1�,

where the parameter set Θ consists of Λ, Ω, B and C� is the set of D�E 4 unclicked documents under �. As in the DSSM and the CLSM models, � � G �; Λ � and � � H �; Λ � are the semantic vectors of � and �respectively, and they are mapped from the original vectors of term counts. The parameters Λ � and Λ � are parts of Λ corresponding to � and � respectively. Differing from the DSSM and the CLSM models, we believe that the lexical match is also a reason for clicks besides the semantic match. Therefore we added a neural network structure for �,� � � �, �; Λ �,� , which is the condensed vector mapped from the original vector of term match counts, where Λ �,� is part of Λ corresponding to the ��, � -pair. For each net, IJ�K function is used as the activation function. That is, if we denote the L-th layer as �M�

N, M�N, …, M�O

N and the L * 1 -th layer as �M�

N1�, M�N1�, …, M�OPQ

N1� , then for each � 1, … , �N1�,

M�N1� �

��+,- ��R%O

�1+,- ��R%O ,

where J�N � ∑ �S,�

N MSN�O

S#� * �T,�N . The parameters in Ω characterize @ as a function of �,� . In a special case

when Ω � 0, the whole structure of � � �; Θ reduces to that of the DSSM model.

Microsoft Research

Page 9: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Parameter Estimation

• Gradient descent method (implemented as mini-batch SGD)

Θ�U � Θ�U�� � VUWX& Θ ,

where VU�Y 0 is the learning rate and WX& Θ is the gradient of & Θ with respect to Θ

WX& Θ � ∑ �� ∑ Z�� � �1 � �� [�

� · WX∆��

�∈]%9

��#� ,

where

Z�� �

� +,- �∆6%

�1∑ +,- �∆65%

65∈8%9

,

[�� �

� +,- �∆6%

∑ +,- �∆65%

65∈8%9

,

∆�� � B : �� , ��; Λ � : �� , �; Λ * @ �� , ��; Ω, Λ � @ �� , �; Ω, Λ

Microsoft Research

Page 10: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Back Propagation

^��U��

JS�U

S�U

JN�U1�

_^`U1a

_Λ�,SU

��^`

U1a

�J`U1a

_J`U1a

_Λ�,SU

Linear

combinations

Activations _J`U1a

_Λ�,SU

�_J`

U1a

_JSU

_JSU

_Λ�,SU

�_J`

U1a

_JSU

· ^�U��

bRc�dPe

bRf�d � ∑

bRc�dPe

bRO�dPQ

bRO�dPQ

bgf�d

�gf�d

�Rf�d N � ∑

_JM�I*h

_JL�I*1 · ΛS,N

�U1� · 1 � K��JS

�U N

Microsoft Research

Page 11: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Dimension Reduction

• Word-based dictionary has vocabulary size about 500k

• Letter-trigram (LTG)-based dictionary has vocabulary size about 50k

• Using Letter-trigram has some advantage when facing mispelling

• An example of LTG-based vector representation> query = 'free ML course';

> query_LTG = Text.ToLetterTriGramSeq(query, 15);

"#fr fre ree ee# #ml ml# #co cou our urs rse se#"

> query_LTGVec = Text.StringTol3gCtVec(query, 15);

$v_idx

19218 21878 23882 34319 37701 41627 42173 43172 46711 58290 58410 58677

$v_ct

1 1 1 1 1 1 1 1 1 1 1 1

Microsoft Research

Page 12: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

A Brief View of the Models’ Ranking Performance

Improvement in Web Search

NDCG Measurements:

Average NDCG of the top positions is defined as

NDCG� �

m∑ ∑

�f=

nopq��1S �S#� / ∑

�sf=

nopq��1S �S#�� ,

where �s��

E �s��

E ⋯ represent the descending

order of ���, ��

�, … , which are the observed click

probabilities of the documents at positions 1,2, …,

respectively under query �. We require � to

satisfy that �JS�S�

� � �S�S�

E u Y 0, where u

is a pre-determined parameter, and v is the total

number of such queries in the test data set.

Microsoft Research

Page 13: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Previous Results

*Results from the presentation in CIKM 2014

Microsoft Research

Page 14: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Other Applications besides the Web Search

Tasks… q d

Web Search Query URL

Catalog Search Query Content

Key word extraction Doc in reading Key phrase

Contextual entity search Key phrase and context Entity and entity page

Machine translation Sentence in language A Sentence in language B

Article recommendation Article in reading Other interesting article

Image recommendation Image in viewing Other interesting image

Image captioning Image Caption sentence

Microsoft Research

Page 15: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Text is boring, let’s have some fun with image

Caption

Generation

System

q

Computer

Vison

System

Candidate captions: d1,d2, …, dk

Semantic similarity model

(DSSM)

The caption ranked the highest: d4

Microsoft Research

Page 16: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

R: Similarity(q, d)

……

�� ��

q: image features d (as text): a parrot

rides a tricycle

A Semantic Similarity ModelP(d|q): Relevance Probability

Microsoft Research

Page 17: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Some Interesting Results

Boy riding on

horse

Microsoft Research

Page 18: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Some Interesting Results

Microsoft Research

Page 19: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Some Interesting Results

Microsoft Research

Page 20: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Some Interesting Results

Microsoft Research

Page 21: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Some Interesting Results

Microsoft Research

Page 22: Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under query , and ˘denotes the set of model parameters. Assume that ˛different clicked ...

Q & A