Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under...
Transcript of Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under...
Learning Neural Network Based Learning Neural Network Based Learning Neural Network Based Learning Neural Network Based
Similarity Models from Similarity Models from Similarity Models from Similarity Models from UUUUser Clicksser Clicksser Clicksser Clicks
Xugang Ye
Outline
• Neural network based similarity models based on term or lexical match
• Neural network based similarity models based on semantic match
• Combined models
• Results
Microsoft Research
Input text streams: q =�����
�… ��
�d =��
���� … ��
�
Similarity(q, d)
Mapping Function
P(d|q): Relevance Probability
Term or Lexical Match
Input text streams: q =�����
�… ��
�d =��
���� … ��
�
S: Similarity(q, d)
Word counts vectors:
Word match counts vector:
… A neural network
��,�
P(d|q): Relevance Probability
Microsoft Research
Semantic Match
Input text streams: q =�����
�… ��
�d =��
���� … ��
�
R: Similarity(q, d)
Word counts vectors:
… A neural network…A neural network
Fully connected version (DSSM):
Huang, He, Gao, Deng, Heck, CIKM 2013
Convolutional version (CLSM):
Shen, He, Gao, Deng, Mesnil, CIKM 2014
Generalized Loss function (GDSSM1):
Ye, Qi, Song, He, Massey, ICDM 2015
�� ��
P(d|q): Relevance Probability
Microsoft Research
Combined Models
Input text streams: q =�����
�… ��
�d =��
���� … ��
�
R: Similarity(q, d)
……
Microsoft Research
S: Similarity(q, d)
…
P(d|q): Relevance Probability
Fully connected version (GDSSM2):
Ye, Qi, Massey, IEEE BigData 2015
Convolutional version:
Ongoing
�� �� ��,�
Microsoft Research
Network Structure
Text stream: w1 w2 w3 w4 …
50k
300
300
128
Fully connected: Convolutional with max-pooling:
Text stream: w1 w2 w3 w4 …
50k
300
300
128
w1 w2 w3 w2 w3 w4
50k
300
…
…
Max pooling
Sliding window
Microsoft Research
Mathematical Formulation: Loss functionSuppose a document � has been viewed by � distinct users under a query �, and the proportion of these users who clicked � is ��0 � � � 1 , according to the binomial distribution, the probability of observing � given �, �, � is
���|�, �, �; Θ ���� � � �; Θ �� 1 � � � �; Θ
����,
where � � �; Θ is the parameterized conditional probability that document � is clicked by a user under query �, and Θ denotes the set of model parameters. Assume that � different clicked ��, � -pairs �� , �� : � 1,2, … , � are independent, then the joint probability of observing �� given ��, ��, �� for � 1, … , � is
∏ � �� �� , �� , ��; Θ��#� ∝ ∏ � �� ��; Θ
�%�% 1 � � �� ��; �%��%�%�
�#� .
By taking the negative natural logarithm and ignoring the constants, we have the loss function
& Θ � � ∑ ln � �� �� , �� , ��; Θ��#�
� � ∑ �� �� ln � �� ��; Θ * �1 � �� ln 1 � � �� ��; Θ��#�
Microsoft Research
Mathematical Formulation: Parametrization
� � �; Θ �+,- ./ �,�;0 12 �,�;3,0
+,- ./ �,�;0 12 �,�;3,0 1∑ +,- ./ �,�5;0 12 �,�5;3,065∈89
,
: �, �; Λ �<�= ><�6
||<�= ||·||<�6 ||,
@ �, �; Ω, Λ �+,- 3><�=,6 ��
+,- 3><�=,6 1�,
where the parameter set Θ consists of Λ, Ω, B and C� is the set of D�E 4 unclicked documents under �. As in the DSSM and the CLSM models, � � G �; Λ � and � � H �; Λ � are the semantic vectors of � and �respectively, and they are mapped from the original vectors of term counts. The parameters Λ � and Λ � are parts of Λ corresponding to � and � respectively. Differing from the DSSM and the CLSM models, we believe that the lexical match is also a reason for clicks besides the semantic match. Therefore we added a neural network structure for �,� � � �, �; Λ �,� , which is the condensed vector mapped from the original vector of term match counts, where Λ �,� is part of Λ corresponding to the ��, � -pair. For each net, IJ�K function is used as the activation function. That is, if we denote the L-th layer as �M�
N, M�N, …, M�O
N and the L * 1 -th layer as �M�
N1�, M�N1�, …, M�OPQ
N1� , then for each � 1, … , �N1�,
M�N1� �
��+,- ��R%O
�1+,- ��R%O ,
where J�N � ∑ �S,�
N MSN�O
S#� * �T,�N . The parameters in Ω characterize @ as a function of �,� . In a special case
when Ω � 0, the whole structure of � � �; Θ reduces to that of the DSSM model.
Microsoft Research
Parameter Estimation
• Gradient descent method (implemented as mini-batch SGD)
Θ�U � Θ�U�� � VUWX& Θ ,
where VU�Y 0 is the learning rate and WX& Θ is the gradient of & Θ with respect to Θ
WX& Θ � ∑ �� ∑ Z�� � �1 � �� [�
� · WX∆��
�∈]%9
��#� ,
where
Z�� �
� +,- �∆6%
�1∑ +,- �∆65%
65∈8%9
,
[�� �
� +,- �∆6%
∑ +,- �∆65%
65∈8%9
,
∆�� � B : �� , ��; Λ � : �� , �; Λ * @ �� , ��; Ω, Λ � @ �� , �; Ω, Λ
Microsoft Research
Back Propagation
^��U��
JS�U
S�U
JN�U1�
_^`U1a
_Λ�,SU
��^`
U1a
�J`U1a
_J`U1a
_Λ�,SU
Linear
combinations
Activations _J`U1a
_Λ�,SU
�_J`
U1a
_JSU
_JSU
_Λ�,SU
�_J`
U1a
_JSU
· ^�U��
bRc�dPe
bRf�d � ∑
bRc�dPe
bRO�dPQ
bRO�dPQ
bgf�d
�gf�d
�Rf�d N � ∑
_JM�I*h
_JL�I*1 · ΛS,N
�U1� · 1 � K��JS
�U N
Microsoft Research
Dimension Reduction
• Word-based dictionary has vocabulary size about 500k
• Letter-trigram (LTG)-based dictionary has vocabulary size about 50k
• Using Letter-trigram has some advantage when facing mispelling
• An example of LTG-based vector representation> query = 'free ML course';
> query_LTG = Text.ToLetterTriGramSeq(query, 15);
"#fr fre ree ee# #ml ml# #co cou our urs rse se#"
> query_LTGVec = Text.StringTol3gCtVec(query, 15);
$v_idx
19218 21878 23882 34319 37701 41627 42173 43172 46711 58290 58410 58677
$v_ct
1 1 1 1 1 1 1 1 1 1 1 1
Microsoft Research
A Brief View of the Models’ Ranking Performance
Improvement in Web Search
NDCG Measurements:
Average NDCG of the top positions is defined as
NDCG� �
�
m∑ ∑
�f=
nopq��1S �S#� / ∑
�sf=
nopq��1S �S#�� ,
where �s��
E �s��
E ⋯ represent the descending
order of ���, ��
�, … , which are the observed click
probabilities of the documents at positions 1,2, …,
respectively under query �. We require � to
satisfy that �JS�S�
� � �S�S�
E u Y 0, where u
is a pre-determined parameter, and v is the total
number of such queries in the test data set.
Microsoft Research
Previous Results
*Results from the presentation in CIKM 2014
Microsoft Research
Other Applications besides the Web Search
Tasks… q d
Web Search Query URL
Catalog Search Query Content
Key word extraction Doc in reading Key phrase
Contextual entity search Key phrase and context Entity and entity page
Machine translation Sentence in language A Sentence in language B
Article recommendation Article in reading Other interesting article
Image recommendation Image in viewing Other interesting image
Image captioning Image Caption sentence
…
Microsoft Research
Text is boring, let’s have some fun with image
Caption
Generation
System
q
Computer
Vison
System
Candidate captions: d1,d2, …, dk
Semantic similarity model
(DSSM)
The caption ranked the highest: d4
Microsoft Research
R: Similarity(q, d)
……
�� ��
q: image features d (as text): a parrot
rides a tricycle
A Semantic Similarity ModelP(d|q): Relevance Probability
Microsoft Research
Some Interesting Results
Boy riding on
horse
Microsoft Research
Some Interesting Results
Microsoft Research
Some Interesting Results
Microsoft Research
Some Interesting Results
Microsoft Research
Some Interesting Results
Microsoft Research
Q & A