Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under...

Learning Neural Network Based Learning Neural Network Based Learning Neural Network Based Learning Neural Network Based

Similarity Models from Similarity Models from Similarity Models from Similarity Models from UUUUser Clicksser Clicksser Clicksser Clicks

Xugang Ye

Outline

• Neural network based similarity models based on term or lexical match

• Neural network based similarity models based on semantic match

• Combined models

• Results

Microsoft Research

Input text streams: q =��

�… ��

�d =��

�� … ��

�

Similarity(q, d)

Mapping Function

P(d|q): Relevance Probability

Term or Lexical Match


�… ��

�d =��

�� … ��

�

S: Similarity(q, d)

Word counts vectors:

Word match counts vector:

… A neural network

��,�


Microsoft Research

Semantic Match


�… ��

�d =��

�� … ��

�

R: Similarity(q, d)

Word counts vectors:

… A neural network…A neural network

Fully connected version (DSSM):

Huang, He, Gao, Deng, Heck, CIKM 2013

Convolutional version (CLSM):

Shen, He, Gao, Deng, Mesnil, CIKM 2014

Generalized Loss function (GDSSM1):

Ye, Qi, Song, He, Massey, ICDM 2015

��


Microsoft Research

Combined Models


�… ��

�d =��

�� … ��

�

R: Similarity(q, d)

……

Microsoft Research

S: Similarity(q, d)

…


Fully connected version (GDSSM2):

Ye, Qi, Massey, IEEE BigData 2015

Convolutional version:

Ongoing

�� ,�

Microsoft Research

Network Structure

Text stream: w1 w2 w3 w4 …

50k

300

300

128

Fully connected: Convolutional with max-pooling:

Text stream: w1 w2 w3 w4 …

50k

300

300

128

w1 w2 w3 w2 w3 w4

50k

300

…

…

Max pooling

Sliding window

Microsoft Research

Mathematical Formulation: Loss functionSuppose a document � has been viewed by � distinct users under a query �, and the proportion of these users who clicked � is ��0 � � � 1 , according to the binomial distribution, the probability of observing � given �, �, � is

��|�, �, �; Θ �� ; Θ �� 1 � � � �; Θ

��,

where � � �; Θ is the parameterized conditional probability that document � is clicked by a user under query �, and Θ denotes the set of model parameters. Assume that � different clicked ��, � -pairs �� , �� : � 1,2, … , � are independent, then the joint probability of observing �� given ��, ��, �� for � 1, … , � is

∏ � �� , �� , ��; Θ��#� ∝ ∏ � �� ; Θ

�%�% 1 � � �� ; Θ�%��%�%�

�#� .

By taking the negative natural logarithm and ignoring the constants, we have the loss function

& Θ � � ∑ ln � �� , �� , ��; Θ��#�

� � ∑ �� ln � �� ; Θ * �1 � �� ln 1 � � �� ; Θ��#�

Microsoft Research

Mathematical Formulation: Parametrization

� � �; Θ �+,- ./ �,�;0 12 �,�;3,0

+,- ./ �,�;0 12 �,�;3,0 1∑ +,- ./ �,�5;0 12 �,�5;3,065∈89

,

: �, �; Λ �<�= ><�6

||<�= ||·||<�6 ||,

@ �, �; Ω, Λ �+,- 3><�=,6 ��

+,- 3><�=,6 1�,

where the parameter set Θ consists of Λ, Ω, B and C� is the set of D�E 4 unclicked documents under �. As in the DSSM and the CLSM models, � � G �; Λ � and � � H �; Λ � are the semantic vectors of � and �respectively, and they are mapped from the original vectors of term counts. The parameters Λ � and Λ � are parts of Λ corresponding to � and � respectively. Differing from the DSSM and the CLSM models, we believe that the lexical match is also a reason for clicks besides the semantic match. Therefore we added a neural network structure for �,� � � �, �; Λ �,� , which is the condensed vector mapped from the original vector of term match counts, where Λ �,� is part of Λ corresponding to the ��, � -pair. For each net, IJ�K function is used as the activation function. That is, if we denote the L-th layer as �M�

N, M�N, …, M�O

N and the L * 1 -th layer as �M�

N1�, M�N1�, …, M�OPQ

N1� , then for each � 1, … , �N1�,

M�N1� �

��+,- ��R%O

�1+,- ��R%O ,

where J�N � ∑ �S,�

N MSN�O

S#� * �T,�N . The parameters in Ω characterize @ as a function of �,� . In a special case

when Ω � 0, the whole structure of � � �; Θ reduces to that of the DSSM model.

Microsoft Research

Parameter Estimation

• Gradient descent method (implemented as mini-batch SGD)

Θ�U � Θ�U�� VUWX& Θ ,

where VU�Y 0 is the learning rate and WX& Θ is the gradient of & Θ with respect to Θ

WX& Θ � ∑ �� ∑ Z�� 1 � �� [�

� · WX∆��

�∈]%9

��#� ,

where

Z��

� +,- �∆6%

�1∑ +,- �∆65%

65∈8%9

,

[��

� +,- �∆6%

∑ +,- �∆65%

65∈8%9

,

∆�� B : �� , ��; Λ � : �� , �; Λ * @ �� , ��; Ω, Λ � @ �� , �; Ω, Λ

Microsoft Research

Back Propagation

^��U��

JS�U

S�U

JN�U1�

_^Ù1a

_Λ�,SU

��^`

U1a

�JÙ1a

_JÙ1a

_Λ�,SU

Linear

combinations

Activations _JÙ1a

_Λ�,SU

�_J`

U1a

_JSU

_JSU

_Λ�,SU

�_J`

U1a

_JSU

· ^�U��

bRc�dPe

bRf�d � ∑

bRc�dPe

bRO�dPQ

bRO�dPQ

bgf�d

�gf�d

�Rf�d N � ∑

_JM�I*h

_JL�I*1 · ΛS,N

�U1� · 1 � K��JS

�U N

Microsoft Research

Dimension Reduction

• Word-based dictionary has vocabulary size about 500k

• Letter-trigram (LTG)-based dictionary has vocabulary size about 50k

• Using Letter-trigram has some advantage when facing mispelling

• An example of LTG-based vector representation> query = 'free ML course';

> query_LTG = Text.ToLetterTriGramSeq(query, 15);

"#fr fre ree ee# #ml ml# #co cou our urs rse se#"

> query_LTGVec = Text.StringTol3gCtVec(query, 15);

$v_idx

19218 21878 23882 34319 37701 41627 42173 43172 46711 58290 58410 58677

$v_ct

1 1 1 1 1 1 1 1 1 1 1 1

Microsoft Research

A Brief View of the Models’ Ranking Performance

Improvement in Web Search

NDCG Measurements:

Average NDCG of the top positions is defined as

NDCG� �

�

m∑ ∑

�f=

nopq��1S �S#� / ∑

�sf=

nopq��1S �S#�� ,

where �s��

E �s��

E ⋯ represent the descending

order of ��, ��

�, … , which are the observed click

probabilities of the documents at positions 1,2, …,

respectively under query �. We require � to

satisfy that �JS�S�

� � �S�S�

E u Y 0, where u

is a pre-determined parameter, and v is the total

number of such queries in the test data set.

Microsoft Research

Previous Results

*Results from the presentation in CIKM 2014

Microsoft Research

Other Applications besides the Web Search

Tasks… q d

Web Search Query URL

Catalog Search Query Content

Key word extraction Doc in reading Key phrase

Contextual entity search Key phrase and context Entity and entity page

Machine translation Sentence in language A Sentence in language B

Article recommendation Article in reading Other interesting article

Image recommendation Image in viewing Other interesting image

Image captioning Image Caption sentence

…

Microsoft Research

Text is boring, let’s have some fun with image

Caption

Generation

System

q

Computer

Vison

System

Candidate captions: d1,d2, …, dk

Semantic similarity model

(DSSM)

The caption ranked the highest: d4

Microsoft Research

R: Similarity(q, d)

……

��

q: image features d (as text): a parrot

rides a tricycle

A Semantic Similarity ModelP(d|q): Relevance Probability

Microsoft Research

Some Interesting Results

Boy riding on

horse

Microsoft Research

Some Interesting Results

Microsoft Research

Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under...

Documents

Transcript of Learning Neural Network Based Similarity Models from ...xye/papers_and_ppts/ppts/Neural...under...