Label propagation - Semisupervised Learning with Applications to NLP

Label Propagation

David [email protected]

Seminar:Semi-supervised and unsupervised learning with Applications to NLP

Outline

● What is Label Propagation

● The Algorithm

● The motivation behind the algorithm

● Parameters of Label Propagation

● Relation Extraction with Label Propagation

Label Propagation

● Semi-supervised

● Shows good results when the amount of annotated data is low with respect to the supervised options

● Similar to kNN

K-Nearest Neighbors(KNN)

● Shares similar ideas with Label Propagation

● Label Propagation (LP) uses unlabeled instances during the process of finding out the labels

Idea of the Problem

We want to find a function f such that:

L=set of Labeled InstancesU=set of Unlabeled Instances

Similar near Unlabeled Instances should have similar Labels

The Model● A complete graph

● Each Node is an instance ● Each arc has a weight

● is high if Nodes x and y are similar.T xy

T xy

The Model

● Inside a Node:

Soft Labels

Variables - Model ● T is a matrix, holding all the weights of the graph

N 1 ...N l=Labeled Data

N l+1 .. N n=Unlabeled DataT l lT l uT u lT u u

T l l

T l u

T u l

T u u

Variables - Model ● Y is a matrix, holding the soft probabilities of

each instance

each of the possible labels

is the probability of being labeled as

The problem to solve

Y N a , Rb naRb

R1 , R2 ... Rk

N 1 , N 2 ...N n each of the instances to label

Y L

Y U

Algorithm

will change in each iteration

Y

How to Measure T?

Euclidean Distance

Distance Measure

Important Parameter(ignore it at the moment) we will talk about this later

How to Initialize Y?● How to Correctly set the values of ?

● Fill the known values (of the labeled data)

● How to fill the values of the unlabeled data? → The initialization of this values can be arbitrary.

Y 0

● Transform T into T' (row normalization)

Propagation Step

● Update during each iteration Y

Y 0Y 1 Y k

→ → ... →

● During the process Y will change

Convergence

T̄ l l T̄ l u

T̄ u l T̄ u u Y u

Y l

During the iteration

Assumming we iterate infinite times then:

Y u

Y l

Y u

Y l =

Clamped

Y U1=T̄ uuY u

0+T̄ ulY L

Y U2=T̄ uu(T̄ uuY u

0+T̄ ulY L)+T̄ ulY L

...

ConvergenceSince is normalized and is a submatrix of :

Doing it n times will lead to:

Converges to Zero

T̄ T̄

After convergence

=

After convergence one can find by solving:

=

Optimization Problem

F should minimize the energy function

f (i ) f ( j) w i j and should be similar for a high in order to minimize

w i j :Similarity between i j

The graph laplacian

The graph laplacian is defined as :

Then we can use the graph laplacian to act on it

Rows are normalized so:

D= IT̄ i j

f :V →R

T̄

So the energy function can be rewritten in terms of

since

Let D be a diagonal matrix where

Back to the optimization ProblemEnergy can be rewritten using laplacian

F should minimize the energy function.

Δuu=(Duu−T̄ uu)Δuu=( I−T̄ uu)

Δul=(Dul−T̄ ul )Δul=−T̄ ul

Optimization Problem

Δuu=(Duu−T̄ uu)Δuu=( I−T̄ uu)

Δul=(Dul−T̄ ul )Δul=−T̄ ul

The algorithm converges to the minimization of the Energy function

f u=( I−T̄ uu)T ul f l

Delta can be rewritten in terms of T̄

Sigma Parameter

Remember the Sigma parameter?

● It strongly influences the behavior of LP.

● There can be:● just one for the whole feature vector● One per dimensionσ

σ

Sigma Parameter

● What happens if tends to be:– 0:

● The label of an unknown instance is given by just the nearest labeled instance

– Infinite● All the unlabaled instances receive the same influence

from all labeled instances. The soft probabilities of each unlabeled instance is given by the class frecuency in the labeled data

● There are heuristics for finding the appropiate value of sigma

σ

Sigma Parameter - MST

Label1

Label2

This is the minimum arc connecting two components with differents labels

Arc connects two components with different label

σ=(minweight (arc))

3

Sigma Parameter – Learning itHow to learn sigma?

● Assumption :A good sigma will do classification with confidence and thus minimize entropy.

How to do it?● Smoothing the transition Matrix T ● Finding the derivative of H (the entropy) w.r.t to

sigma

When to do it?● when using a sigma for each dimension can

be used to determine irrelevant dimensions

Labeling Approach

● Once Yu is measured how do we assign labels to the instances?

Yu

● Take the most likely class

● Class mass Normalization

● Label Bidding

Labeling Approach

● Take the most likely class

● Simply, look at the rows of Yu, and choose for each instance the label with highest probability

● Problem: no control on the proportion of classes

Labeling Approach● Class mass Normalization

● Given some class proportions● Scalate each column to

● Then Simply, look at the rows of Yu, and choose for each instance the label with highest probability

P1 , P2 ...P k

PcC

Labeling Approach● Label bidding

● Given some class proportions

1.estimate numbers of items per label

2. choose the label with greatest number of items, take items whose probabilty of being the current label is the highest and label as the current selected label.

3. iterate through all the possible labels

P1 , P2 ...P k

(C k)

C k

Experiment Setup

● Artificial Data ● Comparison LP vs kNN (k=1)

● Character recognition● Recognize handwritten digits● Images 16x16 pixels,gray scale● Recognizing 1,2,3.● 256 dimensional vector

Results using LP on artificial data

Results using LP on artificial data

● LP finds the structure in the data while KNN fails

P1NN● P1NN is a baseline for comparisons

● Simplified version of LP

1.During each iteration find the unlabeled instance nearest to a labeled instance and label it

2. Iterate until all instances are labeled

Results using LP on Handwritten dataSet

● P1NN (BaseLine), 1NN (kNN)

● Cne: Class mass normalization. Proportions from Labeled Data

● Lbo: Label bidding with oracle class proportions

● ML: most likely labels

Relation Extraction?● From natural language texts detect semantic

relations among entities

Example: B. Gates married Melinda French on January 1, 1994

spouse(B.Gates, Melinda French)

Why LP to do RE?

Problems

UnsupervisedSupervised

Needs many annotated data

Retrieves clusters of relations with no label.

RE- Problem Definition● Find an appropiate label to an ocurrance of two

entities in a context

Example: ….. B. Gates married Melinda French on January 1, 1994

Idea: if two ocurrances of entity pairs ahve similar Contexts, then they have same relation type

Entity 1(e1)

Entity 2(e2)

Context(Cpos)Context(Cpos)

Context(Cmid)

Context(Cpre)

RE problem Definition - Features

● Words: in the contexts● Entity Types: Person, Location, Org...● POS tagging: of Words in the contexts● Chunking Tag: mark which words in the

contexts are inside chunks● Grammatical function of words in the contexts.

i.e : NP-SBJ (subject)● Position of words:

● First Word of e1● Second Word of e1..

-is there any word in Cmid-first word in Cpre,Cmid,Cpost...-second word in Cpre...

RE problem Definition - Labels

Experiment● ACE 2003 data. Corpus from Newspapers

● Assume all entities have been identified already

● Comparison between:

– Differents amount of labeled samples 1%,10%,25,50%,75%,100%

– Different Similarity Functions– LP, SVM and Bootstrapping

● LP:

● Similarity Function: Cosine, JensenShannon

● Labeling Approach: Take the most likely class

● Sigma: average similarity between labeled classes

JensenShannon -Similarity Measure

-Measure the distance between two probabilitiy functions

-JS is a smoothing of Kullback-Leibler divergence

DK L Kullback-Leibler divergence

Experiment

-not symmetric

-not always has a finite value

Results

Classifying relation subtypes- SVM vs LP

SVM with linear Kernel

Bootstrapping

Seeds Classifier

Train a Classifier

Update set of seeds whose confidence is high enough

Classifying relation typesBootstrapping vs LP

Starting with 100 random seeds

Results

● Performs well in general when there are few annotated data in comparison to SVM and kNN

● Irrelevant dimensions can be identified by using LP

● Looking at the structure of unlabeled data helps when there is few annotated data

Thank you

Label propagation - Semisupervised Learning with Applications to NLP

Education

Transcript of Label propagation - Semisupervised Learning with Applications to NLP