Label propagation - Semisupervised Learning with Applications to NLP
-
Upload
david-przybilla -
Category
Education
-
view
1.769 -
download
3
description
Transcript of Label propagation - Semisupervised Learning with Applications to NLP
Label Propagation
David [email protected]
Seminar:Semi-supervised and unsupervised learning with Applications to NLP
Outline
● What is Label Propagation
● The Algorithm
● The motivation behind the algorithm
● Parameters of Label Propagation
● Relation Extraction with Label Propagation
Label Propagation
● Semi-supervised
● Shows good results when the amount of annotated data is low with respect to the supervised options
● Similar to kNN
K-Nearest Neighbors(KNN)
● Shares similar ideas with Label Propagation
● Label Propagation (LP) uses unlabeled instances during the process of finding out the labels
Idea of the Problem
We want to find a function f such that:
L=set of Labeled InstancesU=set of Unlabeled Instances
Similar near Unlabeled Instances should have similar Labels
The Model● A complete graph
● Each Node is an instance ● Each arc has a weight
● is high if Nodes x and y are similar.T xy
T xy
The Model
● Inside a Node:
Soft Labels
Variables - Model ● T is a matrix, holding all the weights of the graph
N 1 ...N l=Labeled Data
N l+1 .. N n=Unlabeled DataT l lT l uT u lT u u
T l l
T l u
T u l
T u u
Variables - Model ● Y is a matrix, holding the soft probabilities of
each instance
each of the possible labels
is the probability of being labeled as
The problem to solve
Y N a , Rb naRb
R1 , R2 ... Rk
N 1 , N 2 ...N n each of the instances to label
Y L
Y U
Algorithm
will change in each iteration
Y
How to Measure T?
Euclidean Distance
Distance Measure
Important Parameter(ignore it at the moment) we will talk about this later
How to Initialize Y?● How to Correctly set the values of ?
● Fill the known values (of the labeled data)
● How to fill the values of the unlabeled data? → The initialization of this values can be arbitrary.
Y 0
● Transform T into T' (row normalization)
Propagation Step
● Update during each iteration Y
Y 0Y 1 Y k
→ → ... →
● During the process Y will change
Convergence
T̄ l l T̄ l u
T̄ u l T̄ u u Y u
Y l
During the iteration
Assumming we iterate infinite times then:
Y u
Y l
Y u
Y l =
Clamped
Y U1=T̄ uuY u
0+T̄ ulY L
Y U2=T̄ uu(T̄ uuY u
0+T̄ ulY L)+T̄ ulY L
...
ConvergenceSince is normalized and is a submatrix of :
Doing it n times will lead to:
Converges to Zero
T̄ T̄
After convergence
=
After convergence one can find by solving:
=
Optimization Problem
F should minimize the energy function
f (i ) f ( j) w i j and should be similar for a high in order to minimize
w i j :Similarity between i j
The graph laplacian
The graph laplacian is defined as :
Then we can use the graph laplacian to act on it
Rows are normalized so:
D= IT̄ i j
f :V →R
T̄
So the energy function can be rewritten in terms of
since
Let D be a diagonal matrix where
Back to the optimization ProblemEnergy can be rewritten using laplacian
F should minimize the energy function.
Δuu=(Duu−T̄ uu)Δuu=( I−T̄ uu)
Δul=(Dul−T̄ ul )Δul=−T̄ ul
Optimization Problem
Δuu=(Duu−T̄ uu)Δuu=( I−T̄ uu)
Δul=(Dul−T̄ ul )Δul=−T̄ ul
The algorithm converges to the minimization of the Energy function
f u=( I−T̄ uu)T ul f l
Delta can be rewritten in terms of T̄
Sigma Parameter
Remember the Sigma parameter?
● It strongly influences the behavior of LP.
● There can be:● just one for the whole feature vector● One per dimensionσ
σ
Sigma Parameter
● What happens if tends to be:– 0:
● The label of an unknown instance is given by just the nearest labeled instance
– Infinite● All the unlabaled instances receive the same influence
from all labeled instances. The soft probabilities of each unlabeled instance is given by the class frecuency in the labeled data
● There are heuristics for finding the appropiate value of sigma
σ
Sigma Parameter - MST
Label1
Label2
This is the minimum arc connecting two components with differents labels
Arc connects two components with different label
σ=(minweight (arc))
3
Sigma Parameter – Learning itHow to learn sigma?
● Assumption :A good sigma will do classification with confidence and thus minimize entropy.
How to do it?● Smoothing the transition Matrix T ● Finding the derivative of H (the entropy) w.r.t to
sigma
When to do it?● when using a sigma for each dimension can
be used to determine irrelevant dimensions
Labeling Approach
● Once Yu is measured how do we assign labels to the instances?
Yu
● Take the most likely class
● Class mass Normalization
● Label Bidding
Labeling Approach
● Take the most likely class
● Simply, look at the rows of Yu, and choose for each instance the label with highest probability
● Problem: no control on the proportion of classes
Labeling Approach● Class mass Normalization
● Given some class proportions● Scalate each column to
● Then Simply, look at the rows of Yu, and choose for each instance the label with highest probability
P1 , P2 ...P k
PcC
Labeling Approach● Label bidding
● Given some class proportions
1.estimate numbers of items per label
2. choose the label with greatest number of items, take items whose probabilty of being the current label is the highest and label as the current selected label.
3. iterate through all the possible labels
P1 , P2 ...P k
(C k)
C k
Experiment Setup
● Artificial Data ● Comparison LP vs kNN (k=1)
● Character recognition● Recognize handwritten digits● Images 16x16 pixels,gray scale● Recognizing 1,2,3.● 256 dimensional vector
Results using LP on artificial data
Results using LP on artificial data
● LP finds the structure in the data while KNN fails
P1NN● P1NN is a baseline for comparisons
● Simplified version of LP
1.During each iteration find the unlabeled instance nearest to a labeled instance and label it
2. Iterate until all instances are labeled
Results using LP on Handwritten dataSet
● P1NN (BaseLine), 1NN (kNN)
● Cne: Class mass normalization. Proportions from Labeled Data
● Lbo: Label bidding with oracle class proportions
● ML: most likely labels
Relation Extraction?● From natural language texts detect semantic
relations among entities
Example: B. Gates married Melinda French on January 1, 1994
spouse(B.Gates, Melinda French)
Why LP to do RE?
Problems
UnsupervisedSupervised
Needs many annotated data
Retrieves clusters of relations with no label.
RE- Problem Definition● Find an appropiate label to an ocurrance of two
entities in a context
Example: ….. B. Gates married Melinda French on January 1, 1994
Idea: if two ocurrances of entity pairs ahve similar Contexts, then they have same relation type
Entity 1(e1)
Entity 2(e2)
Context(Cpos)Context(Cpos)
Context(Cmid)
Context(Cpre)
RE problem Definition - Features
● Words: in the contexts● Entity Types: Person, Location, Org...● POS tagging: of Words in the contexts● Chunking Tag: mark which words in the
contexts are inside chunks● Grammatical function of words in the contexts.
i.e : NP-SBJ (subject)● Position of words:
● First Word of e1● Second Word of e1..
-is there any word in Cmid-first word in Cpre,Cmid,Cpost...-second word in Cpre...
RE problem Definition - Labels
Experiment● ACE 2003 data. Corpus from Newspapers
● Assume all entities have been identified already
● Comparison between:
– Differents amount of labeled samples 1%,10%,25,50%,75%,100%
– Different Similarity Functions– LP, SVM and Bootstrapping
● LP:
● Similarity Function: Cosine, JensenShannon
● Labeling Approach: Take the most likely class
● Sigma: average similarity between labeled classes
JensenShannon -Similarity Measure
-Measure the distance between two probabilitiy functions
-JS is a smoothing of Kullback-Leibler divergence
DK L Kullback-Leibler divergence
Experiment
-not symmetric
-not always has a finite value
Results
Classifying relation subtypes- SVM vs LP
SVM with linear Kernel
Bootstrapping
Seeds Classifier
Train a Classifier
Update set of seeds whose confidence is high enough
Classifying relation typesBootstrapping vs LP
Starting with 100 random seeds
Results
● Performs well in general when there are few annotated data in comparison to SVM and kNN
● Irrelevant dimensions can be identified by using LP
● Looking at the structure of unlabeled data helps when there is few annotated data
Thank you