Gene family classification using a semi-supervised learning method.pdf
Semi-supervised classification for natural language processing
-
Upload
rushdi-shams -
Category
Education
-
view
274 -
download
1
description
Transcript of Semi-supervised classification for natural language processing
SEMI-SUPERVISED CLASSIFICATION FOR
NATURAL LANGUAGE PROCESSING
Rushdi ShamsDepartment of Computer Science
University of Western Ontario,London, Canada.
2
PRESENTATION AT A GLANCE
• Semi-supervised learning– Problems solved by semi-supervised learning– Types– How it works– When it works
• Semi-supervised learning for NLP– Parsing– Text classification– Summarization– Biomedical tasks
• Conclusions
3
SEMI-SUPERVISED LEARNING
• Traditional supervised classifiers use only labeled data for their training– Expensive, difficult to obtain, time
consuming
• Real-life problems have large amount of unlabeled data
• Semi-supervised learning encompasses unlabeled data with labeled data
4
SEMI-SUPERVISED LEARNING PROBLEMS
(1)Learn from labeled data
(2)Apply learning on
unlabeled data to label them
(3)If confident in labeling,
then learn from (1) and (2)
(4)Apply learning on
unseen unlabeled data
Transductive Learning
Inductive Learning
5
SEMI-SUPERVISED LEARNING PROBLEMS
• Transductive learning is like take-home exam– How good a model assumption is when
applied on unlabeled data after trained with labeled data?
• Inductive learning is like in-class exam– How good a model assumption is when
applied on unseen data after trained with labeled + unlabeled data?
6
SCOPES OF SEMI-SUPERVISED LEARNING
• Like traditional learning methods semi-supervised learning can be used for– Classification– Regression and– Clustering
7
HOW DOES SEMI-SUPERVISED CLASSIFICATION WORK?
8
TYPES OF SEMI-SUPERVISED LEARNING
• Generative Learning• Discriminative Learning
• Self-Training• Co-Training• Active Learning
How to use generative
and discriminative learning
9
GENERATIVE VS DISCRIMINATIVE MODELS
(x,y)
Discriminative Models Generative Models
10
GENERATIVE VS DISCRIMINATIVE MODELS
• Imagine your task is to classify a speech to a language
• You can do it by1. Learning each language and then classifying
it using the knowledge you just gained2. Determining the difference in the linguistic
models without learning the languages and then classifying the speech.
• (1) is how Generative models work and (2) is how Discriminative models work
11
GENERATIVE VS DISCRIMINATIVE MODELS
• Discriminative models predict the label y from the training example x
• Using Bayes’ Theorem, we get*
• This is the equation we use in generative models
* P(x) can be ignored since we are interested in finding the argmax(y)
12
GENERATIVE VS DISCRIMINATIVE MODELS
Conditional Probability, to determine class
boundaries
Joint Probability P(x,y), for any given y, we can
generate its x
Transductive SVM, Graph-based
methods
EM Algorithm, Self-learning
Cannot be used without considering P(x)Difficult because P(x|y) are inadequate
13
GENERATIVE VS DISCRIMINATIVE MODELS
•Probability Density Function•Function of mean vector and covariance matrix for a Gaussian distribution
•Mean vector and covariance matrix can be tuned to maximize the term•Use Maximum Likelihood Estimate (MLE) to find that
•Finally the tuning can be optimized using EM algorithm
•Different algorithms use different techniques according to the distribution of data
14
IS THERE A FREE LUNCH?
• “Unlabeled data are abundant, therefore semi-supervised learning is a good idea”– Not always!
• To succeed, one needs to spend reasonable amount of effort to design good models / features / kernels / similarity functions
15
IS THERE A FREE LUNCH?
• It requires matching of problem structure with model assumption
• P(x) is associated with the label prediction of an unlabeled data point x
• Algorithms like Transductive SVM (TSVM) assume that the decision boundary should avoid regions with high p(x)
16
IS THERE A FREE LUNCH?
• “If the data are coming from highly overlapped Gaussians, then decision boundary would go right through the densest region” – TSVM– Expectation-Maximization (EM) performs
better in this case
• Other example: Hidden Markov Model with unlabeled data also does not work!
17
SELF-TRAINING
18
CO-TRAINING
• Given labeled data L and unlabeled data U
• Create two labeled datasets L1 and L2 from L using views 1 and 2
19
CO-TRAINING
20
CO-TRAINING
• Learn classifier f (1) using L1 and classifier f (2) using L2
• Apply f (1) and f (2) on unlabeled data pool U to predict labels
• Predictions are made only using their own set (view) of features
• Add K most confident predictions of f1 to L2
• Add K most confident predictions of f2 to L1
21
CO-TRAINING
• Remove these examples from the unlabeled pool
• Re-train f (1) using L1, f (2) using L2• Like self-training but two classifiers
teaching each other• Finally, use a voting or averaging to
make predictions on the test data
22
CO-TRAINING: COVEATS
1. Each view alone is sufficient to make good classifications, given enough labeled data.
2. The two algorithms perform good, given enough labeled data.
23
ACTIVE LEARNING
24
WHICH METHOD SHOULD I USE?
• There is no direct answer!– Ideally one should use a method whose
assumptions fit the problem structure
• Do the classes produce well clustered data?– If yes, then use EM
• Do the features naturally split into two sets?– If yes, then use co-training
• Is it true that two points with similar features tend to be in the same class?– If yes, then use graph-based methods
25
WHICH METHOD SHOULD I USE?
• Already using SVM?– If yes, then TSVM is a natural extension!
• Is the existing supervised classifier complicated and hard to modify?– If yes, then use self-training
26
SEMI-SUPERVISED CLASSIFICATION FOR NLP
• Parsing• Text classification • Summarization• Biomedical tasks
27
EFFECTIVE SELF-TRAINING FOR PARSING
David McClosky, Eugene Charniak, and Mark JohnsonBrown University
Proceedings of the HLT: NACL, 2006
28
INTRODUCTION
• Self-trained a two phase parser-reranker system with readily available data
• The self-trained model gained a 1.1% improvement in F-score over the previous best result– F-score reported is 92.1%
29
METHODS
• Charniak parser is used for initial parsing
• It produces 50-best parses• A MaxEnt re-ranker is used to re-rank
the parses– Exploits over a million features
30
DATASETS
• Penn treebank section 2-21 for training– 40k WSJ articles
• Penn treebank section 23 for testing• Penn treebank section 24 for held-out
validation• Unlabelled data were collected from
North American News Text Corpus (NANC)– 24 million LA times articles
31
RESULTS
• The authors experimented with and without using the re-ranker as they added unlabelled sentences–With the re-ranker the parser performs well
• The improvement is about 1.1% F-score– The self-trained parser contributes 0.8%
and – The re-ranker contributes 0.3%
32
LIMITATIONS
• The work did not restrict more accurately parsed sentences to be included in the training data.
• Speed is similar to Charniak parser but requires a little bit more memory.
• Unlabeled data from one domain (LA times) and labeled data from a different domain (WSJ) affects self-training– The question is remained unanswered
33
SEMI-SUPERVISED SPAM FILTERING: DOES IT WORK?
Mona Mojdeh and Gordon V. CormackUniversity of Waterloo
Proceedings of the SIGIR 2008
34
INTRODUCTION
• “Semi-supervised learning methods work well for spam filtering when source of available labeled examples differs from those to be classified” [2006 ECML/PKDD challenge]
• The authors reproduced the work and found opposite results
35
BACKGROUND
• ECML/PKDD Challenge– Delayed Feedback:
• The filters will be trained on emails T1
• Then they will classify some test emails t1
• Then train again on the emails T1 + t1
• It continues for the entire dataset• Best (1-AUC) is 0.01%
– Cross-user Train:• Train on a set of emails and test on a different set of
emails• The emails are extracted from the same dataset• Best (1-AUC) is 0.1%
36
BACKGROUND
• Best performing filters:• SVM and Transductive SVM (TSVM)• Dynamic Markov Compression (DMC)• Logistic regression with self-training
37
BACKGROUND
• TREC Spam Track Challenge– Filters will be trained with publicly
available emails– Filters then will be tested on emails
collected from user inboxes
38
METHODS AND MATERIALS
• TREC 2007 dataset– Delayed Feedback:• First 10,000 messages for training• Next 60,000 messages divided into six
batches (each containing 10,000 messages)• The last 5,000 messages for test
– Cross-user Train:• 30,338 messages from particular user
inboxes for training• 45,081 messages from other users for
evaluation
39
RESULTS: DELAYED FEEDBACK VS CROSS-USER
Delayed Feedback Cross-User
40
RESULTS: CROSS-CORPUS
• First 10,000 messages from TREC 2005 corpus
• TREC 2007 corpus split into 10,000 message segments
41
EXTRACTIVE SUMMARIZATION USING SUPERVISED AND SEMI-SUPERVISED
LEARNING
Kam-Fai Wong, Mingli Wu, and Wenjie Li*The Chinese University of Hong Kong
The Hong Kong Polytechnic University*
Proceedings of the Coling. 2008
42
INTRODUCTION
• Used co-training by combining labeled and unlabeled data
• Demonstrated that extractive summaries found from co-training are comparable to summaries produced by supervised methods and humans
43
METHOD
• The authors used four kinds of features: 1. Surface2. Relevance3. Event and4. Content
• Supervised setup– Support Vector Machine
• Co-training setup– Probabilistic SVM (PSVM)– Naive Bayes
44
DATASETS
• DUC-2001 dataset was used• It contains 30 clusters of documents– Each cluster contains documents for a particular
topic
• Total 308 documents• For each cluster, human summaries are
provided – 50, 100, 200 and 400-word summaries
• For each document human summaries are provided, too– 100-word summaries
45
RESULTS: FEATURE SELECTION
• ROUGE I, ROUGE II and ROUGE L scores were used as evaluation measures
Human Summary ROUGE I Score was 0.422
46
RESULTS: EFFECT OF UNLABELED DATA
More labeled data produced better F-score
47
RESULTS: SUPERVISED VS SEMI-SUPERVISED
48
RESULTS: EFFECT OF SUMMARY LENGTH
49
LIMITATIONS
• Co-training is done on the same feature space– Violates the primary hypothesis of co-
training
• The strength of features was determined only using PSVM–We have no knowledge on the
performance of Supervised Naive Bayes on the features
50
SEMI-SUPERVISED CLASSIFICATION FOR EXTRACTING PROTEIN INTERACTION
SENTENCESUSING DEPENDENCY PARSING
Gunes Erkan, Arzucan Ozgir, and Dragomir RadevUniversity of Michigan
Proceedings of the Empirical Methods in Natural Language Processing and Computational Natural Language Learning, 2007
51
INTRODUCTION
• Produces dependency trees for each sentence• Analyzes the paths between two protein names
in the parse trees• Using machine learning techniques, according
to the paths, the sentences are labeled (gold standard)
• Given the paths, cosine similarity and edit distance are used to find out interactions between the proteins
• Semi-supervised algorithms perform better than their supervised versions by a wide margin
52
INTRODUCTION
• The first semi-supervised approach in the problem domain
• The first approach that utilizes information beyond syntactic parses
53
METHOD
• Four algorithms were used1. Support Vector Machine (SVM)2. K-nearest Neighbor (KNN)3. Transductive SVM (TSVM)4. Harmonic Functions
• Stanford dependency parser is used to generate the parse trees
54
DATASETS
• Sentences of two datasets are annotated based on their dependency trees (using supervised techniques)– AIMED– Christine Brun (CB)
55
RESULTS: AIMED DATASET
56
RESULTS: CB DATASET
57
RESULTS: EFFECT OF TRAINING DATA SIZE (AIMED)
• With small training data, semi-supervised algorithms are better
• SVM performs poorly with less training data
58
RESULTS: EFFECT OF TRAINING DATA SIZE (CB)
• KNN performs the worst with much labeled data
• With larger training data, SVM performs comparably with semi-supervised algorithms
59
LIMITATIONS
• Transductive SVM is susceptible to the distribution of the labelled data– The distribution was not tested
• AIMED has class imbalance problem– TSVM is affected by this problem
60
HOW MUCH UNLABELED DATA IS USED?
61
CONCLUSIONS
• Semi-supervised learning is an obvious success in domains like Natural Language Text Processing
• The success depends on –Matching of problem in hand and model
assumption– Careful observations of the distribution
of data– Careful selection of algorithms
62
CONCLUSIONS
• Apart from these fundamental conditions, to get success with semi-supervised learning we need to examine the followings—– Proportion of labeled and unlabeled data (No
definite answer)– Effect of dependency of features (with fewer
labeled examples, use fewer dependent features)– Noise in the labeled data (easier) and unlabeled
data (difficult) (Overall, noise has less effect on semi-supervised learning)
– Difference in domains of labeled and unlabeled data (Transfer learning or self-taught learning)
63
CONCLUSIONS