Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012.

Semi-Supervised Learning & Summary

Advanced Statistical Methods in NLPLing 572

March 8, 2012

RoadmapSemi-supervised learning:

Motivation & perspective

Yarowsky’s modelCo-training

Summary

Semi-supervised Learning

MotivationSupervised learning:

Works really well But need lots of labeled training data

Unsupervised learning:

Unsupervised learning:No labeled data required, butMay not work well, may not learn desired

distinctions

Unsupervised learning:No labeled data required, butMay not work well, may not learn desired

distinctions

E.g. Unsupervised parsing techniquesFits data, but doesn’t correspond to linguistic

intuition

SolutionSemi-supervised learning:

General idea:Use a small amount of labeled training data

General idea:Use a small amount of labeled training dataAugment with large amount of unlabeled training

data Use information in unlabeled data to improve models

Many different semi-supervised machine learnersVariants of supervised techniques:

Semi-supervised SVMs, CRFs, etc

Many different semi-supervised machine learnersVariants of supervised techniques:

Semi-supervised SVMs, CRFs, etc

Bootstrapping approaches Yarowsky’s method, self-training, co-training

There are more kinds of plants and animals in the rainforests than anywhere else on Earth. Over half of the millions of known species of plants and animals live in the rainforest. Many are found nowhere else. There are even plants and animals in therainforest that we have not yet discovered.Biological Example

The Paulus company was founded in 1938. Since those days the product range has been the subject of constant expansions and is brought up continuously to correspond with the state of the art. We’re engineering, manufacturing and commissioning world-wide ready-to-run plants packed with our comprehensive know-how. Our Product Range includes pneumatic conveying systems for carbon, carbide, sand, lime andmany others. We use reagent injection in molten metal for the…Industrial Example

Label the First Use of “Plant”

Word Sense Disambiguation

Application of lexical semantics

Goal: Given a word in context, identify the appropriate senseE.g. plants and animals in the rainforest

Crucial for real syntactic & semantic analysis

Crucial for real syntactic & semantic analysisCorrect sense can determine

Available syntactic structureAvailable thematic roles, correct meaning,..

Disambiguation FeaturesKey: What are the features?

Part of speech Of word and neighbors

Morphologically simplified formWords in neighborhood

Question: How big a neighborhood? Is there a single optimal size? Why?

(Possibly shallow) Syntactic analysisE.g. predicate-argument relations, modification, phrases

Collocation vs co-occurrence featuresCollocation: words in specific relation: p-a, 1 word +/-Co-occurrence: bag of words..

WSD Evaluation

WSD EvaluationIdeally, end-to-end evaluation with WSD

componentDemonstrate real impact of technique in systemDifficult, expensive, still application specific

WSD EvaluationIdeally, end-to-end evaluation with WSD

componentDemonstrate real impact of technique in systemDifficult, expensive, still application specific

Typically, intrinsic, sense-basedAccuracy, precision, recallSENSEVAL/SEMEVAL: all words, lexical sample

WSD Evaluation Ideally, end-to-end evaluation with WSD component

Demonstrate real impact of technique in system Difficult, expensive, still application specific

Typically, intrinsic, sense-based Accuracy, precision, recall SENSEVAL/SEMEVAL: all words, lexical sample

Baseline: Most frequent sense

Topline: Human inter-rater agreement: 75-80% fine; 90% coarse

Minimally Supervised WSDYarowsky’s algorithm (1995)

Bootstrapping approach:Use small labeled seedset to iteratively train

Minimally Supervised WSDYarowsky’s algorithm (1995)

Bootstrapping approach:Use small labeled seedset to iteratively train

Builds on 2 key insights:One Sense Per Discourse

Word appearing multiple times in text has same sense

Corpus of 37232 bass instances: always single sense

Minimally Supervised WSD Yarowsky’s algorithm (1995)

Bootstrapping approach: Use small labeled seedset to iteratively train

Builds on 2 key insights: One Sense Per Discourse

Word appearing multiple times in text has same senseCorpus of 37232 bass instances: always single sense

One Sense Per CollocationLocal phrases select single sense

Fish -> Bass1

Play -> Bass2

Yarowsky’s AlgorithmTraining Decision Lists

1. Pick Seed Instances & Tag

1. Pick Seed Instances & Tag2. Find Collocations: Word Left, Word Right,

Word +K

Word +K(A) Calculate Informativeness on Tagged Set,

Order:

(B) Tag New Instances with Rules

Order:

(B) Tag New Instances with Rules(C) Apply 1 Sense/Discourse(D)

Order:

(B) Tag New Instances with Rules(C) Apply 1 Sense/Discourse(D) If Still Unlabeled, Go To 2

Order:

3. Apply 1 Sense/Discourse

Order:

3. Apply 1 Sense/Discourse

Disambiguation: First Rule Matched

Yarowsky Decision List

Iterative Updating

There are more kinds of plants and animals in the rainforests than anywhere else on Earth. Over half of the millions of known species of plants and animals live in the rainforest. Many are found nowhere else. There are even plants and animals in therainforest that we have not yet discovered.Biological Example

The Paulus company was founded in 1938. Since those days the product range has been the subject of constant expansions and is brought up continuously to correspond with the state of the art. We’re engineering, manufacturing and commissioning world-wide ready-to-run plants packed with our comprehensive know-how. Our Product Range includes pneumatic conveying systems for carbon, carbide, sand, lime andmany others. We use reagent injection in molten metal for the…Industrial Example

Label the First Use of “Plant”

Sense Choice With Collocational Decision

ListsCreate Initial Decision List

Rules Ordered by

Check nearby Word Groups (Collocations)Biology: “Animal” in + 2-10 words Industry: “Manufacturing” in + 2-10 words

Rules Ordered by

Check nearby Word Groups (Collocations)Biology: “Animal” in + 2-10 words Industry: “Manufacturing” in + 2-10 words

Result: Correct Selection95% on Pair-wise tasks

Self-TrainingBasic approach:

Start off with small labeled training set

Train a supervised classifier with the training set

Apply new classifier to residual unlabeled training data

Add ‘best’ newly labeled examples to labeled training

Iterate

Self-TrainingSimple – right?

Devil in the details:Which instances are ‘best’ to add?

Highest confidence?Probably accurate, butProbably add little new information to classifier

Most different?Probably adds information, butMay not be accurate

Use most different, highly confident instances

Co-TrainingBlum & Mitchell, 1998

Basic intuition: “Two heads are better than one”

Basic intuition: “Two heads are better than one”Ensemble classifier:

Uses results from multiple classifiers

Basic intuition: “Two heads are better than one”Ensemble classifier:

Uses results from multiple classifiers

Multi-view classifier:Uses different views of data – feature subsetsIdeally, views should be:

Conditionally independent Individually sufficient – enough information to learn

Co-training Set-upCreate two views of data:

Typically partition feature set by typeE.g. predicting speech emphasis

View 1: Acoustics: loudness, pitch, duration View 2: Lexicon, syntax, context

Co-training Set-upCreate two views of data:

Typically partition feature set by typeE.g. predicting speech emphasis

View 1: Acoustics: loudness, pitch, duration View 2: Lexicon, syntax, context

Some approaches use learners of different types

In practice, views may not truly be conditionally indep.But often works pretty well anyway

Co-training ApproachCreate small labeled training data set

Train two (supervised) classifiers on current training Using different views

Use two classifiers to label residual unlabeled instances

Select ‘best’ newly labeled data to add to training data* Adding instances labeled by C1 to training data for C2, v.v.

Iterate

Graphically

Figure from Jeon&Liu’11

More Devilish DetailsQuestions for co-training:

Which instances are ‘best’ to add to training?Most confident? Most different? Random?Many approaches combine

How many instances to add per iteration?Threshold – by count, by value?

How long to iterate?Fixed count? Threshold classifier confidence? etc…

Co-training ApplicationsApplied to many language related tasks

Blum & Mitchell’s paperAcademic home web page classification95% accuracy: 12 pages labeled; 788 classified

Sentiment analysis

Statistical parsing

Prominence recognition

Dialog classification

Learning Curves:Semi-supervised vs

Supervised

9 12 24 50 100 30066

supervisedsemi-supervised

# of labelled examples

Semi-supervised LearningUmbrella term for machine learning techniques

that:Use a small amount of labeled training dataAugmented with information from unlabeled data

Semi-supervised LearningUmbrella term for machine learning techniques

that:Use a small amount of labeled training dataAugmented with information from unlabeled data

Can be very effective:Training on ~10 labeled samples Can yield results comparable to training on 1000s

Semi-supervised LearningUmbrella term for machine learning techniques that:

Use a small amount of labeled training dataAugmented with information from unlabeled data

Can be very effective:Training on ~10 labeled samples Can yield results comparable to training on 1000s

Can be temperamental:Sensitive to data, learning algorithm, design choicesHard to predict effects of:

amount of labeled data, unlabeled data, etc

Summary

ReviewIntroduction:

Entropy, cross-entropy, and mutual information

Classic machine learning algorithms:Decision trees, kNN, Naïve Bayes

Discriminative machine learning algorithms:MaxEnt, CRFs, SVMs

Other models:TBL, EM, Semi-supervised approaches

General MethodsData organization:

Training, development, test data splits

Cross-validation:Parameter turning, evaluation

Feature selection:Wrapper methods, filtering, weighting

Beam search

Tools, Data, & TasksTools:

Mallet libSVM

Data:20 Newsgroups (Text classification)Penn Treebank (POS tagging)

Beyond 572Ling 573:

‘Capstone’ project class:Integrates material from 57* classesMore ‘real world’: project teams, deliverables,

repositories

Beyond 572Ling 573:

repositories

Ling 575s:Speech technology: Michael Tjalve (TH: 4pm)NLP on mobile devices: Scott Farrar (T: 4pm)

Beyond 572Ling 573:

repositories

Ling 575s:Speech technology: Michael Tjalve (TH: 4pm)NLP on mobile devices: Scott Farrar (T: 4pm)

Ling and other electives

Course Evaluationshttps://depts.washington.edu/oeaias/webq/surve

y.cgi?user=UWDL&survey=1397

Thank you!

Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012.

Documents

Transcript of Semi-Supervised Learning & Summary Advanced Statistical Methods in NLP Ling 572 March 8, 2012.

Naïve Bayes Advanced Statistical Methods in NLP Ling 572 January 17, 2012.

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.

Maximum Entropy Advanced Statistical Methods in NLP Ling 572 January 31, 2012.

Nearest Neighbor Ling 572 Advanced Statistical Methods in NLP January 12, 2012.

Systems & Applications: Introduction Ling 573 NLP Systems and Applications March 29, 2011.

Introduction to Semantics and Pragmatics. LING 2000 - 2006 NLP 2 NLP tends to focus on: Syntax – Grammars, parsers, parse trees, dependency structures.

Computational Semantics Ling 571 Deep Processing Techniques for NLP February 7, 2011.

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

NLP Practitioner 'plus' Infomappe · NLP • NLP Practitioner • NLP Master • NLP Coach. Infomappe NLP Practitioner „plus ...

Transformation-Based Learning Advanced Statistical Methods in NLP Ling 572 March 1, 2012.

Conditional Random Fields Advanced Statistical Methods in NLP Ling 572 February 9, 2012 1.

Information Theory, Classification & Decision Trees Ling 572 Advanced Statistical Methods in NLP January 5, 2012.

Semi-supervised learning and self-training LING 572 Fei Xia 02/14/06.

Features & Unification Ling 571 Deep Processing Techniques for NLP January 26, 2011.

Decision List LING 572 Fei Xia 1/18/06. Outline Basic concepts and properties Case study.

Introduction to Mallet LING 572 Fei Xia Week 1: 1/08/2009.

The classification problem (Recap from LING570) LING 572 Fei Xia, Dan Jinguji Week 1: 1/10/08 1.

Introduction LING 572 Fei Xia Week 1: 1/4/06. Outline Course overview Mathematical foundation: (Prereq) –Probability theory –Information theory Basic.

Empirical Methods in Natural Language Processing Lecture 1 ...people.cs.georgetown.edu/nschneid/cosc572/s19/01_slides.pdfNathan Schneider ENLP (COSC/LING-572) Lecture 1 28 Organization