Overfitting and-tbl
-
Upload
digvijay-singh -
Category
Education
-
view
427 -
download
3
description
Transcript of Overfitting and-tbl
![Page 1: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/1.jpg)
Over fitting&
Transformation Based Learning
CS 371: Spring 2012
![Page 2: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/2.jpg)
Machine Learning
• Machines can learn from examples– Learning modifies the agent's decision mechanisms to improve
performance
• Given training data, machines analyze the data, and learn rules which generalize to new examples– Can be sub-symbolic (rule may be a mathematical function) – Or it can be symbolic (rules are in a representation that is similar to
representation used for hand-coded rules)
• In general, machine learning approaches allow for more tuning to the needs of a corpus, and can be reused across corpora
![Page 3: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/3.jpg)
Training data example
• Inductive learningEmpirical error function:
E(h) = x distance[h(x; ) , f]
Empirical learning = finding h(x), or h(x; ) that minimizes E(h)
• Note an implicit assumption:–For any set of attribute values there is a unique target value–This in effect assumes a “no-noise” mapping from inputs to targets
• This is often not true in practice (e.g., in medicine).
![Page 4: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/4.jpg)
Learning Boolean Functions
• Given examples of the function, can we learn the function?
• 2 to the power of 2d different Boolean functions can be defined on d attributes
– This is the size of our hypothesis space
• Observations:– Huge hypothesis spaces –> directly searching over all functions is
impossible– Given a small data (n pairs) our learning problem may be underconstrained
• Ockham’s razor: if multiple candidate functions all explain the data equally well, pick the simplest explanation (least complex function)
• Constrain our search to classes of Boolean functions, e.g.,– decision trees
![Page 5: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/5.jpg)
Decision Tree Learning
• Constrain h(..) to be a decision tree
![Page 6: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/6.jpg)
Pseudocode for Decision tree learning
![Page 7: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/7.jpg)
Major issues
Q1: Choosing best attribute: what quality measure to use?
Q2: Handling training data with missing attribute values
Q3: Handling training data with noise, irrelevant attributes- Determining when to stop splitting: avoid overfitting
![Page 8: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/8.jpg)
Major issues
Q1: Choosing best attribute: different quality measures.Information gain, gain ratio …
Q2: Handling training data with missing attribute values: blank value, most common value, or fractional count
Q3: Handling training data with noise, irrelevant attributes:- Determining when to stop splitting: ????
![Page 9: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/9.jpg)
Assessing Performance
Training data performance is typically optimistice.g., error rate on training data
Reasons?- classifier may not have enough data to fully learn the concept (but
on training data we don’t know this) - for noisy data, the classifier may overfit the training data
In practice we want to assess performance “out of sample”how well will the classifier do on new unseen data? This is the
true test of what we have learned (just like a classroom)
With large data sets we can partition our data into 2 subsets, train and test- build a model on the training data
- assess performance on the test data
![Page 10: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/10.jpg)
Example of Test Performance
Restaurant problem- simulate 100 data sets of different sizes
- train on this data, and assess performance on an independent test set - learning curve = plotting accuracy as a function of training set size - typical “diminishing returns” effect
![Page 11: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/11.jpg)
Example
![Page 12: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/12.jpg)
Example
![Page 13: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/13.jpg)
Example
![Page 14: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/14.jpg)
Example
![Page 15: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/15.jpg)
Example
![Page 16: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/16.jpg)
How Overfitting affects Prediction
PredictiveError
Model Complexity
Error on Training Data
![Page 17: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/17.jpg)
How Overfitting affects Prediction
PredictiveError
Model Complexity
Error on Training Data
Error on Test Data
![Page 18: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/18.jpg)
How Overfitting affects Prediction
PredictiveError
Model Complexity
Error on Training Data
Error on Test Data
Ideal Rangefor Model Complexity
OverfittingUnderfitting
![Page 19: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/19.jpg)
Training and Validation Data
Full Data Set
Training Data
Validation Data
Idea: train eachmodel on the“training data”
and then testeach model’saccuracy onthe validation data
![Page 20: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/20.jpg)
The v-fold Cross-Validation Method
• Why just choose one particular 90/10 “split” of the data?– In principle we could do this multiple times
• “v-fold Cross-Validation” (e.g., v=10)– randomly partition our full data set into v disjoint subsets (each
roughly of size n/v, n = total number of training data points)• for i = 1:10 (here v = 10)
– train on 90% of data,– Acc(i) = accuracy on other 10%
• end
• Cross-Validation-Accuracy = 1/v i Acc(i)
– choose the method with the highest cross-validation accuracy– common values for v are 5 and 10– Can also do “leave-one-out” where v = n
![Page 21: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/21.jpg)
Disjoint Validation Data Sets
Full Data Set
Training Data
Validation Data
1st partition
![Page 22: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/22.jpg)
Disjoint Validation Data Sets
Full Data Set
Training Data
Validation DataValidation Data
1st partition 2nd partition
![Page 23: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/23.jpg)
More on Cross-Validation
• Notes– cross-validation generates an approximate estimate of how well the
learned model will do on “unseen” data
– by averaging over different partitions it is more robust than just a single train/validate partition of the data
– “v-fold” cross-validation is a generalization• partition data into disjoint validation subsets of size n/v• train, validate, and average over the v partitions• e.g., v=10 is commonly used
– v-fold cross-validation is approximately v times computationally more expensive than just fitting a model to all of the data
![Page 24: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/24.jpg)
Lets look at an other symbolic learner …
![Page 25: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/25.jpg)
Problem Domain: POS Tagging
● What is text tagging?– Some sort of markup, enabling understanding of language.– Can be word tags:
He will race/VERB the car.He will not race/VERB the truck.When will the race/NOUN end?
![Page 26: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/26.jpg)
Why do we care?
● Sometimes, meaning changes a lot– Transcribed speech lacks clear punctuation:
“I called, John and Mary are there.”→ I called John and Mary are there.
(I called John) and (Mary are there.) ??I called ((John and Mary) are there.)
– We can tell, but can a computer?Here, needs to know about verb forms and collections
– Can be important!Quick! Wrap the bandage on the table around her leg!Imagine a robotic medical assistant with this one . . .
![Page 27: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/27.jpg)
Where is this used?
• Any natural language task!– Translators: word-by-word translation does not always work,
sentences need re-arranging.– It can help with OCR or voice transcription
● “I need to writer. I'm a good write her.” “to writer”?? “a good write”?→ “I need to write her. I'm a good writer.
![Page 28: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/28.jpg)
Some terms
● Corpus– Big body of text, annotated (expert-tagged) or not
● Dictionary– List of known words, and all possible parts of speech
● Lexical/Morphological vs. Contextual– Is it a word property (spelling) or surroundings (neighboring parts of speech)?
● Semantics vs Syntax– Meaning (definition) vs. Structure (phrases, parsing)
● Tokenizer– Separates text into words or other sized blocks (idioms,
phrases . . . )● Disambiguator
– Extra pass to reduce possible tags to a single one.
![Page 29: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/29.jpg)
Some problems we face
Classification challenges:– Large number of classes:
● English POS: varying tagsets, 48 to 195 tags
– Often ambiguous, varying with use/context● POS: There must be a way to go there; I know a
person from there – see that guy there? (pron., adv., n.)
– Varying number of relevant features● Spelling, position, surrounding words, paragraph
position, article topic . . .
![Page 30: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/30.jpg)
TBL: A Symbolic Learning Method
• A method called error-driven Transformation-Based Learning (TBL) (Brill algorithm) can be used for symbolic learning– The rules (actually, a sequence of rules) are learned from an
annotated corpus– Performs about as accurately as other statistical approaches
• Can have better treatment of context compared to HMMs (as we’ll see)– rules which use the next (or previous) POS
• HMMs just use P(Ti| Ti-1) or P(Ti| Ti-2 Ti-1)– rules which use the previous (next) word
• HMMs just use P(Wi|Ti)
![Page 31: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/31.jpg)
What does it do?
● Transformation-Based Error-Driven Learning:– First, a dictionary tags every word with its most common POS. So, “run” is tagged as a verb in both:
“The run lasted 30 minutes” and “We run 3 miles every day”
– Unknown capitalized words are assumed to be proper nouns, and remaining unknown words are assigned the most common tag for their three-letter ending.
→ “blahblahous” is probably an adjective.
– Finally, the tags are updated by a set of “patches,” with the form “Change tag a to b if:”
– The word is in context C (eg, the pattern of surrounding tags)
– The word or one in a region R has lexical property P (eg, capitalization)
![Page 32: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/32.jpg)
Rule Templates
• Brill’s method learns transformations which fit different templates– Template: Change tag X to tag Y when previous word is W
• Transformation: NN VB when previous word = to
– Change tag X to tag Y when previous tag is ZEx:
– The can rusted.→ The (determiner) can (modal verb) rusted (verb) . (.)
– Transformation: Modal Noun when previous tag = DET→ The (determiner) can (noun) rusted (verb) . (.)
– Change tag X to tag Y when previous 1st, 2nd, or 3rd word is W• VBP VB when one of previous 3 words = has
• The learning process is guided by a small number of templates (e.g., 26) to learn specific rules from the corpus
• Note how these rules sort of match linguistic intuition
![Page 33: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/33.jpg)
Brill Algorithm (Overview)
• Assume you are given a training corpus G (for gold standard)
• First, create a tag-free version V of it … then do steps 1-4
• Notes:
– As the algorithm proceeds, each successive rule covers fewer examples, but potentially more accurately
– Some later rules may change tags changed by earlier rules
1. Initial-state annotator: Label every word token in V with most likely tag for that word type from G.
2. Consider every possible transformational rule: select the one that leads to the most improvement in V using G to measure the error
3. Retag V based on this rule
4. Go back to 2, until there is no significant improvement in accuracy over previous iteration
![Page 34: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/34.jpg)
Error-driven method
• How does one learn the rules?• The TBL method is error-driven
– The rule which is learned on a given iteration is the one which reduces the error rate of the corpus the most, e.g.:
– Rule 1 fixes 50 errors but introduces 25 more net decrease is 25– Rule 2 fixes 45 errors but introduces 15 more net decrease is 30 Choose rule 2 in this case
• We set a stopping criterion, or threshold once we stop reducing the error rate by a big enough margin, learning is stopped
![Page 35: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/35.jpg)
Example of Error Reduction
From Eric Brill (1995):Computational Linguistics, 21, 4, p. 7
![Page 36: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/36.jpg)
Rule ordering
• One rule is learned with every pass through the corpus.– The set of final rules is what the final output is– Unlike HMMs, such a representation allows a linguist to look
through and make more sense of the rules
• Thus, the rules are learned iteratively and must be applied in an iterative fashion.– At one stage, it may make sense to change NN to VB after to– But at a later stage, it may make sense to change VB back to NN in
the same context, e.g., if the current word is school
![Page 37: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/37.jpg)
Example of Learned Rule Sequence
• 1. NN VB PREVTAG TO
– to/TO race/NN->VB
• 2. VBP VB PREV1OR20R3TAG MD
– might/MD vanish/VBP-> VB
• 3. NN VB PREV1OR2TAG MD
– might/MD not/RB reply/NN -> VB
• 4. VB NN PREV1OR2TAG DT
– the/DT great/JJ feast/VB->NN
• 5. VBD VBN PREV1OR20R3TAG VBZ
– He/PP was/VBZ killed/VBD->VBN by/IN Chapman/NNP
![Page 38: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/38.jpg)
Insights on TBL
• TBL takes a long time to train, but is relatively fast at tagging once the rules are learned
• The rules in the sequence may be decomposed into non-interacting subsets, i.e., only focus on VB tagging (need to only look at rules which affect it)
• In cases where the data is sparse, the initial guess needs to be weak enough to allow for learning
• Rules become increasingly specific as you go down the sequence. – However, the more specific rules generally don’t overfit because
they cover just a few cases
![Page 39: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/39.jpg)
Relation between DT and TBL
![Page 40: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/40.jpg)
DT and TBL
DT is a subset of TBL
1.Label with S2. If X then S A3.S B
![Page 41: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/41.jpg)
DT is a proper subset of TBL
• There exists a problem that can be solved by TBL but not a DT, for a fixed set of primitive queries.
• Ex: Given a sequence of characters– Classify a char based on its position
• If pos % 4 == 0 then “yes” else “no” – Input attributes available: previous two chars
![Page 42: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/42.jpg)
• Transformation list:– Label with S: A/S A/S A/S A/S A/S A/S A/S
– If there is no previous character, then S F A/F A/S A/S A/S A/S A/S A/S
– If the char two to the left is labeled with F, then S F A/F A/S A/F A/S A/F A/S A/F
– If the char two to the left is labeled with F, then FS A/F A/S A/S A/S A/F A/S A/S– F yes– S no
![Page 43: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/43.jpg)
DT and TBL
• TBL is more powerful than DT
• Extra power of TBL comes from– Transformations are applied in sequence– Results of previous transformations are visible to following
transformations.
![Page 44: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/44.jpg)
![Page 45: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/45.jpg)
Brill Algorithm (More Detailed)
• 1. Label every word token with its most likely tag (based on lexical generation probabilities).
• 2. List the positions of tagging errors and their counts, by comparing with “truth” (T)
• 3. For each error position, consider each instantiation I of X, Y, and Z in Rule template.
– If Y=T, increment improvements[I], else increment errors[I].
• 4. Pick the I which results in the greatest error reduction, and add to output
– VB NN PREV1OR2TAG DT improves on 98 errors, but produces 18 new errors, so net decrease of 80 errors
• 5. Apply that I to corpus
• 6. Go to 2, unless stopping criterion is reached
Most likely tag:
P(NN|race) = .98
P(VB|race) = .02
Is/VBZ expected/VBN to/TO race/NN tomorrow/NN
Rule template: Change a word from tag X to tag Y when previous tag is Z
Rule Instantiation for above example:
NN VB PREV1OR2TAG TO
Applying this rule yields:
Is/VBZ expected/VBN to/TO race/VB tomorrow/NN
![Page 46: Overfitting and-tbl](https://reader034.fdocuments.net/reader034/viewer/2022052617/5482f8f25806b5f7048b4742/html5/thumbnails/46.jpg)
Handling Unknown Words
• Can also use the Brill method to learn how to tag unknown words
• Instead of using surrounding words and tags, use affix info, capitalization, etc.– Guess NNP if capitalized,
NN otherwise. – Or use the tag most
common for words ending in the last 3 letters.
– etc.• TBL has also been applied to
some parsing tasks
Example Learned Rule Sequence for Unknown Words