A method for unsupervised broad-coverage lexical error
detection and correction
4th Workshop on Innovative Uses of NLP for Building Educational Applications Workshop
NAACLJune 5, 2009
Nai-Lung Tsao and David WibleNational Central University, Taiwan
• Since 2000 under the support of MOE & Taipei Bureau of Education– IWiLL has been used in Taiwan by:
• 455 schools • 2,804 teachers • 161,493 students and 22,791 independent learners.• Teachers have authored 9,429 web-based lessons with the s
ystem’s authoring tool. • The learner corpus (English TLC) has archived over 32,000
English essays • 5 million words of machine-readable running text written by T
aiwan’s learners using the IWiLL writing platform. • 100,000 tokens of teacher comments on these student texts
The Research ContextIWiLL Online Writing Platform
www.iwillnow.org
Second Language Learners’Error Detection and Correction
• Lexical and Lexico-grammatical errors
- an open-ended class
- driving teachers crazy
- either no rules involved or rules of very limited productivity
Two components to our system
INPUT: user-produced string
2. Edit DistanceAlgorithm
‘on my opinion’
Compares User’s string &Hybrid N-grams
Hybrid n-grams extracted from BNC
1. Target LanguageKnowledgebase:
Error Detection/Correction
The Knowledgebase of Hybrid N-grams
Hybrid n-grams extracted from BNC
1. Target LanguageKnowledgebase:
What, Why, and How
What is a hybrid n-gram?
An n-gram that admit items of different levels
- Traditional n-gram: ‘in my opinion’
- Hybrid n-gram: ‘in [dps] opinion’
Why use hybrid n-grams?
- Traditional n-grams and error precision
- POS n-grams and recall
Enjoy to canoe > unattested > marked as error
Error Detection.
Enjoy canoeing> unattested > marked as error
True positive:
False positive:
V + VVgBased on attested strings like: enjoy hiking OR like watching
We could extract the POS gram: But this would accept: hope exploring
How hybrid n-grams are extracted for the knowledgebase
How the hybrid n-grams are extracted
Hybrid n-grams extracted from BNC
1. Target LanguageKnowledgebase:
hike VVg
V
enjoy VVd
V
enjoyed hikingword form
lexeme
[POS detailed]
{POS rough}
4 categories ofinfo for each itemIn an n-gram
Some hybrid n-grams for enjoyed hiking
enjoyed + Venjoy + Venjoyed + VVgenjoy + VVgVVd + VVgenjoyed + hikeenjoy + hikeV + hikingetc.
Potential Hybrid N-grams for a string
Two components:
INPUT: user-produced string
2. Edit DistanceAlgorithm
‘on my opinion’
Compares User’s string &Hybrid N-grams
Hybrid n-grams extracted from BNC
1. Target LanguageKnowledgebase:
Error Detection/Correction
Edit Distance ComponentSteps in measuring edit distance
1. Generate all hybrid n-grams fromthe learner input string (Set C)
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
We limit edit distance to ‘substitution’.So we limit search to n-grams of the samelength as the learner’s input string.
3. Rank candidates by weighted edit distance between members of C and S
b. Prune Set S using filter factor or coverage
Edit Distance ComponentSteps in measuring edit distance
1. Generate all hybrid n-grams fromthe learner input string (Set C)
enjoyed + Venjoy + Venjoyed + VVgenjoy + VVgVVd + VVgenjoyed + hikeenjoy + hikeV + hikingetc.
enjoyed hikingInput from learner:
Hybrid n-grams generated from learner string
Set C =
Edit Distance ComponentSteps in measuring edit distance
1. Generate all hybrid n-grams fromthe learner input string (Set C)
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
We limit edit distance to ‘substitution’.So we limit search to n-grams of the samelength as the learner’s input string.
3. Calculate weighted edit distance between members of C and S
b. Prune Set S using filter factor or coveragec. Eliminate N-grams under frequency threshold
Edit Distance ComponentSteps in measuring edit distance
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
b. Prune Set S using filter factor or coveragec. Eliminate N-grams under frequency threshold
Edit Distance ComponentSteps in measuring edit distance
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
b. Prune Set S using filter factor or coveragec. Eliminate N-grams under frequency threshold
Edit Distance ComponentSteps in measuring edit distance
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
hikeenjoy
Target KnowledgebaseHybrid N-grams
Set S
enjoyed + Venjoy + Venjoyed + VVgenjoy + VVgVVd + VVgenjoyed + hikeenjoy + hikeV + hikingetc.
Hybrid n-grams generated from learner string
enjoyed hiking
Set C =
Edit Distance ComponentSteps in measuring edit distance
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
hike VVg
V
enjoy VVd
V
enjoyed hiking
Target KnowledgebaseHybrid N-grams
Set S
enjoyed + Venjoy + Venjoyed + VVgenjoy + VVgVVd + VVgenjoyed + hikeenjoy + hikeV + hikingetc.
Hybrid n-grams generated from learner string
enjoyed hiking
Set C =
Edit Distance ComponentSteps in measuring edit distance
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
hike VVg
V
enjoy VVd
V
enjoyed hiking
Target KnowledgebaseHybrid N-grams
Set S
Edit Distance ComponentSteps in measuring edit distance
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
enjoy hike VVg
V
hiking
Target KnowledgebaseHybrid N-grams
Set S
Edit Distance ComponentSteps in measuring edit distance
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
hikeenjoy VVd
V
enjoyed
Target KnowledgebaseHybrid N-grams
Set S
Pruning Set S of Candidates
enjoy + V
enjoy + VVg
100 tokens
80 tokensWe prune the subsuming Hybrid N-gramin cases where a subsumed one accounts for80% or more of the subsuming set
X
Pruning Set S of Candidates
enjoy + VVg80 tokensWe prune the subsuming Hybrid N-gram
in cases where a subsumed one accounts for80% or more of the subsuming set
Pruning of the Knowledgebase will affect error recall
The remaining Set S is filtered for frequency of member hybrid n-grams
Edit Distance ComponentSteps in measuring edit distance
1. Generate all hybrid n-grams fromthe learner input string (Set C)
2. a. Find all hybrid n-grams from the target language knowledgebase derivable from content words in the learner input string. (Set S)
We limit edit distance to ‘substitution’.So we limit search to n-grams of the samelength as the learner’s input string.
3. Rank candidates by weighted edit distance between members of C and S
b. Prune Set S using filter factor or coverage
Weighting of Edit Distance‘enjoyed to hike’
Learner string
Generate Set Cof Hybrid N-grams
Generate Set S of Hybrid N-grams
enjoyed to hike
enjoy VVtenjoy VV to hikeVVd to hikeetc
enjoyed hikingenjoyed hikeenjoy VVgVVd hikingV hikingVVd hikeenjoy VVgenjoy learning
Distance = 1: string c and string s are identical but for one slot
Correction candidates are those with a distance 1 or lower.
Ranking of candidates withdistance = 1 from learner string
Differing element = same lexeme but diff word form is closer than different lexeme
Differing element = same rough POS but diff detailed POS is closer than diff rough POS
Examples 1C-selectionEnjoy to swim > enjoy swimming Enjoy to shop > enjoy shoppingEnjoy to canoe > enjoy canoeingEnjoy to learn > *need to learn; ?want to learn; enjoy learningEnjoy to find > *try to find; *expect to find; *fail to find; *hope to find;
*want to findHope finding > hope to findLet us to know > let us knowGet used to say > *get used to; *have used to say;
Collocation with C-selectionSpend time to fix > spend time fixing; take time to fixTake time fixing > take time to fixTake time recuperating > take time to recuperateSpend time to recuperate > spend time recuperating; take time to recuperate
Examples 2PrepositionFixed expressions:• On the outset > At the outset• In different reasons > For different reasons• In that time > at that time; by that time• On that time > at that time; by that time• On my opinion > in my opinion• In my point of view > from my point of view• I am interested of > I am interested in• She is interested of > she is interested in• I am interesting in > I am interested in• She is interesting in > She is interested in• Just on the time when > just at the time when; *just to the time when
Examples 3Preposition/Particle:Verb + preposition (particle)• Discuss to each other > *discussing to each other (should be
discuss WITH each other)• Discuss this to them > discuss this with them• Waited to her > waited for her• Waited to them > waited for them
Noun + preposition• His admiration to > his admiration for• His accomplishment on > * No suggestion• The opposite side to > the opposite side of• A crisis on > a crisis of; a crisis in• A crisis on his work > a crisis of his work (*a crisis on his
work)
Examples 4Content Word Choice• Lead a miserable living > make a miserable living
*leading a miserable living *led a miserable living lead a miserable life
• Frame of mood > ??change of mood; frame of mind;
* frame of reference
Examples 5Morpho-syntactic• She will ran > She will run• She will runs > She will run Pronoun case:• What made she change > * what made she change (no correction; • should be made HER change) Noun countability or number errors:• In modern time > in modern times Number agreement in head noun and determiner• Too much people > too many people• So much things > so many things • So many thing > so many things• One of the man > one of the men• One of the problem > one of the problems• In my opinions > in my opinion• A lot of problem > a lot of problems• Complementizer selection:
I wonder that > I wonder if; I wonder whether
Future Work
• Improving POS tagging using 2nd order model
• Machine learning of weighting for the various features determining edit distance
• Incorporation of this into our IWiLL online writing environment
• Incorporate MI for the knowledgebase’s hybrid n-grams
Thank you
Top Related