Detecting Missing Hyphens in Learner Text
Aoife Cahill, SusanneWolff, Nitin MadnaniEducational Testing Service
ACL 2013
Martin ChodorowHunter College and the Graduate Center
Outline
Introduction Baselines System Description Evaluation Conclusions
Introduction
(1) Schools may have more after school sports.
(2) I went to the dentist after school today.
(3) My father like play basketball with me.
Missing Hyphens:
Outline
Introduction Baselines System Description Evaluation Conclusions
Baselines
(1) Collins Dictionary
(2) More than 1,000 times in Wikipedia
(3) Probability of the hyphenated form as estimated from Wikipedia is greater than 0.66
Outline
Introduction Baselines System Description Evaluation Conclusions
System Description
Learner text: Schools may have more after school sports.
System Description
Model:Logistic regression model
Probability:Only predict a missing hyphen error when the probability of the prediction is >0.99
System Description
SJM-trained: - San Jose Mercury News corpus
- For training, hyphenated words are automatically split (i.e. well-known becomes well known)
- The training data contains 1% of the positive examples and 3% of the negative examples
System Description
Negative examples selected:Only contexts that occur more than 20 times are selected during training.
System Description
Wiki-revision-trained: - Wikipedia articles
System Description
System Description
System Description
Combined: - Combine both data sources
Outline
Introduction Baselines System Description Evaluation Conclusions
Evaluation
Artificial Data: - Brown corpus
- taking 24,243 sentences
- 2,072 hyphenated words
Evaluation
Evaluation
Evaluation
Learner Text: - CLC-FCE
- The corpus contains 1,244 exam scripts - Totally 173 instances of missing hyphen errors
Evaluation 1
Evaluation
Evaluation
Evaluation
There are 131 true positives for the learner data reveal that 87 of these are cases of a single type, the word “make-up”.
Evaluation
Evaluation 2Learner Text: - A data set of 1,000 student GRE and TOEFL essays
- Drawn from 295 prompts - Ranged in length from 1 to 50 sentences - Average of 378 words per essay
Evaluation
Learner Text (Cont.): - Manually inspect a random sample of 100 instances where each system detected a missing hyphen
- Two native-English speakers judge
- Using the Chicago Manual of Style as a guide - High agreement
Evaluation
Outline
Introduction Baselines System Description Evaluation Conclusions
Conclusions
1 ) Automatically detecting missing hyphen errors in learner text
2 ) The classifiers generally performed better than the baseline systems
3 ) Taking context into account when detecting the errors is important.
Top Related