Download - Detecting Missing Hyphens in Learner Text Aoife Cahill, SusanneWolff, Nitin Madnani Educational Testing Service ACL 2013 Martin Chodorow Hunter College.

Detecting Missing Hyphens in Learner Text

Aoife Cahill, SusanneWolff, Nitin MadnaniEducational Testing Service

ACL 2013

Martin ChodorowHunter College and the Graduate Center

Outline

Introduction Baselines System Description Evaluation Conclusions

Introduction

(1) Schools may have more after school sports.

(2) I went to the dentist after school today.

(3) My father like play basketball with me.

Missing Hyphens:

Outline


Baselines

(1) Collins Dictionary

(2) More than 1,000 times in Wikipedia

(3) Probability of the hyphenated form as estimated from Wikipedia is greater than 0.66

Outline


System Description

Learner text: Schools may have more after school sports.

System Description

Model:Logistic regression model

Probability:Only predict a missing hyphen error when the probability of the prediction is >0.99

System Description

SJM-trained: - San Jose Mercury News corpus

- For training, hyphenated words are automatically split (i.e. well-known becomes well known)

- The training data contains 1% of the positive examples and 3% of the negative examples

System Description

Negative examples selected:Only contexts that occur more than 20 times are selected during training.

System Description

Wiki-revision-trained: - Wikipedia articles

System Description

System Description

Combined: - Combine both data sources

Outline


Evaluation

Artificial Data: - Brown corpus

- taking 24,243 sentences

- 2,072 hyphenated words

Evaluation

Evaluation

Learner Text: - CLC-FCE

- The corpus contains 1,244 exam scripts - Totally 173 instances of missing hyphen errors

Evaluation 1

Evaluation

Evaluation

There are 131 true positives for the learner data reveal that 87 of these are cases of a single type, the word “make-up”.

Evaluation

Evaluation 2Learner Text: - A data set of 1,000 student GRE and TOEFL essays

- Drawn from 295 prompts - Ranged in length from 1 to 50 sentences - Average of 378 words per essay

Evaluation

Learner Text (Cont.): - Manually inspect a random sample of 100 instances where each system detected a missing hyphen

- Two native-English speakers judge

- Using the Chicago Manual of Style as a guide - High agreement

Evaluation

Outline


Conclusions

1 ) Automatically detecting missing hyphen errors in learner text

2 ) The classifiers generally performed better than the baseline systems

3 ) Taking context into account when detecting the errors is important.