© 2010 IBM Corporation Learning to Predict Readability using Diverse Linguistic Features Rohit J....

© 2010 IBM Corporation

Learning to Predict Readability using Diverse Linguistic Features

Rohit J. Kate, Xiaoqiang Luo, Siddharth Patwardhan, Martin Franz, Radu Florian, Raymond J. Mooney, Salim Roukos, Chris Welty

Presented by: Young-Suk Lee

The University of Texas at Austin IBM T. J. Watson Research Center


QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

2

Outline

Problem definition and motivations

Data

System and Features

Experimental Results




3

Readability

DARPA machine reading program (MRP)

“Readability is defined as a subjective judgment of how easily a reader can extract the information the writer or the speaker intended to convey.”

Task: given a general document, assign a readability score (1 to 5)




4

Sample Passage: High Readability

Industrial agriculture has grown increasingly paradoxical, replacing natural processes with synthetic practices and treating farms as factories. Consequently, food has become a marketing entity rather than a necessity to sustain life. …




5

Sample Passage: Low Readability

The word of the prince of believers may Allah God him Talk of gold this at present Reflections on the word of the prince of believers may Allah pleased with him, Prince of Believers May Allah be pleased with him: …




6

Remove less readable documents from web-search

Filter out less readable documents before extracting knowledge

Select reading materials

Readability: Motivations




7

Predicting readability: conveying message– vs. reading difficulty (grade 1 to 12)

Document sources: multiple genres – vs. single domain, genre or reader group

Contrast With Other Work




8

Outline


Data

System and Features





9

Data

390 training documents

Each document:– 8 expert ratings: [1,..,5]– 6-10 “novice” ratings: [1,…,5]

Ratings differ by genre– Nwire and wiki documents: high– MT documents: low

Genre #Docs Expert Rating

Novice

Ratingnwire 56 4.93 4.23

wiki 56 4.83 4.13

weblog 55 4.46 3.75

q-trans 56 4.47 3.83

news-grp 55 4.26 3.34

ccap 56 4.13 3.53mt 56 2.38 1.92




10

Data

MT docs

0

50

100

150

200

250C

ount

1 2 3 4 5Rating

Histogram of Novice Ratings

nwwkwlqtngccmt

MT docsMT docs

ng: newsgroupSpeech: closed

caption




11

Outline


Data

System and Features





12

System Overview

Training Docs

Preprocessing

LM score

Parser score

…Regression

(WEKA)

Test Doc

Sys. Rating




13

Syntactical Features

Using Sundance [Riloff &Phillips 04] and English Slot Grammer parsers

– Ratio of sentences without verbs– Avg. # clauses/per sentence– Avg. #NPs, #VPs, #PPs, #Phrases/sent, – Failure rate of ESG parser– ..




14

Language Model (LM) Features

Normalized document probability:– by a 5-gram generic LM

Genre-specific LMs– Data readily available for those genres– Certain genre is a strong predictor of readability




15

Genre-based Language Model Features

Perplexity of genre-specific LM (Mj):

Genre posterior perplexity (relative probability compared to all G genres):

DocumentHistory words

Word




16

Lexical Features

Fraction of known words using dictionary and gazetteer of names

Out-of-vocabulary (OOV) rates using genre-based corpora

Ratio of function words (“the”, “of” etc.)

Ratio of pronouns




17

Experiments: Evaluation Metric

Pearson correlation coefficient – Mean expert judge rating as the gold-standard

To compare with novice judges:– A sampling distribution representing performance of novice judges

was generated

– Distribution mean and upper critical value were computed Correlation between system and mean expert ratings

– If above the upper critical value: system significantly (statistically) better than novice judges




18

Outline


Data

System and Features





19

Experiments: Methodology

Compared regression algorithms

Feature ablation experiments

Results: 13-fold cross-validation – Balanced genre representation




20

Results: Regression Algorithms

Choice of regression algorithm is not critical.

00.10.20.30.40.50.60.70.80.9

1

BaggedDecision Tree

LinearRegression

SVMRegression

GaussianProcess

Regression

Decsion Trees

Correlation

Distribution MeanUpper Critical Value




21

Results: Feature Sets

Correlation

00.10.20.30.40.50.60.70.80.9

1

All Lexical Syntactical Lexical +Syntactical

LM Based


Each feature set contributes, LM-based feature set: most useful.




22

Results: Genre-based Feature Sets

00.10.20.30.40.50.60.70.80.9

1

All Genre-independent Genre-based

Correlation


Genre-independent features: better than novice mean;Genre-specific features: significantly improve performance.




23

Results: Individual Feature Sets

00.10.20.30.40.50.60.70.80.9

1

All Sundance ESG Perp. Post.Perp.

OOV rates

By itself Ablated from all

Correlation Distribution MeanUpper Critical Value

Posterior perplexities: best feature set, but no single feature set is indispensable.

System using all features




24

Official Evaluation

Conducted by SAIC on behalf of DARPA

Three teams participated

Evaluation task: Predict readability of 150 test documents using the 390 documents for training




25

Official Evaluation Results

Our system performed favorably and scored betterthan the upper critical value.

00.10.20.30.40.50.60.70.80.9

1

Our System System B System C

Upper Critical Value

Correlation

Novice mean

Sig. better than human at p<0.0001




26

Readability system– Regression over syntactical, lexical and language model features

All features contribute, but LM features are most useful

System is significantly (statistically) better than novice human judges

Conclusions




27

Questions??

Thank You!

© 2010 IBM Corporation Learning to Predict Readability using Diverse Linguistic Features Rohit J....

Documents

Transcript of © 2010 IBM Corporation Learning to Predict Readability using Diverse Linguistic Features Rohit J....