.ppt

23
Shiyan Ou, Chris Khoo, Shiyan Ou, Chris Khoo, Dion Goh, Hui-Ying Heng Dion Goh, Hui-Ying Heng Division of Information Studies Division of Information Studies School of Communication & Information School of Communication & Information Nanyang Technological University, Singapore Nanyang Technological University, Singapore Automatic Discourse Parsing of Automatic Discourse Parsing of Dissertation Abstracts as Sentence Dissertation Abstracts as Sentence Categorization Categorization

description

 

Transcript of .ppt

Page 1: .ppt

Shiyan Ou, Chris Khoo, Shiyan Ou, Chris Khoo, Dion Goh, Hui-Ying HengDion Goh, Hui-Ying Heng

  Division of Information StudiesDivision of Information Studies

School of Communication & InformationSchool of Communication & Information

Nanyang Technological University, SingaporeNanyang Technological University, Singapore

Automatic Discourse Parsing of Dissertation Automatic Discourse Parsing of Dissertation Abstracts as Sentence Categorization Abstracts as Sentence Categorization

Page 2: .ppt

Objective

To develop an automatic method to parse the discourse structure of sociology dissertation abstracts

To segment a dissertation abstract into five sections (macro-level structure): 1. background2. problem statement/objectives3. research method4. research results5. concluding remarks

Page 3: .ppt

Approach

Discourse parsing is treated as a text categorization problem to assign each sentence to 1 of the 5 sections or

categories.

Machine learning method Decision tree induction using C5.0 within

Clementine data mining system Rule induction

Part of a broader study to develop a method for multi-document summarization of dissertation abstracts

Page 4: .ppt

Previous Studies

2 approaches Hand-crafted algorithms using lexical and

syntactic clues e.g., Kurohashi & Nagao (1994)

Models developed using supervised machine learning e.g., Nomoto & Matsumoto (1998), Marcu

(1999), Le & Abeysinghe (2003)

Page 5: .ppt

Previous Studies (cont.)

Features of studies different kinds of text and domain

e.g. news articles, scientific articles

different discourse models micro-level structure vs. macro-level

structure theoretical perspectives, e.g. rhetorical

relations

Page 6: .ppt

Data Preparation 300 sociology dissertation abstracts from 2001

Training set: 200 abstracts for constructing the classifier Test set: 100 abstracts for evaluating the accuracy of

the classifier

Abstracts were segmented into sentences with a simple program

Sentences were manually categorized into 1 of the 5 sections. Some problems Some abstracts don’t follow the regular structure. These

were deleted from the training and test set (29 in training set and 16 in test set)

Some sentences can be assigned to multiple categories

Page 7: .ppt

Data Preparation (Cont.)

Sentences were tokenized and words were stemmed using the Conexor parser

Each sentence was converted into a vector of binary term weights using 1 and 0 to indicate whether a word

occurs in the sentence

The dataset was formatted as a table with sentences as rows and words as columns

Page 8: .ppt

Data Preparation (cont.)doc_sen american control data design factor implication study racial …

2.01 1 1 0 0 0 0 0 0 …

2.02 1 0 1 1 0 0 0 0 …

2.03 1 0 0 0 0 1 0 1 …

2.04 0 0 0 0 1 0 1 0 …

2.05 1 0 0 0 0 0 0 1 …

Page 9: .ppt

Machine learning

Using C5.0 -- a decision tree induction program within the Clementine data mining system

Using 10-fold cross-validation to estimate accuracy when developing the model

Minimum records per branch set at 5 to avoid overfitting

Different amounts of pruning were tried Boosting not employed Evaluation using the test set of 84 abstracts

Page 10: .ppt

Categorization models

Three models were developed Model 1 – indicative words present in the

sentence

Model 2 – indicative words + position of sentence

Model 3 – indicative words + position of sentence + indicative words in neighboring sentences (before and after the sentence being categorized)

Page 11: .ppt

Model 1Estimated accuracy using 10-fold cross validation

Word frequenc

ythreshold

Number of words

input

Pruning Severity

90% 95% 99%

>5 1463 53.7 53.9 53.9

>10 876 54.4 54.4 53.7

>20 454 56.4 55.6 56.3

>35 242 57.5 57.9 56.2

>50 153 56.5 56.4 55.5

>75 75 51.6 51.0 50.7

>100 44 51.1 50.8 50.1

>125 30 50.7 50.7 50.7

Page 12: .ppt

Accuracy of Model 1Using the test abstracts

1. Applying the best model to the test sample (with word frequency threshold = 35, and pruning = 95%)

for 100 abstracts (including 16 unstructured abstracts): accuracy = 50.04%

for 84 abstracts (not including 16 unstructured abstracts): accuracy = 60.8%

Preprocessing to filter out unstructured abstracts can improve categorization accuracy substantially

Only high frequency words are useful for categorizing the sentences

Only a small number of indicator words (20) were selected by the decision tree program

Page 13: .ppt

Model 2Estimated accuracy using 10-fold cross validation

Word frequency threshold

Number of words input

Sentence position as an additional attribute

Pruning Severity

80% 85% 90% 95% 99%

>35 242 No (Model 1) 57.0 57.9 57.5 57.9 56.2

Yes (Model 2)

66.5 66.4 65.1 66.6 65.1

Result using the 84 test abstracts:

• accuracy = 71.6% (compared to 60.8% for Model 1)

Sentence position is useful in the sentence categorization.

Page 14: .ppt

Model 22 rules for Section 1: Background

Rule 1 (confidence=0.61) sentence_position <= 0.44 sentence does NOT contain the words:

STUDY, EXAMINE, DATA, ANALYSIS, DISSERTA, PARTICIP, INVESTIG, SHOW, SCALE, SECOND, INTERVIE, STATUS, CONDUCT, REVEAL, AGE, FORM, PERCEPTI

Rule 2 (confidence=0.36) sentence position <= 0.44

Page 15: .ppt

Model 29 rules for Section 2: Problem statement

Rule 1 (confidence = 1.0) Sentence position <= 0.44 Sentence contains: PERCEPTI Sentence does NOT contain: INTERVIE, COMPLETE

Rule 2 (confidence = 0.88) Sentence contains: DISSERTA Sentence does NOT contain: METHOD

Rule 3 (confidence = 0.83) Sentence contains: EXAMINE Sentence position <= 0.44

Words used in other rules: INVESTIGATE, EXAMINE, EXPLORE, STUDY, STATUS, SECOND

Page 16: .ppt

Model 2Words used in rules

Section 3: Research method COMPLETE, CONDUCT, FORM, METHOD,

INTERVIE, SCALE, PARTICIP, TEST, ASSESS, DATA, ANALYSIS

Section 4: Research results REVEAL, SHOW Default category

Section 5: Concluding remarks IMPLICAT, FUTURE

Page 17: .ppt

Model 3 To investigate whether indicator words in neighboring sentences can improve

categorization

Model = indicator words + position of sentence + indicator words in neighboring sentences (before and after the sentence being categorized)

Procedure

Extract indicator words from Model 1 & Model 2 (about 30)

For each sentence, identify the nearest sentence (before and after) containing each indicator word

For each nearest sentence containing an indicator word, calculate the distance (difference in sentence position)

Page 18: .ppt

Model 3

Example of indicator words occurring before and after Sentence 13 in Doc 4

DocID

SentenceID

Indicatorword

Location Neighboringsentence_id

Distance

4 13 analysis before 7 -64 13 study before 4 -94 13 study after 14 1

3 models investigated • Indicator words before the sentence being categorized• Indicator words after the sentence being categorized• Indicator words before and after the sentence being

categorized

Page 19: .ppt

Accuracy of Model 2 & Model 3based on 84 test abstracts

Section

No. of sentenc

es

Model 2 correctly

classified

Model 3 correctly classified

With all indicator

words

With indicator words before

With indicator

words after

1 173 71.10% 80.92% 79.77% 67.63%

2 183 55.74% 48.63% 52.46% 49.18%

3 189 49.74% 52.38% 52.38% 39.15%

4 468 87.61% 91.03% 91.03% 89.31%

5 29 58.62% 58.62% 58.62% 55.17%

Total 1042 71.59% 73.99% 74.47% 68.62%

Page 20: .ppt

Examples of Model 3 rules

1. Rule for Section 1 (confidence = 0.64) 1. Sentence position <= 0.442. Sentence does NOT contain: EXAMINE, INTERVIE,

EXPLORE, DISSERTA, DATA, MOTHER, COMPARE3. STUDY and PARTICIPANT does not appear in an earlier

sentence

2. Rule for Section 3 (confidence = 0.818) 1. Sentence position <= 0.52. Sentence does NOT contain STYLE

3. PARTICIPANT appears in an earlier sentence

Page 21: .ppt

Model 4Use of sequential association rules

Use of sequential pattern mining to identify sequential associations of the form:

word1 > word2 => section 4 word1 > word2 > section 3 => section 4

Various window sizes can be specified

Initial results are not promising probably because of the small training sample

1 possibility is to use transition probabilities section 3 => section 1 section 3 => section 2 section 3 => section 3 section 3 => section 4Applied as a second pass to refine the predictions of the decision tree

Page 22: .ppt

Conclusion Decision tree induction used to develop a ruleset to

parse the macro-level discourse structure of sociology dissertation abstracts

Discourse parsing treated as a sentence categorization task

3 models constructed Model 1 : Indicator words present in the sentence

(60.8%accuracy) Model 2 : Indicator words + sentence position

(71.6% accuracy) Model 3 : Indicator words + sentence position + indicator words

in sentences before the sentence being categorized (74.5% accuracy)

Page 23: .ppt

Future work More in-depth error analysis to determine whether

some kind of inferencing can improve accuracy Obtain 2 more manual codings so that inter-indexer

consistency can be calculated To “generalize” the models by replacing word

tokens with synonym sets Investigate use of SVM (support vector machine) as

an alternative to decision tree induction Develop a method to identify and process

“unstructured” abstracts Extend the work to journal article abstracts