.ppt
description
Transcript of .ppt
Shiyan Ou, Chris Khoo, Shiyan Ou, Chris Khoo, Dion Goh, Hui-Ying HengDion Goh, Hui-Ying Heng
Division of Information StudiesDivision of Information Studies
School of Communication & InformationSchool of Communication & Information
Nanyang Technological University, SingaporeNanyang Technological University, Singapore
Automatic Discourse Parsing of Dissertation Automatic Discourse Parsing of Dissertation Abstracts as Sentence Categorization Abstracts as Sentence Categorization
Objective
To develop an automatic method to parse the discourse structure of sociology dissertation abstracts
To segment a dissertation abstract into five sections (macro-level structure): 1. background2. problem statement/objectives3. research method4. research results5. concluding remarks
Approach
Discourse parsing is treated as a text categorization problem to assign each sentence to 1 of the 5 sections or
categories.
Machine learning method Decision tree induction using C5.0 within
Clementine data mining system Rule induction
Part of a broader study to develop a method for multi-document summarization of dissertation abstracts
Previous Studies
2 approaches Hand-crafted algorithms using lexical and
syntactic clues e.g., Kurohashi & Nagao (1994)
Models developed using supervised machine learning e.g., Nomoto & Matsumoto (1998), Marcu
(1999), Le & Abeysinghe (2003)
Previous Studies (cont.)
Features of studies different kinds of text and domain
e.g. news articles, scientific articles
different discourse models micro-level structure vs. macro-level
structure theoretical perspectives, e.g. rhetorical
relations
Data Preparation 300 sociology dissertation abstracts from 2001
Training set: 200 abstracts for constructing the classifier Test set: 100 abstracts for evaluating the accuracy of
the classifier
Abstracts were segmented into sentences with a simple program
Sentences were manually categorized into 1 of the 5 sections. Some problems Some abstracts don’t follow the regular structure. These
were deleted from the training and test set (29 in training set and 16 in test set)
Some sentences can be assigned to multiple categories
Data Preparation (Cont.)
Sentences were tokenized and words were stemmed using the Conexor parser
Each sentence was converted into a vector of binary term weights using 1 and 0 to indicate whether a word
occurs in the sentence
The dataset was formatted as a table with sentences as rows and words as columns
Data Preparation (cont.)doc_sen american control data design factor implication study racial …
2.01 1 1 0 0 0 0 0 0 …
2.02 1 0 1 1 0 0 0 0 …
2.03 1 0 0 0 0 1 0 1 …
2.04 0 0 0 0 1 0 1 0 …
2.05 1 0 0 0 0 0 0 1 …
Machine learning
Using C5.0 -- a decision tree induction program within the Clementine data mining system
Using 10-fold cross-validation to estimate accuracy when developing the model
Minimum records per branch set at 5 to avoid overfitting
Different amounts of pruning were tried Boosting not employed Evaluation using the test set of 84 abstracts
Categorization models
Three models were developed Model 1 – indicative words present in the
sentence
Model 2 – indicative words + position of sentence
Model 3 – indicative words + position of sentence + indicative words in neighboring sentences (before and after the sentence being categorized)
Model 1Estimated accuracy using 10-fold cross validation
Word frequenc
ythreshold
Number of words
input
Pruning Severity
90% 95% 99%
>5 1463 53.7 53.9 53.9
>10 876 54.4 54.4 53.7
>20 454 56.4 55.6 56.3
>35 242 57.5 57.9 56.2
>50 153 56.5 56.4 55.5
>75 75 51.6 51.0 50.7
>100 44 51.1 50.8 50.1
>125 30 50.7 50.7 50.7
Accuracy of Model 1Using the test abstracts
1. Applying the best model to the test sample (with word frequency threshold = 35, and pruning = 95%)
for 100 abstracts (including 16 unstructured abstracts): accuracy = 50.04%
for 84 abstracts (not including 16 unstructured abstracts): accuracy = 60.8%
Preprocessing to filter out unstructured abstracts can improve categorization accuracy substantially
Only high frequency words are useful for categorizing the sentences
Only a small number of indicator words (20) were selected by the decision tree program
Model 2Estimated accuracy using 10-fold cross validation
Word frequency threshold
Number of words input
Sentence position as an additional attribute
Pruning Severity
80% 85% 90% 95% 99%
>35 242 No (Model 1) 57.0 57.9 57.5 57.9 56.2
Yes (Model 2)
66.5 66.4 65.1 66.6 65.1
Result using the 84 test abstracts:
• accuracy = 71.6% (compared to 60.8% for Model 1)
Sentence position is useful in the sentence categorization.
Model 22 rules for Section 1: Background
Rule 1 (confidence=0.61) sentence_position <= 0.44 sentence does NOT contain the words:
STUDY, EXAMINE, DATA, ANALYSIS, DISSERTA, PARTICIP, INVESTIG, SHOW, SCALE, SECOND, INTERVIE, STATUS, CONDUCT, REVEAL, AGE, FORM, PERCEPTI
Rule 2 (confidence=0.36) sentence position <= 0.44
Model 29 rules for Section 2: Problem statement
Rule 1 (confidence = 1.0) Sentence position <= 0.44 Sentence contains: PERCEPTI Sentence does NOT contain: INTERVIE, COMPLETE
Rule 2 (confidence = 0.88) Sentence contains: DISSERTA Sentence does NOT contain: METHOD
Rule 3 (confidence = 0.83) Sentence contains: EXAMINE Sentence position <= 0.44
Words used in other rules: INVESTIGATE, EXAMINE, EXPLORE, STUDY, STATUS, SECOND
Model 2Words used in rules
Section 3: Research method COMPLETE, CONDUCT, FORM, METHOD,
INTERVIE, SCALE, PARTICIP, TEST, ASSESS, DATA, ANALYSIS
Section 4: Research results REVEAL, SHOW Default category
Section 5: Concluding remarks IMPLICAT, FUTURE
Model 3 To investigate whether indicator words in neighboring sentences can improve
categorization
Model = indicator words + position of sentence + indicator words in neighboring sentences (before and after the sentence being categorized)
Procedure
Extract indicator words from Model 1 & Model 2 (about 30)
For each sentence, identify the nearest sentence (before and after) containing each indicator word
For each nearest sentence containing an indicator word, calculate the distance (difference in sentence position)
Model 3
Example of indicator words occurring before and after Sentence 13 in Doc 4
DocID
SentenceID
Indicatorword
Location Neighboringsentence_id
Distance
4 13 analysis before 7 -64 13 study before 4 -94 13 study after 14 1
3 models investigated • Indicator words before the sentence being categorized• Indicator words after the sentence being categorized• Indicator words before and after the sentence being
categorized
Accuracy of Model 2 & Model 3based on 84 test abstracts
Section
No. of sentenc
es
Model 2 correctly
classified
Model 3 correctly classified
With all indicator
words
With indicator words before
With indicator
words after
1 173 71.10% 80.92% 79.77% 67.63%
2 183 55.74% 48.63% 52.46% 49.18%
3 189 49.74% 52.38% 52.38% 39.15%
4 468 87.61% 91.03% 91.03% 89.31%
5 29 58.62% 58.62% 58.62% 55.17%
Total 1042 71.59% 73.99% 74.47% 68.62%
Examples of Model 3 rules
1. Rule for Section 1 (confidence = 0.64) 1. Sentence position <= 0.442. Sentence does NOT contain: EXAMINE, INTERVIE,
EXPLORE, DISSERTA, DATA, MOTHER, COMPARE3. STUDY and PARTICIPANT does not appear in an earlier
sentence
2. Rule for Section 3 (confidence = 0.818) 1. Sentence position <= 0.52. Sentence does NOT contain STYLE
3. PARTICIPANT appears in an earlier sentence
Model 4Use of sequential association rules
Use of sequential pattern mining to identify sequential associations of the form:
word1 > word2 => section 4 word1 > word2 > section 3 => section 4
Various window sizes can be specified
Initial results are not promising probably because of the small training sample
1 possibility is to use transition probabilities section 3 => section 1 section 3 => section 2 section 3 => section 3 section 3 => section 4Applied as a second pass to refine the predictions of the decision tree
Conclusion Decision tree induction used to develop a ruleset to
parse the macro-level discourse structure of sociology dissertation abstracts
Discourse parsing treated as a sentence categorization task
3 models constructed Model 1 : Indicator words present in the sentence
(60.8%accuracy) Model 2 : Indicator words + sentence position
(71.6% accuracy) Model 3 : Indicator words + sentence position + indicator words
in sentences before the sentence being categorized (74.5% accuracy)
Future work More in-depth error analysis to determine whether
some kind of inferencing can improve accuracy Obtain 2 more manual codings so that inter-indexer
consistency can be calculated To “generalize” the models by replacing word
tokens with synonym sets Investigate use of SVM (support vector machine) as
an alternative to decision tree induction Develop a method to identify and process
“unstructured” abstracts Extend the work to journal article abstracts