An Evaluation of Passage-Based Text Categorization

19
Journal of Intelligent Information Systems, 23:1, 47–65, 2004 c 2004 Kluwer Academic Publishers. Printed in The United States. An Evaluation of Passage-Based Text Categorization JINSUK KIM [email protected] Center for Computational Biology & Bioinformatics, Korea Institute of Science and Technology Information, P.O. Box 122, Yuseong-gu, Daejeon, Republic of Korea305-600 MYOUNG HO KIM [email protected] Department of Electrical Engineering & Computer Science, Korea Advanced Institute of Science and Technology, 373-1, Guseong-dong, Yuseong-gu, Daejeon, Republic of Korea 305-701 Received July 24, 2002; Revised October 2, 2003; Accepted October 24, 2003 Abstract. Researches in text categorization have been confined to whole-document-level classification, probably due to lack of full-text test collections. However, full-length documents available today in large quantities pose renewed interests in text classification. A document is usually written in an organized structure to present its main topic(s). This structure can be expressed as a sequence of subtopic text blocks, or passages. In order to reflect the subtopic structure of a document, we propose a new passage-level or passage-based text categorization model, which segments a test document into several passages, assigns categories to each passage, and merges the passage categories to the document categories. Compared with traditional document-level categorization, two additional steps, passage splitting and category merging, are required in this model. Using four subsets of the Reuters text categorization test collection and a full-text test collection of which documents are varying from tens of kilobytes to hundreds, we evaluate the proposed model, especially the effectiveness of various passage types and the importance of passage location in category merging. Our results show simple windows are best for all test collections tested in these experiments. We also found that passages have different degrees of contribution to the main topic(s), depending on their location in the test document. Keywords: text categorization, passage, non-overlapping window, overlapping window, paragraph, bounded- paragraph, page, TextTile, passage weight function 1. Introduction Text categorization, the task of assigning one or more predefined categories to a document, is an active research field in information retrieval and machine learning. However, research interests in text categorization have been confined to problems such as feature extraction, feature selection, supervised learning algorithms, and hypertext classification. Traditional categorization systems, or classifiers, have treated whole document as a categorization unit but there are few researches related to the input units of classifiers. However, the emer- gence of full-length documents such as word-processor files, full-text SGML/XML/HTML documents, and PDF/Postscript files in large quantities today, tackles traditional categoriza- tion models which process the whole-document as an input unit. As an alternative access method, we regard each document as a set of passages, where a passage is a contiguous segment of text. In the area of information retrieval, introduction of passages can be dated back to early 1990s (Hearst and Plaunt, 1993; Callan, 1994), and various types of passages

Transcript of An Evaluation of Passage-Based Text Categorization

Page 1: An Evaluation of Passage-Based Text Categorization

Journal of Intelligent Information Systems, 23:1, 47–65, 2004c© 2004 Kluwer Academic Publishers. Printed in The United States.

An Evaluation of Passage-Based Text Categorization

JINSUK KIM [email protected] for Computational Biology & Bioinformatics, Korea Institute of Science and Technology Information,P.O. Box 122, Yuseong-gu, Daejeon, Republic of Korea 305-600

MYOUNG HO KIM [email protected] of Electrical Engineering & Computer Science, Korea Advanced Institute of Science and Technology,373-1, Guseong-dong, Yuseong-gu, Daejeon, Republic of Korea 305-701

Received July 24, 2002; Revised October 2, 2003; Accepted October 24, 2003

Abstract. Researches in text categorization have been confined to whole-document-level classification, probablydue to lack of full-text test collections. However, full-length documents available today in large quantities poserenewed interests in text classification. A document is usually written in an organized structure to present itsmain topic(s). This structure can be expressed as a sequence of subtopic text blocks, or passages. In order toreflect the subtopic structure of a document, we propose a new passage-level or passage-based text categorizationmodel, which segments a test document into several passages, assigns categories to each passage, and mergesthe passage categories to the document categories. Compared with traditional document-level categorization, twoadditional steps, passage splitting and category merging, are required in this model. Using four subsets of theReuters text categorization test collection and a full-text test collection of which documents are varying from tensof kilobytes to hundreds, we evaluate the proposed model, especially the effectiveness of various passage typesand the importance of passage location in category merging. Our results show simple windows are best for all testcollections tested in these experiments. We also found that passages have different degrees of contribution to themain topic(s), depending on their location in the test document.

Keywords: text categorization, passage, non-overlapping window, overlapping window, paragraph, bounded-paragraph, page, TextTile, passage weight function

1. Introduction

Text categorization, the task of assigning one or more predefined categories to a document,is an active research field in information retrieval and machine learning. However, researchinterests in text categorization have been confined to problems such as feature extraction,feature selection, supervised learning algorithms, and hypertext classification. Traditionalcategorization systems, or classifiers, have treated whole document as a categorization unitbut there are few researches related to the input units of classifiers. However, the emer-gence of full-length documents such as word-processor files, full-text SGML/XML/HTMLdocuments, and PDF/Postscript files in large quantities today, tackles traditional categoriza-tion models which process the whole-document as an input unit. As an alternative accessmethod, we regard each document as a set of passages, where a passage is a contiguoussegment of text. In the area of information retrieval, introduction of passages can be datedback to early 1990s (Hearst and Plaunt, 1993; Callan, 1994), and various types of passages

Page 2: An Evaluation of Passage-Based Text Categorization

48 KIM AND KIM

have been proposed and tested for document retrieval effectiveness (Callan, 1994; Hearst,1994; Kaszkiel et al., 1999; Kaszkiel and Zobel, 2001). The large quantity of full-lengthdocuments available today imposes renewed interest in the area of text categorization aswell as in the information retrieval.

In this article, we propose a new text categorization model, in which a test documentis split into passages, categorization is processed for each passage, and then the docu-ment’s categories are merged from the passages’ categories. We name the proposed modelas Passage-based Text Categorization, compared with the trends that traditional catego-rization tasks are done based on document-level. In our experimental results, we comparethe effectiveness of several passage types in text categorization, using a kNN (k-NearestNeighbor) classifier (Yang, 1994). In the case of a test collection which consists of verylong documents, we find that use of passages can improve the effectiveness about 10% forall passage types used in our experiments compared with that of document-level catego-rization. In addition, in the case of collections consisting of rather short documents such asnewswires, there is about 5% improvement as well.

This paper starts introducing the passage-based text categorization in Section 2. Thenseveral evaluation measures and data sets used in the experiments are explained in Sections 3and 4, respectively. The experimental results are given in Section 5. Finally, we concludein Section 6.

2. Passage-based text categorization model

Generally a document is deliberately structured to a sequence of subtopical discussionsthat occur in the context of one or more main topic discussions (Hearst, 1994). If this istrue, it is natural to deal with a document as a sequence of subtopic blocks of any unitsuch as sentences, paragraphs, sections, and contiguous text segments. By this motivation,we propose a new text categorization model, passage-based text categorization, which isshown in figure 1.

The primary difference of passage usages between information retrieval and text catego-rization is the target documents. In information retrieval, documents stored in the databasesare split into passages and queries are evaluated for their similarities against these pas-sages (Callan, 1994; Hearst and Plaunt, 1993; Salton et al., 1993; Zobel et al., 1995). On theother hand, in text categorization, the document to categorize is split into passages and cat-egorization tasks are applied to these passages instead of the parent document. As shown infigure 1, the passage-based categorization system splits the test document into several pas-sages, classifies each passage into categories, and determines the test document’s categoriesby merging all passage categories. This procedure includes two additional steps, passagesplitting and category merging, compared with traditional document-level classificationsystems. These two additional steps are topics of Sections 2.1 and 2.2, respectively.

2.1. Passages in text categorization

In regard to passage splitting in figure 1, the definition of a passage is the key point of thisstep. Since a passage is defined as any sequence of text from a document (Kaszkiel and

Page 3: An Evaluation of Passage-Based Text Categorization

AN EVALUATION OF PASSAGE-BASED TEXT CATEGORIZATION 49

Passage-based Text Categorization (Proposed)

testdocument

CategoryMerger

documentcategories

TextClassifier

TextClassifier

passage 1’scategories

PassageSplitter

passage 1

passage 2

passage n

...

passage 2’scategories

passage n’scategories

...

Document-based Text Categorization (Traditional)

testdocument

documentcategories

TextClassifier

TextClassifier

Figure 1. Comparison of document-based (upper) and passage-based (lower) text categorization models.

Zobel, 2001), many types of passages have been used in document retrieval. Callan (1994)grouped these passage types into three classes: discourse passages, semantic passages, andwindow passages.

2.1.1. Discourse passages. Discourse passages are based on logical components of doc-uments such as sentences and paragraphs (Callan, 1994; Hearst, 1994). This passage defi-nition is intuitive, because discourse boundaries organize material by content.

There are three problems with discourse passages. First, there is no guarantee that doc-uments have discourse consistency among authors (Callan, 1994). Second, sometimes itis impossible to make discourse passages, because many documents are supplied withoutpassage demarcation (Kaszkiel and Zobel, 2001). Finally, the lengths of discourse passagescan vary, from very long to very short (Kaszkiel and Zobel, 2001).

As an alternative solution for the first and last problems, Callan (1994) suggested apassage type, known as bounded-paragraphs. As the name implies, while building bounded-paragraphs, short paragraphs are merged to subsequent paragraphs, and paragraphs longerthan some minimum length are kept intact. Callan (1994) used 50 words as the minimumlength of a bounded-paragraph.

2.1.2. Semantic passages. As discussed above, discourse passages may have inconsis-tency or may be impractical to be built due to poor structure of the source document.

Page 4: An Evaluation of Passage-Based Text Categorization

50 KIM AND KIM

An alternative approach is to split a document into semantic passages, each correspond-ing to a topic or to a subtopic. Several algorithms to partition documents into segmentshave been proposed and developed (Kaszkiel and Zobel, 2001). One such algorithm,known as TextTiling (Hearst, 1994), partitions full-length documents into coherent multi-paragraph segments, called TextTiles or simply tiles, which represent a subtopic structure of adocument.

TextTiling splits a document into small text blocks and computes the similarities of alladjacent blocks based on term frequencies. Boundaries of two blocks that show relativelylow similarities are regarded as boundaries between two adjacent tiles, while blocks withhigh similarities are considered to be merged into a tile.

2.1.3. Window passages. While discourse and semantic passages are based on struc-tural properties of documents, an alternative approach called window passages are basedon the sequences of words. This is computationally simple and can be applied todocuments without explicit structural properties as well as to the well-structureddocuments.

Hearst and Plaunt (1993) segmented documents into even-sized blocks, each correspond-ing to a fixed-length sequence of words and starting just after the end of the previous block.Accordingly there is no shared region between two adjacent blocks and thus this pas-sage type is referred to as non-overlapping windows. Callan (1994) partitioned documentsinto overlapping windows where two adjacent segments share words at the boundary. Inour experiments, the second half of an overlapping window is shared by the followingwindow.

Another window passage type is page (Moffat et al., 1994). Pages are similar to bounded-paragraphs, but pages are bounded in their length by physical length in bytes while bounded-paragraphs are bounded based on number of words. A page’s minimum length is 1.0 kb(Moffat et al., 1994).

2.2. Selecting document categories from passage categories

Usually a document is written in an organized manner to describe the author’s inten-tion. For example, a newspaper article may locate its main topic at the title and earlypart to arrest its readers’ attention. On the other hand, a science article may locate itsconclusion at the last part of itself. Therefore, depending on the location, passages froma document may have different degrees of contribution to the document’s maintopic(s).

In our passage-based text categorization model, a passage’s degree of contribution tothe document categories is expressed as a passage weight function. We chose six pas-sage weight functions that are shown in Table 1 in Section 3.3. After categories for allpassages from a document are assigned, passage weights are computed by the passageweight function and a category’s weight is summed from the weights of the passages as-signed to the category. Categories with higher weights than some predefined value areassigned to the document as the final result. More detailed procedures are described inSection 3.3.

Page 5: An Evaluation of Passage-Based Text Categorization

AN EVALUATION OF PASSAGE-BASED TEXT CATEGORIZATION 51

Table 1. Passage weight functions.

Functions† Weighting tendency

pwf 1(p) = 1 Head = Body = Tail

pwf 2(p) = p−1 Head � Body � Tail

pwf 3(p) = p Head < Body < Tail

pwf 4(p) =√

(p − n2 )2 Head = Tail > Body

pwf 5(p) =√

( n2 )2 − (p − n

2 )2 Head = Tail < Body

pwf 6(p) = (log(p + 1))−1 Head > Body > Tail

† Normalization factors are omitted for clarity.p: passage location.n: total number of passages.

3. Evalutation measures

3.1. Similarity measures

To assess the effectiveness of various passaging methods, a kNN (k-nearest neighbor) clas-sifier (Yang, 1994) was used as a document-level classifier. As an example-based classi-fier (Yang, 1994; Sebastiani, 2002), the kNN classifier has many similarities with traditionalinformation retrieval systems. Our kNN classifier was built on an information retrieval sys-tem, KRISTAL-II, which was developed by KISTI’s Information Systems Group1 to man-age and retrieve semi-structured texts such as bibliographies, theses, and journal articles.

To retrieve k top-ranked documents, the kNN classifier uses a vector-space similaritymeasure (Sim(q, d)) between query document q and target document d, which is definedto be

Sim(q, d) = 1

Wd

∑t∈q∧d

(wq,t · wd,t + min( fd,t , fq,t )) (1)

with:

wd,t = log( fd,t + 1) · log

(N

ft+ 1

)

wq,t = log( fq,t + 1) · log

(N

ft+ 1

)

Wd = log

( ∑t∈d

fd,t

)

where fx,t is the frequency of term t in document x ; N is the total number of documents;min(x, y) is the smaller value between x and y; ft is the number of documents where the

Page 6: An Evaluation of Passage-Based Text Categorization

52 KIM AND KIM

term t occurs more than once; wx,t means the weight of term t in query or document x andWd represents the length of document d.

Equation (1) is an empirically derived TF·IDF form of traditional vector based infor-mation retrieval schemes (Witten et al., 1999) which have been commonly used due to itsrobustness and simplicity. The most noticeable modification in Eq. (1), compared with tra-ditional vector schemes, is the introduction of the expression, min( fd,t , fq,t ), which reflectsthe term frequency in the query or document to the query-document similarity directly.Introducing this expression, the categorization performance is slightly better than the tradi-tional vector-space similarity measures (data not shown). Summation of all min( fd,t , fq,t )in Eq. (1) means the total frequency of terms that cooccurred in the query and the targetdocument simultaneously. Sum of min( fd,t , fq,t ), however rather indirectly, reflects theterms’ cooccurrence information to the similarity measure, which may result in improvedperformance.

3.2. Performance measures

To evaluate the categorization effectiveness of various passages, we use the standard defi-nition of precision (p) and recall (r ) as basic performance measures:

p = Categories relevant and retrieved

Categories retrieved

r = Categories relevant and retrieved

Categories relevant

Along with precision and recall, many other researches in text categorization used F1

measure as the performance measure. The F1 measure (van Rijsbergen, 1979) is the har-monic average of precision and recall, which is defined as

F1 = 2pr

p + r

A special point of F1 measure where precision is equal to recall is called precision andrecall break-even point, or simply break-even point(BeP). Since, theoretically, BeP is al-ways less than or equal to F1 measures in any point, BeP is usually used to compare theeffectiveness among different kinds of classifiers or among categorization methods (Yang,1999; Sebastiani, 2002). We present precision, recall, and BeP as the categorization effec-tiveness, and if it is unable to get BeP, F1 measure will be presented instead. To averagethe precision and recall across categories, we used the micro-averaging method (Sebastiani,2002).

3.3. Category relevance measure

In document-level categorization, to determine whether a document d belongs to a cat-egory c j ∈ C = {c1, c2, . . . , c|C |}, our kNN classifier retrieves k training documents

Page 7: An Evaluation of Passage-Based Text Categorization

AN EVALUATION OF PASSAGE-BASED TEXT CATEGORIZATION 53

most similar to d and computes c j ’s weight by adding up similarities between d anddocuments that are retrieved and belong to c j ; if the weight is large enough, the deci-sion is taken to be positive and negative otherwise. Category c j ’s weight for documentd is called category relevance score, Rel(c j , d), and computed as follows: (Yang et al.,2002)

Rel(c j , d) =∑

d ′∈Rk (d)∩D j

Sim(d ′, d) (2)

where set Rk(d) are the k nearest neighbors (top-ranked training documents from the 1stto the kth) of document d , D j is the set of training documents assigned category c j , andSim(d ′, d) is the document-document similarity obtained by Eq. (1) in Section 3.1. For eachtest document, categories with relevance scores greater than a given threshold are assignedto the document.

In passage-level categorization, the same procedure for document-level categorizationtask is applied and categories are assigned to each passage pi from the test document, d.However, since d’s categories are not yet determined, relevance scores for all candidatecategories are computed from the categories of all passages as follows:

Rel(c j , d) =∑pi ∈Pj

pwf n(i) (3)

where pi is the i th passage of document d, Pj is the set of passages assigned category c j ,and pwf n() is one of the passage weight functions shown in Table 1. As in document-levelcategorization, categories with relevance scores greater than a given threshold are assignedto the document.

As stated in Section 2.2, we use 6 passage weight functions (pwf s) which are functions ofpassage location and return a value between 0 and 1 (For clarity, the normalization factorsare omitted in Table 1). Weighting tendency for each pwf is also shown dividing a documentbriefly into Head, Body, and Tail.

4. Data sets

We used the Reuters version 3 (Apte et al., 1994) and its three subsets named GT800,GT1200, and GT1600, and KISTI-Theses collections to verify the effectiveness of pas-sages in text categorization. Reuters version 3 collection was constructed by Apteet al. (1994). They removed all unlabeled documents from both the training and testsets and restricted the categories to have a training set frequency of at least two. Bythe name of constructors we will call this data set as Apte collection here after. GTnnnntest collections were constructed from the Apte data set by removing all the documentsof which lengths are less than nnnn bytes from the test set. This restriction resultedin test sets of 1,109, 652, and 410 documents for GT800, GT1200, and GT1600,respectively.

Page 8: An Evaluation of Passage-Based Text Categorization

54 KIM AND KIM

Finally, a KISTI-Theses data set was constructed from master and doctoral theses ofKAIST,2 POSTECH,3 and CNU,4 which were submitted in electronic forms as their partialfulfillment of graduation. These theses are counted to 1,042 documents from 22 departments.We regarded the 22 departments as categories and selected one third of the documents asa test set (347 documents) and the remaining two thirds as a training set (695 documents).The majority of the documents are written in Hangul (Korean text).

The category distributions for KISTI-Theses data set are listed below in 〈category, #te,#tr, sum〉 format, where #te is the number of documents in the test set, #tr is the number ofdocuments in the training set, and sum is the summation of #te and #tr:

〈Advanced Materials Engineering, 2, 5, 7〉〈Aerospace Engineering, 6, 11, 17〉〈Automation & Design Technology, 1, 4, 5〉〈Biology, 23, 39, 62〉〈Chemical Engineering, 27, 47, 74〉〈Chemistry, 25, 49, 74〉〈Civil Engineering, 9, 17, 26〉〈Computer Science, 24, 49, 73〉〈Electrical Engineering, 48, 102, 150〉〈Environmental Engineering, 5, 6, 11〉〈Industrial Design, 1, 4, 5 〉〈Industrial Engineering, 14, 29, 43〉〈Information & Communication Engineering, 6, 15, 21〉〈Management Engineering, 44, 90, 134〉〈Materials Science, 20, 33, 53〉〈Mathematics, 7, 21, 28〉〈Mechanical Engineering, 45, 98, 143〉〈Metal Engineering, 11, 17, 28〉〈Miscellaneous, 0, 1, 1〉〈Nuclear Engineering, 7, 19, 26〉〈Physics, 15, 27, 42〉〈Steel Engineering, 7, 12, 19〉

The characteristics of data sets used in this experiment are shown in Table 2.

Table 2. Characteristics of the test collections used in this work.

Collection Apte GT800 GT1200 GT1600 KISTI-Theses

Test set 3309 1019 652 410 347

Training set 7789 7789 7789 7789 695

Category count 93 93 93 93 22

Minimal text size (kb) 0.1 0.8 1.2 1.6 14.8

Average text size (kb) 0.8 1.8 2.2 2.7 92.9

Page 9: An Evaluation of Passage-Based Text Categorization

AN EVALUATION OF PASSAGE-BASED TEXT CATEGORIZATION 55

5. Experiments

5.1. Experimental settings

Terms separated by space characters were extracted as features from documents and pas-sages. Stemming was not applied to the terms since it hurt the effectiveness in our exper-iments (data not shown), as shown in previous researches such as Baker and McCallum(1998). Common words (also known as stopwords) were removed from the feature pool.We also removed digits or numeric values from the feature pool. Removing digits slightlyimproved the categorization effectiveness in the case of the Apte data set and its threesubsets (data not shown).

An additional step was applied to the KISTI-Theses data set. Since the majority of itsdocuments are written in Hangul (Korean text), we applied a Hangul morpheme analyzerto Hangul terms and included them into the feature pool.5

Feature selection was applied according to the terms’ document frequencies (DF). Yangand Pedersen (1997) showed DF is a simple, effective, and reliable thresholding measurefor selecting features in text categorization (Yang and Pedersen, 1997). In our experiments,by varying DF ranges during feature selection, we chose minimal DF(DFmin) and maxi-mal DF(DFmax) that performed best in document-level categorization tasks for each testcollections. The same DFmin and DFmax were also applied to passage-level categorizationtasks.

In regard to the k value in our kNN classifier, the best k value for document-level cate-gorization was chosen for passage-level tasks. For the Apte, GT800, GT1200, and GT1600data sets, the k value was selected to be 10. And for the KISTI-Theses collection, k = 1was chosen. The k = 1 for KISTI-Theses collection is very interesting and it is probablydue to the fact that only one category is assigned for each document and each document’slength may be long enough to describe the category retrieved. More discussions are inSection 5.3.5.

Table 3 shows the average passage numbers of a test document and its average length foreach data set. The data for both non-overlapping and overlapping window are shown in thecase of passage size being 100 words. Since a page is bounded to minimal length of 1.0 kb,

Table 3. Average passage number and length.

Passage type Apte GT800 GT1200 GT1600 Theses

Nonoverlapping window 1.8 (0.5) 2.2 (0.5) 4.2 (0.6) 4.7 (0.6) 114.0 (0.8)

Overlapping window 2.4 (0.5) 5.1 (0.5) 6.5 (0.6) 8.0 (0.6) 226.4 (0.8)

Paragraph 7.1 (0.1) 10.8 (0.2) 12.8 (0.2) 15.6 (0.2) 90.8 (1.0)

Bounded-paragraph 2.3 (0.4) 2.8 (0.6) 5.4 (0.4) 6.6 (0.4) 72.3 (1.3)

Page N/A N/A N/A N/A 56.9 (1.6)

TextTile 1.9 (0.5) 3.1 (0.6) 3.5 (0.6) 3.8 (0.7) 64.4 (1.4)

Average number of passages per document.Average length (kilobytes) of a passages is shown in ().

Page 10: An Evaluation of Passage-Based Text Categorization

56 KIM AND KIM

we think it is meaningless to apply page type to short documents of 1 or 2 kb. Thereforewe did not test the effectiveness of page type in Apte and its three subset collections.

5.2. Effectiveness of passage weight functions

As stated in Section 2.2, we assume that the location of a passage plays an important rolein the determination of the parent document’s main topic(s). To reflect this assumption, weintroduced six different kinds of passage weight functions, which are functions of passagelocation and return passage weights between 0 and 1 (see Table 1). Summing these passageweights, the document’s categories are assigned as shown in figure 1 in Section 2.

Micro-averaged break-even points for the GT1600 collection are plotted against variouspassage types in figure 2, where the trends for the six passage weight functions can be seen.For the GT1600 collection, passage weight function 2 (pwf 2) and function 6 (pwf 6) showthe best performance. These two functions return high weights for passages from the Headpart, middle from the Body, and low from the Tail part (Table 1). And pwf 3 has the reversepattern and shows the worst performance. Situations are the same with the Apte, GT800and GT1200 collections (data not shown). This means that the main topics of the Reutersnews articles are determined mainly by the early part of the document.

On the other hand, the KISTI-Theses data set shows a very different situation (figure 3).The pwf 2 shows the worst performance for the KISTI-Theses data set, while it was bestfor the newswire data, the Apte and its three subsets. Furthermore, weighting tendencies

0.68

0.66

0.64

0.62

0.6

0.58

0.56Non-overlappingWindow(100)

OverlappingWindow(100)

Paragraph

pwf1pwf2pwf3

pwf5pwf6

Bounded–Paragraph

TextTile

BeP

pwf4

Figure 2. Effectiveness of 6 passage weighting functions. (Data set: GT1600)

Page 11: An Evaluation of Passage-Based Text Categorization

AN EVALUATION OF PASSAGE-BASED TEXT CATEGORIZATION 57

0.69

0.68

0.67

0.66

0.65

0.64

0.63 Non–

overlappingWindow(100)

OverlappingWindow(100)

Paragraph Bounded–Paragraph

Page TextTile

pwf2pwf3pwf4

pwf5pwf6

BeP

0.7

pwf1

Figure 3. Effectiveness of 6 passage weighting functions. (Data set: KISTI-Theses)

of pwf 2 and pwf 6 are very similar (Table 1), but pwf 6 performs well for several passagetypes while pwf 2 is poor for all passage types. This situation makes the interpretation verydifficult. However, it is clear that different passage weighting functions should be appliedfor different document types.

Note that pwf 1 always returns 1.0 for any passage location while other pwfs return variablevalues depending on the passage’s location in the document (Table 1). But in figure 2, pwf 2and pwf 6 outperform pwf 1, and in figure 3, pwf 3 and pwf 6 outperform pwf 1. These supportour assumption that each passage has a different degree of contribution to the document’smain topic(s).

5.3. Effectiveness of passages

5.3.1. The Apte collection. The experimental results for the Apte test collection are shownin Table 4. Micro-averaged precision, recall, and break-even point (BeP) applied passageweight function 2 (pwf 2) are presented as the performance measures. See the Section 5.2 fora comparison of six pwfs’ effectiveness. The differences, �%, are performance improvementin percentile compared with the document-based categorization. They are based on break-even points. The best BeP for all passage types is underlined.

Since the Apte, GT800, GT1200, GT1600 test collections consist of rather short docu-ments (refer Table 2), page type was not applied to the these data sets. And we slightly mod-ified TextTiling to yield fine-grained tiles, because standard TextTiling is too coarse-grained

Page 12: An Evaluation of Passage-Based Text Categorization

58 KIM AND KIM

Table 4. Effectiveness of passages for the Apte dataset with pwf 2.

Precision Recall BeP† �%

Document 0.818 0.818 0.818 0.0

Non-overlapping windows

Window size = 50 0.817 0.817 0.817 −0.1

Window size = 100 0.820 0.821 0.820 0.3

Window size = 150 0.822 0.822 0.822 0.5

Window size = 200 0.819 0.824 0.822‡ 0.4

Overlapping windows

Window/overlap = 50/25 0.820 0.820 0.820 0.3

Window/overlap = 100/50 0.823 0.823 0.823 0.6

Window/overlap = 150/75 0.823 0.823 0.823 0.6

Window/overlap = 200/100 0.822 0.822 0.822 0.4

Paragraphs 0.812 0.812 0.812 −0.8

Bounded-paragraphs 0.761 0.766 0.763‡ −6.7

TextTiles 0.831 0.812 0.821‡ 0.4

†Micro-averaged precision and recall break-even points.‡Micro-averaged F1 measures.

segmentation algorithm to apply to these data sets. Average size of tiles partitioned by themodified TextTiling algorithm is 98 words corresponding to 454 bytes while average sizeof standard TextTiles is 1 or 2 kilobytes (see Table 3).

Though there is no significant improvement in BeP, overlapping windows at passagesize of 100 and 150 words showed the best performance. The passage type showing theworst performance is bounded-paragraphs. A bounded-paragraph is defined as a paragraphcontaining at least 50 words and at most 200 words (Callan, 1994). We reason that the shortnature of document lengths in the Apte collection causes malformed bounded-paragraphsin this collection resulting in mal-performance.

The poor performance improvement for all passage types in this data set seems to bedue to the fact that the proportion of very short documents in the Apte collection is toohigh. For example, about 60% of the documents in the test set are less than 100 wordswhich corresponds to only one non-overlapping window of size 100. This causes the degreeof improvement by passage-level categorization being overwhelmed by the inaccuracyintroduced by the step of category merging in figure 1, resulting in poor performanceimprovement.

To eliminate the negative effect of short test documents and to verify the effectiveness ofpassage-level categorization, we prepared three additional subsets of the Apte collection,GT800, GT1200, and GT1600. The results will be shown in the following sections.

5.3.2. The GT800 collection. The experimental result for the GT800 test collection isshown in Table 5. The GT800 collection is a subset of the Apte test collection, where

Page 13: An Evaluation of Passage-Based Text Categorization

AN EVALUATION OF PASSAGE-BASED TEXT CATEGORIZATION 59

Table 5. Effectiveness of passages for the GT800 dataset with pwf 2.

Precision Recall BeP† �%

Document 0.690 0.690 0.690 0.0

Non-overlapping windows

Window size = 50 0.688 0.688 0.688 −0.3

Window size = 100 0.695 0.699 0.697 1.0

Window size = 150 0.706 0.701 0.703 1.9

Window size = 200 0.706 0.706 0.706 2.3

Overlapping windows

Window/overlaps = 50/25 0.690 0.690 0.690 0.0

Window/overlaps = 100/50 0.715 0.707 0.711‡ 3.0

Window/overlaps = 150/75 0.704 0.704 0.704 2.0

Window/overlaps = 200/100 0.706 0.710 0.708 2.6

Paragraphs 0.696 0.697 0.697 0.9

Bounded-paragraphs 0.688 0.707 0.697‡ 1.0

TextTiles 0.704 0.702 0.703 1.9

†Micro-averaged precision and recall break-even points.‡Micro-averaged F1 measures.

documents that are shorter than 800 bytes are removed from the test set. The experimentalenvironment is the same as that of the Apte collection.

There are some improvements in performance for most passage types, where overlappingwindows of size 100 works best. Compared with that of the Apte collection, performanceimprovement in this data set is clear. Detailed explanation will be stated in Section 5.3.4.

5.3.3. The GT1200 collection. The experimental result for the GT1200 test collectionis shown in Table 6. The GT1200 collection is a subset of the Apte test collection, wheredocuments that are shorter than 1,200 bytes are removed from the test set. The experimentalenvironment is the same as that of the Apte collection.

There are significant improvements in performance for most passages types, where over-lapping windows of size 100 works best. As lengths of test documents increase, comparedwith the Apte and GT800 datasets, the difference of BeP for passage-level and document-level categorization gets enlarged. Detailed explanation will be stated in Section 5.3.4.

5.3.4. The GT1600 collection. The experimental result for the GT1600 test collectionis shown in Table 7. The GT1600 collection is a subset of the Apte test collection, wheredocuments that are shorter than 1,600 bytes are removed from the test set. The experimentalenvironment is the same as that of the Apte collection.

For all Apte, and GTnnnn collections, overlapping windows showed the best perfor-mance. This is probably because variable length distribution of paragraphs, bounded-paragraphs, and tiles, compared with evenly distributed lengths of overlapping windows,causes skewed categorizations for passages, accordingly resulting in poor performance.

Page 14: An Evaluation of Passage-Based Text Categorization

60 KIM AND KIM

Table 6. Effectiveness of passages for the GT1200 dataset with pwf 2.

Precision Recall BeP† �%

Document 0.660 0.660 0.660 0.0

Non-overlapping windows

Window size = 50 0.674 0.673 0.673 2.1

Window size = 100 0.689 0.683 0.686‡ 4.0

Window size = 150 0.665 0.664 0.665 0.8

Window size = 200 0.642 0.637 0.640‡ −3.0

Overlapping windows

Window/overlaps = 50/25 0.677 0.678 0.678 2.7

Window/overlaps = 100/50 0.690 0.689 0.689 4.5

Window/overlaps = 150/75 0.665 0.664 0.665 0.8

Window/overlaps = 200/100 0.643 0.645 0.644 −2.3

Paragraphs 0.675 0.675 0.675 2.3

Bounded-paragraphs 0.683 0.683 0.683 3.5

TextTiles 0.671 0.670 0.671 1.7

†Micro-averaged precision and recall break-even points.‡Micro-averaged F1 measures.

Table 7. Effectiveness of passages for the GT1600 dataset with pwf 2.

Precision Recall BeP† �%

Document 0.636 0.636 0.636 0.0

Non-overlapping windows

Windows size = 50 0.649 0.648 0.648 1.9

Windows size = 100 0.665 0.663 0.664 4.4

Windows size = 150 0.653 0.642 0.647‡ 1.8

Windows size = 200 0.626 0.621 0.623 −2.0

Overlapping windows

Window/overlaps = 50/25 0.658 0.658 0.658 3.4

Window/overlaps = 100/50 0.670 0.670 0.670 5.4

Window/overlaps = 150/75 0.668 0.663 0.666 4.6

Window/overlaps = 200/100 0.649 0.649 0.649 2.0

Paragraphs 0.652 0.652 0.652 2.5

Bounded-paragraphs 0.665 0.665 0.665 4.5

TextTiles 0.659 0.658 0.658 3.4

†Micro-averaged precision and recall break-even points.‡Micro-averaged F1 measures.

Page 15: An Evaluation of Passage-Based Text Categorization

AN EVALUATION OF PASSAGE-BASED TEXT CATEGORIZATION 61

Furthermore, overlapping window type is superior to non-overlapping window for all datasets. While other passage types, including non-overlapping window type, may have somedegree of loss in term locality information at the boundaries, overlapping window passagesreserve the locality information since they overlap with the adjacent passages.

The best performing passage size of both non-overlapping and overlapping is 100 words.This size is about 3 paragraphs for test documents of the Apte and its three subset collections.(Note that we do not include numeric values in word count.)

The meaning of GTnnnn collections is the increasing lengths of test documents by remov-ing short documents from the test set of the Apte collection. From Tables 5–7, performanceimprovements are approximately proportional to the lengths of the test documents. This isclear in figure 4.

In figure 4, performance improvements in percentage are plotted against average sizes oftest documents for the Apte, GT800, GT1200, and GT1600 data sets, which shows againoverlapping passage type is the best performer. There is a trend that as the average docu-ment size gets larger the performance also improves better. Furthermore, there is a stronglinear correlation between the average document size and the performance improvementfor overlapping window type. This means that the longer the test documents are, the betterthe performance of all passage types.

We have examined the result for rather longer documents among short-document datasets so far. In the following section we will present the passage-level classification tasks forfull-length documents of master and doctoral theses.

6

5

4

3

Impr

ovem

ent (

%)

Average document size (kb)

2

1

0

−1

0.5 1.0

Non-overlapping window(100)

Overlapping window(100)

Bounded-paragraph

TextTile

1.5 2.0 2.5 3.0

Paragraph

Figure 4. Correlation between the average document size and performance improvement for various passagetypes. The first point of bounded-paragraph is omitted due to its huge bias.

Page 16: An Evaluation of Passage-Based Text Categorization

62 KIM AND KIM

Table 8. Effectiveness of passages for the KISTI-Theses dataset with pwf 3.

Precision Recall BeP† �%

Document 0.631 0.631 0.631 0.0

Non-overlapping windows

Window size = 50 0.677 0.677 0.677 7.3

Window size = 100 0.695 0.695 0.695 10.0

Window size = 200 0.683 0.683 0.683 8.2

Window size = 400 0.687 0.689 0.688‡ 9.0

Overlapping windows

Window/overlaps = 50/25 0.669 0.669 0.669 5.9

Window/overlaps = 100/50 0.697 0.697 0.697 10.5

Window/overlaps = 200/100 0.692 0.692 0.692 9.6

Window/overlaps = 400/200 0.686 0.686 0.686 8.7

Paragraphs 0.669 0.669 0.669 5.9

Bounded-paragraphs 0.686 0.686 0.686 8.7

Pages 0.689 0.689 0.689 9.1

TextTiles 0.689 0.689 0.689 9.1

†Micro-averaged precision and recall break-even points.‡Micro-averaged F1 measures.

5.3.5. The KISTI-Theses collection. So far we have examined the passage-based textcategorization on data sets with short documents of one or two kilobytes (Table 2). In thissection, we will present the experimental result for the KISTI-Theses data set which consistsof full-length documents in average 92.9 kb, varying from 14.8 kb to 533.5 kb. Data withpassage weight function 3 (pwf 3) are shown in Table 8.

In this experiment, the k value for our kNN classifier is 1, and the feature selectioncondition is document frequency of at most 2 and at least 69 (corresponding to 10% oftraining documents). The k = 1 for this collection is very interesting. It is probably due tothe fact that this collection is rather a small data set and only one category is assigned to eachdocument in the collection. This means that the higher the k value is, the higher possibility ofinadequate documents being contained in the top-ranked documents, causing performancedegradation. In addition, since all the documents in the collection are very long (average92,900 characters), one top-ranked document to query document may contain sufficientterms which can fully describe a category’s features.

Categorization performance using any type of passage results in more effective catego-rization than when the whole-document is used as the categorization unit (Table 8).

Based on figure 4 and Table 8, overlapping window of passage size 100 words showsthe best performance for all data sets. And for all data sets except the Apte collection,performance of bounded-paragraph type is better than that of paragraph type. As statedin Section 2.1, the lengths of bounded-paragraphs are less skewed than those of para-graphs, because the length of a bounded-paragraph is bounded to some minimal value,

Page 17: An Evaluation of Passage-Based Text Categorization

AN EVALUATION OF PASSAGE-BASED TEXT CATEGORIZATION 63

while paragraphs are varying from one sentence to tens of sentences. Skewing in passagesizes seems to be harmful to passage-based text categorization.

Finally we would like to note shortly the speed-effectiveness trade-off in classification ofthe KISTI-Theses data set. In this experiment, we find that passage-level categorization is 2to 5 times slower than document-level categorization, but memory space required is greatlyreduced, because splitting a document into passages increases the number of categorizationtasks by passage count but decreases the amount of work load from document to passagesize. This is a bearable trade-off regarding that a document is split into hundreds of passagesin the case of the KISTI-Theses collection (see Table 3). The speed-memory trade-off willbe very useful when it is impractical to apply whole-document-level categorization due tobulky volume of the categorization unit such as full-length digital books and whole websites.

6. Conclusions

Advent of full-length document databases and expansion of web sites pose new challengesto text classification, because traditional whole-document classification may be impracticalfor these document types. In this article, we introduce a new text categorization model,called passage-based text categorization, in which the problem size (categorization unit) isreduced from whole-document to small passages split from the document.

We explored the text categorization using various passage types, such as non-overlappingwindows, overlapping windows, paragraphs, bounded-paragraphs, pages, and tiles. Theimprovement obtained by passage-level categorization compared with whole-documentcategorization is greater than 5% for the short-document data set and is greater than 10%for the long-document data set. For all passage types, there are general improvements in thecategorization effectiveness. However, overlapping windows showed superior effectivenessto other passage types across five test collections.

We also introduced passage weight which is applied to merge the passage categories tothe document categories. Despite overall improvements observed for all passage weightingschemes tested in this article, there exist one or more schemes that are superior to simplesummation of weights (pwf 1). Therefore, careful design of the passage weighting schemewill further improve the effectiveness. Furthermore, our results showed that different datasets have different optimal passage weighting schemes.

The superior effectiveness of passages to whole-document may be the result of a fewfactors. First, passages can quest for the subtopic structure of the test document while whole-document ignores detailed design of the document. In passage-level text categorization,this subtopic structure can be exploited by a pool of classifiers, each of which has a local(i.e. passage-level) view of the document. Second, passages are short segments from thedocument. It means they embody locality: document-level classification mashes the term’slocality information, but dividing the document into passages reserves the partial locality.In this respect, it is natural that overlapping windows which reserve the locality even at theboundaries outperforms other passage types.

Dividing the document to smaller pieces of passages is an effective mechanism for textcategorization where a collection of long documents are considered. Even in the collections

Page 18: An Evaluation of Passage-Based Text Categorization

64 KIM AND KIM

of short texts, we found that passages have the potential to improve the effectiveness.Though there is a slight speed-effectiveness trade-off in passage-based text categorization,it is bearable to process really long documents.

We suggest that passage-level text categorization is one of the influential methods inenvironments such as unstructured full-length documents, XML documents and whole websites. For unstructured documents, any passage type quoted in this paper can be appliedto categorization tasks. In regard to XML documents, since they are naturally composedof several passages (each appropriate element content as a passage), it is reasonable toexpect that the passage-based categorization method is well suited for XML documents.Recently Denoyer and Gallinari (2003) also showed that the XML categorization can befurther improved by using the structural information of XML documents. For web siteclassification, since a web site is usually composed of many web pages, the passage-basedcategorization model can be applied by regarding each web page as a passage.

The passage-based text categorization model resembles classifier committees (Larkeyand Croft, 1996; Sebastiani, 2002) in two ways. First, its result of the classification is ob-tained by a pool of classifiers. Second, it uses a passage weight function to compute thedocument’s categories while classifier committees use a combination function to choose cat-egories (Larkey and Croft, 1996; Sebastiani, 2002). Therefore we expect that many researchresults on classifier committees can also be applied easily to passage-based categorizationin the near future.

Acknowledgments

We would like to thank Wonkyun Joo for some helpful comments and fruitful discussions,Hwa-muk Yoon for providing raw data for the KISTI-Theses test collection, and ChangminKim and Jieun Chong for supporting this work.

Notes

1. More information about KRISTAL-II information retrieval system can be obtained at http://giis.kisti.re.kr.2. Korea Advanced Institute of Science and Technology, Daejeon, Korea, http://www.kaist.ac.kr.3. Pohang University of Science and Technology, Pohang, Korea, http://www.postech.ac.kr.4. Chungnam National University, Daejon, Korea, http://www.cnu.ac.kr.5. The Hangul morpheme analyzer used in this experiment is a component of KRISTAL-II information retrieval

system which is used as the base of our kNN classifier.

References

Apte, C., Damerau, F., and Weiss, F. (1994). Towards Language Independent Automated Learning of Text Cate-gorization Models. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research andDevelopment in Information Retrieval (pp. 23–30).

Baker, L.D. and McCallum, A.K. (1998). Distributional Clustering of Words for Text Classification. In Proceedingsof the 21th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval(pp. 96–103).

Page 19: An Evaluation of Passage-Based Text Categorization

AN EVALUATION OF PASSAGE-BASED TEXT CATEGORIZATION 65

Callan, J.P. (1994). Passage Retrieval Evidence in Document Retrieval. In Proceedings of the 17th Annual Inter-national ACM/SIGIR Conference on Research and Development in Information Retrieval (pp. 302–310).

Denoyer, L. and Gallinari, P. (2003). A Belief Networks-Based Generative Model for Structured Documents. AnApplication to the XML Categorization. In MLDM 2003—IAPR International Conference on Machine Learningand Data Mining, Leipzig, Germany

Hearst, M.A. and Plaunt, C. (1993). Subtopic Structuring for Full-length Document Access. In Proceedings ofthe 16th Annual International ACM/SIGIR Conference on Research and Development in Information Retrieval(pp. 59–68).

Hearst, M.A. (1994). Multi-Paragraph Segmentation of Expository Texts. In Proceedings of the 32nd AnnualMeeting of the Association for Computational Linguistics (pp. 9–16).

Kaszkiel, M., Zobel, J., and Sacks-Davis, R. (1999). Efficient Passage Ranking for Document Databases. ACMTransactions on Information Systems, 17(4), 406–439.

Kaszkiel, M. and Zobel, J. (2001). Effective Ranking with Arbitary Passages. The Journal of American Societyfor Information Science and Technology, 52(4), 344–364.

Larkey, L.S. and Croft, W.B. (1996). Combining Classifiers in Text Categorization. In Proceedings of SIGIR-96,19th ACM International Conference on Research and Development in Information Retrieval (pp. 289–297).

Moffat, A., Sacks-Davis, R., Wilkinson, R., and Zobel, J. (1994). Retrieval of Partial Documents. In NIST SpecialPublication 500-215: The Second Text REtrieval Conference (TREC 2) (pp. 181–190).

Salton, G., Allan, J., and Buckley, C. (1993). Approaches to Passage Retrieval in Full Text Information Systems.In Proceedings of the 16th Annual International Conference on Research and Development in InformationRetrieval (pp. 49–58).

Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1–47.van Rijsbergen, C. (1979). Information Retrieval. Butterworths, London.Witten, I.H., Moffat, A., and Bell, T.C. (1999). Managing Gigabytes: Compressing and Indexing Documents and

Images. San Francisco: Morgan Kaufmann Publishing.Yang, Y. (1994). Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization

and Retrieval. In Proceedings of the 17th Annual International ACM/SIGIR Conference on Research andDevelopment in Information Retrieval (pp. 13–22).

Yang, Y. and Pedersen, J.O. (1997). A Comparative Study on Feature Selection in Text Categorization. In Pro-ceedings of the 14th International Conference on Machine Learning(ICML’97) (pp. 412–420).

Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Journal of Information Retrieval,1(1), 67–88.

Yang, Y., Slattery, S., and Ghani, R. (2002). A Study of Approaches to Hypertext Categorization. Journal ofIntelligent Information Systems, 17(2), 219–241.

Zobel. J., Moffat, A. Wilkinson, R., and Sacks-Davis, R. (1995). Efficient Retrieval of Partial Documents. Infor-mation Processing and Management, 31(3), 361–377.