Download - 10.1.1.1.6812

8/7/2019 10.1.1.1.6812

1/7

Sentence Completion

Korinna GrabskiOtto-von-Guericke-Universitat

FIN/IWS

Universitatsplatz 239106 Magdeburg, Germany

[email protected]

Tobias SchefferHumboldt-Universitat zu Berlin

Department of Computer Science

Unter den Linden 610099 Berlin, Germany

[email protected]

ABSTRACT

We discuss a retrieval model in which the task is to completea sentence, given an initial fragment, and given an applica-tion specific document collection. This model is motivatedby administrative and call center environments, in whichusers have to write documents with a certain repetitiveness.We formulate the problem setting and discuss appropriateperformance metrics. We present an index-based retrieval

algorithm and a cluster-based approach, and evaluate our al-gorithms using collections of emails that have been writtenby two distinct service centers.

Categories and Subject Descriptors

H.3.3 [Information search and retrieval]: Retrievalmodels; H.5.2 [User interfaces]: Natural language

General Terms

Algorithms, Experimentation

Keywords

Retrieval models, indexing methods, applications

1. INTRODUCTIONInformation retrieval is concerned with the construction

of methods that satisfy a users information need [3]. Often,but not necessarily, the user has to explicitly encode theinformation need in a query. Here, we discuss a distinctretrieval model in which the information need consists of theremaining part of a sentence typically part of a document which the user is just beginning to enter. The query thusconsists of the initial sentence fragment that has alreadybeen entered, possibly accompanied by preceding sentences.In order to be able to retrieve the sentence, we are givena domain specific collection of documents, written by the

same user, or group of users.

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.SIGIR04, July 2529, 2004, Sheffield, South Yorkshire, UK.Copyright 2004 ACM 1-58113-881-4/04/0007 ...$5.00.

This setting is motivated by processes such as answeringquestions sent by email in a call center, or writing letters inan administrative environment. Analyzing call center data,we found that outbound messages contain many similar, of-ten semantically equivalent sentences. However, templatesand maintained lists with answers to frequently asked ques-tions are not widely used in call center environments; pos-sibly because of lack of the administrative structures and

commitment required to maintain such templates. Our ap-proach is therefore to integrate retrieval functionality intothe human computer interface.

Our paper is organized as follows. We review related workin Section 2. In Section 3, we discuss the problem settingand appropriate performance metrics for sentence comple-tion methods. We present a retrieval algorithm in Section 4and a clustering approach in Section 5. In Section 6, we con-duct experiments with data collections from two call centers.Section 7 concludes.

2. RELATED WORKDarragh and Witten [4] have developed an interactive

keyboard that uses the sequence of past keystrokes to pre-dict the most likely succeeding keystrokes. Clearly, in anunconstrained application context, keystrokes can only bepredicted with limited accuracy. By contrast, in the veryspecific context of entering URLs, completion predictionsare commonly provided by web browsers [6].

Predicting the next action of a user is a problem that hasbeen studied in the area of human computer interface de-sign. Motoda and Yoshida [16] and Davison and Hirsch [5]developed a Unix shell which predicts the command stubsthat a user is most likely to enter, given the current his-tory of entered commands. Korvemaker and Greiner [12]have developed this idea into a system which predicts wholecommand lines. The Unix command prediction problem hasalso been addressed by Jacobs and Blockeel who infer macrosfrom frequent command sequences [8] and predict the next

command using variable memory Markov models [9, 10].Similar approaches have been applied to predict the link

that a user browsing the web is likely to follow, and to pre-dict user actions in other domains [14, 7]. Often, the goalof such user models is to detect anomalies and fraud.

While these previous studies have inspired our work, theyfocus on predicting one of a given set of possible actions,given a trace of actions. By contrast, the problem whichwe discuss here, possesses a new dimension. In the problemsetting at hand, the action which is to be predicted is theremaining part of a sentence in natural language. However,

8/7/2019 10.1.1.1.6812

2/7

we assume an environment with a relatively narrow distri-bution over possible texts such as a call center and weare given a (typically very large) application specific corpus.Hence, we are facing an information retrievalproblem.

A possible approach for this problem might be to learn alanguage model, and to construct the most likely sentence,given its initial fragment. However, as we know from speechrecognition and machine translation, constructing syntacti-cally and semantically proper sentences automatically is a

very hard problem (the decoding problem). We thereforeconsider sentence completion to be a problem that is mostnaturally and most effectively addressed by information re-trieval methods.

An information retrieval approach to sentence completioninvolves finding, in a corpus, the sentence which is mostsimilar to a given initial fragment. Several approaches tosolving this problem are possible. Without index structure,the problem could be solved by approximate string match-ing [18, 13]. This solution can be sped up by using invertedindex files [15, 19, 1, 2]; here, the occurrences of the queryterms are merged and approximate string matching is per-formed on the corresponding text positions.

Our solution extends this approach in two respects.

Firstly, given a sentence fragment of length m, we searchfor sentences whose m initial words are most similar to thequery fragment in the vector space model (rather that differ-ing in at most k terms). Secondly, we present a pruning al-gorithm that interleaves accessing of indexed sentences withpruning of sentences which need not be accessed becausethey cannot be more similar to the query fragment than thebest sentence currently found.

Processing nearest neighbor queries with vector space sim-ilarity and index structures has been studied in the contextof biological sequence data [17]. Kahveci and Singh [11] dis-cussed a wavelet based index structure for strings.

3. PROBLEM SETTINGThe framework which we propose differs in small but sig-

nificant ways from the standard retrieval models. We aregiven a domain specific collection of documents written bya user (or group of users). The users information need is(the remaining part of) a sentence which he or she intendsto write and is beginning to enter. The query is given byan initial sentence fragment. This information need intrin-sically refers to the users intention; it is therefore clear thatthe problem cannot be solved in all cases.

Our postulate, however, is that this framework is inter-esting because in specific environments the domain specificdocument collection does in fact provide the information re-quired to satisfy the users information need in many cases.We pose the sentence completion problem as follows.

Problem Setting 1. We are given a domain specificdocument collection. Furthermore given an initial documentfragment, the sentence completion problem is to identify theremaining part of the sentence that the user currently in-tends to write.

This problem setting can naturally be generalized alongseveral dimensions. Firstly, it seems reasonable to consideradditional attributes of the current communication process,such as other documents that are currently visible on theusers display. Secondly, instead of demanding that the cur-rent sentence be completed entirely, it may often be more

natural to predict a string up to the point where the uncer-tainty about the remaining words of the sentence increases.

The algorithms which we present in this paper address amore specialized version of the sentence completion prob-lem. We consider only the initial fragment of the currentsentence as input, disregarding all preceding sentences andinter-sentential relationships.

What is the proper metric for evaluating the performanceof sentence completion methods? Our definition of the prob-

lem setting refers to the users intention. Hence, a perfectevaluation metric has to refer to the users reaction accep-tance or rejection to proposed sentence completions. Atany time, a sentence completion system can, but need not,produce an output. This output is a single sentence. Theuser interface can, for instance, display the retrieved comple-tion in a bubble and insert the text when the user pressesthe tab key. When a completion is proposed which theuser accepts, he or she benefits, whereas a false proposalmay unnecessarily distract the users attention.

A typical sentence completion method will calculate a con-fidence measure for the completion of a given query sentence.A completion is proposed when the confidence exceeds somethreshold value. We characterize the performance of a sen-

tence completion algorithm with respect to a given test col-lection in terms of two quantities: precision and recall. Byvarying the confidence threshold, we obtain a precision-recallcurve that characterizes the completion method.

For a given initial fragment for which the confidencethreshold is exceeded and an algorithm therefore produces acompletion, the precision is 1 if the completion is acceptedby the user, 0 otherwise. The precision of a collection isaveraged over all initial fragments. Hence, the precision ofan algorithm for a test collection is the number of comple-tions accepted by the user over the number of all proposedcompletions (Equation 1).

For a given initial fragment, a recall of 1 is achieved whena completion is produced which the user accepts. If theuser rejects the produced completion, or if no completion is

produced, recall for that initial fragment is zero. For thewhole collection, the recall is obtained by averaging overall sentences, and over all initial fragments of each sentence.Hence, the recallof an algorithm with respect to a test collec-tion is the number of accepted completions over the numberof all queries (Equation 2). Note that our definition of recalldoes not refer to the set of sentences that are retrievable fora given initial fragment.

P recision =accepted completions

predicted completions(1)

Recall =accepted completions

queries (initial word sequences)(2)

Whether it is more natural to average precision and recallover all initial strings, or over all initial sequences of entirewords, depends on the user interface; we consider initialsequences of entire words.

In order to obtain a performance measure that can be cal-culated without any human in the loop, we make a sim-plifying assumption: we assume that the user will acceptany sentence that is semantically equivalent to the sentencewhich the user intends to write. We can thus measure theperformance using a test collection; all sentences in this col-lection have to be manually grouped into classes of seman-

8/7/2019 10.1.1.1.6812

3/7

tically equivalent sentences. A single instance of a sentencecompletion problem can be simulated by drawing a targetsentence from the test collection, and using an initial frag-ment as query. When the systems confidence exceeds thethreshold and the retrieved sentence lies within the sameequivalence class as the target sentence, this is counted asan accepted completion.

4. RETRIEVAL ALGORITHMOur general approach is to search for the sentence whose

initial words are most similar to the given initial fragmentin vector space representation; ties are broken in favor of themore frequent sentence.

For each training sentence dj and each length , we firstcalculate a TF-IDF representation of the first words:fi,j = normalize(T F(ti, dj , ) IDF(ti)). The inverse doc-ument frequency refers to the number ofsentences contain-ing term ti. The similarity between two vectors is defined bythe cosine measure. When comparing a sentence fragmentof length to a complete sentence, the respective TF-IDFrepresentation of the first words of the sentence is used.

The easiest way to find the best fitting sentence is to sequen-tially compare each sentence with the fragment. However,this has a runtime of O(n).

Therefore, we develop an indexing algorithm that still en-sures finding the best sentence but typically has a sub-linearbehavior. This algorithm is based on an inverse index struc-ture (see Figure 1) that lists, for each term, the sentencesin which the term occurs (the postings). The postings listsare sorted according to a relation

8/7/2019 10.1.1.1.6812

4/7

Table 1: Retrieval algorithm.

Input: Inverted index, sentence fragment q of length .

1. Construct an inverted index for the query by copyingthe sentence lists for the terms in q from the globalinverted index structure. Let index1, . . . ,ind ex bethe postings lists for terms t1, . . . , t.

2. Let f be the TF-IDF representation of q.

3. Let sbest be the best sentence found so far. Set sbest =NULL.

4. Let simbest be the similarity between f and sbest. Setsimbest = 1.

5. Do Until break

(a) Determine upper bound simUBall =j:indexj=

fj for all sentences in the index

structure.

(b) Determine the first sentence in the index struc-ture sfirst, which is the smallest first sentence

over all postings lists according to the canonicalsorting criterion simbest Then

i. Determine true similarity simfirst =sim(sfirst, f).

ii. If simfirst > simbest Then

A. Set sbest = sfirst, simbest = simfirst.

B. If simbest = 1 Then break.

(f) Move pointers to next element in all index liststhat include sfirst, thereby removing all postingsof sfirst from the index. Let sindexj,1 be the newfirst posting of term tj .

Return sbest.

According to Equation 3, the similarity between f andany indexed sentence can be bounded by simUBall =

j:indexj=fj . When simbest simUBall, then no remain-

ing sentence contains sufficiently many query terms to out-perform sbest. When simbest = simUBall, then there maybe another sentence equally similar to sbest. However, as

our canonical ordering prefers frequent sentences, there canbe no equally similar sentence more frequent than sbest.

Equation 5 bounds the similarity for the smallest indexedsentence. Therefore, if simUBfirst < simbest, then sfirst isstrictly worse than sbest. If simbest = simUBfirst then wehave already found a more frequent sentence sbest which isat least as good as sfirst; in both cases, sfirst does not needto be investigated. Only if both criteria do not exclude thecurrent sentence, the real similarity has to be determined.If it is higher than the best similarity, the current sentenceis better and will be stored. Otherwise, the best sentence is

not changed. This proves that the algorithm always storesthe best fitting sentence.

Now, we show (2). There are three situations, in which thealgorithm terminates. In the first case, the index structure isempty. This implies that the best sentence has already beenfound because all sentences that have not been ruled outhave been checked. In the second case, simUBall simbestis valid. In this case, all sentences remaining in the indexstructure are worse than the best sentence found so far by

definition of simUBall. In the third case, the similarity be-tween the best sentence and the fragment is one. In thiscase, the algorithm is allowed to terminate, because oneis the highest possible similarity and the sentence has thehighest possible frequency of all sentences with a similarityof one because of the sorting criterion

8/7/2019 10.1.1.1.6812

5/7

only contain semantically equivalent sentences. The datais reduced by pruning all leaves with less than a minimumnumber of sentences (in our experiments, 10).

The tree can also be used to access the data more quickly.When predicting a sentence, the algorithm first decides fora cluster by greedily moving through the tree from the rootuntil a leaf node is reached. Then it chooses a sentence bythe presented algorithm from the data in this cluster.

Table 2 shows characteristic sentences (translated from

German into English) from the ten largest clusters found incollections of emails sent by the service centers of an online-provider of education services (left hand side) and a largeonline shop (right hand side). The clusters contain varia-tions of sentences most of which we would intuitively expectto be used frequently by service centers. Surprisingly, thesalutation lines for two individuals (Dear Mr. X, DearMr. Y) appear in the set of most frequent sentences. How-ever, a closer look at the data reveals that, in the period inwhich the data were recorded, these customers did in factask substantially many questions.

6. EMPIRICAL STUDIESIn this section, we want to investigate the relative benefits

of the retrieval, and the clustering approach to sentence com-pletion. We also want to shed light on the question whethersentence completion can have a significant beneficial impactin practical applications.

We use two collections: collection A has been provided byan online education provider and contains emails that havebeen sent to students or prospective students by theservice center, in response to inquiries. Collection B hasbeen provided by a large online shop and contains emailsthat have been sent in reply to customer requests. Bothcollections have about the same size, which is around 10000sentences. They differ in the distribution over text; collec-tion A includes more diverse sentences than collection B.

Note that both data sets contain only answer messages,not the corresponding questions. It is intuitively clear thatthe questions contain some information about the sentencesthat will most likely occur in the answer. However, thealgorithms that we compare base their prediction only onthe initial fragment of the sentence that the user is currentlyentering, and we assume the perspective of the service centerthat writes the answer messages.

We manually partition the collection of sentences intoequivalence classes. Each element of an equivalence classcan reasonably be exchanged for each other element withoutchanging the semantics of the message. For the experiments,each data collection was split at random into a training set(75%), and a test set (25%).

The indexing and clustering algorithms are provided withthe training part. In order to measure precision and recall,

we iterate over all sentences x of the test set and, for eachsentence, over all possible sequences that contain the first kwords, for all possible values ofk. Each such initial sequenceis counted as a query. We use the initial sequence as inputto the studied algorithm; the algorithm returns either noprediction, or else one single completion prediction.

Each initial word sequence is counted as a query. Whenthe similarity between query fragment and most similar sen-tence exceeds the threshold parameter, the algorithm pro-duces a completion prediction. If prediction y is returned,we check whether the actual sentence x and the prediction

y lie in the same equivalence class. If this is the case, wecount a correct prediction; if x and y lie in distinct classes,we count an incorrect prediction. We count the precisionof a method as the number of correct predictions over thenumber of both, correct and incorrect prediction. Recallis calculated as the number of correct predictions over thenumber of queries.

Figure 2 shows the precision for collection A for the re-trieval algorithm using our index structure algorithm and

for the algorithm extended with clustering, depending onthe initial fragment length. Both algorithms are investi-gated for two different confidence (similarity) threshold val-ues. Only if the similarity is greater or equal than 0.6, or 1.0respectively, a prediction is made. Figure 3 shows the cor-responding recall, again depending on the fragment length.Figures 4 and 5 show the same information for collection B.

0.5

0.55

0.6

0.65

0.70.75

0.8

0.85

0.9

0.95

1

0 5 10 15 20 25

precision

fragment length

Clustering, threshold 0.6Clustering, threshold 1.0

Retrieval algorithm, threshold 0.6Retrieval algorithm , threshold 1.0

Figure 2: Precision for collection A

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 5 10 15 20 25

recall

fragment length

Clustering, threshold 0.6Clustering, threshold 1.0Retrieval algorithm, threshold 0.6Retrieval algorithm, threshold 1.0

Figure 3: Recall for collection A

For collection A, the retrieval algorithm always has a highprecision of above 0.9. Adding the clustering reduces theprecision gained with shorter initial fragments. However, itraises precision for longer initial fragments. In collection B,there is no real difference between both algorithms. This isprobably due to the lower diversity in this collection. In allruns, a pruning threshold of 10 sentences was used.

The recall depends on the threshold value used for the

8/7/2019 10.1.1.1.6812

6/7

Table 2: Characteristic sentences of the identified clusters.Online education provider Online shopPlease do not hesitate to ask further questions. (lines from the signature file)Please let me know if you have any further questions. I would like to answer your question as follows.I hope that you find this answer helpful. Please contact our toll-free hotline . . .Please let me answer your question as follows. Please apologize the mistake in your confirmation mail.In addition, I would like to answer your question as follows. We apologize for the inconvenience.Please contact our consultation hotline . . . Please apologize our mistake.

(15 minutes installation support) Please apologize the delay.Dear Mr. X, Please apologize the incomplete delivery.Dear Mr. Y, Please follow the link Create new account.Dear Sirs, I have shipped replacement free of charge.

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25

precision

fragment length

Clustering, threshold 0.6

Clustering, threshold 1.0Retrieval algorithm, threshold 0.6Retrieval algorithm, threshold 1.0

Figure 4: Precision for collection B

similarity. It is also smaller for the clustering algorithm,as the stored data is reduced. In collection A, clusteringdramatically reduces the recall. For collection B, recall isonly reduced by 10%. This is, again, caused by the greater

diversity of collection A. On average, recall is lower forlonger initial fragments. Although longer initial fragmentsresult in fewer false completions (higher precision), the di-versity among the sentences increases and the chance of asufficiently similar sentence being available becomes smaller.

The time that is needed to make a prediction is also a veryimportant factor. If the user is typing faster than the systemis retrieving proposals, then it is useless. Figure 6 comparesprediction times for a sequential search to the index-basedsearch, and the index-based search with clustering exten-sion. Figure 7 shows the same information measured for thesecond data collection.

We can clearly see that our index-based retrieval algo-rithm is substantially faster than the sequential algorithm.With increasing number of sentences in the model, the gap

grows further; with as few as 7000 sentences, index-basedsearch is already 5 times faster. As the sentences in col-lection A are more diverse, data compression by clusteringadditionally reduces prediction times notably. For collectionB, the overhead produced when manipulating the clusters islarger than the gain, so the index algorithm behaves a littlebit better. However, the difference is small.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 5 10 15 20 25

recall

fragment length

Clustering, threshold 0.6Clustering, threshold 1.0

Retrieval algorithm, threshold 0.6Retrieval algorithm, threshold 1.0

Figure 5: Recall for collection B

7. CONCLUSION AND FUTURE WORKWe discussed a retrieval model in which the user expresses

his or her information need by beginning to enter a sentence;the retrieval task is to complete the sentence in a way thatthe user accepts. A p ossible implementation could displaythe proposed completion in a bubble at the cursor positionand insert the proposed completion when the user confirmsit, e.g., by pressing the tab key. Applications of the pro-posed retrieval model include the problem of writing lettersor other documents in an administrative environment, or an-swering emails in a service center. Rather than a concludingsolution, this paper provides a starting point for discussionof the sentence completion problem.

In our model, the query consists of the initial part of thesentence, rather than, for instance, the most characteristicpart of a sentence or a special phrase that uniquely identifiesa sentence. The benefit of this approach is that the userdoes not have to adapt his or her own behavior towards

the system; in addition, the result of a failure to predictthe correct sentence completion is only that the user has tocomplete entering the current sentence manually.

We have conducted case studies with collections of emailsfrom two service centers, comparing a purely retrieval-basedto a clustering method. For both applications, we can con-clude that a substantial fraction of sentences can be com-pleted successfully after three initial words, we obtain upto 90% precision and 60% recall. This indicates that thisproblem setting may have a considerable economic poten-tial. Clearly, such high values can only be achieved in envi-

8/7/2019 10.1.1.1.6812

7/7

0

100

200

300

400

500

600

0 1000 2000 3000 4000 5000 6000 7000

pre

dictiontimeinms

number of sentences in the model

Index structureSequential search

Clustering

Figure 6: Prediction times for collection A

0

20

40

60

80

100

120

140

0 1000 2000 3000 4000 5000 6000 7000

predictiontime

inms

number of sentences in the model

ClusteringIndex Structure

Sequential Search

Figure 7: Prediction times for collection B

ronments with a relatively limited discourse area.Comparing the retrieval to the clustering approach we can

conclude that the retrieval method, on average, has higherprecision and recall. The indexing method allows data ac-cess in typically sub-linear time. The clustering approach,however, reduces the data effectively. Further studies shouldinvestigate whether clustering becomes more interesting forvery large document collections.

Our current solution ignores inter-sentential relationshipsas well as inter-document relationships (e.g., between aquestion and the sentences in the corresponding answer).It appears intuitively clear that more accurate predictionscan be obtained by an algorithm which takes these impor-tant attributes into account. Also, we will investigate on

methods that may predict some succeeding words, but notnecessarily the complete remainder of the current sentence.For instance, it may be possible to predict that the stringYour order proceeds as will be shipped on, but it maynot be possible to predict whether the final word is Mon-day. or Tuesday.

Our evaluation has focused on the probability of predict-ing a sentence that lies within the same equivalence classas the sentence which a user is entering. An ultimate eval-uation metric has to refer to the actual user benefit, andhas to measure potential time savings as well as potential

annoyance caused by incorrect completion predictions. Wewill address these issues in our future work.

Acknowledgment

This work was supported by the German Science FoundationDFG under grants SCHE 540/10-1 and NU 131/1-1.

8. REFERENCES[1] Marcio Araujo, Gonzalo Navarro, and Nivio Ziviani. Large

text searching allowing errors. In Proceedings of the SouthAmerican Workshop on String Processing, 1997.

[2] R. Baeza-Yates and G. Navarro. Faster approximate stringmatching. Algorithmica, 23(2):127158, 1999.

[3] R. Baeza-Yates and B. Ribeiro-Neto. Modern InformationRetrieval. Addison Wesley, 1999.

[4] J. Darragh and I. Witten. The Reactive Keyboard.Cambridge University Press, 1992.

[5] B. Davison and H. Hirsch. Predicting sequences of useractions. In Proceedings of the AAAI/ICML Workshop onPredicting the Future: AI Approaches to Time SeriesAnalysis, 1998.

[6] M. Debevc, B. Meyer, and R. Svecko. An adaptive shortlist for documents on the world wide web. In Proceedings ofthe International Conference on Intelligent UserInterfaces, 1997.

[7] T. Fawcett and F. Provost. Activity monitoring: Noticinginteresting changes in behavior. In Proceedings on the ACMSIGKDD International Conference on KnowledgeDiscovery and Data Mining, 1999.

[8] N. Jacobs and H. Blockeel. The learning shell: automatecmacro induction. In Proceedings of the InternationalConference on User Modelling, 2001.

[9] N. Jacobs and H. Blockeel. Sequence prediction with mixedorder Markov chains. In Proceedings of the Belgian/DutchConference on Artificial Intelligence, 2003.

[10] N. Jacobs and H. Blockeel. User modelling with sequentialdata. In Proceedings of the HCI International, 2003.

[11] T. Kahveci and A. Singh. Efficient index structures forstring databases. In Proceedings of the InternationalConference on Very Large Databases, pages 351360, 2001.

[12] B. Korvemaker and R. Greiner. Predicting Unix command

lines: adjusting to user patterns. In Proceedings of theNational Conference on Artificial Intelligence, 2000.

[13] G. Landau and U. Vishkin. Fast string matching with kdifferences. Journal of Conputer Systems Science,37:6378, 1988.

[14] T. Lane. Hidden markov models for human/computerinterface modeling. In Proceedings of the IJCAI-99Workshop on Learning About Users, 1999.

[15] A. Moffat and T. Bell. In situ generation of compressedinverted files. Journal of the American Society forInformation Science, 46(7):537550, 1995.

[16] H. Motoda and K. Yoshida. Machine learning techniques tomake computers easier to use. In Proceedings of theFifteenth International Joint Conference on ArtificialIntelligence, 1997.

[17] O. Ozturk and H. Ferhatosmanoglu. Effective indexing and

filtering for similarity search in large biosequencedatabases. In Proceedings of the IEEE InternationalSymposium on Bioinformatics and Bioengineering, 2003.

[18] P. Sellers. The theory and computation of evolutionarydistances: pattern recognition. Journal of Algorithms,1:359373, 1980.

[19] S. Wu and U. Manber. Fast text searching allowing errors.Communications of the ACM, 35(10):8391, 1992.