Term Processing & Normalization Major goal: Find the best possible representation Minor goals:...

28
Term Processing & Normalization Major goal: Find the best possible representation Minor goals: Improve storage and speed First: Need to transform sequence of bytes into sequence of characters Problems involved: Encoding, language, … Different formats (some might be proprietary) Content vs. structure information etc.

Transcript of Term Processing & Normalization Major goal: Find the best possible representation Minor goals:...

Term Processing & NormalizationMajor goal:

Find the best possible representationMinor goals:

Improve storage and speed

First: Need to transform sequence of bytes into sequence of characters

Problems involved: Encoding, language, … Different formats (some might be proprietary) Content vs. structure information etc.

Lexical Analysis & TokenizationGoal: Transform stream of characters into

stream of words / terms / tokensTerms should be consistent and ‘reasonable’

Lots of potential problems here, e.g.Apostrophes

England’s: England s – Englands?haven’t: haven t – havent – have not?user’s vs. users’ vs. users’s?

Hyphenationstate-of-the-art vs. state of the artco-authors vs. coauthors vs. co authors

Lex. Analysis & Tokeniz. (Cont.)Special characters

Punctuation (Normally removed. But what about www.uni-freiburg.de?)Names: C++, A* Algorithm, M*A*S*H, …

Use just white spaces for segmentation?San Francisco? Los Angeles?Nuclear power plant?

Language specific issues, e.g.German: ‘Komposita’, e.g. Atomkraftwerk

Case sensitivity?

Conclusion: Implementation is easy, but what and how is critical (often heuristics, etc.)

Removal of Stop WordsGoal: Remove words with no relevance for any

search in order to reduce disk space, speed up search time, and improve result

Problems:What about ‘To be or not to be’?What about phrases, e.g. ‘The name of the rose’ (book/movie vs. flower names)?Usability?

Implementation:Often via manually generated stop words lists (depending on the task and data, e.g. small vs. big, general vs. topic specific, etc.)

ThesaurusThesaurus: A data structure composed of (1) a

pre-compiled list of important words in a given domain of knowledge and (2) for each word in the list, a list of related (synonym) words.(Glossary, page 453, Baeza-Yates & Ribeiro-Neto [1])

Implementation:Either manually generated (hard & expensive)or automatic (e.g. with statistical methods)

Usefulness for retrieval is controversialTopic-specific thesauri seem to work well, but limited usage with large, dynamic data base(?)

ThesaurusUsefulness for indexing unclear,but it can be useful in another way:

ThesaurusUsefulness for indexing unclear,but it can be useful in another way:

ThesaurusUsefulness for indexing unclear,but it can be useful in another way:

StemmingStemming: A technique for reducing words to

their grammatical roots.(Glossary, page 452, Baeza-Yates & Ribeiro-Neto [1])

Motivation: reduce different cases, singular and plural, etc. to the root (stem) of the word

Example: connected, connecting, connection, connections -> connect

But how about: operation research, operating system, operative dentistry -> opera???

Approaches for StemmingTable lookup

Generation is complexFinal tables are often incomplete

Affix removal Suffix vs. prefix (e.g. mega-volt)Doesn’t always work, esp. not in German

Successor variety stemming More complex than suffix removalUses (e.g.) linguistic approaches and techniques from morphology

N-gramsGeneral clustering approach which can also be used for stemming

The Porter AlgorithmSpecial algorithm for the English language

based on suffix removal Basic idea: Successive application of several rules

for suffix removal (5 phases of condition and action rules)

Example: Remove plural ‘s’ and ‘sses’Rules: sses -> ss, s -> NIL (obey order!)

Advantage: Easy algorithm with good resultsDisadvantage: Not always correct, e.g.

Same root for police – policy, execute –executive, …Different root for european – europe, search – searcher, …

Conventions for the pseudo language used to describe the - a consonant variable is represented by the symbol C which is used to refer to any letter

other than a,e,i,o,u and other than the letter y preceded by a consonant;- a vowel variable is represented by the symbol V which is used to refer to any letter which

is not a consonant; - a generic letter (consonant or vowel) is represented by the symbol L; - the symbol NIL is used to refer to an empty string (i.e., one with no letters); - combinations of C, V, and L are used to define patterns; - the symbol * is used to refer to zero or more repetitions of a given pattern; - the symbol + is used to refer to one or more repetitions of a given pattern; - matched parenthesis are used to subordinate a sequence of variables to the operators *

and +; - a generic pattern is a combination of symbols, matched parenthesis, and the operators *

and +; - the substitution rules are treated as commands which are separated by a semicolon

punctuation mark; - the substitution rules are applied to the suffixes in the current word; - a conditional if statement is expressed as ``if (pattern) rule'' and the rule is executed only

if the pattern in the condition matches the current word; - a line which starts with a % is treated as a comment; - curled brackets are used to form compound commands; - a “select rule with longest suffix” statement selects a single rule for execution among all

the rules in a compound command. The rule selected is the one with the largest matching suffix.

Porter Algorithm – Convention rules I

Porter Algorithm – Convention rules II

Thus, the expression (C)* refers to a sequence of zero or more consonants while the expression ((V)*(C)*)* refers to a sequence of zero or more vowels followed by zero or more consonants which can appear zero or more times. It is important to distinguish the above from the sequence (V*C) which states that a sequence must be present and that this sequence necessarily starts with a vowel, followed by a subsequence of zero or more letters, and finished by a consonant.

Finally, the command if (*V*L) then ed -> NILstates that the substitution of the suffix ed by nil (i.e., the removal of the suffix ed) only occurs if the current word contains a vowel and at least one additional letter.

Porter Algorithm – Phase 1

% Phase 1: Plurals and past participles. select rule with longest suffix { sses -> ss; ies -> i; ss -> ss; s -> NIL; } select rule with longest suffix { if ( (C)*((V)+(C)+)+(V)*eed) then eed -> ee; if (*V*ed or *V*ing) then { select rule with longest suffix { ed -> NIL; ing -> NIL; } select rule with longest suffix { at -> ate; bl -> ble; iz -> ize; if ((* C1C2) and ( C1 = C2) and (C1 not in {l,s,z})) then C1C2 -> C1; if (( (C)*((V)+(C)+)C1V1C2) and (C2 not in {w,x,y})) then C1V1C2 -> C1V1C2e; } } } if (*V*y) then y -> i;

Porter Algorithm – Phase 2

% Phase 2if ( (C)*((V)+(C)+)+(V)*) then select rule with longest suffix { ational -> ate; tional -> tion; enci -> ence; anci -> ance; izer -> ize; abli -> able; alli -> al; entli -> ent; eli -> e;

ousli -> ous; ization -> ize; ation -> ate; ator -> ate; alism -> al; iveness -> ive; fulness -> ful; ousness -> ous; aliti -> al; iviti -> ive; biliti -> ble; }

Porter Algorithm – Phase 3% Phase 3

if ( (C)*((V)+(C)+)+(V)*) then select rule with longest suffix { icate -> ic; ative -> NIL; alize -> al; iciti -> ic; ical -> ic; ful -> NIL; ness -> NIL; }

Porter Algorithm – Phase 4

% Phase 4if ( (C)*((V)+(C)+)((V)+(C)+)+(V)*) then select rule with longest suffix { al -> NIL; ance -> NIL; ence -> NIL; er -> NIL; ic -> NIL; able -> NIL; ible -> NIL; ant -> NIL; ement -> NIL;

ment -> NIL; ent -> NIL; ou -> NIL; ism -> NIL; ate -> NIL; iti -> NIL; ous -> NIL; ive -> NIL; ize -> NIL; if (*s or *t) then ion -> NIL; }

Porter Algorithm – Phase 5

% Phase 5select rule with longest suffix { if ( (C)*((V)+(C)+)((V)+(C)+)+(V)*) then e -> NIL; if (( (C)*((V)+(C)+)(V)*) and not (( *C1V1C2) and (C2 not in {w,x,y}))) then e -> nil; } if ( (C)*((V)+(C)+)((V)+(C)+)+V*ll) then ll -> l;

Source: Baeza-Yates [1], Appendix (available online at www.sims.berkeley.edu/~hearst/irbook/porter.html)

StemmingUsefulness for retrieval is controversialSeems useful for languages such as German

(lots of cases, compound nouns, etc.), i.e. languages where it is hard to do successfully!

Hard for large, multilingual data collections (such as the web)

Another problem: Usability (wrong results often hard to understand by the users)

Term Processing – SummaryTerm processing, e.g.

NormalizationLexical analysis & tokenizationRemoval of stop wordsThesaurusStemming (Porter’s Algorithm)

and much more

How is it usually done? Often depends on the application and data

Is it useful? How does it influence the result?We need a measure for that!

Evaluation – Precision & RecallEvaluation of IR systems is often done based on

some test data (benchmark tests), i.e. a set of (document-query-relevance)-labels

The most important measures to quantify IR performance are Precision and Recall

PRECISION =# FOUND & RELEVANT

# FOUND

RECALL =# FOUND & RELEVANT

# RELEVANT

Precision & Recall – Example

PRECISION =# FOUND & RELEVANT

# FOUND

RECALL =# FOUND & RELEVANT

# RELEVANT

RESULT: DOCUMENTS:

A CD B

F HGEJ

I

1. DOC. B

2. DOC. E

3. DOC. F

4. DOC. G

5. DOC. D

6. DOC. H

Relation Term Proc. – Prec. & Rec.How does term proc. influence IR performance?

Generally:More specific terms: Precision , RecallBroader terms: Precision , Recall

More specific:Thesaurus: Precision , RecallPhrases: Precision , RecallStemming: Precision , RecallCompound nouns: Precision , Recall

Note: Often increase in recall rather than precision. Maybe a reason why it’s hardly used for web search?

LOGICAL VIEW OF THEDOCUMENTS (INDEX)

Recap: IR System & Tasks Involved

INFORMATION NEED DOCUMENTS

User Interface

PERFORMANCE EVALUATION

QUERY

QUERY PROCESSING (PARSING & TERM

PROCESSING)

LOGICAL VIEW OF THE INFORMATION NEED

SELECT DATA FOR INDEXING

PARSING & TERM PROCESSING

SEARCHING

RANKING

RESULTS

DOCS.

RESULT REPRESENTATION

Documents: Selection & Segmentation

Which documents should be indexed?Big issue for web search (because we can’t do all of it: too big, changes too quickly, …)General problems: Confidential information, formats, etc.

What is a reasonable unit for indexing?In most cases: 1 document = 1 fileWhat about email archives w. more than one email in one file (file = folder)?What about attachments?What about one document split over several files (e.g. HTML version of PPT file)

LOGICAL VIEW OF THEDOCUMENTS (INDEX)

Recap: IR System & Tasks Involved

INFORMATION NEED DOCUMENTS

User Interface

PERFORMANCE EVALUATION

QUERY

QUERY PROCESSING (PARSING & TERM

PROCESSING)

LOGICAL VIEW OF THE INFORMATION NEED

SELECT DATA FOR INDEXING

PARSING & TERM PROCESSING

SEARCHING

RANKING

RESULTS

DOCS.

RESULT REPRESENTATION

References & Recommended Reading

[1] R. BAEZA-YATES, B. RIBEIRO-NETO: ‘MODERN INFORMATIN RETRIEVAL’, ADDISON WESLEY, 1999

CHAPTER 7 (TERMPROCESSING)CHAPTER 8.1, 8.2 (INDEX, INV. FILES)

[2] C. MANNING, P. RAGHAVAN, H. SCHÜTZ: ‘INTRODUCTION TO IR’ (TO APPEAR 2007)CHAPTER 1 (INDEX, INV. FILES)CHAPTER 2 (TERMPROCESSING)

AVAILABLE ONLINE AT http://www-csli.stanford.edu/~schuetze/ information-retrieval-book.html

Schedule

Next week:2 lectures (Tuesday & Wednesday)

The following week: Lecture & tutorials/lab will start(most likely)

I will put a schedule on the web next week