ITCS 6265 Information Retrieval and Web Mining Lecture 2: The term vocabulary and postings lists.

ITCS 6265Information Retrieval and Web

Mining

Lecture 2: The term vocabulary and postings lists

Recap of the previous lecture

Basic inverted indexes: Structure: Dictionary and Postings Key step in construction: Sorting

Boolean query processing Simple optimization Linear time merging

Overview of course topics

Plan for this lecture

Elaborate basic indexing Preprocessing to form the term

vocabulary Documents Tokenization What terms do we put in the index?

Postings Faster merges: skip lists Positional postings and phrase queries

Recall basic indexing pipeline

Tokenizer

Token stream. Friends Romans Countrymen

Linguistic modules

Normalized tokens. friend roman countryman

Indexer

Inverted index.

friend

countryman

Documents tobe indexed.

Friends, Romans, countrymen.

Parsing a document

What format is it in? pdf/word/excel/html?

What language is it in? What character set is in use?

Each of these is a classification problem, which we will study later in the course.

But these tasks are often done heuristically …

Complications: Format/language

Documents being indexed can include docs from many different languages A single index may have to contain terms of

several languages. Sometimes a document or its components

can contain multiple languages/formats French email with a German pdf attachment.

What is a unit document? A file? An email? (Perhaps one of many in an mbox.) An email with 5 attachments? A group of files (PPT or LaTeX as HTML pages)

Tokens and Terms

Tokenization

Input: “Friends, Romans and Countrymen”

Output: Tokens Friends Romans Countrymen

Each such token is now a candidate for an index entry, after further processing Described below

But what are valid tokens to emit?

Tokens, types, and terms

Type: the class of all tokens with the same character sequence (but in different positions of documents) E.g., “to go or not to go” 6 tokens, four types (two to’s and go’s)

Terms: normalized types that are included in the IR dictionary

Tokenization

Simply breaking by whitespaces/punctuation characters might not always work, e.g., Finland’s capital Finland? Finlands? Finland’s? Hewlett-Packard Hewlett and Packard

as two tokens? state-of-the-art: break up hyphenated sequence. co-education: ‘-’ used to split up vowels lowercase, lower-case, lower case ? Have user indicate possible hyphens in query (like westlaw)?

San Francisco: one token or two? How do you decide it is one token (e.g., Barnes & Noble, DVDs & Blue-rays)?

Dates, times, numbers, & domain-specific tokens

Examples: 3/12/91 Mar. 12, 1991 4:00 John123@gmail.com http://www.cs.uncc.edu/ (800) 234-2333

Good idea to treat them as single tokens, despite having spaces & punctuation characters

May choose not to index them (a lot of such tokens) But often they are very useful: think about looking in

a bug report for a particular line where errors occur If part of meta data (fields such as document

creation date), stored in separate indexes for fields

Tokenization: language issues

French E.g., ‘ for reduced article the L'ensemble one token or two? (the set)

L ? L’ ? Le ? Want l’ensemble to match with un ensemble

(a set)

German compound nouns are not segmented

Lebensversicherungsgesellschaftsangestellter ‘life insurance company employee’ German retrieval systems benefit greatly from a

compound splitter module

Chinese and Japanese have no spaces between words: 莎拉波娃现在居住在美国东南部的佛罗里达。 Not always guaranteed a unique

tokenization, e.g., 和尚 Further complicated in Japanese, with

multiple alphabets intermingled Dates/amounts in multiple formats

フォーチュン 500社は情報不足のため時間あた $500K(約 6,000万円 )

Katakana Hiragana Kanji Romaji

End-user can express query entirely in hiragana!

Arabic (or Hebrew) is basically written right to left, but with certain items like numbers written left to right

Words are separated, but letters within a word are joined to form complex ligatures

← → ← → ← start ‘Algeria achieved its independence in 1962

after 132 years of French occupation.’ With Unicode, the surface presentation is

complex, but the stored form is straightforward

Stop words

With a stop list, you exclude from dictionary entirely the commonest words. Intuition:

They have little semantic content: the, a, and, to, be There are a lot of them: ~30% of postings for top 30

But the trend is away from doing this: Good compression techniques (lecture 5) means the

space for including stopwords in a system is very small Good query optimization techniques mean you pay

little at query time for including stop words. You need them for:

Phrase queries: “King of Denmark” Various song titles, etc.: “Let it be”, “To be or not to be” “Relational” queries: “flights to London”

Token Normalization

Normalization

We want to match U.S.A. and USA Solution 1: normalize terms in both document

& query (equivalent classing) e.g., both U.S.A. and USA are mapped to USA

Solution 2: index original tokens & use expansion lists in query time

Enter: window Search: window, windows Enter: windows Search: Windows, windows,

window Enter: Windows Search: Windows (but not

windows!) The latter is potentially more flexible, but less efficient

(more postings to store & intersect)

Normalization: other languages

Diacritics/accents Marks for special pronunciation, e.g.,

résumé vs. resume May also change the meaning of word

Often need to consider how users like to write their queries for these words Even in languages that standardly have

accents, users often may not type them

German: Tuebingen = Tübingen (umlaut may be written as two-vowel digraph)

Normalization: other languages

Collection may contain documents in more than one language

Need to “normalize” indexed text as well as query terms into the same form

Character-level alphabet detection and conversion Tokenization not separable from this. Sometimes ambiguous:

7 月 30日 vs. 7/30

Morgen will ich in MIT …

Is thisGerman “mit”?

Normalization: Case folding

Reduce all letters to lower case Often this is a good idea, since users will use

lowercase regardless of ‘correct’ capitalization… Exceptions: upper case in mid-sentence?

e.g., General Motors Fed vs. fed SAIL vs. sail

Google example (as of 8/2009): C.A.T. CAT (acronym normalization) cat (lower

casing) 1st hit: Caterpiller Inc. web site (stock symbol: CAT):

2nd hit: cat (Wikipedia)

Normalization: Thesauri

Handle synonyms and homonyms Hand-constructed equivalence classes

e.g., car = automobile color = colour

Rewrite to form equivalence classes Index such equivalences

When the document contains automobile, index it under car as well (usually, also vice-versa)

Or expand query? When the query contains automobile, look

under car as well

Normalization: Soundex

Traditional class of heuristics to expand a query into phonetic equivalents Language specific – mainly for names Invented for the US Census E.g., chebyshev tchebycheff

ITCS 6265 Information Retrieval and Web Mining Lecture 2: The term vocabulary and postings lists.

Documents

Transcript of ITCS 6265 Information Retrieval and Web Mining Lecture 2: The term vocabulary and postings lists.

ITCS Annual Report 2020

· Crawler Crane Luffing Tower ... system ( 1 m 3) Fine Control . ITCS ITCS ITCS ITCS ITCS ITCS . fiE Active Safety ITCS ITCS ITCS ITCS ITCS . Next Comfort o . Machine ...

ITCS Quick REFERENCE GUIDE: · Web viewNovember 16, 2015 ITCS Quick REFERENCE GUIDE: Connect to vcl - Apple Page 6 © 2015 ITCS East Carolina University

ITCS 6265 Information Retrieval & Web Mining Lecture 15 Clustering.

Itcs 4120 introduction (c)

ITCS 6163 Data Warehousing

media.cranepedia.commedia.cranepedia.com/uploads/2018/03/7055-7065-2-catalog-ja.pdfCrawler Crane Luffing Tower ... System . 1 mg ) Fine Control . ITCS ITCS ITCS m-f¥-h ITCS ITCS ITCS

ITCS 6265/8265 Project Group 5 Gabriel Njock Tanusree Pai Ke Wang.

ITCS 6265 Information Retrieval and Web Mining Lecture 10: Web search basics.

120126 Tisa Itcs

2012 ITCS PTS

Technical Datasheet PC 6265

ITCS 6265 Lecture 11 Crawling and web indexes. This lecture Crawling Connectivity servers.

ITCS 6265 Information Retrieval & Web Mining

ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.

ITCS TECH TALK - initse.com

ITCS 3153 Artificial Intelligence

ITCS 6265 Information Retrieval & Web Mining Lecture 14: Support vector machines and machine learning on documents [Borrows slides from Ray Mooney]

ITCS 6265 Information Retrieval and Web Mining

ITCS Year in Review