Processing Te tProcessing Text - students.cs.byu.educs453ta/notes/ch4.pdf · Text Statistics Huge...

Ch t 4Chapter 4

Processing Te tProcessing Text

Processing Text Modifying/Converting documents to index terms

Why?

Convert the many forms of words into more consistent index Convert the many forms of words into more consistent index terms that represents the content of a document

Matching the exact string of characters typed by the user is too Matching the exact string of characters typed by the user is too restrictive, e.g., case-sensitivity, punctuation, stemming

• it doesn’t work very well in terms of effectiveness

Not all words are of equal value in a search

Sometimes not clear here ords begin and end Sometimes not clear where words begin and end

• Not even clear what a word is in some languages, e.g., in Chinese and Korean

2

Text Statistics Huge variety of words used in text but many statistical

characteristics of word occurrences are predictablecharacteristics of word occurrences are predictable

e.g., distribution of word counts

Retrieval models and ranking algorithms depend heavily on statistical properties of words

e.g., important/significant words occur often in documents but are not high frequency in collectiong q y

3

Zipf’s Law Distribution of word frequencies is very skewed

Few words occur very often, many hardly ever occur

e.g., “the” and “of”, two common words, make up about 10% of all word occurrences in text documents

Zipf’s law: Zipf s law:

The frequency f of a word in a corpus is inversely proportionalto its rank r (assuming words are ranked in order ofto its rank r (assuming words are ranked in order of decreasing frequency)

f = k f r = k

where k is a constant for the corpus

rf = f r k

4

Top 50 Words from AP89

5

Zipf’s LawExample. Zipf’s law for

AP89 with problems at phigh and low frequencies

According to [Ha 02], Zipf’s lawAccording to [Ha 02], Zipf s law

does not hold for rank > 5,000

i lid h id i i l d ll h is valid when considering single words as well as n-gram phrases, combined in a single curve.

6[Ha 02] Ha et al. Extension of Zipf's Law to Words and Phrases. In Proc. of Int. Conf.

on Computational Linguistics. 2002.

Vocabulary Growth Another useful prediction related to word occurrence

As corpus grows, so does vocabulary size. However, fewer new words when corpus is already large

Observed relationship (Heaps’ Law):

v = k × nβv = k × nβ

wherev is the vocabulary size (number of unique words)n is the total number of words in corpus k, β are parameters that vary for each corpus

v is the vocabulary size (number of unique words)

Predicting that the number of new words increases very

(typical values given are 10 ≤ k ≤ 100 and β ≈ 0.5)

Predicting that the number of new words increases very rapidly when the corpus is small

7

AP89 Example

(k) ()

v = k × nβ

8

Heaps’ Law Predictionsp Number of new words increases very rapidly when the

i ll d ti t i i d fi it lcorpus is small, and continue to increase indefinitely

Predictions for TREC collections are accurate for large gnumbers of words, e.g.,

First 10 879 522 words of the AP89 collection scanned First 10,879,522 words of the AP89 collection scanned

Prediction is 100,151 unique words

Actual number is 100,024

Predictions for small numbers of words (i e < 1000) are Predictions for small numbers of words (i.e., < 1000) are much worse

9

Heaps’ Law on the Web Heaps’ Law works with very large corpora

New words occurring even after seeing 30 million!

Parameter values different than typical TREC values Parameter values different than typical TREC values

New words come from a variety of sources

Spelling errors, invented words (e.g., product, company names), code, other languages, email addresses, etc.

Search engines must deal with these large and growing vocabulariesgrowing vocabularies

10

Heaps’ Law vs. Zipf’s Law As stated in [French 02]:

The observed vocabulary growth has a positive correlation with Heaps’ law

Zipf’s law, on the other hand, is a poor predictor of high-frequency terms, i.e., Zipf’s law is adequate for

di ti di t l f tpredicting medium to low frequency terms

While Heaps’ law is a valid model for vocabulary growth of b d t Zi f’web data, Zipf’s law is not strongly correlated with web data

[French 02] J. French. Modeling Web Data. In Proc. of Joint Conf. on Digital Libraries (JCDL). 2002.

11

Estimating Result Set Size

Word occurrence statistics can be used to estimate the sizeof the results from a web search

How many pages (in the results) contain all of the query terms (based on word occurrence statistics)?

For the query “a b c”:

f = N × f /N × f /N × f /N = (f × f × f )/N2fabc = N × fa/N × fb/N × fc/N = (fa × fb × fc)/N

fabc: estimated size of the result set using joint probability

fa, fb, fc: the number of documents that terms a, b, and c occur in, respectively

N is the total number of documents in the collection N is the total number of documents in the collection

Assuming that terms occur independently 12

TREC GOV2 Example

Poor EstimationD t thDue to the

IndependentAssumption

Collection size (N) is 25,205,179

13

Result Set Size Estimation Poor estimates because words are not independent

Better estimates possible if co-occurrence info. available

P(a ∩ b ∩ c) = P(a ∩ b) · P(c | (a ∩ b))P(a ∩ b ∩ c) = P(a ∩ b) · P(c | (a ∩ b))

f tropical ∩ fish ∩ aquarium = f tropical ∩ aquarium × f fish ∩ aquarium / f aquarium

= 1921 × 9722 / 26480

= 705

f tropical ∩ fish ∩ breeding = f tropical ∩ breeding × f fish ∩ breeding / f breeding

= 5510 × 36427 / 81885

= 2451

14

Result Set Estimation Even better estimates using initial result set (word

frequency + current result set)

Estimate is simply C/s• where s is the proportion of the total number of documents• where s is the proportion of the total number of documents

that have been ranked and C is the number of documents found that contain all the query words

Example. “tropical fish aquarium” in GOV2• After processing 3,000 out of the 26,480 documents thatAfter processing 3,000 out of the 26,480 documents that

contain “aquarium”, C = 258

f tropical ∩ fish ∩ aquarium = 258 / (3000 ÷ 26480) = 2,277

• After processing 20% of the documents,

f = 1 778 (1 529 is real value)f tropical ∩ fish ∩ aquarium = 1,778 (1,529 is real value) 15

Estimating Collection Size Important issue for Web search engines, in terms of coverage

Simple method: use independence model even not realizable Simple method: use independence model, even not realizable

Given two words, a and b, that are independent, and N is the estimated size of the document collectionestimated size of the document collection

fab / N = fa / N × fb / N

N = (fa × fb) / fab

Example. For GOV2flincoln = 771,326ftropical = 120,990

flincoln ∩ tropical = 3,018

N = (120,990 × 771,326) / 3,018 = 30,922,045

16(actual number is 25,205,179)

Tokenizing Forming words from sequence of characters

Surprisingly complex in English, can be harder in other languages

Early IR systems:

Any sequence of alphanumeric characters of length 3 or more

Terminated by a space or other special character

Upper-case changed to lower-case Upper case changed to lower case

17

Tokenizing Example (Using the Early IR Approach).

“ “Bigcorp's 2007 bi-annual report showed profits rose 10%.” becomes

“bigcorp 2007 annual report showed profits rose”

Too simple for search applications or even large-scale Too simple for search applications or even large-scale experiments

Why?

Small decisions in tokenizing can have major impact on the

Too much information lost

Small decisions in tokenizing can have major impact on the effectiveness of some queries

18

Tokenizing Problems Small words can be important in some queries, usually in

combinationscombinations

xp, bi, pm, cm, el paso, kg, ben e king, master p, world war II

Both hyphenated and non-hyphenated forms of many words are common

Sometimes hyphen is not needed • e-bay wal-mart active-x cd-rom t-shirtse-bay, wal-mart, active-x, cd-rom, t-shirts

At other times, hyphens should be considered either as partof the word or a word separatorof the word or a word separator

• winston-salem, mazda rx-7, e-cards, pre-diabetes, t-mobile, spanish-speakingspanish speaking

19

Tokenizing Problems Special characters are an important part of tags, URLs,

code in documentscode in documents

Capitalized words can have different meaning from lower case words

Bush, Apple, House, Senior, Time, Key, pp , , , , y

Apostrophes can be a part of a word/possessive, or just a mistakea mistake

rosie o'donnell, can't, don't, 80's, 1890's, men's straw hats, master's degree, england's ten largest cities, shriner's

20

Tokenizing Problems Numbers can be important, including decimals

Nokia 3250, top 10 courses, united 93, quicktime 6.5 pro, 92.3 the beat, 288358

Periods can occur in numbers, abbreviations, URLs, ends of sentences, and other situations

I.B.M., Ph.D., cs.umass.edu, F.E.A.R.

Note: tokenizing steps for queries must be identical to steps for documents

21

Tokenizing Process First step is to use parser to identify appropriate parts of

document to tokenizedocument to tokenize

Defer complex decisions to other components

Word is any sequence of alphanumeric characters, terminated by a space or special character, with everything converted to lower-case

Everything indexedy g

Example: 92.3 → 92 3 but search finds document with 92 and 3 adjacentj

To enhance the effectiveness of query transformation, incorporate some rules into the tokenizer to reduceincorporate some rules into the tokenizer to reduce dependence on other transformation components

22

Tokenizing Process Not that different than simple tokenizing process used in

the pastthe past

Examples of rules used with TREC

Apostrophes in words ignored• o’connor → oconnor bob’s → bobso connor oconnor bob s bobs

Periods in abbreviations ignored• I B M → ibm Ph D → ph d• I.B.M. → ibm Ph.D. → ph d

23

Stopping Function words (conjunctions, prepositions, articles) have

little meaning on their ownlittle meaning on their own

High occurrence frequencies

Treated as stopwords (i.e., removed)

R d i d Reduce index space

improve response time

Improve effectiveness

C b i t t i bi ti Can be important in combinations

e.g., “to be or not to be”

24

Stopping Stopword list can be created from high-frequency words

or based on a standard listor based on a standard list

Lists are customized for applications, domains, and even parts of documents

e.g., “click” is a good stopword for anchor textg , g p

Best policy is to index all words in documents, make d i i b t hi h d t t tidecisions about which words to use at query time

25

Stemming Many morphological variations of words

Inflectional (plurals, tenses)

Derivational (making verbs into nouns, etc.)

In most cases, these have the same or very similar meaningsmeanings

Stemmers attempt to reduce morphological variations of words to a common stem

Usually involves removing suffixesy g

Can be done at indexing time/as part of query processing (like stopwords)(like stopwords)

26

Stemming Two basic types

Dictionary-based: uses lists of related words

Algorithmic: uses program to determine related words Algorithmic: uses program to determine related words

Algorithmic stemmersg

Suffix-s: remove ‘s’ endings assuming plural• e.g., cats → cat, lakes → lake, wiis → wii

• Some false positives: ups → up

• Many false negatives: supplies → supplie

27

Porter Stemmer Algorithmic stemmer used in IR experiments since the 70’s

Consists of a series of rules designed to the longest possible suffix at each steppossible suffix at each step

Effective in TREC

Produces stems not words

Makes a number of errors and difficult to modify

28

Errors of Porter Stemmer{ No Relationship } { Fail to Find a Relationship }

Porter2 stemmer addresses some of these issues Porter2 stemmer addresses some of these issues

Approach has been used with other languages

29

Link Analysis Links are a key component of the Web

Important for navigation, but also for search

e g <a href="http://example com" >Example website</a> e.g., <a href= http://example.com >Example website</a>

“Example website” is the anchor text

“http://example.com” is the destination link

both are used by search engines both are used by search engines

30

Anchor Text Describe the content of the destination page

i.e., collection of anchor text in all links pointing to a page used as an additional text field

Anchor text tends to be short, descriptive, and similar to query text

Retrieval experiments have shown that anchor text has significant impact on effectiveness for some typessignificant impact on effectiveness for some types of queries

i e more than PageRank i.e., more than PageRank

31

PageRank Billions of web pages, some more informative than others

Links can be viewed as information about the popularity(authority?) of a web page

Can be used by ranking algorithms

Inlink count could be used as simple measure

Link analysis algorithms like PageRank provide more Link analysis algorithms like PageRank provide more reliable ratings

Less susceptible to link spam Less susceptible to link spam

32

Random Surfer Model Browse the Web using the following algorithm:

Choose a random number 0 r 1

If r < λ, then go to a random page

If r ≥ λ, then click a link at random on the current page

S Start again

PageRank of a page is the probability that the “randomPageRank of a page is the probability that the random surfer” will be looking at that page

Links from popular pages increase PageRank of pages Links from popular pages increase PageRank of pages they point to

33

Dangling Links Random jump prevents getting stuck on pages that

Do not have links

Contains only links that no longer point to other pagesy g p p g

Have links forming a loop

Links that point to the second type of pages are called dangling links

May also be links to pages that have not yet been crawled

34

PageRank

PageRank (PR) of page C = PR(A)/2 + PR(B)/1

More generally,

where u is a web pagep gBu is the set of pages that point to uLv is the number of outgoing links from page v

(not counting duplicate links)35

PageRank Don’t know PageRank values at start

Example Assume equal values of 1/3 then Example. Assume equal values of 1/3, then

1st iteration: PR(C) = 0.33/2 + 0.33/1 = 0.5PR(A) = 0.33/1 = 0.33PR(A) 0.33/1 0.33 PR(B) = 0.33/2 = 0.17

2nd iteration: PR(C) = 0.33/2 + 0.17/1 = 0.33PR(A) = 0.5/1 = 0.5PR(B) = 0.33/2 = 0.17

d 3rd iteration: PR(C) = 0.5/2 + 0.17/1 = 0.42PR(A) = 0.33/1 = 0.33PR(B) = 0.5/2 = 0.25

Converges to PR(C) = 0.4PR(A) = 0.4( )PR(B) = 0.2

36

PageRankg Taking random page jump into account, 1/3 chance of

going to any page when r < λgoing to any page when r < λ

PR(C) = λ/3 + (1 − λ) × (PR(A)/2 + PR(B)/1)

More generally,

where N is the number of pages, λ typically 0.15

37

Link Quality Link quality is affected by spam and other factors

e.g., link farms to increase PageRank

Trackback links in blogs can create loops Trackback links in blogs can create loops

Links from comments section of popular blogs can be used as the source of link spam Solution:be used as the source of link spam. Solution:

• Blog services modify comment links to contain rel = nofollow attributerel nofollow attribute

e.g., “Come visit my <a rel = nofollowhref="http://www.page.com">web page</a>.”

38

Trackback Links

39

Information Extraction Automatically extract structure from text

Annotate doc using tags to identify extracted structure

N d tit iti Named entity recognition

Identify words that refer to something of interest in a particular application

e.g., people, companies, locations, dates, product names, g , p p , p , , , p ,prices, etc.

40

Named Entity Recognition Rule-based

Uses lexicons (lists of words & phrases) to categorize names• e.g., locations, peoples’ names, organizations, etc.g p p g

Rules also used to verify or find new entity names

e g “<number> <word> street” for addresses• e.g., <number> <word> street for addresses

• “<street address>, <city>” or “in <city>” to verify city names

“ dd i ” fi d i i• “<street address>, <city>, <state>” to find new cities

• “<title> <name>” to find new names

41

Processing Te tProcessing Text - students.cs.byu.educs453ta/notes/ch4.pdf · Text Statistics Huge...

Documents

Transcript of Processing Te tProcessing Text - students.cs.byu.educs453ta/notes/ch4.pdf · Text Statistics Huge...