Chapter 2 Information Retrieval Part-1
description
Transcript of Chapter 2 Information Retrieval Part-1
![Page 1: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/1.jpg)
1
Chapter 2Information RetrievalPart-1
![Page 2: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/2.jpg)
2
Modern Information Retrieval Document representation
Using keywords Relative weight of keywords
Query representation Keywords Relative importance of keywords
Retrieval model Similarity between document and query Rank the documents Performance evaluation of the retrieval
process
![Page 3: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/3.jpg)
3
Document Representation
Transforming a text document to a weighted list of keywords
![Page 4: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/4.jpg)
4
Stopwords
Figure 2.2 A partial list of stopwords
![Page 5: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/5.jpg)
5
Activity: Document Representation
Transform the text in the document given into a weighted list of keywords.
![Page 6: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/6.jpg)
6
StemmingA given word may occur in a variety of syntactic forms
plurals past tense gerund forms (a noun derived from a verb)
ExampleThe word connect, may appear as
connector, connection, connections, connected, connecting, connects, preconnection, and postconnection.
![Page 7: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/7.jpg)
7
StemmingA stem is what is left after its affixes (prefixes and suffixes) are removedSuffixes connector, connection, connections,
connected, connecting, connects, Prefixes preconnection, and postconnection.Stem connect
![Page 8: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/8.jpg)
8
Porter’s Algorithm Letters A, E, I, O, and U are vowels A consonant in a word is a letter other than A, E,
I, O, or U, with the exception of Y The letter Y is a vowel if it is preceded by a
consonant, otherwise it is a consonant For example, Y in synopsis is a vowel, while in toy,
it is a consonant A consonant in the algorithm description is
denoted by c, and a vowel by v
![Page 9: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/9.jpg)
9
Porter’s algorithmStep 1
Step 1:plurals and past participles
![Page 10: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/10.jpg)
10
Porter’s algorithmStep 2
Steps 2–4: straightforward stripping of suffixes
![Page 11: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/11.jpg)
11
Porter’s algorithmStep 3
Steps 2–4: straightforward stripping of suffixes
![Page 12: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/12.jpg)
12
Porter’s algorithmStep 4
Steps 2–4: straightforward stripping of suffixes
![Page 13: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/13.jpg)
13
Porter’s algorithmStep 5
Steps 5: tidying-up
![Page 14: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/14.jpg)
14
Suffix stripping of a vocabulary of 10,000 words (http://www.tartarus.org/~martin/)
Porter’s algorithm
![Page 15: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/15.jpg)
15
For the Tutorial Bring your laptop/ lab Make sure you have Java installed Bring any English language text
document, extension must be .txt Number of words (no more than 1000
words)
![Page 16: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/16.jpg)
16
Document Representation
![Page 17: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/17.jpg)
17
Term-Document Matrix• Term-document matrix (TDM) is a two-
dimensional representation of a document collection.
• Rows of the matrix represent various documents
• Columns correspond to various index terms• Values in the matrix can be either the
frequency or weight of the index term (identified by the column) in the document (identified by the row).
![Page 18: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/18.jpg)
18
Term-Document matrix
![Page 19: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/19.jpg)
19
Sparse Matrixes- triples
![Page 20: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/20.jpg)
20
Sparse Matrixes- Pairs
![Page 21: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/21.jpg)
21
Normalization• raw frequency values are not useful for a
retrieval model• prefer normalized weights, usually between
0 and 1, for each term in a document• dividing all the keyword frequencies by the
largest frequency in the document is a simple method of normalization:
![Page 22: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/22.jpg)
22
Normalized Term-Document Matrix
![Page 23: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/23.jpg)
23
Vector Representation of document d1
(word, frequency, normalized frequency)
![Page 24: Chapter 2 Information Retrieval Part-1](https://reader035.fdocuments.net/reader035/viewer/2022062222/56816637550346895dd9a3b9/html5/thumbnails/24.jpg)
24
Mini project (Survey)Arabic language stemmer design Survey and compare existing Arabic
language stemmers and write a research paper.
Design an Arabic Language stemmer Reading: Hints on writing technical reports and papers