Edge Extraction with an Anisotropic Vector Field using Divergence Map
Vector Space Model CS 652 Information Extraction and Integration.
-
date post
21-Dec-2015 -
Category
Documents
-
view
224 -
download
3
Transcript of Vector Space Model CS 652 Information Extraction and Integration.
![Page 1: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/1.jpg)
Vector Space Model
CS 652 Information Extraction and Integration
![Page 2: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/2.jpg)
2
IntroductionDocs
Information Need
Index Terms
doc
query
Rankingmatch
![Page 3: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/3.jpg)
3
IntroductionA ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query
A ranking is based on fundamental premises regarding the notion of relevance, such as:
common sets of index terms
sharing of weighted terms
likelihood of relevance
Each set of premises leads to a distinct IR model
![Page 4: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/4.jpg)
4
IR Models
Non-Overlapping ListsProximal Nodes
Structured Models
Retrieval
Browsing
U s e r
T a s k
Classic Models
Boolean Vector (Space) Probabilistic
Set Theoretic
Fuzzy Extended Boolean
Probabilistic
Inference Network Belief Network
Algebraic
Generalized Vector (Space) Latent Semantic Index Neural Networks
Browsing
Flat Structure Guided Hypertext
![Page 5: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/5.jpg)
5
Basic Concepts
Each document is described by a set of representative keywords or index terms
Index terms are document words (i.e. nouns), which have meaning by themselves for remembering the main themes of a document
However, search engines assume that all words are index terms (full text representation)
![Page 6: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/6.jpg)
6
Basic ConceptsNot all terms are equally useful for representing the document contents
The importance of the index terms is represented by weights associated to them
Let
ki be an index term
dj be a document
wij is a weight associated with (ki,dj ), which quantifies the
importance of ki for describing the contents of dj
![Page 7: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/7.jpg)
7
The Vector (Space) Model
Define:wij
> 0 whenever ki dj
wiq>= 0 associated with the pair (ki,q)
vec(dj ) = (w1j, w2j
, ..., wtj ), document vector of dj
vec(q) = (w1q, w2q
, ..., wtq ), query vector of q
The unitary vectors vec(di) and vec(qj) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents)
Queries and documents are represented as weighted vectors
![Page 8: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/8.jpg)
8
The Vector (Space) Model
Sim(q,dj ) = cos() = [vec(dj) vec(q)] / |dj | |q| = [t
i=1 wij wiq
] / ti=1 wij
2 ti=1 wiq
2
where is the inner product operator & |q| is the length of q
Since wij 0 and wiq
0, 1 sim(q, dj ) 0
A document is retrieved even if it matches the query terms only partially
i
jdj
q
![Page 9: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/9.jpg)
9
The Vector (Space) Model
Sim(q, dj ) = [ti=1 wij
wiq ] / |dj | |q|
How to compute the weights wij and wiq
?
A good weight must take into account two effects:
quantification of intra-document contents (similarity)
tf factor, the term frequency within a document
quantification of inter-documents separation (dissi-milarity)
idf factor, the inverse document frequency
wij = tf(i, j) idf(i)
![Page 10: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/10.jpg)
10
The Vector (Space) ModelLet,
N be the total number of documents in the collectionni be the number of documents which contain ki
freq(i, j), the raw frequency of ki within dj
A normalized tf factor is given byf(i, j) = freq(i, j) / max(freq(l, j)),
where the maximum is computed over all terms which occur within the document dj
The inverse document frequency (idf) factor is idf(i) = log (N / ni )the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with term ki.
![Page 11: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/11.jpg)
11
The Vector (Space) Model
The best term-weighting schemes use weights which are give by
wij = f(i, j) log(N / ni)
the strategy is called a tf-idf weighting scheme
For the query term weights, a suggestion isWiq
= (0.5 + [0.5 freq(i, q) / max(freq(l, q))]) log(N/ni)
The vector model with tf-idf weights is a good ranking strategy with general collections
The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute
![Page 12: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/12.jpg)
12
The Vector (Space) ModelAdvantages:
term-weighting improves quality of the answer set
partial matching allows retrieval of documents that approximate the query conditions
cosine ranking formula sorts documents according to degree of similarity to the query
A popular IR model because of its simplicity & speed
Disadvantages:
assumes mutually independence of index terms (??);
not clear that this is bad though
![Page 13: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/13.jpg)
Naïve Bayes Classifier
CS 652 Information Extraction and Integration
![Page 14: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/14.jpg)
14
Bayes Theorem
The basic starting point for inference problems using probability theory as logic
![Page 15: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/15.jpg)
15
Bayes Theorem
.008 .992
.98 .02
.03 .97
P(+|cancer)P(cancer)=(.98).008=.0078
P(+|~cancer)P(~cancer)=(.03).992=.0298
![Page 16: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/16.jpg)
16
Basic Formulas for Probabilities
![Page 17: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/17.jpg)
17
Naïve Bayes Classifier
![Page 18: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/18.jpg)
18
Naïve Bayes Classifier
![Page 19: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/19.jpg)
19
Naïve Bayes Classifier
![Page 20: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/20.jpg)
20
Naïve Bayes Algorithm
![Page 21: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/21.jpg)
21
Naïve Bayes Subtleties
![Page 22: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/22.jpg)
22
Naïve Bayes Subtleties
m-estimate of probability
![Page 23: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/23.jpg)
23
Learning to Classify Text
Classify text into manually defined groups Estimate probability of class membership Rank by relevance Discover grouping, relationships
– Between texts– Between real-world entities mentioned in text
![Page 24: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/24.jpg)
24
Learn_Naïve_Bayes_Text(Example, V)
![Page 25: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/25.jpg)
25
Calculate_Probability_Terms
![Page 26: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/26.jpg)
26
Classify_Naïve_Bayes_Text(Doc)
![Page 27: Vector Space Model CS 652 Information Extraction and Integration.](https://reader036.fdocuments.net/reader036/viewer/2022062421/56649d5f5503460f94a3fa72/html5/thumbnails/27.jpg)
27
How to Improve
More training data Better training data Better text representation
– Usual IR tricks (term weighting, etc.)– Manually construct good predictor features
Hand off hard cases to human being