Document ranking using Qprp with concept of Multi-Dimensional Subspace
-
Upload
prakash-dubey -
Category
Software
-
view
38 -
download
0
description
Transcript of Document ranking using Qprp with concept of Multi-Dimensional Subspace
PRESENTATION ON PROJECT TOPIC:- “DOCUMENT RANKING USING QPRP WITH CONCEPT OF MULTI-DIMENSIONAL SUBSPACE”
1
Presented By:-• Prakash Kumar Dubey (08)• Nanhen Gaurav (07)• Dilip Chauhan (27)
Guided By:-• Mr. Sourish Dhar (Dept. of IT)• Mr. Bhagaban Swain (Dept. of IT)
2
Overview Introduction. Architecture of IR. Classical models of IR. Quantum probability. Document ranking using qPRP. Proposed solution. Implementation and Data collection. Conclusion. Future work.
Information Retrieval 3
• Information Retrieval (IR) is to search for relevant information in large collections of data.
• Examples of IR
Q- Give me articles about Laloo Prasad Yadav and the fodder scam.R- Evidence regarding Laloo Prasad Yadav's involvement in the fodder scam. - text retrieval. Q- What does a brain tumor look like on a CT-scan? R- A picture of a brain tumor - image retrieval.
• Not to be confused with Data Retrieval.
4
Main Components
There are five main components of the basic information retrieval system.
i. Crawling. ii. Indexing. iii. User’s Query. iv. Ranking. v. Relevance Feedback.
Basic Architecture of IR5
6
Cont….. Crawling:- The system browses the document collection and fetches documents. i. Selection Policy. ii. Revisit Policy. iii. Politeness Policy.
Indexing:- System builds an index of the documents. i. Tokenization. ii. Stop-word Eliminator. iii. Stemmer. iv. Inverted Index.
7
Cont…. Ranking:- When user gives a query the index is consulted to get
most relevant document. Relevant documents are then ranked as per their importance.
Relevance Feedback:- It is a classical way of refining search engine rankings. eg:- Matrix(maths or movie).
Three Types of relevance feedback:- * Explicit. * Implicit. * Pseudo.
Theoretical Models in IR8
Theoretical models gives us different ways of solving IR related Problems. IR model is defined as 4-tuple [D,Q, F,R(qi,dj)]. Here,
D- It represents the document collection. Q- Query collection collected from the users. F- Framework for modeling document representation, queries and their relationships. R(qi,dj)- Ranking function which associates a score with
the pair (qi,dj).
Classical Models Of IR9
The main three classical models of Information Retrieval are:-
Boolean Model Vector Space Model Probabilistic Model.
Boolean Model10
The model is based on the set theory and boolean algebra.
Each document is considered as a bag of index terms(words or phrases from the documents important to establish its meaning).
Query here is the expression using boolean algebra connectives like , , etc.
And Or Not Document retrieved should completely match the given query and it
is not ordered.
Boolean Query Example..11
Suppose we have 3 documents:-Doc1:- Cricket is the most popular game of India.Doc2:- Ricky Ponting is the most successful captain of cricket Australia.Doc3:- India is ranked 5th in the latest ICC test cricket ranking.
If a user wants to know about Indian Cricket then a simple query is: India Cricket Australia.
Inverted index is formed. India is present in document {1,3}, Cricket is present in document {1,2,3} and Australia is present in document {2}.So finally {1,3} is selected.
Pros and Cons…
12
Advantages:- i. Simple, efficient and easy to implement. ii. Very precise in nature, user gets exact thing. Disadvantages:- i. Partial matches are not retrieved, which in many cases is not suitable. Retrieved documents are not ranked. ii. Given large set of documents, it retrieves either too many or very few documents. iii. Query does not captures synonymous terms. iv. Model does not use term weights.
Vector Space Model13
In this model the documents are represented as a vector of index terms. It has the ability to fetch partial matches.
Here we do not consider only the presence or absence of terms. So, in vector model the term weights are not binary.
Queries are also represented as vectors.
The similarity between the two vectors is actually calculated as the cosine similarity between them using which we find the relevance of the document.
�� 𝒋= {𝒘𝟏 𝒋 ,𝒘𝟐 𝒋 , ……,𝒘𝒕𝒋 }
Some Important Terms14
Modelling as a Clusturing Method. Fixing the Term weights.
i. Term Frequency(tf)
ii. Inverse Document Frequency(idf)
Similarity Measure Between Two Vectors
15
The most widely used method to measure the similarity between the two vectors is Cosine Similarity.
The Cosine Similarity of the two qi and dj is given by:-
Here, Ɵ = Angle between two vectors. w(i,j)= Term weight of ith term of jth document. w(i,q)= Term weight assigned to ith term of the query.
Cont..16
The retrieved set of documents dk are those for which similarity(di,qj) is greater than a threshold value.
The value of threshold can be brought down if for some query the highest similarity is on lower side hence allowing the partial matches to be retrieved.
Value of cos Ɵ increasesdj
qj
Pros and Cons…17
Advantages:- i. Partial matching possible. ii. Ranking of retrieved results according to cosine similarity is possible.
Disadvantages:- i. Index terms are considered to be mutually independent which does not allow it to capture semantic of query or document. ii. It cannot denote the “clear logic view” like Boolean model.
Probabilistic Model18
We try to capture the information retrieval process from a probabilistic framework.
Idea is to retrieve the documents according to the probability of the document being relevant.
Several version of Probabilistic model are available.
We will use version of Robertson-Spark-Jones.
Probabilistic Model (Why)19
Other model are empirical for most part success measured by experimental results few properties provable
Probabilistic Ranking Principle provable “minimization of risk”
Information Retrieval deals with Uncertain Information And it makes uncertain guess of whether a document satisfies the
query. Probability theory provides a principled foundation for such reasoning
under uncertainty. Vector space model: rank documents according to similarity to query.
Probability Ranking Principle
Collection of Documents
User issues a query
A Set of documents needs to be returned
Question: In what order to present documents to user ?
20
Probability Ranking Principle
Question: In what order to present documents to user ?
Intuitively, want the “best” document to be first, second best - second, etc…
Need a formal way to judge the “goodness” of documents w.r.t. queries.
Idea: Probability of relevance of the document w.r.t. query
21
The Probabilistic Ranking Principle
22
If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of that data.
What is the probability of this document being relevant given this query?
Probabilistic Ranking Principle Definition
All index term weights are all binary i.e., wi,j {0,1}
Let R be the set of documents known to be relevant to query q
Let be the set on non relevant document.
Let be the probability that the document dj is relevant to the query q
Let be the probability that the document dj is non relevant to query q
R
)|( jdRP
)|( jdRP
23
Cont…24
Here we want to rank the documents (d w.r.t. query q) according to the probability of the document to be relevant.
Mathematically scoring function is given by:-P(R = 1| d,q)
R is indicator variable, it takes value 1 if it d(document) is relevant w.r.t. q, and 0 if d is non-relevant w.r.t. q(query).
Probability Ranking PrincipleLet x be a document in the collection.
Let R represent relevance of a document w.r.t. given (fixed) query and let NR represent non-relevance.
)()()|()|(
)()()|()|(
xpNRpNRxpxNRp
xpRpRxpxRp
p(x|R), p(x|NR) - probability that if a relevant (non-relevant) document is retrieved, it is x.
Need to find p(R|x) - probability that a retrieved document x is relevant.
p(R),p(NR) - prior probabilityof retrieving a (non) relevantdocument
25
Probability Ranking Principle
)()()|()|(
)()()|()|(
xpNRpNRxpxNRp
xpRpRxpxRp
Ranking Principle (Bayes’ Decision Rule):
If p(R|x) > p(NR|x) then x is relevant, otherwise x is not relevant
The similarity sim(dj,q) of the document dj to the query q is defined
as the ratio.
Using Bayes’ rule,
)|()|(
),(j
jj dRP
dRPqdsim
26
Binary Independence Model27
Binary Independence Model for calculating the probability of relevance.
Name is binary because the documents and queries are represented as binary (Boolean) term incidence vectors.
, iff term i is present in document x. Independence means terms are independent of each other.
),,( 1 nxxx
1ix
Cont…28
3 Assumptions are made by Binary Independence Model (BIM)
1. The documents are independent of each other.2. The terms in a document are independent of
each other.3. The terms not present in query are equally
likely to occur in any document i.e. do not affect the retrieval process.
Okapi BM25 Ranking Function
29
Probabilistic IR model is very generic in nature.
Many versions of probabilistic IR exist which are used practically.
Okapi-BM25 algorithm is based on the probabilistic IR.
Pays attention to the t.f. and document length.
Disadvantages of PRP30
PRP model does not hold when the assumptions fails. Calibration:- If the estimation of probability by the IR system does not matches the users assessment of relevance. Independent Relevance:- Relevance of documents are independent of
each other.
Certainty in Estimation:-Probability of relevance of a document is reported as scalar by IR system.
Quantum Probability31
Quantum probability theory naturally includes interference effects between events.
We assume that this interference shows the inter-dependency of relevance of the documents.
The outcome is a more sophisticated principle, the Quantum Probability Ranking Principle(qPRP).
To understand the difference between Kolmogorovian and Quantum probability theory on the basis of relevance of documents we will use Double Slit Experiment.
Double Slit Experiment32
Settings of Double Slit Experiment
Cont..33
Distribution of pA and pB
in the double slit experiment.
Distribution of pkAB in the
double slit experiment as estimated by Kolmogorovian probability.
Distribution of pAB asmeasured in the double slit experiment.
^
Cont…34
Kolmogorovian Probability Theory:- pk
AB = p(x|A) +p(x|B)
= pA + pB
Quantum Probability Theory:-pQ
AB = pA + pB + 2*√pA √pB * cos(ƟAB)
= pA + pB +IAB
Where,
ƟAB = ƟA - ƟB
In Reality:-
pAB ≠ pA + pB
≠ pkAB
Quantum Interference Term
An Analogy with Document Ranking
35
Here Particle corresponds to the user who is characterized by an information need.
Each Slit Corresponds to document. Ex- 2 Slit means 2 doc.
The event of a particle passing from left of the screen to the right is comparable with the user examining the set of doc.
p(x|A,B) is analogous to p(S|dA , dB) .
S- an event to stop the search with user being satisfied.
Cont…36
Fig:- Analogy between Double Slit Experiment and Document Ranking Process in IR.
Ranking Document Within Analogy.
37
Fig- IR analogous of the previous figure
Cont…38
Kolmogorovian Probability:- pk
AB = pA + pB
Following equalities can be defined:-
argmax(pAB) = argmax(pkAB)
=argmax(pA + pB)
=argmax(pB )
B ϵ Ɓ
B ϵ Ɓ
B ϵ Ɓ
B ϵ Ɓ
Cont…39
Quantum Probability:- pQ
AB = pA + pB + IAB
Following equalities can be defined:-
argmax(pAB) = argmax(pQAB)
=argmax(pA + pB + IAB )
=argmax(pB + IAB )
B ϵ Ɓ
B ϵ Ɓ
B ϵ Ɓ
B ϵ Ɓ
Ranking The First Document40
Kolmogorovian and Quantum Probability Theory gives the same estimation i.e. pk
AB = pQ
AB = pA
Ranking Subsequent Documents
41
Slit A and B are kept fixed and the 3rd slit is varied among the slits of set Ƈ
Cont…42
Kolmogorovian Probability:- pk
ABC = pA + pB +pC
Following equalities can be defined as:-
argmax(pABC) = argmax(pkABC)
=argmax(pA + pB +pC)
=argmax(pC)
C ϵ Ƈ
C ϵ Ƈ
C ϵ Ƈ
C ϵ Ƈ
Cont…43
Quantum Probability:-
pQABC = pA + pB +pC + 2*√pA √pB * cos(ƟA -ƟB)
+ 2*√pA √pC * cos(ƟA -ƟC)+ 2*√pB √pC * cos(ƟB -ƟC).
pQABC = pA + pB + pC + IAB +IAC +IBC
Following equalities can be defined:-
argmax(pABC) = argmax(pQABC)
=argmax(pA + pB + pC + IAB +IAC +IBC )
=argmax(pC + IAC + IBC )
C ϵ Ƈ
C ϵ Ƈ
C ϵ Ƈ
C ϵ Ƈ
Quantum Probability Ranking Principle(qPRP)
44
Assumptions:-
I. Ranking is Performed Sequentially.II. Empirical data is best described using Quantum Probabilities.III. It is assumed that the documents that have been ranked before
may influence further relevance assessments.
Interpretation of Interference in qPRP45
Quantum interference is central in the formalization of qPRP. Once interference is expressed in terms of IR, these questions may
arise:-
1. What does quantum interference mean in qPRP and in IR?2. How does the quantum interference term influence document
ranking?
Estimating Interference in qPRP
Information Retrieval
46
IdAdB =2.√P(R|q,ddA) √P(R|q,dB).cos ϴdAdB
≈ 2.√P(R|q,ddA) √P(R|q,dB).βfsim (dAdB)
ϴ present in interference term is computed using a function fsim( dAdB).
where,
fsim* is a function used to compute the similarity between
dA and dB.
β is a real valued parameter.
Note(*):- Different similarity function can be used viz Cosine Similarity, Jaccard Similarity etc.
Constructing Document Representation
47
We associate each document to a vector. Vector is defined on the vector space made up by the terms present
in the documents. Each term in a collection is considered as a dimension of the vector
space. Different strategies can be employed to compute the components of
the term-vector for a document. example:- Binary Schema, TF-IDF, BM25 etc.
Proposed Solution48
We do not find any major drawbacks in qPRP approach.
qPRP can be thought as new model for IR.
Existing qPRP approach considers term present in different section of document equally.
Our belief is that representing the document as multidimensional subspace will give better result.
We cannot give equal weight to the term present in title and body.
Reason of considering Document as Multidimensional Objects49
Writers write the different part of document with different views. Title:- Gives idea about the content of document in 3-7 words. 1st Paragraph or Abstract :- Is an overview of document of whole
paragraph. Body:- Content of Document Conclusion:
Writers write the term present in document with different font and size. Ex: Keyterms->italics, etc.
Considering documents as multidimensional will allow building “truly” interactive IR system.
Reason of considering Document as Multidimensional Objects50
Complex aspects of the retrieval process benefit from more sophisticated representation of doc. & queries.
It reduces the length of subspaces
Hence if words appears at any segment then it is more likely to satisfy user.
How document is represented as multidimensional subspace?
51
In previous representation of document
Title: School of Tech.Abstract or1st paragraph of doc.
Body :………………………….School of …………….. Technology……………………………………Conclusion…………………………………………………………
Document
0111001111
Doc 1 =
Document Fragments52
To represent document as multidimensional subspace, we need to divide document in different fragments.
Choice 1: Use single fragment the document itself
Choice 2: Use different section of document (i.e. title, abstract, etc) as fragments.
Choice 3: Use paragraphs as fragments as they seem to be an appropriate size to correspond Information Need(IN).
Choice 4: Use sentence as fragments.
Fragments as Document Section 53
Title: School of Tech.Abstract or1st paragraph of doc.
Body :………………………….School of …………….. Technology……………………………………Conclusion…………………………………………………………………………
Doc
1 1 1 1
0 0 0 0
1 1 1 1
0 0 1 0
1 0 0 1
0 1 1 0
1 1 0 1
0 0 1 0
1 1 1 1
0 1 0 0
Title Abstract Body Conclusion
Doc 1 =
Fragment as Paragraph & Sentence
54
Document can be represented as a set of information needs (IN), each being represented as a vector.
We can decompose paragraph or sentence into text excerpts that are associated with one or more INs.
In same way query can be broken to IN.
Representation for each Segmentation
55
Three weighting schemes are used:-1. Term Frequency-Inverse Document Frequency (TF-IDF)2. Term Frequency(TF)3. Binary(Term presence/absence)
TF-IDF causes substantial overhead We can use TF and binary.
Implementing Multidimensional Subspace with qPRP
56
To decide the rank between two document, from qPRP we know that,
pQAB = pA + pB + 2*√pA √pB * cos(ƟAB )
Different parts of document has different weightage. There are two approaches for implementing MD subspace with
qPRP:-1. Implementing with whole formula2. Implementing only with similarity function
Implementing with whole formula
57
This formula is to be used for different section of document independently.
After calculating for different part and multiply with respective weightage we add with other fragment of document.
Same similarity function can be used. Let suppose we give weightage and 2 document A and B
Title= 0.2 Abstract=0.3 Body=0.3 Conclusion=0.2
pQAB = title* (pQ
AB)title+abstract*(pQAB) abstract
+body*(pQAB)body +conclusion*(pQ
AB) conclusion
Implementing only with Similarity Function58
Only similarity function is implemented with document fragment rather than whole formula.
Calculate similarity function between respective fragments of documents and add all of them.
ƟAB = title* (ƟAB)title+abstract*(ƟAB) abstract
+body*(ƟAB)body +conclusion*(ƟAB) conclusion
Use different types of formula for calculating similarity between multidimensional subspaces.
Metrics for measuring extent of Interference
59
The subspace similarity sims(Sa, Sb) between the p dimensional sub-spaces Sa and the r dimensional subspace Sb is defined as:-
sims(Sa,Sb) = 1-
This formula can also used to calculate similarity between two semantic spaces.
Implementation and Results For the implementation of the project and evaluation of results in an
efficient way some of the pre-requisites we have used are:- Software requirements:-
Windows 7 Microsoft Office 2010 (For project report) JDK 1.6.0 (Compiler) or higher version Notepad++ (with WebEdit)
Data Set requirement:- Ad-Hoc standard Dataset
60
Cont...
Hardware requirements:- 3 GB RAM. 5 GB Hard Disk Free Space. Intel Core i5 Processor or higher version.
Package requirements:- Lucene 2.4.0 BM25 Implementation. Apache Commons Math 2.2.0
61
Data Collection FIRE Ad-Hoc of the year 2010 has been used.
The queries has also been taken from the same.
The data set obtained contains around 1,30,000 documents that comprises of the collection of news from the leading newspaper “The Telegraph” for the period of 2004-07.
We have divided the documents into 3 fragment i.e. <title></title>, <fp></fp> and <sp></sp>.
62
Why fragments???
The fragments are made so as to bring the concept of multi-subspace. In our case the number of sub-spaces is 3. The reason behind choosing these three fragments in this order are:- Titles are most important part of any document. Inverted pyramid is the model for newswriting.
So the title is kept at the top and the main content of the document has been divided into two parts: First paragraph. Second paragraph.
63
Implementing of proposed solutionWe have divided our implementation process into 3 modules:- Indexing of the data set. Searching the indexed document using cosine similarity. Search using Quantum based similarity measure.For implementing the proposed solution we have chosen certain library. They are:- DOM Parser (inbuilt in Java). Apache Lucene 2.4.0. Apache Commons Math Library 2.2.0. BM25 Implemented Library.
64
UML Diagram
Class diagram used for indexing:-
65
Indexer-IndexWriter-Document+getIndexWriter(boolean)+closeIndexWriter()+indexDocument(TryDOM)+recursion(File)+rebuildIndexes(String)
TryDOM-Document-NodeList
+buildDocument(File)+String getName()+String getDocNum()+String getTitle()+String getFirstPara()+String getSecondPara()+String getWholeDocument()
Main
+public static void main(String[])
Class diagram used for searching
66
SearchFrame +String+Jpanel+JTextField+Jbutton
-actionPerformed(ActionEvent)
Class Diagram for Searcher (Part 1)
Mainpublic static void main()
Cont…. 67
DocVector+SparseRealVector+Map
+DocVector(Map<Str,Int>terms)+setEntry(String term, int freq)+normalize()
MySearcher
#HashMap#ArrayList#IndexSearcher#Document#double tempScore#int tempDoc#int num#IndexReader
MySearcher()ScoreDoc[] getProbableRelvDoc(String, String)HashMap sortQPRP(String, String)double getSimilarity(int,int)double testSimilarityUsingCosine(int,int,str)
Class Diagram for Searcher (Part 2)
Explanation(Indexer)
Indexing mainly starts from the Main class which takes as a input the ‘directory path’ where the documents to be indexed are kept.
Main class instantiates the Indexer class and call its method rebuildIndexes(String), and passes the given directory path to it which in turn calls recursive(File). All the files available in the directory will get indexed recursively by this function.
Each file is then parsed by TryDOM class and it is passed to indexDocument(TryDOM) to get indexed.
68
Explanation(Searcher) Now as the indexing is done, the next step is to search the indexed
document for the given query which is done by using the MySearcher class.
Program will start from Main class instantiating the SearchFrame and one GUI is popped up.
GUI takes two input as query and file name(where result is stored). Clicking the search button MySearcher class is initiated and the
method sortQPRP() is called. It then calls getProbableRelvDoc() to get the top k result using BM25 model, now sortQPRP() rearranges the result according to qPRP model.
69
Collaboration diagram for indexer.
70
Indexer
TryDOM Main
getWholeDocument()
getDocNum()
getName()
getSecondPara()
getFirstPara()
getTitle()
buildDocument()
rebuildIndexes()
Collaboration diagram for searcher.
71
Main SearchFrame
MySearcher
DocVector
sortQPRP()
result()
setEntry()normalize()
Lucene Index Structure
Documents in Lucene are stored as an object in Index. We need to convert the data into document object and store them into index.We break the data into different part and store them in Document object as Field object.
72
Doc ID
Title
First Paragraph
Second Paragraph
DocNum
Document Name
Evaluation MeasuresIn order to evaluate the result we need to consider two dimensions.
Recall:- Measure of ability of system to present all relevant documents. Mathematically, recall=
Precision:- Measure of ability of system to present only relevant documents.
Recall and Precision are set based Measures.
73
Cont..74
To measure ranked list precision is plotted against recall. Whenever new nonrelevant document is retrieved, recall value is
same but precision decreases.
Mean Average Precision(MAP):- In recent years TREC community using MAP. It provides single figures across recall levels. To calculate Mean Average Precision the following formula is used.
Rjk is the set of ranked retrieval results from the top result until you get to document dk
qj Q is {d∈ 1, . . . dmj}
75
76
First we retrieved top 150 result using BM25 model. Then we sort the result according to qPRP using cosine similarity. We noted down the ranked list given by both the model. We calculated the recall and precision of both list, whenever new
relevant document is retrieved in the list. We plotted the histogram using recall-precision.
Ranking of relevant document77
Query No.
Relevant Document Ranking(PRP)
Relevant Document Ranking(qPRP)
77 78, 98, 41, 69, 16, 132,47,134,135
60, 48, 47, 46, 42, 52,44,49,54
79 18 2085 26,38,27,1,44,6,42,2,22,104 52,30,25,1,45,26,87,36,42,4888 27,6,49,44,7,59,13,20,12,23,21,4
3,5635,11,82,54,2,21,29,3,39,24,26,18,28
100 5,19,13,12,14,33,06,03,04,29,1,22,9,18,15,23
4,21,12,11,31,33,10,03,06,20,1,27,14,15,7,8
102 2,37,129,84,62,147 2,13,18,16,17,20103 1,17,30,12,130 1,6,7,5,143112 18,3,37,14,6,8,1,73,24 7,5,16,3,6,2,1,13,9121 11,16,5,20,4,1,3,19,6,13 11,10,13,7,1,8,15,14,16122 6,1,4,10,23,9 3,1,16,7,21,14
Comparison of Precision for PRP and qPRP(cosine) on same recall value(Query:100)
78
0.062r 0.125r 0.187r 0.25r 0.312r 0.375r 0.437r 0.5r 0.562r 0.625r 0.687r 0.75r 0.812r 0.875r 0.937r 1r0
0.2
0.4
0.6
0.8
1
1.2
Precision(PRP)Precision(Cosine)
Comparison of Precision for PRP and qPRP(cosine) on same recall value(Query:112)
79
0.111r 0.222r 0.333r 0.444r 0.555r 0.666r 0.777r 0.888r 1r0
0.2
0.4
0.6
0.8
1
1.2
Precision(PRP)Precision(Cosine)
MAP comparison with respect to the queries
80
Q77 Q79 Q85 Q88 Q100 Q102 Q103 Q112 Q121 Q1220
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
MAP(PRP)MAP(COSINE)
Ranking of relevant document using qPRP (using Quantum based simiarity )
81
Model Name Document Ranking Average Precision
PRP 11,16,5,20,4,1,3,19,6,13
0.66
qPRP(using cosine similarity)
11,10,13,7,1,8,15,14,16
0.508
qPRP(using quantum based
similarity)
11,15,19,3,1,2,18,5,12 0.68
Conclusion:- We have calculated the Mean Average Precision (MAP) for both the
models using set of queries. We obtained MAP
The difference between them comes 0.049177 . Result obtained for qPRP is 14.1% more precise than that of PRP. The result that we have obtained is better in most of the cases but
for very few queries result of PRP is better than qPRP.
82
Model Name MAP
PRP 0.347060
qPRP(using cosine similarity) 0.396237
Future Work:- After observing the above result we deduce that qPRP can be used to
rank the Ad Hoc data set. Following direction can be undertaken to get even better result:-
Alternative document representation can be used. For example:- We may divide subspaces on the basis of most informative terms. Most informative terms can be deducted by font, term appearing near to query term appearing in document.
Different similarity measure can be used. For example, one may use the similarity in paper.
83
Cont.. By finding the similarity by capturing the meaning of
document. For capturing the meaning of document we may HAL representation.
Azzopardi, Leif, Probabilistic Hyperspace Analogue to Language
One can also test the solution which we have proposed under section “Implementing with whole formula”
84
Thank You..
85