Indexing by Latent Semantic Analysis Scott Deerwester Graduate ...
LATENT SEMANTIC INDEXING FOR HINDI- ENGLISH CLIR...
Transcript of LATENT SEMANTIC INDEXING FOR HINDI- ENGLISH CLIR...
CHAPTER 3
LATENT SEMANTIC
INDEXING FOR HINDI-
ENGLISH CLIR
IRRESPECTIVE OF
CONTEXT SIMILARITY
67
3.1 INFORMATION RETRIEVAL
Information retrieval is a process that determines similarity between the query
and document. Information Retrieval (IR) deals with representing, storing, organizing,
and accessing information [31]. This representation and organization of information is
useful for user accessing. The main goal of Information Retrieval is to retrieve the
information, which is relevant to the users need.
3.1.1. INDEXING ISSUES IN INFORMATION RETRIEVAL
In Information Retrieval (IR), basic task is to find the subset of a collection of
elements that are relevant to a query. In Text Retrieval, a query is an ordered set of
English words and a collection is a set of natural language English documents. Any
text retrieval system must overcome the fundamental difficulty that the presence or
absence of a word is insufficient to determine relevance. This is due to two intrinsic
problems of natural language named synonymy and polysemy.
Synonymy refers to the fact that a single underlying concept or idea can be
represented by many different terms or combinations of terms e.g. “car” and
“automobile” often refer to the same class of objects. Polysemy refers to the fact that
a single term can refer to more than one underlying concept or idea e.g. “car “may be
an automobile or the head of a LISP cons cell. Because of synonymy, it is difficult to
realize that two documents describe the same topic when they use different
vocabulary, leading to relevant documents being rejected. Because of polysemy, it is
difficult to realize that two documents that use some of the same terms describe
different topics, leading to the retrieval of unwanted documents.
68
A variety of approaches have been developed for IR tasks in the face of these
problems. We will focus on the popular Vector Space Model representation for
documents and queries. We will also focus on variations of latent semantic indexing,
one technique that is designed to address synonymy and polysemy in the VSM
framework and similar in flavor to the approach that we will derive.
The demand for multilingual information is becoming more profound as the
users of the internet throughout the world are increasing. This demand creates a
problem of retrieving documents in one language by specifying the query in another
language.
This increasing necessity for retrieval of multilingual documents came up with
the new branch called Cross Lingual Information Retrieval (CLIR). Cross Lingual
Information Retrieval makes use of user queries in one language (source language)
and utilizes them in retrieval of documents in another language (target language). For
example, if the user enters a query in Hindi language then relevant documents in
English will be retrieved. These retrieved documents are semantically equal. The
main problem in CLIR is scarcity of resources.
3.1.2. INDEXING TASKS IN INFORMATION RETRIEVAL
The field of Information Retrieval is broad. Researchers have focused their
efforts within several sub areas. We focused on the task where a system is given a set
of documents, and whenever a user specifies a query, those documents are ranked by
relevance. This is known as the ad hoc retrieval task. Here, the goal is to find the best
ranking method possible by whatever means. Documents are known beforehand and
69
collections usually remain relatively unchanged. In the routing and filtering tasks, a
user has a set of standing information requests. New documents arrive regularly. The
system receives a new document and decides whether it meets the criteria of those
information requests and, if so, presents the documents to the user. There are
interesting retrieval issues in domains that do not include text at all, such as image
retrieval, or sound classification. Although the tasks are similar, the structure of the
data and the queries are often quite different. Each domain brings different challenges.
In this work, we are mostly concerned with issues that arise from text retrieval.
Further, we are particularly interested in ad hoc retrieval tasks involving short queries.
3.1.3. COMPUTATIONAL ISSUES IN REAL-WORLD IR
The data sets used in text retrieval are large. Although real-world data sets
may contain only 1000 documents consisting of about 10,000 different words, it is
often the case that we are more interested in 100,000 or even 1,000,000 documents
consisting of hundreds of thousands of distinct words. Even the smallest data sets are
beyond the feasible reach of many machine learning algorithms. There are several
engineering challenges that must be addressed. First, simply storing and manipulating
such data efficiently is difficult. Faster algorithms are absolutely essential. Even
algorithms polynomial in the size of the data is infeasible on serial machines. Further,
bringing to bear statistical and machine learning techniques introduces more
complexity. Such algorithms are usually at least polynomial in the size of the data to
be learned, so even the smallest collections are beyond the reach of many machine
learning algorithms. Clearly, developing a fully working retrieval system for
something as large as the World Wide Web requires a system- level engineering
approach.
70
3.1.4. PERFORMANCE MEASURE
Determining how well a system performs is difficult. In this section we
discuss several standard evaluation metrics and provide some examples of how they
interact.
Many measures of retrieval performance have been proposed. The most
commonly used are Precision and recall. Precision is the ratio of relevant documents
retrieved to the total number of documents retrieved. Recall is the ratio of relevant
documents retrieved to the total number of relevant documents contained within the
collection [33]. Because systems provide an ordering on all documents for a given
query, we can calculate precision and recall for the top n documents, with n ranging
over the total number of documents in the collection. We have a collection made up of
ten documents. For a particular query, five are relevant and five are not. If we
examine only the first document returned, we can see that we have perfect precision
(1.0) with recall equal to 0.2. Looking at the first four documents returned, we can see
that three are relevant, resulting in a precision of 0.75 and a recall of 0.60. Precision
and recall can be calculated for a single query as we have in our example, or averaged
over many queries. One usually wishes to measure performance in terms of both
precision and recall. This is commonly done using a precision-recall graph. Precision
is on the y-axis and recall is on the X-axis.
71
Figure.3.1. Precision and Recall
where P is precision, R is recall, and b is the ratio of the importance of recall
to precision. If b=10, recall is ten times more important than precision, but when
b=0.1, recall is only one tenth as important as precision.
3.1.5. VECTOR SPACE MODEL
In the Vector Space Model (VSM), a document is a vector and each dimension
represents a count of occurrences for a different word queries are similarly
represented, making queries no difference from documents. A collection of
documents is a matrix, D, where each column is a document vector di. Thus, Dij is the
weight of word I in document j. Classically, the similarity between a document and a
query, q is defined to be the inner product of their vectors, dTq. This approach may
seem bizarre however, the inner product is just a weighted match between the
overlapping terms of two documents. Although expressed as linear algebra [32], it is
essentially the same approach used by many search engines, from the library systems
72
commonly available in universities to the wildly popular Alta Vista web search
engine.
There are several advantages to this approach beyond its mathematical
simplicity. Above all, it is computationally efficient to compute a histogram and
requires very little space to store it. Notice that although document vectors live in a
very high-dimensional space, the document matrix will be sparsely populated, which
is made up mostly of zeroes. This is true because in general, most documents will not
contain most of the possible words. Thus, algorithms for manipulating the matrix only
require space and time proportional to the average number of different words that
appear in a document, a number likely to be much smaller than the full dimensionality
of the document matrix. Similarly, comparing a query to all the documents in a
collection is efficient. These are key advantages when collections may require
gigabytes to store.
3.2 RELATED WORK
Much work has already been done on CLIR systems and presently research is
going on in many countries like India, Japan, China, and Portugal. Most of the
proposed systems are based on indexing techniques like a dictionary based indexing,
inverted file system, probabilistic latent semantic indexing, ontology indexing and
language modeling, which retrieves the documents based on the index terms. But, by
using index terms we won’t be able to get the documents which are relevant to the
user query.
73
The method of Automatic cross language information retrieval using latent
semantic indexing [35] , they tested the language independent depiction of the
documents, irrespective of the user query, that means it may be a short or long query.
They used French and English parallel corpus for training and testing the system.
They collected the corpus from Hansard: 982 documents for training and 1500
documents for testing their system. Totally, they had used nearly 2482 documents. In
English documents, there are 2482 paragraphs and in French documents also, there
are 2482 paragraphs had taken. The success rate in finding out the mate documents is
98%.
The porter stemmer [36] is used for stemming of the documents in English.
Here they removed suffixes from the words. Stemming is done on the Cranfield200
collection. While stemming, they calculated precision and recall. They tested the
porter stemmer algorithm on 10,000 vocabularies. The reduced words out of 10,000
are 1373 and the words, which are not reduced are nearly 3650. So finally, the
reduced size of the vocabulary size by using the porter stemmer is nearly 1/3 rd of the
original one.
In the method of Turkish-English cross language information retrieval using LSI
[37], in this experiment, they have created a system that can find the cross language
mate of a given document. The system is trained with bilingual documents. In this
phase, they have parsed the documents and stem the words with corresponding
stemmers. Feature-space (term-by-document matrix) has been created and normalized
by using the TF - IDF method. The entire operations take 10-12 minutes with a
computer which has an Intel Quad - core processor and 4GB of memory. Then, the
normalized feature - space matrix has been decomposed to Σ, V, U matrices. In this
step, Matlab was used. After training the system, the documents in the Turkish test set
74
have been queried to the system to find their cross language mates. Cosine similarity
is used to decide the similarity of the documents. They also test the system with
different rank approximation (k values). Depending on the k value, this operation
takes 8 – 12 minutes with a computer mentioned above.
The result of the performance of our system is evaluated by checking if the
query document is obtaining the mate of it in the retrieval result set. After query
submission, retrieval results are ranked according to their similarity to the query
vector. They have submitted 1801 test documents, one by one, as a query and expect
to find its mate in the query results. They consider the query results as successful, if
the mate of the query document appears in the first 10 of ranked retrieval results. The
result shows the number of successful queries according to rank order of the mate
document. For instance, let us take k=500 experiments, they obtain the mate of the
query document at the first rank for 416 queries. The result of CLIR, if a direct match
between documents is made, where no LSI and TFIDF is used. It also shows that,
using TF-IDF and LSI, increases the query performance approximately 3 times, when
the direct matching is considered. The result shows that as the k value increases, the
retrieval result gets better. However, the greater k value causes to have matrices, in
bigger size and this needs more computational time.
In this, they experimented on LSI using Singular Value Decomposition. The
parallel corpus collected from Skylife Magazine’s website. This contains both Turkish
as well as English Articles. Those articles are converted by the interpreters. This
corpus contains 1056 Turkish Documents and 1056 English documents. Here each
paragraph is taken as an individual document. They had some paragraphs in common
with their cross language mates. So, finally there are 3602 document pairs and each of
75
them represented as a single term by document matrix. Out of 3602 documents, 1801
documents are used for training the system and 1801are used for testing the system. A
longest match Stemming algorithm is used from the stemming of the Turkish
Documents and for English, they used Porter stemmer. They had taken MySQL 5.1.11
Database server for storing the documents. By using Latent Semantic Indexing, the
retrieval rate is 3 times more than the direct Matching. The success rate is 69%.
The reference [38] describes Portuguese - English Experiments using LSI.
They used Los Angeles Times for English Documents only. Systran is used for
translating the 20 % of the English collection to Portuguese. The total documents in
the collection are 22000. The success rate of the retrieval is nearly 99%. The
translation is far from perfect, and many times the incorrect sense of a word was used,
e.g. “branch” was translated to “filial”, where the correct sense was “ramo”. When the
system did not have a translation for a term, it remained in the original language.
Nevertheless, they did not perform any corrections or modifications on the resulting
translation they used the Porter stemmer to stem the English documents, and own
stemmer to stem the Portuguese translations. Stop words were also removed.
Next step was to run the SVD on the 22,000 dual-language documents. They
used a binary version of LSI provided by Telcordia Technologies. An important
aspect is the choice of the number of dimensions that will compose the concept space.
They chose 700 dimensions since this is the number, which gave the best
performance, within reasonable indexing time, when using last year’s query topics. It
was also the highest number that our system could support. The entries in the term by
document matrix were the local weight (frequency of a term in a document)
multiplied by the global weight of the term (number of occurrences of a term across
76
the entire collection). The weighting scheme used was “log-entropy” which is given
by the formula below. A term whose appearance tends to be equally likely among the
documents is given a low weight and a term whose appearance is concentrated in a
few documents is given a higher weight. The elements of our matrix will be of the
form: L( i , j ) * G( i )
Local Weighting: L( i , j ) = log( tf ij+1)
The next step was to “fold in” the remaining 91,000 English-only documents
in that semantic space, which means that the vector representations were calculated
for those remaining documents. The resulting index had 70,000 unique terms,
covering both languages. They did not index terms which were less than 3 characters
long. They did not use phrases or multiword recognition, syntactic or semantic
parsing, word sense disambiguation, heuristic association, spellchecking or correction,
proper noun identification, a controlled vocabulary, a thesaurus, or any manual
indexing. All the terms and the documents from both languages are represented in the
same conceptual space. Therefore, a query in one language may retrieve documents
from any other languages.
This situation benefits from cross-linguistic homonyms, i.e. words that have
the same spelling and meaning in both languages; e.g. “singular” is represented by
one vector only, accounting for both languages. On the other hand, it suffers with
“false friends”, i.e. words that have the same spelling across languages but different
meanings; e.g. “data” in Portuguese means “date” instead of “information”. The
problem in this case is that false friends are wrongly represented by only one point in
space, placed on the average of all meanings. The ideal scenario would be taking
77
advantage from cross-linguistic homonyms and at the same time avoiding false
friends. They are still looking for a way to do that automatically. The similarity
between a term and its translation should be very high.
The method Singular Value Decomposition is tested in Indexing by Latent
Semantic Analysis [39]. This method gives the details about how to solve the problem
of multiple terms referring to the same object. In these, the relevant documents are
characterized and identified properly. 12 terms by 9 documents matrix is decomposed
by using SVD in our experiments. These results are modestly encouraging. They
show the latent semantic indexing method to be superior to the simple term matching
in one standard case and equal with another. Further, for these two databases,
performance with LSI is superior to that obtained with the system described by
Voorhees; it performed better than SMART in one case and equal in the other when
term selection differences were eliminated. In order to assess the value of the basic
representational method, they have so far avoided the addition of refinements that one
would consider in a real application, such as discriminative term weighting,
stemming, phrase finding or a method of handling negation or disjunction in the
queries.
So far they have tested the method only with queries formulated to be used
against other retrieval methods; the method almost certainly could do better with
queries in some more appropriate format. They have projects in progress to add
standard enhancements and to incorporate them in a fully automatic indexing and
retrieval system. In addition, they are working on methods to incorporate the very low
frequency, but often highly informative words that were filtered out in the trial
78
analysis procedures. It seems likely that with such improvements LSI will offer a
more effective retrieval method than has previously been available.
The method of Latent Semantic Indexing is described briefly in the Latent
Semantic Indexing Overview [40]. It described some advantage of Liplike less
dimensionality, polysemy, synonymy and Term dependence. In the analysis of LSI,
they used 90,000 terms instead of 70,000 documents are used. So, the term by
document contains only 0.001% - 0.002% non zero, entries. It had taken nearly 18
hours CPU time. In this LSI gave 16% improvement than original keyword method.
Latent Semantic Indexing is a technique that projects queries and documents
into a space with “latent” semantic dimensions. In the latent semantic space, a query
and a document can have high cosine similarity even if they do not share any terms -
as long as their terms are semantically similar in a sense to be described later. They
can look at LSI as a similarity metric that is an alternative to word overlap measures
like tf.idf. The latent semantic space that project into has fewer dimensions than the
original space. LSI is thus a method for dimensionality reduction. A dimensionality
reduction technique takes a set of objects that exist in a high-dimensional space and
represents them in a low dimensional space, often in a two-dimensional or three-
dimensional space for the purpose of visualization. Latent semantic indexing is the
application of a particular mathematical technique, called Singular Value
Decomposition or SVD, to a word-by-document matrix. Hence LSI is a least-squares
method. The projection into the latent semantic space is chosen such that the
representations in the original space are changed as little as possible when measured
by the sum of the squares of the differences.
79
SVD takes a matrix A and represents it as Aˆ in a lower dimensional space
such that the “distance” between the two matrices as measured by the 2-norm is
minimized. The 2-norm for matrices is the equivalent of Euclidean distance for
vectors. SVD project an n-dimensional space onto a k dimensional space where n > >
k. In our application, n is the number of word types in the collection. Values of k that
are frequently chosen are 100 and 150. The projection transforms a document's vector
in n-dimensional word space into a vector in the k-dimensional reduced space. There
are many different mappings from high dimensional to low-dimensional spaces.
Latent Semantic Indexing chooses the mapping that is optimal in the sense that it
minimizes the distance. This setup has the consequence that the dimensions of the
reduced space correspond to the axes of greatest variation. The reference [41]
describes the method for retrieving of English-Greek documents using Latent
Semantic Indexing for Cross Language Information Retrieval. The English and
Turkish documents are clustered along the X-axis and Y-axis a two dimensional
vector. The parsing mechanism is used. Here, the terms should be appearing at least
more than once in the database. This paper mainly focuses on the query matching
within the database. Folding-in is another technique for the LSI generated database
already exists. In this Folding- in technique, each new document is represented as a
weighted sum of component document vector, this is appended to the existing
documents.
The reference [42] describes the method of Latent semantic Indexing a fast
Track Tutorial, how Singular Value Decomposition (SVD) is used in Latent Semantic
Indexing (LSI) to score documents and queries. Do-it-yourself procedures using an
online matrix calculator are described. The tutorials help in learning about basic LSI
80
models, then move forward to advanced models, understand how LSI ranks
documents, replicate all calculations and experiment with your own data, get out of
your head the many SEO myths and fallacies about LSI and stay away from "LSI
based" Snakeoil Search Marketers.
The method of Singular Value Decomposition describes in Singular Value
Decomposition [43] (SVD) is a mathematical technique used for reducing the
dimension of a matrix. This tutorial describes how the documents are decomposed
from a single matrix. This gives the relation between the correlated documents and
uncorrelated documents. In this tutorial, they illustrated the two dimensional data
points.
3.3 LATENT SEMANTIC INDEXING
Indexing is very simple for a single language, but when coming to
multilingual it is quiet difficult. So, for this we are proposing Latent Semantic
Indexing (LSI), by Using Singular Value Decomposition (SVD).This latent semantic
indexing is the best approach for mapping of each document and query vector in to a
reduced dimensional space[47]. This is based on concept matching rather than
matching of index terms. Latent Semantic Indexing is a variant of the vector-retrieval
method in which the De- despondencies between terms are explicitly modeled and
exploited to improve retrieval. One advantage of the LSI representation is that a query
can retrieve a relevant document even if they have no words in common.
Most information-retrieval methods depend on exact matches between words
in users' queries and words in documents. Typically, documents containing one or
81
more query words are returned to the user. Such methods will, however, fail to
retrieve relevant materials that do not share words with users' queries. One reason for
this is that the standard retrieval models treat words as if they are independent,
although it is quite obvious that they are not. A central theme of LSI is that the term -
term interrelationships can be automatically modeled and are used to improve
retrieval; this has been critical in cross-language retrieval since direct term matching
is of little use.
Latent semantic indexing adds an important step to the document indexing
process. In addition to recording which keywords a document contains [46], the
method examines the document collection as a whole, to see which other documents
contain some of those same words. LSI considers documents that have many words in
common to be semantically close, and Ones with few words in common to be
semantically distant. This simple method correlates surprisingly well with how a
human being, looking at content, might classify a document collection. Although the
LSI algorithm doesn't understand anything about what the words mean, the patterns it
notices can make it seem astonishingly intelligent.
When you search an LSI-indexed database, the search engine looks at
similarity values it has calculated for every content word, and returns the documents
that it thinks best fit the query. Because two documents may be semantically very
close even if they do not share a particular keyword, LSI does not require an exact
match to return useful results. When a plain keyword search will fail if there is no
exact match, LSI will often return relevant documents that don't contain the keyword
at all.
82
LSI is used to index our collection of mathematical articles. If the words n-
dimensional [49], manifold and topology appear together in enough articles, the
search algorithm will notice that the three terms are semantically close. A search for
n-dimensional manifolds will therefore return a set of articles containing that phrase
the same result would get with a regular search, but also articles that contain just the
word topology. The search engine understands nothing about mathematics, but
examining a sufficient number of documents teaches it that the three terms are related.
It then uses that information to provide an expanded set of results with better recall
than a plain keyword search.
LSI examines the similarity of the contexts in which words appear, and creates
a reduced-dimension feature-space representation in which words that occur in similar
contexts are near each other. That is, the method rst creates a representation that
captures the similarity of usage of terms and then uses this representation for retrieval.
The derived feature space rejects these inter relationships. LSI uses a method from
linear algebra, singular value decomposition (SVD), to discover the important
associative relationships. It is not necessary to use any external dictionaries, thesauri
[51], or knowledge bases to determine these word associations because they are
derived from a numerical analysis of existing texts. The learned associations are
specific to the domain of interest, and are derived completely automatically.
The singular-value decomposition (SVD) technique is closely related to
eigenvector de- composition and factor analysis. For information retrieval and altering
applications in first step large term-document matrix has been constructed, in much
the same way as vector or Boolean methods do. This term-document matrix is
decomposed into a set of typically k orthogonal factors from which the original matrix
83
can be approximated by linear combination this analysis reveals the latent structure in
the matrix that is obscured by noise or by variability in word usage.
Traditional vector methods represent documents as linear combinations of
orthogonal terms, as shown in the left half of the T so that the angle between two
documents depends on the frequency with which the same terms occur in both,
without regard to any correlations among the terms [50]. Here, Doc 3 contains Term
2, Doc 1 contains Term 1, and Doc 2 contains both terms. In this LSI represents terms
as continuous values on each of the orthogonal indexing dimensions. Since the
number of factors or dimensions is much k Smaller than the number of unique terms,
the terms will not be independent as depicted in the right half of Figure 3.1 When two
terms are used in similar contexts (documents), they will have similar vectors in the
reduced-dimension LSI representation. LSI partially overcomes some of the decencies
of assuming independence of words, and provides a way of dealing with synonymy
automatically without the need for a manually constructed thesaurus. Presented
detailed mathematical descriptions and examples of the underlying LSI/SVD method.
The result of the SVD is a set of vectors representing the location of each term
and document in the reduced -dimension LSI representation. Retrieval proceeds by
using the terms in a query to identify a point in the space technically, the query is
located at the lighted vector sum of its constituent terms. The documents are then
ranked by their similarity to the query, typically using a cosine measure of similarity.
While the most common retrieval scenario involves returning documents in response
to a user query, the LSI representation allows for much more flexible retrieval
scenarios. Since both terms and document vectors are represented in the same space,
similarities between any combination of terms and documents can be easily obtained.
84
For example, user asks for a term's nearest documents, a term's nearest terms, a
document's nearest terms, or a document's nearest documents.
New documents can be added to the LSI representation using a procedure
called folding in. This method assumes that the LSI space is a reasonable
characterization of the important underlying dimensions of similarity, and that new
items can be described in terms of the existing dimensions. Any document not used in
the construction of the semantic space is located at the weighted vector sum of its
constituent terms. This is exactly how queries are handled and has the desirable
mathematical property that a document that is already in the space is folded in at the
same location. A new term is located at the vector sum of the documents in which it
occurs. In single-language document retrieval, the LSI method has equaled or
outperformed standard vector methods in almost every case, and was as much as 30%
better in some cases.
3.3.1. ADVANTAGES
A. TRUE DIMENSIONS
The assumption in LSI and similarly for other forms of dimensionality
reduction like principal component analysis is that the new dimensions are a better
representation of documents and queries. The metaphor underlying the term “latent”
is that these new dimensions are the true representation. This true representation was
then obscured by a generation process that expressed a particular dimension with one
set of words in some documents and a different set of words in another document. LSI
analysis recovers the original semantic structure of the space and its original
85
dimensions [48]. Describe the three major advantages of using the LSI representation
with the following labels are synonymy, polysemy, and the term dependence.
B. SYNONYMY
Synonymy refers to the fact that the same underlying concept can be described
using different terms. Traditional retrieval strategies have trouble discovering
documents on the same topic that use a different vocabulary. In LSI, the concepts in
question as well as all documents that are related to it are all likely to be represented
by a similar weighted combination of indexing variables.
C. POLYSEMY
Polysemy describes words that have more than one meaning, which is the
common property of language. Large numbers of polysemous words in the query can
reduce the precision of a search significantly. By using a reduced representation in
LSI, one hopes to remove some "noise" from the data, which could be described as
rare and less important usages of certain terms. This would work only when the real
meaning is close to the average meaning. Since the LSI term vector is just a weighted
average of the different meanings of the term, when the real meaning differs from the
average meaning, LSI may actually reduce the quality of the search.
D. TERM DEPENDENCE
The traditional vector space model assumes term independence and terms
serve as the orthogonal basis vectors of the vector space. Since there are strong
associations between terms in language, this assumption is never satisfied. While term
independence represents the most reasonable first-order approximation, it should be
86
possible to obtain improved performance by using term associations in the retrieval
process. Adding common phrases as search items is a simple application of this
approach. On the other hand, the LSI factors are orthogonal by definition, and terms
are positioned in the reduced space in a way that reflects the correlations on their use
across documents. It is very difficult to take advantage of term associations without
dramatically increasing the computational requirements of the retrieval problem.
While the LSI solution is difficult to compute for large collections, it need only be
constructed once for the entire collection and performance at retrieval time is not
affected.
3.3.2. DISADVANTAGES
A. STORAGE
One could also argue that the SVD representation is more compact. Many
documents have more than 150 unique terms. So the sparse vector representation will
take up more storage space than the compact SVD representation, if the dimensions
are reduced to 150 dimensions. In reality, the opposite is actually true For example,
the document by term matrix for the Canfield collection used in Hull’s experiments
had 90,441 non-zero entries here after stemming and stop word removal. Retaining
only 100 of the possible 1399 LSI vectors require storing 139,900 values of the
documents alone. The term vectors require the storage of roughly 400,000 additional
values. In addition, the LSI values are real numbers while the original term
frequencies are integers, adding to the storage costs. Using LSI vectors, they no
longer take advantage of the fact that each term occurs in a limited number of
documents, which accounts for the sparse nature of the term by document matrix.
With recent advances in electronic storage media, the storage requirements of the LSI
87
are not a critical problem, but the loss of sparseness has other, more serious
implications.
B. EFFICIENCY
One of the most important speedups in vector space search comes from using
an inverted index. As a consequence, only documents that have some terms in
common with the query must be examined during the search. With LSI, however, the
query must be compared to every document in the collection. However, several
factors can reduce or eliminate this drawback. If the query has more terms than its
representation in the LSI vector space, then inner product similarity scores will take
more time to compute in term space. For example, if relevance feedback is conducted
using the full text of the relevant documents, the number of terms in the query is
likely to grow to be many times the number of LSI vectors, leading to a corresponding
increase in search time. In addition, using a data structure such as the k-d tree in
conjunction with LSI would greatly speed the search for nearest neighbors, provided
only a partial ordering of the documents is required. Most of the additional costs come
in the pre-processing stage when the SVD and the k-d tree are computed, and actual
search time should not be significantly degraded. Other query expansion techniques
suffer even more heavily from the difficulties described above, and LSI performs
relatively well for long documents due to the small number of context vectors used to
describe each document. However, implementation of LSI does require an additional
investment of storage and computing time.
88
C. TOWARD A THEORETICAL FOUNDATION
The empirically improved performance has been observed that there is very
little in the literature in the way of a mathematical theory that predicts this improved
performance. In this session we briefly describe one paper that is an attempt at using
mathematical techniques to rigorously explain the empirically observed improved
performance of LSI, Papadimitriou starts citrating an interesting mathematical fact
due to Eckart and Young, often cited as an explanation of the improved performance
of LSI, that states, informally, that LSI retains as much as possible the relative
position of the document vectors while projecting it to a lower-dimensional space.
This may only provide an explanation of why LSI does not deteriorate too much in
performance over conventional vector-space methods; it fails to justify the observed
improvement in precision and recall.
3.3.3. APPLICATIONS OF LSI
A. INFORMATION RETRIEVAL
The application of Singular Value Decomposition to information retrieval was
originally proposed by a group of researchers at Bell core and called Latent Semantic
Indexing in this context. At this point, it should be clear how to use LSI for IR.
Regarding the Performances reports for several information science test collections,
the average precision using LSI ranged from comparatively 30% better than that
obtained using standard keyword vector methods. The LSI method performs better
relative to standard vector methods when the queries and relevant documents do not
share many words, and at high levels of recall.
89
B. RELEVANCE FEEDBACK
Most of the tests of Relevance Feedback using LSI have involved a method in
which the initial query is replaced with the vector sum of the documents the users
have selected as relevant. The use of negative information has not yet been exploited
in LSI; for example, by moving the queries away from documents that the user has
indicated are irrelevant. Replacing the users’ query with the first relevant document
improves performance by an average of 33% and replacing it with the average of the
first three relevant documents improves performance by an average of 67%.
Relevance feedback provides sizable and consistent retrieval advantages. One way of
thinking about the success of these methods is that many words augment the initial
query that is usually quite impoverished. LSI does some of this kind of query
expansion or enhancement even without the relevant information, but can be
augmented with relevant information.
C. INFORMATION FILTERING
Applying LSI to information filtering applications is straightforward. An
initial sample of documents is analyzed using standard LSI/SVD tools. A users'
interest is represented as one vector in this reduced-dimension LSI space. Each new
document is matched against the vector and if it is similar enough to the interest
vector it is recommended to the user. Learning methods like relevance feedback can
be used to improve the representation of interest vectors over time. Performances
studies are encouraging.
90
D. TREC
Recently, LSI has been used for both information filtering and information
retrieval in TREC. The queries are very long and have detailed descriptions,
averaging more than 50 words in length. The fact that the TREC queries are quite rich
means that the smallest advantages would be expected for LSI or any other methods
that attempt to enhance user queries. The big challenge in this collection was to
extend the LSI tools to handle collections of this size. The results were quite
encouraging. At the time of the TREC conferences it was not reasonable to compute
Aˆ for the complete collection. Instead, a sample of about 70,000 documents and
90,000 terms were used. Such term by document matrices (A) are quite sparse,
containing only .001 to.002 percent of non-zero entries. Computing a 200, i.e. the
200-largest singular values and corresponding singular vectors, required about18
hours of CPU time on a SUN SPARC station 10 workstations.
Documents not in the original LSI analysis, Although it is very difficult to
compare across systems in any detail because of large pre-processing, representation
and matching differences, LSI performance was quite good. For filtering tasks, using
information about known relevant documents to create a vector for each query was
beneficial. The retrieval advantage of 31% was somewhat smaller than that observed
for other filtering tests and is attributable to the good initial queries in TREC. For
retrieval tasks, LSI showed 16% improvement when compared with the keyword
vector methods. Again, the detailed original queries account for the somewhat smaller
advantages than previously observed.
91
E. CROSS-LANGUAGE RETRIEVAL
It is important to note that the LSI analysis makes no use of English syntax or
semantics. This means that LSI is applicable to any language. In addition, it can be
used for cross-language retrieval - documents are in several languages and user
queries can match documents in any language. What is required for cross-language
applications is a common space in which words from many languages are represented
and describe one method for creating such an LSI space. The original term-document
matrix is formed using a collection of abstracts that have versions in more than one
language. Each abstract is treated as the combination of its French English versions.
The truncated SVD is computed for this term by combined-abstract matrix A.
The resulting space consists of combined-language abstracts, English words
and French words. English words and French words that occur in similar combined
abstracts will be near each other in the reduced-dimension LSI space. After this
analysis, monolingual abstracts can be folded-in: a French abstract will simply be
located at the vector sum of its constituent words that are already in the LSI space.
Queries in either French or English can be matched to French or English abstracts.
There is no difficult translation involved in retrieval from the multilingual LSI space.
Experiments showed that the completely automatic multilingual space was more
effective than single-language spaces. The retrieval of French documents in response
to English queries was as effective as first translating the queries into French and
searching a French-only database. The method has shown almost as good results for
retrieving English abstracts and Japanese Kanji ideographs, and for multilingual
translations of the Bible.
92
F. MATCHING PEOPLE INSTEAD OF DOCUMENTS
In a couple of applications, LSI has been used to return the best matching
people instead of documents. In these applications, articles they had written
represented people. In one application, known as the Bellcore Advisor, a system was
developed to find local experts relevant to users' queries. A query was matched to the
nearest documents and project descriptions and the author's organization was returned
as the most relevant internal group. In another application, LSI was used to automate
the assignment of reviewers to the submitted conference papers. Several hundred
reviewers were described by means of texts they had written, and this formed the
basis of the LSI analysis. Hundreds of submitted papers were represented by their
abstracts, and matched to the closest reviewers. These LSI similarities were used to
assign papers to reviewers for a major human-computer interaction conference.
Subsequent analyses suggested that these completely automatic assignments were as
good as those of human experts.
G. NOISY INPUT
LSI does not depend on literal keyword matching, it is especially useful when
the text input is noisy, as in OCR (Optical Character Reader), open input, or spelling
errors. If there are scanning errors and a word Dumais is misspelled as Duniais ,
many of the other words in the document will be spelled correctly. If these correctly
spelled context words also occur in documents that contained a correctly spelled
version of Dumais, then Dumais will probably be near Dunials in the k-dimensional
space determined by Aˆ.
93
H. OTHERS
[119,120] have used SVD and related dimension reduction ideas forward sense
disambiguation and information retrieval work. And have used LSI/SVD as the first
step in conjunction with the statistical classification example is the discriminant
analysis. Using the LSI-derived dimensions effectively reduce the number of
predictor variables for classification. It also used LSI/SVD to reduce the training set
dimension for a neural network protein classification system used in human genome
research.
I. OPEN COMPUTATIONAL OR STATISTICAL ISSUES
There are a number of computational/statistical improvements that would
make LSI even more useful, especially for large collections:
• Computing in an efficient way the truncated SVD of extremely large sparse matrices
• Perform SVD-updating in real-time for databases that change frequently, and
• Efficiently comparing queries to documents for finding near neighbors in high-
dimension spaces
3.4 SINGULAR VALUE DECOMPOSITION
The Singular Value Decomposition (SVD) is a widely used technique to
decompose a matrix into several component matrices, exposing many of the useful
and interesting properties of the original matrix. The decomposition of a matrix is
often called a factorization. Ideally, the matrix is decomposed into a set of factors that
are optimized based on some criterion. For example, a criterion might be the
reconstruction of the decomposed matrix. The decomposition of a matrix is also
94
useful when the matrix is not of full rank. That is, the rows or columns of the matrix
are linearly dependent.
Theoretically, one can use Gaussian elimination to reduce the matrix to row
echelon form and then count the number of nonzero rows to determine the rank.
However, this approach is not practical when working in nite precision arithmetic. A
similar case presents itself when using LU decomposition where L is in lower
triangular form with 1's on the diagonal and U is in upper triangular form. Ideally, a
rank-deficient matrix may be decomposed into a smaller number of factors than the
original matrix and still preserve all of the information in the matrix.
The SVD, in general, represents an expansion of the original data in a
coordinate system where the covariance matrix is diagonal. Using the SVD, one can
determine the dimension of the matrix range or more-often called the rank. The rank
of a matrix is equal to the number of linearly independent rows or columns. This is
often referred to as a minimum spanning set or simply a basis. The SVD can also
quantify the sensitivity of a linear system to numerical error or obtain a matrix
inverse. Additionally, it provides solutions to least-squares problems and handles
situations when matrices are either singular or numerically very close to singular.
The following are the steps involving in constructing the SVD
Step1: Score term weights and construct the term-document matrix A and query
matrix.
Step2: Decompose a matrix A matrix and find the U, S and V matrices, where A =
USVT
95
Step3: Implement a Rank k Approximation by keeping the first columns of U
and V and its first columns and rows of S.
Step4: Find the new document vector coordinates in this reduced k-dimensional
space. A row of V holds eigenvector values. These are the coordinates of the
individual document vectors
Step5: Find the new query vector coordinates in the reduced 2-dimensional space.
q = qTUkSk-1
Step6: Rank documents in decreasing order of query-document cosine similarities.
Singular value decomposition (SVD) can be looked at from three mutually
compatible points of view. On the one hand, it can see as a method for transforming
correlated variables into a set of uncorrelated ones that had better expose the various
relationships among the original data items. At the same time, SVD is a method for
identifying and ordering the dimensions along which data points exhibit the most
variation. This ties in to the third way of viewing SVD, which is that once have
identified where the most variation is, it's possible to the best approximation of the
original data points using fewer dimensions. Hence, the SVD can be seen as a method
for data reduction. As an illustration of these ideas, consider the 2-dimensional data
points.
The regression line running through them shows the best approximation of the
original data with a 1-dimensional object a line. It is the best approximation in the
sense that it is the line that minimizes the distance between each original point and the
line. If a perpendicular line is drawn from each point of the regression line, and took
the intersection of those lines as the approximation of the original data point, a
96
reduced representation of the original data that capture as much of the original
variation as possible.
The second regression line, perpendicular to the restate line captures as much
of the variation as possible along the second dimension of the original data set. It does
a poor job of approximating the original data because it corresponds to a dimension
exhibiting less variation to begin with. It is possible to use these regression lines to
generate a set of uncorrelated data points that will show sub groupings in the original
data not necessarily visible at rst glance. These are the basic ideas behind SVD: taking
a high dimensional, highly variable set of data points and reducing it to a lower
dimensional space that exposes the substructure of the original data more clearly and
orders it from most variation to the least. What makes SVD practical for NLP
applications is that you can simply ignore variation below a particular threshold to
massively reduce your data but be assured that the main relationships of interest have
been preserved. Example of Full Singular Value Decomposition.
SVD is based on a theorem from linear algebra which says that a rectangular
matrix A can be broken down into the product of three matrices - an orthogonal
matrix U, a diagonal matrix S, and the transpose of an orthogonal matrix V . The
theorem usually presents something like
Amn = Umm Smn VT
nn
Where UTU = I; V
TV = I; the columns of U are orthonormal eigenvectors of
AAT
, the columns of V are orthonormal eigenvectors of ATA, and S is a diagonal
matrix containing the square roots of the eigenvalues from U or V in descending
97
order. The following example merely applies this dentition to a small matrix in order
to compute its SVD. In the next section, I attempt to interpret the application of SVD
to document classification.
The Singular Value Decomposition (SVD) is a widely used technique to
decompose a matrix into several component matrices, exposing many of the useful
and interesting properties of the original matrix. The decomposition of a matrix is
often called a factorization. Ideally, the matrix is decomposed into a set of factors that
are optimized based on some criterion. For example, a criterion might be the
reconstruction of the decomposed matrix. The decomposition of a matrix is also
useful when the matrix is not of full rank. That is, the rows or columns of the matrix
are linearly dependent. Theoretically, one can use Gaussian elimination to reduce the
matrix to row echelon form and then count the number of nonzero rows to determine
the rank. However, this approach is not practical when working nite precision
arithmetic.
A similar case presents itself when using LU decomposition where L is in
lower triangular form with 1's on the diagonal and U is in upper triangular form.
Ideally, a rank-deceit matrix may be decomposed into a smaller number of factors
than the original matrix and still preserve all of the information in the matrix. The
SVD, in general, represents an expansion of the original data in a coordinate system
where the covariance matrix is diagonal. Using the SVD, one can determine the
dimension of the matrix range or more-often called the rank. The rank of a matrix is
equal to the number of linearly independent rows or columns. This is often referred to
as a minimum spanning set or simply a basis. The SVD can also quantify the
sensitivity of a linear system to numerical error or obtain a matrix inverse.
98
Additionally, it provides solutions to least-squares problems and handles situations
when matrices are either singular or numerically very close to singular.
A. TF-IDF
The tf–idf weight (term frequency–inverse document frequency) is a weight
often used in information retrieval and text mining. This weight is a statistical
measure used to evaluate how important a word is to a document in a collection or
corpus. The importance increases proportionally to the number of times a word
appears in the document but is offset by the frequency of the word in the corpus.
Variations of the tf–idf weighting scheme are often used by search engines as a central
tool in scoring and ranking a document's relevance given a user query.
B. TERM FREQUENCY
The number of times a term occurs in a document is called its term frequency.
By taking into account these two factors term frequencies (TF) and inverse document
frequency (IDF) — it is possible to assign “weights” to search results and therefore
ordering them statistically. Put another way, a search result’s score (“ranking”) is the
product of TF and IDF:
TFIDF = TF * IDF where:
TF = C / T where the C = number of times a given word appears in a document and T
= total number of words in a document
IDF = D / DF where the D = total number of documents in a corpus, and DF = total
number of documents containing a given word
99
The number of times a term occurs in a document is called its term frequency.
The term "the" is so common, this will tend to incorrectly emphasize documents
which happen to use the word "the" more frequently, without giving enough weight to
the more meaningful terms "brown" and "cow". Also the term "the" is not a good
keyword to distinguish relevant and non-relevant documents and terms. On the
contrary, the words "brown" and "cow" that occur rarely are good keywords to
distinguish relevant documents from the non-relevant documents. Hence an inverse
document frequency factor is incorporated which diminishes the weight of terms that
occur very frequently in the collection and increase the weight of terms that occur
rarely. A high weight in tf–idf is reached by a high term frequency in the given
document and a low document frequency of the term in the whole collection of
documents; the weights hence tend to filter out common terms. The tf-idf value for a
term will be greater than zero if and only if the ratio inside the idf's log function is
greater than 1. Depending on whether a 1 is added to the denominator, a term in all
documents will have either a zero or negative idf, and if the 1 is added to the
denominator, a term that occurs in all but one document will have an idf equal to zero.
C. EXAMPLE
Consider a document containing 100 words wherein the word cow appears 3
times. Following the previously defined formulas, the term frequency (TF) for cow is
then (3 / 100) = 0.03. Now, assume there are 10 million documents and cow appears
in one thousand of these. Then, the inverse document frequency is calculated as log
(10 000 000 / 1 000) = 4. The TF-IDF score is the product of these quantities: 0.03 × 4
= 0.12.
100
D. APPLICATIONS IN VECTOR SPACE MODEL
The tf-idf weighting scheme is often used in the vector space model together
with cosine similarity to determine the similarity between two documents.
E. JAMA
JAMA is a basic linear algebra package for Java. It provides user-level classes
for constructing and manipulating real, dense matrices. It is meant to provide
sufficient functionality for routine problems, packaged in a way that is natural and
understandable to non-experts.
F. CAPABILITIES
JAMA is comprised of six Java classes Matrix, Cholesky Decomposition, LU
Decomposition, QR Decomposition, Singular Value Decomposition and Eigenvalue
Decomposition. The Matrix class provides the fundamental operations of numerical
linear algebra. Various constructors create Matrices from two dimensional arrays of
double precision floating point numbers. Various gets and sets methods provide
access to submatrices and matrix elements. The basic arithmetic operations include
matrix addition and multiplication, matrix norms and selected element-by-element
array operations. A convenient matrix print method is also included. Five
fundamental matrix decompositions, which consist of pairs or triples of matrices,
permutation vectors, and the like, produce results in five decomposition classes.
These decompositions are accessed by the Matrix class to compute solutions of
simultaneous linear equations, determinants, inverses and other matrix functions. The
five decompositions are
101
Cholesky Decomposition of symmetric, positive definite matrices
LU Decomposition (Gaussian elimination) of rectangular matrices
QR Decomposition of rectangular matrices
Eigenvalue Decomposition of both symmetric and nonsymmetric square
matrices
Singular Value Decomposition of rectangular matrices
The current JAMA deals only with real matrices. We expect that future versions
will also address complex matrices. This has been deferred since crucial design
decisions cannot be made until certain issues regarding the implementation of
complex matrices in the Java language are resolved.
The design of JAMA represents a compromise between the need for pure and elegant
object-oriented design and the need to enable high performance implementations.
Table.3.1: JAMA PACKAGE
Summary of JAMA Capabilities
Object Manipulation constructors
set elements
get elements
copy
clone
Elementary Operations addition
subtraction
multiplication
scalar multiplication
element-wise multiplication
element-wise division
unary minus
transpose
102
norm
Decompositions Cholesky
LU
QR
SVD
symmetric eigenvalue
nonsymmetric eigenvalue
Equation Solution nonsingular systems
least squares
Derived Quantities condition number
determinant
rank
inverse
pseudoinverse
3.5 EXPERIMENTAL SETUP & RESULTS
In this experiment, a parallel corpus has been created. The documents are
retrieved from India.gov web sites that contain both Hindi and English documents
which are semantically equal. Documents in both languages have been divided into
paragraphs. Each paragraph is kept as a separate document. So, these documents are
mapped to the respective translation language paragraphs. The mapping data is stored
in the MySQL database server.
The corpus consists of 180 Hindi and 180 English parallel documents. For the
purpose of testing every paragraph is taken as a single document.
103
Figure.3.2.System overview
In this, a system that can search the cross language mate of a given document
has been created. First, the system has been trained with bilingual documents. In this
stage, the English documents are stemmed using a porter stemmer and also the Hindi
documents are stemmed manually. After stemming the documents using
corresponding stemmers, stop words are removed to increase the retrieval
performance.
By counting the frequency of each word in documents, a term-by-document
matrix (Feature-space) is created. The Feature-space is normalized using Term
Frequency – Inverse Document Frequency (TF-IDF), because longer documents may
affect the retrieval results. Then the normalized term-by-document matrix has been
decomposed to U, S, and V matrices using singular value decomposition (SVD). For
this, JAMA package is used which contains all the classes and interfaces which are
used for decomposing the Feature-space.
104
A. LSI
For processing multilingual languages, indexing is quite difficult. So, for this
Latent Semantic Indexing (LSI) is proposed, by Using Singular Value Decomposition
(SVD).This latent semantic indexing is the best approach for mapping of each
document and query vector in to a reduced dimensional space. This is based on
concept matching rather than matching of index terms. Latent Semantic Indexing is a
variant of the vector-retrieval method in which the De- despondencies between terms
are explicitly modeled and exploited to improve retrieval.
B. SVD
1. Score term weights, construct the term-document matrix A, and query matrix.
2. Decompose a matrix A and find the U, S and V matrices, where A = USVT
3. Implement a Rank k approximation by keeping the first columns of U and V and
the first columns and rows of S.
4. Find the new document vector coordinates in this reduced k-dimensional space. A
row of V holds eigenvector values. These are the coordinates of individual
document vectors
5. Find the new query vector coordinates in the reduced 2-dimensional space. Q =
QTUkSk
-1
6. Rank documents in decreasing order of query-document cosine similarities.
C. JAMA
JAMA is a basic linear algebra package for Java. It provides user-level classes
for constructing and manipulating real, dense matrices. It is meant to provide
sufficient functionality for routine problems, packaged in a way that is natural and
105
understandable to non-experts. JAMA is comprised of six Java classes: Matrix,
CholeskyDecomposition,LUDecomposition,QRDecomposition,
SingularValueDecomposition and EigenValueDecomposition.
The corpus consists of 180 Hindi and 180 English parallel documents. So for
this purpose every paragraph is taken as a single document as shown in the table 3.2
Table 3.2.Example Document
English Document Hindi Document
India & the World India's foreign policy
seeks to safeguard the country's enlightened
self-interest. The primary objectiv7 e of
India's foreign policy is to promote and
maintain a peaceful and stable external
environment in which the domestic tasks of
inclusive economic development and
poverty alleviation can progress rapidly and
without obstacles. Given the high priority
attached by the Government of India to
Dsocio-economic development, India has a
vital stake in a supportive external
environment both in our region and
globally.
भारत और विश ् ि भारत की विदेश नीतत में
देश के वििेकपूर्ण स ् ि-हित की रक्षा करन े
पर बल हदया जाता िै। भारत की विदेश
नीतत का प्राथममक उदे्दश ् य शाांततपूर्ण स्थथर
बािरी पररिेश को बढािा देना और उसे
बनाए रखना िै, स्जसमें समग्र आर्थणक और
गरीबी उन ् मूलन के घरेल ूलक्ष ्यों को तेजी
से और बाधाओां से मुक ्त मािौल में आगे
बढाया जा सकें । सरकार द्िारा सामास्जक-
आर्थणक विकास को उच ् च प्राथममकता हदए
106
जाने को देखते िुए, क्षेत्रीय और िैस्विक
दोनों िी स ् तरों पर सियोगपूर्ण बािरी
िातािरर् कायम करने में भारत की
मित ् िपूर्ण भूममका िै।
The Porter stemmer has been used for stemming of English documents. For
Hindi documents manual stemming is performed. So after stemming, the stop word
list is as shown in below Table 3.3
Table 3.3.Top 20 Stop Word List
English Hindi
Word Count Word Count
The 969 में 550
Of 577 और 445
And 483 की 378
In 389 को 241
To 337 का 215
A 202 मलए 166
107
For 161 से 165
With 111 ने 124
Is 108 एक 110
On 105 ककया 108
As 102 पर 105
By 100 िै 95
Was 73 करने 75
From 63 साथ 72
Has 63 इस 69
Also 57 भी 67
At 56 द्िारा 66
An 43 यि 51
As mentioned earlier, each paragraph is taken as an individual document. We
have mapped the paragraphs to their cross linguistic mates in the MySQL database
server. So totally, there are 360 documented pairs created. In that, document relation
represents each of them as a single document in the term.
108
The paragraphs, which are present in the same document, are semantically
equal.180 documents are used for training the system and 180 documents for testing
the system. The document set is shown in the below table 3.4
Table 3.4.Corpus Overview
Set
Number of Documents
English Hindi
Corpus 360 360
Training set 180 80
Hindi Test Set 0 180
English Test Set 180 0
After training the system, the documents in the Hindi testing set have been
queried to the system to find their cross language mates. Cosine similarity is used to
find the similarity among the documents.
D. SIMILARITY OF THE DOCUMENTS
In CLIR systems, the similarity of two documents is measured with similarity
metrics. Cosine Similarity is the most recently used similarity metric to find similarity
of two documents. When the documents are represented in two dimensional space, the
angle θ between two documents can define the similarity of these documents. When
the angle θ decreases, the similarity of the X and Y increases.
109
The formula for computing cosine similarity values is given as
Sim(q,d) = q.d/|q||d|
where q = query and
d=document
Figure 3.3. Cosine Similarity
After computing the similarity, the documents for which cosine value is near
to “1” are retrieved. Because as cos (0) =1.So, the documents which are similar have
the angle between them as 0.
The system is also tested with different ranks (k values). Based on the k value,
the results are shown in the Table 3.5
110
Table 3.5. Cross Language Mate Retrieval Results
The performance of the system is evaluated, if the mate of the query document
is found in the retrieval result. After submission of a query, retrieval results are ranked
according to their similarity to the query document. 180 test documents are given one
by one as a query and expected to find its mate in the query results. The query results
as successful, if the mate of the query document appears in the first 10 of ranked
retrieval results. The above table shows the number of successful queries according to
rank order of the mate document. For example, consider k=40 experiments, the mate
of the query document obtained at the first rank for 40 documents. The first row in the
table shows the results of CLIR, if a direct match between documents is made, where
no LSI and TFIDF is used. The above table also shows that using TFIDF and LSI
increases the query performance by approximately 3 times when direct matching is
considered. The table 3.5 shows that as the k value increases, the results are good but
compile time is increased.
111
3.6 SUMMARY
An index is a critical data structure which allows fast searching over large
volumes of data. A set of keywords are present in this data structure. These index
terms are also useful for ranking documents using some measure of relevance.
Indexing is very simple for a single language, but when coming to multilingual, it is
quite difficult. There are some traditional methods like dictionary based a query
translation method, which translates the query into the target language and search in
index. But there are some problems with this method like polysemy, synonymy which
can be solved by latent semantic indexing, as it doesn’t need to perform query
translation.
This experiment mainly focused on improving a Hindi-English cross language
information retrieval using latent semantic indexing. For this a parallel corpus has
been created from India.gov.in web site and performed singular value decomposition
to get a CLIR system. The tests performed on the proposed system depicted that the
latent semantic indexing improves the results three times to that of the direct matching
method. It is also observed that if the value of k increases, performance will be
increased but the complexity and compilation time are also increased.