wiki.cis.unisa.edu.au  · Web view4.1 Results of Word Sense Discrimination by Co-selection...

71
Ability of Co-selection to Group Words Semantically by Satit Chaprasit Supervisors Associate Professor Helen Ashman Associate Supervisor Gavin Smith A thesis submitted for the degree of Master of Computer and Information Science School of Computer and Information Science Division of Information Technology, Engineering and the Environment University of South Australia

Transcript of wiki.cis.unisa.edu.au  · Web view4.1 Results of Word Sense Discrimination by Co-selection...

Ability of Co-selection to Group Words Semantically

by

Satit Chaprasit

Supervisors

Associate Professor Helen Ashman

Associate Supervisor

Gavin Smith

A thesis submitted for the degree of

Master of Computer and Information Science

School of Computer and Information Science

Division of Information Technology, Engineering and the Environment

University of South Australia

October 2010

i | P a g e

Table of ContentsList of Figures........................................................................................................................................ iii

List of Tables......................................................................................................................................... iv

Declaration............................................................................................................................................v

Acknowledgements..............................................................................................................................vi

Abstract................................................................................................................................................vii

Chapter 1 Introduction..........................................................................................................................1

1.1 Motivation...................................................................................................................................1

1.2 Thesis background.......................................................................................................................2

1.3 The thesis fields...........................................................................................................................3

1.4 Research Questions.....................................................................................................................3

1.5 Explanatory of research questions...............................................................................................3

1.6 Contributions...............................................................................................................................4

Chapter 2 Literature Survey...................................................................................................................5

2.1 Background..................................................................................................................................5

2.1.1 Fundamental Knowledge on Click-Through Data as Implicit Feedback.................................5

2.1.2 Clustering with Co-Selection Method...................................................................................6

2.2 Related Works.............................................................................................................................7

2.2.1 Word Sense Disambiguation.................................................................................................7

2.2.2 Query Clustering...................................................................................................................8

Chapter 3 Methodology.......................................................................................................................10

3.1 Method Outline.........................................................................................................................10

3.2 Identifying ambiguous and unambiguous terms........................................................................10

3.3 Data Preprocessing....................................................................................................................11

3.4 Data Selection for Word Sense Discrimination Experiment.......................................................12

3.5 Data Selection for Query Clustering Experiment.......................................................................15

3.6 Expected Outcomes...................................................................................................................16

Chapter 4 Results.................................................................................................................................17

4.1 Results of Word Sense Discrimination by Co-selection Method................................................17

4.2 The results from the experiment of query clustering on ambiguous dataset............................19

Chapter 5 Discussion...........................................................................................................................22

5.1 Discussion on the experiment on word sense discrimination....................................................22

5.2 Discussion on outcome from query clustering experiment.......................................................23

ii | P a g e

5.3 Scope and Limitation.................................................................................................................25

5.4 Future work...............................................................................................................................25

Chapter 6 Conclusion...........................................................................................................................27

Appendix A- Complete Explanatory of Query Clustering Methodolog................................................30

A.1 – Section outline........................................................................................................................30

A.2 – Generating identification of connected component...............................................................30

A.3 – The selection of 10 truly ambiguous query terms..................................................................31

A.4 – The extraction of related queries...........................................................................................33

A.5 – Generating query pairs...........................................................................................................33

A.6 – Word sense evaluation...........................................................................................................35

A.7 – The approaches of working out the result..............................................................................36

References...........................................................................................................................................38

iii | P a g e

List of FiguresFigure 3-2 stage of selecting truly ambiguous and unambiguous queries............................................12Figure 3-3 cluster generated by the query graph..................................................................................14Figure 4-1 numbers of clusters generated for each of 20 ambiguous and unambiguous queries..........17Figure 4-2 overall proportion of semantically similar queries of method0..........................................20Figure 4-3 overall proportion of semantically similar queries of method1..........................................20Figure 4-4 comparison individual proportion of semantically similar queries.....................................20

iv | P a g e

List of TablesTable 3-1 typing errors of ambiguous indicator found during data preprocessing...............................11Table 4-1 numbers of clusters generated for ambiguous and unambiguous query terms.....................18Table 4-2 basic statistic for the WSD experiment................................................................................18Table 4-3 level of agreement between participants..............................................................................19Table 4-4 the proportion of semantically similar pairs as rated by 11 participants..............................19

v | P a g e

Declaration

I declare that this thesis does not incorporate without acknowledgment any material

previously submitted for a degree or diploma in any university; and that to the best of my

knowledge it does not contain any materials previously published or written by another

person except where due reference is made in the text.

Satit Chaprasit

20th October 2010

vi | P a g e

Acknowledgements

I would like to thank my supervisor Helen Ashman, Gavin Smith, and Mark Truran for their

times, valuable suggestions, and supports during this year. I also would like to thank

members of security lab for their helps and supports. I also do not forget to thank my friends

and participants that help me to complete my thesis. Finally, I would like to thank my family

in Thailand and my cousin here for encouragements and supports.

vii | P a g e

Abstract

Meanings of ambiguous words vary in different contexts. When using machines to

identify meaning of ambiguous words, it is called Word Sense Disambiguation (WSD).

However, there are also problems in this area. A major problem for word sense

disambiguation is to automatically determine word sense. This is because meanings can vary

in different context and new meanings can occur any time. Another problem is to

automatically construct sets of synonyms for use in word sense disambiguation.

Co-selection method, which exploits users’ agreements on clickthrough data as

similarity, could be used to deal with such WSD problems. However, previous studies on the

co-selection method have so far failed to demonstrate the performance of it due to unsuitable

datasets. This study then aims to redress this situation with carefully selecting datasets to

evaluate the co-selection ability to discriminate word sense and to construct sets of

synonyms. The dataset was established by using Wikipedia article titles as a source to create

a list of ambiguous and unambiguous query terms. The query terms selected for the

experiments were also checked whether they were truly ambiguous or unambiguous.

In the case of the word sense discrimination experiment, numbers of clusters

generated by co-selection method for ambiguous and unambiguous queries selected were

used to evaluate whether there is correlation between them. On the other hand, in the case of

experiment of ability to construct sets of synonyms, human judgment were used to evaluate

query pairs generated by query clustering on single click and query clustering on co-selected

data.

The outcomes from the experiments indicate that it is difficult to use co-selection

method on web search to discriminate with word sense effectively, but the query clustering

on co-selected data can help to create unambiguous clusters, which could be used to

automatically construct sets of synonyms.

1 | P a g e

Chapter 1 Introduction

Recently, people use web search in order to find their information needs more and

more. This means that there are a great number of users interacting with search engines by

giving queries into the systems and then selecting the return results that match the

information need in mind. Such selection can be viewed as 2 forms in this study:

Clickthrough data: It is a collection of results selected after submitting a

query. This selection indicates that the user justifies the selected result (clicked

result) is relevant to the search term.

Co-selection data: When a user selects more than one returned result from a

query submitted, it can be seen that such selected results are mutually relevant

to each other. The assumption is that the user has only one sense in mind and

he/she will only select the results that match the purpose. According to this

assumption, co-selected data can help to deal with ambiguous results returned

because the users will select multiple results having the same meaning.

This study will reveal the ability of co-selection to group words semantically.

1.1 Motivation

There are 2 problems in Word Sense Disambiguation (WSD) that have not been

solved. The first one is to identify word sense of ambiguous words. This is because its

meaning can be changed when they occur in a different context, and the potentially emerging

meaning can also occur anytime. The other problem is how to automatically build sets of

synonyms since manually update the synonym sets required severe human efforts.

However, a relatively new technique, called co-selection, could deal with these

problems. It could be used for discriminating with word sense and creating unambiguous

clusters from ambiguous query terms automatically. However, previous studies on co-

selection have failed to give the conclusion of the performance of the co-selection method

because of unsuitable datasets for their experiments. Therefore, this study will redress this

situation with appropriate datasets for new experiments.

2 | P a g e

1.2 Thesis background

Ambiguity is widespread in human language. A number of words can be interpreted

into different meanings depending on which context such words occur in. Word sense

disambiguation is the meaning discovery for words in different context. However, dealing

with different meanings of an ambiguous word is complicated for machines [15]. For

example, search systems face the problem of lexical ambiguity, because, for given a query, it

is difficult for the systems to provide exactly information needs of the user due to low quality

and ambiguity [6].

The proposed method in this study, the co-selection method, aims to deal with such

problems by exploiting users’ judgements from click-through data as similarity function to

automatically determine senses of an ambiguous query. This method does not depend on

potentially-outdated external knowledge sources nor on extensive preprocessing of datasets,

since it requires only a simple approach: user consensus [25]. The assumption of this method

is that when submitted a query, the user has only one sense (meaning) of the query in mind.

Different users will therefore select different search results matching the sense in their mind.

When users select more than one result, such results are called “co-selected data”. Then,

when a number of users co-select the same results from the same query, it emphasizes that

such co-selected results are mutually relevant. For this reason, a group of similar results

indicate a distinct sense of an ambiguous word. In short, this method exploits users’

agreement as similarity function to discriminate word sense.

Even though the result from the first study on co-selection method [25] indicated that

the method was feasible to discriminate with sense of ambiguous queries, the study focused

only on image search. Additionally, the study used an artificial dataset, where it was only a

small-scale experiment participated by volunteers under artificial controls. Therefore, it is

difficult to conclude that co-selection method can discriminate with word sense based only on

that experiment.

There was also another study based on co-selection method from [22] in 2009, but the

study aim to construct unambiguous clusters, not discriminate with word sense, by using

query clustering on co-selected data. The study performed an experiment by using a real

world dataset, which was based on web document search, not image search. However, the

result of the experiment was not useful for assessing co-selection because randomly

3 | P a g e

extracting data from the entire dataset resulted in an unambiguous dataset. For this reason, the

dataset, which contained no ambiguous terms, was inappropriate to evaluate the ability of co-

selection method to create unambiguous clusters.

Since datasets of previous studies on the co-selection method were unsuitable to

conclude the ability of co-selection method to group words semantically, this study will

perform both a word sense discrimination experiment and the experiment of query clustering

on co-selected data [22], but with the appropriate dataset that will be carefully established.

Then we will discuss results, potential factors, and potential future work in order to conclude

the ability of co-selection method.

1.3 The thesis fields

Word Sense Disambiguation and Information Retrieval

1.4 Research Questions

Is there a correlation between numbers of clusters generated by co-selection method

from ambiguous and unambiguous queries?

Can query clustering on co-selected data augment to create unambiguous clusters?

1.5 Explanatory of research questions

Evaluating the capability of co-selection method has not been concluded yet due to

inappropriate dataset. This study therefore aims to conclude the performance of it by

carefully selecting the dataset for new experiments. There are 2 questions in this study. The

first question is about word sense discrimination; it aims to answer whether or not there is a

correlation between numbers of clusters generated between ambiguous and unambiguous

queries - intuitively the more clusters then the more meanings the query has. The other

question is about the ability of co-selection to construct sets of synonyms; it aims to answer

whether query clustering on co-selection method can help to create unambiguous clusters.

4 | P a g e

In the case of the first question, according to assumption of co-selection method

introduced earlier in this chapter, users will click mostly on results matching their

information needs, so an ambiguous query would have a greater number of clusters generated

than unambiguous queries. If the outcome occurs in this way, it will indicate that there is a

correlation between the numbers of clusters generated from ambiguous and unambiguous

queries.

In the case of the second question, with ambiguous queries, the performance of query

clustering was reduced because ambiguous senses of the queries confuse query clustering to

group related query effectively. However, the co-selection method could help to perform this

task more effectively because query clustering on co-selected data also consider information

need of “each user” to group related queries. Thus, we look at whether query clustering on

co-selected data can augment to create unambiguous clusters.

1.6 Contributions

To conclude, the followings are potential contributions of this study:

This study could augment the credibility of co-selection method to group query terms

semantically by carefully selecting dataset for new experiments.

This study will identify problems of using co-selection method on web search.

This study will investigate the use of click-through data on web search whether or not

it is reliable enough to discriminate with word sense by co-selection method.

5 | P a g e

Chapter 2 Literature Survey

This chapter is partitioned into 2 parts, background and related works. The

background will first provide the essential information of click-through data as implicit

feedback, and will then provide background on previous experiments on co-selection method.

For the second part, the study will provide related works in the area of word sense

disambiguation and query clustering.

2.1 Background

2.1.1 Fundamental Knowledge on Click-Through Data as Implicit Feedback

For explicit feedback, users are asked to assess which results are relevant to a search

query so that the retrieval function can be optimized by using this sort of feedback. However,

users are usually unwilling to give the assessment [10]. In addition, according to [11],

manually developing a retrieval function is not only time consuming, but it is sometimes also

not practical.

In contrast to implicit feedback in the form of click-through data, it can be

accumulated at very low cost, in a great number of quantities, and without a requirement of

an additional user activity [11]. Click-through data can be extracted from a search engine log,

where it indicate which URLs of search results were clicked by the users after submitting a

search query. Although queries and URLs selected from search engines are not exactly

relevant, they at least indicate the relationship of users’ judgements and clicked document

results from the queries [28].

Since [14] has first proposed this form of implicit feedback for a web search system,

there are a number of uses based on this concept, such as developing information retrieval

functions [10], optimizing web search [30], investigating user search behaviors [2, 5, 18], and

clustering web documents and queries [3].

Even though click-through data seem to promise implicit feedback on web search,

many studies have asked questions on it. There are three major difficulties – noise as well as

incompleteness, sparseness, and potentially emerging queries and documents – for studies to

6 | P a g e

work on mining click-through data [30]. Biases on documents selected by users are also

found in several studies. According to [1], users assume the top result is relevant, although

the results are randomly retrieved. In addition, [11] point out that high ranking of search

results, trust bias, can influence the users’ decision of clicking on returned documents from

the search, even if they are less relevant to the users’ information needs than other results.

Also, the overall quality of retrieval system, quality bias, influences how users select

documents; this is, if the search system can provide only low relevant results of that query,

users will select the less relevant results as well. Furthermore, ambiguous queries are a major

challenge for information retrieval system to exploit click-through data as relevant implicit

feedback, since it is difficult to deal with. Generally, users usually give short search terms,

which results in ambiguity of the keywords. IS-A relationship could also lead to ambiguity

[20].

Although the accuracy of relevant documents returned from click-through data is

quite low, approximately 52% [18], the judgment of relevant abstracts (snippets taken from

the document and containing the search terms) of the results is 82.6% [12]. The co-selection

method is based on how people judge “abstracts of the documents” instead of documents

itself. For this reason, co-selection method on web search could be reliable enough to perform

the experiments.

2.1.2 Clustering with Co-Selection Method

Co-selection method was first proposed by [25]. This study introduced how to benefit

from users’ consensus as similarity measure on click-through data. The study aimed to

evaluate the assumption of the co-selection method on image search system, SENSAI, for

finding whether co-selected images on the same query indicated mutual relevance or not. For

each term, the clustering method used in this study collected images clicked by randomly

putting each URL into a single axis for initial position, and then such URLs were moved

closer together on the axis after such URLs were co-selected. The experimental result showed

that exploiting users’ consensus was able to automatically separate senses of ambiguous

search terms without relying on laboring to maintain and gather knowledge resources.

However, the limitation of this study was the artificial data since the dataset was collected

from asking users to perform search tasks in limited controls.

7 | P a g e

There was another previous co-selection study from [22]. This study aimed to

evaluate the performance of co-selection method to create unambiguous clusters by query

clustering. The real world dataset was also used in this study. However, the result was

unexpected because the experimental result showed that the alternative method, which had no

capability of disambiguation, gave the best semantic coherence, instead of query clustering

on co-selected data. The researchers conclude that it was not a good idea when randomly

extracting queries from the real world dataset for the experiment because [17, 29] point out

that there is only 4% of ambiguous words in a general search engine. Therefore, randomly

selecting data from the clickthrough data resulted in the unambiguous dataset, which is not

appropriate for evaluating the performance of co-selection method.

Since the appropriate dataset need to be used for verifying the capability of co-

selection method, this thesis aims to carefully extract data from a real world dataset to contain

a proportion of known ambiguous terms. Therefore, the credibility of co-selection to group

words semantically will be revealed in this study.

2.2 Related Works

2.2.1 Word Sense Disambiguation

Since dealing with ambiguity is complicated for the machine, it is required the process

of transforming unstructured textual information to analyzed data structures in order to

identify the underlying meaning. This process – the computationally meaning discovery for

words in context - is called Word Sense Disambiguation (WSD) [15].

To achieve this process, a number of studies have tried to use different techniques,

which usually relied on external resources [25]. For example, The study from [9] was the text

categorization using natural language processing pattern from knowledge-based ruled.

Another example is the study on WSD [8] proposing the method to find related words from a

given word by using Longman Dictionary of Contemporary English (LDOCE). The problem

of external resources is that human efforts and specialist involved are required to manually

maintain the resources.

However, [19] illustrates that word sense disambiguation can be divided into 2 sub

tasks, discriminating and labeling tasks. Word Sense Discrimination is operated by

8 | P a g e

partitioning senses of an ambiguous word into different groups, where the assumption is that

each group represents only one sense. Most of word sense discrimination approaches are

based on clustering. After completing this process, not only can the cluster of senses be sent

to lexicographers for the labeling tasks, but such clusters can also be used in the field of

information access without the requirement of given definition of terms. As [19] illustrate the

system can give examples of each cluster for users to decide which sense they want.

For example, regardless of the definition of words, query suggestion can provide examples of

senses of a given query for users to justify which sense they want.

Word sense discovery is a significantly difficult task because the meanings of words

vary in different contexts, and a new meaning can also be introduced anytime [13]. To deal

with this problem, the studies from [19], [26], [24] tried to discriminate with word sense by

complex approaches. On the other hand, the co-selection method based only on a simple

approach - users’ consensus - to discriminate with word sense.

2.2.2 Query Clustering

At the first stage, [3] used query clustering based on click-through data in order to

collect similar queries together. In this method, a bipartite graph was utilized to indicate the

relationship between distinct query nodes and distinct URL nodes. This algorithm, called

agglomerative hierarchical algorithm, relies only on content-ignorant clustering, where it can

discover similar groups by iteratively merging queries and documents at a time. However,

because of performing clustering per iteration, a noticeable disadvantage of this method is

that it is quite slow [3]. The study used a half of million records of click-through log from the

Lycos search engine to evaluate query suggestions. However, according to [4], this algorithm

is vulnerable to noisy clicks, since users sometimes select document results by mistakes or

poor interpretation of result captions. Therefore, [4] further developed this method by

adapting the similarity function in order to detect noisy clicks and then eliminate them.

Another query clustering method is the session based query clustering, proposed by

[28], which was based on click-through data and also utilized a combination of query content

similarity and query session similarity. A query session was defined as sequences of activities

performed by a user after submitting a query. The clustering algorithm used in this work is

DBSCAN [7] because it can deal with a large data set efficiently, and integrate new queries

incrementally. In addition, DBSCAN does not require manually setting the number of

9 | P a g e

clusters and the maximum size of clusters. However, in the experiment reported, only 20,000

queries were randomly extracted from the entire data set, because this entire data set was too

large, approximately 22 GB. Also, since the data set of this study was from the Encarta

website, which was not from the search engine, [22] point out that the users might not interact

with this system in the same way as in the search engine systems.

In addition, according to [22], the agglomerative hierarchical clustering method

proposed by [3] is unlike the session based method because it has no capability of

discriminating with word sense, but can only collect related information into the same cluster.

Although query clustering has potential for use in word sense disambiguation –

constructing sets of synonyms automatically, these studies have not evaluated the ability of

creating unambiguous clusters. For this reason, it is rational for this project to evaluate that

ability by using query clustering on co-selected data.

10 | P a g e

Chapter 3 Methodology

An exploratory methodology to evaluate the potential capability of the co-selection

method to discriminate with word sense and to cluster unambiguous queries was completed

in this study. The first experiment was to evaluate whether there is difference between the

number of clusters generated from ambiguous versus unambiguous queries; the second

experiment was to evaluate whether query clustering on co-selected data can create

unambiguous clusters. The following indicate the stages of methodology to achieve the

research goal.

3.1 Method Outline

Identifying ambiguous and unambiguous terms (3.2)

Co-selected data preprocessing (3.3)

Data selection for word sense discrimination experiment (3.3)

Data Selection for query clustering experiment (3.4)

Expected outcomes (3.5)

3.2 Identifying ambiguous and unambiguous terms

There were 2 experiments in this study - comparison between the clusters generated

from co-selected data from ambiguous and unambiguous query terms and query clustering to

create unambiguous clusters. Therefore, a known ambiguous dataset and a known non-

ambiguous dataset are required for the experiments. According to [17], Wikipedia can be

used as a source to identify ambiguous and unambiguous terms for these experiments. It was

able to help us separate the titles of articles into ambiguous titles and unambiguous titles. For

this reason, Wikipedia was used to generate the known ambiguous and non-ambiguous terms.

Number Ambiguous indicators Number Ambiguous indicators

1 _(Disambiguation) 9 (disambiguation_page

2 (disambiguation) 10 _(disambigation)

3 (Disambiguation) 11 _(disambigaiton)

4 (disambiguation 12 _(disambigutaion)

5 _(disambig) 13 _(disambiguatuion)

11 | P a g e

6 _disambiguation 14 _(disambiguaton)

7 _(disambiguation_page) 15 _(disambiguatiuon)

8 (disambiguation_page) 16 _(disambigauation)Table 3-1 typing errors of ambiguous indicator found during data preprocessing

As mentioned above, all article titles from Wikipedia, which was downloaded from

http://download.wikimedia.org/enwiki/20100622/ (the file described as 'List of page titles' –

all-titles-in-ns0.gz, approximately 41 Megabytes), were used to generate a group of

ambiguous titles and unambiguous titles. According to the Wikipedia rule, all of the titles

containing “_(disambiguation)” were ambiguous titles, so ambiguous titles could be extracted

by retrieving them out of the all-titles-articles text file. However, because Wikipedia were

contributed by a number of people, there were, in the downloaded text file, also some typing

errors for indicating the title as an ambiguous title. For example, some of the titles contained

“_(disambiguatoin)” instead of “_(disambiguation)” (See table 3.1 for the typing errors found

during the data preprocessing). We noticed that the errors came from the last part of the word

“disambiguation”, which was “guation” rather than “disambig”. For this reason, to extract

known ambiguous titles properly, “disambig” was used as the filter to select ambiguous titles

out of Wikipedia all titles. The remaining titles after the extraction were considered as known

unambiguous titles. As a result, the list of ambiguous title and unambiguous one were

created, and they were then imported into the database as the “ALL-AMBIGUOUS” and

“ALL-UNAMBIGUOUS” tables.

3.3 Data Preprocessing

Clickthrough data needs to be preprocessed to perform experiments based on co-

selection method. Such data preprocessing steps are described below:

1. Queries were normalized into lower cases in order to contain a greater number of

clicks because too small number of clicks for a query would be inadequate to

perform good experiments.

2. The identification of sessions was required to represent a person submitting a

query, which was essential for co-selection method. This is, a unique session

represents a query. Therefore, “SessionIDs” were generated by sorting query,

time, and URLs respectively. The same queries occurring in different periods of

time (30 minutes) were assigned by different “SessionID”.

CT_SSG1

Join Join

ALL-AMBIGUOUS ALL-UNAMBIGUOUS

AMBIGUOUS-SELECTED UNAMBIGUOUS-SELECTED

AMBIGUOUS-SELECTED-20 UNAMBIGUOUS-SELECTED-20

Create Table

Create Table

Select 20 Truly Am

biguous Select 20

Truly U

nambiguo

us

Comparing click count

12 | P a g e

3. The same SessionID having less than 2 records (2 clicks) were filtered out

because 1 record (1 click) per session was not co-selected data.

4. Any co-selected data generated by only one SessionID were also considered as

unusual co-selected data, so they were also filtered out.

After performing these steps, we then had clickthrough data having SessionID with

greater than 1 click – called “CT_SSG1”. This means that we currently had “ALL-

AMBIGUOUS”, “ALL-UNAMBIGUOUS”, and “CT_SSG1” tables.

3.4 Data Selection for Word Sense Discrimination Experiment

For word sense discrimination by using co-selection method, we aimed to evaluate

whether there was a difference between the number of clusters generated from unambiguous

and ambiguous query terms. According to how co-selection method works, the number of

clusters generated from ambiguous query terms should be more than unambiguous terms as it

would be expected to partition clicks from the same search term into distinct clusters where

the different meanings of the search term are manifested. For this reason, this section will

provide the steps of how to select 20 truly ambiguous and unambiguous query terms for word

sense discrimination experiment.

13 | P a g e

Figure 3-1 stage of selecting truly ambiguous and unambiguous queries

The “CT_SSG1” was used to map to the “ALL-AMBIGUOUS” and “ALL-

UNAMBIGUOUS” tables in order to create a table, named “AMBIGALL_SELECTED”, that

contained only ambiguous queries and another table, named “UNAMBIGALL_SELECTED”,

containing only unambiguous queries. Then, these 2 tables were sorted by the highest click

count to the lowest click count of query terms.

However, the imbalance of click count between ambiguous and unambiguous query

terms could be seen as the bias against query selection for the experiment. This is, after we

had the tables that list queries with its total clicks, the highest total clicks from the table

containing unambiguous queries (“UNAMBIGALL_SELECTED”) was significantly higher

than the highest total click from the ambiguous one (“AMBIGALL_SELECTED”). This

means that if we chose the top 20 queries from each table, it could be biased towards such

selected queries. For this reason, we decided to choose 20 ambiguous query terms first, and

then such ambiguous terms would be used for selecting unambiguous query terms containing

similar clicks (This will be explained in more detail below).

In addition, both ambiguous and unambiguous queries were needed to be verified

whether they were truly ambiguous or unambiguous queries. In the case of truly ambiguous

queries, we dealt with this issue by manually checked them as follows:

Whether or not sense of ambiguous terms was dominant, by using Wikipedia

website as it is the original source.

Whether or not sense of ambiguous terms was at least 20 percent of

ambiguous search results from the top 10 results at Google, Bing, and Yahoo.

If we had failed to do this, it is possible that there would be no co-selection

data for the minority senses of the term present in the data, and thus the

distinct clusters would not be manifested.

Then we chose the top 20 proved ambiguous queries.

In the case of truly unambiguous queries, we only manually checked with search

results (they were not required to check with Wikipedia as it was unambiguous from

Wikipedia at the first place). The selection criteria were as follows:

A B

C

4.0D

E

F

3.0

3.0

3.0

2.0

7

4

3

4

3

3

14 | P a g e

The unambiguous terms must not occur in the ambiguous table (it is possible

for a term occurring in both ambiguous and unambiguous tables such as

“google”, and “aol”)

The unambiguous terms must have only one sense in the first page of search

results. . This indicated that the search engine had not identified sufficiently

represenative ambiguity in this term.

Then, we chose 20 selected queries by one-by-one comparing with total clicks

of selected 20 ambiguous queries (not more or less than 10%). For example, if

the 1st selected ambiguous query has 1,000 clicks, the 1st selected

unambiguous query has to have total clicks between 900 and 1,100. Thus,

according to these criteria, 20 truly ambiguous and unambiguous queries were

selected into tables, named “AMBIGALL_SELECTED_20” and

“UNAMBIGALL_SELECTED_20” respectively.

“SensedIDs” for query terms were also required in order to indicate how many

clusters from co-selected data occurred in a query term. This is essential because it would be

used for comparing number of clusters between ambiguous and unambiguous query terms.

The “query graph” was chosen to achieve this task as use in [23]. The principle of the query

graph method was for clustering URLs containing the same sense of an ambiguous query into

the same group. The bipartite graph was used in the query graph in order to visualize

relevance between queries and documents clicked by users. A node represented a document

clicked; there was a counting number to show how many times users selected the same

document; there was also an edge, with a counting time, generated between nodes when a

user selected multiple results (See Figure 3.2). Different groups of nodes and edges

represented distinct clusters, which also represented potentially distinct meanings. For this

reason, the query graph was used in order to generate the identification of sense-distinct

clusters (SenseID) into each record for the experiment.

15 | P a g e

Figure 3-2 cluster generated by the query graph

The followings are the methods that we used to work the results out:

Paired t-test was used to evaluate whether number of clusters generated from truly

ambiguous and unambiguous query terms were statistically different. ( paired t-

test from http://faculty.vassar.edu/lowry/VassarStats.htm)

Mean (X̄ ) of number clusters generated from both 20 truly ambiguous and

unambiguous terms was calculated to compare the average number of clusters.

Standard deviation of number clusters generated from both 20 truly ambiguous

and unambiguous terms was also calculated to support the outcome.

3.5 Data Selection for Query Clustering Experiment.

In the case of second, query clustering experiment, we needed to evaluate whether

query clustering on co-selected data could augment to create unambiguous clusters. To

achieve such evaluation, we only needed to use query terms that have more than 1 sense.

Therefore, the list of ambiguous query terms alone was used for this experiment. We decided

to make a comparison between query clustering on normal clickthrough data (single click) –

called Method0 in this study - and on co-selected data - called Method1 in this study. Then

we used human judgment to evaluate the semantic relationships of query pairs randomly

generated by both methods. After that, we used Fleiss kappa and standard statistics to work

the results out.

Since complete explanation of the methodology for this experiment is considered as

too large information for this section, it is provided in Appendix A. Therefore, the followings

are only the brief steps of methodology for this experiment.

ConnectedIDs were generated for queries in both method0 and method1 by the

simple query clustering [3], which does not require setting any parameter.

Input data between both methods was different. Method0 – the input data was

(query and URL). Method1- the input data was ((Query,SessionID) and URL).

16 | P a g e

For this reason, method0 would have only 1 cluster per query, while method1

would be expected to have multiple clusters per query.

10 truly ambiguous were selected to extract queries having the same

ConnectedID as such 10 truly ambiguous for generating query pairs. To select

such 10 truly ambiguous queries, all of ambiguous queries were sorted by an

additional 2 fields - number of clusters (method1) and size of clusters

(method0), respectively.

Unique query pairs of both method0 and method1 were randomly generated

for the evaluation.

Human judgement was used to evaluate the semantic relationship between

query pairs.

Fleiss free-marginal-kappa was used to work out the level of agreement

between participants, and basic statistics were used to compare the

performance of method0 and method1.

3.6 Expected Outcomes

In the case of the first experiment (word sense discrimination), we expected to see

that there was distinction between clusters generated by co-selected data from unambiguous

queries and ambiguous queries. It was expected that the number of clusters generated for

unambiguous queries should be fewer than ambiguous queries. If the outcome resulted in this

way, it indicated that co-selection method on web search would be able to help to

discriminate with word sense.

In the case of the second experiment (query clustering), we expected to see that query

clustering on co-selected data could augment to create unambiguous clusters. This means that

when comparing query clustering on co-selected data (method1) with query clustering on

single click (method0), number of semantically similar pairs for method1 would be more than

method0 significantly, and the level of agreement between raters would not be only by

chance. If the outcome resulted in this way, it indicated that query clustering on co-selected

data better than the basic clustering algorithm when performing on ambiguous queries. i.e.

that method1 would be able to distinguish between senses of an ambiguous term, while

method0 would not.

17 | P a g e

Chapter 4 Results

This chapter will present the results of both experiments. Discussion about the result

will be in the following chapter. Since there were 2 experiments in this study, this chapter is

divided into 2 parts. The first part is for the result from word sense discrimination by co-

selected method. The other one is for query clustering from ambiguous dataset.

4.1 Results of Word Sense Discrimination by Co-selection Method

Number of clusters for 20 unambiguous queries and 20 ambiguous queries were

generated to compare whether there was a difference between these 2 groups (see table 4.1).

Therefore, the numbers of clusters were assigned into independent sample paired T-test to

validate whether they were statistically different between each other, and basic statistic were

also used to support the outcome.

Figure 4-3 numbers of clusters generated for each of 20 ambiguous and unambiguous queries

18 | P a g e

# Unambiguous query term

Number of clusters generated

Ambiguous query term

Number of clusters generated

1 gmail 1 pogo 22 clip art 2 ups 23 american airlines 4 amazon 54 youtube 1 aim 15 wedding cakes 4 juno 26 tori spelling 1 chase 27 google earth 3 monster 58 delta airlines 2 southwest 19 howard stern 8 delta 610 google maps 1 people 711 Cymbalta 1 aaa 112 Ipod 8 gap 213 Itunes 5 whirlpool 614 Screensavers 5 time 115 Swimsuits 3 hallmark 116 birthday cards 5 continental 217 jennifer aniston 8 Fox 618 paintball guns 1 nwa 319 Fonts 5 e3 620 wedding vows 1 mls 2

Table 4-2 numbers of clusters generated for ambiguous and unambiguous query terms

Statistic Unambiguous

group

Ambiguous group

Means 3.45 3.15

Standard deviation 2.50 2.13

Paired T-test (two-tails) P-value: 0.68Table 4-3 basic statistic for the WSD experiment

Based on the data, there was no distinction between unambiguous and ambiguous

queries. Firstly, the means and standard deviations between 2 groups were not significantly

different. Another reason to support this result is that P-value from Paired t-test (calculated

from http://faculty.vassar.edu/lowry/VassarStats.htm) was 0.68, which was greater than 0.05,

19 | P a g e

and it means that the numbers of clusters generated from these 2 groups were not statistically

different.

4.2 The results from the experiment of query clustering on ambiguous dataset.

In the case of the query clustering experiment, there were 11 participants completing

the evaluation. The rank results from these participants were then used to indicate whether

query clustering on co-selected data (method1) can cluster ambiguous queries better than

query clustering on single click (method0) or not. The agreement of rating between

participants was measured by Fleiss Free-marginal Kappa [16], calculated from

http://justusrandolph.net/kappa where it was produced by the author of [16], and basic

statistic was also used to compare the performance between method1 and method0.

Type of result from online kappa calculator Value

Percent of overall agreement 0.71

Fixed-marginal kappa 0.39

Free-marginal kappa 0.43Table 4-4 level of agreement between participants

As mention above, in this study Fleiss Free-marginal Kappa was used to indicate the

level of agreement between raters because raters can rate the result freely – they do not have

a limited number to rate each category. The result from it is relatively positive because it is

0.43, which means moderate inter-ranker agreement (a positive agreement of kappa starts at

0), according to [27].

The following is the semantically similar query pairs as rated by the participants for both methods.

Participant Method0 Method11 0.45 0.6752 0.508333 0.7583333 0.458333 0.6754 0.616667 0.95 0.358333 0.7916676 0.458333 0.7333337 0.383333 0.6333338 0.533333 0.9259 0.316667 0.55833310 0.533333 0.83333311 0.641667 0.875

20 | P a g e

Standard Deviation 0.101944 0.117604Overall 0.47803 0.759848

Table 4-5 the proportion of semantically similar pairs as rated by 11 participants

Figure 4-4 overall proportion of semantically similar queries of method0

Figure 4-5 overall proportion of semantically similar queries of method1

Figure 4-6 comparison individual proportion of semantically similar queries

21 | P a g e

Based on our data, method1 can cluster ambiguous queries better than method0

because the overall value of method1 is significantly higher than method0, while standard

deviation was only little different.

As mention at the start of this chapter, further discussion about the results from the

study will be provided in the next chapter.

22 | P a g e

Chapter 5 Discussion

As presenting the results from the previous chapter, the outcome for the word sense

discrimination experiment is opposed to our expectation, whereas the outcome from the

query clustering experiment resulted in the same direction as we expected. After giving an

analysis, there were many potential factors involving in these outcomes. Therefore, this

chapter will discuss such potential factors of both experiments.

5.1 Discussion on the experiment on word sense discrimination

After outcomes from this experiment had occurred, we considered why the outcomes

behaved in this unexpected way. However, when looking into the data and the selected

dataset, there were potential factors impacting on ability of co-selection method, especially in

web document search - change in ranking position of search result, the potential difference

between our dataset and the present dataset, the scope of ambiguous query terms, and noise

filtering. Therefore, these factors will be discussed in this section.

Firstly, change in search result ranking during the period of clickthrough data

collected (approximately a month) could be one of the causes of many clusters generated for

unambiguous query term. This is, people could co-select different search results when

ranking position changed. For example, at the time of starting to collect the data, the co-

selected results of a query term might be on the top 5 of the first page, but its rank could be

changed anytime after that, depending on a number of factors such as contemporary trend,

and competing on Search Engine Optimization (SEO). Other research by colleagues in the

Security Lab is currently finding that some classes of queries are highly volatile, and that the

overlap between search results from one five-day period to another for the same search term

can drop significantly. For this reason, different users could select different results, although

they may have the same information need. Hence, ranking changed could be one factor

making the number of clusters similar.

Secondly, the entire dataset used in this study was from several years ago. This means

that although we carefully selected ambiguous and unambiguous queries, it was still difficult

to justify whether such queries were truly ambiguous or unambiguous, when considering

trend between several years ago and the present. As a result, the selected queries might not be

23 | P a g e

exactly appropriate. In short, comparing senses of queries of dataset from several years ago

with the present search results can be another factor that leads to unexpected outcome.

Thirdly, the scope of ambiguous term is vague. It is unclear about what information

need of user is. For example, “ipod” is an unambiguous term at the meaning level, but users

could use “ipod” for different information need such as looking for ipod news, ipod review,

and different versions of ipod - this term would be considered 'weakly ambiguous' [22]. As a

result, users with different purposes for the “ipod” term would click on different search

results, and that could be one of reasons why there were a number of clusters generated for

some of unambiguous terms. In short, it was possible for the user to use a term that was likely

unambiguous to select different information needs due to weak ambiguity of those terms.

Finally, fundamental noise filtering for co-selected data might also be a potential

factor. This is because the filtering only considers a set of co-selected data as being unusual

when there is “one” person co-selected results that different from the others. This means that

if 2 or 3 people accidentally co-selected similar results that did not relate to their information

needs, a new cluster would also be generated by such accidentally co-selected data. It can be

seen that this could be a potential factor to generate more clusters. However, this factor could

also result in generating more clusters for both unambiguous queries and ambiguous queries.

For this reason, we are relatively sure that this factor should not be a significant point to the

unexpected outcome.

To summarize, potential change in search result, not up-to-date dataset, scope of

ambiguous terms, and noise filtering for co-selected data are potential key factors for the

unexpected outcome of this experiment. However, although they can indicate the reason why

there were no correlation between the clusters generated from ambiguous or unambiguous

queries, it is difficult to achieve all of these factors to discriminate with word sense by relying

on clickthrough data on web search in the future. For this reason, it is rational to focus on co-

selection method to discriminate with word sense in image search instead of web document

search.

5.2 Discussion on outcome from query clustering experiment

The outcome from this experiment was as our expectation. Query clustering on co-

selected data is better to cluster ambiguous query terms than single click query clustering

24 | P a g e

from [3]. After giving an analysis to the outcome, there are a few areas we should discuss

with.

Firstly, based on outcome calculated by basic statistics, method1 outperformed

method0. As mention in the previous chapter, the overall proportion of semantically similar

queries of method1, 0.76, was significantly higher than the overall proportion of method0,

0.47, and standard deviation of both method were not significantly different. When further

looking at proportion at the rater level, it was also noticeable that every participant rated

method1 better than method0. This could be because method1, which was based on co-

selected data, presumably clustered sets of queries containing potentially different senses so

that pairs selected from within clusters were more likely to be of one sense of the search term

only, while method0 did not distinguish potentially different senses before the creation of

query pairs for the evaluation, thus there was a greater chance of selecting pairs of clicked

data from the one cluster having distinct senses. Furthermore, ambiguous queries selected

were sorted by greater in a number of clusters to lower as the priority. In other words, queries

containing most potentially different senses were used to compare the performance of both

methods. Although this selection criterion could be seen as a significant advantage for

method1, it was suitable for answering whether method0 or method1 could well perform

query clustering on ambiguous queries. Therefore, it was rational to use this approach to

select ambiguous queries for generating pairs. In short, on ambiguous dataset, method1 can

cluster the ambiguous queries better than method0.

Secondly, the level of agreement between participants was relatively good, but there

were only 11 participants. As showing in the result chapter, level of agreement between

participants indicated that the agreement was not purely by chance because the Fleiss Free-

marginal Kappa was 0.43 (positive agreement start at 0). However, participants did not

completely agree on the same thing. This might be because some of query pairs were difficult

for participants to judge when such query pairs provide unclear meaning, or it might be

because they really disagreed on the relationship of some of query pairs. Additionally, there

was only the small number of participants performing the evaluation, 11 participants, so it

was difficult to analyze potential factors from the level of agreement.

To summarize, based on data, the results indicate that method1 is better than method0

at both the overall proportion level and the individual proportion level. Although steps of

selecting ambiguous queries could provide advantage for method1, it was suitable for the

25 | P a g e

purpose of experiment, which assessed the performance of query clustering on ambiguous

queries. The level of agreement between participants was relatively high, even though there

was only the small number of participants, but the overall results show that co-selection is a

promising avenue for sense-sensitive clustering.

5.3 Scope and Limitation

With scope and limitation for this study, there are several areas that were not covered in

the study. The following are such areas out of the scope:

This study did not identify meanings of distinct clusters generated from co-

selected data because the meanings were not required for use in word sense

discrimination – The system can give examples of each cluster for users to

decide which sense they want (Schütze, 1998) – and it did not need for use in

constructing sets of synonym as well.

This study did not include the experiment on image search, but focused only on

web document search instead.

Since a contemporary dataset is not published, we instead used a real world

dataset from several years ago.

Query clustering from [4] was not used to generate ConnectedIDs because it

requires a parameter, which is artificial factor depending on different kind of

works. Although it could reduce noise more effectively than [3], we were not sure

that noise filtering during the generating ConnectedIDs could affect to the

performance of co-selection method.

We only used basic statistics to find out the results for the experiments. The

advanced statistics might be able to provide further reliable outcome, but with

time limited, this was also out of the scope.

5.4 Future work

In the case of word sense discrimination, we suggest that it would be better to work

on image search instead of web document search because image search is a more reliable

source for analyzing how users click on the results. Although numbers of people participating

26 | P a g e

in image search are significantly fewer than web search, it is still able to augment the task of

automatically discriminate with word sense.

For query clustering on co-selected data, contemporary dataset is required to use in

the future experiment. This is because the present search results from search engine seem to

become more diversified. This means that it can affect to the performance of co-selection

method. It could result in either higher performance or lower performance.

Another potential future work for query clustering on co-selection method is the

comparison between method1 and query clustering from [4], because this comparison has not

been performed in this study. The performance of [4] is higher than [3] because the query

clustering can reduce noise effectively. However, it is unclear whether such query clustering

can perform better than query clustering on co-selection method (method1). Additionally, if

the performances of both methods are similar, we can point out that our method will be more

interesting because it does not require setting an artificial parameter.

27 | P a g e

Chapter 6 Conclusion

Co-selection is a relatively new method to cluster search terms semantically by exploiting

users’ judgement as the similarity function. There is potential to use this similarity function to

discriminate with word senses and construct sets of synonyms by query clustering on co-

selected data. Previous studies on co-selection method have failed to conclusively

demonstrate the performance of it because of unsuitable datasets. This study therefore

selected dataset carefully to perform new experiments.

The literature survey was done in order to develop background knowledge and the issue of

current situation of the study on co-selection method. It exposed that previous studies have

not used co-selection method on web search to determine word senses, and previous studies

on query clustering also have not evaluated the ability to create unambiguous clusters except

the study from [21], which tried evaluating it but the dataset was not appropriate – randomly

exacting data from the entire dataset resulted in the unambiguous dataset.

There are 2 objectives for this study:

This study aimed to evaluate whether there was a difference between the number of

clusters generated by co-selected data for unambiguous queries and ambiguous

queries or not.

This study aimed to evaluate whether query clustering on co-selected data could

augment to create unambiguous clusters or not.

The first objective was to find out that number of clusters generated by co-selected data for

ambiguous queries would be statistically higher than for unambiguous queries. To answer

this question, the experiment needed to identify both ambiguous query terms and

unambiguous query terms to generate clusters for the evaluation.

On the other hand, the second objective was to find out that query clustering on co-selected

data could cluster ambiguous queries better than basic query clustering [3] when using

ambiguous dataset. For this reason, ambiguous query terms alone were required for this

experiment.

28 | P a g e

To achieve the objectives of the study, the followings were several stages of our

methodology:

Identifying ambiguous and unambiguous query terms by using Wikipedia article titles

as the source.

Data preprocessing for the experiments of co-selection method – normalized queries

in order for having more clicks, generated SessionID to represent a unique user

clicking on search results, and filtered sessions having less than 2 clicks out.

For word sense discrimination, 20 truly ambiguous queries and 20 truly unambiguous

queries were identified. Then, SenseID were generated in order to indicate how many

clusters each query has. After that, number of clusters generated for ambiguous

queries and unambiguous queries were used to compare whether there was a

correlation between these 2 groups. The comparison was done by using paired t-test

and basic statistic.

For query clustering, ConnectedIDs for method0 – (query and URL) was the input

data – and for method1- ((query, SessionID), and URL) was the input data – were

generated by simple query clustering [3] because it does not require any parameter. 10

truly ambiguous were identified for use in extracting queries for both methods. Query

pairs for both method0 and method1 were then randomly generated for the evaluation.

After that, human judgement was used to evaluate the relationship between each

query pair. To calculate the results, Fleiss Free-marginal Kappa was used for finding

out the level of agreement between participants, and basic statistic was utilized to

work out the proportion of semantically similar queries between 2 methods.

However, unfortunately, the outcome of the word sense discrimination experiment indicates

that there is no difference between numbers of clusters generated from ambiguous and

unambiguous queries selected. This could be because potential change in search result,

dataset from several years ago, scope of ambiguous terms, and noise filtering for co-selected

data affected to the ability of co-selection method to discriminate with word sense on this

study.

On the other hand, the outcome of the query clustering experiment is as our expectation.

Method1 performed query clustering on ambiguous dataset better than method0 in both

overall proportion and individual proportion of semantically similar pairs. Additionally, the

29 | P a g e

level of agreement between participants was also relatively high. Therefore, based on these

data, query clustering on co-selection method is able to create unambiguous clusters.

Finally, in the case of word sense discrimination by co-selection method, future work should

focus on image search instead of web search. On the other hand, for query clustering by co-

selection method, the future work should use contemporary dataset, or the future work could

consider to making a comparison between query clustering on co-selection method and query

clustering on single click which could well perform in noise filtering as [4].

30 | P a g e

Appendix A- Complete Explanatory of Query Clustering Methodolog

Since there is a large explanation for the query clustering experiment, it seems to be too large

information to explain it in detail in the chapter of methodology. Therefore, it will be

explained here (Appendix A). The followings are complete stages of methodology for the

query clustering experiment.

A.1 – Section outline

Generating identification of connected component (A.2)

The selection of 10 truly ambiguous query terms (A.3)

The extraction of related queries (A.4)

Generating query pairs (A.5)

Word sense evaluation (A.6)

The approaches of working out the result (A.7)

A.2 – Generating identification of connected component

Firstly, Identification of connected component (ConnectedID) was required to

indicate queries clustered together. Fundamental query clustering [3] was chosen to generate

connected-id because it does not require setting a parameter. Additionally, although query

clustering from [4] can perform better noise filtering compared to [3], it requires setting a

certain threshold for the noise filtering, and the threshold is an artificial factor, which is

difficult to know which threshold is the best for our data. Furthermore, it was unclear whether

noise filtering during the process of generating ConnectedIDs could affect to the performance

of co-selection method. For example, it might filter out potential queries that is not useful for

query clustering on single click, but useful for query clustering on co-selected data. Thus,

query clustering from [3] is rational for this experiment because it does not require any

parameter, and it would not affect to ConnectedIDs assigned for query clustering on co-

selection method.

In this study, there were 2 methods for generating unique ConnectedIDs to make a

comparison between single click data and co-selected data. Method0 was for single click

data, whereas method1 was for co-selected data. The table used for generating ConnectedIDs

31 | P a g e

was CT-SSG1, which had already been introduced in section 3.3 (data preprocessing). The

difference between method0 and method1 was the input data to generate ConnectedIDs. In

the case of method0, a pair of input data was (Query and URL), while in the case of method1

a pair of input data was ((Query, SessionID) and URL). For this reason, ConnectedIDs of

data generated for method0 and method1 were different, and another difference is that

method0 had only one cluster per query, while method one would had more than 1 cluster,

depending on co-selected data. After assigned ConnectedIDs, the table of method0 was

“METHOD0”, and the table of method0 was “METHOD1”. In short, ConnectedIDs for

method0 and for method1 were generated by different input data in order to compare query

clustering on single click data with co-selected data.

A.3 – The selection of 10 truly ambiguous query terms

After the ConnectedIDs were generated, a list of ambiguous terms needed to be

created in order for use in selecting truly ambiguous query terms for the experiment. The list

was generated by map both tables (METHOD0 and METHOD1) to “ALL-AMBIGUOUS”

table, which was generated in Section 3.1. In fact, the queries occurring in Method0 and

method1 were the same (the difference was just ConnectedIDs assigned), so it was not

required to map both tables to the “ALL-AMBIGUOUS” table. However, to map them was a

simple task, and we also wanted to make sure that the amount of ambiguous queries from

method0 and method1 was the same. As a result, we first mapped “METHOD0” with “ALL-

AMBIGUOUS” and then recorded the amount of ambiguous queries after mapped, which

were 1,090 queries. Then, we mapped both “METHOD0” and “METHOD1” with “ALL-

AMBIGUOUS”, and looked at the amount of ambiguous queries again. As expected, there

was nothing wrong because the amount of ambiguous queries was the same, 1090. Thus, we

created a new table – called “AMBIGUOUS_COMMON”- by using the mapped tables that

we just mention.

However, for this experiment we need to select only 10 truly ambiguous queries

because of concerns of time spent on the word sense evaluation by participants and adequate

query pairs generated from an ambiguous query. The concern of the time spent on the word

sense evaluation for participants occurred because if we selected too many query pairs for

evaluation, it would consume too much time of participants to complete the evaluation, and it

32 | P a g e

would also reduce the performance of participants to evaluate such pairs when spending more

time on the evaluation. On the other hand, the concern of selecting adequate query pairs

occurred because if we chose too fewer truly ambiguous queries to generate query pairs, it

might result in difficulty to interpret results properly. For this reason, we decided to choose

10 truly ambiguous queries and decided to randomly select 24 queries pairs (12 for method0

and the other 12 for method1) per each of these 10 truly ambiguous queries. This means that

the total number of query pairs would be 240. We assumed that participants would spend 10

seconds per pair, so completing 240 pairs resulted in 40 minutes. In fact, it might still be a

little too much time for participants to complete this evaluation, but, as mention above, we

were also concerned with inadequate pairs for interpreting results. Therefore, we decided to

allow participants to pause the evaluation whenever they want, and come back to continue

later as the existing participants. 10 truly ambiguous queries would be suitable to deal with

the issues of selecting inadequate query pairs and time spent on completing the evaluation.

To find 10 rational ambiguous queries for this experiment, we added 2 additional

fields – size of cluster from method0 and number of cluster from method1 - into

“AMBIGUOUS COMMON” table. These two fields were used for sorting ambiguous query

to find 10 rational ambiguous queries. Although there were 1,090 common ambiguous

queries that we could use for the experiment, as mentioned earlier in this section, some of

them might not be rational because size of clusters (from method0) was too small and/or

number of clusters (from method1) was only one cluster, which was the same as method0, so

they could not be compared with method0. For this reason, we first sorted the queries by

number of clusters since the difference between the methods would become more significant

when number of clusters of method1 were higher. We then sorted the queries by size of

clusters of method0 in order to make sure that this ambiguous query could generate 12 pairs

(the cluster had to contain at least 6 unique queries), and also the greater size of cluster, the

more potential ambiguous queries would occur. After sorting by the number of clusters and

the size of cluster respectively, to find 10 truly ambiguous queries, the ambiguous queries

were evaluated by looking whether or not senses of ambiguous terms were at least 20 percent

of ambiguous search results from the first page of Google, Bing, and Yahoo. When one of top

sorted queries was unclear to be truly ambiguous, we decided not to choose it, and then we

looked at the next one iteratively until 10 truly ambiguous were selected. For this reason, 10

truly ambiguous selected would be rational enough for the experiment. In short, sorting

33 | P a g e

number of clusters (of method1) and size of cluster (of method0) and evaluate truly

ambiguous queries from search results were criteria to find 10 rational ambiguous queries.

A.4 – The extraction of related queries

ConnectedIDs from 10 rational ambiguous queries selected were used to extract

queries related to such 10 ambiguous queries selected. There were ConnectedIDs of such

ambiguous queries occurred in “METHOD0” and “METHOD1” tables. These ConnectedIDs

were used for extracting related queries, which having the same ConnectedIDs as 10 rational

ambiguous queries selected, into “METHOD0_SELECTED” and “METHOD1_SELECTED”

tables. For method0, there was only 1 ConnectedID for each of 10 ambiguous queries

selected because method0, which was based on single click data, considered all queries

sharing the same URLs as related queries. On the other hand, in the case of method1, there

could be multiple clusters – more than 1 ConnectedID - for each ambiguous queries selected

because method1 – based on co-selected data – included judgment from each user to cluster

related queries. For example, if there were 2 clusters for a query term “apple”, there would be

2 ConnectedIDs where each of them would represent different set of related queries.

Therefore, set of queries were extracted into table named “METHOD0_SELECTED” and

“METHOD1_SELECTED” based on the ConnectedIDs of the ambiguous queries selected

occurred in both “METHOD0” and “METHOD1”.

A.5 – Generating query pairs

However, queries pairs for the evaluation needed to be randomly selected from

“METHOD0_SELECTED” and “METHOD1_SELECTED” because we needed only 12

query pairs for each of the 10 ambiguous query terms selected. These means that we needed

120 query pairs for method0 and the other 120 query pairs for method1, so the total number

was 240 as discussed earlier in this chapter. As mention above, the difference between

related queries clustered in method0 – 1 cluster – and method1 – multiple clusters – was

number of clusters for each ambiguous query selected. In the case of method 0, we randomly

select 12 queries pairs as follows:

34 | P a g e

1. We sorted related queries by query term.

2. We created a list of potential unique query pair for an ambiguous query selected

in order to randomly select; for example, if there were four members for a term,

A, B, C and D, the list of these potential unique queries pairs would be presented

as the table B-1. ( Note that our data definitely had more than 4 members due to

the solution to inadequate query pairs generated, as explained before, but we use 4

members here in order to make the example easy to be understood)

Index Potential Pairs Index Potential Pairs

1 A-B 4 B-C

2 A-C 5 B-D

3 A-D 6 C-DTable A-1 the example of a list of potential unique query pair

3. We randomly generated 12 unique indexes for selecting 12 unique pairs for the

ambiguous term.

4. Iteratively perform the 1st to the 3rd steps but use different an ambiguous query

selected until completing all of 10 ambiguous queries selected.

From these steps, we finally had 120 unique query pairs for the evaluation of

method0.

In the case of method1, it was a little difference from method0 due to multiple clusters

for an ambiguous query term. The followings are steps for randomly selecting query pairs for

the evaluation:

1. We randomly selected one of multiple clusters of an ambiguous query selected.

2. We sorted related queries by query term.

3. We created a list of potential unique query pair for the cluster randomly selected

as the example from method0 (see table B-1).

4. We randomly generate “one” unique index for selecting “one” out of 12 unique

pairs for the ambiguous term.

5. If the unique pair duplicated to previous pairs selected, this duplicated pair was

not recorded, and the process then started from the first step again. If the pair did

not duplicate to previous pairs selected, it was recorded, and then the process

iteratively start from step 1 until 12 unique pairs are selected for an ambiguous

term.

35 | P a g e

6. Iteratively, select 12 unique pairs for the other ambiguous queries selected.

From these steps, we finally had 120 unique queries pairs for the evaluation of

method1 as well. This indicates that 240 query queries for the experiment had been randomly

selected for the evaluation already.

A.6 – Word sense evaluation

Figure A-1 the example of evaluation of a query pair

To use human judgment to evaluate the relationships between query pairs, simple user

interface for this evaluation was established by a small web application. Every feature built in

this small application served its purpose as follows:

1. There was the login page for both a new participant and an existing participant (in

the case that the participant preferred not to finish them all at a time). This helped

us to validate which participants performed the evaluation completely.

2. The evaluation page randomly provided a query pair for the participant to justify

relationship between that pair iteratively until there is no pair left. This means that

participant did not know whether the query pairs came from method0 or method1

(see figure B-1). The following were features in this page:

a. There were 8 choices for help participants to decide the relationship between

a query pair.

i. [query 1] is the same concept as [query 2]

ii. [query 1] is a sibling concept of [query 2]

iii. [query 1] is the opposite of [query 2]

36 | P a g e

iv. [query 1] is not related to [query 2]

v. [query 1] is an example of [query 2]

vi. [query 1] is a more generic than [query 2]

vii. [query 1] is a part of [query 2]

viii. [query 1] contains the part [query 2]

b. In the case that participants might not have any idea about the meaning of a

query, they were able to click on that query in order to see whether the search

results of it related to the other query or not.

c. History of evaluated query pairs was also provided for participants to change

the evaluated result because they might change their mind, or they might

accidentally choose the wrong one.

d. In the case of session timeout when a participant did not interact with this

page for a period of time, the application forced participant to login again,

otherwise evaluated results after the session timeout resulted in not containing

the name of the participant, and it would be more difficult for us to calculate

the result.

e. Although there were 240 pairs for the evaluation – 120 unique pairs from

method0 and 120 unique pairs from method1, there were also some query

pairs duplicated across method0 and method1. In fact, this did not affect to the

result. However, we did not want participants to evaluate the same pair twice.

Therefore, when participant evaluated a query pair that duplicated across

method, the result of this pair was assigned to the other pair as well.

These were all features for the application to serve participants when performing the

evaluating task. In short, participants evaluate query pairs that were randomly selected one by

one until completing all query pairs.

A.7 – The approaches of working out the result

Ranks of evaluated pairs – 8 choices (see table B-2) - from participants were

transformed to 1 – related meaning – or 0 – not related meaning.

37 | P a g e

# Relationship between a query pair Transformed value

1 [query 1] is the same concept as [query 2] 1 (Related)

2 [query 1] is a sibling concept of [query 2] 0 (Not related)

3 [query 1] is the opposite of [query 2] 0 (Not related)

4 [query 1] is not related to [query 2] 0 (Not related)

5 [query 1] is an example of [query 2] 1 (Related)

6 [query 1] is a more generic than [query 2] 1 (Related)

7 [query 1] is a part of [query 2] 1 (Related)

8 [query 1] contains the part [query 2] 1 (Related)

Table A-2 ranks transformation

Then, these values were used to calculate the level of agreement between raters and

the proportion of semantically similar pairs from method0 and method1.

The level of agreement between participants was evaluated because the results would

not be reliable if the agreement was done by chance. Fleiss free-marginal kappa, not fixed

kappa, was used to achieve this task since Fleiss “free-marginal” kappa is suitable when

raters do not have a limited number of each category to rate the result [16]. There is also an

online tool to calculate this Fleiss free-marginal kappa produced by [16] at

http://justusrandolph.net/kappa/, so this online tool was used to calculate this result in this

study. In short, Fleiss free-marginal kappa was used to measure level of agreement between

participants.

The proportion of semantically similar pairs was measured for indicating which

method is better than the other. This was done by using basic statistic to calculate at overall

level and also at the individual level. Therefore, the measurement from the statistic should be

able to indicate whether method1 is better than method0 or not.

38 | P a g e

References

[1] A. K. Agrahri, D. A. T. Manickam and J. Riedl, "Can people collaborate to improve the relevance of search results?," in Proceedings of the 2008 ACM conference on Recommender systems Lausanne, Switzerland: ACM, 2008, pp. 283-286.

[2] R. Baeza-Yates, C. Hurtado, M. Mendoza, and G. Dupret, "Modeling User Search Behavior," in Proceedings of the Third Latin American Web Congress: IEEE Computer Society, 2005, p. 242.

[3] D. Beeferman and A. Berger, "Agglomerative clustering of a search engine query log," in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining Boston, Massachusetts, United States: ACM, 2000, pp. 407-416.

[4] W. S. Chan, W. T. Leung and D. L. Lee, "Clustering Search Engine Query Log Containing Noisy Clickthroughs," in Proceedings of the 2004 International Symposium on Applications and the Internet (SAINT’04), 2004, p. 4.

[5] C. L. A. Clarke, E. Agichtein, S. Dumais, and R. W. White, "The influence of caption features on clickthrough patterns in web search," in Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval Amsterdam, The Netherlands: ACM, 2007, pp. 135 - 142.

[6] B. Croft, R. Cook and D. Wilder, "Providing Government Information on the Interne: Experiences with THOMAS," University of Massachusetts1995.

[7] M. Ester, H. P. Kriegel, J. Sander, and X. Xu, "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise," in Proceedings of 2nd Int. Conf. on Knowledge Discovery and Data Mining, 1996, pp. 226-231.

[8] J. A. Guthrie, L. Guthrie, Y. Wilks, and H. Aidinejad, "Subject-dependent co-occurrence and word sense disambiguation," in Proceedings of the 29th annual meeting on Association for Computational Linguistics Berkeley, California: Association for Computational Linguistics, 1991, pp. 146 - 152.

[9] P. J. Hayes, L. E. Knecht and M. J. Cellio, "A news story categorization system," in Proceedings of the second conference on Applied natural language processing Austin, Texas: Association for Computational Linguistics, 1988.

[10] T. Joachims, "Optimizing search engines using clickthrough data," in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining Edmonton, Alberta, Canada: ACM, 2002, pp. 133 - 142.

[11] T. Joachims, L. Granka, B. Pan, H. Hembrooke, and G. Gay, "Accurately interpreting clickthrough data as implicit feedback," in Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval Salvador, Brazil: ACM, 2005, pp. 154 - 161.

[12] T. Joachims, L. Granka, B. Pan, H. Hembrooke, F. Radlinski, and G. Gay, "Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search," ACM Trans. Inf. Syst., vol. 25, p. 7, 2007.

[13] A. Kilgarriff, "Word Senses," in Word Sense Disambiguation, 2006, pp. 29-46.[14] H. Lieberman, "Letizia: an agent that assists web browsing," in Proceedings of the 14th

international joint conference on Artificial intelligence - Volume 1 Montreal, Quebec, Canada: Morgan Kaufmann Publishers Inc., 1995, pp. 924-929.

[15] R. Navigli, "Word sense disambiguation: A survey," ACM Comput. Surv., vol. 41, pp. 1-69, 2009.

39 | P a g e

[16] J. Randolph, "Free-Marginal Multirater Kappa (multirater K[free]): An Alternative to Fleiss' Fixed-Marginal Multirater Kappa," in Joensuu University Learning and Instruction Symposium, Joensuu, Finland, 2005.

[17] M. Sanderson, "Ambiguous queries: test collections need more sense," in Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval Singapore, Singapore: ACM, 2008, pp. 499-506.

[18] F. Scholer, M. Shokouhi, B. Billerbeck, and A. Turpin, "Using Clicks as Implicit Judgments: Expectations Versus Observations," in Advances in Information Retrieval, 2008, pp. 28-39.

[19] H. Schütze, "Automatic word sense discrimination," Comput. Linguist., vol. 24, pp. 97-123, 1998.

[20] D. Shen, M. Qin, W. Chen, Q. Yang, and Z. Chen, "Mining web query hierarchies from clickthrough data," in Proceedings of the 22nd national conference on Artificial intelligence - Volume 1 Vancouver, British Columbia, Canada: AAAI Press, 2007, pp. 341-346.

[21] G. Smith and H. Ashman, "Evaluating implicit judgements from Image search interactions," in Proceedings of the WebSci ' 09: Society On-Line, 2009.

[22] G. Smith, T. Brailsford, C. Donner, D. Hooijmaijers, M. Truran, J. Goulding, and H. Ashman, "Generating unambiguous URL clusters from web search," in Proceedings of the 2009 workshop on Web Search Click Data Barcelona, Spain: ACM, 2009, pp. 28-34.

[23] K. Sprck-Jones, S. E. Robertson and M. Sanderson, "Ambiguous requests: implications for retrieval tests, systems and theories," SIGIR Forum, vol. 41, pp. 8-17, 2007.

[24] N. Tomasz, "Word Sense Discovery for Web Information Retrieval," 2008, pp. 267-274.[25] M. Truran, J. Goulding and H. Ashman, "Co-active intelligence for image retrieval," in

Proceedings of the 13th annual ACM international conference on Multimedia Hilton, Singapore: ACM, 2005, pp. 547-550.

[26] J. Véronis, "HyperLex: lexical cartography for information retrieval," Computer Speech & Language, vol. 18, pp. 223-252, 2004.

[27] A. J. Viera and J. M. Garrett, "Understanding interobserver agreement: the kappa statistic," Family medicine, vol. 37, pp. 360-363, 2005.

[28] J.-R. Wen and H.-J. Zhang, Query Clustering in the Web Context: Kluwer Academic Publishers, 2002.

[29] J. R. Wen, J. Y. Nie and H. J. Zhang, "Query clustering using user logs," ACM Trans. Inf. Syst., vol. 20, pp. 59-81, 2002.

[30] G.-R. Xue, H.-J. Zeng, Z. Chen, Y. Yu, W.-Y. Ma, W. Xi, and W. Fan, "Optimizing web search using web click-through data," in Proceedings of the thirteenth ACM international conference on Information and knowledge management Washington, D.C., USA: ACM, 2004, pp. 118 - 126.