Research: Developing an Interactive Web Information Retrieval and Visualization System

Developing an Interactive Web Information Retrieval andVisualization System

Research ProjectMaster Artificial Intelligence

Department of Knowledge Engineering

R. Atachiants J. Meyer J. Steinhauer T. Stocksmeier

June 24, 2009

Abstract

Finding the needed information (images, articles or other) is not always as simple as going to a search en-gine. This paper aim at developing an interactive presentation system, able to cope with live presentationchallenges: speech recognition, information retrieval, filtering unstructured data and its visualization.Speech recognition is achieved using Microsoft SAPI; information is retrieved using various Text Min-ing techniques in order to derive most relevant keywords and next, the search using expanded Googlequeries is performed. The information then is filtered: HTML tags are cleared and text summarizationis performed in order to extract the most relevant information. Such system has been tested and per-forms satisfactory, and such interactive systems can be built with modern hardware and faster internetconnections, but several challenges still need to be faced and some improvement is required in order tocreate smooth presentation experience. This paper also presents a novel approach of query expansionusing Cyc concepts, based on CycFoundation.org knowledge base, it presents various query expansionpatterns as: Or-Concept, Or-Most-Relevant-Alias and their respective results. It also presents 2 wordmatching algorithms which tries to match a word to the most relevant concept, using WordNet similaritymeasurement to accomplish the goal at semantic level. It shows that Or-Concepts pattern was improvingthe query precision with around a factor of 3.

CONTENTS CONTENTS

Contents

1 Introduction 1

2 Web Retrieval 12.1 Architectural Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.3 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.4 Query Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.4.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4.2 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4.3 Query Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.4.4 Query for Google Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.5 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5.1 Retrieving Unstructured Web Pages . . . . . . . . . . . . . . . . . . . . . . . . . . 62.5.2 Summarizing the Web Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.6 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Tests and Results 83.1 Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Keyword Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Query Construction and Expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Results Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Discussion 10

5 Conclusions 11

A Appendix: Query Expansion using Cyc 12A.1 WordNet Similarity Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12A.2 CycFoundation.org REST API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12A.3 Cyc Concepts Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

A.3.1 DisplayName Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 13A.3.2 Comment Matching Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13A.3.3 Experiments with both algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

A.4 Query Expansion Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14A.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.5.1 Google Search on Wikipedia KB . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15A.5.2 Lemur Search in AP Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

June 24, 2009 2

2 WEB RETRIEVAL

1 Introduction

Why is there a need for an Interactive Web Infor-mation Retrieval and Visualization System?

A person who is teaching at school, at universityor somewhere else is often required to present vari-ous types of information (his own knowledge, textswritten by others, pictures, diagrams etc.) more orless at once. In order to do this, he will collect hisdata beforehand, arrange it and set up a presenta-tion, using an ordinary presentation tool, such asMicrosoft PowerPoint. During his lecture, he willstart the presentation, showing one slide after an-other and explaining what can be seen on it.

This is, compared to what lessons were like somedecades ago, of course a great increase in possibil-ities of how information can be presented and howthe listeners’ attention can be kept. But still, thereare a lot of problems with this approach. First, allthe work has to be done beforehand, resulting ina very long time one has to spend on the presen-tation before he can show something useful. An-other problem is that the prepared presentation isvery static. Presentations about topics in whichchanges can occur during some time will have tobe updated again and again with new informationand media from the web. Further, during the pre-sentation one will learn that the audience’s knowl-edge about some parts of the covered subjects isdifferent from what one had thought before. If thelisteners know more than expected, that won’t bea real problem - some slides can be skipped. Butwhat if they know less than expected? It is impos-sible to make new slides during the presentation, sothe either problem would have to be ignored or onehad to search for the missing information duringthe presentation.

Another hardship with the usual presentations isthe research itself. Finding information is nowa-days very easy. One just goes to Google or anothersearch engine, enters what he is looking for andlooks at the first results where he finds what he islooking for. But it is not always so easy, becauseone doesn’t always know what he is searching forand where he can find it. In addition, you oftenrecieve a lot of results and the precision is too lowto find the relevant information very fast.

In our project, we want to meet all these prob-lems and difficulties. Our aim is to develop a sys-tem that both makes the research easier and more

efficient and also increases the quality and change-ability of the presentation.

To enhance the research itself, the programshould increase the precision of the search bysearching for concepts instead of keywords. If doneproperly, the results of the search are all relevantfor the topic and can be used in some way. Thiscan reduce the time spent on evaluating each ofthe search results. In addition, the texts on the re-sulted pages can be summarized, further reducingthe time needed to evaluate the results. Anotherpossibility is to do the usual web search and theimage search at once, because e.g. tables can befound in both formats.

To provide the possibility of using the systemalso during the presentation itself, the search queryshould be constructed, not solely from some key-board input, but as well directly from spoken text,in a way that provides a possibility to choose thecurrenty wanted way of communication. The re-sults have to be displayed in a way that the impor-tant parts are visible directly, while other thingsare suppressed. The user of the system has to beable to decide which contents (texts, images etc.)are shown and which not.

To meet all these requirements, the system hasto consist of the following steps:

1. Recognize the spoken text

2. Retrieve keywords from the text

3. Create a powerful search query

4. Filter and summerize the results

5. Visualize the results in a proper way

These steps will be further described in the nextchapters.

2 Web Retrieval

2.1 Architectural Overview

The spoken words of the presentation are recog-nized and translated by a speech recognition en-gine. The output of the speech recognition is fedto a keyword extractor to create the keywords ofthe text. These are used by a query constructionto produce an internet search query. The pictureresults of the search are directly transferred to the

June 24, 2009 1

2.3 Keyword Extraction 2 WEB RETRIEVAL

visualisation module. The text results are summa-rized by a summarizer module and then also dis-played by the visualisation module. The workflowcan be shown in figure 1.

Figure 1: workflow

2.2 Speech Recognition

The user of the information retrieval system hasseveral possibilities of interaction with the system,but the most important way of communication isthe speech. It is used to get a starting point as wellas to define the wanted content more precisely.Thus, our application needs a well-working speechrecognition engine, to ”understand” as much of thespoken text as possible. It would be interesting towrite an own speech recognition engine, but theeffort for this is too high to be done for our project.So the main task here is to choose a feasible speechrecognition engine and to use it in a proper way.There are several speech recognition engine avail-able with different strengths and weaknesses. Forexample, the program ”Dragon Naturally Speak-

ing” is an established commercial software productthat works quite well. But due to the fact that it iscommercial, it is very expensive and thus not theright solution for a research project. Beside theone mentioned there are several other commercialproducts that include a speech recognition engine.Another group of engines are open-source freeware-tools, found spread one the web. Unfortunatly, thetools tested turned out not to have the efficiencyand the quality that we would need for the projet.So we decide to use one product of the third group- built-in search engines from operating systems.These engine are likely to have a high qualityand a good performance, while they don’t haveto be bought for a lot of money, because they areincluded in the operating system that everyonehas to buy anyway.After some tests and some problems with theusage of some of the mentions software products,we choosed the Windows Speech Recognition API(SAPI 5.1) [9], [10] as a part of our project. TheWindows speech module offers different kinds offunctionality, including a dictation mode (besideothers such as voice commands, text to speech andothers). In this dictation mode, the spoken wordsare simply put into a plain text, which can by usedby other programs via the programming interface[8].

2.3 Keyword Extraction

For the keyword generation from an input text theNatural Language Processing Open Source tool:SharpNLP is used [3].SharpNLP is a collection of natural language pro-cessing tools written in C#. This tool provides thefollowing NLP tools [4]:

• a sentence splitter (to identify sentence bound-aries)

• a tokenizer (to find tokens or word segments)

• a part-of-speech tagger

• a chunker (to find non-recursive syntactic an-notation such as noun phrase chunks a parser)

• a name finder (Name finder to finding propernames and numeric amounts)

June 24, 2009 2

2.4 Query Construction 2 WEB RETRIEVAL

• a coreference tool (to perform coreference res-olution)

• an interface to the WordNet lexical database[14]

This toolkit turned out to be tricky to setup, butsubsequently offers good performance and usefulfeatures. To generate keywords the following fea-tures are used:

• Sentence Splitting:Splitting the text based on sentence signs andadditional rules (will not split ”and Mr. Smithsaid” into two sentences), uses a model trainedon English data

• Tokenizer:Resolves the sentences into independent to-kens, based on Maximum Entropy Modeling[5]

• POS-Tagging (Part Of Speech-Tagging): As-sociating tokens with corresponding tags, de-nominating what grammatical part of a sen-tence it constitutes, based on a model trainedon English data from the Wall Street Journaland the Brown corpus [7]

• POS-Filtering:Filtering relevant word categories (nouns andforeign words) based on the POS-Tags (NN,NNS, NNP, NNPS, FW)

• Stemming [13]:Reducing the different morphological versionsof a word to their common word stem, basedon the Porter-Stemmer-Algorithm [2]

The relevance rating of the keywords is calculatedbased on word count, actuality and word length.

2.4 Query Construction

In order to construct the query, which the systemuses, several steps are performed:

1. each keyword is mapped to its context sen-tence;

2. using keyword and the context, the keywordsare matched wit a particular Cyc concept;

3. query is expanded using keywords and con-cepts;

4. query is searched in Google search engine.

More detailed description of Concepts Matchingand Expansion can be found in Appendix A.

2.4.1 WordNet Similarity Measurement

In order to be able to compare two words or sen-tences together, a semantic similarity measurementwas needed. WordNet similarity measurement wasused [11].

2.4.2 Cyc Concepts Matching

Cyc[12] is a very large Knowledge Base (KB) whichtries to gather a formalized representation of a vastquantity of fondamental human knowledge: facts,rules, etc. The Cyc KB contains thousands of dif-ferent assertions and can be used in many differentways. For the system described in this paper, aparticular subset of Cyc has been used: conceptsand different relations.In Cyc, every concept linked to additional knowl-edge, as such:

• a display name (readable name of the concept)

• a comment (short description)

• general concepts, for example: HumanAdult islinked to AdultAniman and HomoSapiens

• specific concepts, for example: HumanAdult islinked to Professional-Adult, SoftwareEngin-ner...

• aliases, for example: for HumanAdult thereare: ”adult homo sapiens”, ”grownup” ...

Unfortunately the whole Cyc KB was unavail-able to public when our research has been done,therefore CycFoundation.org [6] REST APIs havebeen used in order to interact with Cyc. Moreabout the actual REST API implementation, aswell as the explanation of different algorithms andspecific results can be found in Appendix A.Since Cyc contains semantic knowledge aboutthe concepts and not words, the word to conceptmatching algorithm was created. The algorithmis built in such way so it’s operates only at

June 24, 2009 3


semantic level, therefore uses WordNet similaritymeasurement[11] (also described in section 2.4.1)in order to compute similarity score between twosentences. The algorithm takes a keyword andits corresponding sentence, then using CycFoun-dation APIs the set of relevant Cyc Concepts(aka Constants) are retrieved. The actual processof retrieval retrieves only concepts containing init’s name the keyword, for example for keyword”human”, a set of concepts would contain ”Hu-manAdult”, ”HumanActivity”, ”HumanBody”.Next, for each item in the set of concepts, the sim-ilarity score with the keyword’s context (sentence)is computed and the best one is used.

Following straightforward pseudo-code imple-mentation illustrates the approach (more aboutmatching type in A.3):

Input: Keyword, its context (sentence) and amatching type

Output: Cyc concept matched(BestConstant)

Constants ⇐ GetConstants(Keyword)BestConstant ⇐ new CycConstant()BestDistance ⇐ ∞foreach Constant in Constants do

Distance ⇐ ∞if MatchingType = DisplayNameMatchingthen

dK ⇐ GetDistance(Keyword,Constant.DisplayName)dC ⇐ GetDistance(Context,Constant.DisplayName)Distance ⇐ (dK + dC) / 2

endif MatchingType = CommentMatchingthen

dCK ⇐ 0Comment ⇐ GetComment(Constant)Keywords = GetKeywords(Comment)foreach CK in Keywords do

dCK ⇐ dCK +GetDistance(Keyword, CK)

enddK ⇐ GetDistance(Keyword,Constant.DisplayName)dC ⇐ GetDistance(Context,Constant.DisplayName)Distance ⇐ (dK + dC + (dCK /Keywords.Count)) / 3

endif Distance < BestDistance then

BestDistance ⇐ DistanceBestConstant ⇐ Constant

endend

2.4.3 Query Expansion

After the keywords and their respective context arematched to a particular Cyc concept, the actualquery expansion can be done. One can think aboutmany different possible query expansions using theadditional Cyc knowledge. In our research we havechosen several expansion methods, most of themare quite straightforward. After some experiments,one particular structured query expansion was cho-sen to be used in the system: Or-Concept.

June 24, 2009 4


The algorithm constructs a structured query, whichcan be described using the following formula, whereK is the set of keywords, k ∈ K and c(k) is a func-tion which gets a matched concept for a keyword(using a matching algorithm) :

Q(K, n) =n∑

i=1

(max(ki, c(ki))) (1)

The following pseudo-code illustrates the approachused in the system in order to construct thestructured query:

Input: Keywords taggeds with Cyc ConceptsOutput: Set of expanded keywords for Googleforeach TaggedWord in Query do

Pattern ⇐ ”( [0] OR [1] ) + ,”PatternEmpty ⇐ ”[0] + ”Keyword ⇐ TaggedWord.WordBestGeneral ⇐ ””BestDistance ⇐ ∞if TaggedWord.Concept ¬ null then

GeneralConcepts ⇐GetGenerals(TaggedWord.Concept)foreach General in GeneralConceptsdo

Distance ⇐GetDistance(Word.Word,General.DisplayName)if (Distance < BestDistance then

BestGeneral ⇐General.DisplayNameBestDistance ⇐ Distance

endend

endif BestGeneral.Length = 0 then

Pattern = PatternEmptyendExpanded.Add(String.Format(Pattern,Keyword, BestGeneral))

end

2.4.4 Query for Google Search

In order to recieve the real data, we need to choosea search engine as well as a corpus. For severalreasons, we quickly arrived at the decision that wewant to use to Google web search for our appli-cation. First of all, Google is the most popular

search engine that one can find in the web. This isa great advantage, because an application program-ming interface (API) is required in order to makeinteraction between our program and the search en-gine possible. The bigger and the more popular asearch engine is, the more likely it is that a goodAPI can be found for it. This turned out to betrue partially. Google provides a lot of of program-ming interfaces, where most of them are specializedfor a specific context. So it isn’t easy to find thebest fitting programming interface for an applica-tion, in addition to the ”usual” problems one mayencouter during the implementation. Another rea-son to choose the Google web search is the fact thatit can easily be restricted to one or a set of severalweb pages. This can be done by simply adding thetag ”site:”, followed by the desired web page, tothe keywords one wants to search for. As a result,the search only finds results that are somewhere onthe indicated pages. As we don’t want to find arbi-trary results, but that from specific, serious pages,we really want to use this feature of the Googleweb search. If one does a web search (either withthe Google engine or somewhere else), a lot of re-sults are found at very different pages and typesof pages. This is of course important, since that iswhat most people want when they search the web.But with the information retrieval system, the aimis different. We don’t want to display content fromany websites that contain the keywords, but onlyfrom websites with trustable content. So a whitelist with the sites that will be searched is the bestway of avoiding useless results. Due to the fact thatthe results of the web search contain lots of HTML-tags and similar things that are (for our purpose)useless, some knowledge about the structure of thefound results. By limiting the search to definedweb pages, we can use our information about thesepages to parse the results, depending on the website they were found on. In a first approach, we lim-ited the search only to articles on en.wikipedia.org.Wikipedia is a very well-known and well-acceptedpage that contains useful information about a lotof domains. Retrieving data from only one pagemay seem to be too less, but it is enough in orderto set up a working and useful system. In addition,the system can easily be enhanced by adding othersites.

June 24, 2009 5

2.5 Results Filtering 2 WEB RETRIEVAL

2.5 Results Filtering

2.5.1 Retrieving Unstructured Web Pages

The search on the web engine gives a list of URLsas its results. Each of the pages can be downloadedas a text, but now is a HTML-formatted web page.There is some work left to do to recieve only thereal content. First, we want to get rid of theHTML tags and any other formatting data. Thistask has to be performed by applications of verydifferent types. For example, every web browserhas to use functionality that distinguishes betweencontent that will be displayed and content thatwill not. It is usually the best to use existing,tested and working packages, which shall be foundat web browsers.The Microsoft Internet Explorer performs thistask using a dynamic link library (DLL) called”mshtml”. This library can easily be used to parsea complex html-file into plain text that containsonly that parts that would be displayed when oneis visiting this web page with the Internet Explorer(or another browser). This does the main part ofthis step.Many web pages don’t only contain the ”real”information, but also a lot of other things, suchas navigation links, menus, advertisements, and soon. These are displayed by the browser and thusseen by the user. If the user of the informationretrieval system chooses to display on of thesesites, he maybe wants to see these content. Butdue to the fact that the texts from the results aresummerized in the next step, we have to removethem from the text, because they don’t contain anyinformation that is relevant for the informationcontained in the page itself.The format of these menus etc. depends on theweb page where the result was found. The resultson Wikipedia are formatted in a different way fromthe results in other lexigraphical web pages. Thus,it is useful to handle the results depending on theweb page they were found on.As the web search in this approach has beenlimited to only one page, namely en.wikipedia.org,we have to consider only one page in order toremove ”waste” like links and menus from it. Ananalysis of the structure unfolded an easy wayof parsing the wikipedia articles. All articles onwikipedia start with article itself - beside some

HTML-tags and a lot of scripts. The menus andnavigation links follow after it. That means, thatwe can simply work on the first part of the articleand skip everything that follows afterwards. Theedge between article and links is clearly defined bythe HTML-tag ’¡div class=”printfooter”¿’, which iscontained in every article on Wikipedia. So we cansimply search the results from wikipedia for thetag mentioned above, and remove everything thatfollows after this tag. It should be mentioned thatthis parsing is aimed at the result of the parsingdone by the Microsoft dynamic link library, buthas to be done before the other parsing step,because otherwise one wasn’t able to find theHTML-tag used as the edge.

2.5.2 Summarizing the Web Content

In the document exists important, but also a lotof irrelevant information. Additionally the screenspace to present the information is intentionallylimited to allow a quick evaluation by the user.Because of this the documents are summarized inorder to present only the important information tothe user. For this we use the Open Text Summa-rizer [1]. This tool doesn’t abstract the documentin a natural way because it does not rephrase thetext. It is just producing a condensed version of theoriginal by keeping only the important sentences.However it shows good results for non-fictionaltext and can be used with unformatted and htmlformatted text. It has received favourable mentionfrom several academic publications and it is atleast as good as commercial tools such as Copernicand Subject Search Summarizer.

The Open Text Summarizer parses the textand utilizes Porter Stemming. For this an xml filewith the parsing and stemming rules is used. Itcalculates the term frequency for each word andstores this in a list. After this stop word filteris performed. It removes all redundant commonwords in the list by using a stop word dictionary,which is also stored in the xml file. After sortingthe list for frequency the keywords are determined,which are the most frequently occurring words.The original sentences are scored based on thesekeywords. A sentence that holds many importantwords, the keywords, is given a high grade. The

June 24, 2009 6

2.6 Visualization 2 WEB RETRIEVAL

result is a text with only the highest scoredsentences. For this the limiting factor can be set toeither a percentage of the original text or sentencesnumber.

2.6 Visualization

Visualization is probably the one of most impor-tant parts of the system. It consist of two mainparts:

1. The ”Presenter-View”, the view that the pre-senter sees on his computer during the pre-sentation. The presenter should be able to seethe dataflow, manipulate and give simple com-mands as: ”show this image” or ”show thatarticle”. The presenter also needs a preview ofwhat’s shown to public.

2. The ”Public-View” the simplified view thatshows only the needed information and mir-rors the lecturer interactions with the presen-ter’s view pre-rendering zone.

Figure 2: Presenter’s view layout structure

For the Graphical User Interface (GUI), severalcrucial specifications were defined, most impor-tantly:

• rendering and screen mirroring should be donein real-time;

• layout of the GUI should be as simple as pos-sible, presenting the retrived data in the sim-plest and fastest way for the presenter;

• it shoud also be interactive, the images andarticles should be manipulated at real-time.

Figure 3: Real-time parallelization workflow

In order to achieve smooth experience, severalthings must be done at the same time. As shows fig-ure 3, two main components: rendering and infor-mation extraction were completely separated andrunning in several threads. Moreover, speech recog-nition part and google search were also separatedand all search done asynchronously. Therefore, thesystem leverages actual multi-core architectures inorder to achieve significant speedup. In oder toachieve the simplest way to presenting the data,several prototypes of layout have been tested. Fi-nally, the one shown at the figure 2 and screenshots4 6 have been proven to be a good way to present-ing the information. The system supports severalscreens rendering, and the actual mirroring andthe development of the user interfaces have beenachieved using a novel Microsoft Windows Presen-tation Foundation (WPF) technology.WPF technology is espectially designed for richuser interface develoment, and built on top of .NetFramework. It uses a markup language knownas XAML to provide clear separation between thecode and the design definition, which also greatlyhelped in parrallelization process. It also features3D and Video/Audio rendering capabilities and an-imation storyboards (similar to flash). Thereforeour system can be extended in order to perform

June 24, 2009 7

3 TESTS AND RESULTS

Figure 4: Screenshot of the GUI, the renderingview is splitted automatically between several im-ages and an article

video search or do 3D enhanced presentations. Therendering engine takes advantage of modern hard-ware (dedicated GPUs ...). One of the features of

Figure 5: Screenshot of the GUI, the public view

our system is the dynamic layout management ofthe render-view. As shown on the figures 4 6 therendering zone automaticaly adjust itself depend-

ing on the quantity of content presented. With suchcapabilites the system assures to use whole screenwithout leaving much empty space.

Figure 6: Screenshot of the GUI, the rendering viewshows only images since no articles have been se-lected by the lecturer

3 Tests and Results

3.1 Speech Recognition

In the context of this project it is important thatthe speech recognition engine has a high qualityof recognition and is working fast. For this reasonthese two aspects have been evaluated.

After a two hour training phase the enginerecognized about 70% of the spoken words cor-rectly. The recognition rate can be improved byfurther training. To achieve a good recognitionrate it is necessary to speak loud and articulatedand to pronounce the words always in the sameway. Additionally punctuation marks are notrecognized automatically but have to be spokenexplicitly. This is however not a natural way tospeak at a presentation.

After speaking some sentences a break of afew seconds has to be made to initiate the recog-nition process for these sentences. The recognitionprocess also needs a few seconds depending on the

June 24, 2009 8

3.4 Results Filtering 3 TESTS AND RESULTS

number of words. The tests showed a recognitionrate of about four words per second. Again thesefrequent breaks of a few seconds are not a naturalway to speak at a presentation.

3.2 Keyword Extraction

The keywords extraction module has been testedwith different features and numbers of words. Thecomputing time and the quality of the extractedkeywords have been evaluated.

The results are shown in figure 7, 8. Thetests have been made for the complete feature set(explained in chapter 2.3) and with disregardingthe stemmer. Due to the limitations of the speechrecognition engine (no punctuation marks if notpronounced explicitly) the keyword extraction hasalso been evaluated for texts without punctuationmarks.It shows that the quality of the keyword extractionwithout stemmer is generally lower than with thecomplete feature set. Especially if the keywordswill be used in the query for the internet search theuse of the stemmer shows advantages. It avoids forexample the search for singular and plural of thesame keyword. Also the quality of the relevancyof the keywords is improved. It also shows thatthe gain in computation time by disregardingthe stemmer is minimal. With 1200 words thegain is only 120 ms. For the expected relativelyshort sentences usually provided by the speechrecognition engine the gain is even less.

The tests also showed that for text withoutpunctuation marks the quality is nearly the sameas for the same text with punctuation marks.

3.3 Query Construction and Expan-sion

As figure 9 shows, the Or-Concept algorithm ofquery expansion makes a significant improvementof the precision. In that figure, 10 word query iscompared to an extended 10 word query, and at N 5one can see an improvement of almost 3 times. Cyc-enhanced queries gives better results, but should beevaluated (more about different experiments andquery-enhancements in Appendix A.

Figure 7: time results for an extraction of ten key-words (with different numbers of words)

Figure 8: quality of results for an extraction of tenkeywords (with different numbers of words)

3.4 Results Filtering

The summarizer module has been tested withand without stemmer for different summarizing

June 24, 2009 9

4 DISCUSSION

Figure 9: Precision at N, 10 word query with nokeyword stemming

percentages of the original unformatted text. Thequality of the results represents their usabilitywhich has been evaluated from the meaningfulnessof the summary and its length. A shorter lengthwas considered a better usability because it allowsa quicker evaluation by the user.

Figure 10 shows that in these tests with stemmerthe results with the best quality were achieved fora 15% summary. Without stemmer the qualitywas usually a little less.The calculation time with and without stemmerfor different summarizer percentages is shown infigure 11. The difference between the differentconfigurations is minimal.

When comparing unformatted text and HTML-formatted text (with similar amount of content)the tests show that the html formatted text cantake significantly longer to be processed.

4 Discussion

As explained in the preceeding chapters, the sys-tem is works in its main parts. It is clearly possibleto enrich web search as well as presentation ofcontents with different tools related to text mining,

Figure 10: quality of results for different summarypercentages for a text with 1200 content words

Figure 11: Calculation times for different summarypercentages for a text with 1200 content words

information retrieval, and others. But also manyproblems turned out that had either to be solvedor to be worked around.

June 24, 2009 10

5 CONCLUSIONS

The speech recognition module did work, but stillhad some problems, which lie in the implementa-tion of the used software. The speech recognitionitself works fine, but the chosen engine has someproblems e.g. in recognizing accents, breaksand similar. These often result in a incorrectlyrecognized word or sentence. So either another,better engine has to found or another solution hasto be found in order to meet the requirements.The keyword extraction from the spoken textworks well and also very efficient. But still,this step can be enhanced. For example, theuser may want to influence the chosen keywordsmore directly than he does by speaking. So itwould be a useful feature if the user could decidemanually which keywords are good and importantand which keywords are or not relevant. Thekeywords provided by the system could then beviewed as suggestions that the user can accept ordecline. Another nice feature related to the firstone mentioned is a self-learning algorithm thattries to predict the user’s decision.Another part that works fine but still can beimproved is the summerization of the found texts.It produces good summaries of the found webpages without consuming too much computationtime. But the result could maybe be enhancedby using features such as coreference resolution,leading to a better estimation of which referentsare important and which are not. But, beforeputting a big effort in it, it should further bechecked whether it is worth the effort.

5 Conclusions

In this paper a design and implementation of an In-teractive Web Retrieval and Visualization Systemhave been discussed. It have been proven robustand satisfying real-time constraints of live presen-tation. Tests and results have shown that both key-word extraction and query expansion return satis-fying results, keeping good precision. During theexperiments it have been also shown that severalparts of such system still need improvement: speechrecognition needs to be enhanced and propery pre-trained, unstructured web data should be properlycleared from all noise and summarization can beenhanced, creating more consize articles.

Overall, such systems can be built with modernhardware and parallelization, and hopefully, moreof those will be seen as commercial products in thenear future.

June 24, 2009 11

A APPENDIX: QUERY EXPANSION USING CYC

References

[1] Open text summarizer.http://libots.sourceforge.net.

[2] Porter stemmer. http://tartarus.org/ mar-tin/PorterStemmer/.

[3] Sharpnlp - open source natu-ral language processing tools.http://www.codeplex.com/sharpnlp.

[4] The Text Mining Handbook: Advanced Ap-proaches in Analyzing Unstructured Data,chapter page 60. Cambridge University Press,2006.

[5] The Text Mining Handbook: Advanced Ap-proaches in Analyzing Unstructured Data,chapter page 318. Cambridge University Press,2006.

[6] Cyc Foundation.http://www.cycfoundation.org/.

[7] W.N. Francis and H. Kucera. Brown corpusmanual.

[8] M. Harrington. Giving computers a voice.http://blogs.msdn.com/coding4fun/archive/2006/10/31/909044.aspx.

[9] J. Moskowitz. Speech recog-nition with windows xp.http://www.microsoft.com/windowsxp/using/setup/expert/moskowitz 02september23.mspx.

[10] J. Moskowitz. Windows speech recognition.http://www.microsoft.com/windows/windows-vista/ features/speech-recognition.aspx.

[11] T. Ngoc Dao and T. Simpson. Measuring simi-larity between sentences. Codeproject Article.

[12] Open Cyc Project.http://www.cyc.com/opencyc.

[13] J. C. Scholtes. 5 text mining: Preprocessingtechniques part 1.

[14] Princeton University. Wordnet, a lex-ical database for the english language.http://wordnet.princeton.edu/.

A Appendix: Query Expan-sion using Cyc

A.1 WordNet Similarity Measure-ment

In order to be able to compare two words or sen-tences together, a semantic similarity measurementwas needed. For this the WordNet similarity mea-surement was used.Following are the steps that are performed to tothe semantic similarity between two sentences [11]:

• each sentence is partitioned into a list of tokensand the stop words are removed;

• words are stemmed;

• part of speech tagging is performed;

• the most appropriate sense for every word ina sentence is found (Word Sense Disambigua-tion). To find out the most appropriate senseof a word, the original Lesk algorithm was usedand expanded with the hypernym, hyponym,meronym, troponym relations from WordNet.The possible senses are scored with a new scor-ing mechanism based on ZipF’s Law and thesense with the highest score is chosen.

• the similarity of the sentences based on thesimilarity of the pairs of words is computed.In order to do this, a semantic similarity rela-tive matrix is created, consisting of pairs ofword senses for semantic similarity betweenthe most appropriate sense of word. The Hun-garian method is used to get the semantic simi-larity between sentences. The match results ofthis are included to compute single similarityvalue for two sentences. The matching averageis used to compute semantic similarity betweentwo word-senses. This similarity is computedby dividing the sum of similarity values of allmatch candidates of both sentences by the to-tal number of set tokens.

A.2 CycFoundation.org REST API

As mentionned in section 2.4.2, the system usedREST APIs in order to access the CycFounda-tion.org Cyc KB. REST stands for Representa-

June 24, 2009 12

A.3 Cyc Concepts Matching A APPENDIX: QUERY EXPANSION USING CYC

tional state transfer, a way to build a service ori-ented architecture, based on HTTP and XML, us-ing generally GET or POST methods of HTTP pro-tocol.CycFoundation Web Services exposes only a sub-set of Cyc’s capabilities, therefore the implementedAPI in our system is rather small. It contains fol-lowing queries:

• GetConstants(Keyword) - performs a searchfor Cyc concepts for a keyword

• GetComment(Concept) - returns a commentfor a particular concept

• GeCanonicalPrettyString(Concept) - returns asimplified name for a concept

• GetDenotation(Concept) - returns a denota-tion for a particular concept

• GetGenerals(Concept) - returns a set of gen-eral concepts for a particular concept

• GetSpecifics(Concept) - returns a set of spe-cific concepts for a particular concept

• GetInstances(Concept) - returns a set of in-stances (concepts) for a particular concept

• GetIsA(Concept) - returns a set of Is A con-cepts for a particular concept

• GetAliases(Concept) - returns a set of aliases(words or phrases) for a particular concept

• GetSiblings(Concept) - returns a set of siblingconcepts for a particular concept

A.3 Cyc Concepts Matching

The actual general algorithm descibed in section2.4.2 was actually derived from two algorithms:display name matching and comment matching.Those algorithms have been tested separately.

A.3.1 DisplayName Matching Algorithm

The display name used in cyc, is the shortest con-cept description. For example the displayname for”HumanAdult” concept is simply ”a human adult”.Therefore the matching algorithm computes thedistance between the keyword contentex and thisdescription, as following:

Input: Keyword, its context (sentence)Output: Cyc concept matched

(BestConstant)Constants ⇐ GetConstants(Keyword)BestConstant ⇐ new CycConstant()BestDistance ⇐ ∞foreach Constant in Constants do

dK ⇐ GetDistance(Keyword,Constant.DisplayName)dC ⇐ GetDistance(Context,Constant.DisplayName)Distance ⇐ (dK + dC) / 2if Distance < BestDistance then


endend

This algorithm has proven to be quite efficient sinceless computation needs to be done, but on the otherhand it tends to provide quite poor results andshould be enhanced for proper use.

A.3.2 Comment Matching Algorithm

The comment matching algorithm takes the addi-tional knowledge about cyc concept, called com-ment. It is long description which tends to be asspecific as possible. For example again, for ”Hu-manAdult” concept, the comment is:

”A specialization of Person, and an in-stance of HumanTypeByLifeStageType.Each instance of this collection is a per-son old enough to participate as an inde-pendent, mature member of society. Inmost modern Western contexts it is as-sumed that anyone over 18 is an adult.However, in many cultures, adulthood oc-curs when one reaches puberty. Adult-hood is contiguousAfter (q.v.) childhood.Notable specializations of this collectioninclude AdultMaleHuman, AdultFemale-Human, MiddleAgedHuman and OldHu-man.”

The actual pseudo code implementation of suchalgorithm, where GetComment method is actualREST API call:

June 24, 2009 13

A.4 Query Expansion Patterns A APPENDIX: QUERY EXPANSION USING CYC

Input: Keyword, its context (sentence)Output: Cyc concept matched

(BestConstant)Constants ⇐ GetConstants(Keyword)BestConstant ⇐ new CycConstant()BestDistance ⇐ ∞foreach Constant in Constants do

dCK ⇐ 0Comment ⇐ GetComment(Constant)Keywords = GetKeywords(Comment)foreach CK in Keywords do

dCK ⇐ dCK + GetDistance(Keyword,CK)

enddK ⇐ GetDistance(Keyword,Constant.DisplayName)dC ⇐ GetDistance(Context,Constant.DisplayName)Distance ⇐ (dK + dC + (dCK /Keywords.Count)) / 3if Distance < BestDistance then


endend

A.3.3 Experiments with both algorithms

After running several experiments (Fig. 13) com-paring the Comment Matching and Display Namealgorithms, we found that Comment Matching isthe one gives significantly better results, but stillcan be enhanced to get even better matching.This may be done by using general/specific con-cepts in the distance calculation formula or learningweights.

A.4 Query Expansion Patterns

In our research, several different query expan-sio algorithms have been implemented, all have aquite straightforward implementation as explainedin section 2.4.3. The algorithms, which can be de-scribed using the following formulas, where K is theset of keywords, k ∈ K, have been implemented andevaluated :

Or-Concept Expansion:

Q(K, n) =n∑

i=1

(max(ki, c(ki))) (2)

where c(k) is a concept linked with the keyword

Or-Aliases Expansion:

Q(K, n) =n∑

i=1

(max(ki, (m∑

j=1

aj(ki)))) (3)

where a(k) is a alias linked with the keyword’sconcept

Or-Most-Relevant-Alias Expansion:

Q(K, n) =n∑

i=1

(max(ki, a(ki))) (4)

where a(k) is a most relevant alias linked with thekeyword’s concept

Or-Most-Relevant-General Expansion:

Q(K, n) =n∑

i=1

(max(ki, g(ki))) (5)

where g(k) is a most relevant general conceptlinked with the keyword’s concept

Or-Most-Relevant-Specific Expansion:

Q(K, n) =n∑

i=1

(max(ki, s(ki))) (6)

where s(k) is a most relevant specific conceptlinked with the keyword’s concept

Or-Is-A-Concept Expansion:

Q(K, n) =n∑

i=1

(max(ki, isa(ki))) (7)

June 24, 2009 14

A.5 Results and Discussion A APPENDIX: QUERY EXPANSION USING CYC

where isa(k) is a most relevant ’Is A ...’ conceptlinked with the keyword’s concept

Or-Most-Relevant-AGS Expansion:

Q(K, n) =n∑

i=1

(max(ki, a(ki), g(ki), s(ki))) (8)

where:a(k) is a most relevant alias linked with thekeyword’s concept,g(k) is a most relevant general concept linked withthe keyword’s concept,s(k) is a most relevant specific concept linked withthe keyword’s concept

A.5 Results and Discussion

A.5.1 Google Search on Wikipedia KB

Our system have been tested and able to generatequeries for Google search engine, however it is verydifficult to evaluate such results, therefore thissecrion of the paper is only to give an example ofresults.Using as input an introductory text of StandfordEncyclopedia for the topic of ’evolution’, thesystem have derived several keywords:

• ”species”, matched to ”speciesImmunity”

• ”theory”, matched to ”theoryOfBeliefSystem”

• ”evolution”, matched to ”Evolution”

• ”change”, matched to ”changesSlot”

• ”term”, matched to ”termExternalIDString”

Next, several queries have been constructed andrestricted to Wikipedia KB:

• Google Non-Expanded Query: ”species theoryevolution change term site:en.wikipedia.org”

• Google Or-Concepts Query: ”( species ORspecies immunity ) + ( theory OR Theory OfBelief System ) + ( evolution OR biologicalevolution ) + ( change OR Changes Slot )+ ( term OR Term External ID String ) +site:en.wikipedia.org”

Figure 12: Precision at N, results of Or-Conceptexpanded query with and without stemming com-pared

• Google Or-Aliases Query: ”species + theory +( evolution OR (biologically will have evolvedOR biologically had evolved OR biologicallywill evolve OR biologically has evolved OR bi-ologically have evolved OR biologically evolv-ing OR biologically evolves OR biologicallyevolved OR biologically evolve OR most evo-lutionary OR more evolutionary OR evolu-tionary OR evolution) ) + change + term +site:en.wikipedia.org”

• Google Or-Most-Relevant-General Query:”species + theory + ( evolution OR”development” ) + change + term +site:en.wikipedia.org”

• ...

After analyzing the results, we compared severalof the results returned by Google. For the normalnon expanded query the results are quite satisfy-ing:* Evolution* Punctuated equilibrium* Evolution as theory and fact* Macroevolution* History of evolutionary thought* On the Origin of Species

June 24, 2009 15


Figure 13: Precision at N, results of Or-Conceptexpanded query using different matching

* Species* Hopeful Monster

However, the source article mentionned CharlesDarwin and his name was not derived in thekeywords, so no direct Wikipedia link to his theoryof natural selection was found in the non-expandedresults. After analyzing deeper, it was actuallyfound in the extended query (Or-Most-Relevant-General):* Evolution* Punctuated equilibrium* Macroevolution* Evolution as theory and fact* Charles Darwin* On the Origin of Species* Hopeful Monster* Natural selection

A.5.2 Lemur Search in AP Corpus

In order to perform an evaluation of the query ex-pansion methods, a structured query generation forLemur search engine have been implemented too.The queries have been automatically derived fromthe narrative of topics 101 to 115 of AP Corpus

Figure 14: Precision at N, expansions of a 5 wordquery compared (part 1)

and a batch evaluation of those queries has beenperformed.Several different sets of results have been producedby IREval function of Lemur, ones of them are:

• 5-10 word query with no keyword stemming;

• 5-10 word query with keyword stemming;

• 5-10 word query using name matching.

Most of the experiments have been performedwith Comment Matching Algorithm with wholeset of query expansion patterns. Some experimentshave been performed in order to compare the Dis-play Name and Comment Matching Algorithms.The keyword stemming part was especially testedin order to suppress plurals, since in derivedkeywords from AP Corpus topics usually left it. Sofor example: ”changes” was stemmed to ”change”and therefore give a larger set of possible conceptsto match. During our experiments (Figure 12) wehave found that using the keyword stemmin inthose algorithms decreases significantly the preci-sion at larger queries. One possible explanationto that is that the actual matching algorithms aremaking more errors, since the possible conceptsset is larger.

In the figures 14 and 15 one can see that onlythe Or-Concept algorithm gives better precision

June 24, 2009 16




at 5 and more. The Or-Most-Relevant-Aliasand Or-Most-Relevant-General give a precisionimprovement only at N > 20.

Finally, it’s in the figures 16 and 17 thatone can really see the improvement made bystemming and especially Or-Concept query ex-


pansion. While 5 words query expansion (at 5N)in this particular algorithm gives an improvementof almost 50 percent, with 10 words query theimprovement is almost of the order of 200 percent(significant increase of precision).

The results speak for themselves; we think thatthe Cyc-enabled query expansion is definitely wayto go, but better patterns still needed to be built.In our experiments from 7 different patterns onlyone of them is actually making good precision in-crease: the Or-Concept pattern.

June 24, 2009 17

Research: Developing an Interactive Web Information Retrieval and Visualization System

Technology

Transcript of Research: Developing an Interactive Web Information Retrieval and Visualization System