On the Effectiveness and Optimization of …...On the Effectiveness and Optimization of Information...

On the Effectiveness and Optimization ofInformation Retrieval for Cross Media Content

Pierfrancesco Bellini, Daniele Cenni, Paolo Nesi

DISIT Lab, Department of Systems and InformaticsFaculty of Engineering, University of Florence, Florence, Italy

{pbellini, cenni, nesi}@dsi.unifi.it, http://www.disit.dsi.unifi.it

Abstract

In recent years, the growth of Social Network communi-ties has posed new challenges for content providers anddistributors. Digital contents and their rich multilingualmetadata sets need improved solutions for an efficientcontent management. There is an increasing need of ser-vices for scaling digital services, searching and index-ing. Despite the huge amount of contents, users wantto easily find relevant unstructured documents in largerepositories, on the basis of multilingual queries, with alimited waiting time. At the same time, digital archiveshave to be fully accessible even if a major restructur-ing is in progress, or without a significant downtime.Evaluating the effectivness of retrieval systems plays afundamental role in the process of system assessmentand optimization. This paper presents an indexing andsearching solution for cross media content, developedfor a Social Network in the domain of Performing Arts.The research aims to cope with the complexity of a het-erogeneous indexing semantic model, with tuning tech-niques for discrimination of relevant metadata terms.Effectiveness and optimization analysis of the retrievalsolution are presented with relevant metrics. The re-search is conducted in the context of the ECLAP project(http://www.eclap.eu).

Keywords: cross media content; indexing; searching;search engines; information retrieval; cultural heritage;stochastic optimization, test collections; social networks

1. Introduction

During the last years, the dramatic growth of digitalitems on the Internet has opened new scenarios, address-ing several issues for content providers and distributors.Large digital repositories, with heterogeneous formatsand metadata are facing several issues, especially in the

context of content retrieval and management. Cross me-dia resources that are populating the Web need efficientindexing and searching solutions, tools for metadatamanagement, mapping and extraction, content transla-tion and certification services. Users want to get fastand relevant results in response to their queries, robustto typos or inflexions. Moreover, searching services arerequested to offer faceting and sorting capabilities, foran easy and intuitive search refinement. Rich text doc-uments require smart processing tools, for realtime full-text parsing, extraction and indexing, that are capable ofsupporting different types of content descriptors. Also,large content repositories are expected to be fully acces-sible, without significant downtime, in case of servicemaintenance or major updates of their indexing struc-ture (e.g., redefinition of index schema, index corruptionetc.).

Social Networks, while shortening the distances be-tween users, are making possible a rapid integration ofmultilingual resources, with different metadata sets andencodings, difficult to deal with. Topics such as queryor metadata translation are becoming even more relevantin the context of Information Retrieval (IR), with theirpros and cons. Also, the evaluation of retrieval effective-ness, plays a determinant role when assessing a system,hence it is crucial to perform a detailed IR analysis, espe-cially in huge multilingual archives. Ranking a retrievalsystem involves human assessors, and may contribute tofind weakness and issues, that prevent a satisfactory andcompelling search experience. A comparative evalua-tion of IR systems, usually follows the Cranfield model;in this context, the effectiveness of a retrieval strategy iscalculated as a function of documents rankings, and eachmetric is obtained by averaging over the queries.

Tipically, the effectiveness evaluation starts by col-lecting information needs from a set of topics; followingthese needs, a set of queries is derived, and then a list ofrelevance judgments that map queries to correspondingrelevant documents. Since people often disagree about adocument relevance, collecting relevance judgments is a

difficult task; in many cases, with an acceptable approx-imation, relevance is assumed to be a binary variable,even if it is defined in a range of values [10]. To over-come these limitations, some approaches start the re-trieval evaluation without relevance judgments, makinguse of pseudo relevance judgments [9, 1, 11]. Rankingstrategies are often performed by comparing rank cor-relation coefficients (e.g., Spearman [2], Kendall Tau)with TREC official rankings. The effectiveness of IRcan be assessed by computing relevant estimation met-rics such as precision, recall, mean average precision, R-precision, F-measure and normalized discounted cumu-lative gain (NDCG). IR test collections and evaluationsseries are used for a comparative study of retrieval effec-tiveness; some examples include TREC, GOV2, NTCIRand CLEF.

This paper describes a Social Network infrastruc-ture, developed in the scope of the ECLAP project, andpresents the effectiveness evaluation and optimization ofan indexing and searching solution, in the field of Per-forming Arts. The research was conducted to overcomeseveral other issues, in the context of cross media con-tent indexing, for the ECLAP social service portal. Theproposed solution is robust with respect to typos, run-time exceptions and index schema updates; above In-formation Retrieval metrics are calculated, on the ba-sis of relevant Performing Arts topics. The solutiondeals with different metadata sets and content types ofthe ECLAP information model, thus enhancing the userexperience with full text multilingual search, advancedmetadata and fuzzy search facilities, faceted query re-finement, content browsing and sorting solutions. TheECLAP portal includes a huge number of contents suchas: MPEG-21, web pages, forums, comments, blogposts, images, rich text documents, doc, pdf, collections,play lists.

The paper is structured as follows: Section 2 depictsan overview of ECLAP; Section 3 introduces Searchingand Indexing Tools implemented in the portal; Section 4discusses IR evaluation and effectiveness, and describesa test strategy, developed for a fine tuning of index fields;Section 5 reports conclusions and future work.

2. ECLAP Overview

The ECLAP project aims to create an online digitalarchive in the field of the European Performing Arts;the archive will be indexed and searchable throughthe Europeana portal, using the so called EDM datamodel. ECLAP main goals include: make availableon Europeana a large amount of Performing Arts dig-ital cross media contents (e.g., performances, lessons,master classes, video lessons, audio, documents, imagesetc.); bring together European Performing Arts institu-

tions, in order to provide their metadata contents for Eu-ropeana, and thus resulting in a Best Practice Networkof European Performing Arts institutions.

ECLAP provides solutions and services for: Perform-ing Arts institutions, final users (teachers, students, ac-tors, researchers etc.). ECLAP is developing technolo-gies and tools, to provide continuous access to digitalcontents, and to increase the number of online collectedmaterials. ECLAP acts as a support tool for: content ag-gregators, working groups on Best Practice reports andarticles, intellectual property and business models, dig-ital libraries and archives. ECLAP services and facil-ities include: user group, discussion forums, mailinglists, integration with other Social Networks, sugges-tions and recommendations to users. Content distribu-tion is available toward several channels: PC/Mac, iPadand Mobiles. ECLAP includes smart back office solu-tions, for automated ingestion and refactoring of meta-data and content; multilingual indexing and querying,content and metadata enrichment, IPR modeling and as-signment tools, content aggregation and annotations, e-learning support.

3. Searching and Indexing Tools

The ECLAP content model deals with different types ofdigital contents and metadata; at the core of the con-tent model there is a metadata mapping schema, usedfor content indexing of resources in the same index in-stance. Resource’s metadata share the same set of in-dexing fields, with a separate set for advanced searchpurposes. The indexing schema has a flexible and up-datable hierarchy, that describes the whole set of hetero-geneous contents. The metadata schema is divided in 4categories (see Table 2): Dublin Core (e.g., title, creator,subject, description), Dublin Core Terms (e.g., alterna-tive, conformsTo, created, extent), Technical (e.g., typeof content, ProviderID, ProviderName, ProviderCon-tentID), Performing Arts (e.g., FirstPerformance Place,PerformingArtsGroup, Cast, Professional), ECLAP Dis-tribution and Thematic Groups, and Taxonomical con-tent related terms. Multilingual support and manage-ment is available at metadata and resource content levels(e.g., multilingual web pages, taxonomy terms that mayinclude a metadata language either mandatory, optionalor not necessary). Performing Arts metadata are mappedto Dublin Core metadata before submission to Euro-peana. Multilingual index includes ECLAP partnerslanguages (Catalan, Greek, English, Spanish, French,Hungarian, Italian, Dutch, Portuguese, Slovenian, Pol-ish, German, Danish). Multilingual metadata are auto-matically translated from any source language, and thenmapped into their index fields. This approach avoidsquery translations issues, while metadata editing tools

Table 1: ECLAP Indexing Model

Media Types DC

(ML

)

Tech

nica

l

Perf

orm

ing

Art

s

Full

Text

Tax,

Gro

up(M

L)

Com

men

ts,T

ags(

ML

)

Vote

s

# of Index Fields∗ 468 10 23 13 26 13 1

Cross Media:html, MPEG-21,animations, etc.

Yn Y Y Y Yn Ym Yn

Info text:blog, web pages,

events, forum,comments

T N N N N Ym N

Document:pdf, doc, ePub Yn Y Y Y Yn Ym Y

Audio, video,image Yn Y Y N Yn Ym Yn

Aggregations:play lists,

collections,courses, etc.

Yn Y Y Y/N Yn Ym Yn

∗ = (# of Fields per Metadata type) ∗ (# of Languages)ML: Multilingual; DC: Dublin Core; Tax: Taxonomy

are at user disposal in the ECLAP portal, for metadataassessment, correction and certification.

Notation used in Table 1, Yn: yes with n possiblelanguages (i.e., n metadata sets); Y : only one metadataset; Y/N : metadata set not complete; T : only title ofthe metadata set, Ym: m different comments can be pro-vided, each of them in specific language. Commentsmay be annidated, thus producing a forum discussion hi-erarchically organized.

ECLAP Index Model meets the metadata require-ments of any digital content, while the indexing servicefollows a metadata ingestion schema. Twenty differ-ent partners are providing their digital contents, each ofthem with their custom metadata, partially fulfilling thestandard DC schema. A single multilanguage index hasbeen developed for faster access, easy management andoptimization.

Index schema includes a set of catchall fields, au-tomatically populated at indexing time for some rele-vant metadata; it provides a more compact way to buildboolean queries to be sent to the searching service. Inthis model, the catchall field body includes full text ofeach content element. Relevant metadata are only in-dexed and not stored in the index, so to keep the in-dex smaller, for a fast access and management; techni-cal metadata are not analyzed and hence stored verba-tim (e.g., video resolution, quality, format); descriptivemetadata are fully processed (e.g., using token analyz-

ers, lowercasing etc.). Metadata and resource parsedtext (i.e., doc, docx, ppt, pptx, xls, xlsx, pdf, html,txt) are extracted with content identifier processing tech-niques (e.g., file extension, content type and encoding,MIME detection). Multilingual taxonomies are stored ina MySQL database with full references to each resource;each content related taxonomy path is indexed with othercontent metadata. Multilingual metadata can be enrichedand edited on the portal, and then validated and certifiedbefore publishing on Europeana. Corrections of multiplemetadata set at the same time is also available.

The index structure may be generated automaticallyfrom scratch, for example in case of corruption or ma-jor schema updates. The production service keeps run-ning while a separate instance of the indexing serviceis processing the index structure, with accessory tasks(e.g., taxonomy/group extraction related to each digitalresource). This strategy avoids significant service down-time, and synchronizes new contents with the newlybuild index, after indexing completion. A detailed log-ging and notifying system keeps trace and emails everythrown exception to the administrator.

The searching service aims to help users to easily findand sort digital contents in the ECLAP portal, and to re-fine queries and results. Main features include:

• Relevance is calculated by taking into account dif-ferent weights for each indexed metadata field, andit is correlated with term frequency in documents.

• Queries are analyzed, tokenized and lowercased,to build a query expression that is a combinationof weighted boolean clauses; in the case of ad-vanced search, the user defines an arbitrary numberof clauses.

• Search results are paginated, and presented withrelevant metadata by descending relevance.

• Automatic suggesting is available from the portalsettings, and provides realtime query suggestions,relevant to the typed query.

A fine tuning of term boosting, giving more relevance tocertain fields with respect to others, is a major require-ment for the system, in order to achieve an optimal IRperformance.

4. Effectiveness and Optimization

ECLAP Metadata Schema, summarized in Table 2, con-sists of 541 metadata fields, divided in 8 categories;some important multilingual metadata (i.e., text, title,body, description, contributor, subject, taxonomy, andPerforming Arts metadata) are mapped into a set of8 catchall fields, for searching purposes. The scoring

2000 4000 6000 8000run

0.2

0.4

0.6

0.8

MAP

Figure 1: MAP vs test runs

system implements a Lucene combination of BooleanModel and Vector Space Model, with boosting of termsapplied at query time. Documents matching a clauseget their score multiplied by a weight factor. A booleanclause b, in the weighted search model, can be defined as

b := (title : q)w1 ∨ (body : q)w2 ∨ (description : q)w3

∨ (subject : q)w4 ∨ (taxonomy : q)w5

∨ (contributor : q)w6 ∨ (text : q)w7

where w1, w2, . . ., w7 are the boosting weights of thequery fields; title (DC resource name), body (parsedhtml resource content); description (DC account of theresource content; e.g., abstract, table of contents, refer-ence), subject (DC topic of the resource content; e.g.,keywords, key phrases, classification codes), taxonomy(content associated hierarchy term), contributor (contri-butions to the resource content; e.g., persons, organiza-tions, services), text (full text parsed from resource; e.g.doc, pdf etc.); q is the query; DC: Dublin Core.

The effectiveness of the retrieval system was evalu-ated with the aim of the trec eval tool. For this purpose, aset of 50 topics was initially collected, in the field of Per-forming Arts. The set of relevant topics was built startingfrom a list of popular queries, obtained with a query loganalysis. The chosen number of topics is above a thresh-old, generally suitable for obtaining reliable results, aspointed out by Zobel [12], and by Sanderson and Zobel[8].

For each topic was formulated a query, and then a setof relevance judgments. Each judgment was collectedby using a pooling strategy, which helps retrieving rele-vant items, by choosing a limited subset of the whole set.The method is reliable with a pool depth of 100; limit-ing the pool depth to 10 may change precision results,but doesn’t affect relative performances of IR systems[12]. Moreover, a precise analysis of IR performanceis possible, even with a relatively short list of relevancejudgments [3].

æ ææ æ æ æ

ææ

æ

æ æ

0.0 0.2 0.4 0.6 0.8 1.0Recall0.70

0.75

0.80

0.85

0.90Precision

Figure 2: Precision-Recall graph

Table 2: ECLAP Metadata Schema

MetadataType # fields Multilingual Index fields # fields/item

PerformingArts 23 N 23 n

DublinCore 15 Y 182 n

Dublin CoreTerms 22 Y 286 n

Technical 10 N 10 10FullText 1 Y 13 1

ThematicGroups 1 Y 13 20

TaxonomyTerms 1 Y 13 231

PagesComments 1 N 1 n

Total 74 − 541 −

Table 3: Estimated IR Metrics for the optimal run

Metric Value# of queries 50

# of doc retrieved for topic 4312# of relevant doc for topic 85

# of relevant docretrieved for topic 84

MAP 0.8223Geometric MAP 0.7216

Precision after retrieving R docs 0.7658Main binary preference measure 0.9886

Reciprocal Rank of the 1st

relevant retrieved doc 0.8728

Full text searches on the ECLAP portal are performedthrough 7 relevant index fields (i.e., title, body, subject,description, text, taxonomy and contributor). In orderto find optimal estimations of each index field weight, aminimization test was designed and implemented. Dueto the high number of variables, the test implementeda simulation annealing strategy [6], which is a stochas-tic minimization method, introduced after the work ofMetropolis et al. [7]. In order to minimize the energyfunction f(~w), one choice is to randomly select and in-crease/decrease only one variable for each iteration, orall the variables. The first option is likely to require alonger simulation time; also, considering the relativelylimited excursion range for the random variable, this sec-ond approach was preferred.

Different annealing schedules, initial state conditionsand allowed transitions per temperature were tested.Simulations took place by defining the system state as avector of field weights ~wi = {w1, w2, . . . , w7}. A run of50 queries was performed for each state condition, to getcorresponding search results with relevant metrics. Foreach run, Mean Average Precision (MAP ) was com-puted and (1−MAP ) was assumed as the energy for thecurrent state. Since MAP is defined as the arithmeticmean of average precision for the information needs, itcan be thought as an approximation of the area under thePrecision-Recall curve. Following the Metropolis Cri-terion, the probability pt of a state transition is definedby

pt =

{1 if Ei+1 < Ei

r < e−∆E/T otherwise

where Ei+1 and Ei are respectively the energy statesof wi+1 and wi, T is the synthetic temperature, ∆E =Ei+1 − Ei is the cost function, r is a random number inthe interval [0,1]. The annealing schedule was definedas T (i+ 1) = αT (i), with α = 0.8.

200 random transitions were proven for each temper-ature iteration. A smoother annealing schedule is morelikely to exhibit convergence, but generally requires abigger simulation time; alternative popular choices in-clude logarithmic schedules such as T (i) = c/log(1+ i)[5, 4]. Stopping conditions were defined by counting thenumber of successful transitions occurred during eachiteration. The best simulation schema, showing con-vergence and system equilibrium is reported in Fig. 1.Some semantic relevant index fields were found to give alimited contribution to the relevance scoring system (i.e.,subject, taxonomy and contributor); reducing the num-ber of boolean clauses to be processed by the retrievalsystem, would result in a higher search speed.Scatter plots of field weights vs MAP , obtained duringthe test, exhibited a relevant dispersion across a consid-erable range of high energy values (see for example Fig.

10 20 30 40 50 60 70Title

0.2

0.4

0.6

0.8

MAP

(a) Title

10 20 30 40 50Body

0.2

0.4

0.6

0.8

MAP

(b) Body

20 40 60 80 100Text

0.2

0.4

0.6

0.8

MAP

(c) Text

10 20 30 40 50Description

0.2

0.4

0.6

0.8

MAP

(d) Description

Figure 3: Scatter plots of Title, Body, Text, Descriptionweights

3). The observed behavior reasonably suggests a highsensitivity to initial conditions and random seeds. Theminimization process resulted in an energy minimum atw1 = 68.4739, w2 = 31.7873, w3 = 0.2459, w4 =9.8633, w5 = 13.2306, w6 = 2.1720, w7 = 3.9720,with MAP = 0.8223 (see Precision-Recall graph inFig. 2 and IR metrics in Table 3). The relatively highestimation of MAP is a direct consequence of a highlypolarized assessment pool, toward Performing Arts top-ics.

5. Conclusions and Future Work

In this paper, an integrated searching and indexing so-lution for the ECLAP portal has been presented, with aIR evaluation analysis and assessment. The index scalesefficiently with thousands of contents and accesses; theECLAP solution aims to enhance the user experience,by speeding up and simplifying the information retrievalprocess. Further analysis, simulation and tuning of in-dex weight fields are being conducted, with differentoptimization approaches. A user behavior study is inprogress, in order to understand user preferences and sat-isfaction.

6. Acknowledgments

The authors want to thank all the partners involved inECLAP, and the European Commission for funding theproject. ECLAP has been funded in the Theme CIP-ICT-PSP.2009.2.2, Grant Agreement No. 250481.

References

[1] Javed A. Aslam and Robert Savell. On the effec-tiveness of evaluating retrieval systems in the ab-sence of relevance judgments. In Proceedings ofthe 26th annual international ACM SIGIR confer-ence on Research and development in informaionretrieval, SIGIR ’03, pages 361–362, New York,NY, USA, 2003. ACM.

[2] Jamie Callan, Margaret Connell, and Aiqun Du.Automatic discovery of language models for textdatabases. In Proceedings of the 1999 ACM SIG-MOD international conference on Management ofdata, SIGMOD ’99, pages 479–490, New York,NY, USA, 1999. ACM.

[3] Ben Carterette, James Allan, and Ramesh Sitara-man. Minimal test collections for retrieval eval-uation. In Proceedings of the 29th annual inter-national ACM SIGIR conference on Research and

development in information retrieval, SIGIR ’06,pages 268–275, New York, NY, USA, 2006. ACM.

[4] Stuart Geman and Donald Geman. Stochasticrelaxation, gibbs distributions, and the bayesianrestoration of images. Pattern Analysis and Ma-chine Intelligence, IEEE Transactions on, PAMI-6(6):721 –741, nov. 1984.

[5] Bruce Hajek. Cooling schedules for optimal an-nealing. Math. Oper. Res., 13(2):311–329, May1988.

[6] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi.Optimization by simulated annealing. Science,220:671–680, 1983.

[7] Nicholas Metropolis, A W Rosenbluth, M NRosenbluth, A H Teller, and Edward Teller. Equa-tion of state calculations by fast computing ma-chines. Journal of Medical Physics, 21(6):1087–1092, 1953.

[8] Mark Sanderson and Justin Zobel. Information re-trieval system evaluation: effort, sensitivity, and re-liability. In Proceedings of the 28th annual inter-national ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’05,pages 162–169, New York, NY, USA, 2005. ACM.

[9] Ian Soboroff, Charles Nicholas, and Patrick Cahan.Ranking retrieval systems without relevance judg-ments. In Proceedings of the 24th annual inter-national ACM SIGIR conference on Research anddevelopment in information retrieval, SIGIR ’01,pages 66–73, New York, NY, USA, 2001. ACM.

[10] Amanda Spink and Howard Greisdorf. Regionsand levels: Measuring and mapping users’ rele-vance judgments. Journal of the American So-ciety for Information Science and Technology,52(2):161–173, 2001.

[11] Shengli Wu and Fabio Crestani. Methods for rank-ing information retrieval systems without relevancejudgments. In Proceedings of the 2003 ACM sym-posium on Applied computing, SAC ’03, pages811–816, New York, NY, USA, 2003. ACM.

[12] Justin Zobel. How reliable are the results of large-scale information retrieval experiments? In Pro-ceedings of the 21st annual international ACM SI-GIR conference on Research and development ininformation retrieval, SIGIR ’98, pages 307–314,New York, NY, USA, 1998. ACM.

On the Effectiveness and Optimization of …...On the Effectiveness and Optimization of Information...

Documents

Transcript of On the Effectiveness and Optimization of …...On the Effectiveness and Optimization of Information...