Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David...

17
Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013

Transcript of Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David...

Page 1: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Attempting to Use Wikipedia Categories to Improve Retrieval

INEX Linked Data Ad Hoc Track 2012

David MasseyABI Kveik, 1st March 2013

Page 2: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Task Description

• 3.2M documents from English language Wikipedia

• 140 queries• Return a ranked list with 1000 documents for

each query• Use Linked Data

Page 3: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Document Collection

• Approx. 30% of the files describe deleted files, images, etc.

• XML-like documents - Regex• Missing documents• Each document consists of three parts:

1. Wikipedia article2. DBPedia properties3. Yago properties

Page 4: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

<lodxml xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xml:lang='en' xmlns:xhtml='http://www.w3.org/1999/xhtml' encoding='UTF-8'><article title='73425'><wikipedia><paragraph><template type='Metadata'><arg></arg><tag name='id'>73425</tag><tag name='title'>The_Deer_Hunter</tag></template>

<template type='Other uses'><arg></arg><arg>Deer Hunter (disambiguation)</arg></template>

<infobox type='film'><tag name='name'>The Deer Hunter</tag>...

<tag name='director'><link><wikilink href='./f4/05/522346.xml'>Michael Cimino</wikilink><dbpedia href='http://dbpedia.org/resource/Michael_Cimino'></dbpedia><yago ref='Michael_Cimino'></yago></link></tag>

<dbpediaproperties><property name='http://dbpedia.org/ontology/thumbnail'><object name='http://upload.wikimedia.org/wikipedia/commons/thumb/5/57/The_Deer_Hunter_poster.jpg/200px-The_Deer_Hunter_poster.jpg'></object></property>

...

<yagoproperties><property name='hasDuration'><object name='10920.0#s'></object></property><property name='isCalled'><object name='A szarvasvad\u00e1sz'></object></property>

The_Deer_Hunter

Page 5: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

guitar chord tuningguitar chord minorguitar classical flamencoguitar classical bachguitar origin Russiaguitar origin bluestango culture moviestango culture countriestango music composerstango music instrumentstango dance stylestango dance history

vietnam war movievietnam war factsvietnam food recipesvietnamese food blogvietnam travel national parkvietnam travel airportsbicycle sport racesbicycle sport disciplinesbicycle holiday naturebicycle benefits healthbicycle benefits environmentfemale rock singerssouth korean girl groupselectronic music genresdigital music notation formatsmusic conferencesintellectual property rights lobby

Queries

Page 6: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Two stage approach

1. Traditional retrieval2. Improve by:– Using links between documents– Using categories– Using Linked Data

Page 7: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Stage One

• Extract title, headings and categories from documents

• Index using Indri – Krovetz stemming, stopword list

• Weighted search – Title (10), Category (5), H2 (2), H3 (1)

• Smoothing (ask Michael)

Page 8: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Vietnam_War_Crimes_Working_GroupVietnam_War_in_filmOperation_Sunrise_(Vietnam_War)Vietnam_War_Story_IIBook:Vietnam_WarVietnam_during_World_War_IVietnam_War_casualtiesVietnam_War_Crimes_Working_Group_FilesPuerto_Ricans_Missing_in_Action_in_the_Vietnam_WarStar_Wars_Mini_Movie_AwardsVietnam_War_Memorial,_Hanoi17th_Parallel:_Vietnam_in_WarVietnam:_The_Camera_At_WarMarch_Against_the_Vietnam_War1960_in_the_Vietnam_War1961_in_the_Vietnam_WarList_of_Vietnam_War_flying_acesList_of_wars_involving_VietnamOutline_of_the_Vietnam_WarThe_War_Within:_America&apos;s_Battle_over_VietnamVietnam:_The_Ten_Thousand_Day_WarProtests_against_the_Vietnam_WarMatterhorn:_A_Novel_of_the_Vietnam_WarList_of_bombs_in_the_Vietnam_WarPuerto_Ricans_in_the_Vietnam_WarMilitary_history_of_Australia_during_the_Vietnam_War

Result After Stage One

Query: Vietnam War Movie

Page 9: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Stage Two

• Links between documents• Categories• Linked Data• …

Page 10: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Expand Query with Wordnet Synonyms

Original query: vietnam war movie

vietnam -> annamwar -> warfaremovie -> film flick pic picture

Expanded query: vietnam annam war warfare film flick movie pic picture

Page 11: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Calculate Text Similarity between Expanded Query and Category Name

Levenshtein distance: "The smallest number of insertions, deletions, and substitutions required to change one string or tree into another. " NIST (http://xlinux.nist.gov/dads/HTML/Levenshtein.html)

Original query: Vietnam War MovieExpanded query: Vietnam Annam War Warefare Film Flick Movie Pic Picture

vietnam annam war warfare film flick movie pic picture

vietnam 0 4 6 7 5 7 6 6 5

war 6 4 0 4 4 5 5 3 6

films 6 5 5 7 1 4 5 4 6

Category: Vietnam War Films = 1 + (0*0 + 0*0 + 1*1) = 2 / 3 = 0.66

Page 12: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

0.66 Vietnam War films1 War films2.33 Star Wars films2.75 Star Wars fan films3 Fan films3 Punic Wars3 Star Wars3.66 Gulf War films3.75 World War I films5.5 Barbary Wars5.5 Boer Wars5.5 Civil wars5.5 Guild Wars5.5 Opium Wars5.66 Vietnam War books5.66 Vietnam War novels5.66 Vietnam War sites5.75 World War II media6.33 Flags of Vietnam6.33 Laws of war6.33 Media of Vietnam6.33 MTV Movie Awards6.8 Women in World War I7 Floods in Vietnam7 Songs of the Vietnam War7.33 Star Wars books7.33 Star Wars comics7.5 War crimes in Vietnam7.5 World War I games7.5 World War II comics

Threshold < 1

Categories ranked by similarity to expanded query

Query: Vietnam War Movie

Page 13: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

vietnam annam war warfare film flick movie pic picture

star 5 4 2 5 4 5 5 4 5

wars 7 5 1 4 4 5 5 4 6

Problems

Category: Star Wars = 1 + (2*2 + 1*1) = 6 / 2 = 3

Frosker = Forsker?

Homonyms – Vietnam War Picture

Missing categories

Expanded Query: Vietnam Annam War Warefare Film Flick Movie Pic Picture

Page 14: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

We_Were_SoldiersA_Better_Tomorrow_3The_War_(film)Faith_of_My_Fathers_(film)Combat_Shock The_Killing_Fields_(film)A_Bright_Shining_LieApocalypse_Now_ReduxFlight_of_the_IntruderDead_PresidentsR-PointThe_Last_HunterThere_Is_No_13Deceit_(2009_film)The_Ballad_of_Andy_CrockerSome_Kind_of_HeroThe_Deer_HunterA_Rumor_of_War_(miniseries)Platoon_(film)The_Crazy_World_of_Julius_VrooderThou_Shalt_Not_Kill..._Except1969_(film)A.W.O.L._(2006_film)The_Siege_of_Firebase_GloriaAlamo_BayRolling_Thunder_(film)

Result After Stage Two

Query: Vietnam War Movie

Page 15: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

vietnam war movie '11-1------' '11--------'vietnam war facts '11--11111-' '11--11111-'vietnam food recipes '------1---' '------1---'vietnamese food blog '----------' '----------'vietnam travel national park '1111111111' '1111111111'vietnam travel airports '-1-111----' '-1-11-1---'guitar chord tuning '111111111-' '111111111-'guitar chord minor '11111--11-' '11111--11-'guitar classical flamenco '----1-----' '----1-----'guitar classical bach '1-11--11--' '1-11--11--'guitar origin Russia '----------' '----------'guitar origin blues '1-1-------' '1-1-------'tango culture movies '---1--1---' '---1--1---'tango culture countries '---1-1---1' '---1-1---1'tango music composers '-1--------' '---1-1---1'tango music instruments '----------' '----------'tango dance styles '11--------' '----------'tango dance history '111-------' '111-------'bicycle sport races '111-1--1--' '---1111-1-'bicycle sport disciplines '----1-----' '----1-----'bicycle holiday nature '----------' '----------'bicycle benefits health '------1---' '------1---'bicycle benefits environment '---------1' '---------1'female rock singers '1-------1-' '1-1--1--1-'south korean girl groups '----------' '111111-111'electronic music genres '1-1-1-----' '-1-1------'digital music notation formats '-111-1111-' '-111-1111-'music conferences '11111-1---' '--11111-1-'intellectual property rights lobby '111-1-1111' '111-1-1111'

Original Query Result Stage 1 Result Stage 2

Page 16: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Stage One Stage Two Best

Precision and RecallP

R

Page 17: Attempting to Use Wikipedia Categories to Improve Retrieval INEX Linked Data Ad Hoc Track 2012 David Massey ABI Kveik, 1st March 2013.

Literature

Kaptein, R., Koolen, M., & Kamps, J. (2009, July). Using Wikipedia categories for ad hoc search. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 824-825). ACM.

Vercoustre, A. M., Pehcevski, J., & Thom, J. (2008). Using wikipedia categories and links in entity ranking. Focused Access to XML Documents, 321-335.

Illustration

http://www.flickr.com/photos/pasukaru76/6196321318/