Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean...

73
Seeking information: IR + AI approaches ECAI 2002 Lyon 1 Seeking Information: Methods from Information Retrieval and Artificial Intelligence Alison Cawsey Heriot-Watt University, Edinburgh, Scotland Mounia Lalmas and Thomas Roelleke Queen Mary University of London, London, England http://www.dcs.qmul.ac.uk/~mounia/ECAI2002.html Tutorial Outline 1. Introduction 2. Information Retrieval Approaches 3. Artificial Intelligence Approaches 4. Conclusions and Future 5. Demos

Transcript of Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean...

Page 1: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

1

Seeking Information: Methods from Information Retrieval and Artificial Intelligence

Alison CawseyHeriot-Watt University, Edinburgh, Scotland

Mounia Lalmas and Thomas RoellekeQueen Mary University of London, London, England

http://www.dcs.qmul.ac.uk/~mounia/ECAI2002.html

Tutorial Outline

1. Introduction2. Information Retrieval Approaches3. Artificial Intelligence Approaches4. Conclusions and Future5. Demos

Page 2: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

2

Introduction: Information needExample of information need in the context of the world wide web:

“Find information on sailing charters that: (1) can be skipped from the Greek Islands, and (2) are registered with the RYA. To be useful, the information must include boat specification, price per week, and e-mail and phone number for contact purpose.”

⇒Information Retrieval (IR)⇒Artificial Intelligence (AI)

Introduction: Information Seeking

users sources

Page 3: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

3

Introduction: Three main components

expressing the information need

extracting information from sources

matching

Introduction: Information Seeking

! Libraries and Bibliographic Systems! World-wide-web! Digital libraries! (Knowledge Management)

! Areas: medical, journalism, broadcast, geographical and satellite systems, learning, leisure, …

Page 4: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

4

Seeking for Information: Information Retrieval

Mounia Lalmas and Thomas RoellekeDepartment of Computer ScienceQueen Mary University of LondonLondon, E1 4NS, England{mounia,thor}@dcs.qmul.ac.uk

IR Approaches: Outline1. Introduction2. Basics

1. Indexing mechanisms2. Retrieval Models3. Evaluation

3. Topics1. Query reformulation2. Web IR3. Structured document retrieval4. The use of AI in IR

Page 5: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

5

Introduction: Information needExample of information need in the context of the world wide web:

“Find all documents containing information on computer courses which: (1) are offered by universities in South England, and (2) are accredited by the BCS/IEE bodies. To be relevant, the document must include information on admission requirements.”

⇒ Information Retrieval

Introduction: Information retrieval (IR) system

“Retrieve all the documents which are relevant to a user query, while retrieving as few non-relevant documents as possible.”

Page 6: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

6

Introduction: The (Basic) IR Process

information need match

search/ / retrievalengine

documents

query

retrieved documents

query reformulation

+ _

index

Introduction: Topics in IR

! Query reformulation! Web IR and link analysis! Structured document

retrieval! Thesaurus construction! Parallel and distributed IR! Text categorisation ! Filtering! Hypertext and hypermedia ! Metadata and ontologies ! Integration of IR and DB

technologies

! Agent-based technology ! Metasearch and data fusion ! Summarisation, abstraction ! Interface and visualisation ! Information-seeking and user

modelling ! User studies ! Multimedia IR ! Multilingual IR ! Index structure ! Text compression

Page 7: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

7

Introduction: IR is a multi-disciplinary approach

informationretrieval

artificial intelligence

human computerinteraction

linguisticsvision

machinelearning

cognitive science

mathematicsinformation andlibrary studies

Information Retrieval: Basics

1. Indexing2. Retrieval3. Evaluation

Page 8: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

8

Basics: Indexing1. What is a document?

2. Representing the content of documents1. Conflation2. Weighting

3. Inverted files (Index)

Indexing: What is a document?

sailinggreecemediterraneanfishsunset

Author = “B. Smith”Crdate = “14.12.96”Ladate = “11.07.02”

Sailing in Greece

B. Smith

headtitleauthor

chaptersectionsection

content

structure

layout

fact

Page 9: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

9

Indexing: Conflation

Documents Tokens

Stop wordsStemsIndex terms

Phrases

- stop list-suffix stripping-linguistic resources

- NLP

Controlled vocabulary-Catalogue-Thesaurus

Indexing: Weighting

weight(t,d) = tf(t,d) × idf(t)

d t N N(t) idf(t) occ(t,d) tmax tf(t,d)

document term number of documents in collection number of documents in which term t occurs inverse document frequency occurrence of term t in document d term in document d with highest occurrence term frequency of term t in document d

•high frequency in a document (tf) leads to a high term weight•lower document frequency (high idf) leads to a high term weight

( ) ( )tnN log tidf =

( ) ( ) ( )( )d,tocc

dt,occa-1 adt,tfmax

+=

Page 10: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

10

Indexing: Inverted fileWord-oriented mechanism for indexing document collections to speed up searching

Searching:! vocabulary search (query terms)! retrieval of occurrence! manipulation of occurrence

TERM IDF TFDOC

Basics: Retrieval

1. Boolean model2. Vector space model3. Probabilistic model4. Other models

Page 11: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

11

Retrieval: Boolean ModelRetrieve documents which are “true” for the query

! Query: logical combination of index termsQ = (K1 and K2) or (K3 and (not K4))“Retrieve all documents indexed by K1 AND K2, OR documents indexed by K3 AND (but) NOT indexed by K4”

! Inverted file for document collections {D1, D2, D3, D4}K1-list: D1, D2, D3, D4K2-list: D1, D2K3-list: D1, D2, D3K4-list: D1

! Result: {D1, D2, D3}

! Issue of normalisation ⇒ set-based models (Dice, Jaccard, …)

Retrieval: Vector Space Model (1)! Set of terms {t1, t2, … , tn}! Document vector D = <d1, d2, … , dn>! Query vector Q = <q1, q2, … , qn>

di = term frequency of term ti in documentqi = query formulated with term ti

Retrieval status value:∑i=1,n di qi

(∑i=1,n di2)1/2 (∑i=1,n qi

2)1/2 = cos θ

Page 12: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

12

Retrieval : Vector Space Model (2)

term t1

term t2

here n=2 (two index terms)

d1 q1

d2

q2

D

Q

θ

Retrieval: Probabilistic Model“Given a user query q and a document d, estimate the probability that the user will find d relevant.”

"Binary independence model (BIR)# Index terms in relevant and non-relevant documents#Assume feedback information

#BIR without user feedback #BIR with within-document frequency

"Use of polynomial functions and logistic regression"Others

Page 13: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

13

Retrieval: Binary Independence Model (BIR)

! Document described by presence/absence of terms! D = <d1, d2, …, dn> where n is number of terms.

R: relevant; ¬R: not relevant

compute P(RD) and P(¬RD) to decide whether document represented by D is relevant.

= otherwise0

by t intexeddocument 1d i

i

BIR: Bayes’ Decision Ruleif P(RD) > P(¬RD) then D is relevant; otherwise D is not relevant.

! Minimises the average probability of error:assigning a relevant document as non-relevant or vice versa (Probability Ranking Principle)

! Need to compute P(RD) and P(¬RD)

Page 14: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

14

BIR: Bayes' theorem (1)

! P(D): probability of observing D a description at random, i.e., probability of D irrespective of whether it is relevant or not.

! P(DR): probability of observing D given that it is relevant.! P(D¬R): probability of observing D given that it is not relevant.! P(R): prior probability of observing a relevant document.! P(¬R): prior probability of observing a non relevant document.! Note: P(D) = P(DR)P(R) + P(D¬R)P(¬R).

( ) ( ) ( )( ) ( ) ( ) ( )

( )DPRP RDP

DRP DP

RP RDPDRP

¬¬=¬=

BIR: Bayes’ theorem (2)if P(DR)P(R)>P(D¬R)P(¬R) then D is relevant otherwise D is not relevant

! From above decision rule, we derive a retrieval function g(D) using independence assumptions:

P(DR) = P(d1R) P(d2R) … P(dnR)P(D¬R) = P(d1¬R) P(d2¬R) … P(dn¬R)

Presence Absencepi = P(di = 1 R) 1 - pi = P(di = 0 R)qi = P(di = 1 ¬R) 1 - qi = P(di = 0 ¬R)

! pi(qi): probability that if the document is relevant (non-relevant) then the ithterm ti is present in the document.

Page 15: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

15

BIR: Retrieval function g(D)

where

"cis are weights associated with terms ti. e.g. discrimination power.

"Simple addition: add the coefficients ci for those terms ti present in document.

"Rank documents using g(D).

( ) ∑=

=n

1iiidcDg

( )( )ii

iii p1q

q1plogc−−

=

BIR: Estimating the cis (1)For each term ti:

ni: number of documents with term tiri: number of relevant documents with term tiR: number of relevant documentsN: number of documents

• not total number of documents in system• some subset specially chosen to enable ci to be estimated• relevance feedback data: number of displayed documents.

NN – RR

ni

N – ni

ni – ri

N – ni – R + ri

ri

R – ri

di = 1di = 0

non-relevantrelevant

Page 16: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

16

BIR: Estimating the cis (2)! pi: probability that a relevant document contains

the term ti! qi: probability that a non relevant document

contains the term ti

! So

extent to which the ith term can discriminate betweenthe relevant and non-relevant documents.

Rrp i

i =

R-Nr-nq ii

i =)rRn(N

)r(n)r(R

r

logc

ii

ii

i

i

i

+−−−

−=

Retrieval: Other Models! Set theoretical models

"fuzzy set model "extended Boolean model

! Algebraic models"latent semantic indexing model "neural network model

! Probabilistic models"inference network "belief network "Language model

! IR viewed as a logical inference

Page 17: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

17

Other Models: IR viewed as a logical inference (1)

! document and query: logical formulae d and q! retrieval: search for document which implies the query:

d → q

! Advantage: from term-based retrieval to knowledge-based retrieval! d = square, q = rectangle, and thesaurus: square → rectangle: document

d relevant to query q

d → q

d = t1∧ t2 ∧ t3d = t1 ∧ t3

d ={t1, t2, t3}q= {t1, t3}

logical view

Other Models: IR viewed as a logical inference (2)

! d = quadrangle, q = rectangle: document d maybe relevant to query q

! Uncertainty: quadrangle → rectangle with uncertainty 0.3

! Retrieval: Estimating the “probability” that document infers thequery: P(d → q)

! Logical Uncertainty Principle:“Given any two sentences x and y; a measure of uncertainty of y → x related to a given data set is determined by the minimal extent to which we have to add information to the data set, to establish that y → x”

"Use of non-classical logics and theories of uncertainty

Page 18: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

18

Information Retrieval: Evaluation

1. Background2. System-centred evaluation3. User-centred evaluation

Evaluation: Background! What to evaluate?

"coverage of the text collection: extent to which the system includes relevant material

"time lag (efficiency): average interval between the time request is made and the time answer is given

"presentation of the output"effort involved by user in obtaining answers to request"recall of the system: proportion of relevant documents retrieved"precision of the system: proportion of the retrieved documents

that are actually relevant

Page 19: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

19

Evaluation: Background! Originally

"Batch IR systems"Small, textual collections"Queries formulated by searchers

! Today"Interactive IR systems"Large collections of different or mixed media"Queries formulated by end-users

Evaluation: System-centred evaluation

! (Comparative) evaluation of technical performance of IR system(s)

! Relevant = “having significant and demonstrable bearing on the matter at hand”

#Objectivity, Topicality, Binary nature, Independence

! Effectiveness = the ability of the IR system to retrieve relevant documents and suppress non-relevant documents

#Test collections: document collection, queries, relevance judgements

Page 20: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

20

Effectiveness: Recall / Precision

Document collection

Retrieved RelevantRetrieved and relevant

documentsrelevant ofnumber retrieved documentsrelevant ofnumber recall

retrieved documents ofnumber retrieved documentsrelevant ofnumber precision

=

=

Effectiveness: Recall / Precision! For each system / system

version"For each query in the test

collection#Run query against system to

obtain ranking#Use ranking and relevance

judgements to calculate recall/precision (r/p) pairs at each recall point

# Interpolate to standard recall points if necessary

"Average r/p values across all queries

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

recall

system 1system 2

Page 21: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

21

Effectiveness: Example of test collections

! TREC (Text REtrieval Conference)"Started in 1990, run by National Institute of Standards and

Technology (NIST)"Components

#Huge document collection (several GB), taken from Wall Street Journal, Financial Times, etc

#New documents, topics (i.e. requests, including description and narrative fields) and relevance judgements (performed by retiredcivil servants) each year

"Tracks# Interactive, cross-lingual, Web, spoken document, short query,

video, querying-answering (factoid)

Evaluation: User-centred evaluation

! Evaluation of interface and user interaction"Usability, task performance, user satisfaction

! Methodology based interactive experiment, ethnographic study"No standard user-centred methodology"Elements often borrowed from other areas, e.g.

human computer interaction, experimental psychology

Page 22: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

22

nformation Retrieval: Topics in informationretrieval

1. Query reformulation

2. Web information retrieval

3. Structured document retrieval

4. The use of AI in IR

Query reformulation

1. Introduction2. Relevance feedback3. Automatic local analysis4. Automatic global analysis5. Evaluation6. Issues

Page 23: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

23

Query reformulation: IntroductionNo detailed knowledge of collection and retrieval environment

difficult to formulate queries well designed for retrievalNeed many formulations of queries for good retrieval

First formulation: naïve attempt to retrieve relevant informationDocuments initially retrieved:

Examined for relevance informationImproved query formulations for retrieving additional relevant documents

Query reformulation:Expanding original query with new termsReweighting the terms in expanded query

Query reformulation: Three approaches1. Relevance feedback

1. Approaches based on feedback from users1. Rocchio2. Binary Independence Model (BIR)

2. Local analysis (pseudo-relevance feedback)1. Approaches based on information derived from set

of initially retrieved documents (local set of documents)

3. Global analysis1. Approaches based on global information derived

from document collection

Page 24: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

24

Relevance feedback! Cycle

" User presented with list of retrieved documents" User marks those which are relevant

• In practice: top 10-20 ranked documents are examined• Incremental

" Select important terms from documents assessed relevant by users" Enhance importance of these terms in a new query

1. Query expansion: Add new terms from relevant documents2. Term reweighting: Modify term weights based on user relevance

judgements3. Query expansion + term reweighting

Relevance feedback: RocchioFor query q

Dr: set of relevant documents among retrieved documentsDn: set of non-relevant documents among retrieved documentsα,β,γ: tuning constants

! Usually information in relevant documents more important than in non-relevant documents (γ<<β)

! Positive relevance feedback (γ=0)

qi+1 = αqi +β

Drdj

d j∈Dr

∑ −γ

Dnd j

d j∈Dn

Page 25: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

25

Relevance feedback: Rocchio in practice (SMART)

! α=1! Terms

"Original query"Appear in more relevant documents that non-relevant

documents"Appear in more than half the relevant documents

! Negative weights ignored

qi+1 = qi +βDr

d j

dj∈Dr

∑ −γ

Dnd j

dj∈Dn

Relevance feedback: Binary independence model (BIR)

! Probabilistic based (Bayes’ Theorem and Probability Ranking Principle)

! Document D=<d1, …, dn>"n terms t1, … , tn in the collection"di = 1 if document D indexed by ti, otherwise = 0

#ci = discrimination power of term ti at retrieving relevant documents and ignoring non-relevant documents

#Predict relevance#Several formulations for ci

g(D) = cidii=1,n∑

Page 26: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

26

BIR: formulations for ci! Independence assumptions

" I1: distribution of terms in relevant documents is independent and their distribution in all documents is independent

" I2: distribution of terms in relevant documents is independent and their distribution in irrelevant documents is independent

! Ordering principle"O1: probable relevance based on presence of search terms in

documents"O2: probable relevance based on presence of search terms in

documents and their absence from documents

Independence Assumption I1

IndependenceAssumption I2

Ordering Principle O1 F1 F2

Ordering Principle O2 F3 F4

BIR: Various combinations

R = number of relevant documentsN= number of documents in collectionri = number of relevant documents containing tini = number of documents containing ti

Nn

Rr

logc :O1)(I1 F1i

i

i =+

R)(N)r(nR

rlogc :O1)(I2 F2

ii

i

i

−−

=+

)n(Nn

)r(Rr

logc :O2)(I1 F3

i

i

i

i

i

−=+

)rRn(N)r(n

)r(Rr

logc:O2)(I2 F4

ii

ii

i

i

i

+−−−

−=+

Page 27: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

27

BIR: Experiments! F1, F2, F3 and F4 outperform no relevance weighting and

ranking by IDF ! F1 and F2; F3 and F4 perform in the same range

! F3 and F4 > F1 and F2! F4 slightly > F3

"O2 is correct (looking at presence and absence of terms)

! No conclusion with respect to I1 and I2, although I2 seems a more realistic assumption.

Query reformulation: Local analysis

! Examine documents retrieved for query to determine query expansion

! No user assistance

! Two strategies"Local clustering (synonyms, stemming variations)"Local context analysis (terms close to query terms in text)

! Two issues:"Query “drift”"Computation cost (on-line)

Page 28: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

28

Query reformulation: Global analysis

! Expand query using information from whole set of documents in collection

! No user assistance! Thesaurus-like structure using all documents

! Two issues:"Approach to built thesaurus (e.g. term co-occurrence)"Approach to select terms for query expansion (e.g.

the top 20 terms ranked according to IDF value)

Query reformulation: Evaluation! Use qi and compute precision and recall graph! Use qi+1 and compute precision recall graph

" Use all documents in the collection• Spectacular improvements• Also due to relevant documents ranked higher• Documents known to user• Must evaluate with respect to documents not seen by user

" (For example) Use documents in residual collection = set of documents minus those assessed relevant

• Measures lower than for original query• More realistic evaluation• But result not comparable with original ranking (fewer relevant

documents)

Page 29: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

29

Query reformulation: Issues! Relevance feedback

"Often users are not reliable in making relevance assessments"Positive, negative, neutral, partial relevance assessments"Why a document is relevant?

! Interface and visualisation"Allow user to quickly identify relevant and non-relevant

documents (e.g. the use of summary)"What happen with 2D and 3D visualisation?

! Interactive query expansion (as opposed to automatic)"User choose the terms to be added

Web information retrieval1. Introduction2. Tasks of web search engines

1. Gathering2. Indexing3. Searching4. Document and query management

3. Metasearch4. Issues

Page 30: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

30

Introduction: Queries on the web

Measure Average value RangeNumber of words 2.35 0 - 393Number of operators 0.41 0 - 958Repetitions of queries 3.97 1 - 1.5 millionQueries per user session 2.02 1 - 173325Screens per query 1.39 1 - 78496

Introduction: Users and the web! Main purpose: research, leisure, business, education,

#products and services (e-commerce)#people and company names and home pages# factoids (from any one of a number of documents)#entire, broad documents#mp3, image, video, audio

! Some statistics#80% do not modify query#85% look first screen only#64% queries are unique#25% users use single keywords (problem for polysemic words and

synonyms)#10% queries are empty!

Page 31: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

31

Web IR: Tasks of a web search engine! Document gathering

#select the documents to be indexed

! Document indexing# represent the content of the selected documents#often 2 indices maintained (full + small for frequent queries)

! Searching# represent the user information need into a query# retrieval process (search algorithms, ranking of web pages)

! Document and query management#display the results#virtual collection (documents discarded after indexing) vs. physical

collection (documents maintained after indexing)

Tasks: Document indexing! Document indexing = building the indices

! Indices are variant of inverted files"metatag analysis"stop words removal + stemming"position data (for phrase searches)"weights

# tf x idf; # downweight long URLs (not important page)# upweight terms appearing at the top of the documents, or emphasised terms

"use de-spamming techniques

! hyperlink information# count link popularity# anchor text from source links# hub and authority value of a page

Page 32: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

32

Tasks: Searching! Querying

"1 word or all words must be in the retrieved pages"normalisation (stop words removal, stemming, etc)"complex queries (date, structure, region, etc)"Boolean expressions (advanced search)"metadata

! Ranking algorithms: use of web links"Anchor text"web page authority analysis

#PageRank (Google)#HITS (Hyperlink Induced Topic Search)

Ranking: Use of web links! Web link: represent a relationship between the connected

pages

! The main difference between standard IR algorithms and web IR algorithms is the massive presence of web links

"web links are source of evidence but also source of noise"classical IR: citation-based IR"web track in TREC, 2000, TREC-9: Small Web task (2GB of

web data); Large Web task (100GB of web data, 18.5 million documents)

Page 33: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

33

Ranking: Anchor text! Represent referenced document

"why?#provides more accurate and concise description than the page

itself# (probably) contains more significant terms than the page itself

"used by ‘WWW Worm’ - one of the first search engines 1994"representation of images, programs, …

! Generate page descriptions from anchor text

Ranking: PageRank (1)! Designed by Brin and Page at Stanford University and used

to implement Google

"a page has a high rank if the sum of the ranks of its in-links is high# in-link of page p: a link from a page to page p#out-link of a page p: a link from page p to a page

"a high PageRank page has many in-links or few highly ranked in-links

! Retrieval: use cosine product (content, feature, term weight) combined with PageRank value

Page 34: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

34

Ranking: PageRank (2)! Random Surfer Model : user randomly navigates

"Initially the surfer is at a random page"At each step the surfer proceeds

# to a randomly chosen Web page with probability d called the “damping factor” (e.g. probability of random jump = 0.2)

# to a randomly chosen page linked to the current page with probability d-1 (e.g. probability of following a random outlink = 0.8)

! Process modelled by Markov Chain"PageRank PR of a page a = probability that the surfer is at page

a on a given time

PR(a) = Kd + K(1-d) ∑i=1,n PR(ai)/C(ai)

d set by system a = page pointed by ai for i=1,nK normalisation factor C(ai) = number of outlinks of ai

Ranking: HITS = Hypertext Induced Topic Search

! Originated from Kleinberg, 1997 (also referred to as the “The Connectivity Analysis Approach”)

! Broad topic queries produce large sets of retrieved results"abundance problem ⇒ too many relevant documents"new type of quality measure needed ⇒ distinguish the most

“authoritative” pages ⇒ high-quality response to a broad query

! HITS: for a certain topic, it identifies "good authorities

#pages that contain relevant information (good sources of content)"good hubs

#page that point to useful pages (good sources of links)

Page 35: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

35

Ranking: HITS (2)! Intuition

"authority comes from inlinks"being a good hub comes from outlinks

"better authority comes from inlinks from good hubs"being a better hub comes from outlinks to good authorities

! Mutual reinforcement between hubs and authorities"a good authority page is pointed to by many hub pages"a good hub page point to many authority pages

! Use the set of pages S that are retrieved (e.g. k = 200 top-ranked pages) + set of pages T that point to or are pointed to by retrieved set of pages S

Ranking: HITS (3)! Computation of hub and authority value of a page through the

iterative propagation of “authority weight” and “hub weight”

! Initially all values equal to 1! Authority weight of page x(p)

"if p is pointed to by many pages with large y-values, then it should receive a large x-value

x(p) = Σqi→p y(qi)! Hub weight of page y(p)

"if p points to many pages with large x-values, then it should receive a large y-value

y(p) = Σp→qi x(qi)! After each computation (iteration), weights are normalised

Page 36: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

36

Web IR: Metasearch (1)! Problems of Web search engines:

"limited coverage of the publicly indexable Web"index different overlapping sections of the Web"based on different IR models"different results to the same query

⇒ users do not have the time, knowledge to select the most appropriate search engines with regard to their information need

! Metasearch engines"Sends query to several search engines, Web directories,

databases"Collect results"Unify (merge) them - Data fusion

Web IR: Metasearch (2)! Divided into phases

"search engine selection# topic-dependent, past queries, network traffic, etc

"document selection#how many documents from each search engine?

"merging algorithm#utilise rank positions, document retrieval scores, titles & abstracts,

etcMetasearcher URL Sources usedMetaCrawler www.metacrawler.com 13Dogpile www.dogpile.com 25SavvySearch www.search.com > 1000

Page 37: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

37

Web IR: Issues! Modelling! Querying! Distributed architecture! Ranking! Indexing! Dynamic pages! Browsing! User interface! Duplicated data! Multimedia! Context

Web IR: Context! Results of search engines are identical, independent of

"user"context in which the user made the request

! adding context information for improving search results ⇒ focus on the user need and answer it directly "explicit context

#query + category"implicit context

#based on documents edited or viewed by user"personalised search

#previous requests and interests, user profile

Page 38: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

38

Structured document retrieval! In standard IR, documents are considered as atomic

information units whatever their type or size"Indexed as a whole

# Indexes do not express the internal organisation of the discourse set by the author(s)

"Retrieved as a whole#Users cannot retrieve independent components of documents that might

be more adapted (more focussed ) to their information needs

! New standards (SGML, XML, HTML, ODA…)! MPEG-7 for audio-visual data

Structured document

Page 39: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

39

SDR: Impact of structure! Searching = querying and browsing

"Complementary advantages and limitations"Both based on explicit manipulation of structure

#Querying: attributes, logical structure#Browsing: links

"Disorientation

SDR: Approaches! Hypermedia

! Passage retrieval

! Indexing/ retrieving hierarchical structure (aggregation)

Page 40: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

40

SDR: Passage retrieval

! Apply IR techniques to parts, “passages”, rather than whole documents"Return ranked documents based on passages"Return ranked passages

#Combination of evidence (local + global)

! Three types of structure:"Discourse: sentence, section, … "Semantic: subject or content of text "Window: based on fixed set of words

SDR: Aggregation-based approaches

object o2object o1

object representation R1 object representation R2

object oobject representation= R1 ⊕ R2type of linksnumber of childrentype of child…..

Page 41: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

41

SDR: Aggregation-based approaches

? = weight of term in Document that is an estimation of Document being represented by the term

• term weight of term in document• aggregated term weight of term from related components• importance associated with components (abstract vs. conclusion)• link type (hierarchy, linear, semantic, popularity)• portion of related components indexed by term • distribution of related components indexed by term

document

section3section2section1{0.7 wine} {0.9 cheese} {0.3 wine}

{? wine ? cheese}

Aggregation: Focussed Retrieval

r

r - relevant r

r

r

BEP

BEPbrowsing

Focus retrieval to Best Entry Points (BEPs)

Page 42: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

42

SDR: Issues! Information seeking-process for SDR

! Interfaces that support browsing and querying

! IR models"Retrieve at appropriate level of granularity"Focus retrieval to best entry points "Structured queries "XML retrieval

! Evaluation and test collection for SDR

! Index structure for SDR

Areas of AI used in IR! Natural language processing

! Knowledge representation# Expert systems# Logical formalisms, conceptual graphs, etc

! Machine learning# Short term: over a single session# Long term: over multiple searches by multiple users

! Computer Vision# OCR

! Reasoning under uncertainty# Dempster-Shafer, Bayesian networks, probability theory, etc

! Cognitive theory# User modelling

Page 43: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

43

Areas of AI used in IR:Three main roles

1. Information characterisation

2. Search formulation in information seeking

3. Support functions

Information characterisation - Approach 1! Replace document text (natural language) with a knowledge

base in an artificial language

! Directly manipulating the information available => Knowledge-base retrieval

! Allows for for question/answering queries! Much of the (textual) information is lost

"What will be put in the knowledge base"Issue of information extraction

! Problem with large collection, but was shown successful in specific domain (SCISOR)

Page 44: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

44

Information characterisation - Approach 2

! Keep documents and use knowledge base as access tool (query formulation)"Semantic-based access, concept-based access"Interface and presentation

! Better classification of document text and better access

! Criticism: problems of (automatic) linkages (documents have different style, language and level of discussion)

Information characterisation - Approach 3! Abandon knowledge base but use AI (syntactic level) to

characterise document content

! Sophisticated matching

! Use NLP to derive"Noun-phrases: “The mother of Jane <=> Jane’s mother”"Sentences: “The boy ate the apple <=> The apple was eaten by

the boy”

"Normalisation is necessary!

! Little of evidence of success (so far)

Page 45: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

45

Information characterisation - Approach 4

! Use AI to select good natural language index terms"Thesaurus construction"Compound terms

! Use world knowledge and a bit of linguistics (e.g. noun vs. verb, discourse)

Information seeking! Characterisation of the user’s information need (and not the

actual matching)! User modelling: “Automating the intermediary” giving the user

an intelligent front-end

! Over iterative searching and dialogue, determine use’s realinformation need

Medical doctor vs. medical student Student and general topic: look for a survey document

! BUT: users have difficulty expressing their information need ⇒difficult of manually or automatically deriving rules for systems

! Based on Expert Systems technologies

Page 46: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

46

Example of Rules! Data abstraction rules

# if precision <=20% then precision level is 1# if precision > 80% then precision level is 5# if retrieval size is 101-200 then retrieval level is 4(1 - very low … 5 very high)

! Heuristic matching rules# if precision level is 2 or 3 and retrieval level >2 then use narrowing

strategy! Refinement rules

# if a narrowing strategy is needed then select strategy “use terms that have high frequency in relevant records” with weight 0.8

Support functions1. Information extraction2. Abstracting and summarising3. Cataloguing (Ontology)4. Automatically linking parts of texts (Hypertext)5. Thesaurus/dictionary building(Linguistics)6. Story telling (News)

Page 47: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

47

IR: Bibliography! [AA97] M Agosti and J Allan. Introduction to the Special Issue on Methods and Tools for the Automatic Construction of Hypertext. IP&M

33(2):129-131, 1997.! [ACP00] M Agosti, F Crestani and G Pasi. Lectures on Information Retrieval, Third European Summer-School, ESSIR 2000, Varenna,

Italy, September 11-15, 2000, Revised Lectures, 2001.! [AS96] M Agosti and AF Smeaton (eds). Information Retrieval and Hypertext. Kluwer Academic Publishers, 1996.! [Bel00] RK Belew. Finding Out About: Search Engine Technology from a cognitive Perspective, Cambridge University Press, 2000.! [BR99] R Baeza-Yates and B Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999.! [Cro87] WB Croft. Approaches to intelligent information retrieval. IP&M 23(4):249-254, 1987.! [CLR98] F Crestani, M Lalmas and CJ van Rijsbergen (eds). Information Retrieval: Uncertainty and Logics - Advanced models for the

representation and retrieval of information. Kluwer Academic Publishers, Boston etal, 1998.! [CLRC98] F Crestani, M Lalmas and CJ . van Rijsbergen, Iain Campbell: ``Is This Document Relevant? ... Probably'': A Survey of

Probabilistic Models in Information Retrieval. ACM Computing Surveys 30(4):528-552, 1998.! [FB92] W Frakes and R Baeza-Yates. Information Retrieval. Data Structures and algorithms. Prentice Hall, 1992.! [FR97] N Fuhr and T Roelleke: A Probabilistic Relational Algebra for the Integration of Information Retrieval and Database Systems.

TOIS 15(1):32-66, 1997.! [GF98] DA Grossman and O Frieder. Information Retrieval: Algorithms and Heuristics. Kluwer Academic Publishers, 1998. ! [Ing92] P Ingwersen. Information Retrieval Interaction. Taylor Graham, London, 1992.! [Kor97] RR Korfhage. Information Storage and Retrieval. Wiley, 1997.! [Kow97] G Kowalski. Information Retrieval Systems, Theory and Implementation. Kluwer Academic Publishers, Boston, USA, 1997. ! [KP93] CSG Khoo and DCC Poo. An expert system approach to online catalog subject searching. IP&M, 30(2):223-238, 1993.! [Kra86] DH Kraft. Research into Fuzzy Extensions of Information Retrieval. SIGIR Forum 20(1-4): 12-13, 1986.! [JR90] P Jacobs and L Rau. SCISOR, Extracting information from on-line news. Communications of the ACM 33(11): 88-97, 1990.

IR: Bibliography! [LLD+02] RWP Luk, HV Leong, TS Dillon, ATS Chan, WB Croft and J Allan. A survey in indexing and searching XML

documents. 415-437 JASIST Volume 53, Number 6, 2002.! [Pet01] C Peters. Cross-Language Information Retrieval and Evaluation, Workshop of Cross-Language Evaluation Forum,

CLEF 2000, Lisbon, Portugal, September 21-22, 2000, Revised Papers. Springer 2001! [Rij79] CJ van Rijsbergen. Information Retrieval. Butterworths, 1979. http://www.dcs.glasgow.ac.uk/Keith/Preface.html.! [Rij86a] CJ van Rijsbergen. A New Theoretical Framework for Information Retrieval. ACM SIGIR’86, pp 194-200, 1986.! [Rij86b] CJ van Rijsbergen. A Non-Classical Logic for Information Retrieval. The Computer Journal 29(6):481-485, 1986.! [Rij92] CJ van Rijsbergen (ed). The Computer Journal, Special Issue on Information Retrieval, 35(3), 1992.! [SKCT88] T Saracevic, P Kantor, AY Chamis, and D Trivison. A study of information seeking and retrieving. I. Background

and methodology. Journal of the American Society for Information Science, 39(3):161-176, 1988. ! [SM83] G Salton and MJ McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York, 1983.! [Sme92] AF Smeaton. Progress in the Application of Natural Language Processing to Information Retrieval Tasks. In The

Computer Journal, 36 (3), 1992.! [Spa91] K Sparck Jones. The role of Artificial Intelligence in Information Retrieval, JASIS 42(8) pp 558-565, 1991.! [Spa99] K Sparck Jones. Information retrieval and artificial intelligence. Artificial Intelligence, 114:257-281, 1999.! [Spa00] K Sparck Jones. Further reflections on TREC. Information Processing and Management 36(1):37-85, 2000.! [SW92a] K Sparck Jones and P Willet. Readings in Information Retrieval. Morgan Kaufman, 1997.! [SW92b] C Stanfill and DL Waltz. Statistical Methods, Artificial Intelligence, and Information Retrieval. In text-based

intelligent systems. Current research and practice in Information Extraction and Retrieval (ed PS Jacobs) Lawrence ErlbaumAssociates Intelligent, 1992.

! [TC90] H Turtle and WB Croft.Inference networks for document retrieval. ACM SIGIR, pp 1-24,! Brussels, Belgium, 1990. ! [WMB94] IH Witten, A Moffat and TC Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Van

Nostrand Reinhold, New York, 1994.

Page 48: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

48

IR: Conferences and journals! Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (www.acm.org/sigir/)! The Text REtrieval Conference (TREC), NIST Special Publication. (trec.nist.gov/)! International Conference on Information and Knowledge Management (CIKM).! Information Retrieval Colloquium British Computing Society (BCS IRSG), now ECIR.! Hypermedia - Information Retrieval - Multimedia (HIM).! RIAO Conference, Content-Based Multimedia Information Access.! ECDL and ADL (Digital libraries)! IP&M (Information Processing and Management), Elsevier.! TOIS (Transactions On Information Systems), ACM.! JASIS (Journal of the American Society for Information Science), ASIS.! JDOC (Journal of DOCumentation), ASLIB.! IR (Information Retrieval), Kluwer.! JIIS (Journal of Intelligent Information Systems), Kluwer.! International Journal of Digital Library (IJODL), Springer CIMIC.! Journal of Digital Information, BCS and OUP.

! British Computer Society, Information Retrieval Specialist Group (irsg.eu.org/)

Seeking for Information: Artificial Intelligence

Alison CawseyDepartment of Computing and Electrical EngineeringHeriot-Watt UniversityEdinburgh, EH14 4AS, [email protected]

Page 49: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

49

Introduction! How can we improve retrieval by applying methods

from Artificial Intelligence (AI)?

! Start by reviewing:"What is AI?"What is the retrieval task?

Artificial Intelligence! Concerned with automating or modelling intelligent

and commonsense behaviour.

! Represent and reason with information at the level of “meaning” (not surface strings).

! Use knowledge, of world, of people, of typical situations.

Page 50: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

50

Revisiting Retrieval! Goal: Provide the user with the information that they

need.

! How might an intelligent assistant do this?"Analyse user’s requirements."Collect information from many sources. "Read, interpret, filter."Create a report or summary

Intelligent Retrieval! About finding information, not documents.

! Work at level of knowledge, not text.

! AI provides methods to extract knowledge from text, reason with it, and communicate results to user.

Page 51: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

51

AI Approaches: Outline1. Speech and language technology: to extract

meaning from text.

2. Ontologies and rich metadata: to represent document and domain semantics.

3. Intelligent filtering and presentation.

4. Practical examples using XML.

Speech and Language Technology

! Involves: Analysis and synthesis of speech and language.

John loves Mary..

loves(john, mary)

Speech recognition and synthesis.

Natural language understanding and generation

Page 52: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

52

SLT: Uses in IR! If we can extract the meaning of a document and

query we can:"Use more semantically significant query terms."Retrieve resources which use terms different to that

in query."Create coherent summaries and tables of key data."Improve cross-language retrieval.

[17]

Example: Information Extraction! SLT can be used to extract key data from texts,

using robust analysis techniques.

Celtic played Rangers in a 2-2 draw..

team1

team2score

Celtic

Rangers2-2

IE

[9]

Page 53: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

53

Summarisation! IE techniques can be combined with text generation

techniques to create high quality summaries.

Lots of text

Even more

Team 1 ….

Team 2

Score

Celtic played Rangerswith a score of 2-2 in a

match described as variously “exciting”

and “deadly dull”.

[16] [26] [19]

Multilingual Retrieval! Can also improve cross-language retrieval:

! Simple word-by-word translation may be inadequate.

fromageAnglaishistoire

A tale of amouse and his cheddar

Query (language 1) Doc (language 2)

[23]

Page 54: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

54

Meaningful Index Terms! SLT can be used to try and extract more meaningful

query terms and document indexes.! Stemming (categorise -> category)! Disambiguate words with many meanings (e.g.,

bank, table)! Use noun phrases as index terms (e.g., “learning

support centre”)! Latter two had limited success

[25][11]

Multimedia Retrieval! SLT used to aid in retrieval of video and speech

(e.g., TV news)

! If no audio transcript, speech recognition can be used.

! Speech recognition doesn’t have to be perfect -matching with query can proceed probabilistically.

[1][20]

Page 55: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

55

Summary: Speech and Language Technology

! Work at level of information and meaning rather than words.

! Analyse whole documents:"Extract information, translate..

! Improve query/index terms:"e.g., extract noun phrases

! Recognise spoken language

Ontologies and Metadata! SLT handles retrieval by extracting structure and meaning

from text.

! Other approach is for authors to provide more structure.

! Add metadata to resource, describing resource using set categories.

! Provide ontologies, defining concepts and relations.

Page 56: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

56

Metadata! Simple example:

"Title: Essence of AI"Creator: Alison Cawsey"Subject: AI"Publisher: Prentice Hall"Date: 1997

! Standard fields can be used - “Dublin Core” is one standard.

! Can then search on metadata: subject = AI AND Date = 1997

[32][33]

Rich Metadata! Metadata can be based on a relational model

providing rich descriptions of resources:

http://.. http://..

Alison alison@cee

creator

name email

[29]

Page 57: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

57

Semantic Web! This leads to idea of semantic web.

! Rich network of resources with meaningful descriptions and relations.

! Augmented with ontology giving relations between concepts (e.g., AI part of CS)

! Search on knowledge/meaning, not text. Inference on concepts used in search.

[5]

Resource Description Framework (RDF)

! RDF provides rich metadata scheme; can be written in XML.<rdf:description about=“http://mydoc”>

<dc:creator rdf:resource=“http://me”></rdf:description>

! RDF schemas provide simple vocabularies/ ontologies - other systems (e.g,. OIL) augment.

! Logics (e.g, description logic) used in inferencing on these.

[29][31][10][12]

Page 58: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

58

Summary: Metadata! Authors create meaningful descriptions of

resources.! Schemas/ontologies provide a reference point for

vocabularies and concepts used.! Then reason at level of concepts (e.g., subject=AI

subject=CS).! But metadata authoring and maintenance load!

Intelligent Filtering and Presentation

! Finding information is much more than matching query to document:

Filter Assemblepresentationresources

User profile, context,

task, query

Page 59: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

59

Adaptive web sites! New generation web sites can adapt content and

display to user, dynamically generating.

! Personalised web pages involve information selection, assembly and presentation to suit user.

! Source data can be extracted from text, or available in structured form (XML or DBs).

[3][6][7][18]

User profile and context! An adaptive site needs to know about the context.

"user’s preferences, task, location, time, hardware (display)..

! Context may be determined:"automatically (e.g., browser)"by asking user."By monitoring user behaviour

! Consider: Travel Guide[15]

Page 60: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

60

Examples! Personalised Travel Guide

"Adapts content to user’s location, time and interests, so nearby, open attractions highlighted.

! Personalised Health Information" Adapts content to highlight information of relevance

to user’s problems and treatments.

! Privacy issues may be paramount.[8][4]

Netbots! Some virtual web sites use Intelligent agents.

! Agent acts on behalf of user to collect, assemble and comment on information retrieved.

! Agent may hold information about the user, and negotiate with other agents for services/information to support user’s task.

[13]

Page 61: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

61

Example: PPP! PPP (personalised plan-based presenter) collects

and assembles information, using animated agent to present with respect to user’s information needs. Follow on: MIAU, SmartKom.

[28][24][2]

Presentation, Filtering and Search

! All concerned with supplying user with needed information.

! Search generally based on query, often run once.! Filtering based on a profile or filter acting for an

extended period! Presentation may allow adjusting selection and

emphasis of information from given basic data.

[21][22][27]

Page 62: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

62

Dangers of PersonalisationPersonalisation may:! Result in less coherent resources, if automatically

constructed. (e.g., summarisation systems)! Result in misleading documents, if aspects included

by author are omitted in third party adaptation.

[35]

Summary: Intelligent Presentation

! Providing the right information may involve selection, assembly and presentation.

! All may “intelligently” take into account aspects of context.

! Danger though in losing coherence of human purpose-authored documents.

Page 63: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

63

XML (eXtensible Markup Language)

! XML provides a basis for fairly simple illustrations of some ideas and techniques.

! This last section therefore introduces XML and gives some “try-this-at-home” examples of personalised presentation.

XML! XML allows us to markup text using tags which

represent meaningful domain concepts."<library-list>

<book><author> Alison </author>

…! Allows search and presentation based on domain

concepts."e.g., Find all resources containing books written by

Alison.

Page 64: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

64

Presenting XML! Structure, content and presentation of XML

documents kept separate.

<book><author>

Book list:• “Essence of AI” by Alison Cawsey

Alison CawseyEssence of AI

structure content

presentation

Presenting XML! Presentation controlled by stylesheets (usually XSL

- eXtensible Stylesheet Language)! Define templates describing how to

present/transform part of document! <xsl:template match=“book”>

<li><xsl:value-of select=“title”/>“ <xsl:value-of select=“author”/> “

</li>

[30]

Page 65: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

65

Presenting XML! Using two different stylesheets get two very

different presentations.

Booklist:• “Essence of AI”by Alison Cawsey.• “Something else”by A.N.Other

T itle A u th o r

E sse nce o f A I A liso n C a w sey

S o m e th in g e lse A .N .O th e r

Personalised Presentation

! We can keep data about the user in another XML file (so have basic User Model + Domain Knowledge).

! Stylesheet can contain rules so output depends on user:

<xsl:if test=“document($UP)/user/interest[. = ‘books’]”>Special for book lovers..

</xsl:if>

! But awkward for serious reasoning/ inference.

Page 66: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

66

XML and Resource Descriptions.! XML used for RDF (Resource Description

Framework).! May support better querying - but also flexible

description of resource to user, e.g.:

Query: type=lesson plan and grade = 1-5

Description:“Astronomy” by Jim Bloggs, is a lesson plan for primary school teachers.A large black umbrella is required.

[36]

Summary: XML! XML allows authors to create documents with

meaningful markup.

! Simple adaptations of presentation can be easily done using XSL.

! But for real “intelligence” in presentation have to parse XML and create structured data format suitable for reasoning systems.

Page 67: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

67

Summary and Issues! Challenge is to give user information they need.

! Given resource may want to:"describe it"retrieve it, given query/context."extract info from it."create a new integrated presentation.

! But costs in terms of coherence of resources and user effort.

Problems! Metadata approaches require author to create and

maintain descriptions.

! Information extraction approaches require configuring for domain.

! Personalising and assembling presentation has risks of creating pages that do not reflect the source document author’s intentions.

Page 68: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

68

The Future! More folk will use XML, so powerful querying and

presentation control.

! More dynamic pages:"but have to be done well to be better than good human

authored docs!

! May see expansion of metadata:"if human cost of creation reduced.

! Role of SLT may be in supporting metadata extraction.

AI: References! [1] Allan, James "Perspectives on Information Retrieval and Speech," in Information Retrieval

Techniques for Speech Applications, Coden, Brown and Srinivasan, editors. pp. 1-10, 2001

$ [2] E. André, J. Müller, and T. Rist. WIP/PPP: Automatic Generation of Personalized Multimedia Presentations. In ACM Multimedia 96, pages 407-408. ACM Press, November 1996.

$ [3] D Bental, L MacKinnon, H Williams, D Marwick, D Pacey, E Dempster and A Cawsey, Dynamic Information Presentation through Web-based Personalisation and Adaptation - An Initial Review, In Joint Proccedings of HCI 2001 and IHM 2001, A Blandford, J. Vanderdonck and P Gray (Eds), pp

485-500, Springer 2001. .

! [4] Bental, D.S., Williams, W.H., Pacey, D., Cawsey, A.J., McKinnon, L.M., and Marwick, D.H. 2001a. Dynamic personalization of Web resources for presenting healthcare information. In Proceedings of MEDICON 2001, Croatia, June 2001, IFMBE Proceedings, 86-89.

! [5] Tim Berners-Lee, James Hendler , The Semantic Web, Scientific American, May 2001, and Ora Lassila

! [6] Bordegoni, M., Faconti, G., Feiner, S., Maybury, M., Rist, T., Ruggieri, S., Trahanias, P., and Wilson, M. 1997. A standard Reference Model for Intelligent Multimedia Presentation Systems. Computer Standards and Interfaces 18, 477-496.

! [7] Brusilovsky, P. 1996. Methods and Techniques of Adaptive Hypermedia. User Modelling and User-Adapted Interaction, 6(2-3), 87-129

Page 69: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

69

AI: References! [8] Cheverst, K., Davies, N., Mitchell, K., And Smith, P. 2000. Providing tailored (context-aware)

information to city visitors. In Adaptive Hypermedia and Adaptive Web-Based Systems, 2000, P. Brusilovsky, O. Stock, And C. Strapparava Eds. Springer, 73-85.

! [9] J. Cowie, Y. Wilks. Information Extraction. In R. Dale, H. Moisl and H. Somers (eds.) Handbook of Natural Language Processing. New York: Marcel Dekker. (2000)

! [10] Stefan Decker, Dan Brickley, Janne Saarela, Jürgen Angele A Query and Inference Service for RDFin QL'98 - The Query Languages Workshop, 1998.

! [11] Feng, F. and Croft, W.B (2000). "Probabalistic Techniques for Phrase Extraction," in Information Process Management, March 2001, vol. 37, No.2, pp. 199-220..

! [12] D. Fensel, I. Horrocks, F. Van Harmelen, S. Decker, M. Erdmann, and M. Klein. Oil in a Nutshell, In Knowledge Acquisition, Modeling, and Management. Proceedings of the European Knowledge Acquisition Conference (EKAW-2000). Lecture Notes in Artificial Intelligence, LNAI, Springer-Verlag, October 2000

! [13] Klusch, M. Information Agent Technology for the Internet: A Survey Journal on Data and Knowledge Engineering, Special Issue on Intelligent Information Integration, D.Fensel (Ed.), Vol. 36(3), Elsevier Science, 2001

! [14] Kobsa, A., Koenemann, J., And Pohl, W. 2001. Personalized Hypermedia Presentation Techniques for Improving Online Customer Relationships. The Knowledge Engineering Review 16(2), 111-155

! [15] Kobsa, A., And Koychev, I. 2000. Learning about Users from Observation. In Adaptive User Interfaces: Papers from the 2000 AAAI Spring Symposium. Menlo Park, CA: AAAI Press.

AI: References! [16] Julian M. Kupiec, Jan Pedersen, and Francine Chen. A Trainable Document Summarizer.

In Edward A. Fox, Peter Ingwersen, and Raya Fidel, editors, Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 68--73, Seattle, Washington, July 1995.

! [17] David D. Lewis and Karen Sparck Jones. Natural language processing for information

retrieval. Communications of the ACM, 39(1):92-101, January 1996.

! [18] Manber, U., Patel, A., And Robison, J. 2000. Experience with Personalization of Yahoo. Communications of the ACM, Vol 43, Number 8, 35-39.

! [19] Inderjeet Mani (Editor), Mark T. Maybury (Editor) , Advances in Automatic Text Summarization, The MIT Press, 1999.

! [20] Pedro J. Moreno, J.M. Van Thong, Beth Logan, Gareth J.F. Jones. From Multimedia Retrieval to Knowledge Management, Computer Vol 35 no 4 2002 pp 58-66

! [21] Mooney. R.J., And Roy, L. 1999. Content-based book recommending using learning for text categorization. In SIGIR'99 Workshop on Recommender Systems: Algorithms and Evaluation, 1999.

! [22] Murthy, K.R.K., And Keerthi, S.S. 1999. Context Filters for Document-Based Information Filtering. In Proceedings of International Conference on Document Analysis and Recognition'99 (ICDAR '99), Bangalore, India, 1999.

Page 70: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

70

AI: References! [23] Douglas W. Oard, Serving Users in Many Languages: Cross-Language Information

Retrieval for Digital Libraries, D-Lib Magazine, December 1997 http://www.dlib.org/dlib/december97/oard/12oard.html

! [24] T. Rist, E. André, and J. Müller. Adding Animated Presentation Agents to the Interface. In Proceedings of the 1997 International Conference on Intelligent User Interfaces, pages 79-86, Orlando, Florida, 1997.

! [25] Mark Sanderson. Word Sense Disambiguation and Information Retrieval (1997) Proceedings of SIGIR-94, 17th ACM International Conference on Research and Development in Information Retrieval

! [26] Sanderson, M. Accurate user directed summarization from existing tools Proceedings of the 7th International Conference on Information and Knowledge Management (CIKM 98), ps 45-51, 1998.

! [27] Shardanand, U., And Maes, P. 1995. Social Information Filtering: Algorithms for Automating “Word of Mouth”. In Proceedings CHI'95, Denver CO, May 1995, ACM Press, 210-217.

AI: References! [28] Wahlster, W., Reithinger, N., Blocher, A. (2001): SmartKom: Towards Multimodal Dialogues

with Anthropomorphic Interface Agents. In: Wolf, G., Klein, G. (eds.), Proceedings of International Status Conference "Human-Computer Interaction", DLR, Berlin, Germany, October 2001, p. 23 - 34.

! [29] World Wide Web Consortium Resource Description Framework (RDF) Model and Syntax, http://www.w3.org/TR/REC-rdf-syntax/, 1999.

! [30] World Wide Web Consortium XSL Transformations (XSLT) W3C Recommendation, http://www.w3.org/TR/xslt, 1999.

! [31] [World Wide Web Consortium Resource Description Framework (RDF) Schema Specification (W3C Proposed Recommendation) http://www.w3.org/TR/PR-rdf-schema/, 1999

! [32] The Dublin Core Metadata Initiative, http://www.purl.org/DC

! [33] [IMS Metadata Specification http://www.imsproject.org/

! [34] [World Wide Web Consortium XML Schema Part 1: Structures, W3C Working Draft http://www.w3.org/TR/xmlschema-1/

! [35] Cawsey, A., “Presenting tailored resource descriptions: will XSLT do the job?”, in Computer Networks (3) 713-722, 2000.

! [36] Cawsey, A., et al, “Preventing misleading presentations of XML documents: Some initial proposals”, in Proc 2nd International Conference on Adaptive Hypermedia, Aix-en-provence, 2002.

Page 71: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

71

AI: Useful Links! Cross language IR: http://raven.umd.edu/dlrg/clir/! Summarisation: http://www.cs.columbia.edu/~radev/summarization/! Information Extraction: http://www.isi.edu/~muslea/RISE/Resources.html! Intelligent IR (at UMASS) http://ciir.cs.umass.edu/! Multimedia retrieval:

http://www-sal.cs.uiuc.edu/~sharad/cs491/readinglist.html! IR and Natural Language Processing: http://web.syr.edu/~diekemar/ir.html! Language Technology: http://cslu.cse.ogi.edu/HLTsurvey/HLTsurvey.html! OIL: http://xml.coverpages.org/oil.html! Semantic Web http://www.w3.org/2001/sw/! Netbots (intelligent information agents): http://www.dbgroup.unimo.it/IIA/! Information filtering: http://www.clis2.umd.edu/dlrg/filter/! Adaptive hypermedia: http://wwwis.win.tue.nl/ah/! XML: http://xml.coverpages.org/

IR+AI: Conclusions! Information seeking: “everyday” users looking for information

to satisfy their information needs

! Information retrieval approaches#Return “information containers”#Mostly at word level, but it works well although context is needed

for web retrieval.

! Artificial retrieval approaches#Return “answers”#Attempt to capture meaning, but it is hard in particular with large

data set (efficiency)

Page 72: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

72

IR+AI: Four scenarios

IR AI

IRAIIR AI

application dependent

AIIR

IR+AI: Demos

1. HySpirit: Experimental platform for developing, implementing and evaluating IR systems.

2. Information personalisation using XML and XSLT.

Page 73: Seeking for Information: Artificial Intelligencemounia/ECAI2002.pdf · sailing greece mediterranean fish sunset Author = “B. Smith” Crdate = “14.12.96” Ladate = “11.07.02”

Seeking information: IR + AI approaches ECAI 2002 Lyon

73

Demo: HySpirit! Retrieval platform extending database models with probability theory

" relational, logical and object-oriented layers modelling hypermedia and knowledge retrieval.

"uncertainty, incompleteness and inconsistency"aggregation of uncertain evidence

! Knowledge-oriented retrieval in semi-structured and heterogeneous data sources. "spatial, temporal, semantic relationship " fact-oriented and content-oriented searching and browsing.

! Easy parameter setting to support retrieval experiments and evaluation.

(qmir.dcs.qmul.ac.uk)