Lemur Toolkit Introduction wbia 彭波 [email protected] 北京大学信息科学技术学院...

99
Lemur Toolkit Introduction http://net.pku.edu.cn/~wbia 彭彭 [email protected] 彭彭彭彭彭彭彭彭彭彭彭彭 3/21/2011

Transcript of Lemur Toolkit Introduction wbia 彭波 [email protected] 北京大学信息科学技术学院...

Page 1: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Toolkit Introduction

http://net.pku.edu.cn/~wbia

彭波[email protected]

北京大学信息科学技术学院3/21/2011

Page 2: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Recap

Information Retrieval Models Vector Space Model Probabilistic models Language model

Page 3: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Some formulas for Some formulas for Sim(VSM)Sim(VSM)

Dot product

Cosine

Dice

Jaccard

3

i i iiiii

iii

i iii

iii

i iii

iii

iii

baba

ba

QDSim

ba

ba

QDSim

ba

ba

QDSim

baQDQDSim

) * (

) * (

),(

) * (2

),(

*

) * (

),(

) * (),(

22

22

22

t 1

D

Q

t 3

t 2

θ

Page 4: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

BM25 (Okapi system) – BM25 (Okapi system) – Robertson et al.Robertson et al.

k1, k2, k3, b: parametersqtf: query term frequencydl: document lengthavdl: average document length

4

Doc. length normalizationTF

factors

Consider tf, qtf, document length

Page 5: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Standard Probabilistic IR

query

d1

d2

dn

Information need

document collection

matchingmatching

5

Page 6: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

IR based on Language Model (LM)

query

d1

d2

dn

Information need

document collection

generationgeneration

)|( dMQP 1dM

2dM

ndM

A query generation process For an information need, imagine an ideal

document Imagine what words could appear in that

document Formulate a query using those words

6

Page 7: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Language Modeling for IR

Estimate a multinomial probability distribution from the text

P(w|D) = (1-) P(w|D)+ P(w|C)

Smooth the distributionwith one estimated from the entire collection

Page 8: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Estimate probability that document generated the query terms

Query Likelihood

?P(Q|D) = P(q|D)

Page 9: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Kullback-Leibler Divergence

Estimate models for document and query and compare

=?

KL(Q|D) = P(w|Q) log(P(w|Q) / P(w|D))

Page 10: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Question

Among the three classic information retrieval model, which one is your best choice in designing your retrieval system?

How can you tune the model parameters to achieve optimized performance?

When you have a new idea on retrieval problem, how can you prove it?

Page 11: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

A Brief History of IR

Slides from Prof. Ray Larson University of California, Berkeley

School of Informationhttp://courses.sims.berkeley.edu/i240/s11/

Page 12: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

IS 240 – Spring 2011

Experimental IR systems

Probabilistic indexing – Maron and Kuhns, 1960

SMART – Gerard Salton at Cornell – Vector space model, 1970’s

SIRE at Syracuse I3R – Croft Cheshire I (1990) TREC – 1992 Inquery Cheshire II (1994) MG (1995?) Lemur (2000?)

Page 13: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

IS 240 – Spring 2011

Historical Milestones in IR Research

1958 Statistical Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering

(Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) 1975 2-Poisson Model (Harter, Bookstein,

Swanson) 1976 Relevance Weighting (Robertson,

Sparck-Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft)

Page 14: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

IS 240 – Spring 2011

Historical Milestones in IR Research (cont.)

1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton,

Yu) 1985 Generalized Vector Space Model

(Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong,

et al.) 1990 Latent Semantic Indexing (Dumais,

Deerwester) 1991 Polynomial & Logistic Regression

(Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok) 1998 Language Models (Ponte, Croft)

Page 15: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

IS 240 – Spring 2011

Information Retrieval – Historical View

Boolean model, statistics of language (1950’s)

Vector space model, probablistic indexing, relevance feedback (1960’s)

Probabilistic querying (1970’s)

Fuzzy set/logic, evidential reasoning (1980’s)

Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s)

DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry

(O($B)) Verity TOPIC (fuzzy

logic)

Internet search engines (O($100B?)) (vector space, probabilistic)

Research Industry

Page 16: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

IS 240 – Spring 2011

Research Systems Software

INQUERY (Croft) OKAPI (Robertson) PRISE (Harman)

http://potomac.ncsl.nist.gov/prise SMART (Buckley) MG (Witten, Moffat) CHESHIRE (Larson)

http://cheshire.berkeley.edu LEMUR toolkit Lucene Others

Page 17: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Project

Some slides from Don Metzler, Paul Ogilvie & Trevor Strohman

Page 18: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Zoology 101

Lemurs are primates found only in Madagascar

50 species (17 are endangered)

Ring-tailed lemurs lemur catta

Page 19: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Zoology 101

The indri is the largest type of lemur

When first spotted the natives yelled “Indri! Indri!”

Malagasy for "Look!  Over there!"

Page 20: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

About The Lemur Project

The Lemur Project was started in 2000 by the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts, Amherst, and the Language Technologies Institute (LTI) at Carnegie Mellon University. Over the years, a large number of UMass and CMU students and staff have contributed to the project.

The project's first product was the Lemur Toolkit, a collection of software tools and search engines designed to support research on using statistical language models for information retrieval tasks. Later the project added the Indri search engine for large-scale search, the Lemur Query Log Toolbar for capture of user interaction data, and the ClueWeb09 dataset for research on web search.

Page 21: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Installation

http://www.lemurproject.org Linux, OS/X:

Extract software/lemur-4.12.tar.gz ./configure --prefix=/install/path ./make ./make install

Windows Run software/lemur-4.12-install.exe Documentation in windoc/index.html

Page 22: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Installation

Use Lemur-4.12 instead~ JAVA Runtime(JDK 6) need for evaluation tool. Environment Variable : PATH

Linux: modify ~/.bash_profile Windows: MyComputer/Properties…

Page 23: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing

Document Preparation Indexing Parameters Time and Space Requirements

Page 24: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Two Index Formats

KeyFile Term Positions Metadata Offline Incremental InQuery Query

Language

Indri Term Positions Metadata Fields / Annotations Online Incremental InQuery and Indri

Query Languages

Page 25: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing – Document Preparation

TREC Text TREC Web Plain Text Microsoft Word(*)

Microsoft PowerPoint(*)

Document Formats:The Lemur Toolkit can inherently deal with several different document format types without any modification:

HTML XML PDF Mbox

(*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a Windows-based machine, and Office must be installed.

Page 26: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing – Document Preparation

If your documents are not in a format that the Lemur Toolkit can inherently process:

1. If necessary, extract the text from the document.2. Wrap the plaintext in TREC-style wrappers:

<DOC>

<DOCNO>document_id</DOCNO>

<TEXT>

Index this document text.

</TEXT>

</DOC>

– or –For more advanced users, write your own parser to extend the Lemur Toolkit.

Page 27: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing - Parameters

Basic usage to build index: IndriBuildIndex <parameter_file>

Parameter file includes options for Where to find your data files Where to place the index How much memory to use Stopword, stemming, fields Many other parameters.

Page 28: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing – Parameters

Standard parameter file specification an XML document:

<parameters>

<option></option>

<option></option>

<option></option>

</parameters>

Page 29: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing – Parameters

where to find your source files and what type to expect

BuildIndex <dataFiles> name of file containing list of datafiles to index.

IndriBuildIndex<parameters><corpus>

<path>/path/to/source/files</path><class>trectext</class>

</corpus></parameters>

Page 30: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing - Parameters

The <index> parameter tells IndriBuildIndex where to create or incrementally add to the index If index does not exist, it will create a new one If index already exists, it will append new

documents into the index.

<parameters>

<index>/path/to/the/index</index>

</parameters>

Page 31: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing - Parameters

<memory> - used to define a “soft-limit” of the amount of memory the indexer should use before flushing its buffers to disk. Use K for kilobytes, M for megabytes, and G for

gigabytes.

<parameters>

<memory>256M</memory>

</parameters>

Page 32: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing - Parameters

Stopwords defined within <stopwords>filename</stopwords>

IndriBuildIndex<parameters>

<stopper>

<word>first_word</word>

<word>next_word</word>

<word>final_word</word>

</stopper>

</parameters>

Page 33: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indexing – Parameters

Term stemming can be used while indexing as well via the <stemmer> tag. Specify the stemmer type via the <name> tag

within. Stemmers included with the Lemur Toolkit

include the Krovetz Stemmer and the Porter Stemmer.

<parameters>

<stemmer>

<name>krovetz</name>

</stemmer>

</parameters>

Page 34: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Retrieval

Parameters Query Formatting Interpreting Results

Page 35: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Retrieval - Parameters

Basic usage for retrieval: IndriRunQuery/RetEval <parameter_file>

Parameter file includes options for Where to find the index The query or queries How much memory to use Formatting options Many other parameters.

Page 36: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Retrieval - Parameters

The <index> parameter tells IndriRunQuery/RetEval where to find the repository.

<parameters>

<index>/path/to/the/index</index>

</parameters>

Page 37: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Retrieval - Parameters

The <query> parameter specifies a query plain text or using the

Indri query language<parameters>

<query> <number>1</number> <text>this is the first

query</text> </query>

<query> <number>2</number> <text>another query to

run</text> </query></parameters>

Query file format<DOC><DOCNO> 1 </DOCNO> What articles exist which

deal with TSS (Time Sharing System), anoperating system for IBM computers?

</DOC><DOC><DOCNO> 2 </DOCNO> I am interested in articles

written either by Prieve or Udo PoochPrieve, B.Pooch, U.

</DOC>

Page 38: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Retrieval – Query Formatting

TREC-style topics are not directly able to be processed via IndriRunQuery/RetEval. Format the queries accordingly:

Format by hand Write a script to extract the fields ( 可爱的

Python~)

Page 39: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Retrieval – Parameters

To specify a maximum number of results to return, use the <count> tag:

<parameters>

<count>50</count>

</parameters>

Page 40: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Retrieval - Parameters

Result formatting options:

IndriRunQuery/RetEval has built in formatting specifications for TREC and INEX retrieval tasks

Page 41: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Retrieval – Parameters

TREC – Formatting directives: <runID>: a string specifying the id for a query

run, used in TREC scorable output. <trecFormat>: true to produce TREC scorable

output, otherwise use false (default).

<parameters>

<runID>runName</runID>

<trecFormat>true</trecFormat>

</parameters>

Page 42: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Outputting INEX Result Format

Must be wrapped in <inex> tags <participant-id>: specifies the participant-id attribute

used in submissions. <task>: specifies the task attribute (default

CO.Thorough). <query>: specifies the query attribute (default

automatic). <topic-part>: specifies the topic-part attribute (default

T). <description>: specifies the contents of the description

tag.

<parameters><inex>

<participant-id>LEMUR001</participant-id></inex>

</parameters>

Page 43: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Retrieval - Evaluation

To use trec_eval: format IndriRunQuery results with appropriate

trec_eval formatting directives in the parameter file:

<runID>runName</runID> <trecFormat>true</trecFormat>

Resulting output will be in standard TREC format ready for evaluation:

<queryID> Q0 <DocID> <rank> <score> <runID>

150 Q0 AP890101-0001 1 -4.83646 runName150 Q0 AP890101-0015 2 -7.06236 runName

Page 44: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Use RetEval for TF.IDF

First run ParseToFile to convert doc formatted queries into queries

<parameters>

<docFormat>web</docFormat>

<outputFile>filename</outputFile>

<stemmer>stemmername</stemmer>

<stopwords>stopwordfile</stopwords>

</parameters>

ParseToFile paramfile queryfile http://www.lemurproject.org/lemur/

parsing.html#parsetofile

Page 45: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Use RetEval for TF.IDF

Then run RetEval

<parameters> <index>index</index> <retModel>0</retModel> // 0 for TF-IDF, 1 for Okapi, // 2 for KL-divergence, // 5 for cosine similarity <textQuery>querie filename</textQuery> <resultCount>1000</resultCount> <resultFile>tfidf.res</resultFile></parameters>

RetEval paramfile http://www.lemurproject.org/lemur/

retrieval.html#RetEval

Page 46: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Evluate Results

TREC qrels Ground Truth: judge by human assessors.

Page 47: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Ireval tool

java -jar “D:\Program Files\Lemur\Lemur 4.12\bin\ireval.jar”result qrels >pr.result

Page 48: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Use Lemur API & Make Extension to Lemur

Page 49: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Task

When you have a new idea on retrieval problem, how can you prove it?

Page 50: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Introducing the API

Lemur “Classic” API Many objects, highly customizable May want to use this when you want to change

how the system works Support for clustering, distributed IR,

summarization Indri API

Two main objects Best for integrating search into larger applications Supports Indri query language, XML retrieval,

“live” incremental indexing, and parallel retrieval

Page 51: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur “Classic” API

Primarily useful for retrieval operations Most indexing work in the toolkit has

moved to the Indri API Indri indexes can be used with Lemur

“Classic” retrieval applications Extensive documentation and tutorials on

the website (more are coming)

Page 52: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Index Browsing

The Lemur API gives access to the index data (e.g. inverted lists, collection statistics)

IndexManager::openIndex Returns a pointer to an index object Detects what kind of index you wish to open, and

returns the appropriate kind of index class Index::

docInfoList (inverted list), termInfoList (document vector), termCount, documentCount

Page 53: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur: DocInfoList

Index::docInfoList( int termID )

Returns an iterator to the inverted list for termID.The list contains all documents that containtermID, including the positions where termIDoccurs.

Page 54: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur: TermInfoList

Index::termInfoList( int docID )

Returns an iterator to the direct list for docID.The list contains term numbers for every termcontained in document docID, and the number of times each word occurs.(use termInfoListSeq to get word positions)

Page 55: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Index Browsing

Index::termCounttermCount() : Total number of terms indexedtermCount( int id ) : Total number of occurrences of term number id.

Index::documentCount

docCount() : Number of documents indexed

docCount( int id ) : Number of documents that contain term number id.

Page 56: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Index Browsing

Index::termterm( char* s ) : convert term string to a numberterm( int id ) : convert term number to a string

Index::documentdocument( char* s ) : convert doc string to a numberdocument( int id ) : convert doc number to a string

Page 57: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Index Browsing

Index::docLength( int docID )The length, in number of terms, of document number docID.

Index::docLengthAvgAverage indexed document length

Index::termCountUnique

Size of the index vocabulary

Page 58: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Browsing Posting List

Page 59: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Retrieval

Class Name Description

TFIDFRetMethod BM25

SimpleKLRetMethod KL-Divergence

InQueryRetMethod Simplified InQuery

CosSimRetMethod Cosine

CORIRetMethod CORI

OkapiRetMethod Okapi

IndriRetMethod Indri (wraps QueryEnvironment)

Page 60: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Retrieval

RetMethodManager::runQuery query: text of the query index: pointer to a Lemur index modeltype: “cos”, “kl”, “indri”, etc. stopfile: filename of your stopword list stemtype: stemmer datadir: not currently used func: only used for Arabic stemmer model->scoreCollection()

RetMethodManager :: createModel

Page 61: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur Retrieval Method

lemur::api::RetrievalMethod Class Reference

Page 62: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

TextQueryMethod

A text query retrieval method is determined by specifying the following elements: A method to compute the query representation A method to compute the doc representation The scoring function A method to update the query representation

based on a set of (relevant) documents

Page 63: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

TextQueryMethod Scoring Function

Given a query q =(q1,q2,...,qN) and a document d=(d1,d2,...,dN), where q1,...,qN and d1,...,dN are terms:

s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d) function w gives the weight of each matched

term function g makes it possible to perform any

further transformation of the sum of the weight of all matched terms based on the "summary" information of a query or a document (e.g., document length).

Page 64: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

TextQueryMethod Scoring Function

ScoreFunction::matchedTermWeight compute the score contribution of a matched

term ScoreFunction:: adjustedScore

score adjustment (e.g., appropriate length normalization)

Page 65: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Lemur: Other tasks

Clustering: ClusterDB Distributed IR: DistMergeMethod Language models: UnigramLM,

DirichletUnigramLM, etc.

Page 66: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

More Stories about Indri

Page 67: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Why Can’t We All Get Along?Structured Data and Information Retrieval

Prof. Bruce CroftCenter for Intelligent Information Retrieval (CIIR)

University of Massachusetts Amherst

Page 68: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Similarities and Differences

Common interest in providing efficient access to information on a very large scale indexing and optimization key topics

Until recently, concern about effectiveness (accuracy) of access was domain of IR

Focus on structured vs. unstructured data is historically true but less relevant today

Statistical inference and ranking are central to IR, becoming more important in DB

Page 69: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Similarities and Differences

IR systems have focused on providing access to information rather than answers e.g. Web search evaluation typically based on topical relevance

and user relevance rather than correctness (except QA)

IR works with multiple databases but not multiple relations

IR query languages more like calculus than algebra

Integrity, security, concurrency are central for DB, less so in IR

Page 70: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

What is the Goal?

An IR system with extended capability for structured data i.e. extend IR model to include combination of

evidence from structured and unstructured components of complex objects (documents)

backend database system used to store objects (cf. “one hand clapping”)

many applications look like this (e.g. desktop search, web shopping)

users seem to prefer this approach (simple queries or forms and ranking)

Page 71: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Combination of Evidence

Example: Where was George Washington born? Returns a ranked list of sentences

containing the phrase George Washington, the term born, and a snippet of text tagged as a PLACE named entity

“Structured Query” .vs. TextQuery

Page 72: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indri – A Candidate IR System

Indri is a separate, downloadable component of the Lemur Toolkit

Influences INQUERY [Callan, et. al. ’92]

Inference network framework Structured Query language

Lemur [http://www.lemurproject.org] Language modeling (LM) toolkit

Lucene [http://jakarta.apache.org/lucene/docs/index.html] Popular off the shelf Java-based IR system Based on heuristic retrieval models

Designed for new retrieval environments i.e. GALE, CALO, AQUAINT, Web retrieval, and XML retrieval

Page 73: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Zoology 101

The indri is the largest type of lemur

When first spotted the natives yelled “Indri! Indri!”

Malagasy for "Look!  Over there!"

Page 74: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Design Goals

Off the shelf (Windows, *NIX, Mac platforms) Simple to set up and use Fully functional API w/ language wrappers for Java, etc…

Robust retrieval model Inference net + language modeling [Metzler and Croft ’04]

Powerful query language Designed to be simple to use, yet support complex

information needs Provides “adaptable, customizable scoring”

Scalable Highly efficient code Distributed retrieval Incremental update

Page 75: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Model

Based on original inference network retrieval framework [Turtle and Croft ’91]

Casts retrieval as inference in simple graphical model

Extensions made to original model Incorporation of probabilities based on language modeling

rather than tf.idf Multiple language models allowed in the network (one per

indexed context)

Page 76: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Model

D

θtitle θbody θh1

r1 rN… r1 rN

… r1 rN…

I

q1 q2

α,βtitle

α,βbody

α,βh1

Document node (observed)

Model hyperparameters (observed)

Context language models

Representation nodes(terms, phrases, etc…)

Belief nodes(#combine, #not, #max)Information need node

(belief node)

Page 77: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Model

I

D

θtitle θbody θh1

r1 rN… r1 rN

… r1 rN…

q1 q2

α,βtitle

α,βbody

α,βh1

Page 78: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

P( r | θ )

Probability of observing a term, phrase, or feature given a context language model ri nodes are binary

Assume r ~ Bernoulli( θ ) “Model B” – [Metzler, Lavrenko, Croft ’04]

Nearly any model may be used here tf.idf-based estimates (INQUERY) Mixture models

Page 79: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Model

I

D

θtitle θbody θh1

r1 rN… r1 rN

… r1 rN…

q1 q2

α,βtitle

α,βbod

yα,βh1

Page 80: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

P( θ | α, β, D )

Prior over context language model determined by α, β

Assume P( θ | α, β ) ~ Beta( α, β ) Bernoulli’s conjugate prior αr = μP( r | C ) + 1 βr = μP( ¬ r | C ) + 1 μ is a free parameter

||

)|(),,|()|(),,|( ,

D

CrPtfDPrPDrP Dr

ii

Page 81: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Model

I

D

θtitle θbody θh1

r1 rN… r1 rN

… r1 rN…

q1 q2

α,βtitle

α,βbod

yα,βh1

Page 82: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

P( q | r ) and P( I | r )

Belief nodes are created dynamically based on query

Belief node estimates are derived from standard link matrices Combine evidence from parents in various

ways Allows fast inference by making

marginalization computationally tractable Information need node is simply a

belief node that combines all network evidence into a single value

Documents are ranked according to P( I | α, β, D)

Page 83: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Belief Inference

p1 p2

q

Pn ……

Page 84: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Example: #AND

A B

Q

BA

BABABABA

BABABABA

baand

pp

pppppppp

pptttPppfttPpptftPppfftP

bBPaAPbBaAtrueQPtrueQP

1)1(0)1(0)1)(1(0

),|()1(),|()1)(,|()1)(1)(,|(

)()(),|()(,

#

P(Q=true|a,b) A B

0 false false

0 false true

0 true false

1 true true

Page 85: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Other Belief Operators

Page 86: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Document Representation

<html><head><title>Department Descriptions</title></head><body>The following list describes … <h1>Agriculture</h1> …<h1>Chemistry</h1> … <h1>Computer Science</h1> …<h1>Electrical Engineering</h1> ……<h1>Zoology</h1></body></html>

<title>department descriptions</title>

<h1>agriculture</h1> <h1>chemistry</h1>… <h1>zoology</h1>

.

.

.

<body>the following list describes … <h1>agriculture</h1> … </body>

<title> context

<body> context

<h1> context 1. agriculture

2. chemistry…36. zoology

<h1>extents

1. the following list describes <h1>agriculture</h1> …

<body>extents

1. department descriptions

<title>extents

Page 87: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Terms

Type Example Matches

Stemmed term dog All occurrences of dog (and its stems)

Surface term “dogs” Exact occurrences of dogs (without stemming)

Term group (synonym group)

<”dogs” canine>

All occurrences of dogs (without stemming) or canine (and its stems)

POS qualified term <”dogs” canine>.NNS

Same as previous, except matches must also be tagged with the NNS POS tag

Page 88: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Proximity

Type Example Matches

#odN(e1 … em) or#N(e1 … em)

#od5(dog cat) or#5(dog cat)

All occurrences of dog and cat appearing ordered within a window of 5 words

#uwN(e1 … em) #uw5(dog cat) All occurrences of dog and cat that appear in any order within a window of 5 words

#phrase(e1 … em) #phrase(#1(willy wonka)#uw3(chocolate factory))

System dependent implementation (defaults to #odm)

#syntax:xx(e1 … em)

#syntax:np(fresh powder)

System dependent implementation

Page 89: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Context Restriction

Example Matches

dog.title All occurrences of dog appearing in the title context

dog.title,paragraph All occurrences of dog appearing in both a title and paragraph contexts (may not be possible)

<dog.title dog.paragraph>

All occurrences of dog appearing in either a title context or a paragraph context

#5(dog cat).head All matching windows contained within a head context

Page 90: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Context Evaluation

Example Evaluated

dog.(title) The term dog evaluated using the title context as the document

dog.(title, paragraph)

The term dog evaluated using the concatenation of the title and paragraph contexts as the document

dog.figure(paragraph)

The term dog restricted to figure tags within the paragraph context.

Page 91: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Belief Operators

INQUERY INDRI

#sum / #and #combine

#wsum* #weight

#or #or

#not #not

#max #max

* #wsum is still available in INDRI, but should be used with discretion

Page 92: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Extent Retrieval

Example Evaluated

#combine[section](dog canine)

Evaluates #combine(dog canine) for each extent associated with the section context

#combine[title, section](dog canine)

Same as previous, except is evaluated for each extent associated with either the title context or the section context

#sum(#sum[section](dog)) Returns a single score that is the #sum of the scores returned from #sum(dog) evaluated for each section extent

#max(#sum[section](dog)) Same as previous, except returns the maximum score

Page 93: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Ad Hoc Retrieval

Flat documents Query likelihood retrieval:

q1 … qN ≡ #combine( q1 … qN)

SGML/XML documents Can either retrieve documents or extents Context restrictions and context evaluations

allow exploitation of document structure

Page 94: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Web Search

Homepage / known-item finding Use mixture model of several document

representations [Ogilvie and Callan ’03]

Example query: Yahoo!#combine( #wsum( 0.2 yahoo.(body) 0.5 yahoo.(inlink) 0.3 yahoo.(title) ) )

)|()|()|(),|( titletitleinlinkinlinkbodybody wPλwPwPwP

Page 95: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Example Indri Web Query

#weight( 0.1 #weight( 1.0 #prior(pagerank) 0.75 #prior(inlinks) ) 1.0 #weight( 0.9 #combine( #wsum( 1 stellwagen.(inlink) 1 stellwagen.(title) 3 stellwagen.(mainbody) 1 stellwagen.(heading) ) #wsum( 1 bank.(inlink) 1 bank.(title) 3 bank.(mainbody) 1 bank.(heading) ) ) 0.1 #combine( #wsum( 1 #uw8( stellwagen bank ).(inlink) 1 #uw8( stellwagen bank ).(title) 3 #uw8( stellwagen bank ).(mainbody) 1 #uw8( stellwagen bank ).(heading)) ) ) )

Page 96: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Question Answering

More expressive passage- and sentence-level retrieval

Example: Where was George Washington born?

#combine[sentence]( #1( george washington )

born #any:place) Returns a ranked list of sentences containing

the phrase George Washington, the term born, and a snippet of text tagged as a PLACE named entity

Page 97: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Indri Examples

Paragraphs from news feed articles published between 1991 and 2000 that mention a person, a monetary amount, and the company InfoCom

#filreq(#band( NewsFeed.doctype #date:between(1991 2000) ) #combine[paragraph]( #any:person #any:money InfoCom

) )

Page 98: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Getting Help

http://www.lemurproject.org Central website, tutorials, documentation, news

http://www.lemurproject.org/phorum Discussion board, developers read and respond

to questions README file in the code distribution

Paul Ogilvie Trevor Strohman Don Metzler

Page 99: Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 3/21/2011.

Thank You!

Q&A