Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院...

Lemur Toolkit Introduction

http://net.pku.edu.cn/~wbia

彭波pb@net.pku.edu.cn

北京大学信息科学技术学院3/21/2011

Information Retrieval Models Vector Space Model Probabilistic models Language model

Some formulas for Some formulas for Sim(VSM)Sim(VSM)

Dot product

Cosine

Jaccard

i i iiiii

baQDQDSim

) * (2

) * (),(

BM25 (Okapi system) – BM25 (Okapi system) – Robertson et al.Robertson et al.

k1, k2, k3, b: parametersqtf: query term frequencydl: document lengthavdl: average document length

Doc. length normalizationTF

factors

Consider tf, qtf, document length

Standard Probabilistic IR

Information need

document collection

matchingmatching

IR based on Language Model (LM)

Information need

document collection

generationgeneration

)|( dMQP 1dM

A query generation process For an information need, imagine an ideal

document Imagine what words could appear in that

document Formulate a query using those words

Language Modeling for IR

Estimate a multinomial probability distribution from the text

P(w|D) = (1-) P(w|D)+ P(w|C)

Smooth the distributionwith one estimated from the entire collection

Estimate probability that document generated the query terms

Query Likelihood

?P(Q|D) = P(q|D)

Kullback-Leibler Divergence

Estimate models for document and query and compare

KL(Q|D) = P(w|Q) log(P(w|Q) / P(w|D))

Question

Among the three classic information retrieval model, which one is your best choice in designing your retrieval system?

How can you tune the model parameters to achieve optimized performance?

When you have a new idea on retrieval problem, how can you prove it?

A Brief History of IR

Slides from Prof. Ray Larson University of California, Berkeley

School of Informationhttp://courses.sims.berkeley.edu/i240/s11/

IS 240 – Spring 2011

Experimental IR systems

Probabilistic indexing – Maron and Kuhns, 1960

SMART – Gerard Salton at Cornell – Vector space model, 1970’s

SIRE at Syracuse I3R – Croft Cheshire I (1990) TREC – 1992 Inquery Cheshire II (1994) MG (1995?) Lemur (2000?)

Historical Milestones in IR Research

1958 Statistical Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering

(Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) 1975 2-Poisson Model (Harter, Bookstein,

Swanson) 1976 Relevance Weighting (Robertson,

Sparck-Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft)

Historical Milestones in IR Research (cont.)

1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton,

Yu) 1985 Generalized Vector Space Model

(Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong,

et al.) 1990 Latent Semantic Indexing (Dumais,

Deerwester) 1991 Polynomial & Logistic Regression

(Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok) 1998 Language Models (Ponte, Croft)

Information Retrieval – Historical View

Boolean model, statistics of language (1950’s)

Vector space model, probablistic indexing, relevance feedback (1960’s)

Probabilistic querying (1970’s)

Fuzzy set/logic, evidential reasoning (1980’s)

Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s)

DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry

(O($B)) Verity TOPIC (fuzzy

logic)

Internet search engines (O($100B?)) (vector space, probabilistic)

Research Industry

Research Systems Software

INQUERY (Croft) OKAPI (Robertson) PRISE (Harman)

http://potomac.ncsl.nist.gov/prise SMART (Buckley) MG (Witten, Moffat) CHESHIRE (Larson)

http://cheshire.berkeley.edu LEMUR toolkit Lucene Others

Lemur Project

Some slides from Don Metzler, Paul Ogilvie & Trevor Strohman

Zoology 101

Lemurs are primates found only in Madagascar

50 species (17 are endangered)

Ring-tailed lemurs lemur catta

Zoology 101

The indri is the largest type of lemur

When first spotted the natives yelled “Indri! Indri!”

Malagasy for "Look! Over there!"

About The Lemur Project

The Lemur Project was started in 2000 by the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts, Amherst, and the Language Technologies Institute (LTI) at Carnegie Mellon University. Over the years, a large number of UMass and CMU students and staff have contributed to the project.

The project's first product was the Lemur Toolkit, a collection of software tools and search engines designed to support research on using statistical language models for information retrieval tasks. Later the project added the Indri search engine for large-scale search, the Lemur Query Log Toolbar for capture of user interaction data, and the ClueWeb09 dataset for research on web search.

Installation

http://www.lemurproject.org Linux, OS/X:

Extract software/lemur-4.12.tar.gz ./configure --prefix=/install/path ./make ./make install

Windows Run software/lemur-4.12-install.exe Documentation in windoc/index.html

Installation

Use Lemur-4.12 instead~ JAVA Runtime(JDK 6) need for evaluation tool. Environment Variable : PATH

Linux: modify ~/.bash_profile Windows: MyComputer/Properties…

Indexing

Document Preparation Indexing Parameters Time and Space Requirements

Two Index Formats

KeyFile Term Positions Metadata Offline Incremental InQuery Query

Language

Indri Term Positions Metadata Fields / Annotations Online Incremental InQuery and Indri

Query Languages

Indexing – Document Preparation

TREC Text TREC Web Plain Text Microsoft Word(*)

Microsoft PowerPoint(*)

Document Formats:The Lemur Toolkit can inherently deal with several different document format types without any modification:

HTML XML PDF Mbox

(*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a Windows-based machine, and Office must be installed.

Indexing – Document Preparation

If your documents are not in a format that the Lemur Toolkit can inherently process:

1. If necessary, extract the text from the document.2. Wrap the plaintext in TREC-style wrappers:

<DOCNO>document_id</DOCNO>

<TEXT>

Index this document text.

</TEXT>

</DOC>

– or –For more advanced users, write your own parser to extend the Lemur Toolkit.

Indexing - Parameters

Basic usage to build index: IndriBuildIndex <parameter_file>

Parameter file includes options for Where to find your data files Where to place the index How much memory to use Stopword, stemming, fields Many other parameters.

Indexing – Parameters

Standard parameter file specification an XML document:

</parameters>

where to find your source files and what type to expect

BuildIndex <dataFiles> name of file containing list of datafiles to index.

IndriBuildIndex<parameters><corpus>

<path>/path/to/source/files</path><class>trectext</class>

</corpus></parameters>

The <index> parameter tells IndriBuildIndex where to create or incrementally add to the index If index does not exist, it will create a new one If index already exists, it will append new

documents into the index.

<index>/path/to/the/index</index>

</parameters>

<memory> - used to define a “soft-limit” of the amount of memory the indexer should use before flushing its buffers to disk. Use K for kilobytes, M for megabytes, and G for

gigabytes.

</parameters>

Stopwords defined within <stopwords>filename</stopwords>

IndriBuildIndex<parameters>

<word>first_word</word>

<word>final_word</word>

</stopper>

</parameters>

Term stemming can be used while indexing as well via the <stemmer> tag. Specify the stemmer type via the <name> tag

within. Stemmers included with the Lemur Toolkit

include the Krovetz Stemmer and the Porter Stemmer.

<name>krovetz</name>

</stemmer>

</parameters>

Retrieval

Parameters Query Formatting Interpreting Results

Retrieval - Parameters

Basic usage for retrieval: IndriRunQuery/RetEval <parameter_file>

Parameter file includes options for Where to find the index The query or queries How much memory to use Formatting options Many other parameters.

The <index> parameter tells IndriRunQuery/RetEval where to find the repository.

<index>/path/to/the/index</index>

</parameters>

The <query> parameter specifies a query plain text or using the

Indri query language<parameters>

<query> <number>1</number> <text>this is the first

query</text> </query>

<query> <number>2</number> <text>another query to

run</text> </query></parameters>

Query file format<DOC><DOCNO> 1 </DOCNO> What articles exist which

deal with TSS (Time Sharing System), anoperating system for IBM computers?

</DOC><DOC><DOCNO> 2 </DOCNO> I am interested in articles

written either by Prieve or Udo PoochPrieve, B.Pooch, U.

</DOC>

Retrieval – Query Formatting

TREC-style topics are not directly able to be processed via IndriRunQuery/RetEval. Format the queries accordingly:

Format by hand Write a script to extract the fields ( 可爱的

Python~)

Retrieval – Parameters

To specify a maximum number of results to return, use the <count> tag:

</parameters>

Result formatting options:

IndriRunQuery/RetEval has built in formatting specifications for TREC and INEX retrieval tasks

Retrieval – Parameters

TREC – Formatting directives: <runID>: a string specifying the id for a query

run, used in TREC scorable output. <trecFormat>: true to produce TREC scorable

output, otherwise use false (default).

<runID>runName</runID>

</parameters>

Outputting INEX Result Format

Must be wrapped in <inex> tags <participant-id>: specifies the participant-id attribute

used in submissions. <task>: specifies the task attribute (default

CO.Thorough). <query>: specifies the query attribute (default

automatic). <topic-part>: specifies the topic-part attribute (default

T). <description>: specifies the contents of the description

<participant-id>LEMUR001</participant-id></inex>

</parameters>

Retrieval - Evaluation

To use trec_eval: format IndriRunQuery results with appropriate

trec_eval formatting directives in the parameter file:

<runID>runName</runID> <trecFormat>true</trecFormat>

Resulting output will be in standard TREC format ready for evaluation:

150 Q0 AP890101-0001 1 -4.83646 runName150 Q0 AP890101-0015 2 -7.06236 runName

Use RetEval for TF.IDF

First run ParseToFile to convert doc formatted queries into queries

<outputFile>filename</outputFile>

<stemmer>stemmername</stemmer>

<stopwords>stopwordfile</stopwords>

</parameters>

ParseToFile paramfile queryfile http://www.lemurproject.org/lemur/

parsing.html#parsetofile

Use RetEval for TF.IDF

Then run RetEval

<parameters> <index>index</index> <retModel>0</retModel> // 0 for TF-IDF, 1 for Okapi, // 2 for KL-divergence, // 5 for cosine similarity <textQuery>querie filename</textQuery> <resultCount>1000</resultCount> <resultFile>tfidf.res</resultFile></parameters>

RetEval paramfile http://www.lemurproject.org/lemur/

retrieval.html#RetEval

Evluate Results

TREC qrels Ground Truth: judge by human assessors.

Ireval tool

java -jar “D:\Program Files\Lemur\Lemur 4.12\bin\ireval.jar”result qrels >pr.result

Use Lemur API & Make Extension to Lemur

When you have a new idea on retrieval problem, how can you prove it?

Introducing the API

Lemur “Classic” API Many objects, highly customizable May want to use this when you want to change

how the system works Support for clustering, distributed IR,

summarization Indri API

Two main objects Best for integrating search into larger applications Supports Indri query language, XML retrieval,

“live” incremental indexing, and parallel retrieval

Lemur “Classic” API

Primarily useful for retrieval operations Most indexing work in the toolkit has

moved to the Indri API Indri indexes can be used with Lemur

“Classic” retrieval applications Extensive documentation and tutorials on

the website (more are coming)

Lemur Index Browsing

The Lemur API gives access to the index data (e.g. inverted lists, collection statistics)

IndexManager::openIndex Returns a pointer to an index object Detects what kind of index you wish to open, and

returns the appropriate kind of index class Index::

docInfoList (inverted list), termInfoList (document vector), termCount, documentCount

Lemur: DocInfoList

Index::docInfoList( int termID )

Returns an iterator to the inverted list for termID.The list contains all documents that containtermID, including the positions where termIDoccurs.

Lemur: TermInfoList

Index::termInfoList( int docID )

Returns an iterator to the direct list for docID.The list contains term numbers for every termcontained in document docID, and the number of times each word occurs.(use termInfoListSeq to get word positions)

Index::termCounttermCount() : Total number of terms indexedtermCount( int id ) : Total number of occurrences of term number id.

Index::documentCount

docCount() : Number of documents indexed

docCount( int id ) : Number of documents that contain term number id.

Index::termterm( char* s ) : convert term string to a numberterm( int id ) : convert term number to a string

Index::documentdocument( char* s ) : convert doc string to a numberdocument( int id ) : convert doc number to a string

Index::docLength( int docID )The length, in number of terms, of document number docID.

Index::docLengthAvgAverage indexed document length

Index::termCountUnique

Size of the index vocabulary

Browsing Posting List

Lemur Retrieval

Class Name Description

TFIDFRetMethod BM25

SimpleKLRetMethod KL-Divergence

InQueryRetMethod Simplified InQuery

CosSimRetMethod Cosine

CORIRetMethod CORI

OkapiRetMethod Okapi

IndriRetMethod Indri (wraps QueryEnvironment)

Lemur Retrieval

RetMethodManager::runQuery query: text of the query index: pointer to a Lemur index modeltype: “cos”, “kl”, “indri”, etc. stopfile: filename of your stopword list stemtype: stemmer datadir: not currently used func: only used for Arabic stemmer model->scoreCollection()

RetMethodManager :: createModel

Lemur Retrieval Method

lemur::api::RetrievalMethod Class Reference

TextQueryMethod

A text query retrieval method is determined by specifying the following elements: A method to compute the query representation A method to compute the doc representation The scoring function A method to update the query representation

based on a set of (relevant) documents

TextQueryMethod Scoring Function

Given a query q =(q1,q2,...,qN) and a document d=(d1,d2,...,dN), where q1,...,qN and d1,...,dN are terms:

s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d) function w gives the weight of each matched

term function g makes it possible to perform any

further transformation of the sum of the weight of all matched terms based on the "summary" information of a query or a document (e.g., document length).

TextQueryMethod Scoring Function

ScoreFunction::matchedTermWeight compute the score contribution of a matched

term ScoreFunction:: adjustedScore

score adjustment (e.g., appropriate length normalization)

Lemur: Other tasks

Clustering: ClusterDB Distributed IR: DistMergeMethod Language models: UnigramLM,

DirichletUnigramLM, etc.

Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院...

Documents

Transcript of Lemur Toolkit Introduction wbia 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院...

IEEE TRANSACTIONS ON CLOUD COMPUTING 1 PISCES: …net.pku.edu.cn/vc/papers/tcc2016.pdf · Transactions on Cloud Computing ... PISCES: Optimizing Multi-job Application Execution in

WBIA Project 2 – Retrieval & Evaluation

PORADNIK ZASAD PISANIA PRACY DYPLOMOWEJ - …wbia.zut.edu.pl/fileadmin/pliki/wbia/zalaczniki/studenci/Poradnik... · Tre ść pracy dyplomowej powinna by ć przedstawiona mo żliwie

Wisconsin Power Piping - DSPS Home · Wisconsin Power Piping April 13-14, 2016 12th Annual DSPS \ WBIA 2016 Industry Days Country Springs Hotel & Conference Center ... ASHRAE 15 +

Szymon Skibicki, KBO WBiA, przykładowe zadania – obci ąż ...skibicki.zut.edu.pl/fileadmin/zadania_obciazenia/z... · Szymon Skibicki, KBO WBiA, przykładowe zadania – obci

Hongzhi Yin Bin Cui Spatio-Temporal Recommendation in ...net.pku.edu.cn/~cuibin/Papers/2016Springerbook.pdfSpatio-Temporal Recommendation in Social Media. SpringerBriefs in Computer

Text Summarization wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 12/17/2013.

STREAMING RANKING BASED RECOMMENDER …net.pku.edu.cn/daim/hongzhi.yin/slides/SIGIR18-Weiqing.pdfSTREAMING RANKING BASED RECOMMENDER SYSTEMS Weiqing Wang, Hongzhi Yin, Zi Huang, Qinyong

Recommender Systems with Big Data (RSBD) --Spatial Item ...net.pku.edu.cn/daim/hongzhi.yin/RSBD/RSBD-L7-Spatial Item... · •Differences between traditional GPS trajectories and

Cloud Transcoder: Bridging the Format and Resolution Gap between Internet Videos …net.pku.edu.cn/~lzh/papers/[NOSSDAV'12] Cloud Transcoder... · 2012-06-08 · Cloud Transcoder:

SizeSpotSigs: An Eﬀective Deduplicate Algorithm ...net.pku.edu.cn/~webg/TianwangLiterature/2011/2011...SizeSpotSigs: An Eﬀective Deduplicate Algorithm 541 As we know, in fact,

Henry Wadsworth Longfellow and his Paul Revere's Ride ...net.pku.edu.cn/~yhf/English/TOEFL/0001-0408ListeningComprehensio… · Listening Comprehension Scripts for TOEFL 0401, by

Probabilistic Use Cases: Discovering Behavioral Patterns ...net.pku.edu.cn/dlib/MOOCStudy/MOOCPaper/engagement...such as genetics [15] and web page recommendation [22]. In the latter,

Introduction to Cloud Computing course/cs402/2009/ 彭波 pb@net.pku.edu.cn 北京大学信息科学技术学院 6/30/2009.

HyperSight: Towards Scalable, High-coverage, and Dynamic ...net.pku.edu.cn/~yangtong/uploads/HyperSight_JSAC_20_CR.pdfAbstract—Performing ﬁne-grained and real-time network monitoring

Wisconsin · 2017-05-16 · Wisconsin DSPS / WBIA Industry Days - 2016 Department of Safety & Professional Services Industry Services Division …. Boiler & Elevator Safety …. Programs

MapReduce & BigTable wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 12/10/2013.

Prezentacja programu PowerPoint - Informacje bieżące WBIA ROAD MONUMENT THE WORLD OF LAMBORGHINI Monument serves as a gothic cathedral portal - a gate to the different world - \NorId

Information Retrieval Evaluation wbia 黄连恩 hle@net.pku.edu.cn 北京大学信息工程学院 10/21/2014.

WBIA Review