Lemur Toolkit Introduction wbia 彭波 [email protected] 北京大学信息科学技术学院...
-
Upload
randall-parker -
Category
Documents
-
view
280 -
download
0
Transcript of Lemur Toolkit Introduction wbia 彭波 [email protected] 北京大学信息科学技术学院...
Recap
Information Retrieval Models Vector Space Model Probabilistic models Language model
Some formulas for Some formulas for Sim(VSM)Sim(VSM)
Dot product
Cosine
Dice
Jaccard
3
i i iiiii
iii
i iii
iii
i iii
iii
iii
baba
ba
QDSim
ba
ba
QDSim
ba
ba
QDSim
baQDQDSim
) * (
) * (
),(
) * (2
),(
*
) * (
),(
) * (),(
22
22
22
t 1
D
Q
t 3
t 2
θ
BM25 (Okapi system) – BM25 (Okapi system) – Robertson et al.Robertson et al.
k1, k2, k3, b: parametersqtf: query term frequencydl: document lengthavdl: average document length
4
Doc. length normalizationTF
factors
Consider tf, qtf, document length
Standard Probabilistic IR
query
d1
d2
dn
…
Information need
document collection
matchingmatching
5
IR based on Language Model (LM)
query
d1
d2
dn
…
Information need
document collection
generationgeneration
)|( dMQP 1dM
2dM
…
ndM
A query generation process For an information need, imagine an ideal
document Imagine what words could appear in that
document Formulate a query using those words
6
Language Modeling for IR
Estimate a multinomial probability distribution from the text
P(w|D) = (1-) P(w|D)+ P(w|C)
Smooth the distributionwith one estimated from the entire collection
Estimate probability that document generated the query terms
Query Likelihood
?P(Q|D) = P(q|D)
Kullback-Leibler Divergence
Estimate models for document and query and compare
=?
KL(Q|D) = P(w|Q) log(P(w|Q) / P(w|D))
Question
Among the three classic information retrieval model, which one is your best choice in designing your retrieval system?
How can you tune the model parameters to achieve optimized performance?
When you have a new idea on retrieval problem, how can you prove it?
A Brief History of IR
Slides from Prof. Ray Larson University of California, Berkeley
School of Informationhttp://courses.sims.berkeley.edu/i240/s11/
IS 240 – Spring 2011
Experimental IR systems
Probabilistic indexing – Maron and Kuhns, 1960
SMART – Gerard Salton at Cornell – Vector space model, 1970’s
SIRE at Syracuse I3R – Croft Cheshire I (1990) TREC – 1992 Inquery Cheshire II (1994) MG (1995?) Lemur (2000?)
IS 240 – Spring 2011
Historical Milestones in IR Research
1958 Statistical Language Properties (Luhn) 1960 Probabilistic Indexing (Maron & Kuhns) 1961 Term association and clustering
(Doyle) 1965 Vector Space Model (Salton) 1968 Query expansion (Roccio, Salton) 1972 Statistical Weighting (Sparck-Jones) 1975 2-Poisson Model (Harter, Bookstein,
Swanson) 1976 Relevance Weighting (Robertson,
Sparck-Jones) 1980 Fuzzy sets (Bookstein) 1981 Probability without training (Croft)
IS 240 – Spring 2011
Historical Milestones in IR Research (cont.)
1983 Linear Regression (Fox) 1983 Probabilistic Dependence (Salton,
Yu) 1985 Generalized Vector Space Model
(Wong, Rhagavan) 1987 Fuzzy logic and RUBRIC/TOPIC (Tong,
et al.) 1990 Latent Semantic Indexing (Dumais,
Deerwester) 1991 Polynomial & Logistic Regression
(Cooper, Gey, Fuhr) 1992 TREC (Harman) 1992 Inference networks (Turtle, Croft) 1994 Neural networks (Kwok) 1998 Language Models (Ponte, Croft)
IS 240 – Spring 2011
Information Retrieval – Historical View
Boolean model, statistics of language (1950’s)
Vector space model, probablistic indexing, relevance feedback (1960’s)
Probabilistic querying (1970’s)
Fuzzy set/logic, evidential reasoning (1980’s)
Regression, neural nets, inference networks, latent semantic indexing, TREC (1990’s)
DIALOG, Lexus-Nexus, STAIRS (Boolean based) Information industry
(O($B)) Verity TOPIC (fuzzy
logic)
Internet search engines (O($100B?)) (vector space, probabilistic)
Research Industry
IS 240 – Spring 2011
Research Systems Software
INQUERY (Croft) OKAPI (Robertson) PRISE (Harman)
http://potomac.ncsl.nist.gov/prise SMART (Buckley) MG (Witten, Moffat) CHESHIRE (Larson)
http://cheshire.berkeley.edu LEMUR toolkit Lucene Others
Lemur Project
Some slides from Don Metzler, Paul Ogilvie & Trevor Strohman
Zoology 101
Lemurs are primates found only in Madagascar
50 species (17 are endangered)
Ring-tailed lemurs lemur catta
Zoology 101
The indri is the largest type of lemur
When first spotted the natives yelled “Indri! Indri!”
Malagasy for "Look! Over there!"
About The Lemur Project
The Lemur Project was started in 2000 by the Center for Intelligent Information Retrieval (CIIR) at the University of Massachusetts, Amherst, and the Language Technologies Institute (LTI) at Carnegie Mellon University. Over the years, a large number of UMass and CMU students and staff have contributed to the project.
The project's first product was the Lemur Toolkit, a collection of software tools and search engines designed to support research on using statistical language models for information retrieval tasks. Later the project added the Indri search engine for large-scale search, the Lemur Query Log Toolbar for capture of user interaction data, and the ClueWeb09 dataset for research on web search.
Installation
http://www.lemurproject.org Linux, OS/X:
Extract software/lemur-4.12.tar.gz ./configure --prefix=/install/path ./make ./make install
Windows Run software/lemur-4.12-install.exe Documentation in windoc/index.html
Installation
Use Lemur-4.12 instead~ JAVA Runtime(JDK 6) need for evaluation tool. Environment Variable : PATH
Linux: modify ~/.bash_profile Windows: MyComputer/Properties…
Indexing
Document Preparation Indexing Parameters Time and Space Requirements
Two Index Formats
KeyFile Term Positions Metadata Offline Incremental InQuery Query
Language
Indri Term Positions Metadata Fields / Annotations Online Incremental InQuery and Indri
Query Languages
Indexing – Document Preparation
TREC Text TREC Web Plain Text Microsoft Word(*)
Microsoft PowerPoint(*)
Document Formats:The Lemur Toolkit can inherently deal with several different document format types without any modification:
HTML XML PDF Mbox
(*) Note: Microsoft Word and Microsoft PowerPoint can only be indexed on a Windows-based machine, and Office must be installed.
Indexing – Document Preparation
If your documents are not in a format that the Lemur Toolkit can inherently process:
1. If necessary, extract the text from the document.2. Wrap the plaintext in TREC-style wrappers:
<DOC>
<DOCNO>document_id</DOCNO>
<TEXT>
Index this document text.
</TEXT>
</DOC>
– or –For more advanced users, write your own parser to extend the Lemur Toolkit.
Indexing - Parameters
Basic usage to build index: IndriBuildIndex <parameter_file>
Parameter file includes options for Where to find your data files Where to place the index How much memory to use Stopword, stemming, fields Many other parameters.
Indexing – Parameters
Standard parameter file specification an XML document:
<parameters>
<option></option>
<option></option>
…
<option></option>
</parameters>
Indexing – Parameters
where to find your source files and what type to expect
BuildIndex <dataFiles> name of file containing list of datafiles to index.
IndriBuildIndex<parameters><corpus>
<path>/path/to/source/files</path><class>trectext</class>
</corpus></parameters>
Indexing - Parameters
The <index> parameter tells IndriBuildIndex where to create or incrementally add to the index If index does not exist, it will create a new one If index already exists, it will append new
documents into the index.
<parameters>
<index>/path/to/the/index</index>
</parameters>
Indexing - Parameters
<memory> - used to define a “soft-limit” of the amount of memory the indexer should use before flushing its buffers to disk. Use K for kilobytes, M for megabytes, and G for
gigabytes.
<parameters>
<memory>256M</memory>
</parameters>
Indexing - Parameters
Stopwords defined within <stopwords>filename</stopwords>
IndriBuildIndex<parameters>
<stopper>
<word>first_word</word>
<word>next_word</word>
…
<word>final_word</word>
</stopper>
</parameters>
Indexing – Parameters
Term stemming can be used while indexing as well via the <stemmer> tag. Specify the stemmer type via the <name> tag
within. Stemmers included with the Lemur Toolkit
include the Krovetz Stemmer and the Porter Stemmer.
<parameters>
<stemmer>
<name>krovetz</name>
</stemmer>
</parameters>
Retrieval
Parameters Query Formatting Interpreting Results
Retrieval - Parameters
Basic usage for retrieval: IndriRunQuery/RetEval <parameter_file>
Parameter file includes options for Where to find the index The query or queries How much memory to use Formatting options Many other parameters.
Retrieval - Parameters
The <index> parameter tells IndriRunQuery/RetEval where to find the repository.
<parameters>
<index>/path/to/the/index</index>
</parameters>
Retrieval - Parameters
The <query> parameter specifies a query plain text or using the
Indri query language<parameters>
<query> <number>1</number> <text>this is the first
query</text> </query>
<query> <number>2</number> <text>another query to
run</text> </query></parameters>
Query file format<DOC><DOCNO> 1 </DOCNO> What articles exist which
deal with TSS (Time Sharing System), anoperating system for IBM computers?
</DOC><DOC><DOCNO> 2 </DOCNO> I am interested in articles
written either by Prieve or Udo PoochPrieve, B.Pooch, U.
</DOC>
Retrieval – Query Formatting
TREC-style topics are not directly able to be processed via IndriRunQuery/RetEval. Format the queries accordingly:
Format by hand Write a script to extract the fields ( 可爱的
Python~)
Retrieval – Parameters
To specify a maximum number of results to return, use the <count> tag:
<parameters>
<count>50</count>
</parameters>
Retrieval - Parameters
Result formatting options:
IndriRunQuery/RetEval has built in formatting specifications for TREC and INEX retrieval tasks
Retrieval – Parameters
TREC – Formatting directives: <runID>: a string specifying the id for a query
run, used in TREC scorable output. <trecFormat>: true to produce TREC scorable
output, otherwise use false (default).
<parameters>
<runID>runName</runID>
<trecFormat>true</trecFormat>
</parameters>
Outputting INEX Result Format
Must be wrapped in <inex> tags <participant-id>: specifies the participant-id attribute
used in submissions. <task>: specifies the task attribute (default
CO.Thorough). <query>: specifies the query attribute (default
automatic). <topic-part>: specifies the topic-part attribute (default
T). <description>: specifies the contents of the description
tag.
<parameters><inex>
<participant-id>LEMUR001</participant-id></inex>
</parameters>
Retrieval - Evaluation
To use trec_eval: format IndriRunQuery results with appropriate
trec_eval formatting directives in the parameter file:
<runID>runName</runID> <trecFormat>true</trecFormat>
Resulting output will be in standard TREC format ready for evaluation:
<queryID> Q0 <DocID> <rank> <score> <runID>
150 Q0 AP890101-0001 1 -4.83646 runName150 Q0 AP890101-0015 2 -7.06236 runName
Use RetEval for TF.IDF
First run ParseToFile to convert doc formatted queries into queries
<parameters>
<docFormat>web</docFormat>
<outputFile>filename</outputFile>
<stemmer>stemmername</stemmer>
<stopwords>stopwordfile</stopwords>
</parameters>
ParseToFile paramfile queryfile http://www.lemurproject.org/lemur/
parsing.html#parsetofile
Use RetEval for TF.IDF
Then run RetEval
<parameters> <index>index</index> <retModel>0</retModel> // 0 for TF-IDF, 1 for Okapi, // 2 for KL-divergence, // 5 for cosine similarity <textQuery>querie filename</textQuery> <resultCount>1000</resultCount> <resultFile>tfidf.res</resultFile></parameters>
RetEval paramfile http://www.lemurproject.org/lemur/
retrieval.html#RetEval
Evluate Results
TREC qrels Ground Truth: judge by human assessors.
Ireval tool
java -jar “D:\Program Files\Lemur\Lemur 4.12\bin\ireval.jar”result qrels >pr.result
Use Lemur API & Make Extension to Lemur
Task
When you have a new idea on retrieval problem, how can you prove it?
Introducing the API
Lemur “Classic” API Many objects, highly customizable May want to use this when you want to change
how the system works Support for clustering, distributed IR,
summarization Indri API
Two main objects Best for integrating search into larger applications Supports Indri query language, XML retrieval,
“live” incremental indexing, and parallel retrieval
Lemur “Classic” API
Primarily useful for retrieval operations Most indexing work in the toolkit has
moved to the Indri API Indri indexes can be used with Lemur
“Classic” retrieval applications Extensive documentation and tutorials on
the website (more are coming)
Lemur Index Browsing
The Lemur API gives access to the index data (e.g. inverted lists, collection statistics)
IndexManager::openIndex Returns a pointer to an index object Detects what kind of index you wish to open, and
returns the appropriate kind of index class Index::
docInfoList (inverted list), termInfoList (document vector), termCount, documentCount
Lemur: DocInfoList
Index::docInfoList( int termID )
Returns an iterator to the inverted list for termID.The list contains all documents that containtermID, including the positions where termIDoccurs.
Lemur: TermInfoList
Index::termInfoList( int docID )
Returns an iterator to the direct list for docID.The list contains term numbers for every termcontained in document docID, and the number of times each word occurs.(use termInfoListSeq to get word positions)
Lemur Index Browsing
Index::termCounttermCount() : Total number of terms indexedtermCount( int id ) : Total number of occurrences of term number id.
Index::documentCount
docCount() : Number of documents indexed
docCount( int id ) : Number of documents that contain term number id.
Lemur Index Browsing
Index::termterm( char* s ) : convert term string to a numberterm( int id ) : convert term number to a string
Index::documentdocument( char* s ) : convert doc string to a numberdocument( int id ) : convert doc number to a string
Lemur Index Browsing
Index::docLength( int docID )The length, in number of terms, of document number docID.
Index::docLengthAvgAverage indexed document length
Index::termCountUnique
Size of the index vocabulary
Browsing Posting List
Lemur Retrieval
Class Name Description
TFIDFRetMethod BM25
SimpleKLRetMethod KL-Divergence
InQueryRetMethod Simplified InQuery
CosSimRetMethod Cosine
CORIRetMethod CORI
OkapiRetMethod Okapi
IndriRetMethod Indri (wraps QueryEnvironment)
Lemur Retrieval
RetMethodManager::runQuery query: text of the query index: pointer to a Lemur index modeltype: “cos”, “kl”, “indri”, etc. stopfile: filename of your stopword list stemtype: stemmer datadir: not currently used func: only used for Arabic stemmer model->scoreCollection()
RetMethodManager :: createModel
Lemur Retrieval Method
lemur::api::RetrievalMethod Class Reference
TextQueryMethod
A text query retrieval method is determined by specifying the following elements: A method to compute the query representation A method to compute the doc representation The scoring function A method to update the query representation
based on a set of (relevant) documents
TextQueryMethod Scoring Function
Given a query q =(q1,q2,...,qN) and a document d=(d1,d2,...,dN), where q1,...,qN and d1,...,dN are terms:
s(q,d) = g(w(q1,d1,q,d) + ... + w(qN,dN,q,d),q,d) function w gives the weight of each matched
term function g makes it possible to perform any
further transformation of the sum of the weight of all matched terms based on the "summary" information of a query or a document (e.g., document length).
TextQueryMethod Scoring Function
ScoreFunction::matchedTermWeight compute the score contribution of a matched
term ScoreFunction:: adjustedScore
score adjustment (e.g., appropriate length normalization)
Lemur: Other tasks
Clustering: ClusterDB Distributed IR: DistMergeMethod Language models: UnigramLM,
DirichletUnigramLM, etc.
More Stories about Indri
Why Can’t We All Get Along?Structured Data and Information Retrieval
Prof. Bruce CroftCenter for Intelligent Information Retrieval (CIIR)
University of Massachusetts Amherst
Similarities and Differences
Common interest in providing efficient access to information on a very large scale indexing and optimization key topics
Until recently, concern about effectiveness (accuracy) of access was domain of IR
Focus on structured vs. unstructured data is historically true but less relevant today
Statistical inference and ranking are central to IR, becoming more important in DB
Similarities and Differences
IR systems have focused on providing access to information rather than answers e.g. Web search evaluation typically based on topical relevance
and user relevance rather than correctness (except QA)
IR works with multiple databases but not multiple relations
IR query languages more like calculus than algebra
Integrity, security, concurrency are central for DB, less so in IR
What is the Goal?
An IR system with extended capability for structured data i.e. extend IR model to include combination of
evidence from structured and unstructured components of complex objects (documents)
backend database system used to store objects (cf. “one hand clapping”)
many applications look like this (e.g. desktop search, web shopping)
users seem to prefer this approach (simple queries or forms and ranking)
Combination of Evidence
Example: Where was George Washington born? Returns a ranked list of sentences
containing the phrase George Washington, the term born, and a snippet of text tagged as a PLACE named entity
“Structured Query” .vs. TextQuery
Indri – A Candidate IR System
Indri is a separate, downloadable component of the Lemur Toolkit
Influences INQUERY [Callan, et. al. ’92]
Inference network framework Structured Query language
Lemur [http://www.lemurproject.org] Language modeling (LM) toolkit
Lucene [http://jakarta.apache.org/lucene/docs/index.html] Popular off the shelf Java-based IR system Based on heuristic retrieval models
Designed for new retrieval environments i.e. GALE, CALO, AQUAINT, Web retrieval, and XML retrieval
Zoology 101
The indri is the largest type of lemur
When first spotted the natives yelled “Indri! Indri!”
Malagasy for "Look! Over there!"
Design Goals
Off the shelf (Windows, *NIX, Mac platforms) Simple to set up and use Fully functional API w/ language wrappers for Java, etc…
Robust retrieval model Inference net + language modeling [Metzler and Croft ’04]
Powerful query language Designed to be simple to use, yet support complex
information needs Provides “adaptable, customizable scoring”
Scalable Highly efficient code Distributed retrieval Incremental update
Model
Based on original inference network retrieval framework [Turtle and Croft ’91]
Casts retrieval as inference in simple graphical model
Extensions made to original model Incorporation of probabilities based on language modeling
rather than tf.idf Multiple language models allowed in the network (one per
indexed context)
Model
D
θtitle θbody θh1
r1 rN… r1 rN
… r1 rN…
I
q1 q2
α,βtitle
α,βbody
α,βh1
Document node (observed)
Model hyperparameters (observed)
Context language models
Representation nodes(terms, phrases, etc…)
Belief nodes(#combine, #not, #max)Information need node
(belief node)
Model
I
D
θtitle θbody θh1
r1 rN… r1 rN
… r1 rN…
q1 q2
α,βtitle
α,βbody
α,βh1
P( r | θ )
Probability of observing a term, phrase, or feature given a context language model ri nodes are binary
Assume r ~ Bernoulli( θ ) “Model B” – [Metzler, Lavrenko, Croft ’04]
Nearly any model may be used here tf.idf-based estimates (INQUERY) Mixture models
Model
I
D
θtitle θbody θh1
r1 rN… r1 rN
… r1 rN…
q1 q2
α,βtitle
α,βbod
yα,βh1
P( θ | α, β, D )
Prior over context language model determined by α, β
Assume P( θ | α, β ) ~ Beta( α, β ) Bernoulli’s conjugate prior αr = μP( r | C ) + 1 βr = μP( ¬ r | C ) + 1 μ is a free parameter
||
)|(),,|()|(),,|( ,
D
CrPtfDPrPDrP Dr
ii
Model
I
D
θtitle θbody θh1
r1 rN… r1 rN
… r1 rN…
q1 q2
α,βtitle
α,βbod
yα,βh1
P( q | r ) and P( I | r )
Belief nodes are created dynamically based on query
Belief node estimates are derived from standard link matrices Combine evidence from parents in various
ways Allows fast inference by making
marginalization computationally tractable Information need node is simply a
belief node that combines all network evidence into a single value
Documents are ranked according to P( I | α, β, D)
Belief Inference
p1 p2
q
Pn ……
Example: #AND
A B
Q
BA
BABABABA
BABABABA
baand
pp
pppppppp
pptttPppfttPpptftPppfftP
bBPaAPbBaAtrueQPtrueQP
1)1(0)1(0)1)(1(0
),|()1(),|()1)(,|()1)(1)(,|(
)()(),|()(,
#
P(Q=true|a,b) A B
0 false false
0 false true
0 true false
1 true true
Other Belief Operators
Document Representation
<html><head><title>Department Descriptions</title></head><body>The following list describes … <h1>Agriculture</h1> …<h1>Chemistry</h1> … <h1>Computer Science</h1> …<h1>Electrical Engineering</h1> ……<h1>Zoology</h1></body></html>
<title>department descriptions</title>
<h1>agriculture</h1> <h1>chemistry</h1>… <h1>zoology</h1>
.
.
.
<body>the following list describes … <h1>agriculture</h1> … </body>
<title> context
<body> context
<h1> context 1. agriculture
2. chemistry…36. zoology
<h1>extents
1. the following list describes <h1>agriculture</h1> …
<body>extents
1. department descriptions
<title>extents
Terms
Type Example Matches
Stemmed term dog All occurrences of dog (and its stems)
Surface term “dogs” Exact occurrences of dogs (without stemming)
Term group (synonym group)
<”dogs” canine>
All occurrences of dogs (without stemming) or canine (and its stems)
POS qualified term <”dogs” canine>.NNS
Same as previous, except matches must also be tagged with the NNS POS tag
Proximity
Type Example Matches
#odN(e1 … em) or#N(e1 … em)
#od5(dog cat) or#5(dog cat)
All occurrences of dog and cat appearing ordered within a window of 5 words
#uwN(e1 … em) #uw5(dog cat) All occurrences of dog and cat that appear in any order within a window of 5 words
#phrase(e1 … em) #phrase(#1(willy wonka)#uw3(chocolate factory))
System dependent implementation (defaults to #odm)
#syntax:xx(e1 … em)
#syntax:np(fresh powder)
System dependent implementation
Context Restriction
Example Matches
dog.title All occurrences of dog appearing in the title context
dog.title,paragraph All occurrences of dog appearing in both a title and paragraph contexts (may not be possible)
<dog.title dog.paragraph>
All occurrences of dog appearing in either a title context or a paragraph context
#5(dog cat).head All matching windows contained within a head context
Context Evaluation
Example Evaluated
dog.(title) The term dog evaluated using the title context as the document
dog.(title, paragraph)
The term dog evaluated using the concatenation of the title and paragraph contexts as the document
dog.figure(paragraph)
The term dog restricted to figure tags within the paragraph context.
Belief Operators
INQUERY INDRI
#sum / #and #combine
#wsum* #weight
#or #or
#not #not
#max #max
* #wsum is still available in INDRI, but should be used with discretion
Extent Retrieval
Example Evaluated
#combine[section](dog canine)
Evaluates #combine(dog canine) for each extent associated with the section context
#combine[title, section](dog canine)
Same as previous, except is evaluated for each extent associated with either the title context or the section context
#sum(#sum[section](dog)) Returns a single score that is the #sum of the scores returned from #sum(dog) evaluated for each section extent
#max(#sum[section](dog)) Same as previous, except returns the maximum score
Ad Hoc Retrieval
Flat documents Query likelihood retrieval:
q1 … qN ≡ #combine( q1 … qN)
SGML/XML documents Can either retrieve documents or extents Context restrictions and context evaluations
allow exploitation of document structure
Web Search
Homepage / known-item finding Use mixture model of several document
representations [Ogilvie and Callan ’03]
Example query: Yahoo!#combine( #wsum( 0.2 yahoo.(body) 0.5 yahoo.(inlink) 0.3 yahoo.(title) ) )
)|()|()|(),|( titletitleinlinkinlinkbodybody wPλwPwPwP
Example Indri Web Query
#weight( 0.1 #weight( 1.0 #prior(pagerank) 0.75 #prior(inlinks) ) 1.0 #weight( 0.9 #combine( #wsum( 1 stellwagen.(inlink) 1 stellwagen.(title) 3 stellwagen.(mainbody) 1 stellwagen.(heading) ) #wsum( 1 bank.(inlink) 1 bank.(title) 3 bank.(mainbody) 1 bank.(heading) ) ) 0.1 #combine( #wsum( 1 #uw8( stellwagen bank ).(inlink) 1 #uw8( stellwagen bank ).(title) 3 #uw8( stellwagen bank ).(mainbody) 1 #uw8( stellwagen bank ).(heading)) ) ) )
Question Answering
More expressive passage- and sentence-level retrieval
Example: Where was George Washington born?
#combine[sentence]( #1( george washington )
born #any:place) Returns a ranked list of sentences containing
the phrase George Washington, the term born, and a snippet of text tagged as a PLACE named entity
Indri Examples
Paragraphs from news feed articles published between 1991 and 2000 that mention a person, a monetary amount, and the company InfoCom
#filreq(#band( NewsFeed.doctype #date:between(1991 2000) ) #combine[paragraph]( #any:person #any:money InfoCom
) )
Getting Help
http://www.lemurproject.org Central website, tutorials, documentation, news
http://www.lemurproject.org/phorum Discussion board, developers read and respond
to questions README file in the code distribution
Paul Ogilvie Trevor Strohman Don Metzler
Thank You!
Q&A