IRTools Software Overview Gregory B. Newby UNC Chapel Hill [email protected].

21
IRTools Software Overview Gregory B. Newby UNC Chapel Hill [email protected]

Transcript of IRTools Software Overview Gregory B. Newby UNC Chapel Hill [email protected].

Page 1: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

IRTools Software Overview

Gregory B. Newby

UNC Chapel Hill

[email protected]

Page 2: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Download & Participate

IRTools is a work in progress. Check back in the spring for more software and test cases. Currently, only some parts workWant to help? We use CVS for distributed developmentOur project page: http://sourceforge.net/projects/irtools

Page 3: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Principles

For IR Researchers

A programming toolkit, not an IR system

Implements major approaches to IR (Boolean, VSM, Probabilistic & LSI)

Scalable to billions of documents

High performance algorithms and structures

Expandable

Documented: http://ils.unc.edu/tera

Page 4: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Major Components

Spider Indexer Retrieval Engine

Gathers documents on the live Web

Builds internal representations of documents

Processes queries and generates results

Page 5: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Implementation

Mostly in C++, using the GNU compiler

Uses the Standard Template Library

Tested on Solaris & Linux (Alpha & 386)

Designed for modularity, so IR researchers can add their own components

Page 6: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Why Might You use IRTools?

If you have your own IR software, there’s probably no needIf you are looking for experimental IR software, this might be a good alternative (goal: to be suitable for general use in mid-2002)IRTools should be useful for classroom use and demonstrationFor production use, consider ht://dig

Page 7: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: Word List

The Berkeley DB is used to store the term termID lookup tableA single file, accessed by hash in a B+ treestruct term_termID { char * term irt_int termID}

Page 8: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: 1st Inverted Index File

Binary file with fixed-length recordsAccessed by termid*sizeof(struct)offsetGives basic info needed for weightingPoints to more files for inverted entries (the actual documents for this term)Some duplication (e.g., meantf) to prevent additional I/O

Page 9: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: 1st Inverted Index File

struct inv_file1 { irt_int termIDirt_int term_doccount // Frequencyirt_int meantf // For weightingirt_int nt // # terms in this docirt_int file2_location // File for // entriesirt_int starting_offset // File 2 locirt_int entry_count // # occurrences // of this term // in file 2

}

Page 10: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: 2nd Inverted Index File

Info about documents with this termUsing Page Rank, best docs can be listed earliest (avoiding subsequent disk I/O)Multiple 2nd files for larger collectionsstruct inv_file2 {irt_int termID // Sanity checkirt_int file_location // Next fileirt_int starting_offset,

num_entries // As for file1}

Page 11: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: 2nd Inverted Index File

For each document with this term: struct inv_docentry {

irt_int term_in_doc_count// For weighting:irt_int doc_unique_terms irt_Int doc_total_terms // 3rd file offsetirt_int file3_location

}

Page 12: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: 3nd Inverted Index File

This lists a term’s locations in documentsirt_int termID // Sanity checkFollowed by terms_in_doc_count irt_ints indicating the positions of this term in this document

Usable for a NEAR operator

Page 13: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Planned & Current Components

Current Various stemmers

and stoplists Various weighting

schemes Sparse matrix

formats for LSI etc. Boolean AND & OR TREC output Visual interfaces

Designed & Planned Page Rank Integrated spider Boolean NEAR Update & delete

entries Concurrent retrieval

engine clients Concurrent indexers

Page 14: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Global Collection Variables

maxn:highest # of terms in any doc

maxUn: highest # unique terms

Nterms: total known terms

Ndocs: total known documents

Page 15: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: Boolean Candidate MergingWorks for OR or ANDMin. disk I/O (needed for inverted index only)Doesn’t require inverted index to be sorted in docID orderThe STL map can be problematic for more than about 20K candidates; using documents that are Page Rank’ed can help shrink the candidate set (and speed up everything)Start with terms with the lowest frequency; we only continue until we have enough hits

Page 16: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: Boolean Candidate Mergingirt_int NFULL=0 // stop with enough hitsvector <irt_int> full // docIDs w. all q termsmap <docID, candidate_info> // Candidatesstruct candidate_info { // For each doc irt_int docID // this doc’s ID nt // # terms in this doc for weighting meantf // mean tf in this doc for weighting float [NQUERYTERMS] tf // For weighting irt_short qtcount // # query terms in doc }

The map eliminates sorting!We must allocate memory for every candidate

Page 17: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: LSI & Information Space

We use a modified Harwell-Boeing sparse matrix format on disk (modified = binary files)Berry’s svdpackc has been integratedWe’re doing scaling experiments now. Scaling is a major challenge for LSIOne solution: do smaller eigensystem problems on candidate subset on the fly, rather than pre-computing the entire collection’s semantic space. But this eliminates possibly interesting documents!

Page 18: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Hyperlink Map

The hyperlink map is a sparse asymmetric matrix, size is D x DWe use a modified Harwell-Boeing format to store the matrixA similar index file structure to the inverted index gives us rapid access to any document’s link listWe must store both sides of the matrix

Page 19: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Web Document Metadata

Items stored during spidering. These are kept in a Berkeley DB B+ hash file, with the document URL (or name) as keyDocname // keydocIDHTTP last update as reportedOur last visit/updateHTTP-reported sizeChecksum (simple)# links out

Page 20: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.

Design Snippet: tokenizer

The tokenizer reads files (via spider or local disk)Goal: Few passes through the fileGoal: Any character setProcess: Keep a static array of word boundaries Keep a static array of tag delimiters (<) Fold everything to lower case termID lookup can happen now or later Simple transformations (like ditching extra white

space) can happen now

Page 21: IRTools Software Overview Gregory B. Newby UNC Chapel Hill gbnewby@ils.unc.edu.