File organization. File Organizations (Indexes) Choices for accessing data during query evaluation...
-
Upload
rosamond-adams -
Category
Documents
-
view
226 -
download
1
Transcript of File organization. File Organizations (Indexes) Choices for accessing data during query evaluation...
File organization
File Organizations (Indexes)
• Choices for accessing data during query evaluation• Scan the entire collection
– Typical in early (batch) retrieval systems– Computational and I/O costs are O(characters in collection)– Practical for only “small” text collections– Large memory systems make scanning feasible
• Use indexes for direct access– Evaluation time O(query term occurrences in collection)– Practical for “large” collections– Many opportunities for optimization
• Hybrids: Use small index, then scan a subset of the collection
Indexes
• What should the index contain?• Database systems index primary and secondarykeys
– This is the hybrid approach– Index provides fast access to a subset of database records– Scan subset to find solution set
• IR Problem:• Cannot predict keys that people will use in queries
– Every word in a document is a potential search term• IR Solution: Index by all keys (words) full text indexes
Indexes• Index is accessed by the atoms of a query language, The atoms are called
“features” or “keys” or “terms”• Most common feature types:• – Words in text, punctuation• – Manually assigned terms (controlled and uncontrolled vocabulary)• – Document structure (sentence and paragraph boundaries)• – Inter- or intra-document links (e.g., citations)• Composed features• – Feature sequences (phrases, names, dates, monetary amounts)• – Feature sets (e.g., synonym classes)• Indexing and retrieval models drive choices• – Must be able to construct all components of those models• Indexing choices (there is no “right” answer)
– Tokenization,Case folding (e.g., New vs new, Apple vs apple), Stopwords (e.g., the, a, its), Morphology (e.g., computer, computers, computing, computed)
• Index granularity has a large impact on speed and effectiveness
Index Contents
• The contents depend upon the retrieval model• Feature presence/absence
– Boolean– Statistical (tf, df, ctf, doclen, maxtf)– Often about 10% the size of the raw data, compressed
• Positional– Feature location within document– Granularities include word, sentence, paragraph, etc– Coarse granularities are less precise, but take less space– Word-level granularity about 20-30% the size of the raw
data,compressed
Indexes: Implementation
• Common implementations of indexes– Bitmaps– Signature files– Inverted files
• Common index components– Dictionary (lexicon)– Postings
• document ids• word positions
No positional data indexed
Indexes: Bitmaps
• Bag-of-words index only• For each term, allocate vector with one bit per
document• If feature present in document n, set nth bit to 1,
otherwise 0• Boolean operations very fast• Space efficient for common terms (why?)• Space inefficient for rare terms (why?)• Good compression with run-length encoding
(why?)• Not widely used
Indexes: Signature Files
• Bag-of-words only• For each term, allocate fixed size s-bit vector (signature)• Define hash function:
– Multiple functions: word → 1..s [selects which bits to set]• Each term has an s-bit signature
– may not be unique!• OR the term signatures to form document signature• Long documents are a problem (why?)
– Usually segment them into smaller pieces / blocks
Signature File Example
Signature File Example
Indexes: Signature Files
• At query time:– Lookup signature for query (how?)– If all corresponding 1-bits are “on” in document
signature, document probably contains that term
• Vary s to control P (false positive)– Note space tradeoff
• Optimal s changes as collection grows (why?)• Widely studied but Not widely used
Indexes: Inverted Lists
• Inverted lists are currently the most common indexing technique
• Source file: collection, organized by document• Inverted file: collection organized by term
– one record per term, listing locations where term occurs• During evaluation, traverse lists for each query term
– OR: the union of component lists– AND: an intersection of component lists– Proximity: an intersection of component lists– SUM: the union of component lists; each entry has a score
Inverted Files
Inverted Files
Word-Level Inverted File
How big is the index?
• For an n word collection:• Lexicon
– Heaps’ Law: V = O(nβ), 0.4 < β< 0.6– TREC-2: 1 GB text, 5 MB lexicon
• Postings– at most, one per occurrence of the word in the
text: O(n)
Inverted Search Algorithm
1. Find query elements (terms) in the lexicon
2. Retrieve postings for each lexicon entry
3. Manipulate postings according to the retrieval model
Word-Level Inverted File
Query: 1.porridge & pot (BOOL) 2.“porridge pot” (BOOL)3. porridge pot (VSM)
lexicon posting
Answer
Lexicon Data Structures
According to Heaps’ Law, the size of lexicon maybe very large
• Hash table– O(1) lookup, with constant h() and collision handling– Supports exact-match lookup– May be complex to expand
• B-Tree– On-disk storage with fast retrieval and good caching
behavior– Supports exact-match and range-based lookup– O(log n) lookups to find a list– Usually easy to expand
In-memory Inversion Algorithm
1. Create an empty lexicon2. For each document d in the collection,
1. Read document, parse into terms2. For each indexing term t,
1. fd,t = frequency of t in d2. If t is not in lexicon, insert it3. Append <d, fd,t> to postings list for t
3. Output each postings list into inverted file1. For each term, start new file entry2. Append each <d,fd,t> to the entry3. Compress entry4. Write entry out to file.
Complexity of In-memory Inv.
• Time: O(n) for n-byte text• Space
– Lexicon: space for unique words + offsets– Postings, 10 bytes per entry
• document number: 4 bytes• frequency count: 2 bytes (allows 65536 max occ)• "next" pointer: 4 bytes
• Is this affordable?– For 5GB collection, at 10 bytes/entry, for 400M
entries, need 4GB of main memory
Idea 1: Partition the text
• Invert a chunk of the text at a time• Then, merge each sub-indexes into one
complete index
Main inverted file
chunk
多路归并
Idea 2: Sort-based Inversion
Invert in two passes:1. Output records <t, d, ft> to a temp. file2. Sort the records using external merge
sort1. read a chunk of the temp file2. sort it using Quicksort3. write it back into the same place4. then merge-sort the chunks in place
3. Read sorted file, and write inverted file
Access Optimizations
• Skip lists:– A table of contents to the inverted list– Embedded pointers that jump ahead n documents
(why is this useful?)• Separating presence information from location
information– Many operators only need presence information– Location information takes substantial space (I/O)– If split,
• reduced I/O for presence operators• increased I/O for location operators (or larger index)
– Common in CD-ROM implementations
Inverted file compression• Inverted lists are usually compressed• Inverted files with word locations are about the size of
the raw data• Distribution of numbers is skewed
– Most numbers are small (e.g., word locations, term frequency)• Distribution can be made more skewed easily
– Delta encoding: 5, 8, 10, 17 → 5, 3, 2, 7• Simple compression techniques are often the best
choice– Simple algorithms nearly as effective as complex algorithms– Simple algorithms much faster than complex algorithms– Goal: Time saved by reduced I/O > Time required to
uncompress
Compress Huffman encoding
Inverted file compression• The longst lists, which take up the most space, have the most
frequent (probable) words.• Compressing the longest lists would save the most space.• The longest lists should compress easily because they contain the
least information (why?)• Algorithms:
– Delta encoding– Variable-length encoding– Unary codes– Gamma codes– Delta codes– Variable-Byte Code – Golomb code
Inverted List Indexes: Compression
Delta Encoding ("Storing Gaps")• Reduces range of numbers.
– keep d in ascending order– 3, 5, 20, 21, 23, 76, 77, 78– becomes: 3, 2, 15, 1, 2, 53, 1, 1
• Produces a more skewed distribution.• Increases probability of smaller numbers.• Stemming also increases the probability of
smaller numbers. (Why?)
Variable-Byte Code
• Binary, but use minimum number of bytes• 7 bits to store value, 1 bit to flag if there is
another byte– 0 < x < 128: 1 byte– 128 < x < 16384: 2 bytes– 16384 < x < 2097152: 3 bytes
• Integral byte sizes for easy coding• Very effective for medium-sized numbers• A little wasteful for very small numbers• Best trade-off between smallest index size and
fastest query time
Index/IR toolkits
• The Lemur Toolkit for Language Modeling and Information Retrieval
• Java-based indexing and search technology,provide web search application software
http://lucene.apache.org/nutch/
http://www.lemurproject.org/
Summary
• Common implementations of indexes– Bitmap, signature file, inverted file
• Common index components– Lexicon & postings
• Inverted Search Algorithm• Inversion algorithm• Inverted file access optimizations
– compression
The End
Vector Space Model
• 文档 d和查询 q在向量空间中表示为两个m维向量,每维度的权值用 TF∙IDF,其相似度用向量夹角余弦度量,有 : (使用原始的 tf,idf公式 )
Qt ttqtd
d
Qt ttqtd
dq
dq
Qtttqttd
dq
Qttqtd
d
df
Nff
W
df
Nff
WW
WW
idffidff
WW
WW
DQCos
)(log1
)(log1
),(
2,,
2,,
,,,,
Vocabulary Growth (Heap’s Law)
• How does the size of the overall vocabulary (number of unique words) grow with the size of the corpus?– Vocabulary has no upper bound due to proper names,
typos, etc.– New words occur less frequently as vocabulary grows
• If V is the size of the vocabulary and the n is the length of the corpus in words:– V = Knβ (0< β <1)
• Typical constants:– K ≈ 10−100– β ≈ 0.4−0.6 (approx. square-root of n)
Query Answer
• 1.porridge & pot (BOOL) – d2
• 2.“porridge pot” (BOOL)– null
• 3. porridge pot (VSM)– d2 > d1>d5 – Next page
3. porridge pot (VSM)N=6, f(d1)表示 d1内词频
Termdf f(d1) f(d2) f(d5)
porridge 2 2 1 0
Pot 2 0 1 1
d5d1d2
6/1d5)Sim(q,
5/2d2)Sim(q,
10/2d1)Sim(q,
6|W|
5|W|
10|W|
1,1,11,0,0,0,0,0,0,0,1,1, D5
1,0,10,0,0,1,1,0,0,0,1,0, D2
0,0,00,0,0,2,2,1,0,1,0,0, D1
)log(||
)(log1
),(
d5
d2
d1
2
t,d
2,,
文档长度:
文档向量:相同,均略去,下面,df
df
NfW
df
Nff
WDQCos
Dt td
Qt ttqtd
dd
back