Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the...
-
Upload
daniela-wheeler -
Category
Documents
-
view
224 -
download
1
Transcript of Collections Management Museums EMu Searching EMu Searching Explained (What’s going on under the...
Collections Management Museums
EMu Searching
EMu Searching Explained
(What’s going on under the hood!)
Bernard MarshallChief Technical Officer
KE Software
Collections Management Museums
EMu Searching
Overview
• The basic theory• Tools and tuning• Searching issues
Collections Management Museums
EMu Searching
EMu search mechanism
• Two level superimposed coding scheme for partial match retrieval
• Developed from research at the University of Melbourne (early 1980s)
• Designed to provide very high speed retrieval from very large datasets
• The more search terms provided, the faster the search time• One set of indexes for all searching (except key searches)
Collections Management Museums
EMu Searching
Record Descriptor
• Encodes the contents of one record into a single bit string• Descriptors stored sequentially in the rec file• Each record descriptor has the data offset (from the data file)
appended
rec descriptor 1 offset
rec descriptor 2 offset
rec descriptor 3 offset
rec descriptor 4 offset
rec descriptor 5 offset
rec file data file
record data 1
record data 3
record data 2
Collections Management Museums
EMu Searching
Field Terms Bits set (k = 2) Descriptor (b = 15)
First Name Boris 3,10 00010 00000 10000
Surname Badenov 1, 4 01001 00000 00000
City FrostbiteFalls
3, 78, 14
00010 00100 0000000000 00010 00001
Country Pottsylvania 4, 9 00001 00001 00000
Rec Descriptor 01011 00111 10001
termpseudo random
number generator bit numbers
k b column no
Collections Management Museums
EMu Searching
Record descriptor (searching)
• Generate record descriptor for search term(s)• AND with all record descriptors to find matching record(s)
Field Terms Bits set (k = 2) Descriptor (b = 15)
First Name Boris 3,10 00010 00000 10000
Query Descriptor 00010 00000 10000
00010 00000 10000 Boris query descriptor
01011 00111 10001 AND record descriptor
00010 00000 10000 resultant descriptor
Collections Management Museums
EMu Searching
False matches
• Query descriptor matches a record descriptor that does not contain the search term
Field Terms Bits set (k = 2) Descriptor (b = 15)
First Name Natasha 7, 9 00000 00101 00000
Query Descriptor 00000 00101 00000
00000 00101 00000 Natasha query descriptor
01011 00111 10001 AND record descriptor
00000 00101 00000 resultant descriptor
Collections Management Museums
EMu Searching
False matches
• Chance of a false match related to bit density• The lower the bit density, the less probability of a false match• EMu uses a bit density of < 25%; that is, less than 25% of bits
are one• Probability of a false match with k = 5 is 1 in 1,024 record
descriptors checked for a single term query• Probability for a two term query 1 in 1,048,576• Lower bit density requires more disk space and produces longer
record descriptors
Collections Management Museums
EMu Searching
Segment descriptor• Encodes the contents of multiple records into a bit string• Descriptors stored sequentially in the seg file (bitsliced)
rec descriptor 1
rec descriptor 2
seg descriptor 1 rec descriptor 3
rec descriptor 4
rec descriptor 5
rec descriptor 6
seg descriptor 2 rec descriptor 7
. . .
Collections Management Museums
EMu Searching
Segment descriptor
• For each group of records (Nr) a single descriptor is calculated as for a record descriptor
• Segment level has its own values for k (number of bits to set) and b (length of bit string)
Collections Management Museums
EMu Searching
Segment descriptor (searching)
• Segment searching checks Nr records per descriptor• For efficient disk access for searching, “flip” seg file (bitslicing)• Penalty is slower record insertions / updates (use oflow file)
00001 00000 00100 00000 01000 seg query descriptor
10011 00010 00111 00001 11001 seg descriptor 1
00011 10000 00001 01100 00100 seg descriptor 2
01000 00110 11000 00011 01001 seg descriptor 3
01001 00100 01100 00101 01000 seg descriptor 4
Collections Management Museums
EMu Searching
Segment descriptor (bitsliced)1000 …0011 …0000 …1100 …1101 …0100 …0000 …0011 …1010 …0000 …0010 …0011 …1001 ……
1001 …
AND
• Each bit slice is ANDed to determine matching segments
• Matching segments are given by bit positions with a value of one
Collections Management Museums
EMu Searching
Complete search sequence
• Build segment query descriptor for query terms• Search bitslice segment file for list of matching segments• Build record query descriptor for query terms• Search record descriptors in matching segments for matching
records• Exact match record only before showing to user
Collections Management Museums
EMu Searching
Number of disk accesses (logical)
• For a single search term with one matching record: ks – bits set per term (segment level) 1 – disk read to read segment to match record descriptor
• Number of logical reads is independent of the table size• Number of physical reads increases as table grows (but disk read
ahead helps here)
Collections Management Museums
EMu Searching
Client query evaluation
• Attachment searches performed and matching IRNs on reference column added to query statement
• Reverse attachment searches performed and matching reference values added to query statement
• Local search terms added to query statement• Also search columns added to query statement• Search performed
Collections Management Museums
EMu Searching
What is a term?
Type Term Query examples
Text word Frostbite, falls
Float number 9.12
Integer number 12
Date day, month, year 12-10-2010
Time hour, min, sec 13:12:10.0
Lat/Long deg, min, sec, dir 120 12 10.43 N
String value A1-124/7
• A term is the basic index component
Collections Management Museums
EMu Searching
Term modifiers
Modifier Applicable types Query examples
Null all types *, !*
Partial text, string ab*, a{a-z}*
Stem text ~electric
Phonetic text @smythe
Phrase text “Red house”
• Modifiers alter how the term is indexed
Collections Management Museums
EMu Searching
Indexing tools
• texdensity Prints out the bit density for segment and record descriptors
• texanalyse Prints the number of terms per record
• texconf Calculate a suitable index configuration Adjust configuration parameters manually
Collections Management Museums
EMu Searching
Configuration parameters
• params file in table directory Override default configuration parameters
• Bit density (rec/seg)• File system block size• False match probability (rec/seg)• Minimum number of records per segment
XML based file
Collections Management Museums
EMu Searching
Searching Issues – false matches• Issue
Some queries are slow but disk activity is high• Diagnose
texadmin database usage shows a high number of index false matches
texdensity shows high density or large standard deviation with high maximum density (check seg and rec)
texanalyse shows a large standard deviation for the number of index terms (check seg and rec)
• Fix Reconfigure table Set configuration parameters manually
Collections Management Museums
EMu Searching
Searching Issues – common terms• Issue
Some queries containing common terms are slow “false” segment matches
• Diagnose Querying on each term individually results in a large number
of matches (query is quick) Querying on the combination of terms becomes slow
• Fix Cluster table on a common term Sort data before indexing
Collections Management Museums
EMu Searching
Searching Issues – block size mismatch• Issue
Overall searching is slow but disk activity is high Using zfs with large record size
• Diagnose Determine the block size of the file system used to hold
index files Use texconf to determine the block size used for indexing
• Fix Set blocksize configuration parameter manually Adjust zfs record size to 16K
Collections Management Museums
EMu Searching
Searching Issues – RAID configuration• Issue
Record updates are very slow Fast disks but performance less than optimal
• Diagnose Disk controller or driver is configured to use RAID 5 or 6
• Fix Optimal performance in a RAID environment is RAID 1+0
(RAID 10) (stripe/mirror) Ensure striping agrees with block size of file system Enable striping where possible
Collections Management Museums
EMu Searching
Searching Issues – Unindexed fields• Issue
Wildcard / stem / phonetic based queries are extremely slow• Diagnose
Use emuindexing to check indexing of fields being queried • Fix
Add Registry entries to enable indexing required:• System|Setting|Table|table|Stem Index|colname;colname;...• System|Setting|Table|table|Phonetic Index|colname;colname;...• System|Setting|Table|table|Null Index|colname;colname;...• System|Setting|Table|table|Partial Index|colname=parts;...
Collections Management Museums
EMu Searching
Searching Issues – Range queries slow• Issue
Queries containing ranges are slow• Diagnose
Use emuindexing to check if range indexing is enabled• Fix
Use emurangeupdate to optimise range based searching Add Registry entries to enable indexing required:
• System|Setting|Table|table|Range Buckets|colname|bucket;...
Collections Management Museums
EMu Searching
Searching Issues – Large attachment queries• Issue
Query is very slow when performing a query containing attachments and other terms
• Diagnose “Optimising query” status is displayed for a long time
• Cause The search engine is re-organising the query
(a AND b) AND (c OR d OR e OR f or g) becomes(a AND b AND c) OR (a AND b AND d) or (a AND b AND e) or(a and b and f) OR (a AND b AND g)
• Fix Rewrite the query optimiser
Collections Management Museums
EMu Searching
References• EMu 4.0.01 Release Notes
System Tuning• Configuration
• Range Indexing• www.kesoftware.com/downloads/EMu/documents/configuration.pdf
• www.kesoftware.com/downloads/EMu/documents/Range Indexing/rangeindexing.pdf