An Evaluation of Index Architectures for DB-IR Integration in an Open-Source IRMS, KRISTAL Jinsuk...
-
Upload
emily-kelly -
Category
Documents
-
view
220 -
download
1
Transcript of An Evaluation of Index Architectures for DB-IR Integration in an Open-Source IRMS, KRISTAL Jinsuk...
An Evaluation of Index Architectures forDB-IR Integration in an Open-Source IRMS,
KRISTAL
Jinsuk KimInformation System Development [email protected]. 6. 1.
2
DB+IR vs. KRISTAL-IRMS
IRMS = Information Retrieval & Management System
3
Contents
• The Great Divide in DB and IR• Approaches in DB-IR Integration• Strategies for Dynamic Index Maintenance
– Direct Index Update– Stand-alone Auxiliary Index Strategy– Pulsing Auxiliary Index Strategy
• Experimental Result• Conclusion• Discussion
4
The Great Divide in DB and IR
DatabaseSystems
Data
Queries
The GreatData Divide
The GreatQuery Divide
InformationRetrievalSystems
Structured Unstructured
RankedKeywordSearch
Complexand
Structured
Dr. Jayavel ShanmugasundaramCornell UniversitySIGMOD 2005
5
RDB vs. IR
Dr. R. Baeza-Yates & Dr. M. ConsensVLDB 2004
• DBs allow structured querying
• Queries and results (tuples) are
different objects
• Soundness & completeness
expected
• All results are equally good
• User is expected to know the
structure
• IR only supports unstructured
querying
• Queries and results are both
documents
• Results are usually imprecise and
incomplete
• Some results are more relevant
than others
• User is expected to be dumb
RDB IRWebSearchEngines ?
6
RDB vs. IR vs. KRISTAL
• Structured querying• Queries and results (tuples) are different objects
• Soundness & completeness expected• All results are equally good
• User is expected to know the structure
• Unstructured querying• Queries and results are both documents
• Results are usually imprecise and incomplete• Some results are more relevant than others• User is expected to be dumb
• Structured & unstructured querying• Structured & unstructured data types
• Boolean, Vector, & Vector Boolean Models
RDB IR KRISTAL
…<presentation date=“1 June 2007”> <title>Index Maintenance in DB-IR Integration</title> <author>Jinsuk Kim</author> <abstract>Index maintenance strategies in DB-IR … stand-alone and pulsing auxiliary index architectures … </abstract></presentation>…
Find presentations between the years 2005 and 2007 of which author is “Jinsuk” and of which abstract is about “pulsing auxiliary index”.
(DATE: 2005 ~ 2007) AND(AUTHOR: Jinsuk) AND(ABSTRACT: pulsing auxiliary index)
KRISTAL Query Formula
7
Strategies in DB-IR Integration (1/2)
• DB-IR Middleware Approach– Glue existing DB and IR engines at the application level
• DBMS for data management and IRS for text search facilities• Inevitable document-index gap
• DB-IR loose coupling– Extend DBMS by SQL-level IR interface
• Examples: Oracle ConText, DB2 Tex Extender, QUIQ, TopX, MonetDB/X100• DB-IR tight coupling
– Extend IR facilities in DBMS storage level (IR on DB)• Example: Odysseus ORDBMS
– Extend DB management facilities in IR storage level (DB on IR: IRMS)• Example: KRISTAL-IRMS
• Novel architecture for DB-IR unification– Still under discussion– “The storage-level core system with RISC-style functionality in DB-IR integration”
suggested by Chaudhuri et al.
8
Strategies in DB-IR Integration (2/2)
IRMS (DB on IRTight Coupling)
IRMS (DB on IRTight Coupling)
DB
Features
IR Features
DB-IRIntegrationDBMSDBMS
WebCrawlers
WebCrawlers
I R S I R S
Web SearchEngines
Web SearchEngines
DB ExtendersDB Extenders
IR on DBTight Coupling
IR on DBTight Coupling
DB2 CartridgeDB2 Cartridge
OdysseusOdysseus
DB-IR m
iddleware a
pproach
DB-IR m
iddleware a
pproach
9
As a Text Management System
• 2 Disk Accesses - 1 for the record - 1 for the transaction log
• 1 Disk Access - 1 for the record (index update in in-memory structure)
• 17 Disk Accesses - 1 for the record - 15 for all index terms - 1 for the transaction log
DB IR DB-IR Integration
Author: Jinsuk KimTitle: Evaluationof index maintenancein DB-IR integrationKeywords: pulsingauxiliary index,postings list
Sample Input Document
Fast InputRollbackCrash RecoverySlow Retrieval
Fast InputNo RollbackNo Crash RecoveryFast Retrieval
Slow InputRollbackCrash RecoveryFast Retrieval
How to solve this problem?
10
A Basic Problem in DB-IR Integration• Index Maintenance for Incoming Documents
– As a document usually contains hundreds of terms to be indexed, index update involves hundreds of disk accesses. This is an extremely time-consuming task.
– Traditional IR systems store these incoming postings lists from a block of new documents in in-memory structures. If additional memory space is not available, the in-memory postings lists are merged to the on-disk main index.
– However, the in-memory postings lists are volatile and can be lost upon certain crash conditions.
– For DB-IR integration, index update for each document should guarantee the document-index integrity, as DB typically does. We call such a document-level transaction as per-document basis transactional index maintenance.
11
How to Solve the Basic Problem?• Requirements
(1) Updating index for an incoming document should be fast. (How much fast?) Avoiding relocations of long postings lists is essential to speed up index maintenanc
e tasks.
(2) The task should be rollbacked if an error occurs.
(3) The result of the task should consistent even with system crashes.
• To cope with (1), separate the index update for incoming documents to a supplementary or auxiliary index storage area.
– It is time consuming due to heavy disk accesses if the on-disk main index is directly updated.
– Rather, update index to a smaller auxiliary storage area.
• To cope with (2), transaction logs should be written to an on-disk area.
• To cope with (3), the auxiliary index should be stored in on-disk area not in in-memory storage.
12
KRISTAL: Index Maintenance Strategies (1/2)
• Direct Index Update (As base line)– Postings list for each term in a new document is appended to the main
index– Relocation of postings lists severely degrades the performance
• Stand-alone Auxiliary Index– Postings lists are updated to a small auxiliary on-disk index– Relocation size in the auxiliary structure is usually smaller than in the
main index– As the auxiliary index grows, relocation size will grow too.
• Pulsing Auxiliary Index– As new documents are arrived, an auxiliary postings list longer than a
given threshold is in-place updated to the main index; this feature keeps the auxiliary index size nearly constant throughout addition of new documents
– Every relocation in the auxiliary index is smaller than the given threshold– Relocations of long postings lists are dispersed among insertion of new
documents• Example: high frequency terms such as ‘the’, ‘on’, and ‘of’ does not exactly co-
occur
13
KRISTAL: Index Maintenance Strategies (2/2)
Main Index
B+-tree
Keydoc 1
Document Table
doc 2
…
doc 5
doc 6
…
3 1
2
2
Auxiliary Index
B+-tree
5
6
5
6 Key2
Delete list
3 6 ..
3 7
Update list
6 8
…
4
7
8
7
Main Index
B+-tree
Keydoc 1
Document Table
doc 2
…
doc 5
doc 6
…
3 1
2
2
Auxiliary Index
B+-tree
5
6
5
6 Key2
Delete list
3 6 ..
3 7
Update list
6 8
…
4
7
8
7
In-Place Update
Main Index
B+-tree
Keydoc 1
Document Table
doc 2
…
doc 5
doc 6
…
3 1
2
2
5
6
5
6
4
(A)
(A) Direct Index Update
(B) Stand-alone Auxiliary Index
(C) Pulsing Auxiliary Index
(B) (C)
14
Experimental Setting• Hardware
– Dual Pentium CPUs (Clock Speed = 3GHz)– 8GB of RAM– RAID-5 SCSI HDD
• Software– OS: RedHat Enterprise Linux 4– Storage and Retrieval Engine: KRISTAL-IRMS
• Test Data– Bibliographic texts
• 10,000, 100,000, and 1,000,000 records for base data• Additional 10,000 documents for appending experiments
• Query Evaluation– Three sets of single terms with varying document frequencies– Complex queries used in real bibliography service in KISTI
15
Experiment – A sample document@DOCUMENT (1296)#TITLE=Regression with Doubly Censored Current Status Data#AUTHOR=Rabinowitz, Daniel ; Jewell, Nicholas P.#JOURNAL=Journal of the Royal Statistical Society. Series B (Methodological)#VOLUME=58#NUMBER=3#PAGE START=541#PAGE END=550#PUBDATE=20010324#ABSTRACT=Data from settings in which an initiating event and a subsequent event occur in se
quence are called doubly censored current status data if the time of neither event is observed directly, but instead it is determined at a random monitoring time whether either the initiating or subsequent event has yet occurred. This paper is concerned with using doubly censored current status data to estimate the regression coefficient in an accelerated failure time model for the length of time between the initiating event and the subsequent event. Motivated by a problem in the epidemiology of acquired immune deficiency syndrome, attention here is focused on a special case, the case in which the initiating event, given that it has occurred before the monitoring time, may be assumed to follow a uniform distribution. The main result is that the likelihood in the special case has the same structure as the likelihood in a simpler setting, the setting in which the time of the initiating event is known. The result allows methods developed for the simpler setting to be applied in the special case. The results of the application of the approach to real data are reported.
#KEYWORDS=Accelerated Failure Time ; Acquired Immune Deficiency Syndrome ; Current Status Data ; Double Censoring ; Survival Analysis
16
Experiment – DB schema and Index Statistics
17
Experiment – Appending 10,000 Documents
10K 100K 1MPre-built TableWith 10,000 docs Pre-built Table
With 100,000 docs Pre-built TableWith 1,000,000 docs
This is aSample textdocuments
This is aSample textdocument
10,000 newdocuments
18
Experiment – 10K Table• 10K + 10K
– Appending 10,000 new documents to a base table with existing 10,000 documents
• Results– Direct update
shows poor performance
– Stand-alone auxiliary index is better than direct update but poor than pulsing aux.
– Pulsing auxiliary strategy shows consistent manner with overall 10,000 documents
0
2
4
6
8
10
12
14
16
18
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Documents Inserted
Inse
rt T
ime
Per
Doc
umen
t (s)
Direct Index Update
Stand-alone Auxiliary
Pulsing Auxiliary
19
Experiment – 100K Table
0
2
4
6
8
10
12
14
16
18
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Documents Inserted
Inse
rt T
ime
Per
Doc
umen
t (s)
Stand-alone Auxiliary
Pulsing Auxiliary
• 100K + 10K– Appending
10,000 new documents to a base table with existing 100,000 documents
• Results– Pulsing auxiliary
index strategy is better than stand-alone auxiliary index.
– However, pulsing strategy shows many biased points throughout the insertion
20
Experiment – 1M Table
0
5
10
15
20
25
30
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Documents Inserted
Inse
rt T
ime
Per
Doc
umen
t (s)
Stand-alone Auxiliary
Pulsing Auxiliary
• 1M + 10K– Appending 10,000
new documents to a base table with existing 1,000,000 documents
• Results– Pulsing auxiliary
index strategy is better than stand-alone auxiliary index.
– However, pulsing strategy shows many and huge biased points throughout the insertion
– For larger base tables pulsing may inferior to stand-alone strategy
21
Experiment – Overall Result• Average Processing Time
per Document for 10,000 Insertions– Overall performance of
pulsing auxiliary strategy is superior to stand-alone auxiliary index.
– Stand-alone auxiliary index shows nearly constant performance since main index and auxiliary index is independent each other.
– Pulsing one shows degenerated performance as the size of base table grows.
4.44.6 4.7
2.02.3
4.4
0
1
2
3
4
5
10K 100K 1M
Ave
rage
Ins
ert
Tim
e pe
r D
ocum
ent
(s)
Stand-alone Auxiliary Pulsing Auxiliary
22
3.86.5
19.9
5.6 7.5
21.1
18.2
33.4
84.1
0
10
20
30
40
50
Ave
rage
Ret
riev
al T
ime
(ms)
Re-Build
Pulsing Aux
Stand-alone Aux
Re-Build 3.8 6.5 19.9
Pulsing Aux 5.6 7.5 21.1
Stand-alone Aux 18.2 33.4 84.1
100=<DF<1000 1000<=DF<10,000 DF>=10,000
Experiment – Postings Access• Boolean mode access
for terms with varying DF ranges after adding 10,000 new documents to the 1M table
• Pulsing auxiliary index shows comparable performance with re-built table.
• cf) Re-build = table built with 1.1 million table in bulk-mode
23
Experiment – Query Evaluation(1/2)
• Target tables– 10K, 100K, and 1M table after adding 10,000 new documents
• Queries– 2994 subject queries used in KISTI bibliography database service– Examples:
• yellow* /N8 (polyurethane* OR urethane*)• silicon AND (optic* /N8 signal*) AND module*• food* /N3 (wastewater* OR (waste /W1 water*)) AND treat*• ceramic* AND (bulletproof* OR (bullet /W1 proof*) OR (bullet /W1 resist*) OR (bullet
* /N2 (protect* OR resist*)))• wood* /N5 (substitut* OR replacement*)• (catalyst* OR catalyzer*) /N5 (regenerat* OR ((precious OR valu* OR noble*) /N2 m
etal* /N5 recover*))– Heavy truncations reflect B+-tree performance by exploiting leaf nodes of the tre
e– Within/Near operations reflect the performance of positional information
24
Experiment – Query Evaluation(2/2)
• Average query performance for complex queries shows Re-build table is the most superior
• But, the performance of pulsing auxiliary index is only 18% worse than that of re-build table (for 1M, 120 to 147ms) while stand-alone auxiliary is degraded by 44% (120 to 212ms)
10
120
18
24
59
212
34
15
147
0
50
100
150
200
250
0 100 200 300 400 500 600 700 800 900 1000
Database Cardinality (x1000)
Ave
rage
Que
ry P
erfo
rman
ce (
ms)
Re-BuildStand-alone AuxiliaryPulsing Auxiliary
25
Conclusion
• Index Maintenance
– Pulsing Auxiliary Index is superior to Stand-alone Auxiliary Index Strategy in index maintenance for newly arriving documents
• cf) For larger base tables, pulsing may inferior to stand-alone auxiliary strategy
• Query Evaluation
– Query evaluation performance of pulsing auxiliary index is comparable with that of re-built table
• Pulsing auxiliary index can be a candidate for index architecture in DB-IR integration
26
Discussion (1/4)• Recent implementation of new index maintenance strategy in
KRISTAL– Postings segmentation (4.4 seconds to 1.5 seconds for 1M
table)
0
5
10
15
20
25
30
0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000
Number of Documents Inserted
Inse
rt T
ime
Per
Doc
umen
t (s)
Pulsing Auxiliary IndexPostings Segmentation of Main Index
27
Discussion (2/4)
• Still this approach is interior to IR’s a block of documents approaches
– IR: 0.1 seconds per document
– KRISTAL: 1.5~4.4 seconds per document
• Overallocation
– Overallocation of postings lists in the auxiliary index may relieve the relocations problem
• Index Compression
– Compression of postings lists will reduce relocation sizes
28
Discussion (3/4)
• KRISTAL toward DB-IR Integration from an IRMS viewpoint
– Solved Problems (Intra-table operations)• Structured query evaluation
• Structured data processing
• XML repository
• Dynamic index maintenance (?)
– To be solved (Inter-table operations)• Table Join
• View and Materialized View
• Trigger
• Query optimization (and SQL-like query language?)
29
Discussion (4/4)
• KRISTAL toward Open-Source IRMS
– Aiming at Open Source Initiative• Currently KRISTAL’s source is open for educational and research pur
poses
• However, KRISTAL-IRMS will be intended to OSI level, sooner or later
• Building KRISTAL on another languages such as Uzbek and Mongolian is under progress in Open-Source level
– Download KRISTAL at http://www.kristalinfo.com
감사합니다감사합니다
http://www.yeskisti.net
http://www.kristalinfo.com