An Evaluation of Index Architectures for DB-IR Integration in an Open-Source IRMS, KRISTAL Jinsuk...

An Evaluation of Index Architectures forDB-IR Integration in an Open-Source IRMS,

KRISTAL

Jinsuk KimInformation System Development [email protected]. 6. 1.

2

DB+IR vs. KRISTAL-IRMS

IRMS = Information Retrieval & Management System

3

Contents

• The Great Divide in DB and IR• Approaches in DB-IR Integration• Strategies for Dynamic Index Maintenance

– Direct Index Update– Stand-alone Auxiliary Index Strategy– Pulsing Auxiliary Index Strategy

• Experimental Result• Conclusion• Discussion

4

The Great Divide in DB and IR

DatabaseSystems

Data

Queries

The GreatData Divide

The GreatQuery Divide

InformationRetrievalSystems

Structured Unstructured

RankedKeywordSearch

Complexand

Structured

Dr. Jayavel ShanmugasundaramCornell UniversitySIGMOD 2005

5

RDB vs. IR

Dr. R. Baeza-Yates & Dr. M. ConsensVLDB 2004

• DBs allow structured querying

• Queries and results (tuples) are

different objects

• Soundness & completeness

expected

• All results are equally good

• User is expected to know the

structure

• IR only supports unstructured

querying

• Queries and results are both

documents

• Results are usually imprecise and

incomplete

• Some results are more relevant

than others

• User is expected to be dumb

RDB IRWebSearchEngines ?

6

RDB vs. IR vs. KRISTAL

• Structured querying• Queries and results (tuples) are different objects

• Soundness & completeness expected• All results are equally good

• User is expected to know the structure

• Unstructured querying• Queries and results are both documents

• Results are usually imprecise and incomplete• Some results are more relevant than others• User is expected to be dumb

• Structured & unstructured querying• Structured & unstructured data types

• Boolean, Vector, & Vector Boolean Models

RDB IR KRISTAL

…<presentation date=“1 June 2007”> <title>Index Maintenance in DB-IR Integration</title> <author>Jinsuk Kim</author> <abstract>Index maintenance strategies in DB-IR … stand-alone and pulsing auxiliary index architectures … </abstract></presentation>…

Find presentations between the years 2005 and 2007 of which author is “Jinsuk” and of which abstract is about “pulsing auxiliary index”.

(DATE: 2005 ~ 2007) AND(AUTHOR: Jinsuk) AND(ABSTRACT: pulsing auxiliary index)

KRISTAL Query Formula

7

Strategies in DB-IR Integration (1/2)

• DB-IR Middleware Approach– Glue existing DB and IR engines at the application level

• DBMS for data management and IRS for text search facilities• Inevitable document-index gap

• DB-IR loose coupling– Extend DBMS by SQL-level IR interface

• Examples: Oracle ConText, DB2 Tex Extender, QUIQ, TopX, MonetDB/X100• DB-IR tight coupling

– Extend IR facilities in DBMS storage level (IR on DB)• Example: Odysseus ORDBMS

– Extend DB management facilities in IR storage level (DB on IR: IRMS)• Example: KRISTAL-IRMS

• Novel architecture for DB-IR unification– Still under discussion– “The storage-level core system with RISC-style functionality in DB-IR integration”

suggested by Chaudhuri et al.

8

Strategies in DB-IR Integration (2/2)

IRMS (DB on IRTight Coupling)

IRMS (DB on IRTight Coupling)

DB

Features

IR Features

DB-IRIntegrationDBMSDBMS

WebCrawlers

WebCrawlers

I R S I R S

Web SearchEngines

Web SearchEngines

DB ExtendersDB Extenders

IR on DBTight Coupling

IR on DBTight Coupling

DB2 CartridgeDB2 Cartridge

OdysseusOdysseus

DB-IR m

iddleware a

pproach

DB-IR m

iddleware a

pproach

9

As a Text Management System

• 2 Disk Accesses - 1 for the record - 1 for the transaction log

• 1 Disk Access - 1 for the record (index update in in-memory structure)

• 17 Disk Accesses - 1 for the record - 15 for all index terms - 1 for the transaction log

DB IR DB-IR Integration

Author: Jinsuk KimTitle: Evaluationof index maintenancein DB-IR integrationKeywords: pulsingauxiliary index,postings list

Sample Input Document

Fast InputRollbackCrash RecoverySlow Retrieval

Fast InputNo RollbackNo Crash RecoveryFast Retrieval

Slow InputRollbackCrash RecoveryFast Retrieval

How to solve this problem?

10

A Basic Problem in DB-IR Integration• Index Maintenance for Incoming Documents

– As a document usually contains hundreds of terms to be indexed, index update involves hundreds of disk accesses. This is an extremely time-consuming task.

– Traditional IR systems store these incoming postings lists from a block of new documents in in-memory structures. If additional memory space is not available, the in-memory postings lists are merged to the on-disk main index.

– However, the in-memory postings lists are volatile and can be lost upon certain crash conditions.

– For DB-IR integration, index update for each document should guarantee the document-index integrity, as DB typically does. We call such a document-level transaction as per-document basis transactional index maintenance.

11

How to Solve the Basic Problem?• Requirements

(1) Updating index for an incoming document should be fast. (How much fast?) Avoiding relocations of long postings lists is essential to speed up index maintenanc

e tasks.

(2) The task should be rollbacked if an error occurs.

(3) The result of the task should consistent even with system crashes.

• To cope with (1), separate the index update for incoming documents to a supplementary or auxiliary index storage area.

– It is time consuming due to heavy disk accesses if the on-disk main index is directly updated.

– Rather, update index to a smaller auxiliary storage area.

• To cope with (2), transaction logs should be written to an on-disk area.

• To cope with (3), the auxiliary index should be stored in on-disk area not in in-memory storage.

12

KRISTAL: Index Maintenance Strategies (1/2)

• Direct Index Update (As base line)– Postings list for each term in a new document is appended to the main

index– Relocation of postings lists severely degrades the performance

• Stand-alone Auxiliary Index– Postings lists are updated to a small auxiliary on-disk index– Relocation size in the auxiliary structure is usually smaller than in the

main index– As the auxiliary index grows, relocation size will grow too.

• Pulsing Auxiliary Index– As new documents are arrived, an auxiliary postings list longer than a

given threshold is in-place updated to the main index; this feature keeps the auxiliary index size nearly constant throughout addition of new documents

– Every relocation in the auxiliary index is smaller than the given threshold– Relocations of long postings lists are dispersed among insertion of new

documents• Example: high frequency terms such as ‘the’, ‘on’, and ‘of’ does not exactly co-

occur

13

KRISTAL: Index Maintenance Strategies (2/2)

Main Index

B+-tree

Keydoc 1

Document Table

doc 2

…

doc 5

doc 6

…

3 1

2

2

Auxiliary Index

B+-tree

5

6

5

6 Key2

Delete list

3 6 ..

3 7

Update list

6 8

…

4

7

8

7

Main Index

B+-tree

Keydoc 1

Document Table

doc 2

…

doc 5

doc 6

…

3 1

2

2

Auxiliary Index

B+-tree

5

6

5

6 Key2

Delete list

3 6 ..

3 7

Update list

6 8

…

4

7

8

7

In-Place Update

Main Index

B+-tree

Keydoc 1

Document Table

doc 2

…

doc 5

doc 6

…

3 1

2

2

5

6

5

6

4

(A)

(A) Direct Index Update

(B) Stand-alone Auxiliary Index

(C) Pulsing Auxiliary Index

(B) (C)

14

Experimental Setting• Hardware

– Dual Pentium CPUs (Clock Speed = 3GHz)– 8GB of RAM– RAID-5 SCSI HDD

• Software– OS: RedHat Enterprise Linux 4– Storage and Retrieval Engine: KRISTAL-IRMS

• Test Data– Bibliographic texts

• 10,000, 100,000, and 1,000,000 records for base data• Additional 10,000 documents for appending experiments

• Query Evaluation– Three sets of single terms with varying document frequencies– Complex queries used in real bibliography service in KISTI

15

Experiment – A sample document@DOCUMENT (1296)#TITLE=Regression with Doubly Censored Current Status Data#AUTHOR=Rabinowitz, Daniel ; Jewell, Nicholas P.#JOURNAL=Journal of the Royal Statistical Society. Series B (Methodological)#VOLUME=58#NUMBER=3#PAGE START=541#PAGE END=550#PUBDATE=20010324#ABSTRACT=Data from settings in which an initiating event and a subsequent event occur in se

quence are called doubly censored current status data if the time of neither event is observed directly, but instead it is determined at a random monitoring time whether either the initiating or subsequent event has yet occurred. This paper is concerned with using doubly censored current status data to estimate the regression coefficient in an accelerated failure time model for the length of time between the initiating event and the subsequent event. Motivated by a problem in the epidemiology of acquired immune deficiency syndrome, attention here is focused on a special case, the case in which the initiating event, given that it has occurred before the monitoring time, may be assumed to follow a uniform distribution. The main result is that the likelihood in the special case has the same structure as the likelihood in a simpler setting, the setting in which the time of the initiating event is known. The result allows methods developed for the simpler setting to be applied in the special case. The results of the application of the approach to real data are reported.

#KEYWORDS=Accelerated Failure Time ; Acquired Immune Deficiency Syndrome ; Current Status Data ; Double Censoring ; Survival Analysis

16

Experiment – DB schema and Index Statistics

17

Experiment – Appending 10,000 Documents

10K 100K 1MPre-built TableWith 10,000 docs Pre-built Table

With 100,000 docs Pre-built TableWith 1,000,000 docs

This is aSample textdocuments

This is aSample textdocument

10,000 newdocuments

18

Experiment – 10K Table• 10K + 10K

– Appending 10,000 new documents to a base table with existing 10,000 documents

• Results– Direct update

shows poor performance

– Stand-alone auxiliary index is better than direct update but poor than pulsing aux.

– Pulsing auxiliary strategy shows consistent manner with overall 10,000 documents

0

2

4

6

8

10

12

14

16

18

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

Number of Documents Inserted

Inse

rt T

ime

Per

Doc

umen

t (s)

Direct Index Update

Stand-alone Auxiliary

Pulsing Auxiliary

19

Experiment – 100K Table

0

2

4

6

8

10

12

14

16

18

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000


Inse

rt T

ime

Per

Doc

umen

t (s)


Pulsing Auxiliary

• 100K + 10K– Appending

10,000 new documents to a base table with existing 100,000 documents

• Results– Pulsing auxiliary

index strategy is better than stand-alone auxiliary index.

– However, pulsing strategy shows many biased points throughout the insertion

20

Experiment – 1M Table

0

5

10

15

20

25

30

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000


Inse

rt T

ime

Per

Doc

umen

t (s)


Pulsing Auxiliary

• 1M + 10K– Appending 10,000

new documents to a base table with existing 1,000,000 documents

• Results– Pulsing auxiliary

index strategy is better than stand-alone auxiliary index.

– However, pulsing strategy shows many and huge biased points throughout the insertion

– For larger base tables pulsing may inferior to stand-alone strategy

21

Experiment – Overall Result• Average Processing Time

per Document for 10,000 Insertions– Overall performance of

pulsing auxiliary strategy is superior to stand-alone auxiliary index.

– Stand-alone auxiliary index shows nearly constant performance since main index and auxiliary index is independent each other.

– Pulsing one shows degenerated performance as the size of base table grows.

4.44.6 4.7

2.02.3

4.4

0

1

2

3

4

5

10K 100K 1M

Ave

rage

Ins

ert

Tim

e pe

r D

ocum

ent

(s)

Stand-alone Auxiliary Pulsing Auxiliary

22

3.86.5

19.9

5.6 7.5

21.1

18.2

33.4

84.1

0

10

20

30

40

50

Ave

rage

Ret

riev

al T

ime

(ms)

Re-Build

Pulsing Aux

Stand-alone Aux

Re-Build 3.8 6.5 19.9

Pulsing Aux 5.6 7.5 21.1

Stand-alone Aux 18.2 33.4 84.1

100=<DF<1000 1000<=DF<10,000 DF>=10,000

Experiment – Postings Access• Boolean mode access

for terms with varying DF ranges after adding 10,000 new documents to the 1M table

• Pulsing auxiliary index shows comparable performance with re-built table.

• cf) Re-build = table built with 1.1 million table in bulk-mode

23

Experiment – Query Evaluation(1/2)

• Target tables– 10K, 100K, and 1M table after adding 10,000 new documents

• Queries– 2994 subject queries used in KISTI bibliography database service– Examples:

• yellow* /N8 (polyurethane* OR urethane*)• silicon AND (optic* /N8 signal*) AND module*• food* /N3 (wastewater* OR (waste /W1 water*)) AND treat*• ceramic* AND (bulletproof* OR (bullet /W1 proof*) OR (bullet /W1 resist*) OR (bullet

* /N2 (protect* OR resist*)))• wood* /N5 (substitut* OR replacement*)• (catalyst* OR catalyzer*) /N5 (regenerat* OR ((precious OR valu* OR noble*) /N2 m

etal* /N5 recover*))– Heavy truncations reflect B+-tree performance by exploiting leaf nodes of the tre

e– Within/Near operations reflect the performance of positional information

24

Experiment – Query Evaluation(2/2)

• Average query performance for complex queries shows Re-build table is the most superior

• But, the performance of pulsing auxiliary index is only 18% worse than that of re-build table (for 1M, 120 to 147ms) while stand-alone auxiliary is degraded by 44% (120 to 212ms)

10

120

18

24

59

212

34

15

147

0

50

100

150

200

250

0 100 200 300 400 500 600 700 800 900 1000

Database Cardinality (x1000)

Ave

rage

Que

ry P

erfo

rman

ce (

ms)

Re-BuildStand-alone AuxiliaryPulsing Auxiliary

25

Conclusion

• Index Maintenance

– Pulsing Auxiliary Index is superior to Stand-alone Auxiliary Index Strategy in index maintenance for newly arriving documents

• cf) For larger base tables, pulsing may inferior to stand-alone auxiliary strategy

• Query Evaluation

– Query evaluation performance of pulsing auxiliary index is comparable with that of re-built table

• Pulsing auxiliary index can be a candidate for index architecture in DB-IR integration

26

Discussion (1/4)• Recent implementation of new index maintenance strategy in

KRISTAL– Postings segmentation (4.4 seconds to 1.5 seconds for 1M

table)

0

5

10

15

20

25

30

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000


Inse

rt T

ime

Per

Doc

umen

t (s)

Pulsing Auxiliary IndexPostings Segmentation of Main Index

27

Discussion (2/4)

• Still this approach is interior to IR’s a block of documents approaches

– IR: 0.1 seconds per document

– KRISTAL: 1.5~4.4 seconds per document

• Overallocation

– Overallocation of postings lists in the auxiliary index may relieve the relocations problem

• Index Compression

– Compression of postings lists will reduce relocation sizes

28

Discussion (3/4)

• KRISTAL toward DB-IR Integration from an IRMS viewpoint

– Solved Problems (Intra-table operations)• Structured query evaluation

• Structured data processing

• XML repository

• Dynamic index maintenance (?)

– To be solved (Inter-table operations)• Table Join

• View and Materialized View

• Trigger

• Query optimization (and SQL-like query language?)

29

Discussion (4/4)

• KRISTAL toward Open-Source IRMS

– Aiming at Open Source Initiative• Currently KRISTAL’s source is open for educational and research pur

poses

• However, KRISTAL-IRMS will be intended to OSI level, sooner or later

• Building KRISTAL on another languages such as Uzbek and Mongolian is under progress in Open-Source level

– Download KRISTAL at http://www.kristalinfo.com

감사합니다감사합니다

http://www.yeskisti.net

http://www.kristalinfo.com

An Evaluation of Index Architectures for DB-IR Integration in an Open-Source IRMS, KRISTAL Jinsuk...

Documents

Transcript of An Evaluation of Index Architectures for DB-IR Integration in an Open-Source IRMS, KRISTAL Jinsuk...