Is Your Index Reader Really Atomic or Maybe Slow?

66
Is Your IndexReader Really Atomic or Maybe Slow? Uwe Schindler SD DataSolutions GmbH, [email protected] 1

description

Presented by Uwe Schindler | SD DataSolutions GmbH - See conference video - http://www.lucidimagination.com/devzone/events/conferences/lucene-revolution-2012 Since the first day, Apache Lucene exposed the two fundamental concepts of reading and writing an index directly through IndexReader & IndexWriter. However, the API did not reflect reality; from the IndexWriter perspective this was desirable but when reading the index this caused several problems in the past. In reality a Lucene index is not a single index while logically treated as a such. This talk will introduce the new API classes AtomicReader and CompositeReader added in Lucene 4.0 as very general interfaces, and DirectoryReader, which most people know as the segment-based “Lucene index on disk”. The talk will also cover more changes and improvements to the search API like reader contexts that allow to convert local document ids to global ones from IndexSearcher. Lucene changed all IndexReaders to be read-only, so it’s no longer possible to modify indexes using those classes. Finally, Uwe Schindler will show migration paths from custom norm values to the various new ranking models that were added to Lucene; this includes using Similarity with Lucene 4.0’s DocValues as replacement for norms.

Transcript of Is Your Index Reader Really Atomic or Maybe Slow?

Page 1: Is Your Index Reader Really Atomic or Maybe Slow?

Is Your IndexReader Really Atomic or Maybe Slow?

Uwe Schindler

SD DataSolutions GmbH, [email protected]

1

Page 2: Is Your Index Reader Really Atomic or Maybe Slow?

My Background

• I am committer and PMC member of Apache Lucene and Solr. My main focus is on development of Lucene Java.

• Implemented fast numerical search and maintaining the new attribute-based text analysis API. Well known as Generics and Sophisticated Backwards Compatibility Policeman.

• Working as consultant and software architect for SD DataSolutions GmbH in Bremen, Germany. The main task is maintaining PANGAEA (Publishing Network for Geoscientific & Environmental Data) where I implemented the portal's geo-spatial retrieval functions with Apache Lucene Core.

• Talks about Lucene at various international conferences like the previous Lucene Revolution, Lucene Eurocon, ApacheCon EU/US, Berlin Buzzwords, and various local meetups.

2

Page 3: Is Your Index Reader Really Atomic or Maybe Slow?

Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

3

Page 4: Is Your Index Reader Really Atomic or Maybe Slow?

Lucene Index Structure

• Lucene was the first full text search engine that supported document additions and updates

• Snapshot isolation ensures consistency

Segmented index structure

Committing changes creates new segments

4

Page 5: Is Your Index Reader Really Atomic or Maybe Slow?

Segments in Lucene

• Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing.

5 *) The term “optimal” does not mean all indexes must be optimized!

Ima

ge:

Do

ug

Cu

ttin

g

Page 6: Is Your Index Reader Really Atomic or Maybe Slow?

Segments in Lucene

• Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing.

• Lucene writes segments incrementally and can merge them.

6 *) The term “optimal” does not mean all indexes must be optimized!

Ima

ge:

Do

ug

Cu

ttin

g

Page 7: Is Your Index Reader Really Atomic or Maybe Slow?

Segments in Lucene

• Each index consists of various segments placed in the index directory. All documents are added to new new segment files, merged with other on-disk files after flushing.

• Lucene writes segments incrementally and can merge them.

• Optimized* index consists of one segment.

7 *) The term “optimal” does not mean all indexes must be optimized!

Ima

ge:

Do

ug

Cu

ttin

g

Page 8: Is Your Index Reader Really Atomic or Maybe Slow?

Lucene merges while indexing all of English Wikipedia

8 Video by courtesy of:

Mike McCandless’ blog, http://goo.gl/kI53f

Page 9: Is Your Index Reader Really Atomic or Maybe Slow?

Indexes in Lucene (up to version 3.6)

• Each segment (“atomic index”) is a completely functional index: – SegmentReader implements the IndexReader

interface for single segments

• Composite indexes – DirectoryReader implements the IndexReader

interface on top of a set of SegmentReaders

– MultiReader is an abstraction of multiple IndexReaders combined to one virtual index

9

Page 10: Is Your Index Reader Really Atomic or Maybe Slow?

Composite Indexes (up to version 3.6)

Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term

index (priority queue for TermEnum,…)

– Postings: posting lists for each term appended, convert document IDs to be global

– Metadata: doc frequency, term frequency,…

– Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search)

– FieldCache: duplicate instances for single segments and composite view (memory!!!)

10

Page 11: Is Your Index Reader Really Atomic or Maybe Slow?

Merging Term Index and Postings

Term 1

Term 2

Term 4

Term 7

Term 8

Term 9

Term 1

Term 3

Term 5

Term 6

Term 8

Term 9

Term 1

Term 2

Term 3

Term 4

Term 5

Term 6

Term 7

Term 8

Term 9

11

Page 12: Is Your Index Reader Really Atomic or Maybe Slow?

Merging Term Index and Postings

Term 1

Term 2

Term 4

Term 7

Term 8

Term 9

Term 1

Term 3

Term 5

Term 6

Term 8

Term 9

Term 1

Term 2

Term 3

Term 4

Term 5

Term 6

Term 7

Term 8

Term 9

12

Page 13: Is Your Index Reader Really Atomic or Maybe Slow?

Composite Indexes (up to version 3.6)

Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term

index (priority queue for TermEnum,…)

– Postings: posting lists for each term appended, convert document IDs to be global

– Metadata: doc frequency, term frequency,…

– Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search)

– FieldCache: duplicate instances for single segments and composite view (memory!!!)

13

Page 14: Is Your Index Reader Really Atomic or Maybe Slow?

Searching before Version 2.9

• IndexSearcher used the underlying index always as a “single” (atomic) index: – Queries are executed on the atomic view of a

composite index

– Slowdown for queries that scan term dictionary (MultiTermQuery) or hit lots of documents, facetting => recommendation to “optimize” index

– On every index change, FieldCache used for sorting had to be reloaded completely

14

Ima

ge: ©

The W

alt Disn

ey Co

mp

any

Page 15: Is Your Index Reader Really Atomic or Maybe Slow?

Lucene 2.9 and later:

Per segment search

Search is executed separately on each index segment:

Atomic view no longer used!

15

Page 16: Is Your Index Reader Really Atomic or Maybe Slow?

Per Segment Search: Pros

• No live term dictionary merging

• Possibility to parallelize

– ExecutorService in IndexSearcher since Lucene 3.1+

– Do not optimize to make this work!

• Sorting only needs per-segment FieldCache

– Cheap reopen after index changes!

• Filter cache per segment

– Cheap reopen after index changes!

16

Page 17: Is Your Index Reader Really Atomic or Maybe Slow?

Per Segment Search: Cons

• Query/Filter API changes: – Scorer / Filter’s DocIdSet no longer use global

document IDs

• Slower sorting by string terms – Term ords are only comparable inside each

segment

– String comparisons needed after segment traversal

– Use numeric sorting if possible!!! (Lucene supports missing values since version 3.4 [buggy], corrected 3.5+)

17

Page 18: Is Your Index Reader Really Atomic or Maybe Slow?

Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

18

Page 19: Is Your Index Reader Really Atomic or Maybe Slow?

Composite Indexes (up to version 3.6)

Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term

index (priority queue for TermEnum,…)

– Postings: posting lists for each term appended, convert document IDs to be global

– Metadata: doc frequency, term frequency,…

– Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search)

– FieldCache: duplicate instances for single segments and composite view (memory!!!)

19

Page 20: Is Your Index Reader Really Atomic or Maybe Slow?

Composite Indexes (version 4.0)

Atomic “views” on multi-segment index: – Term Dictionary: on-the-fly merged & sorted term

index (priority queue for TermEnum,…)

– Postings: posting lists for each term appended, convert document IDs to be global

– Metadata: doc frequency, term frequency,…

– Stored fields, term vectors, deletions: delegate global document IDs -> segment document IDs (binary search)

– FieldCache: duplicate instances for single segments and composite view (memory!!!)

20

Page 21: Is Your Index Reader Really Atomic or Maybe Slow?

Early Lucene 4.0

• Only “historic” IndexReader interface available since Lucene 1.0

• 80% of all methods of composite IndexReaders throwed UnsupportedOperationException – This affected all user-facing APIs

(SegmentReader was hidden / marked experimental)

• No compile time safety! – Query Scorers and Filters need term dictionary and

postings, throwing UOE when executed on composite reader

21

Ima

ge:

© T

he

Wal

t D

isn

ey C

om

pan

y

Page 22: Is Your Index Reader Really Atomic or Maybe Slow?

Heavy Committing™ !!!

22

Ima

ge:

© T

he

Wal

t D

isn

ey C

om

pan

y

Page 23: Is Your Index Reader Really Atomic or Maybe Slow?

IndexReader

• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)

23

Page 24: Is Your Index Reader Really Atomic or Maybe Slow?

IndexReader

• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)

• Does not know anything about “inverted index” concept – no terms, no postings!

24

Page 25: Is Your Index Reader Really Atomic or Maybe Slow?

IndexReader

• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)

• Does not know anything about “inverted index” concept – no terms, no postings!

• Access to stored fields (and term vectors) by document ID – to display / highlight search results only!

25

Page 26: Is Your Index Reader Really Atomic or Maybe Slow?

IndexReader

• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)

• Does not know anything about “inverted index” concept – no terms, no postings!

• Access to stored fields (and term vectors) by document ID – to display / highlight search results only!

• Some very limited metadata – number of documents,…

26

Page 27: Is Your Index Reader Really Atomic or Maybe Slow?

IndexReader

• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)

• Does not know anything about “inverted index” concept – no terms, no postings!

• Access to stored fields (and term vectors) by document ID – to display / highlight search results only!

• Some very limited metadata – number of documents,…

• Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!!

27

Page 28: Is Your Index Reader Really Atomic or Maybe Slow?

IndexReader

• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)

• Does not know anything about “inverted index” concept – no terms, no postings!

• Access to stored fields (and term vectors) by document ID – to display / highlight search results only!

• Some very limited metadata – number of documents,…

• Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!!

• Passed to IndexSearcher – just as before!

28

Page 29: Is Your Index Reader Really Atomic or Maybe Slow?

IndexReader

• Most generic (abstract) API to an index – superclass of more specific types – cannot be subclassed directly (no public constructor)

• Does not know anything about “inverted index” concept – no terms, no postings!

• Access to stored fields (and term vectors) by document ID – to display / highlight search results only!

• Some very limited metadata – number of documents,…

• Unmodifiable: No more deletions / changing of norms possible! – Applies to all readers in Lucene 4.0 !!!

• Passed to IndexSearcher – just as before!

• Allows IndexReader.open() for backwards compatibility (deprecated)

29

Page 30: Is Your Index Reader Really Atomic or Maybe Slow?

AtomicReader

• Inherits from IndexReader

30

Page 31: Is Your Index Reader Really Atomic or Maybe Slow?

AtomicReader

• Inherits from IndexReader

• Access to “atomic” indexes (single segments)

31

Page 32: Is Your Index Reader Really Atomic or Maybe Slow?

AtomicReader

• Inherits from IndexReader

• Access to “atomic” indexes (single segments)

• Full term dictionary and postings API

32

Page 33: Is Your Index Reader Really Atomic or Maybe Slow?

AtomicReader

• Inherits from IndexReader

• Access to “atomic” indexes (single segments)

• Full term dictionary and postings API

33

Page 34: Is Your Index Reader Really Atomic or Maybe Slow?

AtomicReader

• Inherits from IndexReader

• Access to “atomic” indexes (single segments)

• Full term dictionary and postings API

• Access to DocValues (new in Lucene 4.0) and norms

34

Page 35: Is Your Index Reader Really Atomic or Maybe Slow?

CompositeReader

• No additional functionality on top of IndexReader

35

Page 36: Is Your Index Reader Really Atomic or Maybe Slow?

CompositeReader

• No additional functionality on top of IndexReader

• Provides getSequentialSubReaders() to retrieve all child readers

36

Page 37: Is Your Index Reader Really Atomic or Maybe Slow?

CompositeReader

• No additional functionality on top of IndexReader

• Provides getSequentialSubReaders() to retrieve all child readers

• DirectoryReader and MultiReader implement this class

37

Page 38: Is Your Index Reader Really Atomic or Maybe Slow?

DirectoryReader

• Abstract class, defines interface for:

38

Page 39: Is Your Index Reader Really Atomic or Maybe Slow?

DirectoryReader

• Abstract class, defines interface for: – access to on-disk indexes (on top of Directory class) – access to commit points, index metadata, index

version, isCurrent() for reopen support – defines abstract openIfChanged (for cheap reopening

of indexes) – child readers are always AtomicReader instances

39

Page 40: Is Your Index Reader Really Atomic or Maybe Slow?

DirectoryReader

• Abstract class, defines interface for: – access to on-disk indexes (on top of Directory class)

– access to commit points, index metadata, index version, isCurrent() for reopen support

– defines abstract openIfChanged (for cheap reopening of indexes)

– child readers are always AtomicReader instances

• Provides static factory methods for opening indexes – well-known from IndexReader in Lucene 1 to 3

– factories return internal DirectoryReader implementation (StandardDirectoryReader with SegmentReaders as childs)

40

Page 41: Is Your Index Reader Really Atomic or Maybe Slow?

Basic Search Example

Looks familiar, doesn’t it?

41

Ima

ge:

© T

he

Wal

t D

isn

ey C

om

pan

y

Page 42: Is Your Index Reader Really Atomic or Maybe Slow?

“Arrgh! I still need terms and postings on my DirectoryReader!!! What should I do? Optimize to only have one segment? Help me please!!!”

(example question on [email protected])

42

Page 43: Is Your Index Reader Really Atomic or Maybe Slow?

What to do?

• Calm down

• Take a break and drink a beer*

• Don’t optimize / force merge your index!!!

43 *) Robert’s solution

Page 44: Is Your Index Reader Really Atomic or Maybe Slow?

Solutions

• Most efficient way:

– Retrieve atomic leaves from your composite: reader.getTopReaderContext().leaves()

– Iterate over sub-readers, do the work (possibly parallelized)

– Merge results

44

Page 45: Is Your Index Reader Really Atomic or Maybe Slow?

Solutions

• Most efficient way:

– Retrieve atomic leaves from your composite: reader.getTopReaderContext().leaves()

– Iterate over sub-readers, do the work (possibly parallelized)

– Merge results

• Otherwise: wrap your CompositeReader:

45

Page 46: Is Your Index Reader Really Atomic or Maybe Slow?

(Slow) Solution

• Wraps IndexReaders of any kind as atomic reader, providing terms, postings, deletions, doc values – Internally uses same algorithms like previous Lucene readers – Segment-merging uses this to merge segments, too

46

AtomicReader

SlowCompositeReaderWrapper.wrap(IndexReader r)

Page 47: Is Your Index Reader Really Atomic or Maybe Slow?

(Slow) Solution

• Wraps IndexReaders of any kind as atomic reader, providing terms, postings, deletions, doc values – Internally uses same algorithms like previous Lucene readers – Segment-merging uses this to merge segments, too

• Solr always provides an AtomicReader for convenience through SolrIndexSearcher. Plugin writers should use:

47

AtomicReader

SlowCompositeReaderWrapper.wrap(IndexReader r)

AtomicReader

rd = mySolrIndexSearcher.getAtomicReader()

Page 48: Is Your Index Reader Really Atomic or Maybe Slow?

Other readers

• FilterAtomicReader

– Was FilterIndexReader in 3.x (but now solely works on atomic readers)

– Allows to filter terms, postings, deletions

– Useful for index splitters (e.g., PKIndexSplitter, MultiPassIndexSplitter) (provide own getLiveDocs() method, merge to IndexWriter)

• ParallelAtomicReader, -CompositeReader

48

Page 49: Is Your Index Reader Really Atomic or Maybe Slow?

IndexReader Reopening

• Reopening solely provided by directory-based DirectoryReader instances

49

Page 50: Is Your Index Reader Really Atomic or Maybe Slow?

IndexReader Reopening

• Reopening solely provided by directory-based DirectoryReader instances

• No more reopen for:

– AtomicReader: they are atomic, no refresh possible

– MultiReader: reopen child readers separately, create new MultiReader on top of reopened readers

– Parallel*Reader, FilterAtomicReader: reopen wrapped readers, create new wrapper afterwards

50

Page 51: Is Your Index Reader Really Atomic or Maybe Slow?

Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

51

Page 52: Is Your Index Reader Really Atomic or Maybe Slow?

WTF ?!?

IndexReaderContext

AtomicReaderContext

CompositeReaderContext

52

Page 53: Is Your Index Reader Really Atomic or Maybe Slow?

The Problem

53

Doc0

Doc1

Doc2

Doc3

Doc0

Doc1

Doc2

Doc3

Doc0

Doc1

Doc2

Doc3

Atomic0 Atomic1 Atomic2

CompositeA: DirectoryReader

Doc0

Doc1

Doc2

Doc3

Doc0

Doc1

Doc2

Doc3

Doc0

Doc1

Doc2

Doc3

Atomic0 Atomic1 Atomic2

CompositeB: DirectoryReader

SuperComposite: MultiReader

Page 54: Is Your Index Reader Really Atomic or Maybe Slow?

The Problem

54

Doc0

Doc1

Doc2

Doc3

Doc4

Doc5

Doc6

Doc7

Doc8

Doc9

Doc10

Doc11

Atomic0 Atomic1 Atomic2

CompositeA: DirectoryReader

Doc0

Doc1

Doc2

Doc3

Doc0

Doc1

Doc2

Doc3

Doc0

Doc1

Doc2

Doc3

Atomic0 Atomic1 Atomic2

CompositeB: DirectoryReader

SuperComposite: MultiReader

Page 55: Is Your Index Reader Really Atomic or Maybe Slow?

The Problem

55

Doc0

Doc1

Doc2

Doc3

Doc4

Doc5

Doc6

Doc7

Doc8

Doc9

Doc10

Doc11

Atomic0 Atomic1 Atomic2

CompositeA: DirectoryReader

Doc12

Doc13

Doc14

Doc15

Doc16

Doc17

Doc18

Doc19

Doc20

Doc21

Doc22

Doc23

Atomic0 Atomic1 Atomic2

CompositeB: DirectoryReader

SuperComposite: MultiReader

Page 56: Is Your Index Reader Really Atomic or Maybe Slow?

Solution

• Each IndexReader instance provides: IndexReaderContext getTopReaderContext()

56

Page 57: Is Your Index Reader Really Atomic or Maybe Slow?

Solution

• Each IndexReader instance provides: IndexReaderContext getTopReaderContext()

• The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders)

57

Page 58: Is Your Index Reader Really Atomic or Maybe Slow?

Solution

• Each IndexReader instance provides: IndexReaderContext getTopReaderContext()

• The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders)

• Each atomic leave has a docBase (document ID offset)

58

Page 59: Is Your Index Reader Really Atomic or Maybe Slow?

Solution

• Each IndexReader instance provides: IndexReaderContext getTopReaderContext()

• The context provides a “view” on all childs (direct descendants) and leaves (down to lowest level AtomicReaders)

• Each atomic leave has a docBase (document ID offset)

• IndexSearcher passes a context instance relative to its own top-level reader to each Query-Scorer / Filter – allows to access complete reader tree up to the current top-

level reader

– Allows to get the “global” document ID

59

Page 60: Is Your Index Reader Really Atomic or Maybe Slow?

Agenda

• Motivation / History of Lucene

• AtomicReader & CompositeReader

• Reader contexts

• Wrap up

60

Page 61: Is Your Index Reader Really Atomic or Maybe Slow?

Summary

• Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface

• Lucene 4.0 will split the IndexReader class into several abstract interfaces

• IndexReaderContexts will support per-segment search preserving top-level document IDs

61

Page 62: Is Your Index Reader Really Atomic or Maybe Slow?

Summary

• Lucene moved from global searches to per-segment search in Lucene 2.9

• Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface

• Lucene 4.0 will split the IndexReader class into several abstract interfaces

• IndexReaderContexts will support per-segment search preserving top-level document IDs

62

Page 63: Is Your Index Reader Really Atomic or Maybe Slow?

Summary

• Up to Lucene 3.6 indexes are still accessible on any hierarchy level with same interface

• Lucene 4.0 will split the IndexReader class into several abstract interfaces

• IndexReaderContexts will support per-segment search preserving top-level document IDs

63

Page 64: Is Your Index Reader Really Atomic or Maybe Slow?

Summary

• Lucene 4.0 will split the IndexReader class into several abstract interfaces

• IndexReaderContexts will support per-segment search preserving top-level document IDs

64

Page 65: Is Your Index Reader Really Atomic or Maybe Slow?

Summary

• Lucene 4.0 will split the IndexReader class into several abstract interfaces

• IndexReaderContexts will support per-segment search preserving top-level document IDs

65

Page 66: Is Your Index Reader Really Atomic or Maybe Slow?

Contact Uwe Schindler

[email protected] http://www.thetaphi.de

@thetaph1

SD DataSolutions GmbH Wätjenstr. 49

28213 Bremen, Germany +49 421 40889785-0

http://www.sd-datasolutions.de

66