Parallel and Distributed Information Retrieval System
-
Upload
vimalsura -
Category
Engineering
-
view
485 -
download
3
Transcript of Parallel and Distributed Information Retrieval System
![Page 1: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/1.jpg)
Special Topics in Computer ScienceSpecial Topics in Computer Science
Advanced Topics in Information RetrievalAdvanced Topics in Information Retrieval Lecture 7 Lecture 7 (book chapter 9)(book chapter 9): :
Parallel and Distributed IRParallel and Distributed IR Alexander Gelbukh
www.Gelbukh.com
![Page 2: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/2.jpg)
2
Previous Chapter: Previous Chapter: ConclusionsConclusions
How to accelerate search? Same results as sequential Ideas:
Quick-and-dirty rejection of bad objects, 100% recall Fast data structure for search (based on clustering) Careful check of all found candidates
Solution: mapping into fewer-D feature space Condition: lower-bounding of the distance Assumption: skewed spectrum distribution
Few coefficients concentrate energy, rest are less important
![Page 3: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/3.jpg)
3
Previous Chapter: Research topicsPrevious Chapter: Research topics
Object detection (pattern and image recognition) Automatic feature selection Spatial indexing data structures (more than 1D) New types of data.
What features to select? How to determine them? Mixed-type data (e.g., webpages, or images with
sound and description) What clustering/IR methods are better suited for
what features? (What features for what methods?) Similar methods in data mining, ...
![Page 4: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/4.jpg)
4
The problemThe problem
Very large document collections Google: 4,000,000,000 pages Slow response?
Solution: parallel computing Google: 10,000 computers
![Page 5: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/5.jpg)
5
Parallel architecturesParallel architectures
Data stream
Single Multiple
Instruction stream
SingleSISD
classicalSIMDsimple
MultipleMISD(rare)
MIMDmany SISD
![Page 6: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/6.jpg)
6
MIMD architectureMIMD architecture
The most common Can be
tightly coupled loosely coupled
Distributed Many computers interacting via network PC Clusters Similar to MIMD computers, but greater cost of
communication very loosely coupled More coarse-grained programs
![Page 7: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/7.jpg)
7
Performance improvementPerformance improvement
Time: speedup S Ideally, N times (number of processors) In practice impossible
The problem does not decompose into N equal parts Communication and control overhead < 1 / f, where f is the largest separable fraction of the
problem
Cost Per processor: S / N
![Page 8: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/8.jpg)
8
Two approaches to parallelismTwo approaches to parallelism
Build new algorithms E.g., neural nets Naturally parallel Problem: to define the retrieval task
Adapt the existing techniques to parallelism Allows relying on well-studied approaches We will consider this option
![Page 9: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/9.jpg)
9
Ways to use parallelismWays to use parallelism
Multitasking N search engines Good for processing many queriesProblems: A single query is not speeded up Bottleneck: disk access (index) Possible solution: replicating (part of) data. RAIDs
Parallel algorithms IR = data. Main question: how to partition the data Document / index term matrix
(terms can be LSI dimensions, signature bits, etc)
![Page 10: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/10.jpg)
10
Possible partitioningsPossible partitionings
Horizontal: document partitioning. Union of results Vertical: term partitioning. Basically, intersect results
![Page 11: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/11.jpg)
11
Inverted files: Logical partitioningInverted files: Logical partitioning
Logical vs. physical document partitioning Logical: for each term, use pointers into inverted file data for
each processor, to indicate its portion
![Page 12: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/12.jpg)
12
Inverted files: Logical partitioning Inverted files: Logical partitioning Construction and updatingConstruction and updating
Also parallelConstruction Assign docs to processors Order docs such that each processor has an interval Process in parallel Merge. Each piece is ordered already
![Page 13: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/13.jpg)
13
Inverted files:Inverted files:Physical document partitioningPhysical document partitioning
Several separate collections, one per processor Separate indices Then the lists are merged (they are already ordered) Priority queue is used
The result is not sorted; Insertion is quick The maximal element can be found quickly First k elements can be found rather quickly Details in the book
Consistent scores are needed Global statistics is needed. Can be computed at index time
![Page 14: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/14.jpg)
14
Logical or physical partitioning?Logical or physical partitioning?
Logical requires less communication Faster
Physical is more flexible. Simpler implementation Simpler conversion of existing systems
![Page 15: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/15.jpg)
15
Inverted files: Inverted files: Term partitioningTerm partitioning
Each processor processes a part of the inverted file The results are intersected (for AND)
(or as appropriate for Boolean operations, OR and NOT) When term distribution in user queries is skewed,
then document partitioning is better When uniform, term partitioning is better. Twice for long queries, 5 – 10 times for short (Web-like)
![Page 16: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/16.jpg)
16
Suffix arraysSuffix arrays
Array construction can be parallelized merges are parallel
Document partitioning is applied straightforwardly Each processor maintains its own suffix array
Term partitioning can be applied Each processor owns a branch of the tree (lexicographic
interval) Bottleneck: all processors need access to the entire text
![Page 17: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/17.jpg)
17
![Page 18: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/18.jpg)
18
Signature filesSignature files
Document partitioning: straightforward Create query signature, distribute to each processor Merge results (using Boolean operations if needed)
Term partitioning: shorter signatures Merging and eliminating false drops is slow This method is not recommended
![Page 19: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/19.jpg)
19
SIMD computersSIMD computers
Single Instruction, Multiple data Uncommon Good for simple operations
Bit operations in signature files Details in the book
Ranking is supported in hardware in some computers If signature file does not fit into memory, can be
processed in batches I/O overhead Use multiple queries with the same batch This improves throughput, but not response time
![Page 20: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/20.jpg)
20
… … SIMD computersSIMD computers
Inverted files are difficult to adapt to SIMD The inverted file is restructured Details in the book
![Page 21: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/21.jpg)
21
Distributed IRDistributed IR
MIMD with Slow communication Not all nodes are used for a given query Encryption issues
Document partitioning is usually used Term partitioning imposes greater communication
overhead Document clustering can be useful (to distribute docs
by processors) Index clusters and then search only the best ones Another approach: use training queries, then similarity of
the user query to these
![Page 22: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/22.jpg)
22
Research topicsResearch topics
How to evaluate the speedup New algorithms Adaptation of existing algorithms Merging the results is a bottleneck
Meta search engines Creating large collections with judgements
Is recall important?
![Page 23: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/23.jpg)
23
ConclusionsConclusions
Parallel computing can improve response time for each query and/or throughput: number of queries processed with same speed
Document partitioning is simple good for distributed computing
Term partitioning is good for some data structures Distributed computing is MIMD computing with slow
communication SIMD machines are good for Signature files
Both are out of favor now
![Page 24: Parallel and Distributed Information Retrieval System](https://reader035.fdocuments.net/reader035/viewer/2022070515/5875bba51a28ab33128b4711/html5/thumbnails/24.jpg)
24
Thank you!Till May 17? 18?, 6
pm