Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Building a Distributed Full-Text Index for the

WebS. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

• Introduction.

• Testbed architecture.

• Design of the indexer.

• Distributed indexing.

• Introduction.

Inverted list

Cat-> (1,2), (1,4), (3,2)

Dog->(2,2), (3,1), (3,4)

Fish->(1,3), (3,3)

Pig->(1,1), (2,3)

Inverted

index location

Inverted index consist of an inverted lists for each sorted term.

Inverted list consist of a locations in sorted way.

Location consist of

)page identifier, position in the page.(

Posting consist of (index term, location).

Building an inverted index over a collection of web pages involves:

1 .Processing each page to extract postings.

2 .Building for each term inverted list.

3 .Writing out on disk.

Important problems when building web-scale inverted index:

1 .Scale and growth rate.

2 .Rate of change

• Introduction.

• Distributors.

• Indexers.

• Query servers.

Distributed inverted index organization:

1. Local inverted files.

2. Global inverted files.

Global inverted files

Query server 1Cat->(1,2), (1,4), (3,2)

Dog->(2,2), (3,1), (3,4)a-e

Query server 2 Fish->(1,3), (3,3)

Pig->(1,1), (2,3)f-z

Local inverted files

Query server 1

Cat->(1,2), (1,4)

Dog->(2,2)

Fish->(1,3)

Fly->(2,1)

Pig->(1,1), (2,3)

a-e Query server 2

Cat->(3,2)

Dog->(3,1), (3,4)

Fish->(3,3)

Local vs. Global

• Resilience to failures.

• Network load.

Testbed environment:

The indexers and the query servers are single processor PC’s with 350-500 MHz processors, 300-500 MB of main memory, and equipped with multiple disks.

All the machines are interconnected by a 100 Mbps Ethernet LAN network.

The WebBase collection:

To study some properties of web pages that are relevant to text indexing, we analyzed 5 samples, of 100,000 pages each, from different portions of the WebBase repository.

Propertyvalue

Average number of words per page438

Average number of distinct words per page171

Average size of each page (as HTML)8650

Average size of each page after removing HTML tags

Average size of a word in the vocabulary8

Table 1: Properties of the WebBase collection

• Introduction.

Design of the Indexer

• Software pipeline.

•The storage of the inverted files generated by the process.

Software pipeline

The process can logically be split into 3 phases:

• Processing -> CPU intensive.

• Flushing -> disk.

• loading -> network.

The goal of our pipelining technique is to design an execution schedule for the different indexing phases that will result in minimal overall running time.

Examples:

Execution of the pipeline

Pipeline time

Theoretical analysis vs. experimental results

Design of the Indexer

• Software pipeline.

•The storage of the inverted files generated by the process.

Storage schemes:

We consider ed three storage schemes for storing inverted files as sets of (key, value) pairs in a B-tree:

1 .Full list .

2 .Single payload.

3 .Mixed list.

A qualitative comparison of these storage schemes:

• Index size

• Zig-zag joins

• Hot updates

Zig-zag join using ordered indexes

1 2 3 4 7 9 18

1 7 9 11 1712 19

Experimental results (using mixed list)

Number of pages(million)

Input size (GB)

Index size (GB)

Index size (%age)

0.10.810.056.17

0.54.030.276.70

2.016.111.137.01

5.040.282.786.90

Table 5:Mixed-list scheme index sizes

Only one posting was generated for all the occurrences of a word in a page

• Introduction.

Two problems that must be addressed when building an inverted index on a distributed

architecture:

•Page distribution: The question of when and how to distribute pages to the indexing nodes.

•Collecting global statistics: the question of where, when, and how to compute and distribute global statistics.

Two strategies for page distribution:

• A priori distribution.

• Runtime distribution.

Three advantages of runtime distribution:

• Space.

• Load balancing.

• Effective pipelining.

Collecting global statistics

A dedicated server known as the statistician.

•Parallel computation.

•Minimize the number of conversations among servers.

•Avoid extra disk I/O

•Reduces network overhead.

Two strategies for sending information to the statistician:

• ME Strategy: sending local information during merging.

•FL Strategy: sending local information during flushing.

comparison

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Documents

Transcript of Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Rohan Raghavan

Home - M3AI® Lab & Melnik Research Group

Communcation Skills.(Raghavan)

IIFT Rohan Raghavan

Serghei Melnik

N S Raghavan

1Yishai BeeriSimilarity Flooding SDBI – Winter 2001 Similarity Flooding A Versatile Graph Matching Algorithm by Sergey Melnik, Hector Garcia-Molina, Erhard.

Representing Web Graphs - Stanford Universityilpubs.stanford.edu:8090/541/1/2002-30.pdf · Representing Web Graphs Sriram Raghavan, Hector Garcia-Molina Computer Science Department

Crawling the Hidden Web - University of Floridacgrant/projects/public/morpheus/files/(32)Crawling the...Crawling the Hidden Web Sriram Raghavan, Hector Garcia-Molina Computer Science

Presentacion Sergio Melnik resumida

Milk signaling_Dr. Melnik, NutriScience, Portugal, 2012

Battery Ajay Raghavan

Crawling the Hidden Web Sriram Raghavan Hector Garcia-Molina @ Stanford University.

PAHCOM – Pinellas Chapter September 18, 2014 Tatiana Melnik Melnik Legal PLLC tatiana@melniklegal.com | 734-358-4201 Tampa, FL.

Veera raghavan

Stary melnik

By Sergey Melnik, Sriram Raghavan, Beberly Yang and Garcia-Molina 10/22/2015Building a Distributed Full-Text Index for the Web1.

RAGHAVAN SARATHY - pvaivpo.org

Muthulakshmi Raghavan VS

Raghavan Doc