1 Conducting a Web Search: Problems & Algorithms Anna Rumshisky.

37
Conducting a Web Search: Problems & Algorithms Anna Rumshisky
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of 1 Conducting a Web Search: Problems & Algorithms Anna Rumshisky.

1

Conducting a Web Search: Problems & Algorithms

Anna Rumshisky

2

Motivation

Web as a distributed knowledge base Utilization of this knowledge base may be

improved through efficient search utilities

3

Basic Functionality of a Web Search Engine (I)

main functionality components– crawler module, page repository, query engine

algorithmically interesting (“behind the scenes”) components– crawler control, indexer/collection analysis and ranking

modules

4

Basic Functionality of a Web Search Engine (II)

Crawler module (many running in parallel)– takes an initial set of URLs

– retrieves and caches the pages

– extracts URLs from the retrieved pages

Page repository– the cached collection is stored and used to create indexes as

new pages are retrieved and processed

Query engine– processes a query

5

Basic Functionality of a Web Search Engine (III)

Indexer module– builds a content index (“inverted” index)

» for each term, a sorted list of locations where it appears

» where “location” is a tuple: document URL offset within the document weight of a given occurence (e.g. occurrences in headings and

titles may be assigned higher weights)

– builds a link index

» in-links and out-links for each URL stored in an adjacency matrix

6

Basic Functionality of a Web Search Engine (IV)

Collection analysis– creates specialized utility indexes

» e.g. indexes based on global page ranking algorithms

– maintains a lexicon of terms and term-level statistics

» e.g. total number of documents in which a term occurs, etc)

Ranking module– assigns rank to each document relative to a specific query,

using utility indexes created by the collection analysis module

7

Basic Functionality of a Web Search Engine (V)

Crawler control module– prioritizes retrieved URLs to determine the order in which

they should be visited by the crawler

– uses utility indexes created by the collection analysis module

– may use historical data on queries received by the query engine

– may use heuristics-based URL analysis (e.g. prefer URLs with fewer slashes)

– may use anchor text analysis and location

– may use global ranking based on link structure analysis, etc.

8

Some of the Issues

Refresh strategy - metrics determining the refresh strategy:– Freshness of the collection

» % of the pages that are up to date

– Age of the collection

» average age of cached pages; age of a local copy of a single page is e.g. the time elapsed since it was last current

– choosing to visit only the more frequently updated pages will lead to the age of collection growing indefinitely

Scalability of all techniques

9

Remainder of the Talk - Outline Structure of the Web Graph

– the actual structure of the web graph should influence crawling and caching strategies

– design of link-based ranking algorithms

Relevance Ranking Algorithms– used by query engine and by crawler control moduleslink-based algorithms, such as PageRank, HITScontent-based algorithms, such as TF*IDF, Latent Semantic

Indexing

10

Modeling the Web Graph (I) Web as a directed graph

– each page is a vertex, each hypertext link is an edge

– may or may not consider each link “weighted” based on

» where the anchor text occurs

» how many times a given link occurs on a page, etc.

“Bow-tie” link structure– 28% constitute strongly connected core

– 44% fall onto the sides of the bow-tie that can be reached from the core, but not vice versa

– 22% that can reach the core, but can not be reached from it

11

Modeling the Web Graph (II) Web as a probabilistic graph

– web graph is constantly modified, both nodes and edges are added and removed

Traditional (Erdös-Rényi) random graph model: G(n,p)– n nodes, p is the probability that a given edge exists

– number of in-links follows some standard distribution, e.g. binomial:

» Prob(in-degree of a node = k) =

– used to model sparse random graphs

knk ppknk

n −−−

)1()!(!

!

12

Modeling the Web Graph (III) Web graph properties (empirically)

– evolving nature of web graph

– disjoint bipartite cliques

» two subsets, with i nodes and j nodes; each node in first subset connected to each node in the second (total of ij edges)

– distribution of in- and out-degrees follow the power law

» Prob(in-degree of a node = k) =

» experimentally, beta ˜= 2.1βk

1

13

“Evolving” Graph Models (Kumar et al.) Graph models with stochastic copying

– new vertices and new edges added to the graph at discrete intervals of time

– allow dependencies between edges

» some vertices choose outgoing edges at random, independently

» others replicate existing linkage patterns by copying edges from a randomly chosen vertex

– two “evolving” graph models

» linear growth model (a constant number of nodes added at each interval)

» exponential growth (current number of vertices multiplied by a constant factor)

14

“Evolving” Graph Models (II) Defining Gt (Vt, Et) - state of the graph at time t

– fv(Vt, t) - returns # of vertices added to graph at time = t+1

– fe(Gt, t) - returns the set of edges added to graph at time = t+1

– |Vt+1| = |Vt| + fv(Vt, t)

– Et+1 = Et U fe(Gt, t)

– Edge selection» new edges may lead from new vertices to old vertices, or be

added between old vertices

» origin and destination for each edge are chosen randomly

» destination selection method (replicated or random destination) is chosen randomly

15

“Evolving” Graph Models (III) Evaluation

– Very rough assumptions

» constant number of edges assumed for each added vertex

» no deletion, etc.

» creating a new site generates a lot of links between new nodes in that site; since no edges are added between the new nodes, their assumptions appear to collapse each a new site into a single vertex

– Claim: these models show the desired properties,

» the power law distribution for in- and out-degrees

» the presence of directed bi-partite cliques

16

Link-Based Relevance Ranking: PageRank (I)

Goal: Assign global rank to each page on the Web

Basic idea: Each page’s rank is a sum, over all pages that point to it

(=referrers), of rank of each referrer, divided by the out-degree of that referrer

17

Link-Based Relevance Ranking: PageRank (II)

Description: For the total of n web pages, the goal is to obtain a rank vector

r = <r1, r2, ..., rn > where ri =

Consider matrix An x n, with the elements

– ai,j= 1/outdegree(i) if page i points to page j

– ai,j= 0 otherwise

– ai,j is the rank contribution of i to j

By our definition of rank vector, we must have then

r = AT r

– r is the eigenvector of matrix AT corresponding to the eigenvalue 1

∑∈ )referers(i j

j

j)outdegree(

r

∑∈ )referers(i j

j

j)outdegree(

r

18

Link-Based Relevance Ranking: PageRank (III)

Mathematical apparatus If the graph is strongly connected (every node reachable from every

node), eigenvector r for the eigenvalue 1 of such “adjacency” matrix is uniquely defined

The principal eigenvector of a matrix (corresponding to the eigenvalue 1) can be computed using power iteration method

» initialize a vector s with random values

» apply a given linear transformation to it, until it converges to the principal eigenvector:

r = AT s r = r / || r ||, where || vector || is vector length (normalization) || r - s || < epsilon (stop condition)

» essentially, r = AT (AT ... (AT s) - with normalization

19

Link-Based Relevance Ranking: PageRank (IV)

Mathematical apparatus (cont’d) PageRank vector r, as defined above, is proportional to the

stationary probability distribution of the random walk on a graph » traverse the graph choosing at random which link to follow at each

vertex

The power iteration method is guaranteed to converge only if the graph is aperiodic (i.e. no two cycles such that the length of one is proportional to the length of the other)

The speed of convergence of the power iteration depends on the eigenvalue gap (difference between the two largest eigenvalues)

20

Link-Based Relevance Ranking: PageRank (V)

Practical application of the algorithm The actual algorithm is merely applying the power iteration

method to the matrix AT to obtain r In practice, there are problems with the assumptions needed for

this algorithm to work

» the Web graph is NOT guaranteed to be aperiodic, so stationary distribution might not be reached

» the Web is NOT strongly connected; the are pages with no outward links at all

– slight modifications to the formula for ri take care of that

We don’t really need the actual rank of each page, we just need to sort the pages correctly

21

Link-Based Relevance Ranking: HITS (Hypertext-Induced Topic Search)

Goal: Rank all pages with respect to a given query (obtain both hub and authority score for each page)

Motivation: Two types of web pages: hubs (pages with large collections of links: web

directories, link lists, etc.) and authorities (pages well referred to by other pages)

Each pages gets two scores, a hub score and an authority score A good authority has a high in-degree, and is pointed to by many good hubs A good hub is has a high out-degree and points to many good authorities

Consider the sites of Toyota and Honda: though they will not point to eachother, good hubs would point to both

22

Link-Based Relevance Ranking: HITS (II)

Basic idea: An authority score of a page is obtained by summing up the hub

scores of pages that point to it A hub score of a page is obtained by summing up the authority

scores of pages it points to At query time, a small subgraph of the Web graph is identified,

and a link analysis is run on it

23

Link-Based Relevance Ranking: HITS (III)

Description: Selecting a limited subset of pages:

– The query string determines the initial root set of pages

» up to t pages containing the same terms as query string

– Root set is expanded

» by all pages linked from the root set

» by d pages pointing to the root set this is to prevent an over-popular page in the root set - to which

everybody points - to force you to add a large portion of the Web graph to your set

24

Link-Based Relevance Ranking: HITS (IV)

Description (cont’d): Link analysis:

– here we wish to obtain two rank vectors a and h:

a = <a1, a2, ..., an > and h = <h1, h2, ..., hn >

– we obtain them using the following iteration method:» initialize both vectors to random values

» for each page i, set the authority score ai equal to the sum of hub scores of all pages within the subgraph that refer to it (=referrers(i))

» for each page i, set the hub score hi equal to the sum up authority scores of all pages within the subgraph that it points to (=referred(i))

» normalize resulting vectors a and h to have length of 1 (divide each ai by ||a|| and each hi by ||h||)

25

Link-Based Relevance Ranking: HITS (V)

Link analysis (cont’d):

– consider the adjacency matrix A for our focused subgraph

» h = A a

» a = AT hNormalize:

hi= hi/||h|| ai= ai/||a||

» hnorm= c1 A AT hnorm

» anorm= c2 AT A anorm

– thus, vectors h and a are the principal eigenvectors of matrices AAT and ATA, respectively and we are essentially using the power iteration method (which, as we know, will converge)

j

h

aah

h

a

a

a

i

A

n

ji

n

j

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎛+=

=

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

⎟⎟⎟⎟⎟

⎜⎜⎜⎜⎜

1

11

0101

:

26

Content-Based Relevance Ranking: TF*IDF (I)

Goal: Rank all pages with respect to a given query (compute similarity between the query and each document)

Background: This is a traditional IR technique used on collections of documents since 1960s, originally proposed by Richard Salton

Basic idea: Using vector-space model to represent documents Compute the similarity score using a distance metric

27

Content-Based Relevance Ranking: TF*IDF (II)

Description Each document is represented as a vector <w1, w2 , ..., wk>

– k is the total number of terms (lemmas) in a document collection

– wi is the weight of the ith term; depends on the number of occurrences of this term in this document

A query is thought of as just another document There are different schemes for computing term weights

– the choice of a particular scheme is usually empirically motivated

– TF * IDF is the most common one

28

Content-Based Relevance Ranking: TF*IDF (III)

Description (cont’d) TF*IDF weighting scheme:

– wi = term frequencyi * inverse document frequencyi

– term frequencyi = # of times the term i occurs in a document

– inverse document frequencyi = log (N / ni )where

» N is the total # of documents in collection

» ni is the number of documents in which term i occurs

» since N is usually large, N / ni is “squashed” with log

» 1 < N / ni < N

» 0 < log (N / ni ) < log N

– lowest weight of 1 is assigned to terms that occur in all documents

29

Content-Based Relevance Ranking: TF*IDF (IV)

Description (cont’d) Distance metric

– cosine of the angle between the two vectors

– obtained using scalar product:

– this similarity score is indepedent the size of each document

||*||cos 1

ba

ba i

k

ii∑

==α

30

Content-Based Relevance Ranking: TF*IDF (V)

Practical application of the algorithm Since the query is frequently very short (2.3 words); raw term

frequency is not usesful Augmented TF*IDF is used for weighting query terms:

– wi = [0.5 + (0.5 * tfi/max tf)] * idfi

» max tf is the frequency of the most frequent term

» for terms not found in the query, the weight would be 0.5 * idfi

» for most terms found in the query, the weight would be1 * idfi

31

Content-Based Relevance Ranking: Latent Semantic Indexing

Goal: Rank all documents with respect to a given query

Background: This is also a technique developed for traditional IR with static document collection (introduced in 1990)

Basic idea: Construct a term x document matrix, using a vector representation

similar to the one used in TF*IDF

– Matrix Am x n (m terms, n documents), of rank r

– Am x n is typically sparse

Using SVD (Singular Value Decomposition), obtain a rank-k approximation to A

– Matrix Ak (m terms, n documents) similar to the least squares method of fitting a line to a set of points

32

Content-Based Relevance Ranking: LSI (II)

Mathematical apparatus: rank(A)

– number of linearly independent columns in Am x n: Rn -> Rm

– linear transform of rank r maps the basis vectors of the pre-image into r linearly independent basis vectors of the image

Singular Value Decomposition

– Matrix A can be represented as

» A = U VT where

columns of U and V are left and right eigenvectors of A AT

U and V orthogonal (VVT=I) , and is a diagonal matrix: diag(1, , n), where i are nonnegative square roots of the r

eigenvalues of A AT ,

and 1 >= >= ... >= r > r+1 = ... = n = 0

33

Content-Based Relevance Ranking: LSI (III)

Mathematical apparatus (cont’d): Um x k ,, k x k , Vk x n

||A - Ak||F2

34

Content-Based Relevance Ranking: LSI (IV)

Practical application of the algorithm The SVD computation on the term x document matrix is

performed in advance, not at query processing time Each document is represented as a column in the Ak matrix

Scalar product-based metric for distances between document vectors is used

Query vector <a1q, a2q, ..., amq>

– pseudo-document, added to Ak postfactum

– Vq = AqT U -1

– values from AkTAk give scalar product of document vectors

» AkTAk = V 2 V

35

References

Arasu, Cho, Garcia-Molina, Paepcke, Raghavan (2001). Searching the Web.

Kleinberg, J. (1999). Authoritative Sources in a Hyperlinked Environment. Journal of ACM.

Kumar, Raghavan, Rajagopalan, Sivakumar (2000). Stochastic Models for the Web Graph. IEEE.

Berry, Dumais & O'Brien (1994). Using Linear Algebra for Intelligent Information Retrieval.

Deerwester, Dumais, Furnas, Landauer, Harshman (1990). Indexing by Latent Semantic Analysis. Journal of American Society for Information Sciences.

Manning & Schutze (1999). Foundations of Statistical Natural Language Processing.

36

37

Content-Based Relevance Ranking: LSI (III)

Mathematical apparatus (cont’d):

||A - Ak||

Vq = AqT U -1

» A = U V T

» AT = V TU T = V U T,

» AT (U T) -1 = V » AT (U T)-1 -1 = V