Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc :...

77
Information Retrieval Web Search
  • date post

    15-Jan-2016
  • Category

    Documents

  • view

    218
  • download

    0

Transcript of Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc :...

Page 1: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Information Retrieval

Web Search

Page 2: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Retrieve docs that are “relevant” for the user query

Doc: file word or pdf, web page, email, blog, e-book,...

Query: paradigm “bag of words”

Relevant ?!?

Goal of a Search Engine

Page 3: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Two main difficulties

The Web: Size: more than tens of billions of pages

Language and encodings: hundreds…

Distributed authorship: SPAM, format-less,…

Dynamic: in one year 35% survive, 20% untouched

The User: Query composition: short (2.5 terms avg) and imprecise

Query results: 85% users look at just one result-page

Several needs: Informational, Navigational, Transactional

Extracting “significant data” is difficult !!

Matching “user needs” is difficult !!

Page 4: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Evolution of Search Engines First generation -- use only on-page, web-text data

Word frequency and language

Second generation -- use off-page, web-graph data Link (or connectivity) analysis Anchor-text (How people refer to a page)

Third generation -- answer “the need behind the query” Focus on “user need”, rather than on query Integrate multiple data-sources Click-through data

1995-1997 AltaVista, Excite, Lycos, etc

1998: Google

Fourth generation Information Supply[Andrei Broder, VP emerging search tech, Yahoo! Research]

Google, Yahoo,

MSN, ASK,………

Page 5: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.
Page 6: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.
Page 7: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.
Page 8: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.
Page 9: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.
Page 10: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.
Page 11: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

This is a search engine!!!

Page 12: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

-$

+$

Page 13: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Two new approaches

Sponsored search: Ads driven by search keywords

(and user-profile issuing them)

Context match: Ads driven by the content of a web page

(and user-profile reaching that page)

AdWords

AdSense

Page 14: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.
Page 15: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Information Retrieval

The structure of a Search Engine

Page 16: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

The structureW

eb

Crawler

Page archive

Control

Query

Queryresolver

?

Ranker

PageAnalizer

textStructure

auxiliary

Indexer

Page 17: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.
Page 18: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Information Retrieval

The Web Graph

Page 19: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

The Web’s Characteristics

Size 1 trillion of pages is available (Google 7/08) 5-40K per page => hundreds of terabytes Size grows every day!!

Change 8% new pages, 25% new links change weekly Life time of about 10 days

Page 20: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

The Bow Tie

Page 21: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

SCCSCC

WCCWCC

Some definitions

Weakly connected components (WCC) Set of nodes such that from any node can go to any node via

an undirected path. Strongly connected components (SCC)

Set of nodes such that from any node can go to any node via a directed path.

Page 22: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

On observing the Web graph

We do not know which percentage of it we know

The only way to discover the graph structure of the web as hypertext is via large scale crawls

Warning: the picture might be distorted by Size limitation of the crawl Crawling rules Perturbations of the "natural" process of birth and

death of nodes and links

Page 23: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Why is it interesting?

Largest artifact ever conceived by the human

Exploit its structure of the Web for Crawl strategies Search Spam detection Discovering communities on the web Classification/organization

Predict the evolution of the Web Sociological understanding

Page 24: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Many other large graphs… Internet graph

V = Routers E = communication links

“Cosine” graph (undirected, weighted) V = static web pages E = tf-idf distance between pages

Query-Log graph (bipartite, weighted) V = queries and URL E = (q,u) where u is a result for q, and has been clicked

by some user who issued q

Social graph (undirected, unweighted) V = users E = (x,y) if x knows y (facebook, address book, email,..)

Page 25: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Definition

Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN & no OUT)

Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈

2.1

Page 26: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

The In-degree distribution

Altavista crawl, 1999 WebBase Crawl 2001

Indegree follows power law distributionk

ku 1

])(degree-inPr[

2.1

Page 27: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Definition

Directed graph G = (V,E) V = URLs, E = (u,v) if u has an hyperlink to v

Isolated URLs are ignored (no IN, no OUT)

Three key properties: Skewed distribution: Pb that a node has x links is 1/x, ≈

2.1

Locality: usually most of the hyperlinks point to other URLs on

the same host (about 80%).

Similarity: pages close in lexicographic order tend to share

many outgoing lists

Page 28: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

A Picture of the Web Graph

i

j

21 millions of pages, 150millions of links

Page 29: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

URL-sorting

Stanford

Berkeley

Page 30: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Information Retrieval

Crawling

Page 31: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Spidering

24h, 7days “walking” over a Graph

Recall that the Web graph is

A direct graph G = (N, E)

N changes (insert, delete) >> 50 * 109 nodes

E changes (insert, delete) > 10 links per node

BowTie

Page 32: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Crawling Issues

How to crawl? Quality: “Best” pages first Efficiency: Avoid duplication (or near duplication) Etiquette: Robots.txt, Server load concerns (Minimize load)

How much to crawl? How much to index? Coverage: How big is the Web? How much do we cover? Relative Coverage: How much do competitors have?

How often to crawl? Freshness: How much has changed?

How to parallelize the process

Page 33: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Crawling picture

Web

URLs crawledand parsed

URLs frontier

Unseen Web

Seedpages

Sec. 20.2

Page 34: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Updated crawling picture

URLs crawledand parsed

Unseen Web

SeedPages

URL frontier

Crawling thread

Sec. 20.1.1

Page 35: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Robots.txt

Protocol for giving spiders (“robots”) limited access to a website, originally from 1994 www.robotstxt.org/wc/norobots.html

Website announces its request on what can(not) be crawled For a URL, create a file of restrictions URL/robots.txt

Sec. 20.2.1

Page 36: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Robots.txt example

No robot should visit any URL starting with "/yoursite/temp/", except the robot called “searchengine":

User-agent: *

Disallow: /yoursite/temp/

User-agent: searchengine

Disallow:

Sec. 20.2.1

Page 37: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Processing steps in crawling

Pick a URL from the frontier Fetch the document at the URL Parse the URL

Extract links from it to other docs (URLs) For each extracted URL

Ensure it passes certain URL filter tests Check if it is already in the frontier

(duplicate URL elimination) Check if URL has content already seen

Duplicate content elimination

E.g., only crawl .edu, obey robots.txt, etc.

Which one?

Sec. 20.2.1

Page 38: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Basic crawl architecture

WWW

DNS

Parse

Contentseen?

DocFP’s

DupURLelim

URLset

URL Frontier

URLfilter

robotsfilters

Fetch

Sec. 20.2.1

Page 39: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Page selection

Given a page P, define how “good” P is.

Several metrics: BFS, DFS, Random Popularity driven (PageRank, full vs partial) Topic driven or focused crawling Combined

Page 40: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

BFS

“…BFS-order discovers the highest quality pages during the early stages of the crawl”

328 millions of URL in the testbed

[Najork 01]

Page 41: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

This page is a new one ?

Check if file has been parsed or downloaded before

after 20 mil pages, we have “seen” over 200 million

URLs each URL is at least 100 bytes on average Overall we have about 20Gb of URLS

Options: compress URLs in main memory, or use disk Bloom Filter (Archive) Disk access with caching (Mercator, Altavista)

Page 42: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Parallel Crawlers

Web is too big to be crawled by a single crawler, work should be divided avoiding duplication

Dynamic assignment Central coordinator dynamically assigns URLs to

crawlers Links are given to Central coordinator

Static assignment Web is statically partitioned and assigned to crawlers Crawler only crawls its part of the web

Page 43: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Two problems

Load balancing the #URLs assigned to downloaders: Static schemes based on hosts may fail

www.geocities.com/…. www.di.unipi.it/

Dynamic “relocation” schemes may be complicated

Managing the fault-tolerance: What about the death of downloaders ? DD-1, new

hash !!! What about new downloaders ? DD+1, new hash !!!

Let D be the number of downloaders.

hash(URL) maps anURL to {0,...,D-1}.

Dowloader x fetchesthe URLs U s.t. hash(U) = x

Page 44: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

A nice technique: Consistent Hashing

A tool for: Spidering Web Cache P2P Routers Load Balance Distributed FS

Item and servers mapped to unit circle

Item K assigned to first server N such that ID(N) ≥ ID(K)

What if a downloader goes down?

What if a new downloader appears?Each server gets replicated log S times

[monotone] adding a new server moves points between one old to the new, only.

[balance] Prob item goes to a server is ≤ cost/S

[load] any server gets ≤ (I/S) log S items w.h.p

[scale] you can copy each server more times...

Page 45: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Examples: Open Source

Nutch, also used by WikiSearch http://www.nutch.org

Hentrix, used by Archive.org http://archive-crawler.sourceforge.net/index.html

Consisten Hashing Amazon’s Dynamo

Page 46: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Connectivity Server

Support for fast queries on the web graph Which URLs point to a given URL? Which URLs does a given URL point to?

Stores mappings in memory from URL to outlinks, URL to inlinks

Sec. 20.4

Page 47: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Currently the best

Webgraph – set of algorithms and a java implementation

Fundamental goal – maintain node adjacency lists in memory For this, compressing the adjacency

lists is the critical component

Sec. 20.4

Page 48: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Adjacency lists

The set of neighbors of a node

Assume each URL represented by an integer 4 billion pages 32 bits per node And 64 bits per hyperlink

Sec. 20.4

Page 49: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Adjaceny list compression

Properties exploited in compression: Similarity (between lists) Locality (many links from a page

go to “lexic-nearby” pages) Use gap encodings in sorted lists Distribution of gap values

Sec. 20.4

3 bits/link

Page 50: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Main ideas

Consider lexicographically ordered list of all URLs, e.g., www.stanford.edu/alchemy www.stanford.edu/biology www.stanford.edu/biology/plant www.stanford.edu/biology/plant/copyright www.stanford.edu/biology/plant/people www.stanford.edu/chemistry

Sec. 20.4

Page 51: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Copy lists Each of these URLs has an adjacency list Main idea: due to templates, the adjacency

list of a node is similar to one of the 7 preceding URLs in the lexicographic ordering

Express adjacency list in terms of one of these

E.g., consider these adjacency lists 1, 2, 4, 8, 16, 32, 64 1, 4, 9, 16, 25, 36, 49, 64 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144 1, 4, 8, 16, 25, 36, 49, 64

Encode as (-2), 11011111, add 8

Why 7?

Sec. 20.4

Page 52: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Extra nodes and binary arrays

Several tricks: Use RLE over the binary arrays

Use succinct encoding for the intervals created by extra-nodes

Use special interger-codes for the remaining integers code: good for integers from a power law)

Sec. 20.4

Page 53: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Main advantages

Adjacency queries can be answered very efficiently To fetch out-neighbors, trace back the chain of

prototypes This chain is typically short in practice (since

similarity is mostly intra-host) Can also explicitly limit the length of the chain

during encoding

Easy to implement one-pass algorithm

Sec. 20.4

Page 54: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Duplicate documents

The web is full of duplicated content Strict dup-detection = exact match

Not as common

Many cases of near duplicates E.g., Last modified date is the only

difference between two page copies

Sec. 19.6

Page 55: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Duplicate/Near-Duplicate Detection

Duplication: Exact match can be detected with fingerprints

Near-Duplication: Approximate match Overview

Compute syntactic similarity with an edit-distance measure

Use similarity threshold to detect near-duplicates

E.g., Similarity > 80% => Documents are “near duplicates”

Sec. 19.6

Page 56: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Computing Similarity Approach:

Shingles (Word N-Grams) a rose is a rose is a rose → a_rose_is_a rose_is_a_rose is_a_rose_is

a_rose_is_a

Similarity Measure between two docs: Set of shingles + Set intersection

Sec. 19.6

Page 57: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Multiset ofFingerprints

Doc shinglingMultiset ofShingles

fingerprint

Documents Sets of 64-bit fingerprints

Efficient shingle management Fingerprints:• Use Karp-Rabin fingerprints• Use 64-bit fingerprints • Prob[collision] << 1

Page 58: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Similarity of Documents

DocB SB

SADocA

• Jaccard measure – similarity of SA, SB which are sets of integers

•Claim: A & B are near-duplicates if sim(SA,SB) is close to 1

BA

BABA SS

SS )S,sim(S

Page 59: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Remarks

Multiplicities of q-grams – could retain or ignore trade-off efficiency with precision

Shingle Size q [4 … 10] Short shingles: increase similarity of unrelated documents

With q=1, sim(SA,SB) =1 A is permutation of B Need larger q to sensitize to permutation changes

Long shingles: small random changes have larger impact

Similarity Measure Similarity is non-transitive, non-metric But dissimilarity 1-sim(SA,SB) is a metric

[Ukkonen 92] – relate q-gram & edit-distance

Page 60: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Example

A = “a rose is a rose is a rose” B = “a rose is a flower which is a rose” Preserving multiplicity

q=1 sim(SA,SB) = 0.7 SA = {a, a, a, is, is, rose, rose, rose} SB = {a, a, a, is, is, rose, rose, flower, which}

q=2 sim(SA,SB) = 0.5 q=3 sim(SA,SB) = 0.3

Disregarding multiplicity q=1 sim(SA,SB) = 0.6 q=2 sim(SA,SB) = 0.5 q=3 sim(SA,SB) = 0.4285

Page 61: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Efficiency: Sketches

Create a “sketch vector” (of size ~200) for each document Docs that share ≥ t (say 80%) of

elemes in the skecthes are near duplicates

For doc D, sketchD[ i ] is as follows: Let f map all shingles in the universe to

0..2m (e.g., f = fingerprinting) Let i be a random permutation on 0..2m

Pick MIN {i(f(s))} over all shingles s in D

Sec. 19.6

Page 62: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Computing Sketch[i] for Doc1

Document 1

264

264

264

264

Start with 64-bit f(shingles)

Permute on the number linewith i

Pick the min value

Sec. 19.6

Page 63: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

264

264

264

264

Are these equal?

Test for 200 random permutations: , ,… 200

MIN( (f (A) )

Sec. 19.6

same(f(*))

MIN( (f (B) )

Page 64: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Notice that…

Document 1 Document 2

264

264

264

264

264

264

264

264

They are equal iff the shingle with the MIN value in the union of Doc1 and Doc2 is doc in their intersection

Claim: This happens with probability Size_of_intersection / Size_of_union

Sec. 19.6

Page 65: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

All signature pairs

This is an efficient method for estimating the similarity (Jaccard coefficient) for one pair of documents.

But we have to estimate N2 similarities, where N is the number of web pages. Still slow One solution: locality sensitive hashing

(LSH) Another solution: sorting

Sec. 19.6

Page 66: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Information Retrieval

Link-based Ranking(2° generation)

Page 67: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Query-independent ordering

First generation: using link counts as simple measures of popularity.

Undirected popularity: Each page gets a score given by the number of in-links

plus the number of out-links (es. 3+2=5).

Directed popularity: Score of a page = number of its in-links (es. 3).

Easy to SPAM

Page 68: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Second generation: PageRank

Each link has its own importance!!

PageRank is

independent of the query

many interpretations…

Page 69: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Basic Intuition…

What about nodes with no in/out links?

Page 70: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Google’s Pagerank

else

jiioutji

0)(#

1

,

B(i)B(i) : set of pages linking to i. : set of pages linking to i.#out(j)#out(j) : number of outgoing links from j. : number of outgoing links from j.ee : vector of components 1/sqrt{N}. : vector of components 1/sqrt{N}.

Random jump

Principaleigenvector

r = [ T + (1-) e eT ] × r

Page 71: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Three different interpretations

Graph (intuitive interpretation) Co-citation

Matrix (easy for computation) Eigenvector computation or a linear system solution

Markov Chain (useful to prove convergence) a sort of Usage Simulation

Any node

Neighbors

“In the steady state” each page has a long-term visit rate - use this as the page’s score.

Page 72: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Pagerank: use in Search Engines

Preprocessing: Given graph, build matrix Compute its principal eigenvector r r[i] is the pagerank of page i

We are interested in the relative order

Query processing: Retrieve pages containing query terms Rank them by their Pagerank

The final order is query-independent

T + (1-) e eT

Page 73: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

HITS: Hypertext Induced Topic Search

Page 74: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Calculating HITS

It is query-dependent

Produces two scores per page: Authority score: a good authority page for

a topic is pointed to by many good hubs for that topic.

Hub score: A good hub page for a topic points to many authoritative pages for that topic.

Page 75: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Authority and Hub scores

2

3

4

1 1

5

6

7

a(1) = h(2) + h(3) + h(4) h(1) = a(5) + a(6) + a(7)

Page 76: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

HITS: Link Analysis Computation

Wherea: Vector of Authority’s scores

h: Vector of Hub’s scores. A: Adjacency matrix in which ai,j = 1 if ij

hAAh

AaAa

Aah

hAaT

TT

Thus, h is an eigenvector of AAt

a is an eigenvector of AtA

Symmetricmatrices

Page 77: Information Retrieval Web Search. Retrieve docs that are “relevant” for the user query Doc : file word or pdf, web page, email, blog, e-book,... Query.

Weighting links

Weight more if the query occurs in the neighborhood of the link (e.g. anchor text).

yx

yaxh

)()(

xy

yhxa

)()( )(),()(

)(),()(

yhyxwxa

yayxwxh

xy

yx