Search Engine Technology

101
Search Engine Technology Slides are revised version of the ones taken from http://panda.cs.binghamton.edu/~meng/

description

Search Engine Technology. Slides are revised version of the ones taken from http://panda.cs.binghamton.edu/~meng/. Search Engine Technology. Two general paradigms for finding information on Web: Browsing: From a starting point, navigate through hyperlinks to find desired documents. - PowerPoint PPT Presentation

Transcript of Search Engine Technology

Page 1: Search Engine Technology

Search Engine Technology

Slides are revised version of the ones taken from http://panda.cs.binghamton.edu/~meng/

Page 2: Search Engine Technology

Search Engine Technology

Two general paradigms for finding information on Web:• Browsing: From a starting point, navigate through

hyperlinks to find desired documents.– Yahoo’s category hierarchy facilitates browsing.

• Searching: Submit a query to a search engine to find desired documents.– Many well-known search engines on the Web:

AltaVista, Excite, HotBot, Infoseek, Lycos, Google, Northern Light, etc.

Page 3: Search Engine Technology

Browsing Versus Searching

• Category hierarchy is built mostly manually and search engine databases can be created automatically.

• Search engines can index much more documents than a category hierarchy.

• Browsing is good for finding some desired documents and searching is better for finding a lot of desired documents.

• Browsing is more accurate (less junk will be encountered) than searching.

Page 4: Search Engine Technology

Search Engine

A search engine is essentially a text retrieval system for web pages plus a Web interface.

So what’s new???

Page 5: Search Engine Technology

Start on 2/19/01

Page 6: Search Engine Technology

Some Characteristics of the Web• Web pages are

– very voluminous and diversified– widely distributed on many servers. – extremely dynamic/volatile.

• Web pages have – more structures (extensively tagged).– are extensively linked.– may often have other associated metadata

• Web users are– ordinary folks (“dolts”?) without special training

• they tend to submit short queries. – There is a very large user community.

Use the links and tags and

Meta-data!

Use the social structure

of the web

Standard content-based IR

Methods may not work

Page 7: Search Engine Technology

Overview

Discuss how to take the special characteristics of the Web into consideration for building good search engines.

Specific Subtopics:• The use of tag information• The use of link information• Robot/Crawling• Clustering/Collaborative Filtering

Page 8: Search Engine Technology

Use of TAG information

Page 9: Search Engine Technology

Use of Tag Information (1)Use of Tag Information (1)• Web pages are mostly HTML documents (for

now).• HTML tags allow the author of a web page to

– Control the display of page contents on the Web.

– Express their emphases on different parts of the page.

• HTML tags provide additional information about the contents of a web page.

• Can we make use of the tag information to improve the effectiveness of a search engine?

Page 10: Search Engine Technology

Use of Tag Information (2)Use of Tag Information (2)

Two main ideas of using tags:• Associate different importance to term

occurrences in different tags. • Use anchor text to index referenced

documents.

. . . . . .airplane ticket and hotel . . . . . .

Page 1 Page 2: http://travelocity.com/

Page 11: Search Engine Technology

Use of Tag Information (3)

Many search engines are using tags to improve retrieval effectiveness.

• Associating different importance to term occurrences is used in Altavista, HotBot, Yahoo, Lycos, LASER, SIBRIS.

• WWWW and Google use terms in anchor tags to index a referenced page.

• Shortcomings– very few tags are considered– relative importance of tags not studied– lacks rigorous performance study

Page 12: Search Engine Technology

Use of Tag Information (4)Use of Tag Information (4) The Webor Method (Cutler 97, Cutler 99)• Partition HTML tags into six ordered classes:

– title, header, list, strong, anchor, plain• Extend the term frequency value of a term in a

document into a term frequency vector (TFV). Suppose term t appears in the ith class tfi times,

i = 1..6. Then TFV = (tf1, tf2, tf3, tf4, tf5, tf6).Example: If for page p, term “binghamton”

appears 1 time in the title, 2 times in the headers and 8 times in the anchors of hyperlinks pointing to p, then for this term in p:

TFV = (1, 2, 0, 0, 8, 0).

Page 13: Search Engine Technology

Use of Tag Information (5)Use of Tag Information (5) The Webor Method (Continued)• Assign different importance values to term

occurrences in different classes. Let civi be the importance value assigned to the ith class. We have

CIV = (civ1, civ2, civ3, civ4, civ5, civ6)• Extend the tf term weighting scheme

– tfw = TFV CIV = tf1civ1 + … + tf6 civ6

When CIV = (1, 1, 1, 1, 0, 1), the new tfw becomes the tfw in traditional text retrieval.

How to find

Optimal CIV?

Page 14: Search Engine Technology

Use of Tag Information (6)Use of Tag Information (6)

The Webor Method (Continued)Challenge: How to find the (optimal) CIV =

(civ1, civ2, civ3, civ4, civ5, civ6) such that the retrieval performance can be improved the most?

One Solution: Find the optimal CIV experimentally using a hill-climbing search in the space of CIV

Details

Skipped

Page 15: Search Engine Technology

Use of LINK information

Page 16: Search Engine Technology

Use of Link Information (1)

Hyperlinks among web pages provide new document retrieval opportunities.

Selected Examples:• Anchor texts can be used to index a referenced

page (e.g., Webor, WWWW, Google). • The ranking score (similarity) of a page with a

query can be spread to its neighboring pages.• Links can be used to compute the importance of

web pages based on citation analysis.• Links can be combined with a regular query to

find authoritative pages on a given topic.

Page 17: Search Engine Technology

Connection to Citation Analysis

• Mirror mirror on the wall, who is the biggest Computer Scientist of them all?– The guy who wrote the most papers

• That are considered important by most people– By citing them in their own papers

» “Science Citation Index”

– Should I write survey papers or original papers?

Infometrics;

Bibliometrics

Page 18: Search Engine Technology

What

Citation Index says

About Rao’s p

apers

Page 19: Search Engine Technology

Desiderata for ranking• A page that is referenced by lot of important pages (has

more back links) is more important – A page referenced by a single important page may be more

important than that referenced by five unimportant pages• A page that references a lot of important pages is also

important• “Importance” can be propagated

– Your importance is the weighted sum of the importance conferred on you by the pages that refer to you

– The importance you confer on a page may be proportional to how many other pages you refer to (cite)

• (Also what you say about them when you cite them!)

DifferentNotions ofimportance

Page 20: Search Engine Technology

Authority and Hub Pages (1)

The basic idea:• A page is a good authoritative page with respect to

a given query if it is referenced (i.e., pointed to) by many (good hub) pages that are related to the query.

• A page is a good hub page with respect to a given query if it points to many good authoritative pages with respect to the query.

• Good authoritative pages (authorities) and good hub pages (hubs) reinforce each other.

Page 21: Search Engine Technology

Authority and Hub Pages (2)

• Authorities and hubs related to the same query tend to form a bipartite subgraph of the web graph.

• A web page can be a good authority and a good hub.

hubs authorities

Page 22: Search Engine Technology

Authority and Hub Pages (3)

Main steps of the algorithm for finding good authorities and hubs related to a query q.

1. Submit q to a regular similarity-based search engine. Let S be the set of top n pages returned by the search engine. (S is called the root set and n is often in the low hundreds).

2. Expand S into a large set T (base set):• Add pages that are pointed to by any page in

S.• Add pages that point to any page in S. If a

page has too many parent pages, only the first k parent pages will be used for some k.

Page 23: Search Engine Technology

Authority and Hub Pages (4)

3. Find the subgraph SG of the web graph that is induced by T.

S

T

Page 24: Search Engine Technology
Page 25: Search Engine Technology

Authority and Hub Pages (5)

Steps 2 and 3 can be made easy by storing the link structure of the Web in advance.

Link structure table parent_url child_url url1 url2 url1 url3

Page 26: Search Engine Technology

USER(41): aaa ;;an adjacency matrix

#2A((0 0 1) (0 0 1) (1 0 0))

USER(42): x ;;an initial vector

#2A((1) (2) (3))USER(43): (apower-iteration aaa x 2) ;;authority computation—two iterations

[1] USER(44): (apower-iterate aaa x 3) ;;after three iterations

#2A((0.041630544) (0.0) (0.99913305))

[1] USER(45): (apower-iterate aaa x 15) ;;after 15 iterations

#2A((1.0172524e-5) (0.0) (1.0))

[1] USER(46): (power-iterate aaa x 5) ;;hub computation 5 iterations

#2A((0.70641726) (0.70641726) (0.04415108))

[1] USER(47): (power-iterate aaa x 15) ;;15 iterations

#2A((0.7071068) (0.7071068) (4.3158376e-5))

[1] USER(48): Y ;; a new initial vector

#2A((89) (25) (2))

[1] USER(49): (power-iterate aaa Y 15) ;;Magic… same answer after 15 iter

#2A((0.7071068) (0.7071068) (7.571644e-7))

AB

C

Page 27: Search Engine Technology

Start of 2/21 lecture

Page 28: Search Engine Technology

Authority and Hub Pages (6)

4. Compute the authority score and hub score of each web page in T based on the subgraph SG(V, E).

Given a page p, let a(p) be the authority score of p h(p) be the hub score of p (p, q) be a directed edge in E from p to q. Two basic operations:• Operation I: Update each a(p) as the sum of all

the hub scores of web pages that point to p.• Operation O: Update each h(p) as the sum of all

the authority scores of web pages pointed to by p.

Page 29: Search Engine Technology

Authority and Hub Pages (7)

Operation I: for each page p: a(p) = h(q) q: (q, p)E

Operation O: for each page p: h(p) = a(q) q: (p, q)E

q1

q2

q3

p

q3

q2

q1

p

Page 30: Search Engine Technology

Authority and Hub Pages (8)

Matrix representation of operations I and O.Let A be the adjacency matrix of SG: entry (p, q) is

1 if p has a link to q, else the entry is 0.Let AT be the transpose of A.Let hi be vector of hub scores after i iterations.Let ai be the vector of authority scores after i

iterations. Operation I: ai = AT hi-1

Operation O: hi = A ai

Page 31: Search Engine Technology

Authority and Hub Pages (9)

After each iteration of applying Operations I and O, normalize all authority and hub scores.

Repeat until the scores for each page converge (the convergence is guaranteed).

5. Sort pages in descending authority scores.6. Display the top authority pages.

Vq

qa

papa2)(

)()(

Vq

qh

phph2)(

)()(

Page 32: Search Engine Technology

Authority and Hub Pages (10)

Algorithm (summary) submit q to a search engine to obtain the root

set S; expand S into the base set T; obtain the induced subgraph SG(V, E) using T; initialize a(p) = h(p) = 1 for all p in V; for each p in V until the scores converge { apply Operation I; apply Operation O; normalize a(p) and h(p); } return pages with top authority scores;

Page 33: Search Engine Technology

Authority and Hub Pages (11)

Example: Initialize all scores to 1.1st Iteration: I operation: a(q1) = 1, a(q2) = a(q3) = 0, a(p1) = 3, a(p2) = 2 O operation: h(q1) = 5, h(q2) = 3, h(q3) = 5, h(p1) = 1, h(p2) = 0 Normalization: a(q1) = 0.267, a(q2) = a(q3) = 0, a(p1) = 0.802, a(p2) = 0.535, h(q1) = 0.645, h(q2) = 0.387, h(q3) = 0.645, h(p1) = 0.129, h(p2) = 0

q1

q2

q3

p1

p2

Page 34: Search Engine Technology

Authority and Hub Pages (12)

After 2 Iterations: a(q1) = 0.061, a(q2) = a(q3) = 0, a(p1) = 0.791,

a(p2) = 0.609, h(q1) = 0.656, h(q2) = 0.371,

h(q3) = 0.656, h(p1) = 0.029, h(p2) = 0

After 5 Iterations: a(q1) = a(q2) = a(q3) = 0,

a(p1) = 0.788, a(p2) = 0.615

h(q1) = 0.657, h(q2) = 0.369,

h(q3) = 0.657, h(p1) = h(p2) = 0

q1

q2

q3

p1

p2

Page 35: Search Engine Technology

(why) Does the procedure converge?

0

02

12

01 )(

xMx

xMMxx

AAMMxx

kk

T

),,(

][

),...,((][]][[

]),...,([

111

1

222

21

212112

21211

directionthatincomponentzerononahasxifofreigenvectotoconvergeweandhaveweaslongas

xUUxM

diagwherexUUxUUUUxM

diagwhereUUM

kki

kk

n

nn

x x2 xkAs we multiply repeatedly with

M, the component of x in the

direction of principal eigen vector

gets stretched wrt to other

directions.. So we converge finally

to the direction of principal

eigenvector

Page 36: Search Engine Technology

What about non-principal eigen vectors?

• Principal eigen vector gives the authorities (and hubs)

• What do the other ones do?– They may be able to show the clustering in

the documents (see page 23 in Kleinberg paper)

• The clusters are found by looking at the positive and negative ends of the secondary eigen vectors (ppl vector has only +ve end…)

Page 37: Search Engine Technology

Authority and Hub Pages (13)Should all links be equally treated?Two considerations:• Some links may be more meaningful/important

than other links.• Web site creators may trick the system to make

their pages more authoritative by adding dummy pages pointing to their cover pages (spamming).

Domain name: the first level of the URL of a page. Example: domain name for

“panda.cs.binghamton.edu/~meng/meng.html” is “panda.cs.binghamton.edu”.

Page 38: Search Engine Technology

Authority and Hub Pages (14)

• Transverse link: links between pages with different domain names.

• Intrinsic link: links between pages with the same domain name.

• Transverse links are more important than intrinsic links.

Two ways to incorporate this:1. Use only transverse links and discard

intrinsic links.2. Give lower weights to intrinsic links.

Page 39: Search Engine Technology

Authority and Hub Pages (15)

How to give lower weights to intrinsic links?

In adjacency matrix A, entry (p, q) should be assigned as follows:

• If p has a transverse link to q, the entry is 1.

• If p has an intrinsic link to q, the entry is c, where 0 < c < 1.

• If p has no link to q, the entry is 0.

Page 40: Search Engine Technology

Authority and Hub Pages (16)

For a given link (p, q), let V(p, q) be the vicinity (e.g., 50 characters) of the link.

• If V(p, q) contains terms in the user query (topic), then the link should be more useful for identifying authoritative pages.

• To incorporate this: In adjacency matrix A, make the weight associated with link (p, q) to be 1+n(p, q), where n(p, q) is the number of terms in V(p, q) that appear in the query.

Page 41: Search Engine Technology

Authority and Hub Pages (17)Sample experiments:• Rank based on large in-degree (or backlinks) query: gameRank in-degree URL 1 13 http://www.gotm.org 2 12 http://www.gamezero.com/team-0/ 3 12 http://ngp.ngpc.state.ne.us/gp.html 4 12 http://www.ben2.ucla.edu/~permadi/ gamelink/gamelink.html 5 11 http://igolfto.net/ 6 11

http://www.eduplace.com/geo/indexhi.html• Only pages 1, 2 and 4 are authoritative game pages.

Page 42: Search Engine Technology

Authority and Hub Pages (18)Sample experiments (continued)• Rank based on large authority score. query: gameRank Authority URL 1 0.613 http://www.gotm.org 2 0.390 http://ad/doubleclick/net/jump/ gamefan-network.com/ 3 0.342 http://www.d2realm.com/ 4 0.324 http://www.counter-strike.net 5 0.324 http://tech-base.com/ 6 0.306 http://www.e3zone.com• All pages are authoritative game pages.

Page 43: Search Engine Technology

Authority and Hub Pages (19)Sample experiments (continued)• Rank based on large authority score. query: free emailRank Authority URL 1 0.525 http://mail.chek.com/ 2 0.345 http://www.hotmail/com/ 3 0.309 http://www.naplesnews.net/ 4 0.261 http://www.11mail.com/ 5 0.254 http://www.dwp.net/ 6 0.246 http://www.wptamail.com/• All pages are authoritative free email pages.

Page 44: Search Engine Technology

Cora thinks Rao is Authoritative on Planning Citeseer has him down at 90th position…

How come???

--Planning has tw

o clusters

--P

lanning & reinforcement learning

--D

eterministic planning

--The first is

a bigger cluster

--Rao is b

ig in the second cluster

Page 45: Search Engine Technology

Start of 2/26

Page 46: Search Engine Technology

Announcements

• Project task 2 given• Project task 1 progress report due on this Friday

• Questions??

Page 47: Search Engine Technology
Page 48: Search Engine Technology

Authority and Hub Pages (20)

For a given query, the induced subgraph may have multiple dense bipartite communities due to:

• multiple meanings of query terms• multiple web communities related to the query

ad page

obscure web page

Page 49: Search Engine Technology

Authority and Hub Pages (21)

Multiple Communities (continued)• If a page is not in a community, then it is unlikely to

have a high authority score even when it has many backlinks.

Example: Suppose initially all hub and authority scores are 1. q’s p q’s p’s

G1: G2:

1st iteration for G1: a(q) = 0, a(p) = 5, h(q) = 5, h(p) = 0 1st iteration for G2: a(q) = 0, a(p) = 3, h(q) = 9, h(p) = 0

Page 50: Search Engine Technology

Authority and Hub Pages (22)

Example (continued): 1st normalization (suppose normalization factors H1

for hubs and A1 for authorities): for pages in G1: a(q) = 0, a(p) = 5/A1, h(q) = 5/H1, h(p) =

0 for pages in G2: a(q) = 0, a(p) = 3/A1, h(q) = 9/H1, a(p) = 0 After the nth iteration (suppose Hn and An are the

normalization factors respectively): for pages in G1: a(p) = 5n / (H1…Hn-1An) ---- a for pages in G2: a(p) = 3*9n-1 /(H1…Hn-1An) ---- b Note that a/b approaches 0 when n is sufficiently

large, that is, a is much much smaller than b.

Page 51: Search Engine Technology

Authority and Hub Pages (23)

Multiple Communities (continued)• If a page is not in the largest community, then it

is unlikely to have a high authority score.– The reason is similar to that regarding pages

not in a community. larger community smaller community

Page 52: Search Engine Technology

Authority and Hub Pages (24)Multiple Communities (continued)• How to retrieve pages from smaller communities? A method for finding pages in nth largest community:

– Identify the next largest community using the existing algorithm.

– Destroy this community by removing links associated with pages having large authorities.

– Reset all authority and hub values back to 1 and calculate all authority and hub values again.

– Repeat the above n 1 times and the next largest community will be the nth largest community.

Page 53: Search Engine Technology

Authority and Hub Pages (25)

Query: House (first community)

Page 54: Search Engine Technology

Authority and Hub Pages (26)

Query: House (second community)

Page 55: Search Engine Technology

PageRank

Page 56: Search Engine Technology

Pagerank (Ranking the whole darned web)Basic Idea:

Think of Web as a big graph. A random surfer keeps randomly clicking on the links.The importance of a page is the probability that the surfer finds herself on that page--Talk of transition matrix instead of adjacency matrix Transition matrix derived from adjacency matrix --If there are F(u) forward links from a page u, then the probability that the surfer clicks on any of those is 1/F(u) (Columns sum to 1. Stochastic matrix)--But even a dumb user may once in a while do something other than follow URLs on the current page.. --Idea: Put a small probability that the user goes off to a page not pointed to by the current page.

Principal eigenvector

Gives the sta

tionary

distribution!

Page 57: Search Engine Technology

Computing PageRank (4)

Matrix representation Let M be an NN matrix and muv be the entry at

the u-th row and v-th column. muv = 1/Nv if page v has a link to page u muv = 0 if there is no link from v to u Let Ri be the N1 rank vector for I-th iteration and R0 be the initial rank vector. Then Ri = M Ri-1

Page 58: Search Engine Technology

Computing PageRank (5)

If the ranks converge, i.e., there is a rank vector R such that R = M R, R is the eigenvector of matrix M with eigenvalue being 1.

Convergence is guaranteed only if• M is aperiodic (the Web graph is not a big cycle). This is practically

guaranteed for Web.• M is irreducible (the Web graph is strongly connected). This is usually not

true.

Principal eigen value for

A stochastic matrix is 1

Page 59: Search Engine Technology

Computing PageRank (6)

Rank sink: A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages.

Rank sink causes the loss of total ranks.Example:

A B

C D

(C, D) is a rank sink

Page 60: Search Engine Technology

Computing PageRank (7)A solution to the non-irreducibility and rank sink

problem.• Conceptually add a link from each page v to

every page (include self).• If v has no forward links originally, make all

entries in the corresponding column in M be 1/N.

• If v has forward links originally, replace 1/Nv in the corresponding column by c1/Nv and then add (1-c) 1/N to all entries, 0 < c < 1.

Motivation comes also from random-surfer model

Page 61: Search Engine Technology

Computing PageRank (8)

M*= c (M + Z) + (1 – c) x K • M* is irreducible.• M* is stochastic, the sum of all entries of each

column is 1 and there are no negative entries.Therefore, if M is replaced by M* as in Ri = M* Ri-1

then the convergence is guaranteed and there will be no loss of the total rank (which is 1).

Z will have 1/NFor sink pagesAnd 0 otherwise

K will have 1/NFor all entries

Page 62: Search Engine Technology

Computing PageRank (9)

Interpretation of M* based on the random walk model.

• If page v has no forward links originally, a web surfer at v can jump to any page in the Web with probability 1/N.

• If page v has forward links originally, a surfer at v can either follow a link to another page with probability c 1/Nv, or jumps to any page with probability (1-c) 1/N.

Page 63: Search Engine Technology

Start of 2/28

Task 1 submissions to [email protected]

(NO hardcopy submissions)

Page 64: Search Engine Technology

Computing PageRank (10)

Example: Suppose the Web graph is:

M =

A B

CD

0 0 0 ½0 0 0 ½ 1 1 0 00 0 1 0

ABCD

A B C D

Page 65: Search Engine Technology

Computing PageRank (11)

Example (continued): Suppose c = 0.8. All entries in Z are 0 and all entries in K are ¼.

M* = 0.8 (M+Z) + 0.2 K =

After 30 iterations: R(A) = R(B) = 0.176 R(C) = 0.332, R(D) = 0.316

0.05 0.05 0.05 0.450.05 0.05 0.05 0.45 0.85 0.85 0.05 0.050.05 0.05 0.85 0.05

Actually, my calculation says:

A: .11; B=.11; C = .97 D= .18

Page 66: Search Engine Technology

Computing PageRank (12)Incorporate the ranks of pages into the ranking

function of a search engine.• The ranking score of a web page can be a

weighted sum of its regular similarity with a query and its importance.

ranking_score(q, d) = wsim(q, d) + (1-w) R(d), if sim(q, d) > 0 = 0, otherwise where 0 < w < 1.

– Both sim(q, d) and R(d) need to be normalized to between [0, 1].

Page 67: Search Engine Technology

Make sure to talk aboutStability of page rank

Page 68: Search Engine Technology
Page 69: Search Engine Technology

Query complexity

• Complex queries (966 trials)– Average words 7.03– Average operators (+*–") 4.34

• Typical Alta Vista queries are much simpler [Silverstein, Henzinger, Marais and Moricz]– Average query words 2.35– Average operators (+*–") 0.41

• Forcibly adding a hub or authority node helped in 86% of the queries

Page 70: Search Engine Technology

Crawling

Page 71: Search Engine Technology
Page 72: Search Engine Technology

Robot/Spider/Crawler (1)

• A robot (also known as spider, crawler, wanderer) is a program for fetching web pages from the Web.

• 243 registered spiders– Inktomi Slurp, Altavisa Scooter

• Standard search engine– Citeseer; Cora

• Just the CS papers… – Digimark

• Downloads just images; looking for watermarks– AdRelevance

• Just looking for those annoying banner ads (!!!)

General purpose

Vs. Topic Specific

Page 73: Search Engine Technology

Robot/Spider/Crawler (1)

A robot (also known as spider, crawler, wanderer) is a program for fetching web pages from the Web.

Main idea:1. Place some initial URLs into a URL queue.2. Repeat the steps below until the queue is empty

– Take the next URL from the queue and fetch the web page using HTTP.

– Extract new URLs from the downloaded web page and add them to the queue.

Page 74: Search Engine Technology

Web Crawling (Search) Strategy• Starting location(s)• Traversal order

– Depth first– Breadth first– Or ???

• Cycles?• Coverage?• Load?

b

c

d

e

f g

h

i j

Page 75: Search Engine Technology

Storing Summaries

• Can’t store complete page text– Whole WWW doesn’t fit on any server

• Stop Words• Stemming• What (compact) summary should be stored?

– Per URL• Title, snippet

– Per Word • URL, word number

Page 76: Search Engine Technology
Page 77: Search Engine Technology

Robot (2)

Some specific issues:1. What initial URLs to use? Choice depends on type of search engines to be built.• For general-purpose search engines, use URLs that

are likely to reach a large portion of the Web such as the Yahoo home page.

• For local search engines covering one or several organizations, use URLs of the home pages of these organizations. In addition, use appropriate domain constraint.

Page 78: Search Engine Technology

Robot (3)Examples:To create a search engine for Binghamton University, use

initial URL www.binghamton.edu and domain constraint “binghamton.edu”.

• Only URLs having “binghamton.edu” will be used.To create a search engine for Watson School, use initial

URL “watson.binghamton.edu” and domain constraints “watson.binghamton.edu”, “cs.binghamton.edu”, “ee.binghamton.edu”, “me.binghamton.edu” and “ssie.binghamton.edu”.

Page 79: Search Engine Technology

Robot (4)2. How to extract URLs from a web page?Need to identify all possible tags and attributes that hold

URLs.• Anchor tag: <a href=“URL” … > … </a>• Option tag: <option value=“URL”…> … </option>• Map: <area href=“URL” …>• Frame: <frame src=“URL” …>• Link to an image: <img src=“URL” …>• Relative path vs. absolute path: <base href= …>

Page 80: Search Engine Technology

Robot (5)

3. How fast should we download web pages from the same server?

• Downloading web pages from a web server will consume local resources.

• Be considerate to used web servers (e.g.: one page per minute from the same server).

4. Other issues• Handling bad links and down links• Handling duplicate pages [Page Signatures..]• Robot exclusion protocol

Page 81: Search Engine Technology

Robot (6)Robot Exclusion Standard:• Use file /robots.txt to tell what can be accessed.Examples: User-agent: webcrawler Disallow: # no restriction for webcrawler User-agent: lycra Disallow: / # no access for robot lycra User-agent: * Disallow: /tmp # all other robots can index Disallow: /logs # docs not under /tmp, /logs

Voluntary adherence

No enforcement….

Page 82: Search Engine Technology

Robot (7)

Several research issues about robots:• Fetching more important pages first with limited

resources.– Can use measures of page importance

• Fetching web pages in a specified subject area such as movies and sports for creating domain-specific search engines.

– Focused crawling

• Efficient re-fetch of web pages to keep web page index up-to-date.

– Keeping track of change rate of a page

Page 83: Search Engine Technology

Robot (8)Efficient Crawling through URL Ordering [Cho 98]• Default ordering is based on breadth-first search.• Efficient crawling fetches important pages first.Importance Definition• Similarity of a page to a driving query• Backlink count of a page• Forward link of a page• PageRank of a page• Domain of a page (.edu is better than .com)• Combination of the above.

– w1*Apples+w2*Oranges+…..

Page 84: Search Engine Technology

Robot (9)A method for fetching pages related to a driving query

first [Cho 98].• Suppose the query is “computer”.• A page is related (hot) if “computer” appears in the

title or appears 10 times in the body of the page.• Some heuristics for finding a hot page:

– The anchor of its URL contains “computer”.– Its URL contains “computer”.– Its URL is within 3 links from a hot page.Call the above URL as a hot URL.

Page 85: Search Engine Technology

Robot (10)Crawling Algorithm hot_queue = url_queue = empty; /* initialization */ /* hot_queue stores hot URL and url_queue stores other URL */ enqueue(url_queue, starting_url); while (hot_queue or url_queue is not empty) { url = dequeue2(hot_queue, url_queue); /* dequeue hot_queue first if it is not empty */ page = fetch(url); if (page is hot) then hot[url] = true; enqueue(crawled_urls, url); url_list = extract_urls(page); for each u in url_list if (u not in url_queue and u not in hot_queue and u is not in crawled_urls) /* If u is a new URL */ if (u is a hot URL) enqueue(hot_queue, u); else enqueue(url_queue, u); }

Page 86: Search Engine Technology

Focused Crawling• Classifier: Is crawled page P relevant to the topic?

– Algorithm that maps page to relevant/irrelevant• Semi-automatic• Based on page vicinity..

• Distiller:is crawled page P likely to lead to relevant pages?– Algorithm that maps page to likely/unlikely

• Could be just A/H computation, and taking HUBS

• Distiller determines the priority of following links off of P

Page 87: Search Engine Technology

Measuring Crawler efficiency

Page 88: Search Engine Technology
Page 89: Search Engine Technology

Feedback & Prediction

• Traditional IR has a single user—probably working in single-shot modes– Relevance feedback…

• WEB search engines have:– Many users

• Propagate user preferences to other users…– Working continually

• Relevance feedback• User profiling

Page 90: Search Engine Technology

Feedback opportunitiesWhen a user submits a query to a search engine,

the user may have some of the following behaviors or reactions to the returned web pages:

• Click certain pages in certain order while ignore most pages.

• Read some clicked pages longer than some other clicked pages.

• Save/print certain clicked pages.• Follow some links in clicked pages to reach

more pages.

Page 91: Search Engine Technology

Feedback patterns• The behavior of a user u to the result of a query

q can be considered as a piece of knowledge associated with the user query pair (u, q).

• The same user may use the search engine many times with many queries. Each time, the user reacts to the retrieved results.

• Many users may submit different queries to the search engine. – Many users may have common information

needs.– The same query or similar query may be

submitted by different users.

Page 92: Search Engine Technology

Prediction opportunities

The reactions of users to the retrieval results of many past queries can be collected and stored in a knowledge base.

User reaction knowledge can be used in at least three different ways to improve retrieval:

1. Use the knowledge immediately to benefit the current search needs of the user (user feedback).

2. Use the knowledge in the future to benefit the future search needs of the user (user profile).

3. Use the knowledge in the future to benefit the future search needs of all users (collaborative filtering).

Page 93: Search Engine Technology

Relevance feedback

Implicit User Feedback:1. Derive likely relevant documents from the

returned documents based on the user behavior.– Saved/printed documents can be considered

to be relevant.– Documents that are viewed for a longer time

can be considered to be more likely to be relevant.

2. Modify the query to a new query q* and submit q* to the search engine for another round of search.

• Relevance feedback

Page 94: Search Engine Technology

User profilingUser Profile:A profile of a user is a collection of information

that documents the user’s information needs and/or access patterns.

Different types of user profiles exist:• Static profile for describing user information

needs.• Dynamic profile that changes according to

user’s recent access behaviors and patterns.• Specialized profile (e.g., navigational pattern).• Server side profile.• Client side profile.

Page 95: Search Engine Technology

User profiling

User Profile: (continued)• User profile is widely used for text filtering:

Find documents that are similar to a user profile.

• Profile-based filtering is also known as content-based recommendation.

• User profile can be used in combination with query for better information retrieval and filtering.

Page 96: Search Engine Technology

Collaborative Filtering

Collaborative Filtering:From (Miller 96): Collaborative filtering systems make use of the

reactions and opinions of people who have already seen a piece of information to make predictions about the value of that piece of information for people who have not yet seen it.

• Collaborative filtering systems often recommend documents to a user (a query) that are liked (found useful) by similar users (e.g., users who have similar profiles) (for similar queries).

Page 97: Search Engine Technology

Collaborative Filtering (8)Main components:• Recommendation gathering: e.g., record user behaviors

to retrieved documents.• Recommendation aggregation: Combine multiple

recommendations into a useful measure.• Recommendation usage: Apply recommendation

measures to recommend documents.Some interesting issues:• What recommendations are useful?• How to do recommendation aggregation?• How to combine recommendation with other usefulness

measures?

Page 98: Search Engine Technology

Collaborative Filtering (9)Example Systems:PHOAKS (People Helping One Another Know Stuff)• For recommending URLs.• Use each mention of a URL in a news article as a

recommendation.– Not counting URLs in headers and quoted sections.– Not using articles posted to too many newsgroups.– Not counting URLs in announcements or ads.

• Recommendation aggregation: compute the number of distinct recommenders of each URL.

• Recommendation based on the number of distinct recommenders.

Page 99: Search Engine Technology

Collaborative Filtering (10)

Example Systems:Fab (http://fab.stanford.edu)• Combines content-based recommendation and

collaborative recommendation.– Retain the advantages of each approach while

avoid the weaknesses of each approach.• Users are required to rank each recommended

document explicitly based on a 7-point scale.• The ranking is used to update a user’s profile

and highly ranked documents are also recommended to users with similar profiles.

Page 100: Search Engine Technology

Collaborative Filtering (11)

Example Systems:DirectHit (http://www.directhit.com)• Author-controlled search engines versus editor-

controlled directories.• DirectHit aims at achieving the breadth of a

regular search engine with the accuracy of editor-controlled directories by adopting a user-controlled method.

• DirectHit uses user viewing time of documents and other behavior information to identify useful hits to documents and uses collaborative filtering to help find documents for new queries.

Page 101: Search Engine Technology

Indexing and Retrieval Issues