1 How to Build a Search Engine 樂倍達數位科技股份有限公司范綱岷（ Kung-Ming Fung...

1

How to Build a Search Engine

樂倍達數位科技股份有限公司范綱岷（ Kung-Ming Fung ）[email protected]

2008/04/01

2 樂倍達數位科技股份有限公司http://www.doubleservice.com/

Outline

Introduction Different Kinds of Search Engine Architecture

Robot, Spider, Crawler HTML and HTTP Indexing Keyword Search

Evaluation Criteria Related Work Discussion

About Google Ajax ： A New Approach to Web Applications

References


Introduction


Different Kinds of Search Engine

Directory Search Full Text Search

Web pages News Images …

Meta Search


Number of Page ：Directory < Full text < Meta

Directory Search 目錄式 ODP ： Open Directory Project ， http://

dmoz.org/ Full-Text Search 全文檢索

Google ， http://www.google.com/


Meta Search 整合型 MetaCrawler ， http://

www.metacrawler.com/ 愛幫， http://www.aibang.com/


Simplified control flow of the meta search engineReference: Context and Page Analysis for Improved Web Search ， http://www.neci.nec.com/~lawrence/papers.html


Architecture

WWW

Database

Robot, Spider, Crawler

Indexing

Keyword Search

Simple Architecture


Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

Typical high-level architecture of a Web crawler


Typical anatomy of a large-scale crawler.Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.


High Level GoogleArchitecture

Reference: A Survey On Web Information Retrieval Technologies


The architecture of a standard meta search engine.

Reference: Web Search – Your Way


Reference: Web Search – Your Way

The architecture of a meta search engine.



Cyclic architecture for search engines


Robot, Spider, Crawler Robot 是 Search Engine 中負責資料收集

的軟體，又稱為 Spider 、或 Crawler ，他可以自動在設定的期限內定時自各網站收集網頁資料，而且通常是由一些預定的起始網站開始遊歷其所連結的網站，如此反覆不斷（ recursive ）的串連收集。

A major performance stress is DNS lookup.


Goal Resolving the hostname in the URL to an

IP address using DNS （ Domain Name Service ） .

Connecting a socket to the server and sending the request.

Receiving the request page in response.



Amount of static and dynamic pages at a given depth

Dynamic pages: 5 levelsStatic pages: 15 levels


Policy A selection policy that states which pages

to download. A re-visit policy that states when to check

for changes to the pages. A politeness policy that states how to

avoid overloading Web sites. A parallelization policy that states how to

coordinate distributed Web crawlers.Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.


The view of Web Crawler

Reference: Structural abstractions of hypertext documents for Web-based retrieval


Flow of a basic sequential crawler

Reference: Crawling the Web.



A multi-threaded crawler model


HTML and HTTP

HTML – Hypertext Markup Language HTTP – Hypertext Transport Protocol

TCP – Transport Control Protocol HTTP is built on top of TCP.

Hyperlink A hyperlink is expressed as an anchor tag with an href

attribute. <a href=“http://www.ntust.edu.tw/”>NTUST</a> URL – Uniform Resource

Locator （ http://www.ntust.edu.tw/ ）


GET / Http/1.0

Http/1.1 200 OKDate: Sat, 13 Jan 2001 09:01:02 GMTServer: Apache/1.3.0 (Unix) PHP/3.0.4Last-Modified: Wed, 20 Dec 2000 13:18:38 GMTAccept-Ranges: bytesContent-Length: 5437Connection: CloseContent-Type: text/html

<html><head><title>NTUST</title></head><body>…</body></html>



For checking a URL


Operation of a crawler

Reference: Crawling a Country Better Strategies than Breadth-First for Web Page Ordering.


Reference: Crawling on the World Wide Web.

Get new URLs


HTML Tag Tree



Strategies Breadth-first Backlink-count Batch-pagerank Partial-pagerank OPIC （ On-line Page Importance

Computation ） Larger-sites-firstReference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.


Re-visit policy Freshness: This is a binary measure that

indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as:

Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.


Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as:

Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.


Robot Exclusion http://www.robotstxt.org/wc/exclusion.html The robots exclusion protocol The robots META tag


The Robots Exclusion Protocol - /robots.txt Where to create the robots.txt file?

EX:

Site URL Corresponding Robots.txt URL

http://www.w3.org/ http://www.w3.org/robots.txt


URL's are case sensitive, and "/robots.txt" must be all lower-case

Examples ： To exclude all robots from the entire server

User-agent: * Disallow: /

To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/


To exclude a single robot User-agent: BadBot Disallow: /

To allow a single robot User-agent: WebCrawler Disallow:

User-agent: * Disallow: /


To exclude all files except one User-agent: * Disallow: /~joe/docs/

User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html


A sample robots.txt file# AltaVista SearchUser-agent: AltaVista Intranet V2.0 W3C WebreqDisallow: /Out-Of-Date/

# Exclude some access-controlled areasUser-agent: *Disallow: /Team/Disallow: /Project/Disallow: /Sytems/


The Robots META Tag <meta name="robots" content="noindex,nofollow"> Like any META tag it should be placed in the HEAD

section of an HTML page<html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> <body>...


Examples ： <meta name="robots" content="index,follow"> <meta name="robots" content="noindex,follow"> <meta name="robots" content="index,nofollow"> <meta name="robots" content="noindex,nofollow">

Index: if an indexing robot should index the page Follow: if a robot is to follow links on the page

The defaults are INDEX and FOLLOW.


Indexing 索引一般而言，索引的產生是將網頁中每個

Word 或者 Phrase 存入 Keyword 索引檔中，另外除了來自網頁內容外，網頁作者所自行定義 Meta Tag 中的 Keyword 也常被納入索引範圍。

TF, IDF, Reverse （ Inverted ） Index Stop words


Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing

(b) is a inverted index of (a)


d1: My1 care2 is3 loss4 of5 care6 with7 old8 care9 done10.

d2: Your1 care2 is3 gain4 of5 care6 with7 new8 care9 won10.

tid: token ID did: document ID pos: position

tid did pos

my 1 1

care 1 2

is 1 3

…

new 2 8

care 2 9

won 2 10Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.


Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

My care is loss of care with old care done.

d1

Your care is gain of care with new care won.

d2

my -> d1 care -> d1; d2

is -> d1; d2

loss -> d1

of -> d1; d2

with -> d1; d2

old -> d1

done -> d1

your -> d2

gain -> d2

new -> d2

won -> d2

my -> d1/1care -> d1/2,6,9; d2/2,6,9is -> d1/3; d2/3loss -> d1/4of -> d1/5; d2/5with -> d1/7; d2/7old -> d1/8done -> d1/10your -> d2/1gain -> d2/4new -> d2/8won -> d2/10Two variants of the inverted index data structure.


Usually stored on disk Implemented using a B-tree or a hash

table


Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled.

Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.


Keyword Search 查詢檢索軟體是決定 Search Engine 是否能普

遍為人使用的關鍵要素，因為使用者多只能藉由搜尋速度、搜尋結果來判斷一個系統的好壞，而這些工作都屬於檢索軟體的範圍。

人工智慧、自然語言 Ranking ： PageRank 、 HITS Query Expansion


WAIS ：廣域資訊服務 (Wide Area Information System ； WAIS) 是

一套可以建立全文索引，並提供網路資源全文檢索功能的軟體，其主要由伺服器 (Server) 、用戶端 (Client) 、協定(Protocol) 等三部份組成。

查詢方式：關鍵字 (Keyword) 以概念為基礎的 (Concept-based) 模糊（ Fuzzy ）自然語言（ Natural Language ）


PageRankA page can have a high PageRank if there are many pages pointing to it, or if there are same pages that point to it and have a high PageRank.



We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. Google usually sets d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:



HITS ： Hyperlink Induced Topic Search A good hub is a page that points to many

good authorities; a good authority is a page that is pointed to by many good hubs.

Authorities: good sources of content Hubs: good sources of links



Query Expansion

Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing


Evaluation Criteria

Recall ：查詢後回應出適切資料之比率

databasetheindocumentsrelevantofnumberTotal

relevantarethatretrieveditemsofNumberrcall

EX ：

　　做一個查詢，在 database 中有 80 筆適切的文件，但只有 20 個 items 是有效的， 30 個不適切的，則recall = 20/80 = 0.25


Precision ：精確度

retrieveddocumentsofnumberTotal

relevantarethatretrieveditemsofNumberprecision

由上例：

precision = 20/50 = 0.4


Related Work

Robot, Spider, Crawler Performance issues URL path optimization Robot Exclusion

Indexing TF ： Term Frequency IDF ： Inverse Document Frequency


The term frequency (TF) is the number of times the word appears in a document divided by the number of total words in the document.

If a document contains 100 total words and the word cow appears 3 times, then the term frequency of the word cow in the document is 0.03 (3/100).

One way of calculating document frequency (DF) is to determine how many documents contain the word cow divided by the total number of documents in the collection.

So if cow appears in 1,000 documents out of a total of 10,000,000 then the document frequency is 0.0001 (1000/10,000,000). Reference: tf–idf, From Wikipedia, the free encyclopedia,

http://en.wikipedia.org/wiki/Tf%E2%80%93idf.


The final tf-idf score is then calculated by dividing the term frequency by the document frequency.

For our example, the tf-idf score for cow in the collection would be 300 (0.03/0.0001).

Alternatives to this formula are to take the log of the document frequency . The natural logarithm is commonly used. In this example we would have idf = ln(10,000,000/1,000) = 9.21, so tf-idf = 0.03 * 9.21 = 0.27.

Reference: tf–idf, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Tf%E2%80%93idf.


Keyword Search Ranking Query Expansion

Clustering and Classification Caching Information Retrieval Information Extraction


Discussion

Image Search Voice Search Video Search Multimedia Search …


About Google …

Google query-serving architecture

Reference: Web Search for a Planet The Google Cluster Architecture


About Google … Maps - http://

maps.google.com/ Product -

http://www.google.com/products

Blog - http://blogsearch.google.com/


News - http://news.google.com/

Images - http://images.google.com/

Desktop - http://desktop.google.com/


Code –http://www.google.com/codesearch

Catalogs –http://catalogs.google.com/

More, more, more …http://www.google.com/intl/en/options/index.html


Ajax ： A New Approach to Web Applicationshttp://www.adaptivepath.com/publications/essays/archives/000385.php


Ajax


Web P2P Search Model

Reference: Search Engine-Crawler Symbiosis.


References

Search Engine Strategies 2000 ，http://www.jupiterevents.com/sew/sf00/index.html

Google Technology ， http://www.google.com/technology/pigeonrank.html

Teoma ， http://www.teoma.com/


WiseNut ， http://www.wisenut.com/ Architectural design and evaluation of an

efficient Web-crawling System ， http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=CONTENTS&_method=citationSearch&_piikey=S0164121201000917&_version=1&md5=398c9045272cc2249d9323b1418af198


Searching the World Wide Web ，http://www.neci.nec.com/~lawrence/papers.html

A Survey On Web Information Retrieval Technologies ，http://citeseer.nj.nec.com/336617.html

ASPSeek ， http://www.aspseek.org/


Wen-Syan Li, Divyakant Agrawal, “Supporting web query expansion efficiently using multi-granularity indexing and query processing,” Data and Knowledge Engineering, Vol. 35 (3), pp. 239-257, 2000

Web Search – Your Way ， http://citeseer.nj.nec.com/glover00web.html

Web Search for a Planet The Google Cluster Architecture ， http://www.computer.org/micro/mi2003/m2022.pdf


The PageRank Citation Ranking: Bringing Order to the Web ， http://citeseer.nj.nec.com/368196.html

Structural abstractions of hypertext documents for Web-based retrieval ， http://citeseer.ist.psu.edu/140117.html

Crawling the Web ， http://dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf

Effective Web Crawling ， http://www.chato.cl/534/article-63160.html#h2_2

1 How to Build a Search Engine 樂倍達數位科技股份有限公司范綱岷（ Kung-Ming Fung...

Documents

Transcript of 1 How to Build a Search Engine 樂倍達數位科技股份有限公司范綱岷（ Kung-Ming Fung...

1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷（ Kung-Ming Fung...

Documents

Transcript of 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷（ Kung-Ming Fung...

1 How to Build a Search Engine 樂倍達數位科技股份有限公司范綱岷（ Kung-Ming Fung...

Transcript of 1 How to Build a Search Engine 樂倍達數位科技股份有限公司范綱岷（ Kung-Ming Fung...