1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung...
-
date post
15-Jan-2016 -
Category
Documents
-
view
219 -
download
0
Transcript of 1 How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung...
2 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Outline
Introduction Different Kinds of Search Engine Architecture
Robot, Spider, Crawler HTML and HTTP Indexing Keyword Search
Evaluation Criteria Related Work Discussion
About Google Ajax : A New Approach to Web Applications
References
3 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Introduction
4 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Different Kinds of Search Engine
Directory Search Full Text Search
Web pages News Images …
Meta Search
5 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Number of Page :Directory < Full text < Meta
Directory Search 目錄式 ODP : Open Directory Project , http://
dmoz.org/ Full-Text Search 全文檢索
Google , http://www.google.com/
6 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Meta Search 整合型 MetaCrawler , http://
www.metacrawler.com/ 愛幫, http://www.aibang.com/
7 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Simplified control flow of the meta search engineReference: Context and Page Analysis for Improved Web Search , http://www.neci.nec.com/~lawrence/papers.html
8 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Architecture
WWW
Database
Robot, Spider, Crawler
Indexing
Keyword Search
Simple Architecture
9 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
Typical high-level architecture of a Web crawler
10 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Typical anatomy of a large-scale crawler.Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.
11 樂倍達數位科技股份有限公司http://www.doubleservice.com/
High Level GoogleArchitecture
Reference: A Survey On Web Information Retrieval Technologies
12 樂倍達數位科技股份有限公司http://www.doubleservice.com/
The architecture of a standard meta search engine.
Reference: Web Search – Your Way
13 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Web Search – Your Way
The architecture of a meta search engine.
14 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
Cyclic architecture for search engines
15 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Robot, Spider, Crawler Robot 是 Search Engine 中負責資料收集
的軟體,又稱為 Spider 、或 Crawler ,他可以自動在設定的期限內定時自各網站收集網頁資料,而且通常是由一些預定的起始網站開始遊歷其所連結的網站,如此反覆不斷( recursive )的串連收集。
A major performance stress is DNS lookup.
16 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Goal Resolving the hostname in the URL to an
IP address using DNS ( Domain Name Service ) .
Connecting a socket to the server and sending the request.
Receiving the request page in response.
17 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
18 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
Amount of static and dynamic pages at a given depth
Dynamic pages: 5 levelsStatic pages: 15 levels
19 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Policy A selection policy that states which pages
to download. A re-visit policy that states when to check
for changes to the pages. A politeness policy that states how to
avoid overloading Web sites. A parallelization policy that states how to
coordinate distributed Web crawlers.Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
20 樂倍達數位科技股份有限公司http://www.doubleservice.com/
The view of Web Crawler
Reference: Structural abstractions of hypertext documents for Web-based retrieval
21 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Flow of a basic sequential crawler
Reference: Crawling the Web.
22 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Crawling the Web.
A multi-threaded crawler model
23 樂倍達數位科技股份有限公司http://www.doubleservice.com/
HTML and HTTP
HTML – Hypertext Markup Language HTTP – Hypertext Transport Protocol
TCP – Transport Control Protocol HTTP is built on top of TCP.
Hyperlink A hyperlink is expressed as an anchor tag with an href
attribute. <a href=“http://www.ntust.edu.tw/”>NTUST</a> URL – Uniform Resource
Locator ( http://www.ntust.edu.tw/ )
24 樂倍達數位科技股份有限公司http://www.doubleservice.com/
GET / Http/1.0
Http/1.1 200 OKDate: Sat, 13 Jan 2001 09:01:02 GMTServer: Apache/1.3.0 (Unix) PHP/3.0.4Last-Modified: Wed, 20 Dec 2000 13:18:38 GMTAccept-Ranges: bytesContent-Length: 5437Connection: CloseContent-Type: text/html
<html><head><title>NTUST</title></head><body>…</body></html>
25 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
For checking a URL
26 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Operation of a crawler
Reference: Crawling a Country Better Strategies than Breadth-First for Web Page Ordering.
27 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Crawling on the World Wide Web.
Get new URLs
28 樂倍達數位科技股份有限公司http://www.doubleservice.com/
HTML Tag Tree
Reference: Crawling the Web.
29 樂倍達數位科技股份有限公司http://www.doubleservice.com/
HTML Tag Tree
Reference: Crawling the Web.
30 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Strategies Breadth-first Backlink-count Batch-pagerank Partial-pagerank OPIC ( On-line Page Importance
Computation ) Larger-sites-firstReference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.
31 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Re-visit policy Freshness: This is a binary measure that
indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as:
Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.
32 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as:
Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.
33 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Robot Exclusion http://www.robotstxt.org/wc/exclusion.html The robots exclusion protocol The robots META tag
34 樂倍達數位科技股份有限公司http://www.doubleservice.com/
The Robots Exclusion Protocol - /robots.txt Where to create the robots.txt file?
EX:
Site URL Corresponding Robots.txt URL
http://www.w3.org/ http://www.w3.org/robots.txt
35 樂倍達數位科技股份有限公司http://www.doubleservice.com/
URL's are case sensitive, and "/robots.txt" must be all lower-case
Examples : To exclude all robots from the entire server
User-agent: * Disallow: /
To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/
36 樂倍達數位科技股份有限公司http://www.doubleservice.com/
To exclude a single robot User-agent: BadBot Disallow: /
To allow a single robot User-agent: WebCrawler Disallow:
User-agent: * Disallow: /
37 樂倍達數位科技股份有限公司http://www.doubleservice.com/
To exclude all files except one User-agent: * Disallow: /~joe/docs/
User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html
38 樂倍達數位科技股份有限公司http://www.doubleservice.com/
A sample robots.txt file# AltaVista SearchUser-agent: AltaVista Intranet V2.0 W3C WebreqDisallow: /Out-Of-Date/
# Exclude some access-controlled areasUser-agent: *Disallow: /Team/Disallow: /Project/Disallow: /Sytems/
39 樂倍達數位科技股份有限公司http://www.doubleservice.com/
The Robots META Tag <meta name="robots" content="noindex,nofollow"> Like any META tag it should be placed in the HEAD
section of an HTML page<html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> <body>...
40 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Examples : <meta name="robots" content="index,follow"> <meta name="robots" content="noindex,follow"> <meta name="robots" content="index,nofollow"> <meta name="robots" content="noindex,nofollow">
Index: if an indexing robot should index the page Follow: if a robot is to follow links on the page
The defaults are INDEX and FOLLOW.
41 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Indexing 索引 一般而言,索引的產生是將網頁中每個
Word 或者 Phrase 存入 Keyword 索引檔中,另外除了來自網頁內容外,網頁作者所自行定義 Meta Tag 中的 Keyword 也常被納入索引範圍。
TF, IDF, Reverse ( Inverted ) Index Stop words
42 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing
(b) is a inverted index of (a)
43 樂倍達數位科技股份有限公司http://www.doubleservice.com/
d1: My1 care2 is3 loss4 of5 care6 with7 old8 care9 done10.
d2: Your1 care2 is3 gain4 of5 care6 with7 new8 care9 won10.
tid: token ID did: document ID pos: position
tid did pos
my 1 1
care 1 2
is 1 3
…
new 2 8
care 2 9
won 2 10Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.
44 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.
My care is loss of care with old care done.
d1
Your care is gain of care with new care won.
d2
my -> d1 care -> d1; d2
is -> d1; d2
loss -> d1
of -> d1; d2
with -> d1; d2
old -> d1
done -> d1
your -> d2
gain -> d2
new -> d2
won -> d2
my -> d1/1care -> d1/2,6,9; d2/2,6,9is -> d1/3; d2/3loss -> d1/4of -> d1/5; d2/5with -> d1/7; d2/7old -> d1/8done -> d1/10your -> d2/1gain -> d2/4new -> d2/8won -> d2/10Two variants of the inverted index data structure.
45 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Usually stored on disk Implemented using a B-tree or a hash
table
46 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled.
Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.
47 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Keyword Search 查詢 檢索軟體是決定 Search Engine 是否能普
遍為人使用的關鍵要素,因為使用者多只能藉由搜尋速度、搜尋結果來判斷一個系統的好壞,而這些工作都屬於檢索軟體的範圍。
人工智慧、自然語言 Ranking : PageRank 、 HITS Query Expansion
48 樂倍達數位科技股份有限公司http://www.doubleservice.com/
WAIS : 廣域資訊服務 (Wide Area Information System ; WAIS) 是
一套可以建立全文索引,並提供網路資源全文檢索功能的軟體,其主要由伺服器 (Server) 、用戶端 (Client) 、協定(Protocol) 等三部份組成 。
查詢方式: 關鍵字 (Keyword) 以概念為基礎的 (Concept-based) 模糊( Fuzzy ) 自然語言( Natural Language )
49 樂倍達數位科技股份有限公司http://www.doubleservice.com/
PageRankA page can have a high PageRank if there are many pages pointing to it, or if there are same pages that point to it and have a high PageRank.
Reference: A Survey On Web Information Retrieval Technologies
50 樂倍達數位科技股份有限公司http://www.doubleservice.com/
We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. Google usually sets d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows:
Reference: A Survey On Web Information Retrieval Technologies
51 樂倍達數位科技股份有限公司http://www.doubleservice.com/
HITS : Hyperlink Induced Topic Search A good hub is a page that points to many
good authorities; a good authority is a page that is pointed to by many good hubs.
Authorities: good sources of content Hubs: good sources of links
Reference: A Survey On Web Information Retrieval Technologies
52 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Query Expansion
Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing
53 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Evaluation Criteria
Recall :查詢後回應出適切資料之比率
databasetheindocumentsrelevantofnumberTotal
relevantarethatretrieveditemsofNumberrcall
EX :
做一個查詢,在 database 中有 80 筆適切的文件,但只有 20 個 items 是有效的, 30 個不適切的,則recall = 20/80 = 0.25
54 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Precision :精確度
retrieveddocumentsofnumberTotal
relevantarethatretrieveditemsofNumberprecision
由上例:
precision = 20/50 = 0.4
55 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Related Work
Robot, Spider, Crawler Performance issues URL path optimization Robot Exclusion
Indexing TF : Term Frequency IDF : Inverse Document Frequency
56 樂倍達數位科技股份有限公司http://www.doubleservice.com/
The term frequency (TF) is the number of times the word appears in a document divided by the number of total words in the document.
If a document contains 100 total words and the word cow appears 3 times, then the term frequency of the word cow in the document is 0.03 (3/100).
One way of calculating document frequency (DF) is to determine how many documents contain the word cow divided by the total number of documents in the collection.
So if cow appears in 1,000 documents out of a total of 10,000,000 then the document frequency is 0.0001 (1000/10,000,000). Reference: tf–idf, From Wikipedia, the free encyclopedia,
http://en.wikipedia.org/wiki/Tf%E2%80%93idf.
57 樂倍達數位科技股份有限公司http://www.doubleservice.com/
The final tf-idf score is then calculated by dividing the term frequency by the document frequency.
For our example, the tf-idf score for cow in the collection would be 300 (0.03/0.0001).
Alternatives to this formula are to take the log of the document frequency . The natural logarithm is commonly used. In this example we would have idf = ln(10,000,000/1,000) = 9.21, so tf-idf = 0.03 * 9.21 = 0.27.
Reference: tf–idf, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Tf%E2%80%93idf.
58 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Keyword Search Ranking Query Expansion
Clustering and Classification Caching Information Retrieval Information Extraction
59 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Discussion
Image Search Voice Search Video Search Multimedia Search …
60 樂倍達數位科技股份有限公司http://www.doubleservice.com/
About Google …
Google query-serving architecture
Reference: Web Search for a Planet The Google Cluster Architecture
61 樂倍達數位科技股份有限公司http://www.doubleservice.com/
About Google … Maps - http://
maps.google.com/ Product -
http://www.google.com/products
Blog - http://blogsearch.google.com/
62 樂倍達數位科技股份有限公司http://www.doubleservice.com/
News - http://news.google.com/
Images - http://images.google.com/
Desktop - http://desktop.google.com/
63 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Code –http://www.google.com/codesearch
Catalogs –http://catalogs.google.com/
More, more, more …http://www.google.com/intl/en/options/index.html
64 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Ajax : A New Approach to Web Applicationshttp://www.adaptivepath.com/publications/essays/archives/000385.php
65 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Ajax
66 樂倍達數位科技股份有限公司http://www.doubleservice.com/
67 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Web P2P Search Model
Reference: Search Engine-Crawler Symbiosis.
68 樂倍達數位科技股份有限公司http://www.doubleservice.com/
References
Search Engine Strategies 2000 ,http://www.jupiterevents.com/sew/sf00/index.html
Google Technology , http://www.google.com/technology/pigeonrank.html
Teoma , http://www.teoma.com/
69 樂倍達數位科技股份有限公司http://www.doubleservice.com/
WiseNut , http://www.wisenut.com/ Architectural design and evaluation of an
efficient Web-crawling System , http://www.sciencedirect.com/science?_ob=GatewayURL&_origin=CONTENTS&_method=citationSearch&_piikey=S0164121201000917&_version=1&md5=398c9045272cc2249d9323b1418af198
70 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Searching the World Wide Web ,http://www.neci.nec.com/~lawrence/papers.html
A Survey On Web Information Retrieval Technologies ,http://citeseer.nj.nec.com/336617.html
ASPSeek , http://www.aspseek.org/
71 樂倍達數位科技股份有限公司http://www.doubleservice.com/
Wen-Syan Li, Divyakant Agrawal, “Supporting web query expansion efficiently using multi-granularity indexing and query processing,” Data and Knowledge Engineering, Vol. 35 (3), pp. 239-257, 2000
Web Search – Your Way , http://citeseer.nj.nec.com/glover00web.html
Web Search for a Planet The Google Cluster Architecture , http://www.computer.org/micro/mi2003/m2022.pdf
72 樂倍達數位科技股份有限公司http://www.doubleservice.com/
The PageRank Citation Ranking: Bringing Order to the Web , http://citeseer.nj.nec.com/368196.html
Structural abstractions of hypertext documents for Web-based retrieval , http://citeseer.ist.psu.edu/140117.html
Crawling the Web , http://dollar.biz.uiowa.edu/~pant/Papers/crawling.pdf
Effective Web Crawling , http://www.chato.cl/534/article-63160.html#h2_2