1 Web Search for X-Informatics Spring Semester 2002 MW 6:00 pm – 7:15 pm Indiana Time Geoffrey Fox...
-
Upload
tracey-mcdowell -
Category
Documents
-
view
217 -
download
3
Transcript of 1 Web Search for X-Informatics Spring Semester 2002 MW 6:00 pm – 7:15 pm Indiana Time Geoffrey Fox...
1
Web Search forX-Informatics
Spring Semester 2002 MW 6:00 pm – 7:15 pm Indiana TimeSpring Semester 2002 MW 6:00 pm – 7:15 pm Indiana Time
Geoffrey Fox and Bryan CarpenterGeoffrey Fox and Bryan Carpenter
PTLIU Laboratory for Community GridsPTLIU Laboratory for Community Grids
Informatics, (Computer Science , Physics)
Indiana University
Bloomington IN 47404
2
References IReferences I• Here are a set addressing Web Search has one approach to information
retrieval• http://umiacs.umd.edu/~bonnie/cmsc723-00/CMSC723/CMSC723.ppt
• http://img.cs.man.ac.uk/stevens/workshop/goble.ppt • http://www.isi.edu/us-uk.gridworkshop/talks/goble_-_grid_ontologies.ppt
• http://www.cs.man.ac.uk/~carole/cs3352.htm has several interesting sub-talks in it– http://www.cs.man.ac.uk/~carole/IRintroduction.ppt– http://www.cs.man.ac.uk/~carole/SearchingtheWeb.ppt– http://www.cs.man.ac.uk/~carole/IRindexing.ppt– http://www.cs.man.ac.uk/~carole/metadata.ppt– http://www.cs.man.ac.uk/~carole/TopicandRDF.ppt
• http://www.isi.edu/us-uk.gridworkshop/talks/jeffery.ppt from the excellent 2001 e-Science meeting
3
References II: References II: Discussion of “real systems”Discussion of “real systems”• General review stressing the “hidden web” (content stored i
n databases)http://www.press.umich.edu/jep/07-01/bergman.html
• IBM “Clever Project” Hypersearching the Webhttp://www.sciam.com/1999/0699issue/0699raghavan.html
• Google Anatomy of a Web Search Enginehttp://www.stanford.edu/class/cs240/readings/google.pdf
• Peking University Search Engine Grouphttp://net.cs.pku.edu.cn/~webg/refpaper/papers/jwang-log.pdf
• A Huge set of links can be found at:http://net.cs.pku.edu.cn/~webg/refpaper/
4
WebGather:WebGather: towards quality and sctowards quality and scalability of a Web searcalability of a Web search serviceh serviceLI Xiaoming LI Xiaoming •• Department of Computer Department of Computer Science and Technology, Peking Univ.Science and Technology, Peking Univ.
A presentation at Supercomputing 2001 A presentation at Supercomputing 2001 through a constellation site in Chinathrough a constellation site in China
November 15, 2001November 15, 2001
This lecture built around this presentation by Xiaoming Li
We have inserted material from other cited references
5
How many search engines How many search engines out there ?out there ?
• Yahoo !• AltaVista• Lycos• Infoseek• OpenFind• Baidu• Google
• WebGather ( 天网 )• … there are more than 4000 in the world ! (Complete P
lanet White Paperhttp://www.press.umich.edu/jep/07-01/bergman.html)
6
http://e.pku.edu.cnhttp://e.pku.edu.cn
7
WebGatherWebGather
8
Our SystemOur System
9
AgendaAgenda
• Importance of Web search service
• Three primary measures/goals of a Web search service
• Our approaches to the goals
• Related works
• Future work
10
Importance of Web Search Importance of Web Search ServiceService
• Rapid growing of web information– >40 millions of Chinese web pages under .cn
• The second popular application on the web– email, search engine
• Information accessing: from address-based to content-based– who can remember all those URLs ?!
– search engine: a first step towards content-based web information accessing
• There are 4/24 sessions, 15/78 papers at WWW10 !
11
How the Web is growing in ChinaHow the Web is growing in China
Date
StatisticType
Jun. 30,1998
Dec. 31,1998
Jun.30,1999
Dec.31,1999
Jun. 30,2000
Number ofInternet Users inChina
1,175,000 2,100,000 4,000,000 8,900,000 16,900,000
Number of Websites under .CNin China
3700 5300 9906 15153 27289
* source: CNNIC
12
Primary Measures/Goals of Primary Measures/Goals of a Search enginea Search engine
• Scale
– volume of indexed web information, ...• Performance
– “real time” constraint• Quality
– do the end user like the result returned ?
they are at odds with one another !
13
Scale: go for massive !Scale: go for massive !
• the amount of information that is indexed by the system (e.g. number of web pages, number of ftp file entries, etc.)
• the number of websites it covers.• coverage: percentages of the above with res
pect to the totals out there on the Web • the number of information forms that is fetc
hed and managed by the system (e.g. html, txt, asp, xml, doc, ppt, pdf, ps, Big5 as well as GB, etc.)
14
Primary measures/goals of Primary measures/goals of a search enginea search engine
• Scale– volume of indexed information, ...
• Performance– “real time” constraint
• Quality– does the end user like the result
returned ?
they are at odds with one another !
15
Performance: “real time” Performance: “real time” requirementrequirement
• fetch the targeted amount of information within a time frame, say 15 days– otherwise the information may be obsolete
• deliver the results to a query within a time limit (response time), say 1 second– otherwise users may turn away from your
service, never come back !
larger scale may imply degradation of performance
16
Primary measures/goals of Primary measures/goals of a search enginea search engine
• Scale– volume of information indexed, ...
• Performance– “real time” constraint
• Quality– do the end user like the result returned ?
they are at odds with one another !
17
Quality: do the users like it ?Quality: do the users like it ?
• recall rate– can it return information that should be
returned ?
– high recall rate requires high coverage
• accuracy– percentage of returned results that are
relevant to the query
– high accuracy requires better coverage
• ranking (a special measure of accuracy)– are the most relevant results appearing before
those less relevant ?
18
Our approachOur approach
• Parallel and distributed processing: reach
for large scale and scalability
• User behavior analysis: give forth
mechanisms for performance• Making use of content of web pages: hint
innovative algorithms for quality
19
Towards scalabilityTowards scalability
• WebGather 1.0: a million-page level system operated since 1998, uni-crawler.
• WebGather 2.0: a 30-million-page level system operated since 2001, a fully parallel architecture.– not only boosts up the scale
– but also improves performance
– and delivers better quality
20
Internet
robot robot
scheduler
...
crawler
indexer
raw database
index database
searcher
user interface
Architecture of typical search enginesArchitecture of typical search engines
21
Architecture of WebGather 2.0Architecture of WebGather 2.0
... crawler 1 crawler 2 crawler 3 crawler n
raw database raw database raw database raw database ...
index database index database index database index database ...
indexer 1 indexer 2 indexer 3 indexer n ...
searcher 1 searcher 2 searcher 3 searcher n ...
User Interface
document database query cache
LAN
Internet
Crawling
Indexing
Searching
UI
22
Towards scalability: main Towards scalability: main technical issuestechnical issues• how to assign crawling tasks to multiple
crawlers for parallel processing– granularity of the tasks: URL or IP address ?
– maintenance of a task pool: centralized or distributed ?
– load balance
– low communication overhead
• dynamic reconfiguration– in response to failure of crawlers, …,
(remembering that crawling process usually takes weeks)
23
Scheduler N
CR
Scheduler 1
Scheduler 2 S
cheduler 3
robot
robot
robot robot
crawler
CR: crawler registry
Parallel Crawling in WebGatherParallel Crawling in WebGather
24
Task Generation and Task Generation and AssignmentAssignment
• granularity of parallelism: URL or domain name
• task pool: distributed, and tasks are dynamically created and assigned
• A hash function is used for task assignment and load balance
H(URL) = F(URL’s domain part) mod N
25
Time
NumberOf crawlers
2hours
4hours
6hours
8hours
10hours
2 0.001454 0.000309 6.18E-05 1.25E-05 8.24E-06
4 0.00059 0.000375 0.000465 0.000672 0.000568
8 7.04E-05 4.98E-05 4.18E-05 7.44E-05 5.79E-05
16 1.57E-05 1.11E-05 1.42E-05 1.51E-05 1.82E-05
Simulation result: load balanceSimulation result: load balance
26
2 4 6 8 10 12 14 160
2
4
6
8
10
12
main-controller number
spee
dup
2,4,8,16 main-controllers
Spe
edup
number of crawlers
Simulation result: scalabilitySimulation result: scalability
27
Number of crawlers
Spe
edup
Experimental result: scalabilityExperimental result: scalability
28
Our ApproachOur Approach
• Parallel and distributed processing: reach for large scale and scalability
• User behavior analysis: give forth mechanisms for performance
• Making use of content of web pages: hint innovative algorithms for quality
29
Towards high performanceTowards high performance
• “parallel processing”, of course, is a plus to performance, and
• more importantly , user behavior analysis suggests critical mechanisms for improved performance
– a search engine not only maintains web information, but also logs user queries
– a good understanding of the queries gives rise to cache design and performance tuning approaches
30
What do you keep?What do you keep?• So you gather data from the web storing
– Documents and more importantly words extracted from documents
• After removing dull words, you store document# for each word together with additional data – position and meta-information such as font, tag
enclosed in (i.e. if in meta-data section)
• Position needed to be able to respond to multiword queries with adjacency requirements
• There is a lot of important research in best way to get, store and retrieve information
31
What Pages should one get?What Pages should one get?• A Web Search is an Information not a Knowledge retrieval
engine• It looks at a set of text pages with certain additional
characteristics– URL, Titles, Fonts, Meta-data
• And matches a query to these pages returning pages in a certain order
• This order and choices made by user in dealing with this order can be thought of as “knowledge”– E.g. user tries different queries and ecides which of returned set to
explore
• People complain about “number of pages” returned but I think this is a GOOD model for knowledge and it is good to combine people with the computer
32
How do you Rank PagesHow do you Rank Pages• One can find at least 4 criteria
• Content of Document i.e. nature of occurrence of query terms in document (Author)
• Nature of links to and from this document – this is characteristic of a Web page (Other authors)– Google and IBM Clever project emphasized this
• Occurrence of documents in compiled directories (editors)
• Data on what users of search service have done (users)
33
Document Content RankingDocument Content Ranking• Here the TF*IDF method is typical
– TF Term (query word) Frequency
– IDF is Inverse Document Frequency
• This gives a crude ranking which can be refined by other schemes
• If you have multiple terms then you can add their values of TF*IDF
• Next slides come from earlier courses from Goble (Manchester) and Maryland cited at start
34
IR (Information Retrieval) as Clustering
• A query is a vague spec of a set of objects, A
• IR is reduced to the problem of determining which documents are in set A and which ones are not
• Intra clustering similarity:– What are the features that better
describe the objects in A• Inter clustering dissimilarity:
– What are the features that better distinguish the objects A from the remaining objects in C
A:Retrieved
Documents
C: Document Collection
x
xx x
xx
35
Index term weighting
Weight(t,d) = tf(t,d) x idf(t)
N Number of documents in collection
n(t) Number of documents in which term t occurs
idf(t) Inverse document frequency
occ(t,d) Occurrence of term t in document d
tmax Term in document d with highest occurrence
tf(t,d) Term frequency of t in document d
36
Index term weightingIntra-clustering similarity
– The raw frequency of a term t inside a document d.
– A measure of how well the document term describes the document contents
Inter-cluster dissimilarity– Inverse document frequency– Inverse of the frequency of a term t among
the documents in the collection. – Terms which appear in many documents are
not useful for distinguishing a relevant document from a non-relevant one.
Normalised frequency of term t
in document d
Inverse document frequency
n(t)
Nlogidf(t) =
occ(tmax, d)
occ(t,d)tf(t,d) =
Weight(t,d) = tf(t,d) x idf(t)
37
Term weighting schemesTerm weighting schemes• Best known
• Variation for query term weights
n(t)N
logocc(tmax, d)
occ(t,d)weight(t,d) = x
weight(t,d) =occ(tmax, q)0.5occ(t,q)
n(t)N
logx0.5 +
Term frequency Inverse document frequency
38
TF*IDF ExampleTF*IDF Example
4
5
6
3
1
3
1
6
5
3
4
3
7
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
2
1 2 3
2
3
2
4
4
0.50
0.63
0.90
0.13
0.60
0.75
1.51
0.38
0.50
2.11
0.13
1.20
1 2 3
0.60
0.38
0.50
4
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
idfi Unweighted query: contaminated retrievalResult: 2, 3, 1, 4
Weighted query: contaminated(3) retrieval(1)Result: 1, 3, 2, 4
IDF-weighted query: contaminated retrievalResult: 2, 3, 1, 4
tf ,i jwi j,
39
Let be the unnormalized weight of term in document
Let be the normalized weight of term in document
Then
w i j
w i j
ww
w
i j
i j
i j
i j
i jj
,
,
,
,
,
2
Document Length Document Length NormalizationNormalization
• Long documents have an unfair advantage– They use a lot of terms
• So they get more matches than short documents
– And they use the same words repeatedly• So they have much higher term frequencies
40
Cosine Normalization Cosine Normalization ExampleExample
0.29
0.37
0.53
0.13
0.62
0.77
0.57
0.14
0.19
0.79
0.05
0.71
1 2 3
0.69
0.44
0.57
4
4
5
6
3
1
3
1
6
5
3
4
3
7
1
nuclear
fallout
siberia
contaminated
interesting
complicated
information
retrieval
2
1 2 3
2
3
2
4
4
0.50
0.63
0.90
0.13
0.60
0.75
1.51
0.38
0.50
2.11
0.13
1.20
1 2 3
0.60
0.38
0.50
4
0.301
0.125
0.125
0.125
0.602
0.301
0.000
0.602
idfi
1.70 0.97 2.67 0.87Length
tf ,i jwi j, wi j,
Unweighted query: contaminated retrieval, Result: 2, 4, 1, 3 (compare to 2, 3, 1, 4)
41
Google Page RankGoogle Page Rank• This exploits nature of links to a page which is a measure of
“citations” for page
• Page A has pages T1 T2 T3 …Tn which point to it
• d is a fudge factor (say 0.85)
• PR(A) = (1-d) + d *(PR(T1)/C(T1) + PR(T2)/C(T2) + … + PR(Tn)/C(Tn) )
• Where C(Tk) is number of links from page Tk
42
HITS: HITS: Hypertext Induced Topic SearchHypertext Induced Topic Search
• The ranking scheme depends on the query • Considers the set of pages that point to or
are pointed at by pages in the answer S
• Implemented in IBM;s Clever Prototype• Scientific American Article:• http://www.sciam.com/1999/0699issue/0699ra
ghavan.html
43
HITS (2)HITS (2)• Authorities:
– Pages that have many links point to them in S
• Hub: – pages that have many
outgoing links
Positive two-way feedback:
–better authority pages come from incoming edges from good hubs
–better hub pages come from outgoing edges to good authorities
44
Authorities and HubsAuthorities and Hubs
Authorities ( blue )Hubs (red)
45
HITS two step iterative processHITS two step iterative process• assigns initial scores to candidate hubs
and authorities on a particular topic in set of pages S
1. use the current guesses about the authorities to improve the estimates of hubs—locate all the best authorities
2. use the updated hub information to refine the guesses about the authorities--determine where the best hubs point most heavily and call these the good authorities.
• Repeat until the scores eventually converge to the principle eigenvector of the link matrix of S, which can then be used to determine the best authorities and hubs.
H(p) =
u S | p u
A(u)
A(p) =
v S | v p
H(u)
46
CybercommunitiesCybercommunities
HITS is clusteringweb into
Communities
47
Google vs CleverGoogle vs Clever
1. assigns initial rankings and retains them independently of any queries -- enables faster response.
2. looks only in the forward direction, from link to link.
• Clever 1. assembles a different root
set for each search term and then prioritizes those pages in the context of that particular query.
2. also looks backward from an authoritative page to see what locations are pointing there. Humans are innately motivated to create hub-like content expressing their expertise on specific topics.
48
Peking UniversityPeking UniversityUser behavior analysisUser behavior analysis• taking 3 month worth of real user queries
(about 1 million queries)
• each query consists of <keywords, time, IP address, …>
• keywords distribution: we observe that high frequency keywords are dominating
• grouping the queries in 1000, exam the difference between consecutive groups: we observe a quite stable process (the difference is quite small)
• do the above for different group sizes: we observe a strong self-similarity structure
49
• Only 160,000 different keywords in 960,000 queries
• 20% of high-frequency queries occupies 80% of the total visit times
m
jj
xm
ii
C
CY
1
100/
1
Distribution of user queriesDistribution of user queries
Terms (query words as fraction)
Queriesas time
searching
0.20.8
50
Towards high performanceTowards high performance
• Query caching improves system performance dramatically– more than 70% of user queries can be
satisfied in less than 1 millisecond– almost all queries are answered in 1 second
• User behavior may also be used for other purpose– evaluation of various ranking metrics, e.g.,
the link popularity and replica popularity of a URL have positive influence to its importance
51
Our approachOur approach
• Parallel and distributed processing: reach for large scale and scalability
• User behavior analysis: give forth mechanisms for performance
• Making use of content of web pages: hint innovative algorithms for quality
52
Towards good qualityTowards good quality
• Do not miss those important pages: keep recall rate high
• Clever algorithm for removing near replicas: better accuracy
• new metrics to evaluate pages’ relevance: improved ranking– anchor text based, instead of PageRank base
d
53
Fetch the “important” Fetch the “important” pages firstpages first• crawling is normally done with a time
frame, thus not missing important pages is a practical issue for guaranteeing good search quality later on
• besides picking up “good” seed URLs, we use a formula to determine the importance of a page
54
Url1: (http://www.a.com/index.html)
词频: computer 45 network 33 server 9….
Url2: (http://www.b.com/gbindex.html)
词频: computer 45 network 30 server 16 ….
computer
network
server
Url2
Url1
3
b
a
3/(a+b)<0.01
Removing near-replicasRemoving near-replicas
vector based vs. fingerprint based
55
Related worksRelated works
• Harvest– good academic ideas, but complicated
design, not sustained
• Google– the most famous search engine in the world
at the moment, but little exposure on technology used after 1998 (Brian, 1998, WWW-7)
– character based, instead of word based, Chinese processing ?
– too much hardware than necessary (10000 PCs were reported)?
56
17 Most Popular Day Time Queries17 Most Popular Day Time Queries
平时访问天网的前17个查询词(半年总量为845113)
020000400006000080000
100000120000140000
旅行社
宾馆饭店
搜索警察情人
旅游交通
下载 sex
一见钟情
mp3
图片
oicq
北京色情人民
代理服务器
电影
访问量
57
10 Most Popular Day Time Queries, 70%10 Most Popular Day Time Queries, 70%
查询词前十名的查询量占总量的百分比
旅行社16%
宾馆饭店15%
其它30%
旅游交通6%
一见钟情2%
mp31%
sex4% 警察
7%
搜索9%
情人6%
下载4%
旅行社宾馆饭店搜索警察情人旅游交通下载sex一见钟情mp3其它
58
11 Most Popular Leisure Time Queries11 Most Popular Leisure Time Queries
假日查询词排在前列的内容(总量为245704,半年)
0
10000
20000
30000
40000
50000
色与情
法律
下载
娱乐
搜索
图片
北京
oicq
代理服务器 熵
克隆与伦理
系列1
59
eduedu access vs access vs non edunon edu access access--- --- we may have a lot to say about the curve !we may have a lot to say about the curve !
- 教育网 非教育网查询数 日变化折线图
0
20000
40000
60000
80000
100000
120000
140000
160000
2000
-11-
2
2000
-11-
16
2000
-11-
30
2000
-12-
14
2000
-12-
28
2001
-1-1
1
2001
-1-2
5
2001
-2-8
2001
-2-2
2
2001
-3-8
2001
-3-2
2
2001
-4-5
总计 教育网
非教育网