Data.Mining.C.8(Ii).Web Mining 570802461
-
Upload
margaret-wang -
Category
Technology
-
view
3.469 -
download
0
Transcript of Data.Mining.C.8(Ii).Web Mining 570802461
![Page 1: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/1.jpg)
1
Chapter 8.
Mining Complex Types of Data (II)
--Web Mining--
![Page 2: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/2.jpg)
2
Chapter 8. Mining Complex Types of Data (II)
• Introduction to Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
![Page 3: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/3.jpg)
3
Mining the World-Wide Web• The WWW is huge, widely distributed, global information
service centre for – Information services: news, advertisements, consumer
information, financial management, education, government, e-commerce, etc.
– Rich and dynamic Hyper-link(超连接 ) information
– Access and usage information (WEB页面的访问和使用信息 )
• WWW provides rich sources for data/text mining
• Challenges– Too huge for effective data/text warehousing and mining
– Too complex and heterogeneous: no standards and structure
![Page 4: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/4.jpg)
4
Web Mining: A challenging task • Researches for
– Web access patterns (访问模式 )
– Web structures and regularity
– Web contents
• Problems– The “abundance” problem
– Limited coverage of the Web: hidden Web sources, majority of data in DBMS
– Limited query interface based on keyword-oriented search
– Etc.
![Page 5: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/5.jpg)
5
Web Mining
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web Mining Taxonomy
![Page 6: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/6.jpg)
6
Web Mining
Web StructureMining
Web ContentMining
Web Page Content MiningWebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon et.al. 1998) …:Can identify information within given web pages •Ahoy! (Etzioni et.al. 1997):Uses heuristics to distinguish personal home pages from other web pages•ShopBot (Etzioni et.al. 1997): Looks for product prices within web pages
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Mining the World-Wide Web
![Page 7: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/7.jpg)
7
Web Mining
Mining the World-Wide Web
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Web StructureMining
Web ContentMining
Web PageContent Mining Search Result Mining
Search Engine Result Summarization•Clustering Search Result (Leouski and Croft, 1996, Zamir and Etzioni, 1997): Categorizes documents using phrases in titles
![Page 8: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/8.jpg)
8
Web Mining
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General Access Pattern Tracking
•Web Log Mining (Zaïane, Xin and Han, 1998)Uses DM techniques to understand general access patterns and trends.Can shed light on better structure and grouping of resource providers.
CustomizedUsage Tracking
Mining the World-Wide Web
![Page 9: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/9.jpg)
9
Web Mining
Web UsageMining
General AccessPattern Tracking
Customized Usage Tracking
•Adaptive Sites (Perkowitz and Etzioni, 1997)Analyse access patterns of each user at a time.Web site restructures itself automatically by learning from user access patterns.
Mining the World-Wide Web
Web StructureMining
Web ContentMining
Web PageContent Mining
Search ResultMining
![Page 10: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/10.jpg)
10
Web Mining
Web ContentMining
Web PageContent Mining
Search ResultMining
Web UsageMining
General AccessPattern Tracking
CustomizedUsage Tracking
Mining the World-Wide Web
Web Structure Mining Using Links•HITS (Kleinberg, 1998)•PageRank (Sergey Brin and Larry Page,1998)
Amount of Web linkage information provides rich information about the relevance, the quality and structure of the Web’s contentUse interconnections between web pages to give weight to pages. .
![Page 11: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/11.jpg)
11
Chapter 8. Mining Complex Types of Data (II)
• Introduction to Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
![Page 12: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/12.jpg)
12
Introduction• Early search engines mainly compare the similarity of the
query and the indexed pages. i.e., – They use information retrieval methods, cosine, ...
• From 1996, it became clear that the similarity alone was no longer sufficient.
• – The number of pages grew rapidly in the mid-late 1990’s.
• Google estimates: 10 million relevant pages.
• How to choose only 30-40 pages and rank them suitably to present to the user?.
![Page 13: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/13.jpg)
13
Web Structure Analysis• Starting around 1996, researchers began to work on the
problem. They resort to hyperlinks (超连接) .
• Web pages on the other hand are connected through hyperlinks, which carry important information. – Some hyperlinks: organize information at the same site. – Other hyperlinks: point to pages from other Web sites. Such out-going
hyperlinks often indicate an implicit conveyance of authority (权威) to the pages being pointed to.
• Those pages that are pointed to by many other pages are likely to contain authoritative information.
![Page 14: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/14.jpg)
14
Web Structure Analysis• During 1997-1998, two most influential hyperlink-based search
algorithms PageRank and HITS were reported. • Both algorithms exploit the hyperlinks of the Web to rank pages •
– PageRank: Sergey Brin and Larry Page, PhD students from Stanford University, at Seventh International World Wide Web Conference (WWW) in April, 1998.
– HITS: Jon Kleinberg (Cornell University), at Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, January 1998
![Page 15: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/15.jpg)
15
Chapter 8. Mining Complex Types of Data (II)
• Introduction Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
![Page 16: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/16.jpg)
16
Background: Social Network Analysis
• Social network: the study of social entities (people in an organization)
- actors (主体 ), their interactions/relationships. • Interactions/relationships: represented by network or graph,
– each vertex (or node): an actor – each link: a relationship.
• From the network, we can study - properties of its structure - actor: the role, position and prestige( 声望 ) • Communities: various kinds of sub-graphs, formed by groups
of actors.
![Page 17: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/17.jpg)
17
Social Network and the Web
• Web: viewed as a virtual social network
– Each page: actor
– each hyperlink: relationship.
• Results from social network can be adapted and extended for use in the Web context.
• Two types of social network analysis,
- centrality and prestige
closely related to hyperlink analysis and search on the Web.
![Page 18: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/18.jpg)
18
Centrality
• An actor with extensive contacts (links) or communications with many other actors in the organization is considered more important than an actor with relatively fewer contacts.
• Central actor: one involved in many links.
![Page 19: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/19.jpg)
19
Measure of Centrality• Network: viewed as a directed graph
• In-links of actor i: links pointing to i
• Out-links of actor i: links pointing out from i
• The simple degree centrality of actor i:
C(i) = dout(i)/(n-1)
where dout(i) the number of out-links of actor i and
n the total number of actors in the network
Dividing n-1 standardizes the centrality value into range [0,1]
![Page 20: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/20.jpg)
20
Prestige • Prestige: more refined measure of prominence of an actor
than centrality.
• Prestigious actor:
one of extensive ties as a recipient used only in-links.
• Difference between centrality and prestige:
– centrality focuses on out-links
– prestige focuses on in-links
![Page 21: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/21.jpg)
21
Measure of Prestige
• In-links of actor i: links pointing to i
• The simple degree Prestige of actor i:
P(i) = din(i)/(n-1)
where din(i) the number of in-links of actor i and
n the total number of actors in the network
![Page 22: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/22.jpg)
22
Rank Prestige • Rank prestige forms the basis of most Web page link analysis
algorithms for PageRank.
• In the real world, a person i chosen by an important person is more prestigious than chosen by a less important person. – For example, if a company CEO votes for a person is much more
important than a worker votes for the person.
• If one’s circle of influence is full of prestigious actors, then one’s own prestige is also high. – Thus one’s prestige is affected by the ranks or statuses of the involved
actors.
![Page 23: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/23.jpg)
23
Measure of Rank Prestige• Rank prestige PRank(i): a linear combination of links that point to i:
PRank(i) = A1i PRank(1) + A2iPRank(2) + …+ AniPRank(n)
where Aji =1 if j points to i and 0 otherwise.
• We have n equations for n actors --- mathematically we can write them as the column vector P :
•
• A: the adjacency matrix of network (graph), where Aij =1 if i points to j and 0 otherwise
n1n PAP T
![Page 24: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/24.jpg)
24
Intuition Idea for Rank Prestige• A hyperlink from a page to another page is an implicit
conveyance of authority to the target page. – The more in-links that a page i receives, the more prestige the page i
has.
• Pages that point to page i also have their own prestige scores. – A page of a higher prestige pointing to i is more important than a page
of a lower prestige pointing to i.
– In other words, a page is important if it is pointed to by other important pages.
• This is exactly the idea of rank prestige in social network.
![Page 25: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/25.jpg)
25
PageRank Algorithm• According to rank prestige, the importance of page i (i’s
PageRank score) is the sum of the PageRank scores of all pages that point to i.
• The Web as a directed graph G = (V, E). Let the total number of pages be n. The PageRank score of the page i (denoted by P(i)) is defined by:
,)(
)(),(
Eij jO
jPiP Oj is the number
of out-link of j
![Page 26: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/26.jpg)
26
Matrix Notation• Let P be a n-dimensional column vector of PageRank values, i.e., P = (P(1),
P(2), …, P(n))T.
• Let A be the adjacency matrix of our graph with
Here we use Oi to denote the number of out-links of a node i.
• Each transition probability is 1/Oi if we assume the Web surfer will click the hyperlinks in the page i uniformly at random.
otherwise
EjiifOA
iij
0
),(1
![Page 27: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/27.jpg)
27
Transition Probability Matrix• Let A be the state transition probability matrix
• Aij : the transition probability that the surfer in state i (page i) will move to state j (page j).
nnnn
n
n
AAA
AAA
AAA
...
...
...
...
...
...
.
21
22221
11211
A
![Page 28: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/28.jpg)
28
Let us start…
• Given an initial probability distribution vector that a surfer is at each state (or page)
– p0 = (p0(1), p0(2), …, p0(n))T (a column vector) and
– an nn transition probability matrix A,
we have
n
i
ip1
0 1)(
n
jijA
1
1
![Page 29: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/29.jpg)
29
Random Surfer
• State transition:
• Where Aij(1) is the probability of going from i to j after 1 transition, we can write
• In general, the probability distribution after k steps/transition:
1-kk PAP T
n
iij ipAjp
101 )()1()(
01 PAP T
![Page 30: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/30.jpg)
30
An Example Web Hyperlink Graph
02121000
000000
313103100
000010
00021021
00021210
A
![Page 31: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/31.jpg)
31
Improved PageRank• At a page, the random surfer has two options
– With probability d, he randomly chooses an out-link to follow.– With probability 1-d, he jumps to a random page
• Improved model:
where E is eeT (e is a column vector of all 1’s) and thus E is a nn square matrix of all 1’s.
PAE
P ))1(( Tdn
d
![Page 32: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/32.jpg)
32
Follow the Above Example
061610619061061061
157610619061061061
15761061061061061
061610619061157157
061610611211061157
06161061061157061
)1( Tdn
d AE
![Page 33: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/33.jpg)
33
Final PageRank Algorithm• PageRank for each page i is
PAeP Tdd )1(
n
jji jPAddiP
1
)()1()(
![Page 34: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/34.jpg)
34
Final PageRank Algorithm
• equivalent to the formula given in the PageRank algorithm
• The parameter d is called the damping factor which can be set to between 0 and 1. d = 0.85 was used in the PageRank agorithm.
Eij jO
jPddiP
),(
)()1()(
![Page 35: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/35.jpg)
35
Compute PageRank
• Use the iteration method PageRank-Iterate (G)
; k=1; repeat ; k=k+1; until ; return
neP /0
KT
k PdAedP )1(1
kk PP 1
1kP
![Page 36: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/36.jpg)
36
Advantages of PageRank
• PageRank is a global measure and query independent. – PageRank values of all the pages are computed and saved
off-line rather than at the query time.
• Criticism: Query-independence. It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.
• Nie, et al. Topical Link Analysis for Web Search, SIGIR 2006
![Page 37: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/37.jpg)
37
Chapter 8. Mining Complex Types of Data (II)
• Introduction to Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
![Page 38: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/38.jpg)
38
Another Aim: Web Structure Analysis• Hyperlinks are also useful for finding Web communities.
– A Web community is a cluster of densely linked pages representing a group of people with a special interest.
• Beyond explicit hyperlinks on the Web, links in other contexts are useful too, e.g., – for discovering communities of named entities (e.g., people and
organizations)
– for analyzing social phenomena in emails.
![Page 39: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/39.jpg)
39
Background: Co-citation and Bibliographic Coupling
• An typical area of research concerned with links is citation analysis (引证分析 ) of scholarly publications.
– A scholarly publication cites related prior work to acknowledge the origins of some ideas and to compare the new proposal with existing work.
• When a paper cites another paper, a relationship is established between the publications.
• We discuss two types of citation analysis, co-citation ( 共引证 )and bibliographic coupling (文献联结 ) . The HITS algorithm is related to these two types of analysis.
![Page 40: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/40.jpg)
40
Co-citation
• If papers i and j are both cited by paper k, then they may be related in some sense to one another.
• The more papers they are co-cited by, the stronger their relationship is.
Fig. Paper i and paper j are co-cited by paper k
![Page 41: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/41.jpg)
41
Co-citation (共引证)• Let L be the citation matrix. Each cell of the matrix is defined
as follows:
– Lij = 1 if paper i cites paper j, and 0 otherwise.
• Co-citation (denoted by Cij) is a similarity measure defined as the number of papers that co-cite i and j,
• A square matrix C can be formed with Cij, and it is called the co-citation matrix.
,1
n
kkjkiij LLC
![Page 42: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/42.jpg)
42
Bibliographic Coupling (文献联结) • Bibliographic coupling operates on a similar principle. • Bibliographic coupling links papers that cite the same articles
– if papers i and j both cite paper k, they may be related.• The more papers they both cite, the stronger their similarity is.
Fig. Both paper i and paper j cite paper k
![Page 43: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/43.jpg)
43
Bibliographic Coupling
• Bij represents the number of papers that are cited by both paper i and j
• A bibliographic coupling matrix B (can be formed with Bij) is symmetric and is regarded as a similarity measure of two papers in clustering
,1
n
kjkikij LLB
![Page 44: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/44.jpg)
44
HITS
• HITS --- Hypertext Induced Topic Search.
• HITS is search query dependent for finding Web communities
• When the user issues a search query, – HITS first expands the list of relevant pages returned by a search
engine and
– then produces two rankings of the expanded set of pages, i.e.,
authority pages and hub pages.
![Page 45: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/45.jpg)
45
Authorities and Hubs
Authority: Roughly, an authority is a page with many in-links. – The idea is that the page may have good or authoritative content on
some topic and
– thus many people trust it and link to it.
Hub: A hub is a page with many out-links. – The page serves as an organizer of the information on a particular
topic and
– points to many good authority pages on the topic.
![Page 46: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/46.jpg)
46
Mining the Web's Link Structures• Finding authoritative Web pages(权威页面 )
– Retrieving pages that are not only relevant, but also of high quality, or authoritative on the topic
• Hyperlinks( 超连接 ) can infer the notion of authority
– A hyperlink pointing to another Web page, this can be considered as the author’s endorsement(认可 ) of the other page
• Hub pages (Hub页面 ): Web pages that provides collections of links to authorities
![Page 47: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/47.jpg)
47
Mining the Web's Link Structures• Mutually reinforcing relationship( 相互增强关联 ):
a good hub is a page that points to many good authorities;
a good authority is page that is pointed to by many good hubs
…
Authority page (red)
…Hub page(yellow)
Hubs Authorities
![Page 48: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/48.jpg)
48
Define Authority and Hub Weight for Each Page
paFor the page p: authority weight ; hub weight
pq
qp ha
qp
qp ah
q1
q2
q3
page p
a[p]:= sum of h[q],for q, qp
q1
q2
q3
page p
h[p]:= sum of a[q],for q, pq
ph
Better authority (hub) pages with larger a(h)-values
![Page 49: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/49.jpg)
49
The HITS Algorithm
0011
0010
0001
0100
L
d1
d2
d4
d3
• HITS works on the pages in S(web space), and assigns every page in S an authority score and a hub score.
• Let the number of pages in S be n.
• We again use G = (V, E) to denote the hyperlink graph of S.
• We use L to denote the adjacency matrix of the
graph.
otherwise
EddifL ji
ij 0
),(1
![Page 50: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/50.jpg)
50
The HITS Algorithm• Let the authority score of the page i be a(di), and the hub score of page i
be h(di).
• The mutual reinforcing relationship of the two scores is represented as follows:
)(
)()(ij dOUTd
ji dadh
)(
)()(ij dINd
ji dhda
![Page 51: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/51.jpg)
51
HITS in Matrix Form• We use a to denote the column vector with all the authority
scores, a = (a(d1), a(d2), …, a(dn))T, and
• use h to denote the column vector with all the authority scores,
h = (h(d1), h(d2), …, h(dn))T,• Then,
a = LTh
h = La
![Page 52: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/52.jpg)
52
Computation of HITS• The computation of authority scores and hub scores : using power
iteration (迭代) .
• If we use ak and hk to denote authority and hub vectors at the kth iteration, the iterations for generating the final solutions are
1 kT
k LaLa
1 kT
k hLLh
)1,...,1,1(00 ha
![Page 53: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/53.jpg)
53
Relationships with Co-citation and Bibliographic Coupling
• Recall that co-citation of pages i and j, denoted by Cij, is
– the authority matrix (LTL) of HITS is the co-citation matrix C
• bibliographic coupling of two pages i and j, denoted by Bij is
– the hub matrix (LLT) of HITS is the bibliographic coupling matrix B
ijT
n
kkjkiij LLC )(
1
LL
ijT
n
kjkikij LLB )(
1
LL
![Page 54: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/54.jpg)
54
HITS (Hyperlink-Induced Topic Search)• Explore interactions between hubs and authoritative
pages• Use a term-index search engine to form the root set
– Many of these pages are presumably relevant to the search topic (query)
– Some of them should contain links to most of the prominent authorities
• Expand the root set into a base set– all of the pages that the root-set pages link to, and– all of the pages that link to a page in the root set, up
to a designated size cutoff
![Page 55: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/55.jpg)
55
Root Set (根集 ) and Base Set(基集 )• Properties of base set (ideally)
– Relatively small– Rich in relevant pages– Contain most (many) of the strongest authorities
baseroot
![Page 56: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/56.jpg)
56
Step 1 of HITS: Create Base Set from Root Set Subgraph(, , t, d)
: a query string : a text-based search engine t, d: natural number // t=200; d=50 Let R denote the top t results of on // R root set Set S := R For each page p R // html_content get_url(url) Let W(p) denote the set of all pages p points to Let V(p) denote the set of all pages pointing to p Add all pages in W(p) to S If | V(p) | d, then add all pages in V(p) to S Else add an arbitrary set of d pages from V(p) to S End Return S // S base set : ca.1000 – 5000
![Page 57: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/57.jpg)
57
Step 1 of HITS: Create Base Set from Root Set
For instance,
http://search.yahoo.com/bin/search?p=Data+Mining&ei=UTF-8
http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=21
http://search.yahoo.com/search?p=Data+Mining&ei=UTF-8&b=41
… …
• Two types of links in S:
transverse: between pages with different domain name; intrinsic: between pages with same domain name; (domain name: the first level of URL string of a page)• G: deleting all intrinsic links from S
![Page 58: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/58.jpg)
58
The HITS Algorithm
)(
)()(ij dOUTd
ji dadh
0011
0010
0001
0100
L
aLh
d1
d2
d4
“Adjacency matrix”
d3 Initial values: a=h=1
Iterate
Normalize:
2 2( ) ( ) 1i i
i i
a d h d
)(
)()(ij dINd
ji dhda
hLa T
hLLh T
aLLa T
![Page 59: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/59.jpg)
59
Step 2 of HITS: Calculate Authority and Hub Weight for Each Page
Iterate(G)G : a collection of n linked pages k= 1 Repeat
normalize ak, hk
k=k+1 Until ak and hk do not change significantly
Return (ak, hk).
)1,...,1,1,1(00 ha
1 kT
k LaLa
1 kT
k hLLh
![Page 60: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/60.jpg)
60
Step 3 of HITS: Filter out the top authorities and hubs
Filter(G , c) G : a collection of n linked pages k, c: natural number (xk,yk) := Iterate(G). Report the pages with the c largest coordinates in xk as
authorities. Report the pages with the c largest coordinates in yk as hubs.
![Page 61: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/61.jpg)
61
Strengths and Weaknesses of HITS • Strength: its ability to rank pages according to the query topic, which
may be able to provide more relevant authority and hub pages.
• Weaknesses:
– It is in fact quite easy to influence HITS since adding out-links in one’s own page is so easy.
– Inefficiency at query time: The query time evaluation is slow. Collecting the root set, expanding it and performing eigenvector computation are all expensive operations
• Reference( 文献 )
Jon M. Kleinberg: Authoritative Sources in a Hyper-linked Environment, Journal of ACM, Vol.46(5), 1999, pp604-632 http://www.cs.cornell.edu/home/kleinber/kleinber.html
![Page 62: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/62.jpg)
62
Chapter 8. Mining Complex Types of Data (II)
• Introduction to Web mining
• Web Structure Analysis
• PageRank
• HITS Approach
• Summary
![Page 63: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/63.jpg)
63
Summary • Web mining includes mining Web link structures to identify
authoritative Web pages, Web content and Web usage mining
• We introduced
– PageRank & Social network analysis, centrality and prestige
– HITS & Co-citation and bibliographic coupling
![Page 64: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/64.jpg)
64
Summary• Important to note: Hyperlink based ranking is not the only
algorithm used in search engines. In fact, it is combined with many content based factors to produce the final ranking presented to the user.
• Links can also be used to find communities, which are groups of content-creators or people sharing some common interests.
– Web communities
– Email communities
– Named entity communities, etc.
![Page 65: Data.Mining.C.8(Ii).Web Mining 570802461](https://reader035.fdocuments.net/reader035/viewer/2022081403/5550b714b4c90504628b4c80/html5/thumbnails/65.jpg)
65