Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006...

Web Mining

Kyumars Sheykh Esmaili

Data Mining CourseSharif University of Technology

Fall 2006

December 24, 2006 Web Mining 2

Table of ContentsIntroductionWeb Content Mining

Feature Selection and Similarity MeasuresWeb Structure Mining

Web as Social NetworkFeatures and Similarity MeasuresSocial Network Analysis Algorithms

PageRankCyber-Communities

HITSCT

Web Content-Structure ClusteringWeb Usage MiningSome Concrete Applications of Web Mining

Focus CrawlingWeb Search Result Clustering

Summary






HITSCT



Summary


Introduction

Information Overloading on the webSize

2001New information created: 6 exabytes (10^18 bytes) 10 billion (nonspam) e-mail messages were sent per day.

2002New information created: 12 exabytes (10^18 bytes)

2003the public Internet contained about 1 trillion pages and was increasing at a rate of approximately 8 million pages per day.

200535 billion messages per day by 2005.


Challenges on WWW Interactions

Finding Relevant InformationCreating knowledge from Information availablePersonalization of the informationLearning about customers / individual users

Web Mining can play an important Role!


Introduction

Web mining - data mining techniques to automatically discover and extract information from Web documents/servicesWeb mining research – integrate research from several research communities :

Database (DB) Information retrieval (IR) The sub-areas of machine learning (ML) Natural language processing (NLP)


Web Data

Web pagesIntra-page structuresInter-page structuresUsage dataSupplemental data

ProfilesRegistration informationCookies


Web Data Categories

Web Data

Content Data

Structure Data

Usage Data

User Profile Data

Free Texts

HTML Files

XML Files

Dynamic Content

Multimedia

Static Link

Dynamic Link


Web Mining

Web StructureMining

Web ContentMining

Web C-SMining

Web UsageMining

Web Mining Taxonomy


Web Mining : SubtasksResource Finding

Task of retrieving intended web-documents

Information Selection & Pre-processingAutomatic selection and pre-processing specific information from retrieved web resources

GeneralizationAutomatic Discovery of patterns in web sites

AnalysisValidation and / or interpretation of mined patterns






HITSCT



Summary


Feature Selection for Web Mining

for the purposes of automated text classification text features should be:

Relatively few in numberModerate in frequency of assignmentLow in redundancyLow in noiseRelated in semantic scope to the classes

to be assignedRelatively unambiguous in meaning


Feature Selection

Potential features:BODYMETATITLESnippet

Means sentences attached with URL u appeared in search results

Anchor WindowThe anchor text and text around the hyperlink v->u in the

source page vMT, the union of META and TITLE content;BMT, the union of BODY, META and TITLE content.


Percentage of Web Pages With Words in HTML Tags

Feature Selection for Content Mining


Feature Selection For Web Pages

Classification performance for various representations of web pages


Vector Space Model for Content-Similarity

IR systems usually adopt index terms to process queriesIndex term:

a keyword or group of selected wordsany word (more general)

Stemming might be used:connect: connecting, connection, connections

An inverted file is built for the chosen index terms


Vector Space Model - Basic Concepts

Ki is an index termdj is a documentt is the number of index termsK = (k1, k2, …, kt) is the set of all index termswij >= 0 is a weight associated with (ki,dj)wij = 0 indicates that term does not belong to docvec(dj) = (w1j, w2j, …, wtj) is a weighted vector associated with the document djgi(vec(dj)) = wij is a function which returns the weight associated with pair (ki,dj)


The Vector Space Model

Sim(dk,dj) = cos(Θ) = [vec(dk) • vec(dj)] / |dk| * |dj| = [Σ wik * wij] / |dk| * |dj|Since wij > 0 and wik > 0, 0 <= sim(dk,dj) <=1

A document is retrieved even if it matches the target document terms only partially

i

j

dj

dkΘ


The Vector Space Model: Example

d1

d2

d3d4 d5

d6d7

k1k2

k3

k1 k2 k3 q • dj |dj| Sim(dj,q)d1 1 0 1 2 1.41 0.82d2 1 0 0 1 1 0.58d3 0 1 1 2 1.41 0.82d4 1 0 0 1 1 0.58d5 1 1 1 3 1.73 1d6 1 1 0 2 1.41 0.82d7 0 1 0 1 1 0.58

q 1 1 1 |q| 1.73


The Vector Space Model - Weighting

Sim(q,dj) = [Σ wij * wiq] / |dj| * |q|How to compute the weights wij and wiq ?A good weight must take into account two effects:

quantification of intra-document contents (similarity)tf factor, the term frequency within a document

quantification of inter-documents separation (dissi-milarity)idf factor, the inverse document frequency

wij = tf(i,j) * idf(i)


Example:• A collection includes 10,000 documents• The term A appears 20 times in a particular document• The maximum apperance of any term in this document is 50• The term A appears in 2,000 of the collection documents.• f(i,j) = freq(i,j) / max(freq(l,j)) = 20/50 = 0.4• idf(i) = log(N/ni) = log (10,000/2,000) = log(5) = 2.32• wij = f(i,j) * log(N/ni) = 0.4 * 2.32 = 0.93

The Vector Model - Weighting






HITSCT



Summary


Social network analysis

Social network is the study of social entities (people in an organization, called actors), and their interactions and relationships. The interactions and relationships can be represented with a network or graph,

each vertex (or node) represents an actor and each link represents a relationship.

From the network, we can study the properties of its structure, and the role, position and prestige of each social actor. We can also find various kinds of sub-graphs, e.g., communities formed by groups of actors.


Social network and the Web

Social network analysis is useful for the Web because the Web is essentially a virtual society, and thus a virtual social network,

Each page: a social actor and each hyperlink: a relationship.

Many results from social network can be adapted and extended for use in the Web context.


Web Structure MiningThe Web consists not only of pages, but also of hyperlinks pointing from one page to another

These hyperlinks contain an enormous amount of latent human annotation

Assumption: link from page A to page B is a recommendation of page B by AIf A and B are connected by a link, there is a higher probability that they are on the same topic


Web Link Analysis

Used for Ordering documents matching a user query: rankingDeciding what pages to add to a collection: crawlingPage categorizationFinding related pagesFinding duplicated web sites






HITSCT



Summary


Structural Similarity MeasuresWe must define the similarity of two nodes

Method I:For page and page B, A is related to B if there is a hyper-link from A to B, or from B to A

Not so good. Consider the home page of IBM and Microsoft.

Page A

Page B


Structural Similarity Measures

Method II (from Bibliometrics)Co-citation: the similarity of A and B is measured by the number of pages cite both A and B

Bibliographic coupling: the similarity of A and B is measured by the number of pages cited by both A and B.

Page A Page B

Page A Page B






HITSCT



Summary


Using link structure of web (cont.)

There are two famous Link-Structure based algorithms for ranking :

PageRankHITS

Nearly All other algorithms are base on these ones :

Salsa,Clever,.


PageRank

Introduced by Page et al (1998)An offline algorithm (Query independent)The weight is assigned by the rank of parents


A Practical Example for PageRank






HITSCT



Summary


What is cyber-communityA community on the web is a group of web pages sharing a common interest

Eg. A group of web pages talking about POP MusicEg. A group of web pages interested in data-mining

Main properties: Pages in the same community should be similar to each other in contentsThe pages in one community should differ from the pages in another community Similar to cluster


Cyber Communities


Two different types of communities

Explicitly-defined communitiesThey are well known ones, such as the resource listed by Yahoo!

Implicitly-defined communitiesThey are communities unexpected or invisible to most users

Arts

Music

Classic Pop

Painting

eg.

eg. The group of web pages interested in a particular singer


Different types of communities

The explicit communities are easy to identifyEg. Yahoo!, InfoSeek, Clever System

In order to extract the implicit communities, we need analyze the web-graph objectively

In research, people are more interested in the implicit communities


Methods of clustering

Clustering methods based on co-citation analysis

Methods derived from HITS (Kleinberg)Using co-citation matrix

CT Method






HITSCT



Summary


HITS: Hubs and Authority

Hub: web page links to a collection of prominent sites on a common topicAuthority: Pages that link to a collection of authoritative pages on a broad topic; web page pointed to by hubsMutual Reinforcing Relationship: a good authority is a page that is pointed to by many good hubs, while a good hub is a page that points to many good authorities


Authority and Hubness

2

3

4

1 1

5

6

7

x(1) = y(2) + y(3) + y(4) y(1) = x(5) + x(6) + xs(7)


HITS Steps (1)

Creating root and base sets


HITS Steps (2)

Calculating Weights

Authority weight :

Hub weight :

Matrix notation: A - adjacency matrixA(i, j) = 1 if i-th page points to j-th page


Final Result of HITS


HITS Results – 3D perspective


A Practical Example for HITS


Difference between PageRank and HITS

The PageRank is computed for all web pages stored in the database and then prior to the query; HITS is performed on the set of retrieved web pages, and for each query.HITS computes authorities and hubs; PageRank computes authorities only.PageRank: non-trivial to compute, HITS: easy to compute, but real-time execution is hard






HITSCT



Summary


A cheaper method

Previous methods are expensive

There another simple method called communities trawling (CT)

It has been implemented on the graph of 200 millions pages, it worked very well


Basic idea of CT

Definition of communitiesdense directed bipartite sub graphs

Bipartite graph: Nodes are partitioned into two sets, F and CEvery directed edge in the graph is directed from a node u in F to a node v in Cdense if many of the possible edges between F and C are present

Fans Centers

F C


Basic idea of CT

Bipartite coresa complete bipartite subgraph with at least i nodes from F and at least j nodes from C i and j are tunable parametersA (i, j) Bipartite core

Every community have such a core with a certain i and j.

A (i=3, j=3) bipartite core


Basic idea of CT

A bipartite core is the identity of a community

To extract all the communities is to enumerate all the bipartite cores on the web.

Author invent an efficient algorithm to enumerate the bipartite cores. Its main idea is iterate pruning --elimination-generation pruning






HITSCT



Summary


Content Link Clustering

By CLC, each web page q in data set D is representedas 3 vectors:

qOutqIn

qKword

with M, N and L as the vector dimension respectively

The ith item of vector qOut (and qIn) indicates whether q has the corresponding out-link as the ith one in M out-links. If yes, the ith item is1, else 0.

The kth item of vector qKword indicates the frequency of the corresponding kth term of L appeared in page q.


Similarity Measure

The similarity of two pages Q and R is the linear combination of three parts:

poutS(Qout,Rout)+ pinS(Qin,Rin)+ ptermS(Qterm,Rterm)

pout +pin +pterm =1

S(Qout,Rout) is defined as Cosine of two out-link vectors.


Tuning the similarity measure

By varying weighting factors in second formula, it is possible to study the effects of out-links, in-link and terms on clustering process.

Results of term-based clustering is rather coarse and usually includes very general groups, which are totally different each other from semantic point of view.

E.g. for topic “jaguar”, “car” group and “animal” group are two very general groups with very different semantic topics;



So, term-based clustering could only roughly separate pages into general semantic groups and failed to handle the finer case

Like “racing car” and “car driver club” since both pages may include some terms like “car, model etc.

The main reasons of poor “purity” of clusters produced by term-based clustering are:

Noise pages are included into clusters instead of removing since noise pages share some unimportant terms with other pages;

Pages that on different finer topics (but the same general topic) are mixed together.



Hyperlinks represent the authors’ view of the relationship among Web pages

hyperlink-based clustering expresses “association” of pages.

Therefore, we could say that clusters produced by link-based clustering are in finer granularity.

The problem of link-based clustering is that some similar pages (e.g. new created pages) may not have enough co-citation/citation to be grouped together. That is to say, recall is some low.



“T”, “L” and “CLC” to denote terms–based (with pout , pin and pKword as (0, 0, 1), link-based (with pout ,pin and pKword as (0.5, 0.5, 0) and contents-link coupled (with pout , pin and pKword as (0.2,0.3, 0.5) clustering approaches respectively.

Parameters are Similarity threshold weighting factors

The label of each cluster is identified automatically by term vector of centroid for each cluster.


Content Link Mining






HITSCT



Summary


Web Usage Mining

Web usage mining also known as Web log miningmining techniques to discover interesting usage patterns from the secondary data derived from the interactions of the users while surfing the webIncluding

web log data, click-stream data, cookies, user queries, and any data related to the results of interaction between human’s interaction with the web


Web Usage MiningApplications

Target potential customers for electronic commerceEnhance the quality and delivery of Internet information services to the end userImprove Web server system performanceIdentify potential prime advertisement locationsFacilitates personalization/adaptive sitesImprove site designFraud/intrusion detectionPredict user’s actions (allows prefetching)


Web Log Clustering Applications

Association rules– Find pages that are often viewed togetherClustering– Cluster users based on browsing patterns– Cluster pages based on content

Server Logs


Fields

Client IP: 128.101.228.20Authenticated User ID: - -Time/Date: [10/Nov/1999:10:16:39 -0600]Request: "GET / HTTP/1.0"Status: 200Bytes: -Referrer: “-”Agent: "Mozilla/4.61 [en] (WinNT; I)"


WUM – Pre-Processing

Data CleaningRemoves log entries that are not needed for the mining

processData Integration

Synchronize data from multiple server logsUser Identification

Associates page references with different users

Session/Episode IdentificationGroups user’s page references into user sessions

Path CompletionFills in page references missing due to browser and proxy caching


WUM – Association Rule Generation

Discovers the correlations between pages that are most often referenced together in a single server sessionProvide the information

What are the set of pages frequently accessed together by Web users?What page will be fetched next?What are paths frequently accessed by Web users?

Association ruleA B [ Support = 60%, Confidence = 80% ]

Example“50% of visitors who accessed URLs /infor-f.html and labo/infos.htmlalso visited situation.html”


WUM – Clustering

Groups together a set of items having similar characteristicsUser Clusters

Discover groups of users exhibiting similar browsing patternsPage recommendation

User’s partial session is classified into a single clusterThe links contained in this cluster are recommended


Web Usage Clustering –Sample Results

clients who often access/products/software/webminer.htmltend to be from educational institutions.clients who placed an online order for software tend to be students in the 20-25 age group and live in the United States.75% of clients who download software from/products/software/demos/ visit between 7:00 and 11:00 pm on weekends.






HITSCT



Summary


Focused Crawling

Only visit links from a page if that page is determined to be relevant.Classifier is static after learning phase.Components:

Classifier which assigns relevance score to each page based on crawl topic.Distiller to identify hub pages.Crawler visits pages to based on crawler and distiller scores.


Focused Crawling

Classifier also determines how useful outgoing links areHub Pages contain links to many relevant pages. Must be visited even if not high relevance score.


Focused Crawling






HITSCT



Summary


In the web search context:organizing web pages (search results) into groups, so that different groups correspond to different user needs

search enginei.e.: engine car part

Engine Corp.Why not other data mining techniques?

Motivation


(1) Using Contents of Documents

Creating clusters based on snippets returned by web search engines.Clusters based on snippets are almost as good as clusters created using the full text of Web documents.Suffix Tree Clustering (STC) : incremental, O(n)time algorithm

LinearIncrementalOverlappingCan be extended to hierarchical


STC algorithm

Step 1: CleaningStemmingSentence boundary identificationPunctuation elimination

Step 2: Suffix tree constructionProduces base clusters (internal nodes)Base clusters are scored based on size and phrase score (which depends on length and word “quality”)

Step 3: Merging base clustersHighly overlapping clusters are merged


(2) Using user’s usage logs

Advantage: relevancy information is objectively reflected by the usage logsAn experimental result on www.nasa.gov/

Cluster 1 /shuttle/missions/41-c/news/shuttle/missions/61-b…

Cluster 2 /history/apollo/sa-2/news//history/apollo/sa-2/images…

Cluster 3 /software/winvn/userguide/3_3_2.htm/software/winvn/userguide/3_3_4.htm…

… ….


(3) Using hyperlinks

For each URL P in search results R, we extract its all out-links as well as top n in-links by services of AltaVistaWe could get all distinct N out-links and M in-links for all URLs in R.Each page P in R (result set) is represented as 2 vectors:

POut (N- dimension) PIn (Mdimension)


(3) Using Hyperlinks: continued


Concerns on current methods

Each method has pros and cons

Using hyperlinks : the best accuracy and still some room to improve

STC : best to browse and for incrementality.


Sample systems

Scatter/GatherGrouperCarrot2

VivisimoMapuccinoSHOC


Grouper

OnlineOperates on query result snippetsClusters together documents with large common subphrasesSuffix Tree Clustering (STC)STC induces labeling






HITSCT



Summary


Web Mining

Web StructureMining

Web ContentMining

Web PageContent Mining

Search ResultMining

Web UsageMining

General AccessPattern Tracking

CustomizedUsage Tracking

Summary


Thank You

Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006...

Documents

Transcript of Web Mining - Sharifce.sharif.edu/courses/85-86/1/ce925/resources/root/class... · December 24, 2006...