© Prentice Hall1 ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2008 Margaret H. Dunham Department...
-
Upload
tyler-carr -
Category
Documents
-
view
219 -
download
1
Transcript of © Prentice Hall1 ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2008 Margaret H. Dunham Department...
© Prentice Hall 1
ADVANCED TOPICS IN DATA ADVANCED TOPICS IN DATA MININGMINING
CSE 8331CSE 8331Spring 2008Spring 2008
Margaret H. DunhamMargaret H. DunhamDepartment of Computer Science and EngineeringDepartment of Computer Science and Engineering
Southern Methodist UniversitySouthern Methodist University
Companion slides for the text by Dr. M.H.Dunham, Companion slides for the text by Dr. M.H.Dunham, Data Mining, Data Mining, Introductory and Advanced TopicsIntroductory and Advanced Topics, Prentice Hall, 2002., Prentice Hall, 2002.
© Prentice Hall 2
Data Mining OutlineData Mining Outline
Temporal MiningTemporal Mining Spatial MiningSpatial Mining Web MiningWeb Mining
© Prentice Hall 3
Temporal Mining OutlineTemporal Mining Outline
Goal:Goal: Examine some temporal data Examine some temporal data mining issues and approaches.mining issues and approaches.
IntroductionIntroduction Modeling Temporal EventsModeling Temporal Events Time SeriesTime Series Pattern DetectionPattern Detection SequencesSequences Temporal Association RulesTemporal Association Rules
© Prentice Hall 4
Temporal DatabaseTemporal Database
Snapshot Snapshot – Traditional database– Traditional database TemporalTemporal – Multiple time points – Multiple time points Ex:Ex:
© Prentice Hall 5
Temporal QueriesTemporal Queries QueryQuery
DatabaseDatabase
Intersection QueryIntersection Query
Inclusion QueryInclusion Query
Containment QueryContainment Query
Point Query – Tuple retrieved is valid at a particular point in time.Point Query – Tuple retrieved is valid at a particular point in time.
tsq te
q
tsd te
d
tsq te
qtsd te
d
tsq te
qtsd te
d
tsq te
qtsd te
d
© Prentice Hall 6
Types of DatabasesTypes of Databases
Snapshot – No temporal supportSnapshot – No temporal support Transaction Time – Supports time when Transaction Time – Supports time when
transaction inserted datatransaction inserted data– TimestampTimestamp– RangeRange
Valid Time – Supports time range when Valid Time – Supports time range when data values are validdata values are valid
Bitemporal – Supports both transaction Bitemporal – Supports both transaction and valid time.and valid time.
© Prentice Hall 7
Modeling Temporal EventsModeling Temporal Events Techniques to model temporal events.Techniques to model temporal events. Often based on earlier approachesOften based on earlier approaches Finite State Recognizer (Machine) (FSR)Finite State Recognizer (Machine) (FSR)
– Each event recognizes one characterEach event recognizes one character– Temporal ordering indicated by arcsTemporal ordering indicated by arcs– May recognize a sequenceMay recognize a sequence– Require precisely defined transitions between statesRequire precisely defined transitions between states
ApproachesApproaches– Markov ModelMarkov Model– Hidden Markov ModelHidden Markov Model– Recurrent Neural NetworkRecurrent Neural Network
© Prentice Hall 8
FSRFSR
© Prentice Hall 9
Markov Model (MM)Markov Model (MM) Directed graphDirected graph
– Vertices represent statesVertices represent states– Arcs show transitions between statesArcs show transitions between states– Arc has probability of transitionArc has probability of transition– At any time one state is designated as current At any time one state is designated as current
state.state. Markov PropertyMarkov Property – Given a current state, the – Given a current state, the
transition probability is independent of any transition probability is independent of any previous states.previous states.
Applications: speech recognition, natural Applications: speech recognition, natural language processinglanguage processing
© Prentice Hall 10
Markov ModelMarkov Model
© Prentice Hall 11
Hidden Markov Model (HMM)Hidden Markov Model (HMM)
Like HMM, but states need not correspond to Like HMM, but states need not correspond to observable states.observable states.
HMM models process that produces as HMM models process that produces as output a sequence of observable symbols.output a sequence of observable symbols.
HMM will actually output these symbols.HMM will actually output these symbols. Associated with each node is the probability Associated with each node is the probability
of the observation of an event.of the observation of an event. Train HMM to recognize a sequence.Train HMM to recognize a sequence. Transition and observation probabilities Transition and observation probabilities
learned from training set.learned from training set.
© Prentice Hall 12
Hidden Markov ModelHidden Markov Model
Modified from [RJ86]
© Prentice Hall 13
HMM AlgorithmHMM Algorithm
© Prentice Hall 14
HMM ApplicationsHMM Applications
Given a sequence of events and an Given a sequence of events and an HMM, what is the probability that the HMM, what is the probability that the HMM produced the sequence?HMM produced the sequence?
Given a sequence and an HMM, what is Given a sequence and an HMM, what is the most likely state sequence which the most likely state sequence which produced this sequence?produced this sequence?
© Prentice Hall 15
Recurrent Neural Network (RNN)Recurrent Neural Network (RNN)
Extension to basic NNExtension to basic NN Neuron can obtian input form any other Neuron can obtian input form any other
neuron (including output layer).neuron (including output layer). Can be used for both recognition and Can be used for both recognition and
prediction applications.prediction applications. Time to produce output unknownTime to produce output unknown Temporal aspect added by backlinks.Temporal aspect added by backlinks.
© Prentice Hall 16
RNNRNN
© Prentice Hall 17
Time SeriesTime Series
Set of attribute values over timeSet of attribute values over time Time Series Analysis – finding patterns Time Series Analysis – finding patterns
in the values.in the values.– TrendsTrends– CyclesCycles– SeasonalSeasonal– OutliersOutliers
© Prentice Hall 18
Analysis TechniquesAnalysis Techniques Smoothing Smoothing – Moving average of attribute – Moving average of attribute
values.values. Autocorrelation Autocorrelation – relationships between – relationships between
different subseriesdifferent subseries– Yearly, seasonalYearly, seasonal– LagLag – Time difference between related items. – Time difference between related items.– Correlation Coefficient rCorrelation Coefficient r
© Prentice Hall 19
SmoothingSmoothing
© Prentice Hall 20
Correlation with Lag of 3Correlation with Lag of 3
© Prentice Hall 21
SimilaritySimilarity Determine similarity between a target pattern, Determine similarity between a target pattern,
X, and sequence, Y: sim(X,Y)X, and sequence, Y: sim(X,Y) Similar to Web usage miningSimilar to Web usage mining Similar to earlier word processing and spelling Similar to earlier word processing and spelling
corrector applications.corrector applications. Issues:Issues:
– LengthLength– ScaleScale– GapsGaps– OutliersOutliers– BaselineBaseline
© Prentice Hall 22
Longest Common SubseriesLongest Common Subseries
Find longest subseries they have in Find longest subseries they have in common.common.
Ex:Ex:– X = <10,5,6,9,22,15,4,2>X = <10,5,6,9,22,15,4,2>– Y = <6,9,10,5,6,22,15,4,2>Y = <6,9,10,5,6,22,15,4,2>– Output: <22,15,4,2>Output: <22,15,4,2>– Sim(X,Y) = l/n = 4/9Sim(X,Y) = l/n = 4/9
© Prentice Hall 23
Similarity based on Linear Similarity based on Linear TransformationTransformation
Linear transformation function fLinear transformation function f– Convert a value form one series to a value Convert a value form one series to a value
in the secondin the second ff – tolerated difference in results – tolerated difference in results – – time value difference allowedtime value difference allowed
© Prentice Hall 24
PredictionPrediction
Predict future value for time seriesPredict future value for time series Regression may not be sufficientRegression may not be sufficient Statistical TechniquesStatistical Techniques
– ARMAARMA– ARIMAARIMA
NNNN
© Prentice Hall 25
Pattern DetectionPattern Detection
Identify patterns of behavior in time Identify patterns of behavior in time seriesseries
Speech recognition, signal processingSpeech recognition, signal processing FSR, MM, HMMFSR, MM, HMM
© Prentice Hall 26
String MatchingString Matching
Find given pattern in sequenceFind given pattern in sequence Knuth-Morris-Pratt:Knuth-Morris-Pratt: Construct FSM Construct FSM Boyer-Moore:Boyer-Moore: Construct FSM Construct FSM
© Prentice Hall 27
Distance between StringsDistance between Strings
Cost to convert one to the otherCost to convert one to the other TransformationsTransformations
– Match: Current characters in both strings Match: Current characters in both strings are the sameare the same
– Delete: Delete current character in input Delete: Delete current character in input stringstring
– Insert: Insert current character in target Insert: Insert current character in target string into stringstring into string
© Prentice Hall 28
Distance between StringsDistance between Strings
© Prentice Hall 29
Frequent SequenceFrequent Sequence
© Prentice Hall 30
Frequent Sequence ExampleFrequent Sequence Example
Purchases made by Purchases made by customerscustomers
s(<{A},{C}>) = 1/3s(<{A},{C}>) = 1/3 s(<{A},{D}>) = 2/3s(<{A},{D}>) = 2/3 s(<{B,C},{D}>) = 2/3s(<{B,C},{D}>) = 2/3
© Prentice Hall 31
Frequent Sequence LatticeFrequent Sequence Lattice
© Prentice Hall 32
SPADESPADE
Sequential Pattern Discovery using Sequential Pattern Discovery using Equivalence classesEquivalence classes
Identifies patterns by traversing lattice in Identifies patterns by traversing lattice in a top down manner.a top down manner.
Divides lattice into equivalent classes Divides lattice into equivalent classes and searches each separately.and searches each separately.
ID-List:ID-List: Associates customers and Associates customers and transactions with each item.transactions with each item.
© Prentice Hall 33
SPADE ExampleSPADE Example
ID-List for Sequences of length 1:ID-List for Sequences of length 1:
Count for <{A}> is 3Count for <{A}> is 3 Count for <{A},{D}> is 2Count for <{A},{D}> is 2
© Prentice Hall 34
Equivalence ClassesEquivalence Classes
© Prentice Hall 35
SPADE AlgorithmSPADE Algorithm
© Prentice Hall 36
Temporal Association RulesTemporal Association Rules
Transaction has time:Transaction has time:<TID,CID,I<TID,CID,I11,I,I22, …, I, …, Imm,t,tss,t,tee>>
[t[tss,t,tee] is range of time the transaction is active.] is range of time the transaction is active. Types:Types:
– Inter-transaction rulesInter-transaction rules– Episode rulesEpisode rules– Trend dependenciesTrend dependencies– Sequence association rulesSequence association rules– Calendric association rulesCalendric association rules
© Prentice Hall 37
Inter-transaction RulesInter-transaction Rules
Intra-transaction association rulesIntra-transaction association rulesTraditional association RulesTraditional association Rules
Inter-transaction association rulesInter-transaction association rules– Rules across transactionsRules across transactions– Sliding windowSliding window – How far apart (time or – How far apart (time or
number of transactions) to look for related number of transactions) to look for related itemsets.itemsets.
© Prentice Hall 38
Episode RulesEpisode Rules
Association rules applied to sequences Association rules applied to sequences of events.of events.
EpisodeEpisode – set of event predicates and – set of event predicates and partial ordering on thempartial ordering on them
© Prentice Hall 39
Trend DependenciesTrend Dependencies Association rules across two database Association rules across two database
states based on time.states based on time. Ex: (SSN,=) Ex: (SSN,=) (Salary, (Salary, ))
Confidence=4/5Confidence=4/5Support=4/36Support=4/36
© Prentice Hall 40
Sequence Association RulesSequence Association Rules
Association rules involving sequencesAssociation rules involving sequences Ex:Ex:
<{A},{C}> <{A},{C}> <{A},{D}> <{A},{D}>Support = 1/3Support = 1/3Confidence 1Confidence 1
© Prentice Hall 41
Calendric Association RulesCalendric Association Rules
Each transaction has a unique Each transaction has a unique timestamp.timestamp.
Group transactions based on time Group transactions based on time interval within which they occur.interval within which they occur.
Identify large itemsets by looking at Identify large itemsets by looking at transactions only in this predefined transactions only in this predefined interval.interval.
© Prentice Hall 42
Spatial Mining OutlineSpatial Mining Outline
Goal:Goal: Provide an introduction to some Provide an introduction to some spatial mining techniques.spatial mining techniques.
IntroductionIntroduction Spatial Data Overview Spatial Data Overview Spatial Data Mining PrimitivesSpatial Data Mining Primitives Generalization/SpecializationGeneralization/Specialization Spatial RulesSpatial Rules Spatial ClassificationSpatial Classification Spatial ClusteringSpatial Clustering
© Prentice Hall 43
Spatial ObjectSpatial Object
Contains both spatial and nonspatial Contains both spatial and nonspatial attributes.attributes.
Must have a location type attributes:Must have a location type attributes:– Latitude/longitudeLatitude/longitude– Zip codeZip code– Street addressStreet address
May retrieve object using either (or May retrieve object using either (or both) spatial or nonspatial attributes.both) spatial or nonspatial attributes.
© Prentice Hall 44
Spatial Data Mining ApplicationsSpatial Data Mining Applications
GeologyGeology GIS SystemsGIS Systems Environmental ScienceEnvironmental Science AgricultureAgriculture MedicineMedicine RoboticsRobotics May involved both spatial and temporal May involved both spatial and temporal
aspectsaspects
© Prentice Hall 45
Spatial QueriesSpatial Queries Spatial selection may involve specialized selection Spatial selection may involve specialized selection
comparison operations:comparison operations:– NearNear– North, South, East, WestNorth, South, East, West– Contained inContained in– Overlap/intersectOverlap/intersect
Region (Range) QueryRegion (Range) Query – find objects that intersect a given – find objects that intersect a given region.region.
Nearest Neighbor QueryNearest Neighbor Query – find object close to identified – find object close to identified object.object.
Distance ScanDistance Scan – find object within a certain distance of an – find object within a certain distance of an identified object where distance is made increasingly larger.identified object where distance is made increasingly larger.
© Prentice Hall 46
Spatial Data StructuresSpatial Data Structures Data structures designed specifically to store or Data structures designed specifically to store or
index spatial data.index spatial data. Often based on B-tree or Binary Search TreeOften based on B-tree or Binary Search Tree Cluster data on disk basked on geographic location.Cluster data on disk basked on geographic location. May represent complex spatial structure by placing May represent complex spatial structure by placing
the spatial object in a containing structure of a the spatial object in a containing structure of a specific geographic shape.specific geographic shape.
Techniques:Techniques:– Quad TreeQuad Tree– R-TreeR-Tree– k-D Treek-D Tree
© Prentice Hall 47
MBRMBR
Minimum Bounding RectangleMinimum Bounding Rectangle Smallest rectangle that completely Smallest rectangle that completely
contains the objectcontains the object
© Prentice Hall 48
MBR ExamplesMBR Examples
© Prentice Hall 49
Quad TreeQuad Tree
Hierarchical decomposition of the space Hierarchical decomposition of the space into quadrants (MBRs)into quadrants (MBRs)
Each level in the tree represents the Each level in the tree represents the object as the set of quadrants which object as the set of quadrants which contain any portion of the object.contain any portion of the object.
Each level is a more exact representation Each level is a more exact representation of the object.of the object.
The number of levels is determined by The number of levels is determined by the degree of accuracy desired.the degree of accuracy desired.
© Prentice Hall 50
Quad Tree ExampleQuad Tree Example
© Prentice Hall 51
R-TreeR-Tree
As with Quad Tree the region is divided As with Quad Tree the region is divided into successively smaller rectangles into successively smaller rectangles (MBRs).(MBRs).
Rectangles need not be of the same Rectangles need not be of the same size or number at each level.size or number at each level.
Rectangles may actually overlap.Rectangles may actually overlap. Lowest level cell has only one object.Lowest level cell has only one object. Tree maintenance algorithms similar to Tree maintenance algorithms similar to
those for B-trees.those for B-trees.
© Prentice Hall 52
R-Tree ExampleR-Tree Example
© Prentice Hall 53
K-D TreeK-D Tree
Designed for multi-attribute data, not Designed for multi-attribute data, not necessarily spatialnecessarily spatial
Variation of binary search treeVariation of binary search tree Each level is used to index one of the Each level is used to index one of the
dimensions of the spatial object.dimensions of the spatial object. Lowest level cell has only one objectLowest level cell has only one object Divisions not based on MBRs but Divisions not based on MBRs but
successive divisions of the dimension successive divisions of the dimension range.range.
© Prentice Hall 54
k-D Tree Examplek-D Tree Example
© Prentice Hall 55
Topological RelationshipsTopological Relationships
DisjointDisjoint Overlaps or IntersectsOverlaps or Intersects EqualsEquals Covered by or inside or contained inCovered by or inside or contained in Covers or containsCovers or contains
© Prentice Hall 56
Distance Between ObjectsDistance Between Objects EuclideanEuclidean ManhattanManhattan Extensions:Extensions:
© Prentice Hall 57
Progressive RefinementProgressive Refinement
Make approximate answers prior to Make approximate answers prior to more accurate ones.more accurate ones.
Filter out data not part of answerFilter out data not part of answer Hierarchical view of data based on Hierarchical view of data based on
spatial relationshipsspatial relationships Coarse predicate recursively refinedCoarse predicate recursively refined
© Prentice Hall 58
Progressive RefinementProgressive Refinement
© Prentice Hall 59
Spatial Data Dominant AlgorithmSpatial Data Dominant Algorithm
© Prentice Hall 60
STINGSTING
STatistical Information Grid-basedSTatistical Information Grid-based Hierarchical technique to divide area Hierarchical technique to divide area
into rectangular cellsinto rectangular cells Grid data structure contains summary Grid data structure contains summary
information about each cellinformation about each cell Hierarchical clustering Hierarchical clustering Similar to quad treeSimilar to quad tree
© Prentice Hall 61
STINGSTING
© Prentice Hall 62
STING Build AlgorithmSTING Build Algorithm
© Prentice Hall 63
STING AlgorithmSTING Algorithm
© Prentice Hall 64
Spatial RulesSpatial Rules
Characteristic RuleCharacteristic Rule
The average family income in Dallas is $50,000.The average family income in Dallas is $50,000. Discriminant RuleDiscriminant Rule
The average family income in Dallas is $50,000, The average family income in Dallas is $50,000, while in Plano the average income is $75,000.while in Plano the average income is $75,000.
Association RuleAssociation Rule
The average family income in Dallas for families The average family income in Dallas for families living near White Rock Lake is $100,000.living near White Rock Lake is $100,000.
© Prentice Hall 65
Spatial Association RulesSpatial Association Rules
Either antecedent or consequent must Either antecedent or consequent must contain spatial predicates.contain spatial predicates.
View underlying database as set of View underlying database as set of spatial objects.spatial objects.
May create using a type of progressive May create using a type of progressive refinementrefinement
© Prentice Hall 66
Spatial Association Rule AlgorithmSpatial Association Rule Algorithm
© Prentice Hall 67
Spatial ClassificationSpatial Classification
Partition spatial objectsPartition spatial objects May use nonspatial attributes and/or May use nonspatial attributes and/or
spatial attributesspatial attributes Generalization and progressive Generalization and progressive
refinement may be used.refinement may be used.
© Prentice Hall 68
ID3 ExtensionID3 Extension
Neighborhood GraphNeighborhood Graph– Nodes – objectsNodes – objects– Edges – connects neighborsEdges – connects neighbors
Definition of neighborhood variesDefinition of neighborhood varies ID3 considers nonspatial attributes of all ID3 considers nonspatial attributes of all
objects in a neighborhood (not just one) objects in a neighborhood (not just one) for classification.for classification.
© Prentice Hall 69
Spatial Decision TreeSpatial Decision Tree
Approach similar to that used for spatial Approach similar to that used for spatial association rules.association rules.
Spatial objects can be described based Spatial objects can be described based on objects close to them – on objects close to them – Buffer.Buffer.
Description of class based on Description of class based on aggregation of nearby objects.aggregation of nearby objects.
© Prentice Hall 70
Spatial Decision Tree AlgorithmSpatial Decision Tree Algorithm
© Prentice Hall 71
Spatial ClusteringSpatial Clustering
Detect clusters of irregular shapesDetect clusters of irregular shapes Use of centroids and simple distance Use of centroids and simple distance
approaches may not work well.approaches may not work well. Clusters should be independent of order Clusters should be independent of order
of input.of input.
© Prentice Hall 72
Spatial ClusteringSpatial Clustering
© Prentice Hall 73
CLARANS ExtensionsCLARANS Extensions
Remove main memory assumption of Remove main memory assumption of CLARANS.CLARANS.
Use spatial index techniques.Use spatial index techniques. Use sampling and R*-tree to identify Use sampling and R*-tree to identify
central objects.central objects. Change cost calculations by reducing Change cost calculations by reducing
the number of objects examined.the number of objects examined. Voronoi DiagramVoronoi Diagram
© Prentice Hall 74
VoronoiVoronoi
© Prentice Hall 75
SD(CLARANS)SD(CLARANS)
Spatial DominantSpatial Dominant First clusters spatial components using First clusters spatial components using
CLARANSCLARANS Then iteratively replaces medoids, but Then iteratively replaces medoids, but
limits number of pairs to be searched.limits number of pairs to be searched. Uses generalizationUses generalization Uses a learning to to derive description Uses a learning to to derive description
of cluster.of cluster.
© Prentice Hall 76
SD(CLARANS) AlgorithmSD(CLARANS) Algorithm
© Prentice Hall 77
DBCLASDDBCLASD
Extension of DBSCANExtension of DBSCAN Distribution Based Clustering of LArge Distribution Based Clustering of LArge
Spatial DatabasesSpatial Databases Assumes items in cluster are uniformly Assumes items in cluster are uniformly
distributed.distributed. Identifies distribution satisfied by Identifies distribution satisfied by
distances between nearest neighbors.distances between nearest neighbors. Objects added if distribution is uniform.Objects added if distribution is uniform.
© Prentice Hall 78
DBCLASD AlgorithmDBCLASD Algorithm
© Prentice Hall 79
Aggregate ProximityAggregate Proximity
Aggregate ProximityAggregate Proximity – measure of how – measure of how close a cluster is to a feature.close a cluster is to a feature.
Aggregate proximity relationship finds the Aggregate proximity relationship finds the k closest features to a cluster.k closest features to a cluster.
CRH AlgorithmCRH Algorithm – uses different shapes: – uses different shapes:– Encompassing CircleEncompassing Circle– Isothetic RectangleIsothetic Rectangle– Convex HullConvex Hull
© Prentice Hall 80
CRHCRH
© Prentice Hall 81
Web Mining OutlineWeb Mining Outline
Goal:Goal: Examine the use of data mining on Examine the use of data mining on the World Wide Webthe World Wide Web
IntroductionIntroduction Web Content MiningWeb Content Mining Web Structure MiningWeb Structure Mining Web Usage MiningWeb Usage Mining
© Prentice Hall 82
Web Mining IssuesWeb Mining Issues
SizeSize– >350 million pages (1999) >350 million pages (1999) – Grows at about 1 million pages a dayGrows at about 1 million pages a day– Google indexes 3 billion documentsGoogle indexes 3 billion documents
Diverse types of dataDiverse types of data
© Prentice Hall 83
Web DataWeb Data
Web pagesWeb pages Intra-page structuresIntra-page structures Inter-page structuresInter-page structures Usage dataUsage data Supplemental dataSupplemental data
– ProfilesProfiles– Registration informationRegistration information– CookiesCookies
© Prentice Hall 84
Web Mining TaxonomyWeb Mining Taxonomy
Modified from [zai01]
© Prentice Hall 85
Web Content MiningWeb Content Mining
Extends work of basic search enginesExtends work of basic search engines Search EnginesSearch Engines
– IR applicationIR application– Keyword basedKeyword based– Similarity between query and documentSimilarity between query and document– CrawlersCrawlers– IndexingIndexing– ProfilesProfiles– Link analysisLink analysis
© Prentice Hall 86
CrawlersCrawlers Robot (spider)Robot (spider) traverses the hypertext sructure in traverses the hypertext sructure in
the Web.the Web. Collect information from visited pagesCollect information from visited pages Used to construct indexes for search enginesUsed to construct indexes for search engines Traditional CrawlerTraditional Crawler – visits entire Web (?) and – visits entire Web (?) and
replaces indexreplaces index Periodic CrawlerPeriodic Crawler – visits portions of the Web and – visits portions of the Web and
updates subset of indexupdates subset of index Incremental CrawlerIncremental Crawler – selectively searches the Web – selectively searches the Web
and incrementally modifies indexand incrementally modifies index Focused CrawlerFocused Crawler – visits pages related to a – visits pages related to a
particular subjectparticular subject
© Prentice Hall 87
Focused CrawlerFocused Crawler
Only visit links from a page if that page Only visit links from a page if that page is determined to be relevant.is determined to be relevant.
Classifier is static after learning phase.Classifier is static after learning phase. Components:Components:
– Classifier which assigns relevance score to Classifier which assigns relevance score to each page based on crawl topic.each page based on crawl topic.
– Distiller to identify Distiller to identify hub pages.hub pages.– Crawler visits pages to based on crawler Crawler visits pages to based on crawler
and distiller scores.and distiller scores.
© Prentice Hall 88
Focused CrawlerFocused Crawler
Classifier to related documents to topicsClassifier to related documents to topics Classifier also determines how useful Classifier also determines how useful
outgoing links areoutgoing links are Hub PagesHub Pages contain links to many contain links to many
relevant pages. Must be visited even if relevant pages. Must be visited even if not high relevance score.not high relevance score.
© Prentice Hall 89
Focused CrawlerFocused Crawler
© Prentice Hall 90
Context Focused CrawlerContext Focused Crawler
Context Graph:Context Graph:– Context graph created for each seed document .Context graph created for each seed document .– Root is the sedd document.Root is the sedd document.– Nodes at each level show documents with links Nodes at each level show documents with links
to documents at next higher level. to documents at next higher level. – Updated during crawl itself .Updated during crawl itself .
Approach:Approach:1.1. Construct context graph and classifiers using Construct context graph and classifiers using
seed documents as training data.seed documents as training data.2.2. Perform crawling using classifiers and context Perform crawling using classifiers and context
graph created.graph created.
© Prentice Hall 91
Context GraphContext Graph
© Prentice Hall 92
Virtual Web ViewVirtual Web View Multiple Layered DataBase (MLDB)Multiple Layered DataBase (MLDB) built on top of built on top of
the Web.the Web. Each layer of the database is more generalized (and Each layer of the database is more generalized (and
smaller) and centralized than the one beneath it.smaller) and centralized than the one beneath it. Upper layers of MLDB are structured and can be Upper layers of MLDB are structured and can be
accessed with SQL type queries.accessed with SQL type queries. Translation tools convert Web documents to XML.Translation tools convert Web documents to XML. Extraction tools extract desired information to place in Extraction tools extract desired information to place in
first layer of MLDB.first layer of MLDB. Higher levels contain more summarized data obtained Higher levels contain more summarized data obtained
through generalizations of the lower levels.through generalizations of the lower levels.
© Prentice Hall 93
PersonalizationPersonalization
Web access or contents tuned to better fit the Web access or contents tuned to better fit the desires of each user.desires of each user.
Manual techniques identify user’s preferences Manual techniques identify user’s preferences based on profiles or demographics.based on profiles or demographics.
Collaborative filteringCollaborative filtering identifies preferences identifies preferences based on ratings from similar users.based on ratings from similar users.
Content based filteringContent based filtering retrieves pages retrieves pages based on similarity between pages and user based on similarity between pages and user profiles.profiles.
© Prentice Hall 94
Web Structure MiningWeb Structure Mining
Mine structure (links, graph) of the WebMine structure (links, graph) of the Web TechniquesTechniques
– PageRankPageRank– CLEVERCLEVER
Create a model of the Web organization.Create a model of the Web organization. May be combined with content mining to May be combined with content mining to
more effectively retrieve important pages.more effectively retrieve important pages.
© Prentice Hall 95
PageRankPageRank Used by GoogleUsed by Google Prioritize pages returned from search by Prioritize pages returned from search by
looking at Web structure.looking at Web structure. Importance of page is calculated based Importance of page is calculated based
on number of pages which point to it – on number of pages which point to it – BacklinksBacklinks..
Weighting is used to provide more Weighting is used to provide more importance to backlinks coming form importance to backlinks coming form important pages.important pages.
© Prentice Hall 96
PageRank (cont’d)PageRank (cont’d)
PR(p) = c (PR(1)/NPR(p) = c (PR(1)/N11 + … + PR(n)/N + … + PR(n)/Nnn))
– PR(i): PageRank for a page i which points PR(i): PageRank for a page i which points to target page p.to target page p.
– NNii: number of links coming out of page i: number of links coming out of page i
© Prentice Hall 97
CLEVERCLEVER
Identify authoritative and hub pages.Identify authoritative and hub pages. Authoritative PagesAuthoritative Pages : :
– Highly important pages.Highly important pages.– Best source for requested information.Best source for requested information.
Hub PagesHub Pages : :– Contain links to highly important pages.Contain links to highly important pages.
© Prentice Hall 98
HITSHITS
Hyperlink-Induces Topic SearchHyperlink-Induces Topic Search Based on a set of keywords, find set of Based on a set of keywords, find set of
relevant pages – R.relevant pages – R. Identify hub and authority pages for these.Identify hub and authority pages for these.
– Expand R to a base set, B, of pages linked to or Expand R to a base set, B, of pages linked to or from R.from R.
– Calculate weights for authorities and hubs.Calculate weights for authorities and hubs.
Pages with highest ranks in R are returned.Pages with highest ranks in R are returned.
© Prentice Hall 99
HITS AlgorithmHITS Algorithm
© Prentice Hall 100
Web Usage MiningWeb Usage Mining
Extends work of basic search enginesExtends work of basic search engines Search EnginesSearch Engines
– IR applicationIR application– Keyword basedKeyword based– Similarity between query and documentSimilarity between query and document– CrawlersCrawlers– IndexingIndexing– ProfilesProfiles– Link analysisLink analysis
© Prentice Hall 101
Web Usage Mining ApplicationsWeb Usage Mining Applications
PersonalizationPersonalization Improve structure of a site’s Web pagesImprove structure of a site’s Web pages Aid in caching and prediction of future Aid in caching and prediction of future
page referencespage references Improve design of individual pagesImprove design of individual pages Improve effectiveness of e-commerce Improve effectiveness of e-commerce
(sales and advertising)(sales and advertising)
© Prentice Hall 102
Web Usage Mining ActivitiesWeb Usage Mining Activities Preprocessing Web logPreprocessing Web log
– Cleanse Cleanse – Remove extraneous informationRemove extraneous information– SessionizeSessionize
Session:Session: Sequence of pages referenced by one user at a sitting. Sequence of pages referenced by one user at a sitting.
Pattern DiscoveryPattern Discovery– Count patterns that occur in sessionsCount patterns that occur in sessions– Pattern Pattern is sequence of pages references in session.is sequence of pages references in session.– Similar to association rulesSimilar to association rules
» Transaction: sessionTransaction: session» Itemset: pattern (or subset)Itemset: pattern (or subset)» Order is importantOrder is important
Pattern AnalysisPattern Analysis
© Prentice Hall 103
ARs in Web MiningARs in Web Mining Web Mining:Web Mining:
– ContentContent– StructureStructure– UsageUsage
Frequent patterns of sequential page Frequent patterns of sequential page references in Web searching.references in Web searching.
Uses:Uses:– CachingCaching– Clustering usersClustering users– Develop user profilesDevelop user profiles– Identify important pagesIdentify important pages
© Prentice Hall 104
Web Usage Mining IssuesWeb Usage Mining Issues
Identification of exact user not possible.Identification of exact user not possible. Exact sequence of pages referenced by Exact sequence of pages referenced by
a user not possible due to caching.a user not possible due to caching. Session not well definedSession not well defined Security, privacy, and legal issuesSecurity, privacy, and legal issues
© Prentice Hall 105
Web Log CleansingWeb Log Cleansing
Replace source IP address with unique Replace source IP address with unique but non-identifying ID.but non-identifying ID.
Replace exact URL of pages referenced Replace exact URL of pages referenced with unique but non-identifying ID.with unique but non-identifying ID.
Delete error records and records Delete error records and records containing not page data (such as containing not page data (such as figures and code)figures and code)
© Prentice Hall 106
SessionizingSessionizing
Divide Web log into sessions.Divide Web log into sessions. Two common techniques:Two common techniques:
– Number of consecutive page references Number of consecutive page references from a source IP address occurring within from a source IP address occurring within a predefined time interval (e.g. 25 a predefined time interval (e.g. 25 minutes).minutes).
– All consecutive page references from a All consecutive page references from a source IP address where the interclick time source IP address where the interclick time is less than a predefined threshold.is less than a predefined threshold.
© Prentice Hall 107
Data Structures Data Structures
Keep track of patterns identified during Keep track of patterns identified during Web usage mining processWeb usage mining process
Common techniques:Common techniques:– Trie Trie – Suffix TreeSuffix Tree– Generalized Suffix TreeGeneralized Suffix Tree– WAP TreeWAP Tree
© Prentice Hall 108
Trie vs. Suffix TreeTrie vs. Suffix Tree
Trie:Trie:– Rooted treeRooted tree– Edges labeled which character (page) from Edges labeled which character (page) from
patternpattern– Path from root to leaf represents pattern.Path from root to leaf represents pattern.
Suffix Tree:Suffix Tree:– Single child collapsed with parent. Edge Single child collapsed with parent. Edge
contains labels of both prior edges.contains labels of both prior edges.
© Prentice Hall 109
Trie and Suffix TreeTrie and Suffix Tree
© Prentice Hall 110
Generalized Suffix TreeGeneralized Suffix Tree
Suffix tree for multiple sessions. Suffix tree for multiple sessions. Contains patterns from all sessions.Contains patterns from all sessions. Maintains count of frequency of Maintains count of frequency of
occurrence of a pattern in the node.occurrence of a pattern in the node. WAP Tree:WAP Tree:
Compressed version of generalized suffix Compressed version of generalized suffix treetree
© Prentice Hall 111
Types of PatternsTypes of Patterns
Algorithms have been developed to discover Algorithms have been developed to discover different types of patterns.different types of patterns.
Properties:Properties:– Ordered Ordered – Characters (pages) must occur in the – Characters (pages) must occur in the
exact order in the original session.exact order in the original session.– Duplicates Duplicates – Duplicate characters are allowed in – Duplicate characters are allowed in
the pattern.the pattern.– ConsecutiveConsecutive – All characters in pattern must – All characters in pattern must
occur consecutive in given session.occur consecutive in given session.– Maximal Maximal – Not subsequence of another pattern.– Not subsequence of another pattern.
© Prentice Hall 112
Pattern TypesPattern Types
Association RulesAssociation RulesNone of the properties holdNone of the properties hold
EpisodesEpisodesOnly ordering holdsOnly ordering holds
Sequential PatternsSequential PatternsOrdered and maximalOrdered and maximal
Forward SequencesForward SequencesOrdered, consecutive, and maximalOrdered, consecutive, and maximal
Maximal Frequent SequencesMaximal Frequent SequencesAll properties holdAll properties hold
© Prentice Hall 113
EpisodesEpisodes
Partially ordered set of pagesPartially ordered set of pages Serial episodeSerial episode – totally ordered with – totally ordered with
time constrainttime constraint Parallel episodeParallel episode – partial ordered with – partial ordered with
time constrainttime constraint General episodeGeneral episode – partial ordered with – partial ordered with
no time constraintno time constraint
© Prentice Hall 114
DAG for EpisodeDAG for Episode