A Brief Overview of Data Mining
description
Transcript of A Brief Overview of Data Mining
A Brief Overview of Data Mining
- IR Group Meeting
04/11/2006
Qiaozhu Mei
Outline
• Introduction• Functionalities• Hot topics• Research Groups• Useful Resources
Part 1: Introduction
• Introduction– What is data mining? – General Process– Related Fields– Different Views
• Functionalities• Hot topics• Research Groups• Useful Resources
What is Data Mining?
• (From Prof. Jiawei Han’s Slides): Data mining (knowledge discovery from data)
– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data
• (From Prof. Sunita Sarawagi’s slides): Process of semi-
automatically analyzing large databases to find patterns that are – valid: hold on new data with some certainty– novel: non-obvious to the system– useful: should be possible to act on the item – understandable: humans should be able to interpret the pattern
• (From Prof. Vipin Kumar’ Slides): Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
What is Data Mining? (cont.)
• Under these definitions:– What is not Data Mining?
• Look up phone number in phone directory• Query a Web search engine for information about “Amazon”
– What is Data Mining?• Certain names are more prevalent in certain US locations
(O’Brien, O’Rurke, O’Reilly… in Boston area)
• Group together similar documents returned by search engine according to their context
- Tan, Steinbach, Kumar, Introduction to Data Mining
General Process of KDD
– Data mining—core of knowledge discovery process
Data Cleaning
Data Integration
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
- Han & Kamber, Data Mining: Concepts and TechniquesDatabases
Related Fields
• Confluence of Multiple Disciplines
Data Mining
Database Technology
Statistics
OtherDisciplines
Algorithm
MachineLearning Visualization
- Han & Kamber, Data Mining:
Concepts and Techniques
Statistics/AI
Data Mining
Database systems
Machine Learning/Pattern
Recognition
• Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
• But different…
- Tan, Steinbach, Kumar, Introduction to Data Mining
Differences to Related Fields
• Traditional Techniques may be unsuitable due to – Enormity of data– High dimensionality of data– Heterogeneous, distributed nature of data
• Overlaps with machine learning, statistics, artificial intelligence, databases, visualization, but more stress on– scalability of number of features and instances– stress on algorithms and architectures whereas foundations of
methods and formulations provided by statistics and machine learning.
– automation for handling large, heterogeneous data
-From Prof. Vipin Kumar’s slides
-From Prof. Sunita Sarawagi’s slides
Different Views of Data Mining
• Categorize a data mining task from different views• By general functionality and operations:
– Descriptive data mining• Find human-interpretable patterns that describe the data.• Clustering / similarity matching• Association rules and variants• Deviation detection
– Predictive data mining• Use some variables to predict unknown or future values of other vari
ables. • Regression
• Classification
• Collaborative Filtering
Different Views of Data Mining (II)
• By data to be mined– Relational, data warehouse, transactional, stream, object-
oriented, sequence, graph, social network, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW
• By knowledge to be discovered– Characterization, discrimination, frequent patterns, association,
classification, clustering, trend/deviation, outlier analysis, etc
• By techniques utilized– Database-oriented, data warehouse (OLAP), combinational
algorithms, machine learning, statistics, visualization, etc.
• By application adapted– Retail, telecommunication, banking, fraud analysis, bio-data
mining, stock market analysis, text mining, Web mining, etc.- Han & Kamber, Data Mining: Concepts and Techniques
Part 2: Functionalities
• Introduction• Functionalities
– Data Warehousing and OLAP– Frequent patterns, association, correlation and causality– Classification and prediction– Clustering
– Outlier analysis, Trend and evolution analysis • Hot topics• Research Groups• Useful Resources
Data Warehousing and OLAP
• Data Warehousing:– “A data warehouse is a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon
• OLAP: on-line analytical processing
– Major task of data warehouse system
– Data analysis and decision making
– Drill-down, roll-up, exception/discovery driven
• Methodology – Data Cubing
– Iceberg cube
– Multi-way, BUC, Star, MM,
shell, close-cube, etc.
all
product date country
product,date product,country date, country
product, date, country
- Han & Kamber, Data Mining: Concepts and Techniques
Frequent Patterns and Associations
• Frequent pattern: a pattern (itemsets, subsequences, substructures, etc.) that occurs frequently in a data set– Comparing to n-grams, phrases, etc.
• Motivation: Finding inherent regularities in data
• Applications: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis
• Association rule mining: – Given a set of records each of which contain some number of
items from a given collection;– Produce dependency rules which will predict occurrence of an
item based on occurrences of other items.– Frequent pattern association rules correlations
Mining Frequent Patterns
• Types of data:– Itemsets, sequences, graphs.
• Scalable mining methods: Three major approaches– Apriori (Agrawal & Srikant@VLDB’94)– FPgrowth (Han, Pei & Yin @SIGMOD’00)
• Prefixspan, clospan, gSpan, closegraph, etc.
– Vertical data format approach (Charm, Zaki & Hsiao @SDM’02)
• Apriori:– Candidate pattern generation and pruning– Breadth-first search over pattern space
• FPgrowth:– Pattern growth through FP-tree, no candidate generation– Depth-first search, doing pruning smartly
Classification and Prediction
• Supervised Learning, already discussed in Machine Learning.– Classification: classifies data (constructs a model) based on the training
set and the values (categorical class labels) in a classifying attribute and uses it in classifying new data
– Prediction: models continuous-valued functions, i.e., predicts unknown or
missing values • Algorithms:
– Decision Tree based: C4.5, ID3, Rainforest, etc.
– Bayesian Method: Naïve Bayesian, Bayesian network, a lot of others covered in Machine Learning..
– Discriminative: Perceptron/Winnow, NN, SVM, CB-SVM, etc.
– Rule-based, Associative, k-NN, etc.
– Prediction: Regression,
• Bagging, Boosting, Model Selection, Cross-Validation
Clustering
• Unsupervised Learning, as discussed in Machine Learning– Given a set of data points, each having a set of attributes, and a si
milarity measure among them, find clusters such that– Data points in one cluster are more similar to one another.– Data points in separate clusters are less similar to one another.– Similarities/distances: many!
• Algorithms: – Partition based: K-means, K-Medoids, CLARA, etc– Hierarchical: Bottom-up (single/complete/average link), top-down,
Birch– Density-based/Grid-based: DBSCAN, DENCLUE, CLIQUE, etc. – Model-based: EM, COBWEB, SOM, etc. – High-Dimensional, Constraint based
Outlier, Trend and Evolution
• outliers: The set of objects that are considerably dissimilar from the remainder of the data– Statistical: hypothesis testing, bug mining– Density based– Clustering based, etc
• Deviation/Anomaly Detection• Fraud Detection• Trend and Evolution:
– Usually coupled with outlier analysis– Basic functionalities in temporal data mining– Trend, cycle, seasonal, irregular patterns
Part 3: Hot Topics
• Introduction• Functionalities• Hot topics
– Mining data stream, Mining time series, Spatiotemporal data mining, mining Social Networks, Sequential data mining, Graph Mining, Biology data mining, Privacy Preserving Data Mining
– Text and Web mining
• Research Groups• Useful Resources
Mining Data Streams
• Data: Data streams—continuous, ordered, changing fast, huge amount
• Characteristics and Challenges :– Huge volumes– Fast changing, requires fast and real-time response– Random access is expensive — need single scan algorithms– Difficult to keep the universe — need approximations
• Basic problems:– Multi-dimensional on-line analysis of streams– Mining outliers and unusual patterns in stream data– Clustering data streams – Classification of stream data
Mining Data Streams (II)
• Methods:– Basic: Sliding windows, Tilted time frames– Counting (FP mining, etc):
• Random sampling
• Approximated counting
– OLAP:• Keep Critical layers in stream cube computation
• Partial materialization
• outlier: exception-based exploration
– Clustering:• Offline microclustering and online macroclustering
• Text Related Applications:– Web logs and Web page click streams
Mining Time series
• Data: Time-series database– Consists of sequences of values or events changing with time
– Data is recorded at regular intervals
• Characteristics and Challenges :– Characteristic time-series components: Trend, cycle, seasonal,
irregular patterns
• Basic Problems:– Trends discovery, Similarity Search, outlier detection, prediction
and clustering
Mining Time series (II)
• Methods:– Statistical modeling (Regression, Spline, Mixture Model, etc)– Data transformation (DFT, DWT)– Sliding windows, Atomic matching, window stitching,
Subsequence Ordering– Clustering
• Text Related Applications:– Transliteration mining, Temporal text mining, word bursting, etc.
-Han & Kamber, Data Mining:
Concepts and Techniques
Spatiotemporal data mining
• Data: object data sets, spatial/spatiotemporal databases and data warehouses
• Characteristics and Challenges:– Generalize detailed geographic points into clustered regions, suc
h as business, residential, industrial, or agricultural areas, according to land usage
– handling objects in space that have identity and well-defined
extents, locations, and relationships.– Require the merge of a set of geographic areas by spatial operat
ions
• Basic Problems:– Querying objects; distribution/cluster/correlation/evolution/trend
analysis
Spatiotemporal data mining (II)
• Methods– GIS (Geographic Information System): Analysis and visualization
of geographic data • Search, Location analysis, Terrain analysis, Distribution,
Spatial analysis/statistics, Measurement– Indexing Spatial data (R-tree, etc. )– Modeling single objects with points, lines and regions– Modeling spatially related collection of objects: plane partitions a
nd networks.– Spatiotemporal patterns, correlations, trend analysis, clustering…
• Text Related Applications:– Spatiotemporal text mining; community evolution in weblogs; – Information diffusing; web evolution
Special topics in Frequent Pattern Mining
• Association rule mining and frequent itemset mining are pretty old topics
• However, some special topics of frequent pattern mining are still hot– Sequential pattern mining– Graph mining– Pattern post-processing
Sequential pattern mining
• Data: sequential data base• Basic problems:
– Discovery of frequent subsequences (allow gap, comparing to n-grams); close subsequences
– Sequence Similarity Search, Sequence Alignment
• Methods:– Apriori: GSP– FP-Growth: PrefixSpan, Clospan– BLAST, Hidden Markov models,
CRF, etc.
• Text Related Applications:– Most text patterns are sequential patterns– Phrase extraction, entity/relation extraction,
opinion mining, etc– Biology sequence modeling
-Han & Kamber, Data Mining:
Concepts and Techniques
Graph Mining
• Data: graph databases (like social network, but multiple graphs, more general), examples include– Chemical component, protein structure, program flow, XML/Web, – Directed, undirected, labeled/unlabeled, weighted, 2-D/3-D, etc.
• Characteristics and Challenges:– Theoretically, most are of high complexity, but practically, the graphs
are solvable. – Too many substructures to index– …
• Basic problems– Frequent subgraph mining– Close subgraph mining– Graph indexing by substructures– Similarity search
-Han & Kamber, Data Mining: Concepts and Techniques
Graph Mining (II)
• Methods:– Subgraph mining: Apriori (e.g. FSG), Pattern Growth (e.g. gSpan)– gSpan: pattern growth, depth first search, active elimination of du
plicated subgraphs; Flatten a graph into a sequence using depth first search; enumerate graph using right-most extension.
– CloseGraph: mining close subgraph patterns– gIndex: identify frequent structures, prune redundancy to maintain
discriminative structures, create index on such structures. – Similarity search: indexing; feature based similarities; estimate fea
ture missing
• Text Related Applications:– Multi-resolution topic map, entity-relation network, pathway extract
ion, etc.
Graph Mining (III): Graph Indexing & Querying
• More on Graph Indexing and Similarity Search• Comparing to Text Retrieval:
Text Retrieval Graph Indexing & Search
Objects Documents Graphs
Basic Units Words
Pruning stopwords
Frequent structures
Need to mine frequent subgraph
Redundancy? stemming Need discriminative structs.
Representation Term vectors Feature vectors
Dimensions Terms Substructures
Relevance Vector similarity Vector similarity
Approximation No Yes, need to estimate feature missing (relax substructures)
Graph Mining (IV): Graph Indexing & Querying
• What if we want to index on phrases instead of words?– Need to extract phrases first– N-grams/sequential patterns, have to remove redundancy
• E.g. “natural language processing” v.s. “language processing”
– Substructures are like phrases…
• Can IR help?– Representation and Similarity measures? (Vector Space Models,
Probabilistic models…)– How to weight features? (TF-IDF, …)– Generative models?– Query expansion? Feedback?
Pattern Post-processing
• Data: frequent patterns extracted by mining algorithms• Challenge:
– Mining algorithms output explosively large number of patterns– How to interpret the frequent patterns extracted
• Basic Problems:– Pattern summarization– Mining compressed patterns– Top-K patterns– Pattern annotation– User-oriented ranking
• Methods:– Modeling Pattern profiles, coverage and contexts– Using Clustering to summarize and compress patterns– Bridging IR/NLP and frequent pattern mining: profile, context, ranking,
feedback, filtering, summarization, MMR, etc.
Mining Social Networks
• Data: Graphs/networks with nodes and links– Example: communication networks, webpages, citations, biological
pathways, etc.
• Characteristics and Challenges:– Connected Components: few– Network diameter: small– Clustering: high degree– Degree distribution: heavy-tailed– Modeling Logical/statistical dependencies
• Basic Problems:– Model the generation of graphs/networks– Link based object ranking, classification,
Identification, Clustering, entity resolution– Link Prediction, querying, community discovery
H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai, Nature 411, 41-42 (2001)
Mining Social Networks (II)
• Methods:– Graph Generation Models: trying to derive generative models
which explains the characteristics and evolutions of social
networks/graphs. – Vertex Ranking: PageRank, HITS, etc. – Community Detection: Hierarchical Clustering, Spectral
clustering, Stochastic modeling, etc. – Link based classification: semi-supervised learning, propagation– Entity resolution: duplicate prediction, collective resolution,
probabilistic models– Link Prediction: binary classification problem, local conditional
probabilistic models– Substructure mining: graph pattern mining, indexing
Mining Social Networks (III)
• Generative Models of social network/graph generation and evolution
• Random graphs (Erdös-Rényi models)– Fix vertices, generate each edge independently with probability p– N(N-1)/2 trials of a biased coin flip, p ~ 1/N– Degree distribution is Poisson, E[d] = p(N-1); E[# of e] = pN(N-
1)/2– Parameter: p
• Graph process model:– starting with no edges, just keep adding one edge at a time– always choose next edge randomly from among all missing edges
Mining Social Networks (IV)
• α-model (Watts-Strogatz models, Small-world)– For vertices u, v, define m(u,v) to be the number of common
neighbors (so far)– Define the propensity R(u,v) of u to connect to v
• if m(u,v) >= k, R(u,v) = 1 (share too many friends, must connect)
• if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect)
• else, R(u,v) = p + (m(u,v)/k) (1-p) biased to connect
– Generate network incrementally, with R(u,v) as the edge probability;
– α ∞, is similar to Erdos-Renyi models– Need to tune parameter α, p, k
Mining Social Networks (V)
• Scale free models: not fix N (# of vertices)– Start with (say) two vertices connected by an edge– let Z = Σ d(j) where d(j) = degree of vertex j so far– add new vertex i with k edges back to {1, …, i-1}: i is connected
back to j with probability d(j)/Z– Richer get richer…
• Evaluation of generative models– Can they explain all the characteristics of social networks?– Parameter tuning?
• Other models for Social network analysis– Copying model: leads to communities– Forest Fire Model– Electricity network (not generative model, but interesting)
Mining Social Networks (VI)
• Text Related Applications: quite a lot!– Ranking webpages– Multi-resolution Concept/Topic Map– Citation Impact of scientific literature– Entity-relation extraction– Bioinformatics: Pathway extraction– Reference Reconciliation– Web structure evolution– Community discovery in Weblogs..
Text and Web mining
• Data: text, unstructured/semi-structured; webpages with linkages, user logs;– E.g. webpage, news, email, weblogs, scientific literature,
citations, customer reviews, forums, search logs, chatting logs, legal documents, etc.
• Challenges:– Modeling unstructured/semi-structured data– Coupling with Natural Language Processing– Handling high dimensionality– Handling data sparseness and ambiguity– The Web is too complicated!
Text and Web mining (II)
• Selected Problems:– Text categorization/clustering (Already covered in NLP and ML)– Word sense disambiguation (Covered in NLP)– Information Extraction (Covered in NLP)– Dimension Reduction (Overlapping with ML and IR)– Collaborative Filtering, User-interest modeling– Topic Detection and Tracking– Comparative Text Mining, Theme based text mining– Transliteration mining– Email clustering / spam detection– Opinion mining (Overlapping with NLP)– Social Networks Related (Already covered)– Temporal Text Mining– Vision based page segmentation / Block based search
Text and Web mining (III)
• Methods: Confluence of Multiple Disciplines– Database: data integration, schema matching, XML– Data mining: sequential pattern mining, association rule mining,
…– IR: Search, language models, feedback, …– Machine Learning: SVD, Supervised/unsupervised learning,
semi-supervised learning, Topic-models, …– NLP: POS tagging, parsing, context modeling, sentiment
extraction, entity extraction, …– Statistical Learning: Bayesian methods, word bursting, time-
series analysis, hypothesis testing, other statistical models, …
Text and Web mining (IV)
• Resolution:– Word level: Word sense disambiguation, word bursting, transliter
ation mining– Entity level: information extraction, entity-relation network– Pattern level: opinion mining, relation extraction– Document level: document classification/clustering– Theme level: PLSI, LDA, comparative text mining, temporal text
mining/spatiotemporal text mining– Topic level: topic detection and tracking, email threading– Web level: social network, weblog mining, block based search
• Selected topics will be discussed in next meeting..
Part 4: Research Groups
• Introduction• Functionalities• Hot topics• Research Groups
– Stanford, CMU, UIUC, Wisc, Helsinki, UMN– IBM, Microsoft, MSRA, Yahoo!– Others
• Useful Resources
Research Groups
• Rakesh Agrawal– One of the Leaders in Data Mining– Frequent patterns, Privacy Preserved Data Mining
• Stanford: Jerome H. Friedman– http://www-stat.stanford.edu/~jhf/– Strong Statistical flavor, machine learning, boosting
• CMU: Christos Faloutsos – http://www.cs.cmu.edu/~christos/ – Graph mining, Social Networks, Stream data mining, Image/Multimedia
mining, time-series mining
• UIUC: Jiawei Han– http://www-sal.cs.uiuc.edu/~hanj/ – Many! Frequent pattern mining, graph mining, OLAP/Cubing, Stream
data mining, Classification, Clustering, …
Research Groups (II)
• University of Helsinki: Heikki Mannila– http://www.cs.helsinki.fi/research/fdk/– http://www.cs.helsinki.fi/u/mannila/ – Frequent itemset mining, computational biology
• Wisconsin: Raghu Ramakrishnan– http://www.cs.wisc.edu/dmi/– http://www.cs.wisc.edu/~raghu/ – Data warehousing, cubing, classification/clustering,
• Minnesota: Vipin Kumar– http://www-users.cs.umn.edu/~kumar/ – Spatiotemporal data mining
• IBM T.J Watson: Philip S. Yu– http://domino.research.ibm.com/comm/research.nsf/pages/r.kdd.html– http://www.research.ibm.com/people/p/psyu/index.html – Frequent pattern mining, graph mining, data streams
Research Groups (III)
• Microsoft Research Redmond: Surajit Chaudhuri– http://research.microsoft.com/dmx/ – Data base related, Data cleaning, etc.
• Microsoft Research Redmond: Eric Brill– http://research.microsoft.com/tmsn/– http://research.microsoft.com/~brill/ – Text Mining, Search and Navigation Research, NLP
• Microsoft Research Asia: – http://research.microsoft.com/wsm/ – Web search, web/text mining
• Yahoo! Research: Prabhakar Raghavan– http://research.yahoo.com/researcher.shtml
– http://theory.stanford.edu/~pragh/ – Web/Text Mining, Social Networks
Research Groups (IV)
• IBM Webfountain– http://www.almaden.ibm.com/webfountain/
• UIC: Bing Liu – http://www.cs.uic.edu/~liub/– Association rule mining, web/text mining
• UNC: Wei Wang – http://www.cs.unc.edu/~weiwang/ – Biology data mining, frequent pattern mining
• Simon Fraser: Jian Pei – http://www.cs.sfu.ca/~jpei/ – Sequential pattern mining, OLAP
• National University of Singapore: Anthony K.H. Tung – http://www.comp.nus.edu.sg/~atung/ – Spatial data mining, Biology data mining
• …
Part 5: Useful Resources
• Introduction• Functionalities• Hot topics• Research Groups• Useful Resources
– Text Books– Toolkits– Conferences– Others
Text Books• S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Dat
a. Morgan Kaufmann, 2002• R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience,
2000• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Son
s, 2003• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowle
dge Discovery and Data Mining. AAAI/MIT Press, 1996• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Kno
wledge Discovery, Morgan Kaufmann, 2001• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd
ed., 2006• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001• T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mini
ng, Inference, and Prediction, Springer-Verlag, 2001• T. M. Mitchell, Machine Learning, McGraw Hill, 1997• G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT
Press, 1991• P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005• S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998• I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Technique
s with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
- From Prof. Jiawei Han’s slides
Toolkits
• Weka: Data mining software in Java– http://www.cs.waikato.ac.nz/%7Eml/weka/
• IlliniMine (Illinois Data Mining System)– http://illimine.cs.uiuc.edu/ – Data Cubing– Frequent Pattern Mining– Sequential pattern mining– Graph pattern Mining
– Classification • Collected by Vipin Kumar:
– http://www-users.cs.umn.edu/~kumar/dmbook/resources.htm
Conferences• KDD Conferences
– ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)
– SIAM Data Mining Conf. (SDM)
– (IEEE) Int. Conf. on Data Mining (ICDM)
– Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD)
– Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
• Other related conferences– ACM SIGMOD
– VLDB
– (IEEE) ICDE
– WWW, SIGIR
– ICML, CVPR, NIPS
• Journals
– Data Mining and Knowledge Discovery (DAMI or DMKD)
– IEEE Trans. On Knowledge and Data Eng. (TKDE)
– KDD Explorations
– ACM Trans. on KDD- From Prof. Jiawei Han’s slides
Others
• KDnuggets– http://www.kdnuggets.com/
• Tutorial: Machine Learning Techniques for Data Mining
(WEKA) Slides- Eibe Frank, University of Waikato - http://books.elsevier.com/companions/1558605525?country=United+States
• Ideas for course projects in data mining– Collected by Vipin Kumar– http://www-users.cs.umn.edu/~kumar/dmbook/projects.htm
End of the presentation
Thanks!