A Brief Overview of Data Mining

52
A Brief Overview of Data Mining - IR Group Meeting 04/11/2006 Qiaozhu Mei

description

A Brief Overview of Data Mining. - IR Group Meeting 04/11/2006 Qiaozhu Mei. Outline. Introduction Functionalities Hot topics Research Groups Useful Resources. Part 1: Introduction. Introduction What is data mining? General Process Related Fields Different Views Functionalities - PowerPoint PPT Presentation

Transcript of A Brief Overview of Data Mining

Page 1: A Brief Overview of Data Mining

A Brief Overview of Data Mining

- IR Group Meeting

04/11/2006

Qiaozhu Mei

Page 2: A Brief Overview of Data Mining

Outline

• Introduction• Functionalities• Hot topics• Research Groups• Useful Resources

Page 3: A Brief Overview of Data Mining

Part 1: Introduction

• Introduction– What is data mining? – General Process– Related Fields– Different Views

• Functionalities• Hot topics• Research Groups• Useful Resources

Page 4: A Brief Overview of Data Mining

What is Data Mining?

• (From Prof. Jiawei Han’s Slides): Data mining (knowledge discovery from data)

– Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data

• (From Prof. Sunita Sarawagi’s slides): Process of semi-

automatically analyzing large databases to find patterns that are – valid: hold on new data with some certainty– novel: non-obvious to the system– useful: should be possible to act on the item – understandable: humans should be able to interpret the pattern

• (From Prof. Vipin Kumar’ Slides): Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Page 5: A Brief Overview of Data Mining

What is Data Mining? (cont.)

• Under these definitions:– What is not Data Mining?

• Look up phone number in phone directory• Query a Web search engine for information about “Amazon”

– What is Data Mining?• Certain names are more prevalent in certain US locations

(O’Brien, O’Rurke, O’Reilly… in Boston area)

• Group together similar documents returned by search engine according to their context

- Tan, Steinbach, Kumar, Introduction to Data Mining

Page 6: A Brief Overview of Data Mining

General Process of KDD

– Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

- Han & Kamber, Data Mining: Concepts and TechniquesDatabases

Page 7: A Brief Overview of Data Mining

Related Fields

• Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

Algorithm

MachineLearning Visualization

- Han & Kamber, Data Mining:

Concepts and Techniques

Statistics/AI

Data Mining

Database systems

Machine Learning/Pattern

Recognition

• Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

• But different…

- Tan, Steinbach, Kumar, Introduction to Data Mining

Page 8: A Brief Overview of Data Mining

Differences to Related Fields

• Traditional Techniques may be unsuitable due to – Enormity of data– High dimensionality of data– Heterogeneous, distributed nature of data

• Overlaps with machine learning, statistics, artificial intelligence, databases, visualization, but more stress on– scalability of number of features and instances– stress on algorithms and architectures whereas foundations of

methods and formulations provided by statistics and machine learning.

– automation for handling large, heterogeneous data

-From Prof. Vipin Kumar’s slides

-From Prof. Sunita Sarawagi’s slides

Page 9: A Brief Overview of Data Mining

Different Views of Data Mining

• Categorize a data mining task from different views• By general functionality and operations:

– Descriptive data mining• Find human-interpretable patterns that describe the data.• Clustering / similarity matching• Association rules and variants• Deviation detection

– Predictive data mining• Use some variables to predict unknown or future values of other vari

ables. • Regression

• Classification

• Collaborative Filtering

Page 10: A Brief Overview of Data Mining

Different Views of Data Mining (II)

• By data to be mined– Relational, data warehouse, transactional, stream, object-

oriented, sequence, graph, social network, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW

• By knowledge to be discovered– Characterization, discrimination, frequent patterns, association,

classification, clustering, trend/deviation, outlier analysis, etc

• By techniques utilized– Database-oriented, data warehouse (OLAP), combinational

algorithms, machine learning, statistics, visualization, etc.

• By application adapted– Retail, telecommunication, banking, fraud analysis, bio-data

mining, stock market analysis, text mining, Web mining, etc.- Han & Kamber, Data Mining: Concepts and Techniques

Page 11: A Brief Overview of Data Mining

Part 2: Functionalities

• Introduction• Functionalities

– Data Warehousing and OLAP– Frequent patterns, association, correlation and causality– Classification and prediction– Clustering

– Outlier analysis, Trend and evolution analysis • Hot topics• Research Groups• Useful Resources

Page 12: A Brief Overview of Data Mining

Data Warehousing and OLAP

• Data Warehousing:– “A data warehouse is a subject-oriented, integrated, time-variant, and

nonvolatile collection of data in support of management’s decision-making process.”—W. H. Inmon

• OLAP: on-line analytical processing

– Major task of data warehouse system

– Data analysis and decision making

– Drill-down, roll-up, exception/discovery driven

• Methodology – Data Cubing

– Iceberg cube

– Multi-way, BUC, Star, MM,

shell, close-cube, etc.

all

product date country

product,date product,country date, country

product, date, country

- Han & Kamber, Data Mining: Concepts and Techniques

Page 13: A Brief Overview of Data Mining

Frequent Patterns and Associations

• Frequent pattern: a pattern (itemsets, subsequences, substructures, etc.) that occurs frequently in a data set– Comparing to n-grams, phrases, etc.

• Motivation: Finding inherent regularities in data

• Applications: Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis

• Association rule mining: – Given a set of records each of which contain some number of

items from a given collection;– Produce dependency rules which will predict occurrence of an

item based on occurrences of other items.– Frequent pattern association rules correlations

Page 14: A Brief Overview of Data Mining

Mining Frequent Patterns

• Types of data:– Itemsets, sequences, graphs.

• Scalable mining methods: Three major approaches– Apriori (Agrawal & Srikant@VLDB’94)– FPgrowth (Han, Pei & Yin @SIGMOD’00)

• Prefixspan, clospan, gSpan, closegraph, etc.

– Vertical data format approach (Charm, Zaki & Hsiao @SDM’02)

• Apriori:– Candidate pattern generation and pruning– Breadth-first search over pattern space

• FPgrowth:– Pattern growth through FP-tree, no candidate generation– Depth-first search, doing pruning smartly

Page 15: A Brief Overview of Data Mining

Classification and Prediction

• Supervised Learning, already discussed in Machine Learning.– Classification: classifies data (constructs a model) based on the training

set and the values (categorical class labels) in a classifying attribute and uses it in classifying new data

– Prediction: models continuous-valued functions, i.e., predicts unknown or

missing values • Algorithms:

– Decision Tree based: C4.5, ID3, Rainforest, etc.

– Bayesian Method: Naïve Bayesian, Bayesian network, a lot of others covered in Machine Learning..

– Discriminative: Perceptron/Winnow, NN, SVM, CB-SVM, etc.

– Rule-based, Associative, k-NN, etc.

– Prediction: Regression,

• Bagging, Boosting, Model Selection, Cross-Validation

Page 16: A Brief Overview of Data Mining

Clustering

• Unsupervised Learning, as discussed in Machine Learning– Given a set of data points, each having a set of attributes, and a si

milarity measure among them, find clusters such that– Data points in one cluster are more similar to one another.– Data points in separate clusters are less similar to one another.– Similarities/distances: many!

• Algorithms: – Partition based: K-means, K-Medoids, CLARA, etc– Hierarchical: Bottom-up (single/complete/average link), top-down,

Birch– Density-based/Grid-based: DBSCAN, DENCLUE, CLIQUE, etc. – Model-based: EM, COBWEB, SOM, etc. – High-Dimensional, Constraint based

Page 17: A Brief Overview of Data Mining

Outlier, Trend and Evolution

• outliers: The set of objects that are considerably dissimilar from the remainder of the data– Statistical: hypothesis testing, bug mining– Density based– Clustering based, etc

• Deviation/Anomaly Detection• Fraud Detection• Trend and Evolution:

– Usually coupled with outlier analysis– Basic functionalities in temporal data mining– Trend, cycle, seasonal, irregular patterns

Page 18: A Brief Overview of Data Mining

Part 3: Hot Topics

• Introduction• Functionalities• Hot topics

– Mining data stream, Mining time series, Spatiotemporal data mining, mining Social Networks, Sequential data mining, Graph Mining, Biology data mining, Privacy Preserving Data Mining

– Text and Web mining

• Research Groups• Useful Resources

Page 19: A Brief Overview of Data Mining

Mining Data Streams

• Data: Data streams—continuous, ordered, changing fast, huge amount

• Characteristics and Challenges :– Huge volumes– Fast changing, requires fast and real-time response– Random access is expensive — need single scan algorithms– Difficult to keep the universe — need approximations

• Basic problems:– Multi-dimensional on-line analysis of streams– Mining outliers and unusual patterns in stream data– Clustering data streams – Classification of stream data

Page 20: A Brief Overview of Data Mining

Mining Data Streams (II)

• Methods:– Basic: Sliding windows, Tilted time frames– Counting (FP mining, etc):

• Random sampling

• Approximated counting

– OLAP:• Keep Critical layers in stream cube computation

• Partial materialization

• outlier: exception-based exploration

– Clustering:• Offline microclustering and online macroclustering

• Text Related Applications:– Web logs and Web page click streams

Page 21: A Brief Overview of Data Mining

Mining Time series

• Data: Time-series database– Consists of sequences of values or events changing with time

– Data is recorded at regular intervals

• Characteristics and Challenges :– Characteristic time-series components: Trend, cycle, seasonal,

irregular patterns

• Basic Problems:– Trends discovery, Similarity Search, outlier detection, prediction

and clustering

Page 22: A Brief Overview of Data Mining

Mining Time series (II)

• Methods:– Statistical modeling (Regression, Spline, Mixture Model, etc)– Data transformation (DFT, DWT)– Sliding windows, Atomic matching, window stitching,

Subsequence Ordering– Clustering

• Text Related Applications:– Transliteration mining, Temporal text mining, word bursting, etc.

-Han & Kamber, Data Mining:

Concepts and Techniques

Page 23: A Brief Overview of Data Mining

Spatiotemporal data mining

• Data: object data sets, spatial/spatiotemporal databases and data warehouses

• Characteristics and Challenges:– Generalize detailed geographic points into clustered regions, suc

h as business, residential, industrial, or agricultural areas, according to land usage

– handling objects in space that have identity and well-defined

extents, locations, and relationships.– Require the merge of a set of geographic areas by spatial operat

ions

• Basic Problems:– Querying objects; distribution/cluster/correlation/evolution/trend

analysis

Page 24: A Brief Overview of Data Mining

Spatiotemporal data mining (II)

• Methods– GIS (Geographic Information System): Analysis and visualization

of geographic data • Search, Location analysis, Terrain analysis, Distribution,

Spatial analysis/statistics, Measurement– Indexing Spatial data (R-tree, etc. )– Modeling single objects with points, lines and regions– Modeling spatially related collection of objects: plane partitions a

nd networks.– Spatiotemporal patterns, correlations, trend analysis, clustering…

• Text Related Applications:– Spatiotemporal text mining; community evolution in weblogs; – Information diffusing; web evolution

Page 25: A Brief Overview of Data Mining

Special topics in Frequent Pattern Mining

• Association rule mining and frequent itemset mining are pretty old topics

• However, some special topics of frequent pattern mining are still hot– Sequential pattern mining– Graph mining– Pattern post-processing

Page 26: A Brief Overview of Data Mining

Sequential pattern mining

• Data: sequential data base• Basic problems:

– Discovery of frequent subsequences (allow gap, comparing to n-grams); close subsequences

– Sequence Similarity Search, Sequence Alignment

• Methods:– Apriori: GSP– FP-Growth: PrefixSpan, Clospan– BLAST, Hidden Markov models,

CRF, etc.

• Text Related Applications:– Most text patterns are sequential patterns– Phrase extraction, entity/relation extraction,

opinion mining, etc– Biology sequence modeling

-Han & Kamber, Data Mining:

Concepts and Techniques

Page 27: A Brief Overview of Data Mining

Graph Mining

• Data: graph databases (like social network, but multiple graphs, more general), examples include– Chemical component, protein structure, program flow, XML/Web, – Directed, undirected, labeled/unlabeled, weighted, 2-D/3-D, etc.

• Characteristics and Challenges:– Theoretically, most are of high complexity, but practically, the graphs

are solvable. – Too many substructures to index– …

• Basic problems– Frequent subgraph mining– Close subgraph mining– Graph indexing by substructures– Similarity search

-Han & Kamber, Data Mining: Concepts and Techniques

Page 28: A Brief Overview of Data Mining

Graph Mining (II)

• Methods:– Subgraph mining: Apriori (e.g. FSG), Pattern Growth (e.g. gSpan)– gSpan: pattern growth, depth first search, active elimination of du

plicated subgraphs; Flatten a graph into a sequence using depth first search; enumerate graph using right-most extension.

– CloseGraph: mining close subgraph patterns– gIndex: identify frequent structures, prune redundancy to maintain

discriminative structures, create index on such structures. – Similarity search: indexing; feature based similarities; estimate fea

ture missing

• Text Related Applications:– Multi-resolution topic map, entity-relation network, pathway extract

ion, etc.

Page 29: A Brief Overview of Data Mining

Graph Mining (III): Graph Indexing & Querying

• More on Graph Indexing and Similarity Search• Comparing to Text Retrieval:

Text Retrieval Graph Indexing & Search

Objects Documents Graphs

Basic Units Words

Pruning stopwords

Frequent structures

Need to mine frequent subgraph

Redundancy? stemming Need discriminative structs.

Representation Term vectors Feature vectors

Dimensions Terms Substructures

Relevance Vector similarity Vector similarity

Approximation No Yes, need to estimate feature missing (relax substructures)

Page 30: A Brief Overview of Data Mining

Graph Mining (IV): Graph Indexing & Querying

• What if we want to index on phrases instead of words?– Need to extract phrases first– N-grams/sequential patterns, have to remove redundancy

• E.g. “natural language processing” v.s. “language processing”

– Substructures are like phrases…

• Can IR help?– Representation and Similarity measures? (Vector Space Models,

Probabilistic models…)– How to weight features? (TF-IDF, …)– Generative models?– Query expansion? Feedback?

Page 31: A Brief Overview of Data Mining

Pattern Post-processing

• Data: frequent patterns extracted by mining algorithms• Challenge:

– Mining algorithms output explosively large number of patterns– How to interpret the frequent patterns extracted

• Basic Problems:– Pattern summarization– Mining compressed patterns– Top-K patterns– Pattern annotation– User-oriented ranking

• Methods:– Modeling Pattern profiles, coverage and contexts– Using Clustering to summarize and compress patterns– Bridging IR/NLP and frequent pattern mining: profile, context, ranking,

feedback, filtering, summarization, MMR, etc.

Page 32: A Brief Overview of Data Mining

Mining Social Networks

• Data: Graphs/networks with nodes and links– Example: communication networks, webpages, citations, biological

pathways, etc.

• Characteristics and Challenges:– Connected Components: few– Network diameter: small– Clustering: high degree– Degree distribution: heavy-tailed– Modeling Logical/statistical dependencies

• Basic Problems:– Model the generation of graphs/networks– Link based object ranking, classification,

Identification, Clustering, entity resolution– Link Prediction, querying, community discovery

H. Jeong, S.P. Mason, A.-L. Barabasi, Z.N. Oltvai, Nature 411, 41-42 (2001)

Page 33: A Brief Overview of Data Mining

Mining Social Networks (II)

• Methods:– Graph Generation Models: trying to derive generative models

which explains the characteristics and evolutions of social

networks/graphs. – Vertex Ranking: PageRank, HITS, etc. – Community Detection: Hierarchical Clustering, Spectral

clustering, Stochastic modeling, etc. – Link based classification: semi-supervised learning, propagation– Entity resolution: duplicate prediction, collective resolution,

probabilistic models– Link Prediction: binary classification problem, local conditional

probabilistic models– Substructure mining: graph pattern mining, indexing

Page 34: A Brief Overview of Data Mining

Mining Social Networks (III)

• Generative Models of social network/graph generation and evolution

• Random graphs (Erdös-Rényi models)– Fix vertices, generate each edge independently with probability p– N(N-1)/2 trials of a biased coin flip, p ~ 1/N– Degree distribution is Poisson, E[d] = p(N-1); E[# of e] = pN(N-

1)/2– Parameter: p

• Graph process model:– starting with no edges, just keep adding one edge at a time– always choose next edge randomly from among all missing edges

Page 35: A Brief Overview of Data Mining

Mining Social Networks (IV)

• α-model (Watts-Strogatz models, Small-world)– For vertices u, v, define m(u,v) to be the number of common

neighbors (so far)– Define the propensity R(u,v) of u to connect to v

• if m(u,v) >= k, R(u,v) = 1 (share too many friends, must connect)

• if m(u,v) = 0, R(u,v) = p (no mutual friends no bias to connect)

• else, R(u,v) = p + (m(u,v)/k) (1-p) biased to connect

– Generate network incrementally, with R(u,v) as the edge probability;

– α ∞, is similar to Erdos-Renyi models– Need to tune parameter α, p, k

Page 36: A Brief Overview of Data Mining

Mining Social Networks (V)

• Scale free models: not fix N (# of vertices)– Start with (say) two vertices connected by an edge– let Z = Σ d(j) where d(j) = degree of vertex j so far– add new vertex i with k edges back to {1, …, i-1}: i is connected

back to j with probability d(j)/Z– Richer get richer…

• Evaluation of generative models– Can they explain all the characteristics of social networks?– Parameter tuning?

• Other models for Social network analysis– Copying model: leads to communities– Forest Fire Model– Electricity network (not generative model, but interesting)

Page 37: A Brief Overview of Data Mining

Mining Social Networks (VI)

• Text Related Applications: quite a lot!– Ranking webpages– Multi-resolution Concept/Topic Map– Citation Impact of scientific literature– Entity-relation extraction– Bioinformatics: Pathway extraction– Reference Reconciliation– Web structure evolution– Community discovery in Weblogs..

Page 38: A Brief Overview of Data Mining

Text and Web mining

• Data: text, unstructured/semi-structured; webpages with linkages, user logs;– E.g. webpage, news, email, weblogs, scientific literature,

citations, customer reviews, forums, search logs, chatting logs, legal documents, etc.

• Challenges:– Modeling unstructured/semi-structured data– Coupling with Natural Language Processing– Handling high dimensionality– Handling data sparseness and ambiguity– The Web is too complicated!

Page 39: A Brief Overview of Data Mining

Text and Web mining (II)

• Selected Problems:– Text categorization/clustering (Already covered in NLP and ML)– Word sense disambiguation (Covered in NLP)– Information Extraction (Covered in NLP)– Dimension Reduction (Overlapping with ML and IR)– Collaborative Filtering, User-interest modeling– Topic Detection and Tracking– Comparative Text Mining, Theme based text mining– Transliteration mining– Email clustering / spam detection– Opinion mining (Overlapping with NLP)– Social Networks Related (Already covered)– Temporal Text Mining– Vision based page segmentation / Block based search

Page 40: A Brief Overview of Data Mining

Text and Web mining (III)

• Methods: Confluence of Multiple Disciplines– Database: data integration, schema matching, XML– Data mining: sequential pattern mining, association rule mining,

…– IR: Search, language models, feedback, …– Machine Learning: SVD, Supervised/unsupervised learning,

semi-supervised learning, Topic-models, …– NLP: POS tagging, parsing, context modeling, sentiment

extraction, entity extraction, …– Statistical Learning: Bayesian methods, word bursting, time-

series analysis, hypothesis testing, other statistical models, …

Page 41: A Brief Overview of Data Mining

Text and Web mining (IV)

• Resolution:– Word level: Word sense disambiguation, word bursting, transliter

ation mining– Entity level: information extraction, entity-relation network– Pattern level: opinion mining, relation extraction– Document level: document classification/clustering– Theme level: PLSI, LDA, comparative text mining, temporal text

mining/spatiotemporal text mining– Topic level: topic detection and tracking, email threading– Web level: social network, weblog mining, block based search

• Selected topics will be discussed in next meeting..

Page 42: A Brief Overview of Data Mining

Part 4: Research Groups

• Introduction• Functionalities• Hot topics• Research Groups

– Stanford, CMU, UIUC, Wisc, Helsinki, UMN– IBM, Microsoft, MSRA, Yahoo!– Others

• Useful Resources

Page 43: A Brief Overview of Data Mining

Research Groups

• Rakesh Agrawal– One of the Leaders in Data Mining– Frequent patterns, Privacy Preserved Data Mining

• Stanford: Jerome H. Friedman– http://www-stat.stanford.edu/~jhf/– Strong Statistical flavor, machine learning, boosting

• CMU: Christos Faloutsos – http://www.cs.cmu.edu/~christos/ – Graph mining, Social Networks, Stream data mining, Image/Multimedia

mining, time-series mining

• UIUC: Jiawei Han– http://www-sal.cs.uiuc.edu/~hanj/ – Many! Frequent pattern mining, graph mining, OLAP/Cubing, Stream

data mining, Classification, Clustering, …

Page 44: A Brief Overview of Data Mining

Research Groups (II)

• University of Helsinki: Heikki Mannila– http://www.cs.helsinki.fi/research/fdk/– http://www.cs.helsinki.fi/u/mannila/ – Frequent itemset mining, computational biology

• Wisconsin: Raghu Ramakrishnan– http://www.cs.wisc.edu/dmi/– http://www.cs.wisc.edu/~raghu/ – Data warehousing, cubing, classification/clustering,

• Minnesota: Vipin Kumar– http://www-users.cs.umn.edu/~kumar/ – Spatiotemporal data mining

• IBM T.J Watson: Philip S. Yu– http://domino.research.ibm.com/comm/research.nsf/pages/r.kdd.html– http://www.research.ibm.com/people/p/psyu/index.html – Frequent pattern mining, graph mining, data streams

Page 45: A Brief Overview of Data Mining

Research Groups (III)

• Microsoft Research Redmond: Surajit Chaudhuri– http://research.microsoft.com/dmx/ – Data base related, Data cleaning, etc.

• Microsoft Research Redmond: Eric Brill– http://research.microsoft.com/tmsn/– http://research.microsoft.com/~brill/ – Text Mining, Search and Navigation Research, NLP

• Microsoft Research Asia: – http://research.microsoft.com/wsm/ – Web search, web/text mining

• Yahoo! Research: Prabhakar Raghavan– http://research.yahoo.com/researcher.shtml

– http://theory.stanford.edu/~pragh/ – Web/Text Mining, Social Networks

Page 46: A Brief Overview of Data Mining

Research Groups (IV)

• IBM Webfountain– http://www.almaden.ibm.com/webfountain/

• UIC: Bing Liu – http://www.cs.uic.edu/~liub/– Association rule mining, web/text mining

• UNC: Wei Wang – http://www.cs.unc.edu/~weiwang/ – Biology data mining, frequent pattern mining

• Simon Fraser: Jian Pei – http://www.cs.sfu.ca/~jpei/ – Sequential pattern mining, OLAP

• National University of Singapore: Anthony K.H. Tung – http://www.comp.nus.edu.sg/~atung/ – Spatial data mining, Biology data mining

• …

Page 47: A Brief Overview of Data Mining

Part 5: Useful Resources

• Introduction• Functionalities• Hot topics• Research Groups• Useful Resources

– Text Books– Toolkits– Conferences– Others

Page 48: A Brief Overview of Data Mining

Text Books• S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Dat

a. Morgan Kaufmann, 2002• R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience,

2000• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Son

s, 2003• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowle

dge Discovery and Data Mining. AAAI/MIT Press, 1996• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Kno

wledge Discovery, Morgan Kaufmann, 2001• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd

ed., 2006• D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001• T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mini

ng, Inference, and Prediction, Springer-Verlag, 2001• T. M. Mitchell, Machine Learning, McGraw Hill, 1997• G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT

Press, 1991• P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005• S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998• I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Technique

s with Java Implementations, Morgan Kaufmann, 2nd ed. 2005

- From Prof. Jiawei Han’s slides

Page 49: A Brief Overview of Data Mining

Toolkits

• Weka: Data mining software in Java– http://www.cs.waikato.ac.nz/%7Eml/weka/

• IlliniMine (Illinois Data Mining System)– http://illimine.cs.uiuc.edu/ – Data Cubing– Frequent Pattern Mining– Sequential pattern mining– Graph pattern Mining

– Classification • Collected by Vipin Kumar:

– http://www-users.cs.umn.edu/~kumar/dmbook/resources.htm

Page 50: A Brief Overview of Data Mining

Conferences• KDD Conferences

– ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD)

– SIAM Data Mining Conf. (SDM)

– (IEEE) Int. Conf. on Data Mining (ICDM)

– Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD)

– Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)

• Other related conferences– ACM SIGMOD

– VLDB

– (IEEE) ICDE

– WWW, SIGIR

– ICML, CVPR, NIPS

• Journals

– Data Mining and Knowledge Discovery (DAMI or DMKD)

– IEEE Trans. On Knowledge and Data Eng. (TKDE)

– KDD Explorations

– ACM Trans. on KDD- From Prof. Jiawei Han’s slides

Page 51: A Brief Overview of Data Mining

Others

• KDnuggets– http://www.kdnuggets.com/

• Tutorial: Machine Learning Techniques for Data Mining

(WEKA) Slides- Eibe Frank, University of Waikato - http://books.elsevier.com/companions/1558605525?country=United+States

• Ideas for course projects in data mining– Collected by Vipin Kumar– http://www-users.cs.umn.edu/~kumar/dmbook/projects.htm

Page 52: A Brief Overview of Data Mining

End of the presentation

Thanks!