Datamining
Transcript of Datamining
Data Mining at UVA
New Horizons in Teaching and Learning Conference
May 21-24, 2007Kathy Gerber, ITC Research Computing
• Lots of data is being collected and warehoused – Web data, e-commerce– purchases at department/
grocery stores– Bank/Credit Card
transactions
• Computers have become cheaper and more powerful
• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in
Customer Relationship Management)
Why Mine Data? Commercial Viewpoint
Why Mine Data? Scientific Viewpoint
• Data collected and stored at enormous speeds (GB/hour)
– remote sensors on a satellite
– telescopes scanning the skies
– microarrays generating gene expression data (e.g., GEOSS)
– scientific simulations generating terabytes of data
• Traditional techniques infeasible for raw data• Data mining may help scientists
– in classifying and segmenting data– in Hypothesis Formation
Mining Large Data Sets - Motivation• There is often information “hidden” in the data that is
not readily evident• Human analysts may take weeks to discover useful
information• Much of the data is never analyzed at all
0
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
4,000,000
1995 1996 1997 1998 1999
The Data Gap
Total new disk (TB) since 1995
Number of analysts
From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
What is Data Mining?• Many Definitions
– Non-trivial extraction of implicit, previously unknown and potentially useful information from data
– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns
Summary of SAS DM Process -SEMMA
• Sample the data by creating one or more data tables. The sample should be large enough to contain the significant information, yet small enough to process.
• Explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas.
• Modify the data by creating, selecting, and transforming the variables to focus the model selection process.
• Model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome.
• Assess the data by evaluating the usefulness and reliability of the findings from the data mining process.
What is (not) Data Mining?
What is Data Mining?
– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
What is not Data Mining?
– Look up phone number in phone directory
– Query a Web search engine for information about “Amazon”
• Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
• Traditional Techniquesmay be unsuitable due to – Enormity of data– High dimensionality
of data– Heterogeneous,
distributed nature of data
Origins of Data Mining
Machine Learning/Pattern
Recognition
Statistics/AI
Data Mining
Database systems
Classification: Definition
• Given a collection of records (training set )– Each record contains a set of attributes, one of the
attributes is the class.• Find a model for class attribute as a function of the
values of other attributes.• Goal: previously unseen records should be assigned a
class as accurately as possible.– A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Examples of Classification Task
• Predicting tumor cells as benign or malignant
• Classifying credit card transactions as legitimate or fraudulent
• Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil
• Categorizing news stories as finance, weather, entertainment, sports, etc
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
Illustrating Classification Task
Apply
Model
Induction
Deduction
Learn
Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
Software Demonstrations
SAS Enterprise Miner
R Rattle
Weka
SAS Enterprise Miner
Screenshot – EM Tutorial Workflow
R Rattle• Install R 2.5.0
• > source("http://www.ggobi.org/downloads/install.r")
• > install(“rattle”, dep=TRUE)
Weka
Slide Credits
• R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”
• SAS Enterprise Miner tutorial
• Frank Eibe, Machine Learning with Weka
• Tan, Steinbach, Kumar “Introduction to Data Mining”
Versions and References for Software Used Today
• SAS 9.1.3 EAS with Enterprise Miner– UVA licensed software– http://rescomp.virginia.edu
• R 2.5.0 with Rattle (open source)– Open source
• Weka (open source)– Ian Witten, Frank Eibe: Data Mining: Practical Machine Learning
Tools and Techniques (Second Edition)
• Not demonstrated but also see Insightful Miner and Orange