Datamining

Data Mining at UVA

New Horizons in Teaching and Learning Conference

May 21-24, 2007Kathy Gerber, ITC Research Computing

[email protected]

• Lots of data is being collected and warehoused – Web data, e-commerce– purchases at department/

grocery stores– Bank/Credit Card

transactions

• Computers have become cheaper and more powerful

• Competitive Pressure is Strong – Provide better, customized services for an edge (e.g. in

Customer Relationship Management)

Why Mine Data? Commercial Viewpoint

Why Mine Data? Scientific Viewpoint

• Data collected and stored at enormous speeds (GB/hour)

– remote sensors on a satellite

– telescopes scanning the skies

– microarrays generating gene expression data (e.g., GEOSS)

– scientific simulations generating terabytes of data

• Traditional techniques infeasible for raw data• Data mining may help scientists

– in classifying and segmenting data– in Hypothesis Formation

Mining Large Data Sets - Motivation• There is often information “hidden” in the data that is

not readily evident• Human analysts may take weeks to discover useful

information• Much of the data is never analyzed at all

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

4,000,000

1995 1996 1997 1998 1999

The Data Gap

Total new disk (TB) since 1995

Number of analysts

From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

What is Data Mining?• Many Definitions

– Non-trivial extraction of implicit, previously unknown and potentially useful information from data

– Exploration & analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns

Summary of SAS DM Process -SEMMA

• Sample the data by creating one or more data tables. The sample should be large enough to contain the significant information, yet small enough to process.

• Explore the data by searching for anticipated relationships, unanticipated trends, and anomalies in order to gain understanding and ideas.

• Modify the data by creating, selecting, and transforming the variables to focus the model selection process.

• Model the data by using the analytical tools to search for a combination of the data that reliably predicts a desired outcome.

• Assess the data by evaluating the usefulness and reliability of the findings from the data mining process.

What is (not) Data Mining?

What is Data Mining?

– Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)

– Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)

What is not Data Mining?

– Look up phone number in phone directory

– Query a Web search engine for information about “Amazon”

• Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems

• Traditional Techniquesmay be unsuitable due to – Enormity of data– High dimensionality

of data– Heterogeneous,

distributed nature of data

Origins of Data Mining

Machine Learning/Pattern

Recognition

Statistics/AI

Data Mining

Database systems

Classification: Definition

• Given a collection of records (training set )– Each record contains a set of attributes, one of the

attributes is the class.• Find a model for class attribute as a function of the

values of other attributes.• Goal: previously unseen records should be assigned a

class as accurately as possible.– A test set is used to determine the accuracy of the

model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Examples of Classification Task

• Predicting tumor cells as benign or malignant

• Classifying credit card transactions as legitimate or fraudulent

• Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil

• Categorizing news stories as finance, weather, entertainment, sports, etc

Classification Techniques

• Decision Tree based Methods

• Rule-based Methods

• Memory based reasoning

• Neural Networks

• Naïve Bayes and Bayesian Belief Networks

• Support Vector Machines

Illustrating Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Software Demonstrations

SAS Enterprise Miner

R Rattle

Weka

SAS Enterprise Miner

Screenshot – EM Tutorial Workflow

R Rattle• Install R 2.5.0

• > source("http://www.ggobi.org/downloads/install.r")

• > install(“rattle”, dep=TRUE)

Slide Credits

• R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications”

• SAS Enterprise Miner tutorial

• Frank Eibe, Machine Learning with Weka

• Tan, Steinbach, Kumar “Introduction to Data Mining”

Versions and References for Software Used Today

• SAS 9.1.3 EAS with Enterprise Miner– UVA licensed software– http://rescomp.virginia.edu

• R 2.5.0 with Rattle (open source)– Open source

• Weka (open source)– Ian Witten, Frank Eibe: Data Mining: Practical Machine Learning

Tools and Techniques (Second Edition)

• Not demonstrated but also see Insightful Miner and Orange

http://www.sas.com/technologies/analytics/datamining/miner/

http://rattle.togaware.com/

http://www.cs.waikato.ac.nz/ml/weka/

http://www.insightful.com/products/iminer/default.asp

http://www.ailab.si/orange

Datamining

Documents

Transcript of Datamining