Introduction to Clementine

31
Introduction to Introduction to Clementine Clementine Tutors: Tutors: Cecia Chan Cecia Chan & & Gabriel Fung Gabriel Fung Data Mining Tutorial Data Mining Tutorial

description

 

Transcript of Introduction to Clementine

Page 1: Introduction to Clementine

Introduction to ClementineIntroduction to Clementine

Tutors: Tutors: Cecia ChanCecia Chan & Gabriel Fung & Gabriel Fung

Data Mining TutorialData Mining Tutorial

Page 2: Introduction to Clementine

A Brief Review of Data Mining (I)A Brief Review of Data Mining (I)

Data mining is…Data mining is…– A process of extracting A process of extracting previously unknownpreviously unknown, , validvalid and and

actionable knowledgeactionable knowledge from large databases from large databases

A rule of thumb:A rule of thumb:– If we know clearly the shape and likely content of what If we know clearly the shape and likely content of what

we are looking for, we are probably not dealing with we are looking for, we are probably not dealing with data miningdata mining

Page 3: Introduction to Clementine

A Brief Review of Data Mining (II)A Brief Review of Data Mining (II)

Therefore, data mining is Therefore, data mining is notnot……– SQL queries against any number of disparate database or data SQL queries against any number of disparate database or data

warehousewarehouse

– SQL queries in a parallel or massively parallel environmentSQL queries in a parallel or massively parallel environment

– IInformation retrieval, for example, through intelligent agentsnformation retrieval, for example, through intelligent agents

– Multidimensional database analysis (MDA)Multidimensional database analysis (MDA)

– OLAPOLAP

– Exploratory data analysis (EDA)Exploratory data analysis (EDA)

– GGraphical visualizationraphical visualization

– Traditional statistical processing against a data warehouseTraditional statistical processing against a data warehouse

However, they are all However, they are all related related to data miningto data mining

Page 4: Introduction to Clementine

Data Mining ProcessData Mining Process

1.1. Business objective(s) determinationBusiness objective(s) determination– What is your goal? What is your goal?

2.2. Data collectionData collection– You can learn nothing without data…You can learn nothing without data…

3.3. Data preprocessing (or Data preparation)Data preprocessing (or Data preparation)– Remove outlier / filter noise / modify fields / etcRemove outlier / filter noise / modify fields / etc

4.4. ModelingModeling– The core part of data miningThe core part of data mining

5.5. EvaluationEvaluation– See what you have learn!See what you have learn!

Page 5: Introduction to Clementine

Data Mining SoftwareData Mining Software

Existing Data mining software:Existing Data mining software:– Clementine from SPSS (we have this software)Clementine from SPSS (we have this software), ,

Enterprise Minter from SAS (we have this software)Enterprise Minter from SAS (we have this software),,Intelligence Miner from IBM (we have this software)Intelligence Miner from IBM (we have this software), , MineSet from Silicon Graphics, MineSet from Silicon Graphics, K-wiz from Compression Sciences Ltd., K-wiz from Compression Sciences Ltd., DBMiner from DBMiner Tech. Inc.,DBMiner from DBMiner Tech. Inc.,PolyAnalyst from Megaputer Intelligence, PolyAnalyst from Megaputer Intelligence, StatServer from MathsoftStatServer from Mathsoft::::

Page 6: Introduction to Clementine

Problem StatementProblem Statement

Situation:Situation:– You are a researcher compiling data for a medical You are a researcher compiling data for a medical

studystudy

– You have collected data about a set of patients, all of You have collected data about a set of patients, all of whom suffered from the same illnesswhom suffered from the same illness

– Each patient responded to one of five drug treatmentsEach patient responded to one of five drug treatments

Page 7: Introduction to Clementine

Step 1: Business objectiveStep 1: Business objective

Figure out which drug might be appropriate for a Figure out which drug might be appropriate for a future patient with the same illnessfuture patient with the same illness

Here are the data collected:Here are the data collected:– AgeAge

– Sex (M or F)Sex (M or F)

– BP (Blood pressure: High, normal, or low)BP (Blood pressure: High, normal, or low)

– Weight (The weight of the patient)Weight (The weight of the patient)

– Cholesterol (Blood cholesterol: Normal or high)Cholesterol (Blood cholesterol: Normal or high)

– Na (Blood sodium concentration)Na (Blood sodium concentration)

– K (Blood potassium concentration)K (Blood potassium concentration)

– Drug (Drug to which the patient responded) Drug (Drug to which the patient responded)

Page 8: Introduction to Clementine

Using Clementine (1)Using Clementine (1)

Clementine is located in…Clementine is located in…– Start Start All Programs All Programs Clementine 6.0.2 Clementine 6.0.2

ModelsModels

NodesNodes

Work-SpaceWork-Space

Page 9: Introduction to Clementine

Using Clementine (2)Using Clementine (2)

Nodes in the workspace represent different objects Nodes in the workspace represent different objects and actions. You connect the nodes to form and actions. You connect the nodes to form streams, which, when executed, let you visualize streams, which, when executed, let you visualize relationships and draw conclusions.relationships and draw conclusions.

Page 10: Introduction to Clementine

Step 2: Data Collection (1)Step 2: Data Collection (1)

Double Click

Nodes for inputting Nodes for inputting the collected datathe collected data

Page 11: Introduction to Clementine

Data Collection (2)Data Collection (2)

Location of your fileLocation of your file

Use how many columns from the fileUse how many columns from the file

Is the first row specify the names of the Is the first row specify the names of the fields or not fields or not

Other detailsOther details

Page 12: Introduction to Clementine

Step 3: Data Preparation – Explore the Data Step 3: Data Preparation – Explore the Data (1)(1)

Nodes for exploration/visualization:Nodes for exploration/visualization:– Table (in the Output panel)Table (in the Output panel)

– Plot (in the Graphs Panel)Plot (in the Graphs Panel)

– Histogram (in the Graphs Panel)Histogram (in the Graphs Panel)

– Distribution (in the Graphs Panel)Distribution (in the Graphs Panel)

– Web (in the Graphs Panel)Web (in the Graphs Panel)

Page 13: Introduction to Clementine

Step 3: Data Preparation – Explore the Data Step 3: Data Preparation – Explore the Data (2)(2)

Note:Note: Connect the nodes by click-and-drag the middle button of the mouseConnect the nodes by click-and-drag the middle button of the mouse

Double Click

Connect the nodes:Connect the nodes:

Page 14: Introduction to Clementine

Step 3: Data Preparation – Explore the Data Step 3: Data Preparation – Explore the Data (3)(3)

Execution:Execution:

Note:Note:Right click on the table nodeRight click on the table nodeto display this menuto display this menu

Page 15: Introduction to Clementine

Step 3: Data Preparation – Explore the Data Step 3: Data Preparation – Explore the Data (4)(4)

Other nodes (Please try the other nodes yourself):Other nodes (Please try the other nodes yourself):– Histogram:Histogram:

Page 16: Introduction to Clementine

Step 3: Data Preparation – Modify the Data Step 3: Data Preparation – Modify the Data (1)(1)

Replacing values:Replacing values:– Use Filler node:Use Filler node:

» SupposeSuppose we want to transform all weights to its log value we want to transform all weights to its log value (Note: we usually only transform variables to log when it is (Note: we usually only transform variables to log when it is highly skewed):highly skewed):

Page 17: Introduction to Clementine

Step 3: Data Preparation – Modify the Data Step 3: Data Preparation – Modify the Data (2)(2)

Derive a new value:Derive a new value:– Use Derive node:Use Derive node:

» SupposeSuppose we want to combine Na and K: we want to combine Na and K:

Page 18: Introduction to Clementine

Step 3: Data Preparation – Modify the Data Step 3: Data Preparation – Modify the Data (3)(3)

Remove some fieldsRemove some fields– Use Filter nodeUse Filter node

» SupposeSuppose we have derived a new field Na_Over_K, now we we have derived a new field Na_Over_K, now we need to remove the field Na and K:need to remove the field Na and K:

Page 19: Introduction to Clementine

Step 4: Modeling – Define fieldsStep 4: Modeling – Define fields

Define the fieldsDefine the fields– Use Type node:Use Type node:

Page 20: Introduction to Clementine

Step 4: Modeling – Build a Model (1)Step 4: Modeling – Build a Model (1) It is the core part of data mining. It is the core part of data mining. Supervised Learning:Supervised Learning:

– Train Net (Neural Network)Train Net (Neural Network)– C5.0 (C5.0 Decision Tree)C5.0 (C5.0 Decision Tree)– Linear Reg. (Linear regression)Linear Reg. (Linear regression)– C & R Tree (Classification and Regression Tree, CART)C & R Tree (Classification and Regression Tree, CART)

Unsupervised Learning:Unsupervised Learning:– Train Kohonen (Self-Organized Map, SOM)Train Kohonen (Self-Organized Map, SOM)– Train KMeans (K-means Clustering)Train KMeans (K-means Clustering)– TwoStep (A kind of Hierarchical Clustering)TwoStep (A kind of Hierarchical Clustering)

Others:Others:– GRI (Association Rule mining)GRI (Association Rule mining)– Apriori (Association Rule mining)Apriori (Association Rule mining)– Factor / PCA (Factor analysis, attribute selection technique)Factor / PCA (Factor analysis, attribute selection technique)

Page 21: Introduction to Clementine

Step 4: Modeling – Build a Model (2)Step 4: Modeling – Build a Model (2)

Build what model?Build what model?– Recall that our objective is to determine which type of drugs is Recall that our objective is to determine which type of drugs is

suitable for a specific patient.suitable for a specific patient.

– Thus, it is a classification problem (supervised learning)Thus, it is a classification problem (supervised learning)

In this tutorial, we use:In this tutorial, we use:– C5.0 and C & R TreeC5.0 and C & R Tree

Page 22: Introduction to Clementine

Step 4: Modeling – Build a Model (3)Step 4: Modeling – Build a Model (3)

Note:Note:– There are many complex settings for each modelThere are many complex settings for each model

– In this tutorial, we use default settingIn this tutorial, we use default setting

– Fine tuning a model requires solid experiences in data miningFine tuning a model requires solid experiences in data mining

Page 23: Introduction to Clementine

Step 5: Evaluation (1)Step 5: Evaluation (1)

It means NOTHING even if we have learned It means NOTHING even if we have learned SOMETHING, until the knowledge that we have SOMETHING, until the knowledge that we have learned are ACTIONABLE and VALIDlearned are ACTIONABLE and VALID

Remember:Remember:– The data set of training and testing are ALWAYS The data set of training and testing are ALWAYS

different (why?)different (why?)

Page 24: Introduction to Clementine

Step 5: Step 5: Evaluation (2)Evaluation (2)

Create the following flowCreate the following flow

Note:Note:Must have the same flowMust have the same flowas the training stageas the training stage

Page 25: Introduction to Clementine

Step 5: Step 5: Evaluation (3)Evaluation (3)

Different results:Different results:

– Different models can yield a completely different Different models can yield a completely different resultsresults

– Choosing and tuning a good model is a difficult jobChoosing and tuning a good model is a difficult job

– In this tutorial, we only introduce the process of data In this tutorial, we only introduce the process of data mining onlymining only

Page 26: Introduction to Clementine

Assignment 1Assignment 1

Page 27: Introduction to Clementine

Assignment 1 – Problem Statement Assignment 1 – Problem Statement

Situation: Situation: – You are a financial analyst of a bank You are a financial analyst of a bank

– You have to predict whether a customer is Good or Bad You have to predict whether a customer is Good or Bad based on some demographic informationbased on some demographic information

Data Set: Data Set: – A data set about your past customers has been collected A data set about your past customers has been collected

– Each customer is either Good or Bad Each customer is either Good or Bad

Page 28: Introduction to Clementine

Assignment 1 – Field definitionsAssignment 1 – Field definitions

VARIABLE ROLE DEFINITION DESCRIPTION

CHECKING input Nominal Checking account status

HISTORY input Nominal Credit history

AMOUNT input Interval Amount in Bank

SAVINGS input Nominal No. of Savings (bonds, stocks, etc)

EMPLOYED input Nominal Employment Type (Gov., private, etc)

INSTALLP input Nominal Type of installment rate

MARITAL input Nominal Martial status

PROPERTY input Nominal Type of Property

AGE input Interval Age in years

OTHER input Nominal Type of other installment plan

HOUSING input Nominal Type of House

EXISTCR input Interval Number of existing credits

JOB input Nominal Job Nature

FOREIGN input Binary Foreign worker or Local worker

GOOD_BAD Output Binary Good or bad credit rating

Page 29: Introduction to Clementine

Assignment 1 – Data Mining ProcessAssignment 1 – Data Mining Process

Data CollectionData Collection– Please download Please download CreditRisk CreditRisk data set from data set from

http://www.se.cuhk.edu.hk/~ect7470/http://www.se.cuhk.edu.hk/~ect7470/– Two data sets: Two data sets:

(i) creditRisk1.csv is for training (i) creditRisk1.csv is for training (ii) creditRisk2.csv is for testing(ii) creditRisk2.csv is for testing

Data PreprocessingData Preprocessing– Please explore the data and think critically whether any Please explore the data and think critically whether any

data preprocessing is necessary data preprocessing is necessary » Hints: Two of the interval variables are highly skewedHints: Two of the interval variables are highly skewed

Page 30: Introduction to Clementine

Assignment 1 – Data Mining ProcessAssignment 1 – Data Mining Process

Modeling Modeling – Please build a prediction models using default settings: Please build a prediction models using default settings:

» C5.0 Decision Tree C5.0 Decision Tree

Model Assessment Model Assessment – Please use the testing data set to evaluate the Please use the testing data set to evaluate the

performance of the prediction models performance of the prediction models

Page 31: Introduction to Clementine

Assignment 1 –Assignment 1 –SubmissionSubmission

Save the stream as “Save the stream as “id.strid.str” ” – E.g, 00123456.strE.g, 00123456.str

Upload your stream to the course accountUpload your stream to the course account Deadline:Deadline:

– 4 April 20044 April 2004

This is an individual assignmentThis is an individual assignment

NoteNote::We strongly encourage you to submit this assignment We strongly encourage you to submit this assignment during the class!!! during the class!!!