Data mining chapter04and5-best

6
Classification vs. Prediction Classification: predicts categorical (discrete and unordered) class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions to predicts unknown or missing values Chapter 4 -Classification Cont... Cont... Prediction is similar to classification First, construct a model Second, use model to predict unknown value Major method for prediction is regression Linear and multiple regression Non-linear regression Prediction is different from classification Classification refers to predict categorical class label Prediction models continuous-valued functions Recall- Data Mining Models and Tasks Classification Classification process involves two steps 1. Model construction: refers to describing a set of predetermined classes using training data set The training data is a set of tuples where Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The model is represented as classification rules, decision trees, or mathematical formulae 2. Model usage: Refers to using the model for classifying future or unknown objects Or explaining some scenario with some accuracy after testing ClassificationA Two-Step Process cont... Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction: training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur

Transcript of Data mining chapter04and5-best

Page 1: Data mining chapter04and5-best

Classification vs. Prediction

Classification:

– predicts categorical (discrete and unordered) class labels

– classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data

Prediction:

– models continuous-valued functions to predicts unknown or missing values

Chapter 4 -Classification Cont...

Cont...

Prediction is similar to classification

– First, construct a model

– Second, use model to predict unknown value

Major method for prediction is regression

– Linear and multiple regression

– Non-linear regression

Prediction is different from classification

– Classification refers to predict categorical class label

– Prediction models continuous-valued functions

Recall- Data Mining Models and Tasks

Classification

Classification process involves two steps

1. Model construction:

refers to describing a set of predetermined classes using training

data set

The training data is a set of tuples where Each tuple/sample is

assumed to belong to a predefined class, as determined by the

class label attribute

The model is represented as classification rules, decision trees, or

mathematical formulae

2. Model usage:

Refers to using the model for classifying future or unknown

objects

Or explaining some scenario with some accuracy after testing

Classification—A Two-Step Process –cont...

Model construction: describing a set of predetermined

classes

– Each tuple/sample is assumed to belong to a predefined class,

as determined by the class label attribute

– The set of tuples used for model construction: training set

– The model is represented as classification rules, decision trees,

or mathematical formulae

Model usage: for classifying future or unknown objects

– Estimate accuracy of the model

The known label of test sample is compared with the

classified result from the model

Accuracy rate is the percentage of test set samples that are

correctly classified by the model

Test set is independent of training set, otherwise over-fitting

will occur

Page 2: Data mining chapter04and5-best

Classification Process (1): Model Construction

Training

Data

NAME RANK YEARS TENURED

Mike Assistant Prof 3 no

Mary Assistant Prof 7 yes

Bill Professor 2 yes

Jim Associate Prof 7 yes

Dave Assistant Prof 6 no

Anne Associate Prof 3 no

Classification

Algorithms

IF rank = ‘professor’

OR years > 6

THEN tenured = ‘yes’

Classifier

(Model)

Classification Process (2): Use the Model in Prediction

Classifier

Testing

Data

NAME RANK YEARS TENURED

Tom Assistant Prof 2 no

Merlisa Associate Prof 7 no

George Professor 5 yes

Joseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Supervised vs. Unsupervised Learning

Supervised learning (classification)

– Supervision: The training data (observations, measurements,

etc.) are accompanied by labels indicating the class of the

observations

– New data is classified based on the training set

Unsupervised learning (clustering)

– The class labels of training data is unknown

– Given a set of measurements, observations, etc. with the aim

of establishing the existence of classes or clusters in the data

– Usually classification follows after clustering

Issues regarding classification and prediction: Data

Preparation

Data cleaning

– Preprocess data in order to reduce noise and

handle missing values

Relevance analysis (feature selection)

– Remove the irrelevant or redundant attributes

Data transformation

– Generalize and/or normalize data

Issues regarding classification and prediction :

Evaluating Classification Methods

Predictive accuracy

– Measure how accurate is the classifier to predict object class label

Speed

– This refers to the computational costs involved in generating and using the

given classifier or predictor

– time to construct the model

– time to use the model

Scalability

– This refers to the ability to construct the classifier or predictor efficiently

given large amounts of data

Robustness

– This is the ability of the classifier or predictor to make correct predictions

given noisy data or data with missing values

Interpretability:

– This refers to the level of understanding and insight that is provided by the

classifier or predictor.

Classification: Technical Definition

Given a collection of records (training set )

– Each record contains a set of attributes, one of the attributes is the class.

Find a model for class attribute as a function of the values of other attributes.

Goal: previously unseen records should be assigned a class as accurately as possible.

– A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Page 3: Data mining chapter04and5-best

Thus Classification

Classification is a data mining (machine learning) technique

used to predict group membership for data instances.

Given a collection of records (training set), each record

contains a set of attributes, one of the attributes is the class.

– Find a model for class attribute as a function of the values of other

attributes.

Goal: previously unseen records should be assigned a class

as accurately as possible. A test set is used to determine the

accuracy of the model.

– Usually, the given data set is divided into training and test sets, with

training set used to build the model and test set used to validate it.

For example, one may use classification to predict whether the weather

on a particular day will be “sunny”, “rainy” or “cloudy”.

Illustrating Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learning

algorithm

Training Set

Classification methods

Goal: Predict class Ci = f(x1, x2, .. Xn)

There are various classification methods. Popular classification techniques include the following.

– Decision tree classifier: divide decision space into piecewise constant regions.

– Rule based – Association based classifier

– K-Nearest Neighbour: classify based on similarity measurement

– Neural networks: partition by non-linear boundaries

– Bayesian network: a probabilistic model

– Support vector machine: solves non-linearly separable problems

Simple classification using decision tree

Decision tree classifier

Decision tree performs classification by constructing a tree

based on training instances with leaves having class labels.

– The tree is traversed for each test instance to find a leaf,

and the class of the leaf is the predicted class. This is a

directed knowledge discovery in the sense that there is a

specific field whose value we want to predict.

Widely used learning method. It has been applied to:

– classify medical patients based on the disease,

– equipment malfunction by cause,

– loan applicant by likelihood of payment.

– Accidents by severity

Pros and Cons of decision treesCons

­ Cannot handle complicatedrelationship between features

­ simple decision boundaries

­ problems with lots of missingdata

Pros

• Reasonable training time• Fast application

• Easy to interpret

• Easy to implement• Can handle large number of

features

Why decision tree induction in data mining?• Relatively faster learning speed (than other classification

methods)

• Convertible to simple and easy to understand

classification if-then-else rules

• Comparable classification accuracy with other methods

• Does not require any prior knowledge of data

distribution, works well on noisy data.

Page 4: Data mining chapter04and5-best

Chapter 5 - Cluster Analysis

What is Cluster Analysis?

Cluster: a collection of data objects

– Similar to one another within the same cluster

– Dissimilar to the objects in other clusters

Cluster analysis

– Grouping a set of data objects into clusters

Clustering is unsupervised classification: no predefined classes

Typical applications

– As a stand-alone tool to get insight into data distribution

– As a preprocessing step for other algorithms

Clustering cont…

Given a set of points, with a

notion of distance between

points, group the points into

some number of clusters, so

that members of a cluster are

in some sense as close to

each other as possible.

While data points in the

same cluster are similar, those

in separate clusters are

dissimilar to one another.

x x

x x x x

x x x x

x x x

x x

x

xx x

x x

x x x

x

x x x

x

x x

x x x x

x x x

x

• Clustering is a data mining (machine learning) technique that finds similarities between data according to the characteristics found in the data & groups similar data objects into one cluster

Cont…

Thus Cluster Analysis

– Finding groups of objects such that the objects in a group will be

similar (or related) to one another and different from (or unrelated

to) the objects in other groups

Inter-cluster

distances are maximized

Intra-cluster

distances are minimized

Quality: What Is Good Clustering?

• A good clustering method will produce high quality clusters

with

– high intra-class similarity

– low inter-class similarity

• The quality of a clustering result depends on both the

similarity measure used by the method and its

implementation

– Key requirement of clustering: Need a good measure of similarity

between instances.

• The quality of a clustering method is also measured by its

ability to discover some or all of the hidden patterns in the

given datasets

Page 5: Data mining chapter04and5-best

Requirements of Clustering in Data Mining

Scalability

– Highly scalable algorithms are needed for clustering on large databases like DW

Ability to deal with different types of attributes

– Clustering may be performed also on binary, categorical and ordinal data

Discovery of clusters with arbitrary shape

– Most algorithms tend to find spherical clusters

Minimal requirements for domain knowledge to determine input parameters

– Clustering results are quite sensitive to the input parameters

– Parameters are often difficult to determine

Able to deal with noise and outliers

Insensitive to order of input records

High dimensionality

– DW can contain several dimensions

Incorporation of user-specified constraints

Interpretability and usability

Example: Clustering Application

• Text/Document Clustering:

–Goal: To find groups of documents that are similar

to each other based on the important terms

appearing in them.

–Approach:

–To identify frequently occurring terms in each

document.

–Form a similarity measure based on the frequencies

of different terms and use it to cluster documents.

–Gain: Information Retrieval can utilize the clusters

to relate a new document or search term to clustered

documents.

Cont…

Applications of Cluster Analysis can be for

– Understanding

Group related documents for browsing,

group genes and proteins that have similar

functionality, or group stocks with similar

price fluctuations

– Summarization

Reduce the size of large data sets

What is not Cluster Analysis?

Supervised classification– Have class label information

Simple segmentation– Dividing students into different registration groups

alphabetically, by last name

Results of a query– Groupings are a result of an external specification

Types of Clusters

Major types : Well-separated clusters and Center-based

clusters

Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is

closer (or more similar) to every other point in the cluster than to any point not in the cluster.

3 well-separated clusters

Types of Clusters: Center-Based

Center-based– A cluster is a set of objects such that an object in a cluster is

closer (more similar) to the “center” of a cluster, than to the center of any other cluster

– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster

4 center-based clusters

Page 6: Data mining chapter04and5-best

Type of data in clustering analysis

Data types of variables are different

The difference need proper distance computation logic for

cluster analysis

Some of the types of data we have are:

– Interval-scaled variables

– Binary variables

– Nominal, and ordinal

– mixed types:

Interval-valued variables

This are values of variables of an object which are characterized

by its continuous nature of the measurement such as height,

weight, age

As the measurement unit affect cluster distance, we need

preprocessing that avoid the effect of unit of measurement

This is called standardization

Binary Variables

A binary variable is a variable which has only two possible values (1 or 0,

yes or no, etc)

– For example smoker, educated, Ethiopian, IsFemale etc

Nominal Variables

A generalization of the binary variable in that it can take more than 2 states,

e.g., red, yellow, blue, green

Ordinal Variables

An ordinal variable can be discrete or continuous

order is important, e.g., rank

Variables of Mixed Types

A database may contain different types of variables

– symmetric binary, asymmetric binary,

nominal, ordinal, interval.

One may use

– a weighted formula to combine their effects.

– Or preprocess the data so that it fits to the techniques

requirement

Major Clustering Approaches Partitioning clustering approach:

– Construct various partitions and then evaluate them by some criterion, e.g.,

minimizing the sum of square errors

– Typical methods:

distance-based: K-means clustering

model-based: expectation maximization (EM) clustering.

Hierarchical clustering approach:

– Create a hierarchical decomposition of the set of data (or objects) using some

criterion

– Typical methods:

agglomerative Vs divisive

single link Vs complete link