Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering...

Post on 18-Jan-2016

226 views 0 download

Tags:

Transcript of Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering...

Classification & Regression

COSC 526 Class 7

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: ramanathana@ornl.gov

2

Last Class

• Streaming Data Analytics:– sampling, counting, frequent item sets, etc.

• Berkeley Data Analytic Stack (BDAS)

3

Project Deliverables and Deadlines

Deliverable Due-date % Grade

Initial selection of topics1, 2 Jan 27, 2015 10

Project Description and Approach Feb 20, 2015 20

Initial Report Mar 20, 2015 10

Project Demonstration Apr 16-19, 2015 10

Final Project Report (10 pages) Apr 21, 2015 25

Poster (12-16 slides) Apr 23, 2015 25

• 1Projects can come with their own data (e.g., from your project) or can be provided

• 2Datasets need to be open! Please don’t use datasets that have proprietary limitations

• All reports will be in NIPS format: http://nips.cc/Conferences/2013/PaperInformation/StyleFiles

4

This class and next…

• Classification

• Decision Tree Methods

• Instance based Learning

• Support Vector Machines (SVM)

• Latent Dirichlet Allocation (LDA)

• How do we make these work with large-scale datasets?

5

Part I: Classification and a few new terms…

6

Classification • Given a collection of records (training):

– Each record has a set of attributes [x1,x2, …, xn]

– One attribute is referred to as the class [y]

• Training Goal: Build a model for the class attribute as a function of the set of attributes:

• Testing Goal: Previously unseen records should be assigned the class attribute as accurately as possible

7

Illustration of Classification

ID Attrib1

Attrib2

Attrib3

Class

1 Yes Large 120,000

No

2 No Medium

100,000

No

3 No Small 70,000 No

4 Yes Medium

120,000

No

5 No Large 95,000 Yes

6 Yes Large 220,000

Yes

… … … … …

ID Attrib1

Attrib2

Attrib3

Class

20 No Small 55,000 ?

21 Yes Medium

90,000 ?

22 Yes … … ?

Training

Testing

Learning Algorith

m

Learn Model

Model

Apply Model

8

Examples of classification

• Predicting tumor cells whether they are benign or cancerous

• Classifying credit card transactions are legitimate / fraudulent

• Classifying if a protein sequence is a helix, beta-sheet or a random coil

• Classifying if a tweet is talking about sports, finance, terrorism, etc.

9

Two problem Settings

• Supervised Learning:– Training and testing data

– Training consists of labels (class attribute y)

– Testing data consists of all other attributes

• Unsupervised Learning:– Training data does not usually consist of labels

– We will cover unsupervised learning later (mid Feb)

10

Two more terms..

• Parametric:– A particular functional form is assumed. E.g.,

Naïve Bayes

– Simplicity: easy to interpret and understand

– High bias: real data might not follow this functional form

• Non-parametric:– Estimation is purely data-driven

– no functional form assumption

11

Part II: Classification with Decision TreesExamples illustrated from Tan, Steinbach and Kumar, Introduction

to Data Mining Text book

12

Decision Trees

• One of the more popular learning techniques

• Uses a tree-structured plan of a set of attributes to test in order to predict the output

• Decision of which attribute to be tested is based on information gain

13

Example of a Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

contin

uous

class

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

14

Another Example of Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

al

categoric

al

contin

uous

class

MarSt

Refund

TaxInc

YESNO

NO

NO

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the same data!

15

Decision Tree Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training Set

Decision Tree

16

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test DataStart from the root of tree.

17

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

18

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

19

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

20

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

21

Apply Model to Test Data

Refund

MarSt

TaxInc

YESNO

NO

NO

Yes No

Married Single, Divorced

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test Data

Assign Cheat to “No”

22

Decision Tree Classification Task

Apply

Model

Induction

Deduction

Learn

Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training Set

Decision Tree

23

Decision Tree Induction

• Many Algorithms:– Hunt’s Algorithm (one of the earliest)

– CART

– ID3, C4.5

– SLIQ,SPRINT

24

General Structure of Hunt’s Algorithm

• Let Dt be the set of training records that reach a node t

• General Procedure:

– If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt

– If Dt is an empty set, then t is a leaf node labeled by the default class, yd

– If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.

– Recursively apply the procedure to each subset.

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Dt

?

25

Hunt’s Algorithm

Don’t Cheat

Refund

Don’t Cheat

Don’t Cheat

Yes No

Refund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Cheat

Single,Divorced

Married

TaxableIncome

Don’t Cheat

< 80K >= 80K

Refund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Cheat

Single,Divorced

Married

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

26

Tree Induction

• Greedy strategy– Split the records based on an attribute test that

optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?

• How to determine the best split?

– Determine when to stop splitting

27

Tree Induction

• Greedy strategy.– Split the records based on an attribute test that

optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?

• How to determine the best split?

– Determine when to stop splitting

28

How to Specify Test Condition?

• Depends on attribute types– Nominal

– Ordinal

– Continuous

• Depends on number of ways to split– 2-way split

– Multi-way split

29

Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets.

Need to find optimal partitioning.

CarTypeFamily

Sports

Luxury

CarType{Family, Luxury} {Sports}

CarType{Sports,

Luxury}

{Family} OR

30

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets. Need to find optimal partitioning.

Splitting Based on Ordinal Attributes

SizeSmall

Medium

Large

Size{Medium,

Large} {Small}

Size{Small, Medium

}{Large}

OR

Size{Small, Large} {Medium}

31

Splitting Based on Continuous Attributes

• Different ways of handling– Discretization to form an ordinal categorical

attribute• Static – discretize once at the beginning

• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing(percentiles), or clustering.

– Binary Decision: (A < v) or (A v)• consider all possible splits and finds the best cut

• can be more compute intensive

32

Splitting Based on Continuous Attributes

TaxableIncome> 80K?

Yes No

TaxableIncome?

(i) Binary split (ii) Multi-way split

< 10K

[10K,25K) [25K,50K) [50K,80K)

> 80K

33

Tree Induction

• Greedy strategy.– Split the records based on an attribute test that

optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?

• How to determine the best split?

– Determine when to stop splitting

34

How to determine the Best Split

OwnCar?

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType?

C0: 1C1: 0

C0: 1C1: 0

C0: 0C1: 1

StudentID?

...

Yes No Family

Sports

Luxury c1c10

c20

C0: 0C1: 1

...

c11

Before Splitting: 10 records of class 0,10 records of class 1

Which test condition is the best?

35

How to determine the Best Split

• Greedy approach: – Nodes with homogeneous class distribution are

preferred

• Need a measure of node impurity:

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

36

Measures of Node Impurity

• Gini Index

• Entropy

• Misclassification error

37

How to Find the Best Split

B?

Yes No

Node N3 Node N4

A?

Yes No

Node N1 Node N2

Before Splitting:

C0 N10 C1 N11

C0 N20 C1 N21

C0 N30 C1 N31

C0 N40 C1 N41

C0 N00 C1 N01

M0

M1 M2 M3 M4

M12

M34Gain = M0 – M12 vs M0 –

M34

38

Measure of Impurity: GINI

• Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information

– Minimum (0.0) when all records belong to one class, implying most interesting information

j

tjptGINI 2)]|([1)(

C1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

39

Examples for computing GINI

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

j

tjptGINI 2)]|([1)(

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

40

Splitting Based on GINI

• Used in CART, SLIQ, SPRINT.

• When a node p is split into k partitions (children), the quality of split is computed as,

where, ni = number of records at child i,

n = number of records at node

p.

k

i

isplit iGINI

n

nGINI

1

)(

41

Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions:

– Larger and Purer Partitions are sought for.

B?

Yes No

Node N1 Node N2

Parent

C1 6

C2 6

Gini = 0.500

N1 N2 C1 5 1

C2 2 4

Gini=0.333

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194

Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528= 0.333

42

Categorical Attributes: Computing Gini Index

• For each distinct value, gather counts for each class in the dataset

• Use the count matrix to make decisions

CarType{Sports,Luxury}

{Family}

C1 3 1

C2 2 4

Gini 0.400

CarType

{Sports}{Family,Luxury}

C1 2 2

C2 1 5

Gini 0.419

CarType

Family Sports Luxury

C1 1 2 1

C2 4 1 1

Gini 0.393

Multi-way split Two-way split (find best partition of values)

43

Continuous Attributes: Computing Gini Index• Use Binary Decisions based on one

value

• Several Choices for the splitting value– Number of possible splitting

values = Number of distinct values

• Each splitting value has a count matrix associated with it– Class counts in each of the

partitions, A < v and A v

• Simple method to choose best v– For each v, scan the database to

gather count matrix and compute its Gini index

– Computationally Inefficient! Repetition of work.

Tid Refund Marital Status

Taxable Income Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

TaxableIncome> 80K?

Yes No

44

Continuous Attributes: Computing Gini Index...

• For efficient computation: for each attribute,– Sort the attribute on values– Linearly scan these values, each time updating the

count matrix and computing gini index– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Taxable Income

60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230

<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >

Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Split Positions

Sorted Values

45

How do we do this with large datasets?

• GINI computations is expensive:– Sorting along numeric attributes O(n log n)!

– Need an in-memory hashtable to split based on attributes

• Empirical observation:– GINI scores tend to increase or decrease slowly

– Minimum value of the gini score for the splitting attribute is lower than other data points

• CLOUDS (Classification for Large OUt-of-core Datasets) achieves this

46

How to hasten GINI computation? (1)

• Sampling of Splitting Points (SS):– Use quantiles to divide the points into q

intervals

– Evaluate GINI index at each boundary

– Determine the GINImin value

– Split along the attribute with GINImin

• Pro:– Only one pass over the entire dataset

• Cons:– Greedy

– Time is dependent on the number of points in the interval

47

How to hasten GINI computation? (2)

• Sampling Splitting points with Estimation (SSE)

– Find GINImin as outlined in SS

– Estimate a lower bound GINIest for each interval

– Eliminate (prune) all intervals where GINIest ≥

GINImin

– Derive list of intervals (alive) where we can determine the correct split

• How do we determine what to prune?

Goal: Estimate the minimum GINI given an interval [vl, vu]Approach: Hill-climbing heuristicInputs: • n – number of data points

(records)• c – number of classes

Compute:xi (yi) – number of elements in class i that are less than equal to vl (vu)ci – total number of elements in class inl (nu) – number of elements that are less than equal to vl (vu)

Take the partial derivative along the subset of classes cClass with the minimum gradient will remain the split point (for the next split)

• Pros:• Instead of being dependent

on (n) the hill climbing heuristic is dependent on (c)

• Other optimizations can include prioritizing alive intervals based on GINImin estimated

• Better accuracy than SS in estimating the splitting points

• Cons:• Paying more for splitting

the data (I/O costs are higher)

48

Summary

• Decision Trees are an elegant way to design classifiers:– Simple and has intuitive meaning associated

with the splits

• More than one tree representation is possible for the same data– many heuristic measures available for deciding

how to construct a tree

• For Big Data, we need to design heuristics that provide approximate (yet closely accurate) solutions

49

Part III: Classification with k-Nearest Neighbors

Examples illustrated from Tom Mitchell, Andrew Moore, Jure Leskovec, …

50

Instance based Learning

• Given a number of examples {(x,y)}, and a query vector q:– Find the closest example(s) x*

– Predict y*

• Learning:– Store all training examples

• Prediction:– Classification problem: Return majority class

among the examples

– Regression problem: Return average y value of the k examples

51

Application Examples

• Collaborative filtering– Find k most similar people to user x that have

rated movies in a similar way

– Predict rating yx of x as an average of yk

52

Visual Interpretation of k-Nearest Neighbor works

k=1

k=3

53

1-Nearest Neigbhor

• Distance Metric: – E.g.: Euclidean

• How many neighbors to look at?– One

• Weighting function (optional):– Not-used

• How to fit with local points?– Predict the same output as the nearest

neigbhor

54

• Distance Metric: – E.g.: Euclidean

• How many neighbors to look at?– k (e.g., k=9)

• Weighting function (optional):– Not-used

• How to fit with local points?– Predict the same output as the nearest

neighbors

k-nearest Neighbor

k = 9

55

Generalizing k-NN: Kernel Regression

• Instead of k-neighbors, we look at all

• Weight the points using a Gaussian function:

• Fit the local points with the

weighted average…

wi

d(xi, q) = 0

56

Distance Weighted k-NN

• Points that are close by may be weighted heavier than farther away points

• Prediction rule:

• where wi can be:

• where d is the distance between x and xi

57

How to compute d(x,xi)?

• One approach: Shepard’s method

58

How to find nearest neighbors?

• Given: a set P of n points in Rd

• Goal: Given a query point q– NN: Find the nearest neighbor p of q in P

– Range search: Find one/all points in P within distance r from q

58

qp

59

Issues with large datasets

• Distance measure to use

• Curse of dimensionality:– in high dimensions, the nearest neighbor might

not be near at all

• Irrelevant attributes:– a lot of attributes (in big data) are not

informative

• k-NN wants data in memory!

• It must make a pass over the entire data to do classification

60

Distance Metrics

• D(x, xi) must be positive

• D(x, xi) = 0 iff a = b (Reflexive)

• D(x, xi) = D(xi, x) (Symmetric)

• For any other data point y, D(x, xi) + D(xi,y) >= D(x, y) (Triangle inequality)

Euclidean Distance: One of the most commonly used metric

61

But… Euclidean is not fully useful!

• Makes sense when all data points are expressed in SI (or a similar way)

• Can generalize Euclidean distance Lk

norm, also called Minkowski distance…

• L1: Manhattan Distance (L1 norm)

• L2: Euclidean Distance (L2 norm)

62

Curse of Dimensionality…

• What happens if {d} is very large?

• Say we have 10000 points uniformly distributed in a unit hypercube where we want to run k=5 neighbors

• Query point is origin (01, 02, …,0d):

– in d=1: we have to go 5/10000 = 0.0005 on an average to capture our 5 neighbors

– in d=2: we have to go (0.0005)1/2 to find 5 neighbors

– in d=d: we have to go (0.0005)1/d to find neighbors

• Very expensive and can break down!

63

Irrelevant features…

• A large number of features are uninformative about the data:– do not discriminate across classes (but may be

within a class)

• We may be left with a classic “hole” in the data when computing the neighbors…

64

Efficient indexing (for computing neighbors)

• A kd-tree is an efficient data structure:– Split along the median value having the highest

variance

– Points are stored in the leaves…

65

Space and Time Complexity of Kd-trees

• Space: O(n)!!!– Can reduce storage with other forms of efficient

data structures

– E.g., Ball-tree structure (does better with higher dimensional datasets)

• Time: – O(n) – worst case scenario

– O(log n) – normal scenarios…

66

Now, how do we make this online?

• Instead of assuming a static k-NN classifier, how about we make our k-NN adapt to incoming data?– Locally weighted linear regression

• Form an explicit approximation f^ for a region surrounding query q– Called piece-wise approximation

– Minimize error over k nearest neighbor of q

– Minimize error over all training examples (weighting by distances)

– Combine the two

67

LWLR: Mathematical form

• Linear regression:

• Error:

• Minimize error over k-nearest neighbors

• Minimize error over entire set of examples, weighing by distances

• Combine the two:

68

Example of LWLR

simple regression

69

Large-scale Classification & Regression• Decision Trees:

– Querying and inference is easy

– Induction needs special constructs for handling large-scale datasets

– Can adapt to incoming data streams

• k-NN:– Intuitive algorithm

– Building efficient data structures to handle large datasets

– Online approach

• Next class: More classification & Regression!