Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering...

Classification & Regression

COSC 526 Class 7

Arvind RamanathanComputational Science & Engineering DivisionOak Ridge National Laboratory, Oak RidgePh: 865-576-7266E-mail: ramanathana@ornl.gov

Last Class

• Streaming Data Analytics:– sampling, counting, frequent item sets, etc.

• Berkeley Data Analytic Stack (BDAS)

Project Deliverables and Deadlines

Deliverable Due-date % Grade

Initial selection of topics1, 2 Jan 27, 2015 10

Project Description and Approach Feb 20, 2015 20

Initial Report Mar 20, 2015 10

Project Demonstration Apr 16-19, 2015 10

Final Project Report (10 pages) Apr 21, 2015 25

Poster (12-16 slides) Apr 23, 2015 25

• 1Projects can come with their own data (e.g., from your project) or can be provided

• 2Datasets need to be open! Please don’t use datasets that have proprietary limitations

• All reports will be in NIPS format: http://nips.cc/Conferences/2013/PaperInformation/StyleFiles

This class and next…

• Classification

• Decision Tree Methods

• Instance based Learning

• Support Vector Machines (SVM)

• Latent Dirichlet Allocation (LDA)

• How do we make these work with large-scale datasets?

Part I: Classification and a few new terms…

Classification • Given a collection of records (training):

– Each record has a set of attributes [x1,x2, …, xn]

– One attribute is referred to as the class [y]

• Training Goal: Build a model for the class attribute as a function of the set of attributes:

• Testing Goal: Previously unseen records should be assigned the class attribute as accurately as possible

Illustration of Classification

ID Attrib1

Attrib2

Attrib3

1 Yes Large 120,000

2 No Medium

100,000

3 No Small 70,000 No

4 Yes Medium

120,000

5 No Large 95,000 Yes

6 Yes Large 220,000

… … … … …

ID Attrib1

Attrib2

Attrib3

20 No Small 55,000 ?

21 Yes Medium

90,000 ?

22 Yes … … ?

Training

Testing

Learning Algorith

Learn Model

Apply Model

Examples of classification

• Predicting tumor cells whether they are benign or cancerous

• Classifying credit card transactions are legitimate / fraudulent

• Classifying if a protein sequence is a helix, beta-sheet or a random coil

• Classifying if a tweet is talking about sports, finance, terrorism, etc.

Two problem Settings

• Supervised Learning:– Training and testing data

– Training consists of labels (class attribute y)

– Testing data consists of all other attributes

• Unsupervised Learning:– Training data does not usually consist of labels

– We will cover unsupervised learning later (mid Feb)

Two more terms..

• Parametric:– A particular functional form is assumed. E.g.,

Naïve Bayes

– Simplicity: easy to interpret and understand

– High bias: real data might not follow this functional form

• Non-parametric:– Estimation is purely data-driven

– no functional form assumption

Part II: Classification with Decision TreesExamples illustrated from Tan, Steinbach and Kumar, Introduction

to Data Mining Text book

Decision Trees

• One of the more popular learning techniques

• Uses a tree-structured plan of a set of attributes to test in order to predict the output

• Decision of which attribute to be tested is based on information gain

Example of a Decision Tree

Tid Refund MaritalStatus

TaxableIncome Cheat

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes10

categoric

contin

Refund

TaxInc

Yes No

Married Single, Divorced

< 80K > 80K

Splitting Attributes

Training Data Model: Decision Tree

Another Example of Decision Tree

TaxableIncome Cheat

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

categoric

contin

Refund

TaxInc

Yes No

Married Single,

Divorced

< 80K > 80K

There could be more than one tree that fits the same data!

Decision Tree Classification Task

Induction

Deduction

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training Set

Decision Tree

Apply Model to Test Data

Refund

TaxInc

Yes No

< 80K > 80K

Refund Marital Status

Taxable Income Cheat

No Married 80K ? 10

Test DataStart from the root of tree.

Refund

TaxInc

Yes No

< 80K > 80K

No Married 80K ? 10

Test Data

Refund

TaxInc

Yes No

< 80K > 80K

No Married 80K ? 10

Test Data

Refund

TaxInc

Yes No

< 80K > 80K

No Married 80K ? 10

Test Data

Refund

TaxInc

Yes No

< 80K > 80K

No Married 80K ? 10

Test Data

Refund

TaxInc

Yes No

< 80K > 80K

No Married 80K ? 10

Test Data

Assign Cheat to “No”

Decision Tree Classification Task

Induction

Deduction

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

TreeInductionalgorithm

Training Set

Decision Tree

Decision Tree Induction

• Many Algorithms:– Hunt’s Algorithm (one of the earliest)

– CART

– ID3, C4.5

– SLIQ,SPRINT

General Structure of Hunt’s Algorithm

• Let Dt be the set of training records that reach a node t

• General Procedure:

– If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt

– If Dt is an empty set, then t is a leaf node labeled by the default class, yd

– If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets.

– Recursively apply the procedure to each subset.

Tid Refund Marital Status

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Hunt’s Algorithm

Don’t Cheat

Refund

Don’t Cheat

Yes No

Refund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Single,Divorced

Married

TaxableIncome

Don’t Cheat

< 80K >= 80K

Refund

Don’t Cheat

Yes No

MaritalStatus

Don’t Cheat

Single,Divorced

Married

TaxableIncome Cheat

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

Tree Induction

• Greedy strategy– Split the records based on an attribute test that

optimizes certain criterion.

• Issues– Determine how to split the records

• How to specify the attribute test condition?

• How to determine the best split?

– Determine when to stop splitting

Tree Induction

• Greedy strategy.– Split the records based on an attribute test that

How to Specify Test Condition?

• Depends on attribute types– Nominal

– Ordinal

– Continuous

• Depends on number of ways to split– 2-way split

– Multi-way split

Splitting Based on Nominal Attributes

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets.

Need to find optimal partitioning.

CarTypeFamily

Sports

Luxury

CarType{Family, Luxury} {Sports}

CarType{Sports,

Luxury}

{Family} OR

• Multi-way split: Use as many partitions as distinct values.

• Binary split: Divides values into two subsets. Need to find optimal partitioning.

Splitting Based on Ordinal Attributes

SizeSmall

Medium

Size{Medium,

Large} {Small}

Size{Small, Medium

}{Large}

Size{Small, Large} {Medium}

Splitting Based on Continuous Attributes

• Different ways of handling– Discretization to form an ordinal categorical

attribute• Static – discretize once at the beginning

• Dynamic – ranges can be found by equal interval bucketing, equal frequency bucketing(percentiles), or clustering.

– Binary Decision: (A < v) or (A v)• consider all possible splits and finds the best cut

• can be more compute intensive

Splitting Based on Continuous Attributes

TaxableIncome> 80K?

Yes No

TaxableIncome?

(i) Binary split (ii) Multi-way split

[10K,25K) [25K,50K) [50K,80K)

Tree Induction

• Greedy strategy.– Split the records based on an attribute test that

How to determine the Best Split

OwnCar?

C0: 6C1: 4

C0: 4C1: 6

C0: 1C1: 3

C0: 8C1: 0

C0: 1C1: 7

CarType?

C0: 1C1: 0

C0: 0C1: 1

StudentID?

Yes No Family

Sports

Luxury c1c10

C0: 0C1: 1

Before Splitting: 10 records of class 0,10 records of class 1

Which test condition is the best?

How to determine the Best Split

• Greedy approach: – Nodes with homogeneous class distribution are

preferred

• Need a measure of node impurity:

C0: 5C1: 5

C0: 9C1: 1

Non-homogeneous,

High degree of impurity

Homogeneous,

Low degree of impurity

Measures of Node Impurity

• Gini Index

• Entropy

• Misclassification error

How to Find the Best Split

Yes No

Node N3 Node N4

Yes No

Node N1 Node N2

Before Splitting:

C0 N10 C1 N11

C0 N20 C1 N21

C0 N30 C1 N31

C0 N40 C1 N41

C0 N00 C1 N01

M1 M2 M3 M4

M34Gain = M0 – M12 vs M0 –

Measure of Impurity: GINI

• Gini Index for a given node t :

(NOTE: p( j | t) is the relative frequency of class j at node t).

– Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information

– Minimum (0.0) when all records belong to one class, implying most interesting information

tjptGINI 2)]|([1)(

C1 0C2 6

Gini=0.000

C1 2C2 4

Gini=0.444

C1 3C2 3

Gini=0.500

C1 1C2 5

Gini=0.278

Examples for computing GINI

C1 0 C2 6

C1 2 C2 4

C1 1 C2 5

P(C1) = 0/6 = 0 P(C2) = 6/6 = 1

Gini = 1 – P(C1)2 – P(C2)2 = 1 – 0 – 1 = 0

tjptGINI 2)]|([1)(

P(C1) = 1/6 P(C2) = 5/6

Gini = 1 – (1/6)2 – (5/6)2 = 0.278

P(C1) = 2/6 P(C2) = 4/6

Gini = 1 – (2/6)2 – (4/6)2 = 0.444

Splitting Based on GINI

• Used in CART, SLIQ, SPRINT.

• When a node p is split into k partitions (children), the quality of split is computed as,

where, ni = number of records at child i,

n = number of records at node

isplit iGINI

Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions:

– Larger and Purer Partitions are sought for.

Yes No

Node N1 Node N2

Parent

Gini = 0.500

N1 N2 C1 5 1

C2 2 4

Gini=0.333

Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194

Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528

Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528= 0.333

Categorical Attributes: Computing Gini Index

• For each distinct value, gather counts for each class in the dataset

• Use the count matrix to make decisions

CarType{Sports,Luxury}

{Family}

C1 3 1

C2 2 4

Gini 0.400

CarType

{Sports}{Family,Luxury}

C1 2 2

C2 1 5

Gini 0.419

CarType

Family Sports Luxury

C1 1 2 1

C2 4 1 1

Gini 0.393

Multi-way split Two-way split (find best partition of values)

Continuous Attributes: Computing Gini Index• Use Binary Decisions based on one

• Several Choices for the splitting value– Number of possible splitting

values = Number of distinct values

• Each splitting value has a count matrix associated with it– Class counts in each of the

partitions, A < v and A v

• Simple method to choose best v– For each v, scan the database to

gather count matrix and compute its Gini index

– Computationally Inefficient! Repetition of work.

Tid Refund Marital Status

3 No Single 70K No

6 No Married 60K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

TaxableIncome> 80K?

Yes No

Continuous Attributes: Computing Gini Index...

• For efficient computation: for each attribute,– Sort the attribute on values– Linearly scan these values, each time updating the

count matrix and computing gini index– Choose the split position that has the least gini index

Cheat No No No Yes Yes Yes No No No No

Taxable Income

60 70 75 85 90 95 100 120 125 220

55 65 72 80 87 92 97 110 122 172 230

<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >

Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0

No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0

Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420

Split Positions

Sorted Values

How do we do this with large datasets?

• GINI computations is expensive:– Sorting along numeric attributes O(n log n)!

– Need an in-memory hashtable to split based on attributes

• Empirical observation:– GINI scores tend to increase or decrease slowly

– Minimum value of the gini score for the splitting attribute is lower than other data points

• CLOUDS (Classification for Large OUt-of-core Datasets) achieves this

How to hasten GINI computation? (1)

• Sampling of Splitting Points (SS):– Use quantiles to divide the points into q

intervals

– Evaluate GINI index at each boundary

– Determine the GINImin value

– Split along the attribute with GINImin

• Pro:– Only one pass over the entire dataset

• Cons:– Greedy

– Time is dependent on the number of points in the interval

How to hasten GINI computation? (2)

• Sampling Splitting points with Estimation (SSE)

– Find GINImin as outlined in SS

– Estimate a lower bound GINIest for each interval

– Eliminate (prune) all intervals where GINIest ≥

GINImin

– Derive list of intervals (alive) where we can determine the correct split

• How do we determine what to prune?

Goal: Estimate the minimum GINI given an interval [vl, vu]Approach: Hill-climbing heuristicInputs: • n – number of data points

(records)• c – number of classes

Compute:xi (yi) – number of elements in class i that are less than equal to vl (vu)ci – total number of elements in class inl (nu) – number of elements that are less than equal to vl (vu)

Take the partial derivative along the subset of classes cClass with the minimum gradient will remain the split point (for the next split)

• Pros:• Instead of being dependent

on (n) the hill climbing heuristic is dependent on (c)

• Other optimizations can include prioritizing alive intervals based on GINImin estimated

• Better accuracy than SS in estimating the splitting points

• Cons:• Paying more for splitting

the data (I/O costs are higher)

Summary

• Decision Trees are an elegant way to design classifiers:– Simple and has intuitive meaning associated

with the splits

• More than one tree representation is possible for the same data– many heuristic measures available for deciding

how to construct a tree

• For Big Data, we need to design heuristics that provide approximate (yet closely accurate) solutions

Part III: Classification with k-Nearest Neighbors

Examples illustrated from Tom Mitchell, Andrew Moore, Jure Leskovec, …

Instance based Learning

• Given a number of examples {(x,y)}, and a query vector q:– Find the closest example(s) x*

– Predict y*

• Learning:– Store all training examples

• Prediction:– Classification problem: Return majority class

among the examples

– Regression problem: Return average y value of the k examples

Application Examples

• Collaborative filtering– Find k most similar people to user x that have

rated movies in a similar way

– Predict rating yx of x as an average of yk

Visual Interpretation of k-Nearest Neighbor works

1-Nearest Neigbhor

• Distance Metric: – E.g.: Euclidean

• How many neighbors to look at?– One

• Weighting function (optional):– Not-used

• How to fit with local points?– Predict the same output as the nearest

neigbhor

• Distance Metric: – E.g.: Euclidean

• How many neighbors to look at?– k (e.g., k=9)

• Weighting function (optional):– Not-used

• How to fit with local points?– Predict the same output as the nearest

neighbors

k-nearest Neighbor

Generalizing k-NN: Kernel Regression

• Instead of k-neighbors, we look at all

• Weight the points using a Gaussian function:

• Fit the local points with the

weighted average…

d(xi, q) = 0

Distance Weighted k-NN

• Points that are close by may be weighted heavier than farther away points

• Prediction rule:

• where wi can be:

• where d is the distance between x and xi

How to compute d(x,xi)?

• One approach: Shepard’s method

How to find nearest neighbors?

• Given: a set P of n points in Rd

• Goal: Given a query point q– NN: Find the nearest neighbor p of q in P

– Range search: Find one/all points in P within distance r from q

Issues with large datasets

• Distance measure to use

• Curse of dimensionality:– in high dimensions, the nearest neighbor might

not be near at all

• Irrelevant attributes:– a lot of attributes (in big data) are not

informative

• k-NN wants data in memory!

• It must make a pass over the entire data to do classification

Distance Metrics

• D(x, xi) must be positive

• D(x, xi) = 0 iff a = b (Reflexive)

• D(x, xi) = D(xi, x) (Symmetric)

• For any other data point y, D(x, xi) + D(xi,y) >= D(x, y) (Triangle inequality)

Euclidean Distance: One of the most commonly used metric

But… Euclidean is not fully useful!

• Makes sense when all data points are expressed in SI (or a similar way)

• Can generalize Euclidean distance Lk

norm, also called Minkowski distance…

• L1: Manhattan Distance (L1 norm)

• L2: Euclidean Distance (L2 norm)

Curse of Dimensionality…

• What happens if {d} is very large?

• Say we have 10000 points uniformly distributed in a unit hypercube where we want to run k=5 neighbors

• Query point is origin (01, 02, …,0d):

– in d=1: we have to go 5/10000 = 0.0005 on an average to capture our 5 neighbors

– in d=2: we have to go (0.0005)1/2 to find 5 neighbors

– in d=d: we have to go (0.0005)1/d to find neighbors

• Very expensive and can break down!

Irrelevant features…

• A large number of features are uninformative about the data:– do not discriminate across classes (but may be

within a class)

• We may be left with a classic “hole” in the data when computing the neighbors…

Efficient indexing (for computing neighbors)

• A kd-tree is an efficient data structure:– Split along the median value having the highest

variance

– Points are stored in the leaves…

Space and Time Complexity of Kd-trees

• Space: O(n)!!!– Can reduce storage with other forms of efficient

data structures

– E.g., Ball-tree structure (does better with higher dimensional datasets)

• Time: – O(n) – worst case scenario

– O(log n) – normal scenarios…

Now, how do we make this online?

• Instead of assuming a static k-NN classifier, how about we make our k-NN adapt to incoming data?– Locally weighted linear regression

• Form an explicit approximation f^ for a region surrounding query q– Called piece-wise approximation

– Minimize error over k nearest neighbor of q

– Minimize error over all training examples (weighting by distances)

– Combine the two

LWLR: Mathematical form

• Linear regression:

• Error:

• Minimize error over k-nearest neighbors

• Minimize error over entire set of examples, weighing by distances

• Combine the two:

Example of LWLR

simple regression

Large-scale Classification & Regression• Decision Trees:

– Querying and inference is easy

– Induction needs special constructs for handling large-scale datasets

– Can adapt to incoming data streams

• k-NN:– Intuitive algorithm

– Building efficient data structures to handle large datasets

– Online approach

• Next class: More classification & Regression!

Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering...

Documents

Transcript of Classification & Regression COSC 526 Class 7 Arvind Ramanathan Computational Science & Engineering...

Oak Ridge - Mar 2013

OAK RIDGE NATIONAL LABORATORY -

Oak Ridge National Laboratory - Owning a Priusmyprius.co.za.phoenix.preview.afrihost.com/120761.pdf · OAK RIDGE NATIONAL LABORATORY Oak Ridge, Tennessee 37831 managed by UT-BATTELLE,

Oak Ridge Leadership Computing Facility

Oak Ridge Reservation: Polychlorinated Biphenyl (PCB) Releases Ridge PCB_Final_… · Oak Ridge Reservation (USDOE) Oak Ridge, Anderson County, Tennessee CERCLIS NO. TN1890090003

Oak Ridge Reservation: Polychlorinated Biphenyl (PCB) Releases Ridge PCB_Final_P… · Oak Ridge Reservation: Polychlorinated Biphenyl (PCB) Releases Public Health Assessment . Foreword

Oak Ridge National Laboratory - US Department of … · OAK RIDGE NATIONAL LABORATORY U. S. DEPARTMENT OF ENERGY Oak Ridge National Laboratory David L. Greene ... P.N. Leiby, D.L.

Clustering COSC 526 Class 12 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory, Oak Ridge Ph: 865-576-7266 E-mail:

History of Oak Ridge

Oak Ridge Police Department Final

BECHTEL -OAK RIDGE LIBRARY€¦ · BECHTEL -OAK RIDGE LIBRARY GEOLOGIC SUMMARY FOR THE SEAWAY GROUP SEPTEMBER 1986 Prepared for UNITED STATES DEPARTMENT OF ENERGY OAK RIDGE OPERATIONS

All you wanted to know about Regression… COSC 526 Class 9 Arvind Ramanathan Computational Science & Engineering Division Oak Ridge National Laboratory,

Oak Ridge High School Renovation

OAK RIDGE SCHOOLS - OAK RIDGE BOARD OF EDUCATION · oak ridge schools - oak ridge board of education regular meeting school administration building board room monday - november 28,

Oak Ridge, TN - ADFAC

Oak Ridge National Laboratory - The Energy …...The Energy Awareness and Resiliency Standardized Services (EARSS) System Prepared by Oak Ridge National Laboratory Oak Ridge, Tennessee

Oak Ridge National Laboratory

TEfl9 (yoD Oak Ridge Energy' · TEfl9 (yoD Oak Ridge Energy' Associated Post Office Box 117 environment (S\_J Universities Oak Ridge. Tennessee 37831-011 Systems Division

OAK RIDGE NATIONAL LABORATO

Oak Ridge Associated Universities / ORISE, October 2011 · 2014-03-12 · Oak Ridge Associated Universities, Oak Ridge Institute for Science . and Education . Report from the Department