Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection...

32
Machine Learning Lecture 8 Data Processing Feature & Sample Selection Dr. Patrick Chan [email protected] South China University of Technology, China 1 Dr. Patrick Chan @ SCUT Agenda Feature Selection Search Criterion Wrapper / Filter / Embedded Method Feature Extraction PCA LDA Active Learning Lecture 8: Data - Feature & Sample Selection 2

Transcript of Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection...

Page 1: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Machine Learning

Lecture 8

Data Processing

Feature & Sample Selection

Dr. Patrick [email protected]

South China University of Technology, China

1

Dr. Patrick Chan @ SCUT

Agenda

Feature Selection

Search

Criterion

Wrapper / Filter / Embedded Method

Feature Extraction

PCA

LDA

Active Learning

Lecture 8: Data - Feature & Sample Selection2

Page 2: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Space

Design suitable feature space is more important than classifier

Some collected features may not be useful

Low storage complexity

Low model complexity

Accuracy may increase

Improving the understanding of the data and the model

Lecture 8: Data - Feature & Sample Selection3

Dr. Patrick Chan @ SCUT

Curse of Dimensionality

Exponential growth with dimensionality in the number of examples required to accurately estimate a function

For a given sample size, there is a max number of feature yields the best performance

Classifier will degrade rather than improve if more features are used

the information lost by discarding some features is compensated by a more accurate mapping in lower dimensional space

Lecture 8: Data - Feature & Sample Selection4

Page 3: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Selection & Extraction

Feature Selection

Selecting a subset of features without a transformation

Feature Extraction

Transforming existing features into a lower dimensional space

Lecture 8: Data - Feature & Sample Selection5

Dr. Patrick Chan @ SCUT

Feature Selection

Given a feature set F = {x1, x2, …, xd}

Aim to maximize a selection criteria by selecting S, where S F

Major components:

Search

Criterion

Lecture 8: Data - Feature & Sample Selection6

Page 4: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Selection

Search

Exhaustive Search

Explore all possible feature subsets

Impractical in applications with large feature number (d is usually large, otherwise no FS is

needed)

Optimal Feature Subset

d features, 2d candidates

220 = 1048576

Optimal Feature Subset with given subset size

Fixed m, the number of selected features, dCm

candidates

20C10 = 184756Lecture 8: Data - Feature & Sample Selection7

Dr. Patrick Chan @ SCUT

Feature Selection

Search

Heuristic Search

Prevents brute force

Only sub-optimal solution

Find a "closer" optimal subset of features

Naïve Search

Sequential Search

Randomized Search

Lecture 8: Data - Feature & Sample Selection8

Page 5: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Selection

Search: Heuristic Search

Naïve Search

Evaluate each feature individually

Select the m features with highest scores

Example:

S(x2) = S(x4) > S(x3) > S(x1)

But the best combination is x2 and x3 or x4 and x3

Lecture 8: Data - Feature & Sample Selection9

x1

x2

x3

x4

Dr. Patrick Chan @ SCUT

Feature Selection

Search: Heuristic Search

Sequential Search

Search the answer by adding/removinga feature each time

Greedy method

Each time selectthe best move

Forward Selection

Backward Elimination

Lecture 8: Data - Feature & Sample Selection10

0000

0001001001001000

1100 1010 0110 1001 00110101

1110 1101 1011 0111

1111

Page 6: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Selection

Search: Heuristic Search

Sequential Forward Selection (SFS)

Start from null set

Add the one which improves the result each time

Lecture 8: Data - Feature & Sample Selection11

F : Full feature set

S : Selected feature set

C : Evaluation Criteria

S = {}

Repeat

For each f in F

score(f) = Evaluation

(S U {f}, C)

Until

f* = max score(f)

S = S U {f*}

F = F – {f*}

Until F = {}

0000

0001001001001000

1100 1010 0110 1001 00110101

1110 1101 1011 0111

1111

Dr. Patrick Chan @ SCUT

Feature Selection

Search: Heuristic Search

Sequential Backward Elimination (SBE)

Start from full set

Remove the one has the least affects on the result each time

Lecture 8: Data - Feature & Sample Selection12

F : Full feature set

S : Selected feature set

C : Evaluation Criteria

S = {}

Repeat

For each f in F

score(f) = Evaluation

(S - {f}, C)

Until

f* = max score(f)

S = S - {f*}

Until S = {}

0000

0001001001001000

1100 1010 0110 1001 00110101

1110 1101 1011 0111

1111

Page 7: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Selection

Search: Heuristic Search

Comparison: SFS and SBE

First Step:

SFS:

SBE:

Last Step:

SFS:

SBE:

Time complexity of SBE is higher usually

k candidates with k features are considered

SBE is usually commonly used in practical

E.g. Remove 10% of features

Lecture 8: Data - Feature & Sample Selection13

|S| = 0 d candidates

|S| = d d candidates

|S| = d-2

|S| = 2

2 candidates

2 candidates

Dr. Patrick Chan @ SCUT

Feature Selection

Search

Randomized Heuristic Selection

Generate better subsets iteratively based on the existing candidate pool

Keep improving the quality of selected features

Next subset is generated randomly

Do not know when the optimal set is obtained

it is unnecessary to wait until the search ends usually

Lecture 8: Data - Feature & Sample Selection14

Page 8: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Selection: Search

Randomized Heuristic Selection

Example: Genetic Algorithm

Random initial population

Repeat

Evaluate the fitness of each candidate in population

Remove some bad candidates

Create new pollution by Mutation(c): change a candidate slightly

Crossover(c1, c2): generate a candidate containing the elements of both c1 and c2

Until a good candidate is found

Lecture 8: Data - Feature & Sample Selection15

Dr. Patrick Chan @ SCUT

Feature Selection

Criterion

Dependence Measures

Quantify whether a feature and class ID are correlated or dependent

Pearson correlation coefficient:

Lecture 8: Data - Feature & Sample Selection16

Page 9: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Selection

Criterion

Information Gain

Measure the uncertain

Entropy:

Xi: a set containing samples in class i

Information gain:

Lecture 8: Data - Feature & Sample Selection17

� �

Current

entropyEntropy after

using feature A

Dr. Patrick Chan @ SCUT

Feature Selection

Criterion

Accuracy Measures

Classifier dependence

Evaluate a feature subset by using the performance of a classifier trained by that subset

Any accuracy or error measure

Lecture 8: Data - Feature & Sample Selection18

Page 10: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Selection

Criterion

Consistency Measures

Classifier dependence

Aim to achieve P( C | Fullset ) = P( C | Subset )

Rather than accuracy, only the consistence on outputs is measured

Find a minimum number of features that separate classes as the full set of features can

Lecture 8: Data - Feature & Sample Selection19

Dr. Patrick Chan @ SCUT

Feature Selection

Selection Type

Filter

Only depend on the data structure but not a classifier

No bias on a model

No training is involved, low time complexity

Can handle larger sized data

Wrapper

Evaluate a feature set according to model performance

Selected features yield better results

Time consumed

2020 Lecture 8: Data - Feature & Sample Selection

Page 11: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Selection

Selection Type

Embedded method

Features are selected for a model(Similar to wrapper)

Features are selected during training

Low time complexity

Avoiding re-training for each feature subset

No data splitting

Split into a training and test set

Lecture 8: Data - Feature & Sample Selection21

Dr. Patrick Chan @ SCUT

Feature Extraction

Given a feature space find a

mapping , where , such that z preserves (most of) the information in x

d: original feature number

m: selected feature number

Optimal case, no information loss

May not be linear

Lecture 8: Data - Feature & Sample Selection22

Page 12: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

Principal Components Analysis (PCA)

Unsupervised

Increase variance

Linear Discriminant Analysis (LDA)

Supervised

Increase accuracy

Lecture 8: Data - Feature & Sample Selection23

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA

What is the characteristic of important features?

Variance

If the value of all samples are very similar, they cannot be separated by this feature

E.g. x1 is better than x2

Lecture 8: Data - Feature & Sample Selection24

x1

x2

Page 13: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA

Principal Components Analysis (PCA) linearly projects the data along the directions where the data varies most

The first axis with the greatest variance, etc

Dimensionality can be reduced by eliminating the later principal components

Lecture 8: Data - Feature & Sample Selection25

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA

Lecture 8: Data - Feature & Sample Selection

x1

x2

x1

x2

x1

x2

x1

x2

x1

x2

x2

x1

Project 1

Project 3

Project 2

26

Larger Var

Page 14: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA

The projection directions are determined by the eigenvectorsof the covariance matrix corresponding to the largest eigenvalues

The magnitude of the eigenvalues corresponds to the variance of the data along the eigenvector directions

Lecture 8: Data - Feature & Sample Selection27

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA

Vectors v having same direction as Cv are called eigenvectors of C

Cv = v

C : an d by d covariance matrix

v : eigenvectors of C

: an eigenvalue of C

E.g.

Lecture 8: Data - Feature & Sample Selection28

Page 15: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA

Covariance Matrix Calculation

1. Scaler Operation:

2. Vector Operation

Lecture 8: Data - Feature & Sample Selection29

�� �

(�)� �

(�)�

��

=

� � � �� ⋯ � ��� �� ⋱ ⋮

⋮ ⋱ ⋮

� �� ⋯ ⋯ � ���

(�) (�) �

��

(�)

��

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA

Calculation of eigenvalue and eigenvector v

Solve the eigenvalues of

Expand and generate a a polynomial with the degree d

i.e. Only d different root

For each eigenvalue , solve to obtain eigenvectors v

Lecture 8: Data - Feature & Sample Selection30

Cv = v

(Cv - I) = 0

� � �

� � �

� ℎ �

= �� �

ℎ �+b � �

� �+c � �

� ℎ

� �

� �= �� − ��

� � � �

� � � ℎ

� � � �

! "

= �

� � ℎ

� � �

! "

+b

� � ℎ

� � �

"

+c

� � ℎ

� � �

! "

+d

� � �

� � �

!

Page 16: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA

The eigenvector with the largest absolute eigenvalue is called First Principal Component (PC1)

Indicate that the data have the largest variance along its eigenvector

PC2 : the direction with

maximum variation left in data, orthogonal to the PC1

PCi: the direction with

maximum variation left in data, orthogonal to all previous PCj, j = 1, …, i-1

Lecture 8: Data - Feature & Sample Selection31

x1

x2 orthogonal

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA

How to choose m?

Preserve a percentage of the information (variance) in the data

If m=d, all information is

preserved

Lecture 8: Data - Feature & Sample Selection32

Pers

evere

d V

ari

ance

Selected Dimensionality

Page 17: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA: Example

Lecture 8: Data - Feature & Sample Selection33

1.91

x1 x22.5 2.4

0.5 0.7

2.2 2.9

1.9 2.2

3.1 3

2.3 2.7

2 1.6

1 1.1

1.5 1.6

1.1 0.9

1.81Mean

(�) (�) �

��

(�)

��

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA: Example

Lecture 8: Data - Feature & Sample Selection34

=0

=0

=0

=0

0.0491 1.2840

Cv = v

(Cv - I) = 0

��# − $ = 0

Eigenvalues

Page 18: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA: Example

Lecture 8: Data - Feature & Sample Selection35

0.04

0.04

0.0491

1.28

1.28

1.2840

Eigenvectors

Dr. Patrick Chan @ SCUT

Feature Extraction

PCA: Example

Lecture 8: Data - Feature & Sample Selection36

orthogonal

0.0491

� 1.2840

PC1

�> )PC2

Page 19: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT37

Feature Extraction

PCA: Limitation

PCA is not suitable for classification

More spare in x1 than x2

Eigenvalue of x1 > Eigenvalue of x2

But x2 is more useful in classification

x1

x2

Lecture 8: Data - Feature & Sample Selection

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA

Linear Discriminant Analysis (LDA) projectthe original data to a new space linearlyaiming to preserve as much discriminatory information as possible

Seeks to find directions along which the classes are best separated

38 Lecture 8: Data - Feature & Sample Selection

Page 20: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Two-Class Problem

Define a measure for class separation

Between-Class Scatter ( )

A class far away from others is preferable

Distance between means of classes

Within-Class Scatter ( )

A condense class is preferable

Variance of a class

Lecture 8: Data - Feature & Sample Selection39

More CondenseFurther Away

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Two-Class Problem

Between-Class Scatter ( )

Distance between means of classes

Lecture 8: Data - Feature & Sample Selection40

�(�,�)

'(

��

�(�,�)

'(

��

� (�,�)

'(

��

� (�,�)

'(

��

) � ��

��

��

���

� � �

�)

) � ��

Page 21: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Two-Class Problem

Within-Class Scatter ( )

Variance of a class

Lecture 8: Data - Feature & Sample Selection41

* �

�(�,�)

�(�,�)

�'(

��

�(�,�)

�(�,�)

�'(

��

� (�,�) ��

� (�,�) ��

�'(

��

���

� (�,�)�

(�,�)�

�'(

��

� (�,�)�

(�,�)�

�'(

��

��

� �

�*

* �

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Two-Class Problem

Lecture 8: Data - Feature & Sample Selection42

�)

�*

)

*

*�*)

�)

)

*

�)

�*

�*

�)

�)

�*

�* )

�) *

�*

�*

)

�)

�*

*

) *

Page 22: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Two-Class Problem

Lecture 8: Data - Feature & Sample Selection43

) *

*+

)

*+

) is scaler, let

*+

)

) *

When = 0

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Two-Class Problem

x1

x2

y

4 2 1

2 4 1

2 3 1

3 6 1

4 4 1

Lecture 8: Data - Feature & Sample Selection44

x1

x2

y

9 10 2

6 8 2

9 5 2

8 7 2

10 8 2

* �) � ��

Page 23: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Two-Class Problem

Lecture 8: Data - Feature & Sample Selection45

*

)

+

0 �

*+

) Therefore

*+

) Since

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Two-Class Problem

Lecture 8: Data - Feature & Sample Selection46

0 �

Page 24: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Two-Class Problem

Lecture 8: Data - Feature & Sample Selection47

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Multi-Class Problem

How about multi-class?

Within-class Scatter

2-class

2-class can be generalized to multi-class

Multi-class

Lecture 8: Data - Feature & Sample Selection48

x1

x2

,

Page 25: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Multi-Class Problem

How about multi-class?

Between-Class Scatter

2-class

Define as the mean

of means of all classes

Multi-class

Lecture 8: Data - Feature & Sample Selection49

x1

x2

� ,

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Multi-Class Problem

Original x Mapped z

* �

'

��

* �

'

��

�(�,�)

'(

��

�(�,�)

�(�,�)

�'(

��

) � � ��

'

���

) � � ��

'

���

'

���

'

���

�(�,�)

'(

��

�(�,�)

�(�,�)

�'(

��

Lecture 8: Data - Feature & Sample Selection50

50

Page 26: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Multi-Class Problem

The detailed proof is ignored (not difficult)

The loss function of multi-class problem is:

Eigenvalue and Eigenvector can be obtained by solving

Lecture 8: Data - Feature & Sample Selection51

)�)

*�*

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Limitation

LDA assumes unimodal Gaussian likelihoods

Perform badly if the assumption is wrong

Lecture 8: Data - Feature & Sample Selection52

Page 27: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Extraction

LDA: Limitation

Discriminatory information is not in the mean but in the variance

Lecture 8: Data - Feature & Sample Selection53

Dr. Patrick Chan @ SCUT

Feature Extraction

Artificial Neuron Network

ANN aims to extract a meaningful feature space in some settings

E.g. Deep Learning

Will be discussed later

Lecture 8: Data - Feature & Sample Selection54

Page 28: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Feature Space

Summary

Feature Selection

Selecting a subset

No transformation

Selected features are understandable

Feature Extraction

Transformingfeatures into another space

Only for numeric features

Meaning of original features is lost

Suitable for visualization

Extracted features contain more information

Lecture 8: Data - Feature & Sample Selection55

Dr. Patrick Chan @ SCUT

Active Learning

Models discussed in this course are Passive Learning

Samples are pre-collected

Some samples may not be useful in learning

Active LearningTraining samples are selected according to the need of the current model

The learner in different learning states needs different samples

Label information is queried for the selected samples

Lecture 8: Data - Feature & Sample Selection56

Page 29: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Active Learning

Incremental Learning framework

Algorithm

Given a set of unlabeled samples

Initialization: Query some samples selected randomly

Repeat:

Train a model using labelled samples queried so far

Query the label information of the most useful sample for the current model

How to quantify usefulness?

1. Uncertainty Sampling

2. Query-By-Committee

3. Expected Model Change

57 Lecture 8: Data - Feature & Sample Selection

Dr. Patrick Chan @ SCUT

Active Learning: Sample Evaluation

Uncertainty Sampling

An active learner queries the instances which it is least certain on the decision

Three strategies

Least confident

Margin sampling

Entropy

Lecture 8: Data - Feature & Sample Selection58

Page 30: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Active Learning: Sample Evaluation

Uncertainty Sampling

Least Confident

�-'�

Only focus on probable class

Margin Sampling

./0 and ./0� are the largest and 2nd largest g

Ignores the output distribution for the remaining classes

Entropy

Consider all outputs

Lecture 8: Data - Feature & Sample Selection59

Dr. Patrick Chan @ SCUT

Active Learning: Sample Evaluation

Uncertainty Sampling

Example: Heat maps illustrating the query behavior for three-label classification problem

Lecture 8: Data - Feature & Sample Selection60

0.33

0.33

0.33

1 2 = 1 − �./0 2 1 2 = �./0 2 − �./0� 2 1 2 = −∑ �� 2 log�� 2�-'

0.5

0.5

0

Page 31: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

Active Learning: Sample Evaluation

Query-By-Committee (QBC)

A committee contains m diverse models

trained on the current labeled set

Each committee member votes on the labeling of query candidates

Pick the instances generating the most disagreement among hypotheses

Vote entropy

V(yi) is the number of votes for class I

m is the committee size

61 Lecture 8: Data - Feature & Sample Selection

( (

Dr. Patrick Chan @ SCUT

Active Learning: Sample Evaluation

Expected Model Change

The instance yielding the greatest change to the current model will be queried

Expected Gradient Length (EGL)

For gradient-based training model

Query the instance x which, if labeled and added to D,

would result in the new training gradient of the largest magnitude

8 : the gradient of the objective function J with

respect to the parameters θ

8 � : the new gradient of adding the training

tuple (x,y) to D

Computationally expensive

Lecture 8: Data - Feature & Sample Selection62

8 � 8 �

Page 32: Lecture08 - Data - Feature Selection & Extraction€¦ · Dr. Patrick Chan @ SCUT Feature Selection & Extraction Feature Selection Selectinga subset of features without a transformation

Dr. Patrick Chan @ SCUT

References

http://www.sci.utah.edu/~shireen

Lecture 8: Data - Feature & Sample Selection63