Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid...

28
Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    220
  • download

    1

Transcript of Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid...

Page 1: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Feature Subset Selection using Minimum Cost Spanning Trees

Mike Farah - 18548059Supervisor: Dr. Sid Ray

Page 2: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Outline

• Introduction Pattern Recognition Feature subset selection

• Current methods• Proposed method• IFS• Results• Conclusion

Page 3: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Introduction: Pattern Recognition

• The classification of objects into groups by learning from a small sample of objects Apples and strawberries:

Classes: apples and strawberries Features: colour, size, weight, texture

• Applications: Character recognition Voice recognition Oil mining Weather prediction …

Page 4: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Introduction: Pattern Recognition

• Pattern representation Measuring and recording features Size, colour, weight, texture….

• Feature set reduction Reducing the number of features used Selecting a subset Transformations

• Classification The resulting features are used for

classification of unknown objects

Page 5: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Introduction: Feature subset selection

• Can be split into two processes: Feature subset searching

Not usually feasible to exhaustively try all feature subset combinations

Criterion function Main issue of feature subset selection (Jain et

al. 2000) Focus of our research

Page 6: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Current methods

• Euclidean distance Statistical properties of the classes are

not considered

• Mahalanobis distance Variances and co-variances of the

classes are taken into account

J(x) = min{(μ i −μ j )'(μ i −μ j )}

J(x) = min{(μ i −μ j )'Σij−1(μ i −μ j )}

Page 7: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Limitations of Current Methods

Page 8: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Limitations of Current Methods

Page 9: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Friedman and Rafsky’s two sample test

• Minimum spanning tree approach for determining whether two sets of data originate from the same source

• A MST is built across the data from two sources, edges which connect samples of different data sets are removed

• If many edges are removed, then the two sets of data are likely to originate from the same source

Page 10: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Friedman and Rafsky’s two sample test

• Method can be used as a criterion function• MST built across the sample points• Edges which connect samples of different

classes are removed• A good subset is one that provides

discriminatory information about the classes, therefore the fewer edges removed the better

Page 11: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Limitations of Friedman and Rafsky’s technique

Page 12: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Our Proposed Method

• Use the number of edges and edge lengths in determining the suitability of a subset

• A good subset will have a large number of short edges connecting samples of the same class

• And a small number of long edges connecting samples of different classes

Page 13: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Our Proposed Method

• We experimented with using average edge length and weighted average - weighted average was expected to perform better

J(x) = 0.5 1−betweenNum

betweenNum + withinNum+

Avg(betweenEdges)

Avg(betweenEdges) + Avg(withinEdges)

⎝ ⎜

⎠ ⎟

Page 14: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

IFS - Interactive Feature Selector

• Developed to allow users to experiment with various feature selection methods

• Automates the execution of experiments• Allows visualisation of data sets, and

results• Extensible, developers can add criterion

functions, feature selectors and classifiers easily into the system

Page 15: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

IFS - Screenshot

Page 16: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

IFS - Screenshot

Page 17: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Experimental Framework

Data set No. Samples No. Feats No. Classes

Iris 150 4 3

Crab 200 7 2

Forensic Glass 214 9 7

Diabetes 332 8 2

Character 757 20 8

Synthetic 750 7 5

Page 18: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Experimental Framework

• Spearman’s rank correlationA good criterion function will have good

correlation with the classifier, subsets which are ranked high should achieve high accuracy levels

• Subset chosen Final subsets selected by criterion

functions are compared to the optimal subset chosen by the classifier

• Time

Page 19: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Forensic glass data set results

Page 20: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Forensic glass data set results

Page 21: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Synthetic data set

Page 22: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Synthetic data set

Page 23: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Algorithm completion times

Page 24: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Algorithm completion times

Page 25: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Algorithm complexities

• K-NN

• MST criterion functions

• Mahalanobis distance

• Euclidean distance

O(N 2(F + log(N) +K))

O(N 2 *F)

O(C2 *F 2(N + F))

O(C2 *F)

Page 26: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Conclusion

• MST based approaches generally achieved higher accuracy values and rank correlation - in particular with the K-NN classifier

• Criterion function based on Friedman and Rafsky’s two sample test performed the best

Page 27: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.

Conclusion

• MST approaches are closely related with the KNN classifier

• Mahalanobis criterion function suited to data sets with Gaussian distributions and strong feature interdependence

• Future work: Construct a classifier based on KNN, which

gives closer neighbours higher priority Improve IFS

Page 28: Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah - 18548059 Supervisor: Dr. Sid Ray.