Support Vector Machines for Classification of Flow ...

Post on 27-Jan-2022

5 views 0 download

Transcript of Support Vector Machines for Classification of Flow ...

Support Vector Machines for Classification of Flow DataClassification of Flow Data

Funded by SBIR Grant # R43 RR024094-01A1FlowCap 2010pJohn Quinn Ph.D.

Treestarjohn@treestar.com

Our ObjectiveOur Objective• Demonstrate that supervised training

algorithms can effectively replicate user created gates – Very useful for high throughput settings

– Can increase robustness

• We believe this will be the first application in ppwhich algorithmic gate placement becomes the norm.

Selected AlgorithmSelected Algorithm• Support Vector Machine (SVM)pp ( )

– Radial kernel

• Supervised linear classifier that solves an optimization problem to find the hyperplane(s) that separate classes with the maximum distance between classes

Wi h li i d h i li l– With non-linear mapping data that is not linearly separable can be classified

SVM OperationSVM OperationOptimization:p• Determine which

elements of the training data marktraining data mark the boundary of maximum distance

D

between two classes

or Support vectorsClass 1Class 2

D Maximum separation

SVM OperationSVM Operation

• Optimization problemOptimization problemFor data:

A h l th t t t l b d fi dA hyperplane that separates any two classes can be defined as:For ci=1For ci=-1

Knowing that the data points should be outside of the margin, we can impose the constraint:p

SVM OperationSVM OperationWe know that the support vectors will have a perpendicular di t f th h l fdistance from the hyperplane of:

and

The distance between SV’s can then be expressed as:

So optimization is the minimization of

D

SVM OperationSVM OperationWe then use the inequality, q y,

as a constraint to fix a critical point and useas a constraint to fix a critical point and use Lagrangian multipliers αi, to express w as a linear combination of the training vectors:

The support vectors, NSV, are then the Xiassociated with non-negative Lagrange multipliers

SVM OperationSVM OperationOnce w is known, and the support vectors have been identified, b can be solved as:

If there are more than two classes, the operation remains the same but the hyperplanes are determined either as onehyperplanes are determined either as one versus all or pairwise

• We chose a one versus all format

SVM OperationSVM Operation• Data not linearly separable? Map it to a y p p

space where it is!– We assume that flow data will have a Gaussian

Gdistribution and selected a Gaussian mapping

Input Space Mapped Space

Why use an SVM?Why use an SVM?• SVM’s are deterministic • Find the global maxima and not local

maxima– If the training data are representative of the

real data, you cannot do better.• SVM’s are fast

– They solve a maximization problem, as d d i i i fi iopposed to doing an iterative fitting

PreprocessingPreprocessing• To prepare the training data, we:

N li th d t t f 1 t 1– Normalize the data to a range of -1 to 1– Identified the training data set with the largest number

of clusters• Used this data set as the reference set

– Calculated the centroid of each cluster in the reference set

– In all other training data, calculated the Euclidean distance of each cluster to the clusters in the reference set and assigned them cluster ID’s matchingreference set and assigned them cluster ID s matching the reference cluster with the smallest distance measureTook a sample of each training data set and combined– Took a sample of each training data set and combined them into one training vector to present to the SVM

Algorithm choiceAlgorithm choiceMatlab has a free file share repository

Someone has already put almost any algorithm p y gyou can think of into code

I d th SVM d d bI used the SVM coded by By Junshui Ma, and Yi Zhao of Ohio St. University

It received 5 stars

Training DataTraining Data• Example training datap g

– Showing parameters 1 & 2, and 3 & 4 of the stem cell data set

ResultsResults

ResultsResultsSpeed:pData set Training time Classification time

• CFSE 4 sec 2 min 48 sec (13 files)• CFSE 4 sec 2 min 48 sec (13 files)

• DLBCL 5 sec 67 sec (30 files)

• GvHD 5 sec 38 sec (12 files)

• NDD 11 sec 27 min 28 sec (30 files)

• Stem cell 4 sec 19 sec (30 files)Stem cell 4 sec 19 sec (30 files)

Room for improvement…Room for improvement…• The SVM’s are highly dependant on g y p

identifying a transform that maps the data to a linearly separable space.

• We could experiment with a number of different transforms

FlowCap FeedbackFlowCap Feedback

• What went wellWhat went well– Data easily available– Submission process easySubmission process easy– Questions answered immediately!

• What could be improvedWid bli it ti l l t f– Wider publicity particularly out of our domain

Questions?Questions?