Support Vector Machines for Classification of Flow ...
Transcript of Support Vector Machines for Classification of Flow ...
Support Vector Machines for Classification of Flow DataClassification of Flow Data
Funded by SBIR Grant # R43 RR024094-01A1FlowCap 2010pJohn Quinn Ph.D.
Our ObjectiveOur Objective• Demonstrate that supervised training
algorithms can effectively replicate user created gates – Very useful for high throughput settings
– Can increase robustness
• We believe this will be the first application in ppwhich algorithmic gate placement becomes the norm.
Selected AlgorithmSelected Algorithm• Support Vector Machine (SVM)pp ( )
– Radial kernel
• Supervised linear classifier that solves an optimization problem to find the hyperplane(s) that separate classes with the maximum distance between classes
Wi h li i d h i li l– With non-linear mapping data that is not linearly separable can be classified
SVM OperationSVM OperationOptimization:p• Determine which
elements of the training data marktraining data mark the boundary of maximum distance
D
between two classes
or Support vectorsClass 1Class 2
D Maximum separation
SVM OperationSVM Operation
• Optimization problemOptimization problemFor data:
A h l th t t t l b d fi dA hyperplane that separates any two classes can be defined as:For ci=1For ci=-1
Knowing that the data points should be outside of the margin, we can impose the constraint:p
SVM OperationSVM OperationWe know that the support vectors will have a perpendicular di t f th h l fdistance from the hyperplane of:
and
The distance between SV’s can then be expressed as:
So optimization is the minimization of
D
SVM OperationSVM OperationWe then use the inequality, q y,
as a constraint to fix a critical point and useas a constraint to fix a critical point and use Lagrangian multipliers αi, to express w as a linear combination of the training vectors:
The support vectors, NSV, are then the Xiassociated with non-negative Lagrange multipliers
SVM OperationSVM OperationOnce w is known, and the support vectors have been identified, b can be solved as:
If there are more than two classes, the operation remains the same but the hyperplanes are determined either as onehyperplanes are determined either as one versus all or pairwise
• We chose a one versus all format
SVM OperationSVM Operation• Data not linearly separable? Map it to a y p p
space where it is!– We assume that flow data will have a Gaussian
Gdistribution and selected a Gaussian mapping
Input Space Mapped Space
Why use an SVM?Why use an SVM?• SVM’s are deterministic • Find the global maxima and not local
maxima– If the training data are representative of the
real data, you cannot do better.• SVM’s are fast
– They solve a maximization problem, as d d i i i fi iopposed to doing an iterative fitting
PreprocessingPreprocessing• To prepare the training data, we:
N li th d t t f 1 t 1– Normalize the data to a range of -1 to 1– Identified the training data set with the largest number
of clusters• Used this data set as the reference set
– Calculated the centroid of each cluster in the reference set
– In all other training data, calculated the Euclidean distance of each cluster to the clusters in the reference set and assigned them cluster ID’s matchingreference set and assigned them cluster ID s matching the reference cluster with the smallest distance measureTook a sample of each training data set and combined– Took a sample of each training data set and combined them into one training vector to present to the SVM
Algorithm choiceAlgorithm choiceMatlab has a free file share repository
Someone has already put almost any algorithm p y gyou can think of into code
I d th SVM d d bI used the SVM coded by By Junshui Ma, and Yi Zhao of Ohio St. University
It received 5 stars
Training DataTraining Data• Example training datap g
– Showing parameters 1 & 2, and 3 & 4 of the stem cell data set
ResultsResults
ResultsResultsSpeed:pData set Training time Classification time
• CFSE 4 sec 2 min 48 sec (13 files)• CFSE 4 sec 2 min 48 sec (13 files)
• DLBCL 5 sec 67 sec (30 files)
• GvHD 5 sec 38 sec (12 files)
• NDD 11 sec 27 min 28 sec (30 files)
• Stem cell 4 sec 19 sec (30 files)Stem cell 4 sec 19 sec (30 files)
Room for improvement…Room for improvement…• The SVM’s are highly dependant on g y p
identifying a transform that maps the data to a linearly separable space.
• We could experiment with a number of different transforms
FlowCap FeedbackFlowCap Feedback
• What went wellWhat went well– Data easily available– Submission process easySubmission process easy– Questions answered immediately!
• What could be improvedWid bli it ti l l t f– Wider publicity particularly out of our domain
Questions?Questions?