Aim 1: Aim 2cs229.stanford.edu/proj2017/final-posters/5143905.pdf · This offers us unprecedented...

1
Acknowledgements This work would not have been possible without the help of Dennis Wall, Kelley Paskov, and the other members of the Wall lab, as well as the funding and computing resources of the Wall Lab and the Stanford University School of Medicine. Thanks also to the Machine Learning instructors and TAs. 1. Phillips, R. D. et al. Enrichment Procedures for Soft Clusters: A Statistical Test and its Applications. (2010). 2. Unpublished work from the Wall lab, Stanford University Clustering and Classifying Autism The iHart Consortium has helped to collect one of the largest Autism Spectrum Disorder (ASD) datasets ever, including genetic and behavioral data for several thousand ASD Cases and Controls. This offers us unprecedented opportunity to take Machine Learning Approaches to two major Autism Research problems : Aim 2: An Autism Genetic Risk Score Goal: Build a genetic risk predictor for ASD The Problem: Autism is a complex disease – it is determined about 50% by genetics and 50% by a person’s environment As a result, it is impossible to perfectly predict autism from genetics. However, an imperfect classifier can: give us a measure of a person’s genetic risk of autism provide intuition about which genetic features are most predictive of disease. Genotype + Environment = Phenotype The Feature S et: 0 0 1 Each genome is shown as a 1109 × 1 binary describing where each person has a loss-of-function in a gene. A Logistic Regression Classifier: A Gradient Boosted Classifier: Conclusions and F uture Work: Our best performance is achieved from averaging the predictions from the two classifiers above (see right). This classifier outperforms previous methods (best AU-ROC = 0.54 [2]), showing promise as a genetic risk score predictor for ASD. We first trained a Logistic Regression Classifier because these models are often simple to interpret. We also trained a gradient boosted tree classifier to capture non-linear gene-gene relationships. F1 score: 0.634 Area under ROC: 0.565 F1 score: 0.647 Area under ROC: 0.580 Rachael Aikens, ([email protected]) and Brianna Kozemzak ([email protected]) Stanford University Department of Biomedical Informatics, Wall Lab F1 score: 0.642 Area under ROC: 0.602 Future work will: Continue to optimize ensemble and non-linear classification models Analyze feature importance to infer which genetic variants are most predictive Aim 1: Clustering Autism Subtypes Goal: : Develop a cluster validation tool kit and use it to analyze clustering results Feature Heat Maps: Label Pie Charts: Features on the x-axis and centroids on the y-axis. Lighter feature values usually indicate more neurotypical behavior. We see separation of neurotypical individuals from atypical individuals and then a mixed cluster. Cluster 1 (3980) Cluster 2 (2683) Cluster 3 (6830) ADOS Diagnosis ADI - R Diagnosis Pie charts were generated for 29 different labels including diagnostic, demographic, and computed ADOS/ADI-R labels. The control group appears to separate from the ASD individuals. Data: 13,493 individuals 123 features from ADOS and ADI-R instruments Diagnostic, medical, demographic, etc. labels Individual Movement: Cluster 1 (3980) Cluster 2 (2683) Cluster 3 (6830) Cluster Moved To Movement between clusters was not random. This indicates some common underlying features driving cluster formation for all k values. ASD can manifest over a broad spectrum of symptoms, from great intellectual and communication disability to near-normal ‘high-functioning’ forms. As a result, it is often asked whether ASD is in fact composed of some number of Autism ‘sub-types’ that are best diagnosed, studied, and treated in different ways. Features on x-axis and examples on y-axis, sorted by cluster. This was too complex to be useful, so we looked only at the centroids of the cluster (a low rank representation of examples) instead. The Problem: Prior Work in Wall Lab: Imputed missing values and clustered data using generalized low rank model with logistic loss Crisp and soft k-means clusterings were created for k = 1, 2, ..., 6. Conclusions and Future Work: Conclusions “Best” clustering result was soft k-means with k=3, where each individual is assigned to a single cluster based on maximum partial membership Why? Clusters are separated by diagnosis, medical history, and computed ADOS/ADI-R labels without creating indistinguishable extra clusters Future work will: Employ methods to work directly with the soft clustering results by using enrichment tests developed for soft clustering [1] and implementing weighted membership for pie charts Apply other clustering methods to data set and compare with k-means and soft k-means results

Transcript of Aim 1: Aim 2cs229.stanford.edu/proj2017/final-posters/5143905.pdf · This offers us unprecedented...

Page 1: Aim 1: Aim 2cs229.stanford.edu/proj2017/final-posters/5143905.pdf · This offers us unprecedented opportunity to take Machine Learning Approaches to two major Autism Research problems:

AcknowledgementsThisworkwouldnothavebeenpossiblewithoutthehelpofDennisWall,Kelley

Paskov,andtheothermembersoftheWalllab,aswellasthefundingandcomputingresourcesoftheWallLabandtheStanfordUniversitySchoolofMedicine.

ThanksalsototheMachineLearninginstructorsandTAs.

1. Phillips,R.D.etal.EnrichmentProceduresforSoftClusters:AStatisticalTestanditsApplications.(2010).2.UnpublishedworkfromtheWalllab,StanfordUniversity

Clusterin

gandClassifyingAutism

TheiHart ConsortiumhashelpedtocollectoneofthelargestAutismSpectrumDisorder(ASD)datasetsever,includinggeneticandbehavioraldataforseveralthousandASDCasesandControls.

ThisoffersusunprecedentedopportunitytotakeMachineLearningApproachestotwomajorAutismResearchproblems:

Aim2:AnAutismGeneticRiskScore

Goal:BuildageneticriskpredictorforASD

TheProblem:Autismisacomplexdisease– itisdeterminedabout50%bygeneticsand50%byaperson’senvironment

Asaresult,itisimpossibletoperfectlypredictautismfromgenetics.

However,animperfectclassifiercan:• giveusameasureofaperson’sgeneticriskofautism• provideintuitionaboutwhichgeneticfeaturesare

mostpredictiveofdisease.

Genotype+Environment=Phenotype

TheFeatureSet:

0 0 1

Eachgenomeisshownasa1109× 1binarydescribingwhereeachpersonhasaloss-of-functioninagene.

ALogisticRegressionClassifier: AGradientBoostedClassifier:

ConclusionsandFutureWork:Ourbestperformanceisachievedfromaveragingthepredictionsfrom

thetwoclassifiersabove(seeright).Thisclassifieroutperformspreviousmethods(bestAU-ROC=0.54[2]),showingpromiseasa

geneticriskscorepredictorforASD.

WefirsttrainedaLogisticRegressionClassifierbecausethesemodelsareoftensimpletointerpret.

Wealsotrainedagradientboostedtreeclassifiertocapturenon-lineargene-generelationships.

F1score:

0.634

AreaunderROC:

0.565

F1score:

0.647

AreaunderROC:

0.580

RachaelA

iken

s,(ra

iken

[email protected]

)and

Bria

nnaKo

zemzak(kozem

zak@

stanford.edu

)StanfordUniversity

Dep

artm

ento

fBiomed

icalInformatics,WallLab

F1score:

0.642AreaunderROC:

0.602Futureworkwill:• Continuetooptimizeensembleandnon-linearclassificationmodels• Analyzefeatureimportancetoinferwhichgeneticvariantsaremostpredictive

Aim1:ClusteringAutismSubtypes

Goal::Developaclustervalidationtoolkitanduseittoanalyzeclusteringresults

FeatureHeatMaps:

LabelPieCharts:

Featuresonthex-axisandcentroidsonthey-axis.Lighterfeaturevaluesusuallyindicatemoreneurotypical behavior.Weseeseparationofneurotypical individualsfromatypicalindividualsandthenamixedcluster.

Cluster1(3980) Cluster2(2683) Cluster3 (6830)

ADOSDiagnosis

ADI-RDiagnosis

Piechartsweregeneratedfor29differentlabelsincludingdiagnostic,demographic,andcomputedADOS/ADI-Rlabels.ThecontrolgroupappearstoseparatefromtheASDindividuals.

Data:• 13,493individuals• 123featuresfromADOSandADI-Rinstruments• Diagnostic,medical,demographic,etc.labels

IndividualMovement:Cluster1(3980) Cluster2(2683) Cluster3 (6830)

ClusterMovedTo

Movementbetweenclusterswasnotrandom.Thisindicatessomecommonunderlyingfeaturesdriving

clusterformationforallkvalues.

ASDcanmanifestoverabroadspectrumofsymptoms,fromgreatintellectualandcommunicationdisabilitytonear-normal‘high-functioning’forms.Asaresult,itisoftenaskedwhetherASDisinfactcomposedofsomenumberofAutism‘sub-types’thatarebestdiagnosed,studied,andtreatedindifferentways.

Featuresonx-axisandexamplesony-axis,sortedbycluster.Thiswastoocomplextobeuseful,sowelookedonlyatthecentroidsofthecluster(alowrankrepresentationofexamples)instead.

TheProblem:

PriorWorkinWallLab:• Imputedmissingvaluesandclustereddatausing

generalizedlowrankmodelwithlogisticloss• Crispandsoftk-meansclusteringswerecreated

fork=1,2,...,6.

ConclusionsandFutureWork:Conclusions• “Best”clusteringresultwassoftk-meanswithk=3,

whereeachindividualisassignedtoasingleclusterbasedonmaximumpartialmembership

• Why?Clustersareseparatedbydiagnosis,medicalhistory,andcomputedADOS/ADI-Rlabelswithoutcreatingindistinguishableextraclusters

Futureworkwill:• Employmethodstoworkdirectlywiththesoft

clusteringresultsbyusingenrichmenttestsdevelopedforsoftclustering[1]andimplementingweightedmembershipforpiecharts

• Applyotherclusteringmethodstodatasetandcomparewithk-meansandsoftk-meansresults