Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

10
Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR Deborah Siegel Northwest Genomics Center UW Center for Mendelian Genomics twitter: @dsiegel coauthoring a book on data analysis in SparkR with Amanda Casari Seattle Spark Meetup October 16, 2015

Transcript of Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

Page 1: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

Dimensionality Reduction of Genomic Variation

with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

Deborah Siegel Northwest Genomics Center

UW Center for Mendelian Genomics

twitter: @dsiegel

coauthoring a book on data analysis in SparkR with Amanda Casari

Seattle Spark Meetup October 16, 2015

Page 2: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

xkcd

Page 3: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

Reference Genome !!Humans have 2 copies of most of our

chromosomes, thus we have two alleles for each position of our genome. !!!

AATCATGTGTGGCTACTTACTGTCACT !!AATCATGTGTGGCTACTTACTGTCACT AATCATGTGTAGCTACTTACTGTCACT !

!!We will represent our data with Manhattan distance: for each person in our sample, how many alternate alleles are present in their genotype for that position.

!

homozygous ref allele

GGheterozygous

GA

homozygous alt allele

AA

GG {0}

GA {1}

AA {2}

Representation

Page 4: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

Motivation

Manolio et al. Finding the missing heritability of complex diseases. Nature. 2009 Oct 8 Used with permission

Page 5: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

John Novembre, Toby Johnson, Katarzyna Bryc, Zoltán Kutalik, Adam R. Boyko, Adam Auton, Amit Indap, Karen S. King, Sven Bergmann, Matthew R. Nelson, Matthew Stephens, Carlos D. Bustamante (2008). Genes mirror geography within Europe Nature DOI: 10.1038/nature07331

Used with Permission

Motivation Genes Mirror Geography within Europe

Page 6: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

ℝ3,000,000,000 ℝ50,818,468 ℝ 493,782 ℝ1838 ℝ2

!Genomic Positions chr22

biallelic snp variants with

complete data in our sample

common variants in our sample

dimension reduced

Data 1000 genomes project publicly available on AWSData

Feature Space

Page 7: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

Representation

Page 8: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

vcf file

Filter - Complete data

Filter - Common Variants

create RDD[Row] with sample ID and ordered

array of alt allele counts

Filter - Populations

Case Class of Sample Variants

parquet file of genotype objects

ML VectorAssembler

Spark ML

create DataFrame

df to rdd

create schema

Spark/ADAM

ML fit PCA (model)

ML transform (projection)

write to parquet on hdfs

SparkRread parquet

from hdfs to local data.frame

plot PC1 & PC2

workflow

ML VectorSlicer

convert vector slices to strings

rdd to df

Page 9: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR
Page 10: Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR

!

Also:

Northwest Genomics Center

Amanda Casari

Frank Nothaft & Big Data Genomics http://bdgenomics.org/

Neil Ferguson for his excellent blog post - http://bdgenomics.org/blog/2015/07/10/genomic-analysis-using-adam/

!

Thank You!

Call to Action: Talk with folks in development community about PCA on Spark, and SparkR