Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR
-
Upload
deborah-siegel -
Category
Science
-
view
813 -
download
0
Transcript of Dimensionality Reduction of Genomic Variation with Big Data Genomics ADAM & Spark MLLib/ML & SparkR
Dimensionality Reduction of Genomic Variation
with Big Data Genomics ADAM & Spark MLLib/ML & SparkR
Deborah Siegel Northwest Genomics Center
UW Center for Mendelian Genomics
twitter: @dsiegel
coauthoring a book on data analysis in SparkR with Amanda Casari
Seattle Spark Meetup October 16, 2015
xkcd
Reference Genome !!Humans have 2 copies of most of our
chromosomes, thus we have two alleles for each position of our genome. !!!
AATCATGTGTGGCTACTTACTGTCACT !!AATCATGTGTGGCTACTTACTGTCACT AATCATGTGTAGCTACTTACTGTCACT !
!!We will represent our data with Manhattan distance: for each person in our sample, how many alternate alleles are present in their genotype for that position.
!
homozygous ref allele
GGheterozygous
GA
homozygous alt allele
AA
GG {0}
GA {1}
AA {2}
Representation
Motivation
Manolio et al. Finding the missing heritability of complex diseases. Nature. 2009 Oct 8 Used with permission
John Novembre, Toby Johnson, Katarzyna Bryc, Zoltán Kutalik, Adam R. Boyko, Adam Auton, Amit Indap, Karen S. King, Sven Bergmann, Matthew R. Nelson, Matthew Stephens, Carlos D. Bustamante (2008). Genes mirror geography within Europe Nature DOI: 10.1038/nature07331
Used with Permission
Motivation Genes Mirror Geography within Europe
ℝ3,000,000,000 ℝ50,818,468 ℝ 493,782 ℝ1838 ℝ2
!Genomic Positions chr22
biallelic snp variants with
complete data in our sample
common variants in our sample
dimension reduced
Data 1000 genomes project publicly available on AWSData
Feature Space
Representation
vcf file
Filter - Complete data
Filter - Common Variants
create RDD[Row] with sample ID and ordered
array of alt allele counts
Filter - Populations
Case Class of Sample Variants
parquet file of genotype objects
ML VectorAssembler
Spark ML
create DataFrame
df to rdd
create schema
Spark/ADAM
ML fit PCA (model)
ML transform (projection)
write to parquet on hdfs
SparkRread parquet
from hdfs to local data.frame
plot PC1 & PC2
workflow
ML VectorSlicer
convert vector slices to strings
rdd to df
!
Also:
Northwest Genomics Center
Amanda Casari
Frank Nothaft & Big Data Genomics http://bdgenomics.org/
Neil Ferguson for his excellent blog post - http://bdgenomics.org/blog/2015/07/10/genomic-analysis-using-adam/
!
Thank You!
Call to Action: Talk with folks in development community about PCA on Spark, and SparkR