Post on 22-Dec-2015
Amazon Web Services for Bioinformatics
June 2, 2015
2
Overview
• Cloud Service Providers• Amazon Web Services Offerings• Hands-on
– Setting up an AWS account– Initiating a Cloud Server for Galaxy– Running Analysis on Galaxy
• Break • Cloud Use Case: 1000 Genomes Project
– Accessing and analyzing 1000 Genomes data on AWS
– Terminate AWS cluster• AWS usage costs and terminating services• Break• Cloud Use Case: Million Veterans Program
3
Introductions and Workshop Considerations
• Introduction• What’s your name?• Where are you from?• What do you do?• Tell us something interesting about yourself!
• Workshop Considerations• Content only requires basic computing skills, so don’t get
discouraged if you don’t understand anything• Follow along with your computer• Help thy neighbor• Ask questions• Engage and enjoy
4
Cloud Service Providers (CSP)
• Amazon Web Services (AWS)
• Verizon Terremark
• Microsoft Azure
• IBM
• HP
• Apple
• CenturyLink
5
Amazon Web Services (AWS) Offerings
• EC2 – Elastic Compute
• S3 – Storage
• EMR – Elastic Map Reduce
• IAM – Identity and Access Management
• RDS – Relational Database
• Glacier – Archival Storage
• AWS Zones – Transfer fee between zones
• Free Usage Tier
6
Getting Started: Setting up an AWS Account
This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.
• Access Amazon Web Services
https://352634094794.signin.aws.amazon.com/console
• Logging in
User Name: user [user umber e.g. user1, user2, user37]
Password: hpcc
[Note: “I have an MFA token” should be left unchecked]
7
Getting Started: Setup an EC2 Instance
This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.
• What’s an AMI? Amazon Machine Image• Two ways to launch an EC server instance
o AWS consoleo AMI
» Amazon Marketplace» Public URL
• Launch through public Galaxy AMI: https://usegalaxy.org/cloudlaunch
• Locate Key ID and Secret Key– AWS Console > Identity and Access Management > Rotate your
access keys– Click “Manage your access keys”– Scroll down and click “Manage access keys”– Click “Create access key”– Click “Show user security credentials”– Click “Download Credentials”
8
Getting Started: Launch EC2 Instance
This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.
1. Bring back the “Launch a Galaxy Cloud Instance” Screen2. Copy and paste Key into “Enter Key ID” field3. Copy and paste Secret Key into “Enter Secret Key” field4. Choose a name for your Galaxy server5. Choose a simple password (e.g. hpcc)6. Key Pair “Create New”7. Instance Type “Compute optimized Large (2 vCPU/4GB
RAM)”8. Click Submit[Takes a few minutes to launch an instance – check the console]9. Click the instance URL to access CloudMan Interface10. Username “admin”, password “hpcc”, select “Transient Storage”[Takes a few minutes to launch Galaxy]11. Click “Access Galaxy”[Galaxy can also be accessed by typing the URL from console]
9
The 1000 Genomes Project
• Goal is to study genetic variants with at least 1% frequency in populations
• Phase I started in 2010 with 4 populations and 1000 Genomes• Phase II and III completed in 2013 with 2500 genomes from 25
populations
10
1000 Genomes Project Data, Analysis, and Results
• Data is stored by EBI and NCBI and AWS
• 2500 whole genomes sequenced at 28x
• Genome Wide Association Studies
• Focus on common and rare genetic conditions, population genetics, evolution and ancestry
11
Create an S3 Bucket and Add Data
Create S3 bucket• Return to the AWS console and click “S3”• Create new S3 data bucket – Name: “user[x]data”[Note: bucket name should be unique, lowercase, and alphanumeric• Create new folder in your bucket – Name: “user[x]folder”
Find 1000 Genomes Data• Gp tp 1000 Genomes Data Browser:
http://browser.1000genomes.org/tools.html• Select “Data Slicer > Online Version”• Select genome location on Chr 7 ”7:50000-100000”• Select VCF Filters “By Population”• Select CLM and download file to your local computer
Upload to S3 bucket• Upload a file in your S3 bucket – Rename it to: “CLM.vcf.gz”• Change permissions of your file to “everyone”
12
Command Line Access to EC2 Server and S3 Bucket
Command line access to your server• Windows – Download “Putty” or any other SSH client• Mac – Open “Terminal”• Go to CloudMan console and copy server address for command line access ssh -i cloudman_key_pair.pem ubuntu@ec2-52-5-185-118.compute-1.amazonaws.com
Access your S3 Data Bucket• Access your S3 bucket
wget http://user[x]data.S3.amazonaws.com/user[x]folder/CLM.vcf.gz• Unzip and view your VCF file
gunzip CLM.vcf.gzhead CLM.vcf
Access 1000 Genomes Data [Public Bucket on S3] • Download 1000 Genomes XML file
wget http://S3.amazonaws.com/10000genomes• Download populations File
wget http://1000genomes.S3.amazonaws.com/20131219.populations.tsv• View 1000 Genomes population
head 20131219.populations.tsv
13
AWS Usage Costs and terminating services
• Usage costs are calculated and billed monthly• Usage is determined by the hour during which an instance
starts• E.g. EC2 instance running from 2:55 PM - 4:05 PM will be
billed for 3 hours• Be sure to stop or terminate instances when not in use
• EC2• Server Instance• Storage Volume
• S3• Terminating our instance
• Go to CloudMan webpage and click “Terminate Cluster”• Terminate EC2 storage volumes• Delete S3 buckets and folders• Check console to ensure all services have been stopped
14
The Million Veterans Program (MVP)
• National voluntary research program funded by the Department of Veterans Affairs Office of Research & Development
• Goal is to study how genes and environment factors affect veterans’ health
• Building one of the world's largest medical databases containing biological samples and health information from one million veterans• Blood samples for genomic profiling
– Single Nucleotide Polymorphism (SNP) Array Analysis– Next Generation Sequencing (NGS) Analysis
• Personal health surveys and military deployment history• Electronic health records
• Genomic Informatics for Integrative Science (GenISIS) comprises hardware, platform, and tools to manage, store, and analyze MVP data
• Current recruitment has passed 400K samples with a goal of 1 Million samples in 5 years
• Total Data Volume expected to exceed 10 Petabytes in 5 years
This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.
15
Overview
This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.
16
MVP Data Warehouse
• Metadata extracted from vendor generated genomic data using SNP Arrays Genotyping, Whole Genome Sequencing, and Whole Exome Sequencing will be cataloged in a Metadata Database
• Genomic data will be linked with corresponding de-identified clinical and survey data by an Honest Broker system
• Terminology and Annotation Server will allow researchers to incorporate a wide array of genomic and clinical annotations to integrate genomic, survey, and clinical data
• Query Mart will enable researchers to build cohorts and subset data using clinical and genomic information and export to the Data Mart for further analysis
This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.
17
Cloud Broker
This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.
• Cloud Portal manages access control for different types of data and users
• Cloud Engine co-locates data with analytical tools
• Intelligent Orchestration Tool maps data and processes to storage and compute clusters to efficiently manage resources
• Geographically distributed computational resources pooled through a virtual private cloud
This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.
Data Lake – Key Value Data Store
18
SNP
rs4362914
Gene
TCF7L2
Sample
SHIP000675221
Patient
PT-00589A
Patient
PT-00589A
ConditionDiabetes
Type II
SNP
rs4362914
Genome Loc
Chr7:4344859978
Sample
SHIP000675221
SNP
rs4362914
SNP
rs4362914
ConditionDiabetes
Type II
SurveyS-2014-06-18-A3288
Deployment
Vietnam War
Genome Loc
Chr7:4344859978
Genotype
T
Sample
SHIP000675221
SurveyS-2014-06-18-A3288
Gene
TCF7L2
Condition
DiabetesType II
Tier 1
Tier 2
Tier 3
Access Control
19
Challenges and Lessons Learned
This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.
• Petabyte scale genomics data poses storage, transfer, and processing challenges
• Cloud computing offers optimal solutions for data storage and analytics• Next generation algorithms with built-in scalability features (e.g. Apache
Hadoop/MapReduce)• Co-locating data and analytical tools to reduce data replication and
transfer bottlenecks
• Genomic data is PHI and should be protected using Data-in-Motion and Data-at-Rest best practices
• Encryption and decryption of genomic datasets constitute a significant fraction of data transfer and analysis time – YMMV
• Efficient architectural design of storage and processing systems diminish security risks and encryption/decryption bottlenecks
• Data integration and metadata annotation are critical in deriving knowledge from data
• Lack of unified standard formats in genomics necessitates substantial effort in highly specialized analytical pipelines
• Data integration can be powered by annotation using multiple ontologies• Data annotation upon ingest is crucial in a rapidly changing genomic
sequencing landscape
20
Questions