Amazon Web Services for Bioinformatics June 2, 2015.

Amazon Web Services for Bioinformatics

June 2, 2015

Overview

• Cloud Service Providers• Amazon Web Services Offerings• Hands-on

– Setting up an AWS account– Initiating a Cloud Server for Galaxy– Running Analysis on Galaxy

• Break • Cloud Use Case: 1000 Genomes Project

– Accessing and analyzing 1000 Genomes data on AWS

– Terminate AWS cluster• AWS usage costs and terminating services• Break• Cloud Use Case: Million Veterans Program

Introductions and Workshop Considerations

• Introduction• What’s your name?• Where are you from?• What do you do?• Tell us something interesting about yourself!

• Workshop Considerations• Content only requires basic computing skills, so don’t get

discouraged if you don’t understand anything• Follow along with your computer• Help thy neighbor• Ask questions• Engage and enjoy

Cloud Service Providers (CSP)

• Amazon Web Services (AWS)

• Verizon Terremark

• Microsoft Azure

• Google

• IBM

• HP

• Apple

• CenturyLink

Amazon Web Services (AWS) Offerings

• EC2 – Elastic Compute

• S3 – Storage

• EMR – Elastic Map Reduce

• IAM – Identity and Access Management

• RDS – Relational Database

• Glacier – Archival Storage

• AWS Zones – Transfer fee between zones

• Free Usage Tier

Getting Started: Setting up an AWS Account

This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

• Access Amazon Web Services

https://352634094794.signin.aws.amazon.com/console

• Logging in

User Name: user [user umber e.g. user1, user2, user37]

Password: hpcc

[Note: “I have an MFA token” should be left unchecked]

Getting Started: Setup an EC2 Instance

• What’s an AMI? Amazon Machine Image• Two ways to launch an EC server instance

o AWS consoleo AMI

» Amazon Marketplace» Public URL

• Launch through public Galaxy AMI: https://usegalaxy.org/cloudlaunch

• Locate Key ID and Secret Key– AWS Console > Identity and Access Management > Rotate your

access keys– Click “Manage your access keys”– Scroll down and click “Manage access keys”– Click “Create access key”– Click “Show user security credentials”– Click “Download Credentials”

Getting Started: Launch EC2 Instance

1. Bring back the “Launch a Galaxy Cloud Instance” Screen2. Copy and paste Key into “Enter Key ID” field3. Copy and paste Secret Key into “Enter Secret Key” field4. Choose a name for your Galaxy server5. Choose a simple password (e.g. hpcc)6. Key Pair “Create New”7. Instance Type “Compute optimized Large (2 vCPU/4GB

RAM)”8. Click Submit[Takes a few minutes to launch an instance – check the console]9. Click the instance URL to access CloudMan Interface10. Username “admin”, password “hpcc”, select “Transient Storage”[Takes a few minutes to launch Galaxy]11. Click “Access Galaxy”[Galaxy can also be accessed by typing the URL from console]

The 1000 Genomes Project

• Goal is to study genetic variants with at least 1% frequency in populations

• Phase I started in 2010 with 4 populations and 1000 Genomes• Phase II and III completed in 2013 with 2500 genomes from 25

populations

1000 Genomes Project Data, Analysis, and Results

• Data is stored by EBI and NCBI and AWS

• 2500 whole genomes sequenced at 28x

• Genome Wide Association Studies

• Focus on common and rare genetic conditions, population genetics, evolution and ancestry

Create an S3 Bucket and Add Data

Create S3 bucket• Return to the AWS console and click “S3”• Create new S3 data bucket – Name: “user[x]data”[Note: bucket name should be unique, lowercase, and alphanumeric• Create new folder in your bucket – Name: “user[x]folder”

Find 1000 Genomes Data• Gp tp 1000 Genomes Data Browser:

http://browser.1000genomes.org/tools.html• Select “Data Slicer > Online Version”• Select genome location on Chr 7 ”7:50000-100000”• Select VCF Filters “By Population”• Select CLM and download file to your local computer

Upload to S3 bucket• Upload a file in your S3 bucket – Rename it to: “CLM.vcf.gz”• Change permissions of your file to “everyone”

Command Line Access to EC2 Server and S3 Bucket

Command line access to your server• Windows – Download “Putty” or any other SSH client• Mac – Open “Terminal”• Go to CloudMan console and copy server address for command line access ssh -i cloudman_key_pair.pem ubuntu@ec2-52-5-185-118.compute-1.amazonaws.com

Access your S3 Data Bucket• Access your S3 bucket

wget http://user[x]data.S3.amazonaws.com/user[x]folder/CLM.vcf.gz• Unzip and view your VCF file

gunzip CLM.vcf.gzhead CLM.vcf

Access 1000 Genomes Data [Public Bucket on S3] • Download 1000 Genomes XML file

wget http://S3.amazonaws.com/10000genomes• Download populations File

wget http://1000genomes.S3.amazonaws.com/20131219.populations.tsv• View 1000 Genomes population

head 20131219.populations.tsv

AWS Usage Costs and terminating services

• Usage costs are calculated and billed monthly• Usage is determined by the hour during which an instance

starts• E.g. EC2 instance running from 2:55 PM - 4:05 PM will be

billed for 3 hours• Be sure to stop or terminate instances when not in use

• EC2• Server Instance• Storage Volume

• S3• Terminating our instance

• Go to CloudMan webpage and click “Terminate Cluster”• Terminate EC2 storage volumes• Delete S3 buckets and folders• Check console to ensure all services have been stopped

The Million Veterans Program (MVP)

• National voluntary research program funded by the Department of Veterans Affairs Office of Research & Development

• Goal is to study how genes and environment factors affect veterans’ health

• Building one of the world's largest medical databases containing biological samples and health information from one million veterans• Blood samples for genomic profiling

– Single Nucleotide Polymorphism (SNP) Array Analysis– Next Generation Sequencing (NGS) Analysis

• Personal health surveys and military deployment history• Electronic health records

• Genomic Informatics for Integrative Science (GenISIS) comprises hardware, platform, and tools to manage, store, and analyze MVP data

• Current recruitment has passed 400K samples with a goal of 1 Million samples in 5 years

• Total Data Volume expected to exceed 10 Petabytes in 5 years

Overview

MVP Data Warehouse

• Metadata extracted from vendor generated genomic data using SNP Arrays Genotyping, Whole Genome Sequencing, and Whole Exome Sequencing will be cataloged in a Metadata Database

• Genomic data will be linked with corresponding de-identified clinical and survey data by an Honest Broker system

• Terminology and Annotation Server will allow researchers to incorporate a wide array of genomic and clinical annotations to integrate genomic, survey, and clinical data

• Query Mart will enable researchers to build cohorts and subset data using clinical and genomic information and export to the Data Mart for further analysis

Cloud Broker

• Cloud Portal manages access control for different types of data and users

• Cloud Engine co-locates data with analytical tools

• Intelligent Orchestration Tool maps data and processes to storage and compute clusters to efficiently manage resources

• Geographically distributed computational resources pooled through a virtual private cloud

Data Lake – Key Value Data Store

rs4362914

TCF7L2

Sample

SHIP000675221

Patient

PT-00589A

Patient

PT-00589A

ConditionDiabetes

Type II

rs4362914

Genome Loc

Chr7:4344859978

Sample

SHIP000675221

rs4362914

ConditionDiabetes

Type II

SurveyS-2014-06-18-A3288

Deployment

Vietnam War

Genome Loc

Chr7:4344859978

Genotype

Sample

SHIP000675221

SurveyS-2014-06-18-A3288

TCF7L2

Condition

DiabetesType II

Tier 1

Tier 2

Tier 3

Access Control

Challenges and Lessons Learned

• Petabyte scale genomics data poses storage, transfer, and processing challenges

• Cloud computing offers optimal solutions for data storage and analytics• Next generation algorithms with built-in scalability features (e.g. Apache

Hadoop/MapReduce)• Co-locating data and analytical tools to reduce data replication and

transfer bottlenecks

• Genomic data is PHI and should be protected using Data-in-Motion and Data-at-Rest best practices

• Encryption and decryption of genomic datasets constitute a significant fraction of data transfer and analysis time – YMMV

• Efficient architectural design of storage and processing systems diminish security risks and encryption/decryption bottlenecks

• Data integration and metadata annotation are critical in deriving knowledge from data

• Lack of unified standard formats in genomics necessitates substantial effort in highly specialized analytical pipelines

• Data integration can be powered by annotation using multiple ontologies• Data annotation upon ingest is crucial in a rapidly changing genomic

sequencing landscape

Questions

Amazon Web Services for Bioinformatics June 2, 2015.

Documents

Transcript of Amazon Web Services for Bioinformatics June 2, 2015.

Meta data and bioinformatics Bioinformatics is EBI-centred, loosely organised Bioinformatics was coined by Pauline Hogekamp ~1979 European bioinformatics.

Introduction to Bioinformatics Introduction to Bioinformatics -

AWS June Webinar Series - Getting Started: Amazon Redshift

Investors’ Presentation June’18 - Amazon S3

Bioinformatics - Stellenbosch UniversityPevsner J. Bioinformatics and Functional Genomics 3rd Edition Wiley-Blackwell 2015. Bioinformatics, Stellenbosch University • Many bioinformatics

bioinformatics secrets The Bioinformatics Skill Systemangus.readthedocs.io/en/2014/_static/2014-rpg.pdf · The Bioinformatics Skill System bioinformatics secrets 1. ... bioinformatics

Doug Brutlag 2011 Bioinformatics Genomics, Bioinformatics.

Data submission services of EMBL Australia Bioinformatics Resource (EMBL-ABR) · 2019-04-30 · Activity Report - June 2018 Data submission services of EMBL Australia Bioinformatics

Amazon resource for bioinformatics

Advanced Database Searching June 24, 2008 Jonathan Pevsner, Ph.D. Introduction to Bioinformatics Johns Hopkins University.

Bioinformatics for molecular biology€¦ · Bioinformatics for molecular biology Structural bioinformatics tools, predictors, and 3D modeling –Structural Bioinformatics DrJon K.

NEWSLETTER 02 JUNE 2014 - Amazon Web Services

Bioinformatics, Translational Bioinformatics, Personalized Medicine

June 2020 Class 7B - Amazon S3

Bioinformatics workflow management Thoughts and case studies from industry. Mark Schreiber, Bioinformatics Research Investigator , 5-7 June 2007.

1. INTRODUCTION TO BIOLOGY AND BIOINFORMATICS · INTRODUCTION TO BIOLOGY AND BIOINFORMATICS BIOINFORMATICS COURSE MTAT.03.239 11.09.2013 . 2 "Introduction to Bioinformatics" Bioinformatics

EB3233 Bioinformatics Introduction to Bioinformatics.

Journal of Bioinformatics and Computational Biologycobweb.cs.uga.edu/~suchi/pubs/susanta-jbcb-2007.pdf · June 19, 2007 2:14 WSPC/185-JBCB 00275 Journal of Bioinformatics and Computational

Tutorial - QIAGEN Bioinformatics...Getting Started June 27, 2019 QIAGEN Aarhus Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 ts-bioinformatics@qiagen.com

Lesson 10 Bioinformatics Power point and discussion Bioinformatics BLAST activity (Bioinformatics) –Wolbachia Project .