GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute...
Transcript of GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute...
![Page 1: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/1.jpg)
SCALABLE PUBLIC HEALTH BIOINFORMATICS IN THE CLOUD
GenomeTrakr - 2018
www.fda.gov
![Page 2: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/2.jpg)
2
Agenda• Part 1 (Scalable Cloud Solution)
– GalaxyTrakr Overview– Challenge / Solution / Benefits– Under the Hood
• Part 2 (Curated Tools for Public Health)– JIFSAN– Galaxy / UseGalaxy vs. GalaxyTrakr– GalaxyTrakr Target Audience– GalaxyTrakr
![Page 3: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/3.jpg)
3
Scalable Cloud ComputingJimmy Sanders • FDA-CFSAN
Scalable public health bioinformatics in the cloud
- part 1 -
![Page 4: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/4.jpg)
GalaxyTrakr Overview U.S. East 1 – Northern VA
Bioinformatics Tools:• SeqSero• SPAdes• CFSAN SNP• Etc.
Auto Scaling Compute Clusters toMeet Demand and Save Cost!
https://galaxytrakr.org
External Users
Data Upload Options:• SFTP• Web Upload• Download via SRA/ENA
accession from NCBI
GalaxyTrakr Overview• Delivers a public instance of Galaxy
(galaxyproject.org) in AWS with implemented tools and workflows for analysis of foodborne bacteria
• Enhances collaboration between FDA-CFSAN and the GenomeTrakr partners
• Leverages an elastically scalable compute cluster in AWS
• Provides basic WGS analysis tools to labs within FDA and external to FDA
Current Statistics• Over 80,000 jobs processed in just over one year• 30+ tools and workflows available• 240 registered users• 53 connected laboratories:
• Public and State Health Labs• Academic Institutions• International Health Laboratories (Italy, Chile,
Dublin, South Africa)• Other Federal Organizations (CDC, USDA)• Labs within FDA
![Page 5: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/5.jpg)
5
Challenge• FDA-CFSAN needed to meet the FSMA requirement for
the FDA to collaborate with state and local food safety laboratories
• The network of laboratories routinely sequences more than 1000 isolates each month for isolates originating from food, environmental, and clinical sources
• FDA-CFSAN’s capacity for providing bioinformatics support to these laboratories has not kept pace with the large volume of data being generated
• Non-Federal partners can’t access FDA-hosted resources (tools, data, compute power)
• Required an environment that could scale on demand
![Page 6: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/6.jpg)
6
Solution• Galaxy (https://usegalaxy.org/)• Amazon Web Services (AWS) (https://aws.amazon.com/)• CloudFormation Cluster
(https://cfncluster.readthedocs.io/en/latest/)
Amazon Web Services U.S. East Northern VA
AWS Availability Zone
CloudFormation“Automation” EC2
“Master Server”
EFS“Network File Share”
EC2“Compute Nodes”
Auto Scaling
CloudWatch“Monitoring and
Event Trigger
![Page 7: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/7.jpg)
7
Benefits• Galaxy
– Collaborative, open-source web-based platform for bioinformatics– Large community support
• Amazon Web Services– Ability to deploy a scalable solution and present to public users– Infrastructure services available to run a platform like Galaxy– Pay as you go
• CloudFormation Cluster– Cost savings through auto-scaling compute cluster to process jobs
submitted from Galaxy (some days < 100 jobs, some days > 1000 jobs)
– Easily modified to meet various bioinformatics computation requirements
![Page 8: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/8.jpg)
8
Under the Hood
Master Node
https://galaxytrakr.org
Publish Metric: Pending Tasks
Galaxy Submits Tasks to Cluster
Cloud Watch
Auto Scaling Service
Monitors: Pending Tasks
Auto ScalingCompute Fleet
Deploy or Decommission Compute Node(s)Based on Pending Tasks
SQS
Elastic File System
User Executes Tool (i.e. SPAdes)
Job Completes, Data Available to User
User Retrieves Result
Nodes Shut Down
![Page 9: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/9.jpg)
9
Curated tools for public healthJustin Payne • FDA-CFSAN
Scalable public health bioinformatics in the cloud
- part 2 -
![Page 10: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/10.jpg)
10
![Page 11: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/11.jpg)
11
JIFSAN Bioinformatics TrainingThe Struggle is Real
![Page 12: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/12.jpg)
12
• JIFSAN bioinformatics training feedback:• “is this something you expect us to do?”• “I didn’t know my Mac had a command line”• “what should we buy?”
• Status quo might be easy to improve on.
JIFSAN Bioinformatics Training…But the Results Were Less So
![Page 13: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/13.jpg)
13
GalaxyBlankenberg et al. Curr. Protoc. Mol. Biol. 89:19.10.1-19.10.21. (2010)
![Page 14: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/14.jpg)
14
Galaxy (workflows)
![Page 15: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/15.jpg)
15
UseGalaxy vs GalaxyTrakr
![Page 16: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/16.jpg)
16
GalaxyTrakrWho’s the Intended User?
1. “My web browser? Google, I think”
2. “Is the internet down? I can’t get to the Facebook”
3. “Oh, there’s an Excel formula for that.”
4. “You mean you didn’t change your router’s default password?”
5. “Where’s Terminal on this thing?”
6. “There’s a funny easter egg in echo’s man page.”
7. “Arch really screams on my dev box since I re-compiled the kernel.”
![Page 17: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/17.jpg)
17
GalaxyTrakrWho do we target
1. “My web browser? Google, I think”
2. “Is the internet down? I can’t get to the Facebook”
3. “Oh, there’s an Excel formula for that.”
4. “You mean you didn’t change your router’s default password?”
5. “Where’s Terminal on this thing?”
6. “There’s a funny easter egg in echo’s man page.”
7. “Arch really screams on my dev box since I re-compiled the kernel.”
A scurrilous falsehood
![Page 18: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/18.jpg)
18
• ”One (or two) ways to do things” • “Obvious path forward”
• “Dogfooding”– SNP-Pipeline– SeqSero– ECTyper
GalaxyTrakrCuration and Layout Philosophy
![Page 19: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/19.jpg)
19
Tissue Sequencing
HG Alignment
Function Prediction
Structural variation
Expression analysis Other stuff
VS
“Expansive” analysis
Population Summary
Low-res Clustering
Sequenced Isolate
Sequenced Isolate
Seq TypingSequenced Isolate
“Reductive” analysis
GalaxyTrakrExpansive vs. Reductive Analyses
![Page 20: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/20.jpg)
20
• Large-scale sample handling
• Tool support for complexcollections
GalaxyTrakrChallenges to Population Analysis
![Page 21: GenomeTrakr - 2018 SCALABLE PUBLIC HEALTH …...• Leverages an elastically scalable compute cluster in AWS • Provides basic WGS analysis tools to labs within FDA and external to](https://reader034.fdocuments.net/reader034/viewer/2022042418/5f34c2637acb6e35577fb6da/html5/thumbnails/21.jpg)