Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
-
Upload
gigascience-bgi-hong-kong -
Category
Technology
-
view
124 -
download
1
description
Transcript of Xu Xing: EasyGenomics – Next Generation Bioinformatics on the Cloud
Contact [email protected]
http://www.easygenomics.com
Next Generation Bioinformaticson the Cloud
Xing Xu, Ph.DDirector of Cloud Computing Product
Topics for Today
Behind the cloud product- BGI- The team
The product: EasyGenomics- Why are we building this product?- What can this product do?
Future direction and open questions
2
BGI
The world largest genome sequencing center- Started with Human Genome Project in 1999 with only a
few sequencers.- Now more than 150 sequencers, 6 TB/day sequencing
throughput.
MODEL ABI3730XL
Roche454
ABISOLiD 4
SolexaGA IIx
IlluminaHiSeq 2000
INSTALLATION 16 1 27 6 135
BGI
The world largest genome sequencing center The largest computing and storage center for
genomics in China
- 20,000+ CPU cores- 19 NVIDIA GPUs- 220+ Tflops peak
performance- 17 PB data storage- The storage and
computation capability increase by 10000 folds!
- Still increasing …
BGI
The world largest genome sequencing center The largest computing and storage center for
genomics in China One of world leading research institutes in
Genomics
Since 2007, - 253 papers in high-impact journals- Including 47 in Nature and its sub-
journals, 9 in Science, 2 in Cell, and 1 in NEJM, with 42 first and/or corresponding authors
- 369 patent applications- 254 software authorship
BGI
The world largest genome sequencing center The largest computing and storage center for
genomics in China One of world leading research institutes in
Genomics
BGI has the sequencing capacity, hardware resource and software proficiency to be the one of the strongest end-to-end service providers in the world for NGS sequencing, data analysis and data interpretation.
Team for the Cloud Platform
Run like a software company
Managers are from leading software companies, such as HP, Microsoft, and Levono.
Team members are Young, Energetic, and Ambitious.
Fully supported by BGI in-house algorithm development teams.
Product
Development
Testing
Operation
BGI Support
Team for the Cloud Platform
Development Team- Dev: Ming Jiang, Yongsheng Chen, Can Long, Jiasheng Wu, etc.- Flex Lab: Yan Li, Shengchang Gu etc. GPU Lab: Bingqiang Wang etc.- Pipeline: Liang Wang etc.
Test & QA Team- Xin Guan, Jingjuan Liu, etc.
PMO & IT Operation- Wenjun Zeng, Litong Lai, Jing Tian, etc.
Product Team- Xing Xu, Jing Guo, Fang Fang etc.
Other BGI Teams
+ + +
Topics for Today
Behind the cloud product- BGI- The team
The product: EasyGenomics- Why are we building this product?- What can this product do?
Future direction and open questions
9
Trend of Volume and Cost
10
Geological side of the problem
Sequencing happens EVERYWHERE.
+
Geological side of the problem
Images from omicsmaps.com
BGI
Difficulties of Analysis
In-depth Annotation
Lack of knowledge
Post Tertiary Analysis
Variant Calling
Complicated AlgorithmsComputation intensive
Tertiary Analysis
Mapping
Computation intensiveData storage
Secondary Analysis
Base calling
Data throughputData storage
Primary analysis
Problems and Solutions
13
Problems:
• Big genomic data
• Geological distribution
• Algorithm integration
• Computational demand
• Big genomic data
• Geological distribution
• Algorithm integration
• Computational demand+)
Cloud
High Speed Data Exchange
Pipelines
Distributed Workloads
Solutions
EasyGenomics™
EasyGenomics is a Software as a Service (SaaS) bioinformatics platform for research and applications.
Algorithms, Workflows,
Reports
Computational ResourcesDatabase,
Data management
Web portal,Simple UIHigh speed
connection
Bioinformatics Workflows
Data Management
High Speed Connection
Key Features
Bioinformatics Workflow
Four steps: Upload, Create a Sample, Perform Analyses, Download Results
Algorithms: Carefully chosen, tested and optimized
Workflows: Whole Genome Resequencing, Exome Resequencing, RNA-Seq, small RNA, ncRNA, and De novo Assembly
Homepage
Four task portals
Status of recent works
Warning and Logging
Navigation Tabs
Bioinformatics Workflow--- Pipelines
18
Exome Resequencing RNASeq
Transcriptome
Bioinformatics Workflow---Comprehensive Reports
19
Bioinformatics Workflow---Comprehensive Reports
20
Data Management
“Sample”, “Analysis”, “Project” Mimicking real research procedure Automatic management of underlying data structure
Raw Data
Sample A
Sample B
Analysis I
Analysis II
Analysis XProject I
Create a Sample
Add read groups
Sample Page
Individual report for each lane
Summarized report for all lanes
Data management---Security
Access
Multi-tenancy
Isolation
Compliance
• Username/Password• Biometric access• HTTPS , Aspera fastpTM
• Trusted database connection
• ACL, Data encryption
• Physical isolation• Virtual isolation
• ISO27000
High Speed Data Exchange
Aspera’s patented fasp™ high-speed file transferring technology
10~100X faster than FTP
25
Transfer 24GB in 30 Seconds
26
Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.
Transfer 24GB in 30 Seconds
27
Demonstrated 10Gbps ultra high speed data exchange with UC Davis, and NCBI in June.
A 24GB file was transferred from China to US in 30 Seconds (~8Gbits/s).
Amount of Data that can be transferred in 24hr
28
Easy-to-Use UI
Reusability- Reuse the same sample for different analyses (different
parameters)- Reuse all parameter settings for different analyses
Simple UI and interactive features- As easy as to do online shopping- Shortcut for predefined setting, at the same time fully
customizable for advance users- Handle batch analyses in one setting
29
Create an Analysis
Selected sample(s)
• One selected sample => Single Analysis
• Multiple selected samples => Batch Analyses
Create an Analysis
Selectable modules
Predefined Settings
Shortcut
Create an Analysis
Create an Analysis
Customizable
Create an Analysis
Project TableAdd/Remove
Project
Operation short cuts
Project list table Filter and search box
Analysis Table
Sample Table
A typical user case
38
Topics for Today
Behind the cloud product- BGI- The team
The product: EasyGenomics- Why are we building this product?- What can this product do?
Future direction and open questions
39
Future directions
What is the market? Which direction to go?
- Cloud on the public infrastructure vs cloud on the private infrastructure
- SaaS vs PaaS- Data analysis is only one step of the whole process.- What will be the sustained model for the cloud service?
Cloud Service Providers
Market Position
Annotation Providers
Sequencing Service ProvidersInstrument Manufacturers
Personal Genetic TestingProviders
illumina
Software Providers
NOW
Challenge and Solution
DNANexus Basespace(Illumina)
GenomeSpace EasyGenomics Ingenuity/ NextBio
Cloud Public Public Public Private PrivateReasoning Great demand on
space and computation resources
Security, Privacy issue
Positioning Infrastructure (PaaS)
App Store Platform for accessing available tools.
SaaS Solution InformationThey are playing the results from NGS not the raw reads.
Advantage Funding Advance in the
field
Sequencing service Community of
Partners
Strong connection to academia
Sequencing Service Development
Capability
Experience
42
Public vs Private Cloud
Public Cloud
Pros:− “Limitless” resource− Share data to a wide
range of people− Offering nice platform
Cons:− Security and reliability− Short term cost saving
vs Long term cost nightmare
Private Cloud
Pros:− Flexibility− Security and Privacy
control− Long-term cost saving
Cons:− Big initial investment− Maintaining the
infrastructure and software on the cloud
But, the line between public and private cloud are blurring.
A sustained model for cloud service?
Key components of cost- Storage- Computational resource- Data transfer- Software usage
App store or Cell phone plan
Long term cost vs Short term cost
Data analysis is NOT ALL!
EPM
Project Management Sample Center Wet Lab
OperationBioinformatics Data Analysis
EPM
Management System
Budgeting
Tasking
Receipt/Storage
Handover
Sample QC
Sample prep
Workflow
Sequencing
Data analysis
Data QC
Sal
es
Bil
lin
g
Web-based Interface
Management Interfacing Query Statistics
Roadmap of EasyGenomics
46
Jun 2012
Aug 2012
Sep 2012
Dec 2012
Apr 2013
EG1.1 (in Jun)• New result reports• Fully Integrated Data
Exchange Interface
EG1.2 (in Aug)• New read filtering step,
speed up 20x
EG1.3 (in Sep)• Data import from BGI
sequencing service
EG1.5 (est. in Dec)• QC indicator, QC module• New Sample report• Transcriptome workflows• Reference management
EG2.0 (est. in Apr, 2013)• IRODs data management• Data sharing, collaboration• User own applications• Comparison, Filtering tools• Visualization
www.EasyGenomics.com
Free Beta Trial is on going!!
Interpretation is the KEY
Analysis and Interpretation is the KEY
Enabling Technology
49
Best Practice Award for IT Infrastructure
Human Genome SOAPdenovo EasyGenomicsTM
(192 cores)
Genome Coverage 86% 86%
Assembly Time 70h 55h
No. of Servers 1 15
Memory Size 500GB x 1 24 GB x 15
Mode Centralized Distributed
Hadoop-based Flexible Computing
Enabling Technology
SOAP Hadoop (Gaea)
GPU
50