Post on 09-Jan-2017
Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences
Dr. Matthieu-P. Schapranow Festival of Genomics, London, U.K.
Jan 19, 2016
Schapranow, Festival of Genomics, Jan 19, 2016
Recap: we.analyzegenomes.com Real-time Analysis of Big Medical Data
2
In-Memory Database
Extensions for Life Sciences
Data Exchange, App Store
Access Control, Data Protection
Fair Use
Statistical Tools
Real-time Analysis
App-spanning User Profiles
Combined and Linked Data
Genome Data
Cellular Pathways
Genome Metadata
Research Publications
Pipeline and Analysis Models
Drugs and Interactions
Challenges of Big Medical Data
Drug Response Analysis
Pathway Topology Analysis
Medical Knowledge Cockpit Oncolyzer
Clinical Trial Recruitment
Cohort Analysis
...
Indexed Sources
Combined column and row store
Map/Reduce Single and multi-tenancy
Lightweight compression
Insert only for time travel
Real-time replication
Working on integers
SQL interface on columns and rows
Active/passive data store
Minimal projections
Group key Reduction of software layers
Dynamic multi-threading
Bulk load of data
Object-relational mapping
Text retrieval and extraction engine
No aggregate tables
Data partitioning Any attribute as index
No disk
On-the-fly extensibility
Analytics on historical data
Multi-core/ parallelization
Our Technology In-Memory Database Technology
+
+++
+
P
v
+++t
SQL
xx
T
disk
3
Schapranow, Festival of Genomics, Jan 19, 2016
Challenges of Big Medical Data
In-Memory Data Management Overview
Schapranow, Festival of Genomics, Jan 19, 2016
Analyze Genomes: Real-world Examples
4
Advances in Hardware
64 bit address space – 4 TB in current server boards
4 MB/ms/core data throughput
Cost-performance ratio rapidly declining
Multi-core architecture (6 x 12 core CPU per blade)
Parallel scaling across blades
1 blade ≈50k USD = 1 enterprise class server
Advances in Software
Row and Column Store Compression Partitioning Insert Only
A
Parallelization
+++
++
P
Active & Passive Data Stores
■ Main memory access is the new bottleneck
■ Lightweight compression can reduce this bottleneck, i.e.
□ Lossless
□ Improved usage of data bus capacity
□ Work directly on compressed data
Lightweight Compression
Schapranow, Festival of Genomics, Jan 19, 2016
Analyze Genomes: Real-world Examples
5
Attribute Vector
RecId ValueId 1 C18.0 2 C32.0 3 C00.9 4 C18.0 5 C20.0 6 C20.0 7 C50.9 8 C18.0
Inverted Index
ValueId RecIdList 1 2 2 3 3 5,6 4 1,4,8 5 7
Data Dictionary
ValueId Value 1 Larynx 2 Lip 3 Rectum 4 Colon 5 Mama Table
… … … C18.0 Colon 646470 C50.9 Mama 167898 C20.0 Rectum 647912 C20.0 Rectum 215678 C18.0 Colon 998711 C00.9 Lip 123489 C32.0 Larynx 357982 C18.0 Colon 091487
RecId 1 RecId 2 RecId 3 RecId 4 RecId 5 RecId 6 RecId 7 RecId 8 …
• Typical compression factor of 10:1 for enterprise software
• In financial applications up to 50:1
■ Row stores are designed for operative workload, e.g.
□ Create and maintain meta data for tests
□ Access a complete record of a trial or test series
■ Column stores are designed for analytical work, e.g.
□ Evaluate the number of positive test results
□ Identification of correlations or test candidates
■ In-Memory approach: Combination of both stores
□ Increased performance for analytical work
□ Operative performance remains interactively
Combined Column and Row Store
Schapranow, Festival of Genomics, Jan 19, 2016
Analyze Genomes: Real-world Examples
6
+
■ Traditional databases allow four data operations:
□ INSERT, SELECT and
□ DELETE, UPDATE
■ Insert-only requires only INSERT, SELECT to maintain a complete history (bookkeeping systems)
■ Insert-only enables time travelling, e.g. to
□ Trace changes and reconstruct decisions
□ Document complete history of changes, therapies, etc.
□ Enable statistical observations
Insert-Only / Append-Only
Schapranow, Festival of Genomics, Jan 19, 2016
Analyze Genomes: Real-world Examples
7
+++
+
■ Horizontal Partitioning
□ Cut long tables into shorter segments
□ E.g. to group samples with same relevance
■ Vertical Partitioning
□ Split off columns to individual resources
□ E.g. to separate personalized data from experiment data
■ Partitioning is the basis for
□ Parallel execution of database queries
□ Implementation of data aging and data retention management
Partitioning
Schapranow, Festival of Genomics, Jan 19, 2016
Analyze Genomes: Real-world Examples
8
■ Modern server systems consist of x CPUs, e.g.
■ Each CPU consists of y CPU cores, e.g. 12
■ Consider each of the x*y CPU core as individual workers, e.g. 6x12
■ Each worker can perform one task at the same time in parallel
■ Full table scan of database table w/ 1M entries results in 1/x*1/y search time when traversing in parallel
□ Reduced response time
□ No need for pre-aggregated totals and redundant data
□ Improved usage of hardware
□ Instant analysis of data
Multi-core and Parallelization
Schapranow, Festival of Genomics, Jan 19, 2016
Analyze Genomes: Real-world Examples
9
■ Consider two categories of data stores: active and passive
■ Active data are accessed frequently & updates are expected, e.g.
□ Most recent experiment results, e.g. last two weeks
□ Samples that have not been processed, yet
■ Passive data are used for analytical & statistical purposes, e.g.
□ Samples that were processed 5 years ago
□ Meta data about seeds that are not longer produced
■ Moving passive data on slower storages
□ Reduces main memory demands
□ Improves performance for active data
Active and Passive Data Store
Schapranow, Festival of Genomics, Jan 19, 2016
Analyze Genomes: Real-world Examples
10
P
■ Layers are introduced to abstract from complexity
■ Each layer offers complete functionality, e.g. meta data of samples
■ Less layer result in
□ Less code to maintain
□ More specific code
□ Reduced resource demands
□ Improves performance of applications due to eliminating obsolete processing steps
Reduction of Application Layers
Schapranow, Festival of Genomics, Jan 19, 2016
Analyze Genomes: Real-world Examples
11 xx
■ 1,000 core cluster at Hasso Plattner Institute with 25 TB main memory
■ 25 nodes, each consists of:
□ 40 cores
□ 1 TB main memory
□ Intel® Xeon® E7- 4870
□ 2.40GHz
□ 30 MB Cache
In-Memory Database Technology Hardware Characteristics at HPI FSOC Lab
Schapranow, Festival of Genomics, Jan 19, 2016
Challenges of Big Medical Data
12
In-Memory Database Technology Use Case: Analysis of Genomic Data
Schapranow, Festival of Genomics, Jan 19, 2016
Analyze Genomes: Real-world Examples
13
Analysis of Genomic Data
Alignment and Variant Calling Analysis of Annotations in World-wide DBs
Bound To CPU Performance Memory Capacity
Duration Hours – Days Weeks
HPI Minutes Real-time
In-Memory Technology
Multi-Core
Partitioning & Compression
Federated In-Memory Database (FIMDB) Incorporating Local Compute Resources
Schapranow, Festival of Genomics, Jan 19, 2016
Challenges of Big Medical Data
14 S ite B
Federated In -M em oryD atabase Instance ,
A lgorithm s, andApp lications M anaged
by Service P rovider
Clou
d Ser
vice
Prov
ider
S ite A
FIMD
BA.
1
FIMD
BA.
2
FIMD
BA.
3
FIMD
BA.
4
FIMD
BA.
5
FIMD
BB.
1
FIMD
BB.
2
FIMD
BB.
3
FIMD
BC.
1
Federated In -M em oryD atabase Instances
M aster D ataM anaged by
Service P rovider
Sensitive D atareside a t S ite
Multiple Sites Forming the Federated In-Memory Database System
Schapranow, Festival of Genomics, Jan 19, 2016
Challenges of Big Medical Data
15
Federated In-M em ory D atabase System
M aster D ata andS hared A lgorithm s
S ite A S ite BC loud Provider
C loud IM D BInstance
Local IM D BInstance
S ensitive D ata,e.g . P atient D ata
R
Local IM D BInstance
Sensitive D ata,e .g. P atien t D ata
R
■ Front-End
□ Any user interface supported, e.g. UI5, JavaScript, iOS
□ Communication protocol: Ajax/JSON
■ API Server
□ Ruby on Rails
□ Translates into database queries
■ Platform
□ IMDB with extensions for Life Sciences, e.g. worker framework, scheduler, third-party tools, etc.
Software Architecture
Schapranow, Festival of Genomics, Jan 19, 2016
Challenges of Big Medical Data
16
H igh-Perform ance In-M em ory G enom e C loud P la tfo rm
R esearcherw ith W ebBrow ser
C lin ic ianw ith M ob ileD evice
Ana lytica l D ataP rocessing
Mas
ter D
ata,
e.g
.R
efer
ence
Gen
omes
,A
nnot
atio
ns, a
ndC
linic
al T
rials
R
G raph P rocessing
Specific C loudApp lication
Specific C loudApp lica tion s
Specific M obileApplication
RR
Tran
sact
iona
l Dat
a,e.
g. G
enom
e R
eads
,Pat
ient
EM
R D
ata,
and
Wor
ker Sta
tus
In -M em ory D atabase InstancesIn-M em ory D atabase Instances
RR
Scheduling andW orker F ram ew ork
Annota tion andU pdater Fram ew ork
Inte
rnet
Intra
net
Dat
a La
yer
Pla
tform
Lay
erA
pplic
atio
n La
yer
Accounting Security Extensions
Keep in contact with us!
Hasso Plattner Institute Enterprise Platform & Integration Concepts (EPIC)
Program Manager E-Health Dr. Matthieu-P. Schapranow
August-Bebel-Str. 88 14482 Potsdam, Germany
Dr. Matthieu-P. Schapranow schapranow@hpi.de http://we.analyzegenomes.com/
Schapranow, Festival of Genomics, Jan 19, 2016
Challenges of Big Medical Data
17