Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life...

Analyze Genomes: A Federated In-Memory Computing Platform for Life Sciences

Dr. Matthieu-P. Schapranow Festival of Genomics, London, U.K.

Jan 19, 2016

Schapranow, Festival of Genomics, Jan 19, 2016

Recap: we.analyzegenomes.com Real-time Analysis of Big Medical Data

In-Memory Database

Extensions for Life Sciences

Data Exchange, App Store

Access Control, Data Protection

Fair Use

Statistical Tools

Real-time Analysis

App-spanning User Profiles

Combined and Linked Data

Genome Data

Cellular Pathways

Genome Metadata

Research Publications

Pipeline and Analysis Models

Drugs and Interactions

Challenges of Big Medical Data

Drug Response Analysis

Pathway Topology Analysis

Medical Knowledge Cockpit Oncolyzer

Clinical Trial Recruitment

Cohort Analysis

Indexed Sources

Combined column and row store

Map/Reduce Single and multi-tenancy

Lightweight compression

Insert only for time travel

Real-time replication

Working on integers

SQL interface on columns and rows

Active/passive data store

Minimal projections

Group key Reduction of software layers

Dynamic multi-threading

Bulk load of data

Object-relational mapping

Text retrieval and extraction engine

No aggregate tables

Data partitioning Any attribute as index

No disk

On-the-fly extensibility

Analytics on historical data

Multi-core/ parallelization

Our Technology In-Memory Database Technology

In-Memory Data Management Overview

Analyze Genomes: Real-world Examples

Advances in Hardware

64 bit address space – 4 TB in current server boards

4 MB/ms/core data throughput

Cost-performance ratio rapidly declining

Multi-core architecture (6 x 12 core CPU per blade)

Parallel scaling across blades

1 blade ≈50k USD = 1 enterprise class server

Advances in Software

Row and Column Store Compression Partitioning Insert Only

Parallelization

Active & Passive Data Stores

■  Main memory access is the new bottleneck

■  Lightweight compression can reduce this bottleneck, i.e.

□  Lossless

□  Improved usage of data bus capacity

□  Work directly on compressed data

Lightweight Compression

Attribute Vector

RecId ValueId 1  C18.0 2  C32.0 3  C00.9 4  C18.0 5 C20.0 6 C20.0 7 C50.9 8 C18.0

Inverted Index

ValueId RecIdList 1  2 2  3 3  5,6 4  1,4,8 5  7

Data Dictionary

ValueId Value 1 Larynx 2 Lip 3 Rectum 4 Colon 5 Mama Table

… … … C18.0 Colon 646470 C50.9 Mama 167898 C20.0 Rectum 647912 C20.0 Rectum 215678 C18.0 Colon 998711 C00.9 Lip 123489 C32.0 Larynx 357982 C18.0 Colon 091487

RecId 1 RecId 2 RecId 3 RecId 4 RecId 5 RecId 6 RecId 7 RecId 8 …

•  Typical compression factor of 10:1 for enterprise software

•  In financial applications up to 50:1

■  Row stores are designed for operative workload, e.g.

□  Create and maintain meta data for tests

□  Access a complete record of a trial or test series

■  Column stores are designed for analytical work, e.g.

□  Evaluate the number of positive test results

□  Identification of correlations or test candidates

■  In-Memory approach: Combination of both stores

□  Increased performance for analytical work

□  Operative performance remains interactively

Combined Column and Row Store

■  Traditional databases allow four data operations:

□  INSERT, SELECT and

□  DELETE, UPDATE

■  Insert-only requires only INSERT, SELECT to maintain a complete history (bookkeeping systems)

■  Insert-only enables time travelling, e.g. to

□  Trace changes and reconstruct decisions

□  Document complete history of changes, therapies, etc.

□  Enable statistical observations

Insert-Only / Append-Only

■  Horizontal Partitioning

□  Cut long tables into shorter segments

□  E.g. to group samples with same relevance

■  Vertical Partitioning

□  Split off columns to individual resources

□  E.g. to separate personalized data from experiment data

■  Partitioning is the basis for

□  Parallel execution of database queries

□  Implementation of data aging and data retention management

Partitioning

■  Modern server systems consist of x CPUs, e.g.

■  Each CPU consists of y CPU cores, e.g. 12

■  Consider each of the x*y CPU core as individual workers, e.g. 6x12

■  Each worker can perform one task at the same time in parallel

■  Full table scan of database table w/ 1M entries results in 1/x*1/y search time when traversing in parallel

□  Reduced response time

□  No need for pre-aggregated totals and redundant data

□  Improved usage of hardware

□  Instant analysis of data

Multi-core and Parallelization

■  Consider two categories of data stores: active and passive

■  Active data are accessed frequently & updates are expected, e.g.

□  Most recent experiment results, e.g. last two weeks

□  Samples that have not been processed, yet

■  Passive data are used for analytical & statistical purposes, e.g.

□  Samples that were processed 5 years ago

□  Meta data about seeds that are not longer produced

■  Moving passive data on slower storages

□  Reduces main memory demands

□  Improves performance for active data

Active and Passive Data Store

■  Layers are introduced to abstract from complexity

■  Each layer offers complete functionality, e.g. meta data of samples

■  Less layer result in

□  Less code to maintain

□  More specific code

□  Reduced resource demands

□  Improves performance of applications due to eliminating obsolete processing steps

Reduction of Application Layers

■  1,000 core cluster at Hasso Plattner Institute with 25 TB main memory

■  25 nodes, each consists of:

□  40 cores

□  1 TB main memory

□  Intel® Xeon® E7- 4870

□  2.40GHz

□  30 MB Cache

In-Memory Database Technology Hardware Characteristics at HPI FSOC Lab

In-Memory Database Technology Use Case: Analysis of Genomic Data

Analysis of Genomic Data

Alignment and Variant Calling Analysis of Annotations in World-wide DBs

Bound To CPU Performance Memory Capacity

Duration Hours – Days Weeks

HPI Minutes Real-time

In-Memory Technology

Multi-Core

Partitioning & Compression

Federated In-Memory Database (FIMDB) Incorporating Local Compute Resources

14 S ite B

Federated In -M em oryD atabase Instance ,

A lgorithm s, andApp lications M anaged

by Service P rovider

S ite A

Federated In -M em oryD atabase Instances

M aster D ataM anaged by

Service P rovider

Sensitive D atareside a t S ite

Multiple Sites Forming the Federated In-Memory Database System

Federated In-M em ory D atabase System

M aster D ata andS hared A lgorithm s

S ite A S ite BC loud Provider

C loud IM D BInstance

Local IM D BInstance

S ensitive D ata,e.g . P atient D ata

Local IM D BInstance

Sensitive D ata,e .g. P atien t D ata

■  Front-End

□  Any user interface supported, e.g. UI5, JavaScript, iOS

□  Communication protocol: Ajax/JSON

■  API Server

□  Ruby on Rails

□  Translates into database queries

■  Platform

□  IMDB with extensions for Life Sciences, e.g. worker framework, scheduler, third-party tools, etc.

Software Architecture

H igh-Perform ance In-M em ory G enom e C loud P la tfo rm

R esearcherw ith W ebBrow ser

C lin ic ianw ith M ob ileD evice

Ana lytica l D ataP rocessing

G raph P rocessing

Specific C loudApp lication

Specific C loudApp lica tion s

Specific M obileApplication

ker Sta

In -M em ory D atabase InstancesIn-M em ory D atabase Instances

Scheduling andW orker F ram ew ork

Annota tion andU pdater Fram ew ork

Accounting Security Extensions

Keep in contact with us!

Hasso Plattner Institute Enterprise Platform & Integration Concepts (EPIC)

Program Manager E-Health Dr. Matthieu-P. Schapranow

August-Bebel-Str. 88 14482 Potsdam, Germany

Dr. Matthieu-P. Schapranow schapranow@hpi.de http://we.analyzegenomes.com/

Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life...

Technology

Transcript of Festival of Genomics 2016 London: Analyze Genomes: A Federated In-Memory Computing Platform for Life...

Eukaryotic genomes Genomics 260.605.01 J. Pevsner November 17, 2010.

Orthology predictions for whole mammalian genomes Leo Goodstadt MRC Functional Genomics Unit Oxford University.

Plant Genomes Central: Integrated Resources for Plant …Plant Genomes Central: Integrated Resources for Plant Genomics Plant Genomes Central (PGC) is an integrated, Web-based portal

GENETICS - CLUTCH CH.15 GENOMES AND GENOMICSlightcat-files.s3.amazonaws.com/...clutch...genomes-and-genomics-1… · Traditional whole genome sequencing (WGS) require cellular reactions

20.1 Structural Genomics Determines the DNA Sequences of Entire Genomes

Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.

The 100,000 genomes project - ClinGen · The 100,000 genomes project . Tim Hubbard @ timjph . Genomics England . King’s College London, King’s Health Partners . Wellcome Trust

Microbial genomics Genomics: study of entire genomes Logical next step after genetics: study of genes Genomics: 1) “Structural genomics” * Determine and.

Genomics & Proteomics I.What is genomics? A. GOALS of Genomics B. How are genomes mapped? C. Methods for sequencing genomes D. The Human Genome Project.

Spring 2007 Bioinformatiatics Ch. 6 - Genomics. Completed genomes Spring 2009 Bioinformatiatics .

Genomics study of molecular organization of genomes, their information content, and gene products they encode divided into three areas –structural genomics.

Genomics: 1 Genomics-sequencing of microbial genomes This lecture illustrates the strategies used in microbial genome sequencing projects, compares genome.

Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud

Genomics-sequencing of microbial genomes · Genomics: 37 Post Genome Sequence •Comparative genomics –comparing genome organisation and content –genome size –genome repeats/Tn/phages

Genomes and Their Evolution - MyTeacherSite.org...– Comparisons of genomes among organisms provide information about the evolutionary history of genes and taxonomic groups • Genomics

Genomics, Proteomics, and Systems Biology 5. 5 Genomics, Proteomics, and Systems Biology Genomes and Transcriptomes Proteomics Systems Biology.

Genomics-sequencing of microbial genomes · Genomics-sequencing of microbial genomes This lecture illustrates the strategies used in microbial genome sequencing projects, compares

Structural Genomics • Functional genomics - …...Structural Genomics. The analysis of the physical structure of genomes. Mapping Genomes Cytogenetic and Genetic maps were very helpful

Comparative genomics of the core and accessory genomes of ... … · RESEARCH Open Access Comparative genomics of the core and accessory genomes of 48 Sinorhizobium strains comprising

Genomics and personalised medicine · •Tom – Deputy Chief Scientist & Director of Public Health, 100,000 Genomes Project •100,000 Genomes Project •Sequencing 100,000 genomes