j.liu Current status of trans-mart development (1)

21
Deloitte Consulting LLP Current Status of tranSMART Development Jinlei Liu

description

 

Transcript of j.liu Current status of trans-mart development (1)

Page 1: j.liu  Current status of trans-mart development (1)

Deloitte Consulting LLP

Current Status of tranSMART Development

Jinlei Liu

Page 2: j.liu  Current status of trans-mart development (1)

Objectives

• Core problem to solve

• Current development status and challenges

• tranSMART platform revisit and enhancement ideas

• Community development

Page 3: j.liu  Current status of trans-mart development (1)

- 3 -

Collaborative analysis of medical research data sets needed to make data

driven decisions for translational research is not scalable today. This is

because groups lack needed standard integration within and between

data sets across disparate domains including ‘omics, clinical research,

and outcomes linked with scientifically meaningful semantics.

A platform that enables scientists to share high quality data across

experimental data sets with standardized storage, query, analytics, and

visualization models is needed to enable integrative informatics driven

analyses.

Core problem: Scalable Analyses of Integrated Scientific Data

Page 4: j.liu  Current status of trans-mart development (1)

- 4 -

tranSMART – Knowledge Management Platform

Page 5: j.liu  Current status of trans-mart development (1)

- 5 -

tranSMART - Adoption and Emerging Community

Emerging CommunityAdoption

GitHub Activity since Jan 2012

Page 6: j.liu  Current status of trans-mart development (1)

- 6 -

Features in the Open Source Releases

Q4 2011 eTRIKSreview

Feb 2012

0.9 GPL

Feb 2013

1.1 Beta RC1

Dec 2012

1.1 Alpha

July 2012

1.0 GA1.0 RC21.0 RC1

Initial Release

• Search, Dataset

Explorer, Sample

Explorer and Gene

Signature

• Gene Expression,

RBM and Clinical Trial

Data

• Gene Pattern

Integration

• ETL scripts based on

Oracle Technology

• Legacy i2b2

GA Release

• i2b2 upgrade to 1.6

• R analytical plugin with

8+ pipelines

• R native interface

• Advanced data export

• SNP data support

• Updated ETL scripts –

some in Kettle

• Documentation

• Data Curation Tool

Postgres Migration

• i2b2 –postgres support

• tranSMART postgres

migration

• Integration tests

• Community build tools

• Updated ETL scripts –

more Kettle jobs

Page 7: j.liu  Current status of trans-mart development (1)

- 7 -

Yet More Features on Private or Forked Versions of tranSMART

Faceted Search (3 versions!)

Gene Signature UI Enhancements

New data visualization in search

Integrated DSE and Faceted Search API

GWAS, eQTL, Genetic Variation (VCF) data

New analytic pipelines in R

Across Study pilot

Study Data and Metadata tagging

Data Upload UI and Tools

Enrichment Analysis and Metacore integration

NCIBI tool integration

Installation scripts

New ETL pipelines and bioportal integration

New grid view

Saved Reports

Page 8: j.liu  Current status of trans-mart development (1)

- 8 -

Knowledge Sharing Requires Collaborative Development Effort

Master Branch

Feature left on branch

Forked Development Branch

Private Repo 1

Private Repo 2

Page 9: j.liu  Current status of trans-mart development (1)

Feedback From the Community Requires Platform Revamp

Developers

• Best architecture - Extension and

customization requires significant

core code changes

• UI and code clean up – Mixed ExtJS

and Jquery

• Best system integration via Service

API

• Better data curation and ETL –

ideally automated pipelines

• Better packaging

• Better code management and testing

Users

• Intuitive UI to visualize data

• Powerful data export tool

• Support NGS and other new data

types

• Better performance

• Self data management capability

• Meta-analysis

• More analytic pipeline integration

• Integration with other systems

Page 10: j.liu  Current status of trans-mart development (1)

- 10 -

tranSMART Platform Revisit – Architecture Overview

Internal

Applicatio

n

Page 11: j.liu  Current status of trans-mart development (1)

- 11 -

tranSMART Platform Revisit – Data Categories and Storage

Category Type Description Example Usage Storage

Level 1 Raw

• Raw data from

source platform

• Not normalized

Affymetrix CEL filesData processing pipeline File system

Level 2Processed

• Normalized data

through curation or

data processing

pipelines

• Clinical trial data

• RMA or MAS5 normalized

gene expression data

• SNP data with Calls and CNV

Dataset ExplorerDatabase:

DeApp,

i2b2DemoData

Level 3 Interpreted

• Interpreted or

aggregated data from

processed data

• Z-scores for gene expression

data

• ANOVA analysis results

• Dataset Explorer

• SearchDatabase:

DeApp, BioMart

Level 4Summary and

Findings

• Quantified

association and

analysis across

multiple samples.

• Published results

• Across trial analysis

• Data association results from

publicationsSearch

Database:

BioMart

Master DataSlow changing

data

• Data about key

business entities in

the system. Data

might be from internal

or external data

source.

• Study design, platform

specification, Subject

Demographics, ontology

trees, user defined gene lists

Dataset Explorer

Search

Database:

i2b2Metadata,

i2b2DemoData,

BioMart, SearchApp

Reference

Data

Slow changing

data used as

reference

• Data from other

system that’s used as

identifier or reference

to other systems

• Affymetrix annotation files,

GeneID from Entrez

Dataset Explorer

Search

Database:

DeApp, BioMart

MetaData -

StructuralMetadata

• Data descripts data

structure

• Data dictionary, Schema

guideDocumentation File

MetaData –

Administrative

(Operational)Metadata

• Data associated with

application/data

access and operation

• ETL auditing and QC results,

Application access resultsSearch

Database:

searchApp, rdc_cz

Page 12: j.liu  Current status of trans-mart development (1)

- 12 -

tranSMART Platform Revisit - Data Storage

BIOMART

I2B2

DEMODATA

DEAPP

SEARCHAPP

I2B2

METADATA

I2B2

HIVE

BIOMART_US

ER

UID, subject, study Projects/ontologysubject, sample, concept_cd, trial

concept_cd,

ontologyBiomarker UIDs

Core data warehouse and datamart with master data(study, platform etc), analyzed and curated summary data

Application user data such as user accounts, the queries they've run, gene signatures and the study permissions

Omic mart stores high dimension data(Gex/SNP/Proteomics), subject and sample association, and security extension for clinical trials.

TM_LZ

TM_CZ

Single access point for tranSMART app. Contains database SYNONYMS

Landing zone where data is stored in original format

ETL job control, qc and auditing zone

I2b2 project and user database

Clinical trial ontology and security

Clinical, subjects and low dimension data in STAR schema

TM_WZ

Working zone contains intermediate ETL results

Page 13: j.liu  Current status of trans-mart development (1)

- 13 -

Data Store Redesign

User and

Application Data

In RDBS

Level 3, Level 4 and

Clinical Data in

RDBS

Level 2 and 3 Data

In No-SQL DB

Meta data and Master Data

Documentation and

Indexing on File

System

Reference and Operational Data

Clinical and FindingTransactional High Dimension/ Big Data Files and External links

Page 14: j.liu  Current status of trans-mart development (1)

- 14 -

tranSMART Platform Revisit – Data Curation and ETL

Data is

available in

tranSMART

for analysis

by end

users.

Original

source

research

data. Is

copied as

the

preliminary

process

step.

Quality-

approved

data sent

through the

ETL

Pipeline.

Data is

tagged for

future

referencing

and

searching,

at the

record level

by

Concepts

(disease,

tissue,

platform)

Data is then

organized

into a common

structure

and

common

ontology or

vocabulary

The

curation

process

begins by

converting

data from

original

sources

into a

common

format.

Common

Data

Format

Metadata

Tagging

tranSMARTData

Source

aETL

EngineerAnalyst Quality control

Common

Ontology

Feedback Loop

Determine

which

study to

load into

the

system.

This is

decided by

the

Principal

Scientist /

System

Product

Manager

Define

study/data

to be

loaded

Data StewardPrincipal

Scientist AnalystAnalyst

Data is

analyzed

and

compared

against

similarly

tagged

data, and

any

unusual

features

noted.

ETL

Process

Quality

Control

Process

Page 15: j.liu  Current status of trans-mart development (1)

- 15 -

• Data ingestion templates and services

• Curation tool with metadata integration

• Data upload and services

• Automated data processing pipelines

• Data security

• Data sharing API and services

Curation and ETL Enhancement

Page 16: j.liu  Current status of trans-mart development (1)

- 16 -

tranSMART Platform Revisit - N tier Architecture

Presentation tier

Business tier

Data tier

Oracle/Post

gresFile Storage

Controller

Model

Ajax Javascript Framework

GORM/Hibernate

GSP/JSP Json/XML

Web Services

Security (with Plugins)

Plugins

Data is stored

and retrieved in

the database or

file system.

Exte

rna

l S

yste

ms

Service

i2b2PM

I2b2 CRC

Data ExportI2b2 Ontology

Data Import

Data Retrieve

Plugin Reg

Async Job

SOAP

Restful

Analysis

Search

Filter

Doc Index

RModule

Container

Data processing

and business

logic evaluation.

Moves and

transforms data

between

presentation and

data tier

Web based user

interface

Programming API

Data Integration

Web Service

Knowledge

Inventory

Page 17: j.liu  Current status of trans-mart development (1)

- 17 -

tranSMART Platform Revisit - Analytic Integration via R Plugin

Rse

rve

R backend

Analytic Server

Packages

Modules

RModule

Plugin

ROracletranSMAR

TRInterface

Biomart

Clinical

Mart(i2b2)

Doc

Store

Send Data /manage

Analytic job

Data Server

Direct access to Data Store via OCI

App Server

Data

Retrieval

Plugin

Output

Render

Plugin Reg

Data Export

Async Job

Request and

Retrieve Data

Register module Input

Render module response

Page 18: j.liu  Current status of trans-mart development (1)

- 18 -

Service and Plugin Based Architecture

Data Ingestion

and Export

Data Visualization and Explorer

Data Analysis

Data Integration

and Storage

SERVICES

CORE

KEY PLUGINS

PLUGINS

Ideas

• Leverage Grails plugin

framework

• tranSMART core as a

Grails plugin

• Service and plugin

registration in Core

• Extension as grails

plugin

Page 19: j.liu  Current status of trans-mart development (1)

- 19 -

Great Opportunity - Knowledge Sharing and Community Development

Forming Storming

Performing Norming

Knowledge SharingKnowledge Creation

Knowledge Unknown Knowledge SiloN

o T

rust

Syn

erg

yL

imite

d T

rust

Co

llab

ora

tion

Page 20: j.liu  Current status of trans-mart development (1)

- 20 -

Another Popular Knowledge Management Community!

tranSMART

Page 21: j.liu  Current status of trans-mart development (1)

- 21 -

Thank You