j.liu Current status of trans-mart development (1)
-
Upload
scott-wagers -
Category
Documents
-
view
512 -
download
1
description
Transcript of j.liu Current status of trans-mart development (1)
Deloitte Consulting LLP
Current Status of tranSMART Development
Jinlei Liu
Objectives
• Core problem to solve
• Current development status and challenges
• tranSMART platform revisit and enhancement ideas
• Community development
- 3 -
Collaborative analysis of medical research data sets needed to make data
driven decisions for translational research is not scalable today. This is
because groups lack needed standard integration within and between
data sets across disparate domains including ‘omics, clinical research,
and outcomes linked with scientifically meaningful semantics.
A platform that enables scientists to share high quality data across
experimental data sets with standardized storage, query, analytics, and
visualization models is needed to enable integrative informatics driven
analyses.
Core problem: Scalable Analyses of Integrated Scientific Data
- 4 -
tranSMART – Knowledge Management Platform
- 5 -
tranSMART - Adoption and Emerging Community
Emerging CommunityAdoption
GitHub Activity since Jan 2012
- 6 -
Features in the Open Source Releases
Q4 2011 eTRIKSreview
Feb 2012
0.9 GPL
Feb 2013
1.1 Beta RC1
Dec 2012
1.1 Alpha
July 2012
1.0 GA1.0 RC21.0 RC1
Initial Release
• Search, Dataset
Explorer, Sample
Explorer and Gene
Signature
• Gene Expression,
RBM and Clinical Trial
Data
• Gene Pattern
Integration
• ETL scripts based on
Oracle Technology
• Legacy i2b2
GA Release
• i2b2 upgrade to 1.6
• R analytical plugin with
8+ pipelines
• R native interface
• Advanced data export
• SNP data support
• Updated ETL scripts –
some in Kettle
• Documentation
• Data Curation Tool
Postgres Migration
• i2b2 –postgres support
• tranSMART postgres
migration
• Integration tests
• Community build tools
• Updated ETL scripts –
more Kettle jobs
- 7 -
Yet More Features on Private or Forked Versions of tranSMART
Faceted Search (3 versions!)
Gene Signature UI Enhancements
New data visualization in search
Integrated DSE and Faceted Search API
GWAS, eQTL, Genetic Variation (VCF) data
New analytic pipelines in R
Across Study pilot
Study Data and Metadata tagging
Data Upload UI and Tools
Enrichment Analysis and Metacore integration
NCIBI tool integration
Installation scripts
New ETL pipelines and bioportal integration
New grid view
Saved Reports
…
- 8 -
Knowledge Sharing Requires Collaborative Development Effort
Master Branch
Feature left on branch
Forked Development Branch
Private Repo 1
Private Repo 2
Feedback From the Community Requires Platform Revamp
Developers
• Best architecture - Extension and
customization requires significant
core code changes
• UI and code clean up – Mixed ExtJS
and Jquery
• Best system integration via Service
API
• Better data curation and ETL –
ideally automated pipelines
• Better packaging
• Better code management and testing
Users
• Intuitive UI to visualize data
• Powerful data export tool
• Support NGS and other new data
types
• Better performance
• Self data management capability
• Meta-analysis
• More analytic pipeline integration
• Integration with other systems
- 10 -
tranSMART Platform Revisit – Architecture Overview
Internal
Applicatio
n
- 11 -
tranSMART Platform Revisit – Data Categories and Storage
Category Type Description Example Usage Storage
Level 1 Raw
• Raw data from
source platform
• Not normalized
Affymetrix CEL filesData processing pipeline File system
Level 2Processed
• Normalized data
through curation or
data processing
pipelines
• Clinical trial data
• RMA or MAS5 normalized
gene expression data
• SNP data with Calls and CNV
Dataset ExplorerDatabase:
DeApp,
i2b2DemoData
Level 3 Interpreted
• Interpreted or
aggregated data from
processed data
• Z-scores for gene expression
data
• ANOVA analysis results
• Dataset Explorer
• SearchDatabase:
DeApp, BioMart
Level 4Summary and
Findings
• Quantified
association and
analysis across
multiple samples.
• Published results
• Across trial analysis
• Data association results from
publicationsSearch
Database:
BioMart
Master DataSlow changing
data
• Data about key
business entities in
the system. Data
might be from internal
or external data
source.
• Study design, platform
specification, Subject
Demographics, ontology
trees, user defined gene lists
Dataset Explorer
Search
Database:
i2b2Metadata,
i2b2DemoData,
BioMart, SearchApp
Reference
Data
Slow changing
data used as
reference
• Data from other
system that’s used as
identifier or reference
to other systems
• Affymetrix annotation files,
GeneID from Entrez
Dataset Explorer
Search
Database:
DeApp, BioMart
MetaData -
StructuralMetadata
• Data descripts data
structure
• Data dictionary, Schema
guideDocumentation File
MetaData –
Administrative
(Operational)Metadata
• Data associated with
application/data
access and operation
• ETL auditing and QC results,
Application access resultsSearch
Database:
searchApp, rdc_cz
- 12 -
tranSMART Platform Revisit - Data Storage
BIOMART
I2B2
DEMODATA
DEAPP
SEARCHAPP
I2B2
METADATA
I2B2
HIVE
BIOMART_US
ER
UID, subject, study Projects/ontologysubject, sample, concept_cd, trial
concept_cd,
ontologyBiomarker UIDs
Core data warehouse and datamart with master data(study, platform etc), analyzed and curated summary data
Application user data such as user accounts, the queries they've run, gene signatures and the study permissions
Omic mart stores high dimension data(Gex/SNP/Proteomics), subject and sample association, and security extension for clinical trials.
TM_LZ
TM_CZ
Single access point for tranSMART app. Contains database SYNONYMS
Landing zone where data is stored in original format
ETL job control, qc and auditing zone
I2b2 project and user database
Clinical trial ontology and security
Clinical, subjects and low dimension data in STAR schema
TM_WZ
Working zone contains intermediate ETL results
- 13 -
Data Store Redesign
User and
Application Data
In RDBS
Level 3, Level 4 and
Clinical Data in
RDBS
Level 2 and 3 Data
In No-SQL DB
Meta data and Master Data
Documentation and
Indexing on File
System
Reference and Operational Data
Clinical and FindingTransactional High Dimension/ Big Data Files and External links
- 14 -
tranSMART Platform Revisit – Data Curation and ETL
Data is
available in
tranSMART
for analysis
by end
users.
Original
source
research
data. Is
copied as
the
preliminary
process
step.
Quality-
approved
data sent
through the
ETL
Pipeline.
Data is
tagged for
future
referencing
and
searching,
at the
record level
by
Concepts
(disease,
tissue,
platform)
Data is then
organized
into a common
structure
and
common
ontology or
vocabulary
The
curation
process
begins by
converting
data from
original
sources
into a
common
format.
Common
Data
Format
Metadata
Tagging
tranSMARTData
Source
aETL
EngineerAnalyst Quality control
Common
Ontology
Feedback Loop
Determine
which
study to
load into
the
system.
This is
decided by
the
Principal
Scientist /
System
Product
Manager
Define
study/data
to be
loaded
Data StewardPrincipal
Scientist AnalystAnalyst
Data is
analyzed
and
compared
against
similarly
tagged
data, and
any
unusual
features
noted.
ETL
Process
Quality
Control
Process
- 15 -
• Data ingestion templates and services
• Curation tool with metadata integration
• Data upload and services
• Automated data processing pipelines
• Data security
• Data sharing API and services
Curation and ETL Enhancement
- 16 -
tranSMART Platform Revisit - N tier Architecture
Presentation tier
Business tier
Data tier
Oracle/Post
gresFile Storage
Controller
Model
Ajax Javascript Framework
GORM/Hibernate
GSP/JSP Json/XML
Web Services
Security (with Plugins)
Plugins
Data is stored
and retrieved in
the database or
file system.
Exte
rna
l S
yste
ms
Service
i2b2PM
I2b2 CRC
Data ExportI2b2 Ontology
Data Import
Data Retrieve
Plugin Reg
Async Job
SOAP
Restful
Analysis
Search
Filter
Doc Index
RModule
Container
Data processing
and business
logic evaluation.
Moves and
transforms data
between
presentation and
data tier
Web based user
interface
Programming API
Data Integration
Web Service
Knowledge
Inventory
- 17 -
tranSMART Platform Revisit - Analytic Integration via R Plugin
Rse
rve
R backend
Analytic Server
Packages
Modules
RModule
Plugin
ROracletranSMAR
TRInterface
Biomart
Clinical
Mart(i2b2)
Doc
Store
Send Data /manage
Analytic job
Data Server
Direct access to Data Store via OCI
App Server
Data
Retrieval
Plugin
Output
Render
Plugin Reg
Data Export
Async Job
Request and
Retrieve Data
Register module Input
Render module response
- 18 -
Service and Plugin Based Architecture
Data Ingestion
and Export
Data Visualization and Explorer
Data Analysis
Data Integration
and Storage
SERVICES
CORE
KEY PLUGINS
PLUGINS
Ideas
• Leverage Grails plugin
framework
• tranSMART core as a
Grails plugin
• Service and plugin
registration in Core
• Extension as grails
plugin
- 19 -
Great Opportunity - Knowledge Sharing and Community Development
Forming Storming
Performing Norming
Knowledge SharingKnowledge Creation
Knowledge Unknown Knowledge SiloN
o T
rust
Syn
erg
yL
imite
d T
rust
Co
llab
ora
tion
- 20 -
Another Popular Knowledge Management Community!
tranSMART
- 21 -
Thank You