Challenges and Solutions for Managing the Complexities of a Genomic Core Facility GNomEx.

44
Challenges and Solutions for Managing the Complexities of a Genomic Core Facility GNomEx

Transcript of Challenges and Solutions for Managing the Complexities of a Genomic Core Facility GNomEx.

Challenges and Solutions for Managing the Complexities

of a Genomic Core Facility

GNomEx

Tony Di Sera• Passionate about Software, fascinated by Molecular Biology.• Over 20 years in the software field

Intel Human Genome Project

LIMs Software Startup

Hunstman Cancer

Institute at the

University of Utah

Original author of GNomEx, Project Lead

University of Utah and The Hunstman Cancer Institute

3 Genomics Cores

Bioinformatics Core

HCI Research Informatics

+ Sys Admin

Our Job is…

To deliver clean, beautiful data to the Researcher as quickly as possible…..

GNomEx at a Glance

LIMs Order Tracking Workflow Email Notification Results Delivery

Data Repository Analysis Project Center Configurable

Annotations Private to Public

Visibility

SubmitExperiment Workflow

Results Delivery

Automated Billing

Analysis Visualization

GNomEx OverviewData Flow

Experiments Analysis Visualization

GNomEx Experiments

Experiments cont…

GNomEx Analysis

Visualizing your Data

Visualizing your Data

Three Challenges

Data

Demand

People

Challenge #1

Complicated Data

BIG Data

Sensitive Data

Big Data

If you don’t have slack in the system, your throughput drops to a crawl.

Big Data

Build as you go

Buy BIG upfront

The Cloud

For Computationally Intense processes

2 X Faster than Slow Disk

Fast RPM SCSI

Files have short life

Fast Disk

For Large Storage

Infrequent Disk Hits

Files have long life span

Mountable to GNomEx and Analysis Pipeline

Slow Disk

If you store your Data In-house….

Hire a talented, fearless, focused Sys Admin

xkdc

Transferring BIG Data - FDT by CalTech

Pool of directly mapped buffers Data Transfer Socket

Connection& ControlManagement

Pool of directly mapped buffers

Restore MultipleFiles Concurrently

IndependentThreads perDevice

Big Data, Big Processing

Illumina Data Pipeline

FastQImage Processing FastQFile Splitting

GNomEx Barcode Tags Experiment Info Run Info

Experiment Folders

Images

Sequencing Analysis

• NovalignAlign

• Align around known IndelsRecalibrate

• GATKSNP Indel

• Annovar, VAAST, VarscanAnnotate

• Identify likely ChIP PeaksChip SEQ

• Find differentially expressed genes from transcriptsRNA Seq

Automated Analysis Pipeline

# run novoalign with default parameters #e [email protected]#a A1325@align -g hg19 -i *.txt.gz

#map, recalibrate and call SNP/INDEL w/ GATK@snpindel -g hg19 -i A*.txt.gz#map, recalibrate, call SNP/INDEL, annotate@annot -g hg19 -i control_A*.gz case_B*.gz -vaast -annovar

Simplifies running analyses on cluster

Fully versioned

Customizable

Complicated Data• Configurable

AnnotationsSample

Annotations

• WorkflowExperimental parameters, multiplexing, and protocols

• Links from Experiment to Analysis

Data Genealogy

• TopicsTie many

experiments and analyses

together

Experiment

Protocols

Lanes

Sample

The Data Model

Folder 1

File 1

File 2

File 3

File 4

Folder 2

Files A

File B

Folder 4

File 1

File 2

The File System

Sensitive Data

Physical

Server Room Access

OSServer, File Permissions,Network

ApplicationAuthorization

ApplicationAuthentication

Who can Access the Data?

CollaboratorsVisibility

Owner

Lab Members

Institution

Public

Three Challenges

Data

Demand

People

Challenge #2The Demand

① More Researchers

② More Experiments

③ More Samples per

Lane

④ Push for Faster Results

Slower Response Times

It is a shame

To ANNOY the user …….in the first 20 seconds

Addressing the Bottlenecks

App Server

Data Transfers

File Conversion

s

More Hardware

Faster Authenticatio

n

Apache Tomcat

Workload Balancing

Efficient Database Queries

Offload to Batch

Processing

Thinner Client

GNomEx Image Processing Analysis

How many servers are we talking about?

TomcatFDT

DatabaseServer

File Server

DataPipeline

Analysis

FastDisk

HighPerformance

Clusters

Slow DiskThe Repository

FastDisk

FastDisk

FastDisk

Biggest Bottleneck is….Getting the features implemented and bugs fixed in GNomEx.

Year 1 Year 2 Year 3 Year 4 Year 5 Year 6 Year 70

50

100

150

200

250

300

350

Backlog

Backlog

Three Challenges

Data

Demand

People

Different Users, Different

Perspectives

3 Core Facilities

Bioinformatics

Researchers at your Institution

Outside Researchers

Accounting

Three Kinds of Users

SubmitExperiment Workflow

Results Delivery

Automated Billing

Analysis Visualization

SubmitAnnotatePreapprove

AuthorizeRegister

Track

Record

Data Pipeline

ReviewSplitInvoice

Analysis PipelineUploadAnnotateOrganize

LinkOrganizeBrowse

Browse

Download Pay

Rese

arch

er

Core

Bio

info

rmatics

Download

We Don’t Always Speak the Same Language

Core Facility

Bioinformatics

Software Developers

System Admin

JDK

SQL

P-Value

FDR Cluster Nodes

HibernateEclipse

Ant

Case/Control

NICs

NFS

REFS

ImageCopy

Cluster densityMolarity Adapters

5’ vs 3’

CpG Islands

Optical Error

Linux Kernal

Interface

Inheritance

Spike in

But We Share the Same Goal

Deliver clean, beautiful data to the Researcher as quickly as possible…..

Agile Development Reducing Risk by shortening the Delivery Window

Agile ManifestoValue… More Than…

Individuals and Interactions

Processes and Tools

Working software Comprehensive Documentation

Customer Collaboration Contract Negotiation

Responding to Change Following a Plan

Iteration Incrementing Iterating

Our Scrum Board

In Summary

Data

Demand

People

Housing Big Data requires$ and expertise

System performanceIs multi-faceted

Work towards Shared Understanding. Build a team and process that embraces change.

Plans

Reporting Mobile app for Work lists

Usability, Simplify Interface

Translational Research

Abstracting and Mining Genomic Findings

Special Thanks

Parting Thoughts

Privileged to work in this fieldWorking with bright, interesting, fun, and nice peopleIn an area exploding with new advancementsThat will ultimately lead to important scientific discoveries

http://www.sourceforge.net/projects/gnomexhttp://hci-scrum.hci.utah.edu/[email protected]