Smith T Bio Hdf Bosc2008

16
Confidential © Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 1 Todd Smith(1), Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk( 1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820. 3. Microsoft Corporation, Redmond WA. BioHDF : Open binary file formats for large scale data management TM

Transcript of Smith T Bio Hdf Bosc2008

Page 1: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 1

Todd Smith(1), Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk(2).

1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820.3. Microsoft Corporation, Redmond WA.

BioHDF : Open binary file formats for large scale data management

TM

Page 2: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 2

Laboratory and data workflow management for genetic analysis

Overview

• Driver: Next Generation DNA Sequencing

• What is HDF5

• BioHDF Project

Page 3: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 3

Next Generation DNA Sequencing

• Next Gen Sequencing platforms produce ~1500 X more data than CE (Sanger)

• A single Next Gen instrument can produce 20 times more data a single run than a day’s operation of a genome center with 100 CE instruments

• In Sequence quotes - July 2007

– Toby Bloom, Broad Institute “Next-gen sequencing impacts all aspects of informatics.”

– Phil Butcher, Sanger “The best way to move terabytes of data is still disk.” Want to process data closer to the machine.

– Eugen Clark, Harvard “[community] needs to start talks about data retention.”

– Kelly Carpenter, Wash U “these sequencers are going to totally screw you.”

• Nature Methods July 2008: “Byte-ing off more than you can chew”

Page 4: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 4

Three Phases of Data ProductionPrimary Data Analysis - Images to bases

Secondary Data Analysis

Tertiary Data Analysis

Sequences +Quality valuesRun quality

Gene listsRead DensityVariant listSample, run quality

Differential expressionMethylation sitesGene associationGenomic structureExperiment, science

Ref Seq +Aligner

One or moreData sets

Secondary Data ProductionDe novo assembly =>

Assembler

Contigs + Annotation

Page 5: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 5

Proliferation of files, formats, formatters

Tag profilingChIP-Seq Resequencing

Example: MAQ - http://maq.sourceforge.net

Secondary Analysis for:

Additional files and formats needed for tertiary analysis

Page 6: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 6

Challenges

• Complexity– Numerous programs, scripts, files, and formats– Redundant data

• Computational overhead– All data typically reside in RAM during computation– Output and input formats differ, so data must be

frequently reprocessed

• Space, time, and bandwidth efficiency– Increased storage– Computation times increase disproportionately– Large data sets must be transported for processing

Page 7: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 7

What Needs to Be Done

• Reduce complexity– Decrease numbers and kinds of files– Eliminate data duplication (performance)– API and tools for data access

• Improve resource utilization– Reduce redundancy, work with compressed data– Improve program access to data, random reads and

writes, map disk to computer memory– Parallel I/O, Remote access– Facilitate data sharing, preservation

• Adopt a standard from other data intensive fields– Benefit from history and experience– Benefit from refinement– Build on a proven, widely accepted platform

Page 8: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 8

HDF5: Single Platform / Multiple Uses

• A file format for managing any kind of data

• Software system to manage data in the format

• Designed for high volume or complex data

• Designed every size and type of system

• Open format and software• One library, with

– Options to adapt I/O and storage to data needs

– Layers on top and below

• Ability to interact well with other technologies

• Attention to past, present, future compatibility

Page 9: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 9

HDF5 - 20 yrs in Physical Sciences

• Gain multiple “working with data efficiencies” slice, recombine …

• Arrays, sets, organizations, compression already there

• Server and remote access• Quick access to data via

HDFView, MATLAB, other tools

• Widely used - MATLAB, Mathematica, IDL, NASA-EOS,

Significantly reduce programming efforts needed to develop and maintain formats and software to explore scientific questions in your data

Page 10: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 10

HDF Software

HDF I/O LibraryHDF I/O LibraryHDF I/O LibraryHDF I/O Library

Tools, Applications, Libraries (e.g. BioHDF)Tools, Applications, Libraries (e.g. BioHDF)Tools, Applications, Libraries (e.g. BioHDF)Tools, Applications, Libraries (e.g. BioHDF)

HDF FileHDF FileHDF FileHDF File

Page 11: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 11

BioHDF

• SBIR Funded Project• Phase I - Feasibility for genotyping• Phase II - Open source technologies to support

computation in Next Gen DNA sequencing applications– Support diverse types of data from multiple sequencing

technologies by extending the BioHDF data model

– Develop prototype BioHDF software applications that support common activities utilizing DNA

– Develop methods for incorporating BioHDF into enterprise applications for clinical research and diagnostics

Page 12: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 12

Phase I - Pilot Project

Combined view of HapMap, chromosome LD, PolyPhred details

A 53,000x53000 LD array

BioHDF file structure 53,000 row, 100+ column HapMap table

polyPhred data table, graphs, and chromats

Page 13: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 13

Benefits

• Separated the model, implementation, and view of the data

• Multiple levels of data in a single view

• Hapmap: convert, display, and scroll 100,000s genotypes

• Compressed 5.2 GB LD data into 300 MB (17x)

• Quickly and randomly access subsets of data

• Made use of standard features and a data viewer (HDFview)

Only had to build the model and data importer

Page 14: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 14

Phase II

• Primary Data Analysis– Models for storing and accessing

primary data

– Implement and test models, develop compression methods

– Create research tools to access and work with the data

• Secondary Data Analysis– Models for storing common data

structures (assembly graphs, density plots, variants)

– APIs to work with programs, enable out-of-core processing

– Develop research level applications utilizing HDFView, current and emerging genome browsers

Page 15: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 15

Collaborations

• Planned– Software: SRF working group (A. Siddiqui), AMOS project (M.

Pop), Assembly formats (G. Marth), Consed (D. Gordon)– Applications and data: University of Washington, University of

Florida, Johns Hopkins University, Applied Biosystems

• Emerging– Additional Sequencing Vendors, Microsoft Research, Intel,

Institutes for Systems Biology

• Seeking– Algorithm developers– Application developers– Frameworks, Bio*– Data sets

Page 16: Smith T Bio Hdf Bosc2008

Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 16

Summary

• Data challenges for Next Gen sequencing– Manage high volumes of data– Workflow complexity– Computational performance

• BioHDF will be built on existing, available, and proven HDF5 technology

• Geospiza and The HDF Group are seeking collaborations

• Funding - NIH STTR 1R41HG003792-02• Interested? Contact [email protected]