Smith T Bio Hdf Bosc2008
Transcript of Smith T Bio Hdf Bosc2008
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 1
Todd Smith(1), Christian Chilan (2), Rishi Sinha(3), Elena Pourmal(2), Mike Folk(2).
1. Geospiza, Inc. 100 West Harrison St. North Tower #330, Seattle WA 98119. 2. The HDF group 1901 S. First St., Suite C-2 Champaign, IL 61820.3. Microsoft Corporation, Redmond WA.
BioHDF : Open binary file formats for large scale data management
TM
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 2
Laboratory and data workflow management for genetic analysis
Overview
• Driver: Next Generation DNA Sequencing
• What is HDF5
• BioHDF Project
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 3
Next Generation DNA Sequencing
• Next Gen Sequencing platforms produce ~1500 X more data than CE (Sanger)
• A single Next Gen instrument can produce 20 times more data a single run than a day’s operation of a genome center with 100 CE instruments
• In Sequence quotes - July 2007
– Toby Bloom, Broad Institute “Next-gen sequencing impacts all aspects of informatics.”
– Phil Butcher, Sanger “The best way to move terabytes of data is still disk.” Want to process data closer to the machine.
– Eugen Clark, Harvard “[community] needs to start talks about data retention.”
– Kelly Carpenter, Wash U “these sequencers are going to totally screw you.”
• Nature Methods July 2008: “Byte-ing off more than you can chew”
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 4
Three Phases of Data ProductionPrimary Data Analysis - Images to bases
Secondary Data Analysis
Tertiary Data Analysis
Sequences +Quality valuesRun quality
Gene listsRead DensityVariant listSample, run quality
Differential expressionMethylation sitesGene associationGenomic structureExperiment, science
Ref Seq +Aligner
One or moreData sets
Secondary Data ProductionDe novo assembly =>
Assembler
Contigs + Annotation
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 5
Proliferation of files, formats, formatters
Tag profilingChIP-Seq Resequencing
Example: MAQ - http://maq.sourceforge.net
Secondary Analysis for:
Additional files and formats needed for tertiary analysis
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 6
Challenges
• Complexity– Numerous programs, scripts, files, and formats– Redundant data
• Computational overhead– All data typically reside in RAM during computation– Output and input formats differ, so data must be
frequently reprocessed
• Space, time, and bandwidth efficiency– Increased storage– Computation times increase disproportionately– Large data sets must be transported for processing
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 7
What Needs to Be Done
• Reduce complexity– Decrease numbers and kinds of files– Eliminate data duplication (performance)– API and tools for data access
• Improve resource utilization– Reduce redundancy, work with compressed data– Improve program access to data, random reads and
writes, map disk to computer memory– Parallel I/O, Remote access– Facilitate data sharing, preservation
• Adopt a standard from other data intensive fields– Benefit from history and experience– Benefit from refinement– Build on a proven, widely accepted platform
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 8
HDF5: Single Platform / Multiple Uses
• A file format for managing any kind of data
• Software system to manage data in the format
• Designed for high volume or complex data
• Designed every size and type of system
• Open format and software• One library, with
– Options to adapt I/O and storage to data needs
– Layers on top and below
• Ability to interact well with other technologies
• Attention to past, present, future compatibility
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 9
HDF5 - 20 yrs in Physical Sciences
• Gain multiple “working with data efficiencies” slice, recombine …
• Arrays, sets, organizations, compression already there
• Server and remote access• Quick access to data via
HDFView, MATLAB, other tools
• Widely used - MATLAB, Mathematica, IDL, NASA-EOS,
Significantly reduce programming efforts needed to develop and maintain formats and software to explore scientific questions in your data
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 10
HDF Software
HDF I/O LibraryHDF I/O LibraryHDF I/O LibraryHDF I/O Library
Tools, Applications, Libraries (e.g. BioHDF)Tools, Applications, Libraries (e.g. BioHDF)Tools, Applications, Libraries (e.g. BioHDF)Tools, Applications, Libraries (e.g. BioHDF)
HDF FileHDF FileHDF FileHDF File
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 11
BioHDF
• SBIR Funded Project• Phase I - Feasibility for genotyping• Phase II - Open source technologies to support
computation in Next Gen DNA sequencing applications– Support diverse types of data from multiple sequencing
technologies by extending the BioHDF data model
– Develop prototype BioHDF software applications that support common activities utilizing DNA
– Develop methods for incorporating BioHDF into enterprise applications for clinical research and diagnostics
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 12
Phase I - Pilot Project
Combined view of HapMap, chromosome LD, PolyPhred details
A 53,000x53000 LD array
BioHDF file structure 53,000 row, 100+ column HapMap table
polyPhred data table, graphs, and chromats
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 13
Benefits
• Separated the model, implementation, and view of the data
• Multiple levels of data in a single view
• Hapmap: convert, display, and scroll 100,000s genotypes
• Compressed 5.2 GB LD data into 300 MB (17x)
• Quickly and randomly access subsets of data
• Made use of standard features and a data viewer (HDFview)
Only had to build the model and data importer
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 14
Phase II
• Primary Data Analysis– Models for storing and accessing
primary data
– Implement and test models, develop compression methods
– Create research tools to access and work with the data
• Secondary Data Analysis– Models for storing common data
structures (assembly graphs, density plots, variants)
– APIs to work with programs, enable out-of-core processing
– Develop research level applications utilizing HDFView, current and emerging genome browsers
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 15
Collaborations
• Planned– Software: SRF working group (A. Siddiqui), AMOS project (M.
Pop), Assembly formats (G. Marth), Consed (D. Gordon)– Applications and data: University of Washington, University of
Florida, Johns Hopkins University, Applied Biosystems
• Emerging– Additional Sequencing Vendors, Microsoft Research, Intel,
Institutes for Systems Biology
• Seeking– Algorithm developers– Application developers– Frameworks, Bio*– Data sets
Confidential© Copyright 2008 Geospiza, Inc. All Rights Reserved. Page 16
Summary
• Data challenges for Next Gen sequencing– Manage high volumes of data– Workflow complexity– Computational performance
• BioHDF will be built on existing, available, and proven HDF5 technology
• Geospiza and The HDF Group are seeking collaborations
• Funding - NIH STTR 1R41HG003792-02• Interested? Contact [email protected]