NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data
description
Transcript of NCSA-NARA investigations of HDF5 in support of EXPRESS-Driven data
NCSA-NARA investigations of HDF5
in support of EXPRESS-Driven dataMike Folk
The HDF NARA Project
PDES, Inc. Offsite MeetingSeptember 24-29, 2006
PDES, Inc. Offsite Sept 2006 2
Acknowledgement
This report is based upon work supported by the National Archives and Records Administration (NARA)
through the grant NARA NSF 0202 GPG. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily
reflect the views of the NARA.
PDES, Inc. Offsite Sept 2006 3
ParticipantsMike Folk, Vailin Choi, Elena Pourmal – The
HDF GroupMark Conrad and Bob Chadduck – NARADavid Price – EuroSTEPKeith Hunten – Lockheed-MartinSteve Cooper and Denny Moore – Electric
BoatOthers
1. What is HDF5?
PDES, Inc. Offsite Sept 2006 5
HDF5 is
• A file format for managing any kind of data
• Software system to manage data in the format
• Suited especially to large volume or complex data
• Suited for every size and type of system• Open file format, open software
PDES, Inc. Offsite Sept 2006 6
Definitions• “HDF” – Hierarchical Data Format
• Originated in 1988• NCSA at University of Illinois at Urbana-
Champaign
• “HDF5” • Successor to HDF, introduced in 1998
PDES, Inc. Offsite Sept 2006 7
An HDF5 file is a container…
lat | lon | temp----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6
palette
palette
……into into which you which you can put can put your data your data objects.objects.
PDES, Inc. Offsite Sept 2006 8
HDF5 data model• HDF5 file – container for data objects• Primary Objects
• Groups• Datasets
• Additional ways to organize data• Attributes for metadata• Sharable objects• Storage and access properties
Everything else is built from
Everything else is built from
these parts.
these parts.
PDES, Inc. Offsite Sept 2006 9
HDF “groups” for organizing objects in files
palettepalette
Raster imageRaster image
3-D array3-D array
2-D array2-D arrayRaster imageRaster image
lat | lon | templat | lon | temp----|-----|---------|-----|----- 12 | 23 | 3.112 | 23 | 3.1 15 | 24 | 4.215 | 24 | 4.2 17 | 21 | 3.617 | 21 | 3.6
TableTable
““/” /” (root)(root)““/” /” (root)(root)
““/foo”/foo”““/foo”/foo”
PDES, Inc. Offsite Sept 2006 10
HDF5 “dataset” for holding the data
DataMetadataDataspaceDataspaceDataspaceDataspace
3
RankRank
Dim_2 = 5Dim_1 = 4
DimensionsDimensions
time = 32.4
pressure = 987
temp = 56
AttributesAttributes
Chunked
compressed
Dim_3 = 7
Storage infoStorage info
IEEE 32-bit float
DatatypeDatatype
PDES, Inc. Offsite Sept 2006 11
Datatypes (array elements)• Datatype – how to interpret a data
element• Two classes: atomic and compound
PDES, Inc. Offsite Sept 2006 12
Datatypes• HDF5 atomic types
• normal integer & float• user-definable (e.g. 13-bit integer)• fixed length and variable length multiples (e.g.
strings)• references to objects/dataset regions• enumeration - names mapped to integers• array
• HDF5 compound types• Records with fields – comparable to C structs • Members can be atomic or compound types
PDES, Inc. Offsite Sept 2006 13
“Groups”• A mechanism for
collections of related objects
• Every file starts with a root group
• Similar to UNIX directories
• Can have attributes
“/”tom dick
harry
a b c
PDES, Inc. Offsite Sept 2006 14
Special Storage OptionsBetter subsetting Better subsetting access time; access time; extendableextendable
chunked
Improves storage Improves storage efficiency, efficiency, transmission speedtransmission speed
compressedcompressed
Arrays can be Arrays can be extended in any extended in any directiondirection
extendableextendable
Metadata for FredMetadata for FredMetadata for FredMetadata for Fred
Dataset “Fred”Dataset “Fred”Dataset “Fred”Dataset “Fred”
File AFile A
File BFile B
Data for FredData for Fred
Metadata in one file, Metadata in one file, raw data in another.raw data in another.Split fileSplit file
PDES, Inc. Offsite Sept 2006 15
Mesh Example, in HDFView
PDES, Inc. Offsite Sept 2006 16
HDF5 Software
Tools & ApplicationsTools & ApplicationsTools & ApplicationsTools & Applications
HDF FileHDF FileHDF FileHDF File
HDF I/O LibraryHDF I/O LibraryHDF I/O LibraryHDF I/O Library
PDES, Inc. Offsite Sept 2006 17
Features of library• Ability to create and access complex data
structures• Fast, flexible I/O• Data transformation and filtering during I/O• Flexible API for power users• Compatibility with common data models
• Able to represent all common data structures• Supports key language models – C, Fortran,
Java, etc.
PDES, Inc. Offsite Sept 2006 18
Other info• Library and tools run almost anywhere• Other software from THG
• Java viewer• Command-line utilities
• Other software• Commercial (IDL, Matlab, Labview, etc.)• Community (EOS, ASCI, etc.)• Integration with other software (SRB,
databases, etc.)
PDES, Inc. Offsite Sept 2006 19
Making HDF useful for your application• There are many ways to organize and
access data in HDF5• How do we apply these capabilities to a
particular domain, such as product data?• We have to decide how we will organize and
access our data in a way that best addresses our needs.
• And create data models, APIs and tools as appropriate to support our applications.
• Or adapt existing data models, APIs and tools as appropriate to support our applications.
Sample uses of HDF
PDES, Inc. Offsite Sept 2006 21
1. NASA Earth Observing System (EOS)
Aqua (6/01)Aura
TES HRDLSMLS OMI
Terra
CERES MISR
MODIS MOPITT
AquaCERES MODIS
AMSR
PDES, Inc. Offsite Sept 2006 22
2. Advanced Simulation & Computing (ASC)
Question: How do we maintain a nuclear stockpile in the absence
of testing?
Answer: Very large simulations
PDES, Inc. Offsite Sept 2006 23
ASC Data requirements• Large datasets (> a terabyte) • Fast I/O on massive parallel systems • Complex data and extensive metadata• Availability on leading edge systems
3. Bioinformatics
--
Managing genomic data
caacaagccaaaactcgtacaacaacaagccaaaactcgtacaaCgagatatctcttggaaaaactCgagatatctcttggaaaaactgctcacaatattgacgtacaaggctcacaatattgacgtacaaggttgttcatgaaactttcggtagttgttcatgaaactttcggtaAcaatcgttgacattgcgacctAcaatcgttgacattgcgacctaatacagcccagcaagcagaataatacagcccagcaagcagaat
PDES, Inc. Offsite Sept 2006 25
DNA sequencing workflows are complex
• Diverse formats• Highly redundant data• Multiple levels of
information• Complex associations• Repeated file
processing• Non-scalable storage• Lack of persistence
PDES, Inc. Offsite Sept 2006 26
HDF5 as binary exchange format for bioinformatics
4. Flight test data
PDES, Inc. Offsite Sept 2006 28
Boeing flight test
HDF role in the Software Stack
PDES, Inc. Offsite Sept 2006 30
StorageStorage
File on parallelFile on parallelfile systemfile systemFileFile
Split metadata Split metadata and raw data filesand raw data files
User-definedUser-defineddevicedevice
?? Across the networkAcross the networkor to/from anotheror to/from another
application or libraryapplication or libraryHDF5 formatHDF5 format
HDF5HDF5 data model & API data model & API
Apps: simulation, visualization, remote sensing…
Examples: Thermonuclear simulationsProduct modelingData mining tools
Visualization toolsClimate models
Common application-specific data models
HDF5 virtual file layer (I/O drivers)HDF5 virtual file layer (I/O drivers)
MPI I/OMPI I/OSplit FilesSplit FilesStdioStdio CustomCustom StreamStreamHDF5 serial & HDF5 serial &
parallel I/Oparallel I/O
BioHDF SAF HDF-Packet HDF-EOSMatlabapp-specificapp-specific API or GUI
LANL LLNL, SNL Grids COTS NASA
2. Why is there interest in HDF5 for
product data? (Courtesy of David Price, EuroSTEP)
PDES, Inc. Offsite Sept 2006 32
Needs• STEP and related models exist using
EXPRESS• ASCII, XML STEP formats defined,
software developed• But ASCII/XML don’t adapt well for
highly voluminous, complex data• Finite element analysis• Computational fluid dynamics• Heterogeneous product data
PDES, Inc. Offsite Sept 2006 33
EuroSTEP project• VIVACE: “Value Improvement through a
Virtual Aeronautical Collaborative Enterprise”
• Deliverable: EXPRESS-driven Large Volume Binary Data Representation
PDES, Inc. Offsite Sept 2006 36
Survey of State of the Art• Candidates
• ASN.1 : Abstract Syntax Notation 1• HDF5 : Hierarchical Data Format• XML/Binary• CGNS : CFD General Notation System• SDAI implementation by LKSoft
• Found HDF5 most suitable for very large scientific datasets and complex relationships
Goal:Create open-source
toolkit mapping EXPRESS to HDF5
PDES, Inc. Offsite Sept 2006 38
StorageStorage
File on parallelFile on parallelfile systemfile systemFileFile
Split metadata Split metadata and raw data filesand raw data files
User-definedUser-defineddevicedevice
?? Across the networkAcross the networkor to/from anotheror to/from another
application or libraryapplication or libraryHDF5 formatHDF5 format
HDF5HDF5 data model & API data model & API
Apps: simulation, visualization, remote sensing…
Examples: Thermonuclear simulationsProduct modelingData mining tools
Visualization toolsClimate models
Common application-specific data models
HDF5 virtual file layer (I/O drivers)HDF5 virtual file layer (I/O drivers)
MPI I/OMPI I/OSplit FilesSplit FilesStdioStdio CustomCustom StreamStreamHDF5 serial & HDF5 serial &
parallel I/Oparallel I/O
BioHDF SAF HDF-Packet HDF-EOSMatlabappl-specificappl-specific
APIsLANL LLNL, SNL Grids COTS NASA
Product model Applications
Examples: Thermonuclear simulationsProduct modelingData mining tools
Visualization tools
STEP data models
STEP-HDF5
NARA-sponsored work
PDES, Inc. Offsite Sept 2006 40
NCSA-THG NARA Research• Investigate the viability of scientific data
formats, such as HDF5, for long-term preservation of engineering data in the federal archives
PDES, Inc. Offsite Sept 2006 41
Heterogeneous data aggregation, with HDF5 • Goal:
Using NARA’s TWR collection, investigate the possibilities and limitations of using HDF5 as a container for archiving heterogeneous collections of records, with special attention to STEP data.
PDES, Inc. Offsite Sept 2006 42
Activities• Use files, datatypes, structures in NARA
TWR collection – STEP files, photos, schematics, etc.
• Map these to HDF5 objects and structures, exploiting features of HDF5
• Assess benefits and costs in terms of storage efficiency and accessibility
• Investigate use of HDF5 as container for collection
PDES, Inc. Offsite Sept 2006 43
Relationship EuroSTEP, Electric Boat, et al
• Working together to develop mappings from EXPRESS to HDF5
• Sharing data for testing• Periodic meetings to share information
and coordinate research• Some involvement with standardization
PDES, Inc. Offsite Sept 2006 44
Investigating I/O efficiency and size • Explore different datatypes and storage
options for b-spline surface models (later: finite element models)
• Two types of data – b-splines themselves and cartesian points
• Variables• Different HDF5 datatypes• Dataset compression• Use of extra indexes in HDF5 for fast access
PDES, Inc. Offsite Sept 2006 45
Some results• Small files
• HDF5 not appreciably better then STEP, sometimes worse
• Large files• Compression always made HDF5 files smaller• Even without compression, HDF5 storage better• Indexing approach also tended to save space
• Lessons• HDF5 can provide very efficient storage for
cartesian points• Choice of data types and data storage is important
HDF5 as container
HDFView Demo
PDES, Inc. Offsite Sept 2006 47
Thank you
PDES, Inc. Offsite Sept 2006 49
HDF Information• HDF Information Center
• http://hdfgroup.org/
• HDF Help email address• [email protected]/
• HDF users mailing list• [email protected]/