1 Overview of HDF5 HDF Summit Boeing Seattle The HDF Group (THG) September 19, 2006.
-
Upload
madlyn-tate -
Category
Documents
-
view
222 -
download
0
Transcript of 1 Overview of HDF5 HDF Summit Boeing Seattle The HDF Group (THG) September 19, 2006.
1
Overview of HDF5 HDF Summit
Boeing SeattleThe HDF Group (THG)September 19, 2006
2
Topics• What is HDF?• Sample uses of HDF• THG the Company
3
What is HDF?
4
Answering big questions …
Matter & the universeMatter & the universe
August 24, 2001August 24, 2001August 24, 2001August 24, 2001 August 24, 2002August 24, 2002August 24, 2002August 24, 2002
Total Column Ozone (Dobson)Total Column Ozone (Dobson)Total Column Ozone (Dobson)Total Column Ozone (Dobson)
60 385 61060 385 61060 385 61060 385 610
Weather and climateWeather and climate
Life and natureLife and nature
5
involves big data …
6
varied data…
caacaagccaaaactcgtacaaatatgaccgcacttcgctataaagaacacggcttgtggcgagatatctcttggaaaaactttcaagagcaactcaatcaactttctcgagcattgcttgctcacaatattgacgtacaagataaaatcgccatttttgcccataatatggaacgttgggttgttcatgaaactttcggtatcaaagatggtttaatgaccactgttcacgcaacgactacaatcgttgacattgcgaccttacaaattcgagcaatcacagtgcctatttacgcaaccaatacagcccagcaagcagaatttatcctaaatcacgccgatgtaaaaattctcttcgtcggcgatcaagagcaatacgatcaaacattggaaattgctcatcattgtccaaaattacaaaaaattgtagcaatgaaatccaccattcaattacaacaagatcctctttcttgcacttgg
7
Contig Summaries
Discrepancies
Contig Qualities
Coverage Depth
and complex relationships…
Read Read qualityquality
Aligned bases
ContigContig
Reads
Percent match
SNP ScoreSNP Score
TraceTrace
8
on big computers…
9
and on little computers.
10
How do we…• Describe the data? • Read it? Store it? Find it? Share it? Mine it?
• Move it into, out of, and between computers
and repositories
11
HDF is• A file format for managing any kind of
data• Software to store and access data in
the format• Suited especially to large or complex data
collections• Suited for every size of system• Platform independent – runs almost
anywhere• Open – both file formats and software
12
HDF solution
I/O software & tools
CommonCommonData Data
modelsmodels
StandardAPIs
Scientific data file format
Efficient storage, I/O
13
An HDF file is a container…
lat | lon | temp----|-----|----- 12 | 23 | 3.1 15 | 24 | 4.2 17 | 21 | 3.6
palette
palette
……into into which you which you can put can put your data your data objects.objects.
14
HDF structures for organizing objects in files
palettepalette
Raster imageRaster image
3-D array3-D array
2-D array2-D arrayRaster imageRaster image
lat | lon | templat | lon | temp----|-----|---------|-----|----- 12 | 23 | 3.112 | 23 | 3.1 15 | 24 | 4.215 | 24 | 4.2 17 | 21 | 3.617 | 21 | 3.6
TableTable
““/” /” (root)(root)““/” /” (root)(root)
““/foo”/foo”““/foo”/foo”
16
Mesh Example, in HDFView
17
HDF5 Software
Tools & ApplicationsTools & ApplicationsTools & ApplicationsTools & Applications
HDF FileHDF FileHDF FileHDF File
HDF I/O LibraryHDF I/O LibraryHDF I/O LibraryHDF I/O Library
18
Goals of HDF5 Library• Flexible API to support a wide range of
operations on data• High performance access in serial and
parallel computing environments• Compatibility with common data models
and programming languages
19
Features• Ability to create complex data structures• Complex subsetting• Efficient storage• Flexible I/O (parallel, remote, etc.)• Ability to transform data during I/O• Support for key language models
• OO compatible• C & Fortran primarily• Also Java, C++
20
Sample uses of HDF
21
1. NASA Earth Observing System (EOS)
Aqua (6/01)Aura
TES HRDLSMLS OMI
Terra
CERES MISR
MODIS MOPITT
AquaCERES MODIS
AMSR
22
2. Advanced Simulation & Computing (ASC)
Question: How do we maintain a nuclear stockpile in the absence
of testing?
23
Answer: Very large simulations
on very large computers
24
ASC Data requirements• Large datasets (> a terabyte) • Good I/O performance on massive
parallel systems Complex data and extensive metadata
25
26
3. Bioinformatics
--
Managing genomic data
caacaagccaaaactcgtacaacaacaagccaaaactcgtacaaCgagatatctcttggaaaaactCgagatatctcttggaaaaactgctcacaatattgacgtacaaggctcacaatattgacgtacaaggttgttcatgaaactttcggtagttgttcatgaaactttcggtaAcaatcgttgacattgcgacctAcaatcgttgacattgcgacctaatacagcccagcaagcagaataatacagcccagcaagcagaat
27
DNA sequencing workflows• Diverse formats• Highly redundant data• Repeated file
processing• Disconnected
programs• Non-scalable storage• Lack of persistence
28
Multiple levels and relationships
Contig Summaries
Discrepancies
Contig Qualities
Coverage Depth
Read Read qualityquality
Aligned bases
ContigContig
Reads
Percent match
SNP ScoreSNP Score
TraceTrace
29
HDF5 as binary format for bioinformatics
30
4. Flight test data--
31
3. Boeing flight test
32
Flight test data requirements• Fast data acquisition from 1000s of
sources• Wide variety of data types• Active archive • Standardization for data/software
exchange• Special features
35
THG the Company
36
What is the HDF Group?• 18 years at National Center for
Supercomputing Center (NCSA) at University of Illinois
• Recent spin-off U of I• Non-profit 501(c)(3)• 17 scientific, technology, and professional
staff• 5 students• 2+million product users world-wide• Cross industry sectors and disciplines
37
THG missionTo support the vast
community of HDF users and to ensure the sustainable
development of HDF technologies and the
ongoing accessibility of HDF-stored data.
38
Business model• Non-profit: mission driven• Intellectual property:
• U of I plans to assign ownership to THG• The HDF formats will remain free, and
HDF software will remain open source.
• Continue close ties to U of I and NCSA.
39
Income-generating activities• Major client support• Targeted HDF development• Grant-supported R&D• Consulting
40
Thank you
41
HDF Information• HDF Information Center
• http://hdfgroup.org/
• HDF Help email address• [email protected]/
• HDF users mailing list• [email protected]/