Data formats in e-Science Two key requirements Two key requirements –Interoperability and...
-
Upload
sydney-west -
Category
Documents
-
view
212 -
download
0
Transcript of Data formats in e-Science Two key requirements Two key requirements –Interoperability and...
Data formats in e-Data formats in e-ScienceScience Two key requirementsTwo key requirements
– Interoperability and ScalabilityInteroperability and Scalability– XMLXML is flexible, but verbose is flexible, but verbose– Binary formatsBinary formats are compact, but are compact, but
specificspecific– e.g. VOTable vs FITS issue in the VOe.g. VOTable vs FITS issue in the VO
Two possible solutions:Two possible solutions:– BinX from the BinX from the edikt edikt project at NeSCproject at NeSC– VX from the School of InformaticsVX from the School of Informatics
BinX: Binary in XMLBinX: Binary in XML
A language:A language:– Uses XML to describe the data types and Uses XML to describe the data types and
structures in a binary data filestructures in a binary data file A library:A library:
– For manipulating XML and binary filesFor manipulating XML and binary files BinX files:BinX files:
– SchemaBinX: XML descriptor of binary fileSchemaBinX: XML descriptor of binary file– DataBinX: SchemaBinX + data valuesDataBinX: SchemaBinX + data values
BinX will allow you to interact with a binary BinX will allow you to interact with a binary file as if it were XML – e.g. run XPath queries file as if it were XML – e.g. run XPath queries
Astronomical testbed: Astronomical testbed: format conversion with format conversion with BinXBinX
……and back againand back again
This way is harder, due to ASCII text in FITS header.
More about BinXMore about BinX
Download the BinX code & play with itDownload the BinX code & play with it– See See www.edikt.org/binxwww.edikt.org/binx
After BinX comes DFDL (After BinX comes DFDL (daffodil daffodil ))– Data Format Description LanguageData Format Description Language– Developing through GGF Working GroupDeveloping through GGF Working Group– BinX might morph into a DFDL implementationBinX might morph into a DFDL implementation
Basic idea behind DFDL:Basic idea behind DFDL:– We can’t have a single data format, but we We can’t have a single data format, but we
can have a single way of describing data can have a single way of describing data formatsformats
VX: Vectorizing XMLVX: Vectorizing XML
XML seems especially verbose for XML seems especially verbose for data files with simple, repeating data files with simple, repeating structuresstructures– e.g. VOTable – lots of <TR>s and e.g. VOTable – lots of <TR>s and
<TD>s<TD>s Vectorize it:Vectorize it:
– Decompose the XML document into a Decompose the XML document into a skeletonskeleton describing the structure and describing the structure and vectorsvectors containing data values containing data values
Example: bibliography in Example: bibliography in XMLXML
Astronomical VX Astronomical VX applicationapplication Export the PhotoObj Table from the SDSS Export the PhotoObj Table from the SDSS
EDR database into VOTableEDR database into VOTable– 360 columns and 10,000,000 rows360 columns and 10,000,000 rows
The Skeleton here is trivial The Skeleton here is trivial Querying the decomposed XML version can Querying the decomposed XML version can
be as fast as querying the SkyServer be as fast as querying the SkyServer databasedatabase– For queries where SkyServer doesn’t make heavy For queries where SkyServer doesn’t make heavy
use of indexes, which are not in VX yetuse of indexes, which are not in VX yet
More: More: http://homepages.inf.ed.ac.uk/v1bchoi/paper.pdfhttp://homepages.inf.ed.ac.uk/v1bchoi/paper.pdf