Data formats in e-Science Two key requirements Two key requirements –Interoperability and...

8
Data formats in e- Data formats in e- Science Science Two key requirements Two key requirements Interoperability and Scalability Interoperability and Scalability XML XML is flexible, but verbose is flexible, but verbose Binary formats Binary formats are compact, but are compact, but specific specific e.g. VOTable vs FITS issue in the VO e.g. VOTable vs FITS issue in the VO Two possible solutions: Two possible solutions: BinX from the BinX from the edikt edikt project at NeSC project at NeSC VX from the School of Informatics VX from the School of Informatics

Transcript of Data formats in e-Science Two key requirements Two key requirements –Interoperability and...

Page 1: Data formats in e-Science Two key requirements Two key requirements –Interoperability and Scalability –XML is flexible, but verbose –Binary formats are.

Data formats in e-Data formats in e-ScienceScience Two key requirementsTwo key requirements

– Interoperability and ScalabilityInteroperability and Scalability– XMLXML is flexible, but verbose is flexible, but verbose– Binary formatsBinary formats are compact, but are compact, but

specificspecific– e.g. VOTable vs FITS issue in the VOe.g. VOTable vs FITS issue in the VO

Two possible solutions:Two possible solutions:– BinX from the BinX from the edikt edikt project at NeSCproject at NeSC– VX from the School of InformaticsVX from the School of Informatics

Page 2: Data formats in e-Science Two key requirements Two key requirements –Interoperability and Scalability –XML is flexible, but verbose –Binary formats are.

BinX: Binary in XMLBinX: Binary in XML

A language:A language:– Uses XML to describe the data types and Uses XML to describe the data types and

structures in a binary data filestructures in a binary data file A library:A library:

– For manipulating XML and binary filesFor manipulating XML and binary files BinX files:BinX files:

– SchemaBinX: XML descriptor of binary fileSchemaBinX: XML descriptor of binary file– DataBinX: SchemaBinX + data valuesDataBinX: SchemaBinX + data values

BinX will allow you to interact with a binary BinX will allow you to interact with a binary file as if it were XML – e.g. run XPath queries file as if it were XML – e.g. run XPath queries

Page 3: Data formats in e-Science Two key requirements Two key requirements –Interoperability and Scalability –XML is flexible, but verbose –Binary formats are.

Astronomical testbed: Astronomical testbed: format conversion with format conversion with BinXBinX

Page 4: Data formats in e-Science Two key requirements Two key requirements –Interoperability and Scalability –XML is flexible, but verbose –Binary formats are.

……and back againand back again

This way is harder, due to ASCII text in FITS header.

Page 5: Data formats in e-Science Two key requirements Two key requirements –Interoperability and Scalability –XML is flexible, but verbose –Binary formats are.

More about BinXMore about BinX

Download the BinX code & play with itDownload the BinX code & play with it– See See www.edikt.org/binxwww.edikt.org/binx

After BinX comes DFDL (After BinX comes DFDL (daffodil daffodil ))– Data Format Description LanguageData Format Description Language– Developing through GGF Working GroupDeveloping through GGF Working Group– BinX might morph into a DFDL implementationBinX might morph into a DFDL implementation

Basic idea behind DFDL:Basic idea behind DFDL:– We can’t have a single data format, but we We can’t have a single data format, but we

can have a single way of describing data can have a single way of describing data formatsformats

Page 6: Data formats in e-Science Two key requirements Two key requirements –Interoperability and Scalability –XML is flexible, but verbose –Binary formats are.

VX: Vectorizing XMLVX: Vectorizing XML

XML seems especially verbose for XML seems especially verbose for data files with simple, repeating data files with simple, repeating structuresstructures– e.g. VOTable – lots of <TR>s and e.g. VOTable – lots of <TR>s and

<TD>s<TD>s Vectorize it:Vectorize it:

– Decompose the XML document into a Decompose the XML document into a skeletonskeleton describing the structure and describing the structure and vectorsvectors containing data values containing data values

Page 7: Data formats in e-Science Two key requirements Two key requirements –Interoperability and Scalability –XML is flexible, but verbose –Binary formats are.

Example: bibliography in Example: bibliography in XMLXML

Page 8: Data formats in e-Science Two key requirements Two key requirements –Interoperability and Scalability –XML is flexible, but verbose –Binary formats are.

Astronomical VX Astronomical VX applicationapplication Export the PhotoObj Table from the SDSS Export the PhotoObj Table from the SDSS

EDR database into VOTableEDR database into VOTable– 360 columns and 10,000,000 rows360 columns and 10,000,000 rows

The Skeleton here is trivial The Skeleton here is trivial Querying the decomposed XML version can Querying the decomposed XML version can

be as fast as querying the SkyServer be as fast as querying the SkyServer databasedatabase– For queries where SkyServer doesn’t make heavy For queries where SkyServer doesn’t make heavy

use of indexes, which are not in VX yetuse of indexes, which are not in VX yet

More: More: http://homepages.inf.ed.ac.uk/v1bchoi/paper.pdfhttp://homepages.inf.ed.ac.uk/v1bchoi/paper.pdf