Status of LOFAR Data in HDF5 Format · 2011-11-23 · IDL, VisIt, HDFView, h5py, PyTables, DAL...

1
Abstract The Low Frequency Array (LOFAR) project is solving the challenge of data size and complexity using the Hierarchical Data Format, version 5 (HDF5). Most ofLOFAR's standard data products will be stored using the HDF5 format; the Beam‐Formed (aka Pulsar) and Transient Buffer Board (TBB) Time Series Data have transiQoned from project‐specific binary to HDF5 format. It supports an unlimited variety of datatypes, and is designed for flexible and efficient I/O and for high volume and complex data. HDF5 is portable, extensible, and parallelizable, allowing applicaQons to evolve in their use of HDF5. We report on our effort to pave the way towards new astronomical data encapsulaQon using HDF5, which can be used by future ground and space projects. The LOFAR project has formed a collaboraQon with NROA, the VOA and the HDF Group to obtain funding for a full‐Qme staff member to work on documenQng and developing standards for astronomical data wri[en in HDF5. We hope our effort will enhance HDF5 visibility and usage within the community, specifically for SKA Pathfinders and other major new radio telescopes such as LOFAR, EVLA, ALMA, ASKAP, MeerKAT, MWA, LWA, and eMERLIN. The LOFAR Radio Telescope WWW.LOFAR.ORG LOFAR Data in HDF5: Solves Variety, Complexity & Size Issues Time Series data share similar issues to all datasets produced by LOFAR observaQons ‐‐ they vary tremendously in type, size and complexity. Instead of using many different file formats to encapsulate LOFAR data, the Hierarchical Data Format, version 5 (HDF5) was chosen as a viable soluQon for potenQally massive LOFAR data products. The LOFAR “Superterp”, Exloo, NL (Fig. 1) LOFAR data staQsQcs: Data rates up to 8GB/sec File sizes, 10s to 100s TB Datasets ranging from 1 up to 6 dimensions Single dataset can be spread over mulQple disks Data wriQng must be mulQ‐threaded and parallelizable Most astronomical data containers (CASA MS or FITS, or ~30 pulsar data formats) could not meet all our needs The “LOw Frequency ARray” (LOFAR) Currently, the largest radio telescope in the world 48 staQons (40 NL, 8 InternaQonal); complete by end of 2011 (Figure 1 on right shows the center staBons in NL) Baselines from 1 ‐ 1500 km; ulQmately achieve sub‐arcsecond resoluQon over much of the band Low Band Antenna bandpass: 30‐80 MHz High Band Antenna bandpass: 120‐240 MHz Total Recordable Bandwidth: 48MHz; 100MHz TBBs Data CorrelaQon: IBM Blue Gene/P supercomputer, Groningen, NL (see photo in Figure 2 on right) Offline processing cluster has 100 nodes, each with: 24 cores, 64GB RAM, 21TB storage Long Term Archive (LTA) has: 2.2PB disk, 5PB tape Access to 22,600 cores via BigGrid and JUROPA LOFAR Beam‐Formed Time Series Data Format SpecificaQons in HDF5 1 Documents available from the LOFAR Wiki (“Data Products” link), h<p://lus.lofar.org/wiki 2 UV Visibility and Near Field Imaging are under development Each LOFAR dipole antenna has a ring buffer, transient buffer board, that can store up to 1.3 seconds of raw Qme series data sampled every 5ns (LOFAR’s highest Qme resoluQon). LOFAR acts as a “Qme machine” – a transient event triggers the buffered data to be wri[en to HDF5, providing up to 1.3 seconds of historical informaQon at full temporal and spaQal resoluQon. TBB’s are used to study: air showers produced by cosmic rays, lightning on Earth and other planetary bodies within our own solar system, and unknown sub‐second Qmescale astronomical transients. Transient Buffer Board (TBB) Data: look‐back into the “past” Summary, CollaboraQons And Future ConsideraQons HDF5‐friendly Tools: IDL, VisIt, HDFView, h5py, PyTables, DAL (LOFAR), PyDAL (LOFAR) LOFAR has started work on the next generaQon of its C++ Data Access Library (DAL + PyDAL), based on current Qme series data use cases and use pa[erns, with the goal of speed‐up, simplificaQon of use and be[er python bindings. We also intend to expand the DAL for use with addiQonal LOFAR data formats and to be able write converters to/ from FITS, the Casa Tables Measurement Sets and HDF5. Improvements coming in 2012! LOFAR has teamed up with NRAO , the IVOA and The HDF Group in requesQng funding for defining and developing astronomical data standards in HDF5, based on LOFARs ICD and DAL ground work. This effort is nicknamed “AstroHDF ”. Future work also involves developing converts to/from HDF5/FITS and collaboraQons on interfaces in DS9 and VisIt for HDF5 (WCS‐compliant) astronomical data. We believe that the HDF5 format is well‐suited for complex and large astronomical data. For some of LOFAR’s data format challenges, HDF5 is the only viable soluQon. In HDF5, LOFAR can easily map 10s of TB of complex, 6‐dimensional data structures with intricate header and logging informaQon throughout the data‐files. We feel that HDF5 can meet the data needs and demands of future projects such as LSST and the SKA, where size and complexity will be even greater of an issue than faced by LOFAR. The Blue Gene/P supercomputer at Groningen (Fig. 2) LOFAR Interface Control Documents 1 (ICDs) provide detailed descripQons of all expected LOFAR Data Products in HDF5 format: Beam‐Formed (BF) Data [shown on the leF] Transient Buffer Board (TBB) Time Series [bo<om leF of poster] Radio Sky Image Cubes Dynamic Spectrum Data RotaRon Measure (RM) Synthesis Cubes UV Visibility Data 2 Near‐field Images 2 Element Beam Sub‐Array PoinRng LOFAR StaRons Tied‐Array Beams DATA (I) TIME FREQUENCY DATA (Q) TIME FREQUENCY DATA (U) TIME FREQUENCY DATA (V) TIME FREQUENCY Binary Data Container HDF5 Container: Header & Data Structure Layout Header + Data == HDF5 “File” (Beam Formed) LOFAR Data Access Library (DAL): C++ and Python Data I/O Linux Cluster “Offline” Known Pulsar Pipeline Converter step to be removed when DAL layer is completed. Data Blue Gene/P “Online” Tied‐Array Beam Pipeline For the LOFAR project, a data product is defined within the context of HDF5. HDF5 allows for storage, not only of the data, but for the associated and related meta‐data describing the data’s contents, condiQons of observaQons, logs, etc.. As an “all‐in‐one” wrapper, the HDF5 format simplifies the management of complex datasets. Describes Layout LOFAR Data Format ICD Beam-Formed Data Document ID: LOFAR-USG-ICD-003.tex ISSUE 2.04.08 SVN Repository Revision: 8333 A. Alexov, K. Anderson, L. B¨ahren, J.-M. Grießmeier,J.W.T. Hessels, J.S. Masters, B.W. Stappers Contents Change record 3 1. Introduction 8 1.1. Purpose and Scope .......................................... 8 1.2. Context and Motivation ....................................... 8 1.3. Applicable documents ........................................ 8 2. Overview 9 3. Organization of the data 10 3.1. High level LOFAR BF file structure ................................ 10 3.2. Overview of BF Groups ....................................... 10 3.3. Hierarchical Structure of the BF H5 File.............................. 13 4. Detailed Data Specification 14 4.1. The Root Group (ROOT) ....................................... 14 4.1.1. Common LOFAR Attributes ................................ 15 4.1.2. Additional Beam-Formed Root Attributes ......................... 18 4.2. The System Logs Group (SYS LOG) ................................. 18 4.3. The Sub-Array Pointing Group (SUB ARRAY POINTING {NNN}) .................. 18 4.4. The Processing History Group (PROCESS HISTORY) ........................ 18 4.5. The Beam Group (BEAM {NNN}) ................................... 22 4.6. The Coordinates Group (COORDINATES) .............................. 24 4.6.1. The Time coordinate..................................... 24 4.6.2. The Spectral coordinate ................................... 25 4.7. The Stokes Datasets (STOKES {N}) ................................. 27 4.8. The Subband/Channel tables/arrays ................................ 27 4.9. TiedArray Observing Sub-Modes & Data Format ......................... 28 4.9.1. Standard Observing Mode for TiedArray ......................... 28 4.9.2. Sub-Observing Modes for TiedArray ............................ 28 1 Why use HDF5? Format for managing and storing large and complex scienQfic data Unlimited variety of datatypes No inherent file size limitaQons Parallel reading and wriQng (MPI) Runs on massively parallel/ distributed systems Built‐in compression Library API in C/C++/Java/Fortran Self‐describing and portable Free & has 20+ year history @NASA Defines SW SpecificaQon Status of LOFAR Data in HDF5 Format Anastasia Alexov 1,2 , Pim Schellart 3 , Sander ter Veen 3 and the LOFAR ICD Group 1: University of Amsterdam (UvA) [[email protected]], 2: Netherlands InsQtute for Radio Astronomy (ASTRON); 3: Radboud University Nijmegen StaQon Correlator Transient Buffer Board Ring Buffer, 1.3s LOFAR antenna signal LOFAR Correlator Images Beam‐Formed Data (HDF5) Time Series Data Writer (HDF5) Pipeline Soxware (HDF5) A/D Converter 48 MHz 100 MHz LOFAR Data Access Library (DAL): C++ and Python Data I/O LOFAR TBB Data Flow Hear more at: HDF5 BoF Wed Nov 9 th @ 17:15 LOFAR project is developing the C++ Data Access Library (DAL), which provides full scope constructors for creaQng and accessing LOFAR data products, using the ICD specificaQons. TBB and Beam‐Formed have been implemented. Crab Giant Pulse (@center), as seen by the TBB boards at 0.015 sec sampling Bme CollaboraQve mailing list info: nextgen‐[email protected] Sign up: [email protected] w/message body “subscribed nextgen‐astrodata” TBB HDF5 Data Stats: 100 MHz of bandwidth 1.3 sec of look-back time 40MB - 48GB dataset/station 2GB - 2.25TB dataset sizes Data maps easily to HDF5 2012 Upgrade: 5 sec look-back 2012 Upgrade: 8.6 TB dataset AS T R A U N IJ E N U I R O M T T D E N H Y S I C S D B O D V ER G N E M P E A S T Y I P R F O www.hdfgroup.org Hear more at: HDF5 BoF Wed Nov 9 th @ 17:15

Transcript of Status of LOFAR Data in HDF5 Format · 2011-11-23 · IDL, VisIt, HDFView, h5py, PyTables, DAL...

Page 1: Status of LOFAR Data in HDF5 Format · 2011-11-23 · IDL, VisIt, HDFView, h5py, PyTables, DAL (LOFAR), PyDAL(LOFAR) LOFAR has started work on the next generaon of its C++ Data Access

Abstract

TheLowFrequencyArray(LOFAR)projectissolvingthechallengeofdatasizeandcomplexityusingtheHierarchicalDataFormat,version5(HDF5).MostofLOFAR'sstandarddataproductswillbestoredusingtheHDF5format;theBeam‐Formed(akaPulsar)andTransientBufferBoard(TBB)TimeSeriesDatahavetransiQonedfromproject‐specificbinarytoHDF5format.Itsupportsanunlimitedvarietyofdatatypes,andisdesignedforflexibleandefficientI/Oandforhighvolumeandcomplexdata. HDF5isportable,extensible,andparallelizable,allowingapplicaQonstoevolveintheiruseofHDF5.WereportonourefforttopavethewaytowardsnewastronomicaldataencapsulaQonusingHDF5,whichcanbeusedbyfuturegroundandspaceprojects.TheLOFARprojecthasformedacollaboraQonwithNROA,theVOAandtheHDFGrouptoobtainfundingforafull‐QmestaffmembertoworkondocumenQnganddevelopingstandardsforastronomicaldatawri[eninHDF5.WehopeoureffortwillenhanceHDF5visibilityandusagewithinthecommunity,specificallyforSKAPathfindersandothermajornewradiotelescopessuchasLOFAR,EVLA,ALMA,ASKAP,MeerKAT,MWA,LWA,andeMERLIN.

TheLOFARRadioTelescope

WWW.LOFAR.ORG

LOFARDatainHDF5:SolvesVariety,Complexity&SizeIssuesTimeSeriesdatasharesimilarissuestoalldatasetsproducedbyLOFARobservaQons‐‐theyvarytremendouslyintype,sizeandcomplexity.InsteadofusingmanydifferentfileformatstoencapsulateLOFARdata,theHierarchicalDataFormat,version5(HDF5)waschosenasaviablesoluQonforpotenQallymassiveLOFARdataproducts.

TheLOFAR“Superterp”,Exloo,NL(Fig.1)

LOFARdatastaQsQcs:•  Dataratesupto8GB/sec•  Filesizes,10sto100sTB•  Datasetsrangingfrom1upto6dimensions

•  SingledatasetcanbespreadovermulQpledisks

•  DatawriQngmustbemulQ‐threadedandparallelizable

•  Mostastronomicaldatacontainers(CASAMSorFITS,or~30pulsardataformats)couldnotmeetallourneeds

• The“LOwFrequencyARray”(LOFAR)• Currently,thelargestradiotelescopeintheworld• 48staQons(40NL,8InternaQonal);completebyendof2011(Figure1onrightshowsthecenterstaBonsinNL)• Baselinesfrom1‐1500km;ulQmatelyachievesub‐arcsecondresoluQonovermuchoftheband• LowBandAntennabandpass:30‐80MHz• HighBandAntennabandpass:120‐240MHz• TotalRecordableBandwidth:48MHz;100MHzTBBs• DataCorrelaQon:IBMBlueGene/Psupercomputer,Groningen,NL(seephotoinFigure2onright)• Offlineprocessingclusterhas100nodes,eachwith:24cores,64GBRAM,21TBstorage• LongTermArchive(LTA)has:2.2PBdisk,5PBtape• Accessto22,600coresviaBigGridandJUROPA

LOFARBeam‐FormedTimeSeriesDataFormatSpecificaQonsinHDF5

1 DocumentsavailablefromtheLOFARWiki(“DataProducts”link),h<p://lus.lofar.org/wiki2 UVVisibilityandNearFieldImagingareunderdevelopment

EachLOFARdipoleantennahasaringbuffer,transientbufferboard,thatcanstoreupto1.3secondsofrawQmeseriesdatasampledevery5ns(LOFAR’shighestQmeresoluQon).LOFARactsasa“Qmemachine”–atransienteventtriggersthebuffereddatatobewri[entoHDF5,providingupto1.3secondsofhistoricalinformaQonatfulltemporalandspaQalresoluQon.

TBB’s are used to study: air showers producedby cosmic rays, lightning onEarthandotherplanetarybodieswithinourownsolarsystem,andunknown

sub‐secondQmescaleastronomicaltransients.

TransientBufferBoard(TBB)Data:look‐backintothe“past” Summary,CollaboraQonsAndFutureConsideraQons

HDF5‐friendlyTools:IDL,VisIt,HDFView, h5py,PyTables,DAL(LOFAR),PyDAL(LOFAR)

LOFARhasstartedworkonthenextgeneraQonof itsC++DataAccessLibrary(DAL+PyDAL),basedoncurrentQmeseriesdatausecasesandusepa[erns,withthegoalofspeed‐up,simplificaQonofuseandbe[erpythonbindings. WealsointendtoexpandtheDALforusewithaddiQonalLOFARdataformatsandtobeablewriteconvertersto/from FITS, the Casa TablesMeasurement Sets and HDF5. Improvements coming in2012!

LOFARhasteamedupwithNRAO,theIVOAandTheHDFGroupinrequesQngfundingfordefininganddevelopingastronomicaldatastandardsinHDF5,basedonLOFARsICDandDALgroundwork.Thiseffortisnicknamed“AstroHDF”.Futureworkalsoinvolvesdeveloping converts to/from HDF5/FITS and collaboraQons on interfaces inDS9 andVisItforHDF5(WCS‐compliant)astronomicaldata.

We believe that the HDF5 format is well‐suited for complex and large astronomicaldata.ForsomeofLOFAR’sdataformatchallenges,HDF5istheonlyviablesoluQon.InHDF5,LOFARcaneasilymap10sofTBofcomplex,6‐dimensionaldatastructureswithintricateheaderandlogginginformaQonthroughoutthedata‐files. WefeelthatHDF5canmeet thedata needs anddemandsof futureprojects such as LSST and the SKA,wheresizeandcomplexitywillbeevengreaterofanissuethanfacedbyLOFAR.

TheBlueGene/PsupercomputeratGroningen(Fig.2)

LOFARInterfaceControlDocuments1(ICDs)providedetaileddescripQonsofallexpectedLOFARDataProductsinHDF5format:

• Beam‐Formed(BF)Data[shownontheleF]• TransientBufferBoard(TBB)TimeSeries[bo<omleFofposter]• RadioSkyImageCubes• DynamicSpectrumData• RotaRonMeasure(RM)SynthesisCubes• UVVisibilityData2• Near‐fieldImages2

ElementBeam

Sub‐ArrayPoinRng

LOFARStaRons

Tied‐ArrayBeams

DATA(I)

TIME

FREQUENCY

DATA(Q)

TIME

FREQUENCY

DATA(U)

TIME

FREQUENCY

DATA(V)

TIME

FREQUENCY

BinaryDataContainer

HDF5Container:Header&DataStructureLayout

Header+Data==

HDF5“File”(BeamFormed)

LOFARDataAccessLibrary(DAL):C++andPythonDataI/O

LinuxCluster“Offline”KnownPulsarPipeline

ConvertersteptoberemovedwhenDALlayeriscompleted.

Data

BlueGene/P“Online”Tied‐ArrayBeamPipeline

FortheLOFARproject,adataproductisdefinedwithinthecontextofHDF5.HDF5allowsforstorage,notonlyofthedata, but for the associated and related meta‐datadescribingthedata’scontents,condiQonsofobservaQons,logs, etc.. As an “all‐in‐one” wrapper, the HDF5 formatsimplifiesthemanagementofcomplexdatasets.DescribesLayout

LOFAR Data Format ICDBeam-Formed DataDocument ID: LOFAR-USG-ICD-003.tex

ISSUE 2.04.08

SVN Repository Revision: 8333

A. Alexov, K. Anderson, L. Bahren, J.-M. Grießmeier, J.W.T. Hessels,

J.S. Masters, B.W. Stappers

SVN Date: 2011-09-02

Contents

Change record 3

1. Introduction 81.1. Purpose and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.2. Context and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.3. Applicable documents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2. Overview 9

3. Organization of the data 103.1. High level LOFAR BF file structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2. Overview of BF Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3. Hierarchical Structure of the BF H5 File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4. Detailed Data Specification 144.1. The Root Group (ROOT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.1. Common LOFAR Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.1.2. Additional Beam-Formed Root Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2. The System Logs Group (SYS LOG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184.3. The Sub-Array Pointing Group (SUB ARRAY POINTING {NNN}) . . . . . . . . . . . . . . . . . . 184.4. The Processing History Group (PROCESS HISTORY) . . . . . . . . . . . . . . . . . . . . . . . . 184.5. The Beam Group (BEAM {NNN}) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.6. The Coordinates Group (COORDINATES) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.6.1. The Time coordinate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.6.2. The Spectral coordinate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.7. The Stokes Datasets (STOKES {N}) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.8. The Subband/Channel tables/arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274.9. TiedArray Observing Sub-Modes & Data Format . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.9.1. Standard Observing Mode for TiedArray . . . . . . . . . . . . . . . . . . . . . . . . . 284.9.2. Sub-Observing Modes for TiedArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

1

WhyuseHDF5?•  FormatformanagingandstoringlargeandcomplexscienQficdata•  Unlimitedvarietyofdatatypes•  NoinherentfilesizelimitaQons•  ParallelreadingandwriQng(MPI)•  Runsonmassivelyparallel/distributedsystems•  Built‐incompression•  LibraryAPIinC/C++/Java/Fortran•  Self‐describingandportable•  Free&has20+yearhistory@NASA

DefinesSWSpecificaQon

StatusofLOFARDatainHDF5FormatAnastasiaAlexov1,2,PimSchellart3,SanderterVeen3andtheLOFARICDGroup

1:UniversityofAmsterdam(UvA)[[email protected]],2:NetherlandsInsQtuteforRadioAstronomy(ASTRON);3:RadboudUniversityNijmegen

StaQonCorrelator

TransientBufferBoardRingBuffer,1.3s

LOFARantennasignal

LOFARCorrelator

ImagesBeam‐FormedData(HDF5)

TimeSeriesDataWriter(HDF5)

PipelineSoxware(HDF5)

A/DConverter

48MHz

100MHz

LOFARDataAccessLibrary(DAL):C++andPythonDataI/O

LOFARTBBDataFlow

Hearmoreat:HDF5BoFWedNov9th@17:15

LOFARprojectisdevelopingtheC++DataAccessLibrary(DAL),whichprovidesfullscopeconstructorsforcreaQngandaccessingLOFARdataproducts,usingtheICDspecificaQons.TBBandBeam‐Formedhavebeenimplemented.

CrabGiantPulse(@center),asseenbytheTBBboardsat0.015secsamplingBme

CollaboraQvemailinglistinfo:nextgen‐[email protected]:[email protected]/messagebody“subscribednextgen‐astrodata”

TBB HDF5 Data Stats: •  100 MHz of bandwidth •  1.3 sec of look-back time •  40MB - 48GB dataset/station •  2GB - 2.25TB dataset sizes •  Data maps easily to HDF5 •  2012 Upgrade: 5 sec look-back •  2012 Upgrade: 8.6 TB dataset

AST

RA

U

NIJ

E

NUI

R

O

M

T

T

D

EN

HY

SIC

S

DBO

D

VER

GN

EM

PEA

S TYI

P

R

FO

www.hdfgroup.org

Hearmoreat:HDF5BoF

WedNov9th@17:15