Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO...

Post on 01-Apr-2015

228 views 1 download

Tags:

Transcript of Stephen Gwyn Canadian Astronomy Data Centre Aggregating Metadata from Multiple Archives: a Non-VO...

Stephen GwynCanadian Astronomy Data Centre

Aggregating Metadata from Multiple Archives: a Non-VO Approach

Stephen GwynCanadian Astronomy Data Centre

CADC

Stephen GwynCanadian Astronomy Data Centre

- Astronomy is using more and more archival data - More than 50% of HST papers are archival - Similar trends for other telescopes- Harder for solar system astronomy

SSOIS: Solar System Object Image Search allows users to search for images of moving targets

Stephen GwynCanadian Astronomy Data Centre

SSOIS: Solar System Object Image Search allows users to search for images of moving targets

Stephen GwynCanadian Astronomy Data Centre

SSOIS: Solar System Object Image Search allows users to search for images of moving targets

Stephen GwynCanadian Astronomy Data Centre

CFHT

Initally, only data from CFHT/MegaCam was searched

Stephen GwynCanadian Astronomy Data Centre

NEAT

CFHT

Subaru

ESOGemini

AAT

SDSS

NOAO

ING

Next added data from external telescope archives

Stephen GwynCanadian Astronomy Data Centre

CADC

Next added data from external telescope archives

Stephen GwynCanadian Astronomy Data Centre

For each image, we need:

- position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data

Scraping external archives:

Stephen GwynCanadian Astronomy Data Centre

For each image, we need:

- position (RA,Dec) - Field of view - MJD of mid-exposure - filter - exposure time - target name - URL to data

Scraping external archives:

There are a variety of data archive interfaces....

Stephen GwynCanadian Astronomy Data Centre

- In an ideal world: one query to get all metadata- In real life: row limits- As the archives are updated, they need to be re-scraped periodically- Programmatic retrieval is required

Scraping external archives:

Stephen GwynCanadian Astronomy Data Centre

Advantages: - A single tool can scrape multiple archives

Disadvantages: - Not all archives have an SIAP interface - Many SIAP services do not conform to the VO standard - Not all SIAP services contain all the necessary metadata - Most archives have at least 1 heavily observed patch of sky: hit the row limit again - SIAP services vary in ability for positional queries - maximum search area - search is circle or box - may require 105 queries: may be perceived as DOS attack

Far better off scraping by day/night/MJD - Almost all telescopes take <10000 observations per 24 hours: - Can re-scrape with fewer queries

Use SIAP?

Stephen GwynCanadian Astronomy Data Centre

Scraping by RA/Dec

Stephen GwynCanadian Astronomy Data Centre

Scraping by Date

Stephen GwynCanadian Astronomy Data Centre

Older archive interfaces:- Query page + simple CGI result page- view source on the query page- get form inputs- issue repeated queries to CGI result page using GET or POST with wget/curl/scripting API- Easy

http://astronomydata.edu/query?ra=12.87&dec=13.52&mjd=57323

Stephen GwynCanadian Astronomy Data Centre

Newer archive interfaces:- AJAX/HTML5/etc page - Download Javascript and run through de-obfuscator- locate relevant XMLHttpRequest- determine if cookies are necessary- issue repeated queries to XMLHttpRequest URLs- Much harder

Stephen GwynCanadian Astronomy Data Centre

Easiest of all...http://smoka.nao.ac.jp/status/obslog/SUP_2007.txt

Stephen GwynCanadian Astronomy Data Centre

A script to get all Subaru/SuprimeCam metadata...

#!/bin/bashwget http://smoka.nao.ac.jp/status/obslog/SUP_1999.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2000.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2001.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2002.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2003.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2004.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2005.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2006.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2007.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2008.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2009.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2010.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2011.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2012.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2013.txtwget http://smoka.nao.ac.jp/status/obslog/SUP_2014.txt

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Search

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Search

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Search

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Searchhttp://www1.cadc-ccda.hia-iha.nrc-cnrc.gc.ca/tap/sync?LANG=ADQL&REQUEST=doQuery&QUERY=SELECT%20Observation.observationURI%20AS%20%22Preview%22%2C%20Observation.collection%20AS%20%22Collection%22%2C%20Observation.observationID%20AS%20%22Obs.%20ID%22%2C%20COORD1(CENTROID(Plane.position_bounds))%20AS%20%22RA%20(J2000.0)%22%2C%20COORD2(CENTROID(Plane.position_bounds))%20AS%20%22Dec.%20(J2000.0)%22%2C%20Plane.time_bounds_cval1%20AS%20%22Start%20Date%22%2C%20Observation.instrument_name%20AS%20%22Instrument%22%2C%20Plane.time_exposure%20AS%20%22Int.%20Time%22%2C%20Observation.target_name%20AS%20%22Target%20Name%22%2C%20Plane.energy_bandpassName%20AS%20%22Filter%22%2C%20Plane.calibrationLevel%20AS%20%22Cal.%20Lev.%22%2C%20Observation.type%20AS%20%22Obs.%20Type%22%2C%20Plane.energy_bounds_cval1%20AS%20%22Min.%20Wavelength%22%2C%20Plane.energy_bounds_cval2%20AS%20%22Max.%20Wavelength%22%2C%20Observation.proposal_id%20AS%20%22Proposal%20ID%22%2C%20Observation.proposal_pi%20AS%20%22P.I.%20Name%22%2C%20Plane.productID%20AS%20%22Product%20ID%22%2C%20Plane.dataRelease%20AS%20%22Data%20Release%22%2C%20AREA(Plane.position_bounds)%20AS%20%22Field%20of%20View%22%2C%20Plane.position_sampleSize%20AS%20%22Pixel%20Scale%22%2C%20Plane.dataProductType%20AS%20%22Data%20Type%22%2C%20Plane.position_timeDependent%20AS%20%22Moving%20Target%22%2C%20Plane.provenance_name%20AS%20%22Provenance%20Name%22%2C%20Plane.provenance_keywords%20AS%20%22Provenance%20Keywords%22%2C%20Observation.intent%20AS%20%22Intent%22%2C%20Observation.target_type%20AS%20%22Target%20Type%22%2C%20Observation.target_standard%20AS%20%22Target%20Standard%22%2C%20Plane.metaRelease%20AS%20%22Meta%20Release%22%2C%20Observation.sequenceNumber%20AS%20%22Sequence%20Number%22%2C%20Observation.algorithm_name%20AS%20%22Algorithm%20Name%22%2C%20Observation.proposal_title%20AS%20%22Proposal%20Title%22%2C%20Observation.proposal_keywords%20AS%20%22Proposal%20Keywords%22%2C%20Observation.proposal_project%20AS%20%22Proposal%20Project%22%2C%20Plane.position_bounds%20AS%20%22Polygon%22%2C%20Plane.energy_emBand%20AS%20%22Band%22%2C%20Plane.provenance_reference%20AS%20%22Prov.%20Reference%22%2C%20Plane.provenance_version%20AS%20%22Prov.%20Version%22%2C%20Plane.provenance_project%20AS%20%22Prov.%20Project%22%2C%20Plane.provenance_producer%20AS%20%22Prov.%20Producer%22%2C%20Plane.provenance_runID%20AS%20%22Prov.%20Run%20ID%22%2C%20Plane.provenance_lastExecuted%20AS%20%22Prov.%20Last%20Executed%22%2C%20Plane.provenance_inputs%20AS%20%22Prov.%20Inputs%22%2C%20Plane.energy_restwav%20AS%20%22Rest-frame%20Spectral%20Coverage%22%2C%20Plane.planeID%20AS%20%22planeID%22%2C%20isDownloadable(Plane.planeURI)%20AS%20%22DOWNLOADABLE%22%2C%20Plane.planeURI%20AS%20%22CAOM%20Plane%20URI%22%2C%20Observation.instrument_keywords%20AS%20%22Instrument%20Keywords%22%2C%20Plane.energy_transition_species%20AS%20%22Molecule%22%2C%20Plane.energy_transition_transition%20AS%20%22Transition%22%2C%20Plane.position_resolution%20AS%20%22IQ%22%20FROM%20caom2.Plane%20AS%20Plane%20JOIN%20caom2.Observation%20AS%20Observation%20ON%20Plane.obsID%20%3D%20Observation.obsID%20WHERE%20%20(%20Observation.instrument_name%20%3D%20%27MegaPrime%27%20AND%20Observation.collection%20%3D%20%27CFHT%27%20)&FORMAT=tsv

Stephen GwynCanadian Astronomy Data Centre

The second easiest: CADC's Advanced Search SELECT Observation.observationURI AS "Preview",

Observation.collection AS "Collection", Observation.observationID AS "Obs. ID", COORD1(CENTROID(Plane.position_bounds)) AS "RA (J2000.0)", COORD2(CENTROID(Plane.position_bounds)) AS "Dec. (J2000.0)", Plane.time_bounds_cval1 AS "Start Date", Observation.instrument_name AS "Instrument", Plane.time_exposure AS "Int. Time", Observation.target_name AS "Target Name", Plane.energy_bandpassName AS "Filter", Plane.calibrationLevel AS "Cal. Lev.", Observation.type AS "Obs. Type", Plane.energy_bounds_cval1 AS "Min. Wavelength", Plane.energy_bounds_cval2 AS "Max. Wavelength", Observation.proposal_id AS "Proposal ID", Observation.proposal_pi AS "P.I. Name", Plane.productID AS "Product ID", Plane.dataRelease AS "Data Release", AREA(Plane.position_bounds) AS "Field of View", Plane.position_sampleSize AS "Pixel Scale", Plane.dataProductType AS "Data Type", Plane.position_timeDependent AS "Moving Target", Plane.provenance_name AS "Provenance Name", Observation.intent AS "Intent", Observation.target_type AS "Target Type", Observation.target_standard AS "Target Standard", Observation.sequenceNumber AS "Sequence Number", Observation.algorithm_name AS "Algorithm Name", Observation.proposal_title AS "Proposal Title", Observation.proposal_keywords AS "Proposal Keywords", Plane.energy_emBand AS "Band", Plane.provenance_version AS "Prov. Version", Plane.provenance_project AS "Prov. Project", Plane.provenance_runID AS "Prov. Run ID", Plane.provenance_lastExecuted AS "Prov. Last Executed", Plane.energy_restwav AS "Rest-frame Spectral Coverage", isDownloadable(Plane.planeURI) AS "DOWNLOADABLE", Plane.planeURI AS "CAOM Plane URI", Observation.instrument_keywords AS "Instrument Keywords", Plane.energy_transition_species AS "Molecule", Plane.energy_transition_transition AS "Transition", Plane.position_resolution AS "IQ"

FROM caom2.Plane AS Plane JOIN caom2.Observation AS Observation ON Plane.obsID = Observation.obsID

WHERE ( Observation.collection = 'CFHT' )

Stephen GwynCanadian Astronomy Data Centre

The other hard part:

- Parsing downloaded metadata

- Which observations are images?

- Quality control - is MJD right? - Are coordinates 2000.0 or 1950.0?

- Sorting out filters: - remove narrow band filter data - remove bad filters - remove grism data - maybe homogenize filter names (B vs Bj vs Bjohnson vs Johnson B vs ...)

- Telescope footprint not typically part of the metadata

- Work out links back to original images

SSOIS saves the Earth....

Stephen GwynCanadian Astronomy Data Centre

Summary:

- SSOIS allows multi-archive searches for moving objects- Metadata is harvested from external archives- Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it

Stephen GwynCanadian Astronomy Data Centre

Stephen GwynCanadian Astronomy Data Centre

Summary:

- SSOIS allows multi-archive searches for moving objects- Metadata is harvested from external archives- Lessons learned: - SIAP is not useful for metadata harvesting - multiple queries by time not by position - older interfaces are easier to scrape - parsing metadata often harder than retrieving it