Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of...

21
Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia, Canada [email protected] http://people.ok.ubc.ca/rlawrenc/

Transcript of Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of...

Page 1: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Managing Data Quality in a Terabyte-scale Sensor ArchiveManaging Data Quality in a

Terabyte-scale Sensor Archive

Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan

Kelowna, British Columbia, [email protected]

http://people.ok.ubc.ca/rlawrenc/

Page 2: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 2Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Data-Driven Scientific Discovery Modern scientific discovery uses vast quantities of data

generated from instruments, sensors, and experimental systems. The quality and impact of the research is highly dependent on the

quality of the data collection and analysis.

Challenges: The amount of data required for research is exploding. The types and sources of data is increasing and data generated

by different experiments or devices must be integrated.

These two factors require scientists to be concerned with how their experimental data is collected, archived, analyzed, and integrated to lead to research contributions.

Page 3: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 3Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Fundamental Sensor Data Archive Issue Sensors produce vast amounts of data that is valuable for

historical as well as real-time applications and analysis.

Due to the number of sensors and volume of data collected, manual curation and data validation is difficult.

By their nature, sensors are prone to failures, inaccuracies, and periods of intermittent or substandard performance.

Despite this, the historical data record should be as clean and accurate as possible despite device limitations.

Page 4: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 4Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Key Question (and Answer) Question: How can we achieve high quality historical archives of

sensor data?

Answer: In addition to operational monitoring of the data archive system, the data stream should be analyzed using metadata properties to detect errors.

Operational monitoring – Are the system components and workflow functioning properly?

Metadata validation – Does the data stream conform to known ranges? Can data cleansing and correction be performed?

Page 5: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 5Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Our goal is to provide the science community with ready access to the vast archives and real-time information collected

by the national network of NEXRAD radars.[This requires hiding the numerous data management issues.]

NEXRAD Archive System Overview

We will briefly overview: The data collected by the NEXRAD system and its scientific

value. The current state of NEXRAD data archiving and its use in

scientific discovery, including its data quality limitations. An extension of the system that uses metadata properties to

validate and clean archived data.

Page 6: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 6Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

NEXRAD System and Generated Data There are over 150 NEXt generation RADars (NEXRAD) that

collect real-time precipitation data across the United States. The system has been operational for over 10 years, and the

amount of collected data is continually expanding.

A radar emits a coherent train of microwave pulses and processes reflected pulses.

Each processed pulse corresponds to a bin. There are multiple bins in a ray (beam). Rotating the radar 360º is a sweep. After a sweep the radar elevation angle is increased, and another sweep performed. All sweeps together form a volume.

Page 7: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 7Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Usefulness of NEXRAD Data Although the NEXRAD system was designed for severe

weather forecasting, data collected has been used in many areas including: flood prediction bird and insect migration rainfall estimation

The value of this data has been noted by a NRC report which labeled it a “critical resource.”

Enhancing Access to NEXRAD Data—A Critical National Resource. National Academy Press, Washington D.C. ISBN 0-309-06636-0, 1999

Page 8: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 8Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Archiving NEXRAD Data Researchers have two options for acquiring NEXRAD data:

1) Retrieve RAW data from the National Climatic Data Center (NCDC) tape archive.

2) Capture real-time data distributed by University Corporation for Atmospheric Research (UCAR) using their Unidata Internet Data Distribution (IDD) system.

Acquiring, archiving, and analyzing the data requires significant computational and operational knowledge which makes it impractical for many researchers.

Page 9: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 9Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

NEXRAD Archive System The NEXRAD archive system is a NSF-funded project that

aims to simplify the analysis of NEXRAD data for researchers. The NEXRAD archive:

Collects and archives RAW data from the real-time stream. Analyzes and indexes data for retrieval by metadata properties. Performs data cleansing such as removing ground clutter. Allows researchers access to historical and real-time data in

RAW form. Provides an analysis workflow system that will generate

derived products (such as rainfall maps) using the RAW data, known algorithms, and researcher parameters.

The NEXRAD archive is hosted at the University of Iowa and the development is done in conjunction with NCDC, Unidata, and Princeton University.

Page 10: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 10Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

NEXRAD Archive Architecture Files are added from real-time

stream or from other sources. Metadata extractor produces

XML description of each data file used for indexing.

Clients can access archive directly using C library and their own program.

All data files are web accessible. Metadata directory can be

queried using web services interface.

Most clients use pre-constructed web workflow system and do not access RAW data.

Data and metadata will be replicated at a supercomputing center and eventually at NCDC.

Page 11: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 11Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Metadata Archive

““Find all the 2002 storms over the Find all the 2002 storms over the Ralston Creek watershed Ralston Creek watershed with with mean arealmean areal precipitationprecipitation greater greater than than X mmX mm, and with a , and with a spatial spatial extent of more than Z kmextent of more than Z km22, with a , with a duration of less than N hoursduration of less than N hours. I . I want the data in want the data in GeoTIFFGeoTIFF””

User/Client

User/Client’s View

Get URIs

WebServices

Get data HTTP

Query Metadata Metadata Archive

“Find all the 2002 storms over the Ralston Creek watershed with mean areal precipitation greater than X mm, and with a spatial extent of more than Z km2, with a duration of less than N hours. I want the data in GeoTIFF.”

DistributedData Archive

(NCDC, Iowa, etc.)

Page 12: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 12Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

NEXRAD System Current Statistics The NEXRAD archive system:

Has been running for over 2 years Collects data for 30 of the 150 radars Has indexed over 8 million radar scans Has RAW data that is over 8 TB in compressed form Processes real-time data stream of 10-20 GB/day Supports a sophisticated workflow system that produces

derived data products (e.g. rainfall maps) for users on demand Has an operational monitoring system (is the archive workflow

pipeline functioning properly?) but only simple data validation checks

Question: What is the quality of the data being archived?

Page 13: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 13Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Archive Monitoring System We developed a new archive monitoring system that:

Explicitly tracks all archive workflow events in logs that are stored and queried using a database

Detects data corruption using metadata properties as well as pipeline failures

Produces reports on a web interface to simplify the task of administration of the archive

The monitoring system was developed and operated separately from the main archive to compare performance and to prevent issues with the operational system.

Page 14: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 14Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Archive System with Monitor Basic archive workflow components

unchanged except for logging: Converter – translate RAW form to

compressed RLE Metadata Extractor – analyze data

properties/check for inconsistencies Loader – load metadata into

database and files onto web servers Monitoring system:

Loads XML log records from each archive component into DB.

Provides metadata ranges for checking data validity.

Tracks files through pipeline (lineage) and handles corrupt files.

Has separate log database that is accessed using web front-end.

Can restart any workflow software.

Page 15: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 15Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Validating Sensor Data using Metadata Operational ranges of data produced by sensors is commonly

known. For example, the timestamp of sensor readings for the radars

should be close to the current time. Reflectivity readings are within known ranges given weather conditions.

Monitoring system provides these operational ranges to the metadata extractor component that can verify that data is within accepted ranges.

Data outside ranges causes files to be dropped if not recoverable, fixed if possible (date changes), or flagged as warnings of potential corruption otherwise.

The goal is to get as much data through the pipeline as possible, but make sure compromised data is flagged.

Page 16: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 16Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Monitoring System Implementation The monitoring system required that each workflow component

be changed to use XML log records instead of separate files. Each XML log record is loaded into a Postgres database by a

log processor. The log processor and logging is separate from the archive system to ensure that logging does not slow the archiving.

As a RAW file proceeds through the pipeline, log events for it are recorded at each stage. Files that do not make it through the pipeline are not “lost” to the archive as before.

Administrator has a web front-end to control archive processes, monitor events on per file or per process level, and track operational characteristics across the entire workflow.

Page 17: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 17Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Monitor System Administrator Interface

Page 18: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 18Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Operational Results A duplicate NEXRAD archive system that included the

monitoring system processed two radars in parallel with the live system for six months.

Key results: Data errors occur in about 5% of input files. Expectation was

less than 1% given sophistication of sensors. One radar had data corruption for a two week period that went unnoticed

in the live system as radar was indicating good operational status. Most errors were not fixable, but significant # of correctable date errors.

Administrator time reduced dramatically compared to manual log investigation.

The cost of logging a sensor stream is high. Storing log records in a database is a bottleneck and must be separated from archive. (Database loading is also a bottleneck in archive itself.)

Page 19: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 19Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Future Work and Conclusions Archiving sensor data is going to be an increasing challenge.

Ensuring high quality archives requires more than operational monitoring and should also include data validation using metadata properties.

Live archive system is being updated to use monitoring system. The bottleneck in archiving and monitoring sensor data is the

database system. Monitor should be separated from archive. Loading the metadata into the archive takes an order of

magnitude longer than generating it. Metadata is growing beyond the capabilities of a single

database. It will be replicated and distributed for performance and political reasons.

Logging to the database provides easy access to information, but you must be aware of performance issues.

Page 20: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Page 20Ramon Lawrence -

Managing Data Quality in a Terabyte-scale Sensor Archive – SAC 2008

Project Participants The University of Iowa (Lead)

W.F. Krajewski (PI) A.A. Bradley, A. Kruger, R. Lawrence

Princeton University J.A. Smith (PI) M. Steiner, M.L.Baeck

National Climatic Data Center S.A. Delgreco (PI) S. Ansari

UCAR/Unidata Program Center M. K. Ramamurthy (PI) W.J. Weber

Research supported by NSF ITR Grant ATM 0427422: “A Comprehensive Framework for Use of NEXRAD Data in Hydrometeorology and Hydrology”.

Page 21: Managing Data Quality in a Terabyte-scale Sensor Archive Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan Kelowna, British Columbia,

Managing Data Quality in a Terabyte-scale Sensor ArchiveManaging Data Quality in a

Terabyte-scale Sensor Archive

Bryce Cutt, Ramon Lawrence University of British Columbia Okanagan

Kelowna, British Columbia, [email protected]

http://people.ok.ubc.ca/rlawrenc/

Thank You!