Michelle Gierach, PO.DAAC Project Scientist Eric Tauer, PO.DAAC Project System Engineer.

25
Handling Datasets Michelle Gierach, PO.DAAC Project Scientist Eric Tauer, PO.DAAC Project System Engineer

Transcript of Michelle Gierach, PO.DAAC Project Scientist Eric Tauer, PO.DAAC Project System Engineer.

Handling Datasets

Michelle Gierach, PO.DAAC Project Scientist

Eric Tauer, PO.DAAC Project System Engineer

Introduction

• We saw a theme spanning several 2011 UWG recommendations, (6, 14, 19, 20, 23).

• The theme spoke to a fundamental need/goal:

Approach and handle datasets with consistency, and accept and/or deploy them only because it makes sense to do so.

• This is worth solving!

We want to provide the right datasets, and we want users to be able to easily connect with the right datasets.

• We enthusiastically agree with the UWG recommendations! Therefore…

Our intent is to capture the lifecycle policy, (including how we accept, handle, and characterize datasets), to ensure: Consistency in our approach, Soundness in our decisions, and the Availability of descriptive measures to our users.

In the next two discussions, we will address those 5 UWG Recommendations (6, 14, 19, 20, 23), via the following talks:

1. The proposed end-to-end lifecycle phases (enabling consistency), and assessment criteria (describing our data to our users) (Eric)

2. The results of the Gap Analysis, and the corresponding Dataset Assessment (Michelle)

The next two discussions…

Recommendations CoveredRecommendation 6. Carry out the dataset gap analyses and create a reporting structure that categorizes what is available, what could be created, the potential costs involved, estimates of user needs, and other data management factors. This compilation should enable prioritization of efforts that will fill the most significant data voids.

Recommendation 14. There needs to be a clear path for all datasets generated outside of PO.DAAC to be accepted and hosted by the PO.DAAC. The PSTs have a role in determining whether a dataset is valuable and of good quality. The processes and procedures should be published and readily available to potential dataset developers. All datasets should go through the same data acceptance path. A metric exclusively based on the number of peer-reviewed papers using the dataset is NOT recommended.

Recommendation 19. The UWG has previously recommended that the PO.DAAC work on providing climatologies, anomalies, indices, and various dataset statistics for selected datasets. This does not include developing CDRs as part of the core PO.DAAC mission. This recommendation is repeated, because it could be partially complementary to the IPCC/CMIP5 efforts, e.g., these climatologists prefer to work with global monthly mean data fields. Contributions of CDR datasets to PO.DAAC from outside research should be considered.

Recommendation 20. Better up front planning is required if NASA research program results are to be directed toward PO.DAAC. Datasets must meet format and metadata standards, and contribute to body of core data types. The Dataset Lifecycle Management plans are a framework for these decisions. Software must be designed to integrate with and beneficially augment the PO.DAAC systems. PO.DAAC should not accept orphan datasets or software projects.

Recommendation 23. Guiding user to data:Explain and use Dataset Lifecycle Management vocabulary with appropriate linkages Clarify what Sort by 'Popularity...' means

The Dataset Lifecycle

Related 2011 UWG Recommendations

The specification and documentation of the Dataset Lifecycle Policy stems from UWG Recommendations: 14, 20, 23

• “There needs to be a clear path for all datasets generated outside of PO.DAAC to be accepted and hosted by the PO.DAAC”

• “All datasets should go through the same data acceptance path”

• “Better up front planning is required if NASA research program results are to be directed toward PO.DAAC”

• “Dataset Lifecycle Management plans are a framework for these decisions”

• “Explain and use Dataset Lifecycle Management vocabulary with appropriate linkages”

Why a Lifecycle Policy?

• Consistency in our approach

• Match users to data

Major Goal: Better describe our data to better map it to our users.

First: Define Lifecycle Phases to control consistency.

Dataset Lifecycle work underway internal and external to PO.DAAC.

Internal:

• Significant research and work performed by Chris Finch (UWG 2010 Presentation)

• Work within PO.DAAC to streamline process; Mature teams with a very solid understanding of their roles

• Existing exit-criteria checklist for product release

External:

• Quite a bit of reference available via Industry efforts and progress

• Models can be leveraged from implementations at other DAACs, Big Data, Data One

Question:

Any specific recommendations regarding lifecycle models appropriate to PO.DAAC?

Existing Work…

Phase* Description

1 Identify a Dataset of Interest

This phase controls the identification of a dataset and it’s submission as a candidate, performing some order of a cost/benefit analysis.

2 Green-light the Dataset Review the information on a candidate dataset, indicating a go/no-go for inclusion at PO.DAAC

3 Tailor the Dataset Policy Set the expectations with respect to the policy; identify areas for waiver or non-applicability, if any.

4 Ingest the Dataset Determine and verify the route to obtain data under this dataset; collection of data-set related meta-data.

5 Archive the Dataset Determine and verify how to process the data, identify reformatting needs, meta-data extraction and initial validation strategy.

6 Register/Catalog the Dataset

Do preparatory steps required to ultimately enroll this dataset into the catalogs and FTP site(s), etc.

7 Distribute the Dataset (Was “Integrate”) Identify and complete work to tie the dataset in to search engines, visualization tools, and services.

8 Verify the Dataset Identify the test approach, plans, and procedures for verifying any and all PO.DAAC manipulation of this dataset’s granules. Define and document the level of verification to be performed. (Roll up all validation from prior steps.)

9 Rollout the Dataset Finalize the dataset information page; review the dataset for readiness. Deploy to operations and notify community of availability.

10

Maintain the Dataset This phase controls actions that may be needed to maintain the data in-house over the longer term, including re-processing, superseding, and versioning.

Proposed PO.DAAC Lifecycle Phases

*Additionally, we include “Retire the Dataset”, but these are the primary operational phases.

Lifecycle Policy: A Means to an End

Lifecycle Policy

ESDIS Goals User Goals

PO.DAAC Dataset Goals

How we do business

ProceduresConsistent Approach

How we Describe our

DataMATURITY

Controls…

Second: Define measurements related to Maturity Index.

Better-Described Data

• We want to quantitatively evaluate our datasets• We don’t want to claim datasets are “good” or

“bad”• NASA and NOAA call their evaluation: “Maturity.”

Question:(Rhetorical, at this point) What does “maturity” mean to you? Do you prefer

it to “Assessment and Characterization”?

Assessment and Characterization?

Maturity?

Constant Collection• Over the lifecycle, various data points are collected

• Decisional (e.g., Uniqueness: Rare or hard-to-find data)

• Descriptive (e.g., Spatial Resolution)

• Those data points might control decisions or flow (exit criteria) and/or might be used to describe the “maturity” to the user.

• We think “maturity” means:

A quantified characterization of dataset features.

A higher number means more “mature”

Dataset Lifecycle PhasesIdentify a Dataset of

Interest

Green-Light the Dataset

Tailor the Dataset Policy

Ingest the Dataset

Archive the Dataset

Register/Catalog the Dataset

Distribute the Dataset

Verify the Dataset

Rollout the Dataset

Maintain the Dataset

Increasing Knowledge of Maturity

- constant collection -

A Dataset Maturity Model

Related 2011 UWG Recommendations

The creation of a PO.DAAC Dataset Maturity Model stems from UWG Recommendations: 6, 14, 20, 23

• [Identify the] “potential costs involved, estimates of user needs, and other data management factors”

• “The PSTs have a role in determining whether a dataset is valuable and of good quality. The processes and procedures should be published and readily available to potential dataset developers”

• “A metric exclusively based on the number of peer-reviewed papers using the dataset is NOT recommended.”

• “Datasets must meet format and metadata standards”

• “PO.DAAC should not accept orphan datasets”

• “Clarify what Sort by 'Popularity...' means”

We adhere to the lifecycle for consistency, but a key outcome of the lifecycle must be maturity measures.

Dataset Maturity

Maturity

Community Assessment: 3

Technical Quality: 4

Processing: 3

Provenance: 3

Documentation: 5

Adherence to Process Guidelines:

5

Toolkits: 5

Relationships: 4

Specification: 4

Overarching Maturity Index:

4

Ref: NASA Data Maturity Levels

• Beta

• Products intended to enable users to gain familiarity with the parameters and the data formats.

• Provisional

• Product was defined to facilitate data exploration and process studies that do not require rigorous validation. These data are partially validated and improvements are continuing; quality may not be optimal since validation and quality assurance are ongoing.

• Validated

• Products are high quality data that have been fully validated and quality checked, and that are deemed suitable for systematic studies such as climate change, as well as for shorter term, process studies. These are publication quality data with well-defined uncertainties, but they are also subject to continuing validation, quality assurance, and further improvements in subsequent versions. Users are expected to be familiar with quality summaries of all data before publication of results; when in doubt, contact the appropriate instrument team.

• Stage 1 Validation: Product accuracy is estimated using a small number of independent measurements obtained from selected locations and time periods and ground-truth/field program efforts.

• Stage 2 Validation:  Product accuracy is estimated over a significant set of locations and time periods by comparison with reference in situ or other suitable reference data. Spatial and temporal consistency of the product and with similar products has been evaluated over globally representative locations and time periods. Results are published in the peer-reviewed literature.

• Stage 3 Validation: Product accuracy has been assessed.  Uncertainties in the product and its associated structure are well quantified from comparison with reference in situ or other suitable reference data. Uncertainties are characterized in a statistically robust way over multiple locations and time periods representing global conditions. Spatial and temporal consistency of the product and with similar products has been evaluated over globally representative locations and periods. Results are published in the peer-reviewed literature.

• Stage 4 Validation: Validation results for stage 3 are systematically updated when new product versions are released and as the time-series expands.

Ref: NOAA Maturity Model

Maturity Sensor Use

Algorithm Stability

(including ancillary inputs)

Metadata & QA Documentation Validation Public ReleaseScience and Applications

1 Research MissionSinificant

changes likelyIncomplete Draft ATBD Minimal

Limited data availability to

develop familiarityLittle or none

2 Research MissionSome changes

expectedResearch grade

(extensive)ATBD version 1+

Uncertainty estimated for select locations or

times

Data available but of unknown

accuracy; caveats required for data

use

Limited or ongoing

3 Research MissionMinimal change

expected

Research grade (extensive); meets

international standards

Public ATBD; peer-reviewed algorithm and

product descriptions

Uncertainty estimated over widely distributed

times/locations by multiple investigators; differences understood

Data available but of unknown

accuracy; caveats required for data

use

Provisionally used in applications and

assessments demonstrating positive value

4 Operational MissionMinimal change

expected

Stable, Allows provenance tracking and

reproducability; meets int'l standards

Public ATBD; draft Operational Algorithm

Description (OAD); peer-reviewed

algorithm and product descriptions

Uncertainty estimated over widely distributed

times/locations by multiple investigators; differences understood

Data available but of unknown

accuracy; caveats required for data

use

Provisionally used in applications and

assessments demonstrating positive value

5

All relevant research and operational

missions; unified and coherent record

demonstrated across different sensors

Stable and reproducable

Stable, Allows provenance tracking and

reproducability; meeting int'l

standards

Public ATBD, OAD, and validation plan; peer-reviewed algorithm,

product, and validaition articles

Consistent uncertianties estimated

over most environmental

conditions by multiple investigators

Multi-mission record is publicly

available with associated uncertainty

estimate

Used in various published

applications and assessments by

different investigators

6

All relevant research and operational

missions; unified and coherent record over

complete series; record is considered

scientifically irrefutable following extensive

scrutiny

Stable and reproducable;

homogenous and published error

budget

Stable, Allows provenance tracking and

reproducability; meeting int'l

standards

Product, algorithm, validation, processing,

and metadata described in peer-

reviewed literature.

Observation strategy designed to reveal systematic errors

through independent cross-checks, open

inspection, and continuous

interrogation

Multi-mission record is publicly

available from long term archive

Used in various published

applications and assessments by

different investigators

See: ftp://ftp.ncdc.noaa.gov/pub/data/sds/ms-privette-P1.3.conf.header.pdf

Laundry List of Criteria

Community Assessment:Papers written / number of citations# of Likes# of downloads/views

Technical Quality:QQC+Latency / GappinessAccuracy

Sampling issues?Caveats/known issues identified?

Processing:Has it been manipulated?Cal/Val state?Verification state?

Provenance:Maturity of platform/instrument/sensorMaturity of ProgramParent datasets identified (if applicable)Is the sensor fully described?Is the context of the reading(s) fully

described?State-of-the-Art technology?

Documentation:What is the state of the documentation?Is the documentation captured

(archived)?

Adherence to Process GuidelinesDid it get fast-tracked? Tons of waivers?Were all exit criteria met satisfactorily?Consistent use of units?

Access:Readily available?Foreign repository?Behind firewalls or open FTP?

Toolkits:Data visualization routine?Data reader?Verified reader/subroutine?

Relationships:Sibling/child datasets identified?Motivation/justification identified?

Rarity:Hard-to-find data?Atypical sensor/resolution/etc.?

Specification:Resolution (spatial / temporal)Spatial coverageStart timeEnd timeData format? Exotic structure?Sizing / volume expectation?

Used for Maturity Index

Community Assessment:Papers written / number of

citations# of Likes# of downloads/views

Technical Quality:QQC+Latency / GappinessAccuracy

Sampling issues?Caveats/known issues identified?

Processing:Has it been manipulated?Cal/Val state?Verification state?

Provenance:Maturity of

platform/instrument/sensorMaturity of ProgramParent datasets identified (if

applicable)Is the sensor fully described?Is the context of the reading(s)

fully described?State-of-the-Art technology?

Documentation:What is the state of the

documentation?Is the documentation captured

(archived)?

Adherence to Process GuidelinesDid it get fast-tracked? Tons of

waivers?Were all exit criteria met

satisfactorily?Consistent use of units?

Access:Readily available?Foreign repository?Behind firewalls or open FTP?

Toolkits:Data visualization routine?Data reader?Verified reader/subroutine?

Relationships:Sibling/child datasets identified?Motivation/justification identified?

Rarity:Hard-to-find data?Atypical sensor/resolution/etc.?

Specification:Resolution (spatial / temporal)Spatial coverageStart timeEnd timeData format? Exotic structure?Sizing / volume expectation?

Maturity

Community Assessment: 3

Technical Quality: 4

Processing: 3

Provenance: 3

Documentation: 5

Adherence to Process Guidelines:

5

Toolkits: 5

Relationships: 4

Specification: 4

Overarching Maturity Index:

4

Proposed Presentation

Users would be presented with layers of information:

• Scores derived from the various criteria categories

• An ultimate maturity index (simple mathematical average) from the combined values:

• Ultimately could allow weighting

• At this point, seems it would overcomplicate

?

Question:

What does “maturity” mean to you? Do you prefer it to “Assessment and Characterization”?

Does this provide better described datasets and better mapping of data to our users?

The lifecycle document, while capturing process, becomes a means to an even greater end.

The driving current is consistency, and as our goals hinge on matching users to datasets, the lifecycle becomes the means to ensuring fully characterized datasets.

We hope the approach is reasonable, and that we are accurate in our assessment that the policy aspects of the Dataset Lifecycle can and will help ensure conformity to process, and consistent availability of maturity data across all PO.DAAC holdings.

Next steps:

Need to ultimately identify (and if necessary, implement) the infrastructure needed to guide us through this lifecycle

Still need to resolve some key questions, such as: How does the lifecycle morph with respect to different types of datasets? (Remote Datasets? Self-generated Datasets?)

Wrap Up

Dataset Lifecycle PhasesIdentify a Dataset of

Interest

Green-Light the Dataset

Tailor the Dataset Policy

Ingest the Dataset

Archive the Dataset

Register/Catalog the Dataset

Distribute the Dataset

Verify the Dataset

Rollout the Dataset

Maintain the Dataset

Michelle’s discussion starts here…