VTL (Validation and Transformation Language) A new standard for data validation and processing Marco...

26
VTL (Validation and Transformation Language) A new standard for data validation and processing Marco Pellegrino Eurostat Acknowledgements: Bank of Italy, SDMX Technical Working Group, DDI Alliance, Bryan Fitzpatrick, Arofan Gregory, and others… Eurosta t

Transcript of VTL (Validation and Transformation Language) A new standard for data validation and processing Marco...

VTL (Validation and Transformation Language)

A new standard for data validation and

processing

Marco PellegrinoEurostat

Acknowledgements: Bank of Italy, SDMX Technical Working Group, DDI Alliance, Bryan Fitzpatrick, Arofan Gregory, and others…

Eurostat

Background

Data validation, a critical issue for the E.S.S.

Eurostat and Member States: double work or "no work"?

Inefficiencies:• Lack of coordination• Lack of documentation• Lack of formalisation of validation procedures and rules• Low harmonisation of software solutions.

Need of a comprehensive solution: portfolio of actions in the framework of the ESS Vision 2020

2

Eurostat

SDMX originally focused on data collection and dissemination

Current line of tendency: Support more stages of the statistical production process

Approach

GSBPM (Generic Statistical Business Process Model)

3

Data Validation Process Before/During Transmission

(“First Level”) - Covered by SDMX today

- Format Check (SDMX-ML) - Code Check (SDMX DSD)

After Transmission( “Second Level”) - Not yet covered by SDMX

SDMX-VTL

- Detailed value check - Mirror check - …

Eurostat

Main goals:

Define and preserve validation rules (document and preserve the validation know-how)

Exchange and share validation rules (with reporting institutions & other correspondents)

Apply validation rules in the collection and production processes (aiming at an industrialized processing of statistical data)

At a later stage:

Improve the VTL to support more complex algorithms for data compilation and estimation

The VTL initiative

5

What is VTL 1.0?• A reference framework for the creation of rules for data

validation and transformation

• It maps to a clear and generic information model

• It aligns with relevant statistical information standards such as SDMX and GSIM

SDMX

VTL: part 1 - part 2

BNF (Extended Backus-Naur Form) Technical notation

6

Eurostat

Main VTL features

• User orientation

• Integrated approach

• IT implementation independence

• Active role for processing

• Extensibility and customizability

• Language effectiveness

Proper governance is needed

8

The VTL Information Model

• VTL is a “stand-alone” specification• It can be used with SDMX, DDI, or potentially anything

else• It can be used on its own

• Because different standards have different information models, VTL must establish its own information model• Other information models can be mapped against it• VTL uses GSIM as a basis

VTL Data Model

• Organizes Data Points into Data Sets

• Describes Data Structures using Structure Components• Measures• Attributes• Identifiers

• very similar to GSIM

Logical Data Set

DataPoints

Identifier Component

Identifier Component

Measure Component

Transformation Model

• Takes a set of Transformation Expressions and organizes them into a Transformation Scheme

• Each Expression has an Operand, and Operator, and a Result– Operands can have Parameters– Operators and Results are identified by the Expression

when it is executed– VTL specifies the Operators and the types of Parameters

• VTL uses the SDMX Transformation model

Transformations and Process models

Transformation modelIt exists in SDMX, but not in GSIM and DDI

It allows defining calculations through mathematical expressions

It does not allow cycles (same structure than a spreadsheet)

Process modelIt exists in SDMX, GSIM, DDI and other standards (e.g. BPM)

It allows defining calculations through a process

It allow cycles (like a procedural programming language)

GSIM Process Model

Process Method and Rules

Governance and Standards Alignment

• VTL will be maintained by the SDMX TWG• Extensions will be considered for inclusion in future

versions

• Has already produced some feedback to GSIM for next version• VTL can be mapped against SDMX• VTL can be directly utilized by DDI in those places where

computations are included• VTL could be used in CSPA services where processing is

performed • As GSIM processing Rules

What's next?

• More operators and features + bug-fixing + fine-tuning = VTL 1.1

• Reuse of rules, structural validation?

• SDMX specifications (e.g. for exchanging VTL rules in SDMX messages, for storing rules and for requesting validation rules from web services) in progress

• Implementation tests with some pilot domains

• Integration within the ESS Validation Architecture (Validation project with national statistical institutes).

19

Eurostat

Conclusions

• A formal unambiguous and standard language was needed for encoding validation rules so that these can be translated into specific data editing systems

• Use of generic software services provided within the ESS community is foreseen

• Great achievement, led by a task-force with experts from statistical institutes, central banks, international organisations and (a few) private experts

20

Thanks for your [email protected]

Eurostat

Examples

21

22

Is the total = 100?

check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all)

VTL Grammar: A Simple Example

23

ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]

check (ds1[keep (Country,Year, Percentage)][aggregate sum(Percentage)]=100, imbalance(Percentage), all)

Steps

VTL Grammar: Another Example

• We have two Data Sets (D1 and D2) with the same structure:

VTL Grammar: A Simple Example (cont.)

• We want to create a table (Dresult) which provides totals, combining the values for the US and the European Union:

Dresult := D1 + D2

Results

Dresult is a Data Set containing the United States plus the European Union: