PPDM data loading best practice

33
www.etlsolutions .com Data Loading Best Practice PPDM Association 2012 Data Symposium

description

A review of data loading best practice, based on PPDM data from the Oil & Gas industry.

Transcript of PPDM data loading best practice

Page 1: PPDM data loading best practice

www.etlsolutions.com

Data LoadingBest PracticePPDM Association 2012 Data Symposium

Page 2: PPDM data loading best practice

Agenda

Data loading challenges

Best practices in tools and

methodology

We’ll be taking a look at loading data into PPDM, but much of this applies to generic data loading too

Page 3: PPDM data loading best practice

Agenda

Data loading challenges

Best practices in tools and

methodology

Page 4: PPDM data loading best practice

We’ve been listening to the Data Manager’s perspective

o PPDM Conference, Houston

o PNEC, Houstono Data Managers’

challenges:– Education– Certification– Preserving

knowledge– Process

Page 5: PPDM data loading best practice

Data management is difficult and important

Page 6: PPDM data loading best practice

Different data movement scenarios

Data

Migration

LoadingIntegration

But all require mapping rules for best practice

Page 7: PPDM data loading best practice

The business view of data migration can be an issue

• Often started at the end of a programme• Seen as a business issue (moving the filing

cabinet), not technical• However, the documents in the filing cabinet

need to be read, understood, translated to the new system; obsolete files need to be discarded

Page 8: PPDM data loading best practice

Different data migration methodologies are available

PDM (Practical Data Migration)

• Johny Morris• Training course• PDM certification• Abstract• V2 due soon

Providers• Most companies

providing data migration services/products have a methodology

• Ours is PDM-like, but more concrete and less abstract

Page 9: PPDM data loading best practice

Agenda

Data loading challenges

Best practices in tools and

methodology

Methodology

Page 10: PPDM data loading best practice

Core migration

As an example, our methodology

Landscape analysis

Data discovery

Data modelling

Data assurance

Data review

Data cleansing

Requirements analysis

Project scoping

Configuration

Migration design

Testing design

Migration development

Testing development

Execution

Review

Legacy decommissioning

Page 11: PPDM data loading best practice

Firstly, review the legacy landscape

Legacy

SAP

ApplicationReport

Archive

Access DBExcel

VBA

Satellites

Page 12: PPDM data loading best practice

Eradicate failure points

Beware the virtual waterfall process

Signoff

MigrateAgile DevelopmentRequirements

Page 13: PPDM data loading best practice

Agenda

Data loading challenges

Best practices in tools and

methodology

Rules

Page 14: PPDM data loading best practice

• In data migration, integration or loading, one area of commonality is the link between source and target

• This requires design, definition, testing, implementation and documentation

• The aim is automated loading of external data into a common store

• This requires best practice

Rules are required

PPDM

Page 15: PPDM data loading best practice

Best practice: A single version of truth

PPDM 3.8

• So for each of these data loaders we want a single version of truth

• Whatever artifacts are required, we want to remove duplication, because duplication means errors, inconsistency and additional work

• We want to remove boiler plate components that are only indirectly related to the business rules by which data is loaded

• Let’s look at what goes into a data loader and where the duplication and unnecessary work comes from...

Page 16: PPDM data loading best practice

The PPDM physical model

• PPDM comes to us as a physical projection, rather than a logical model – maps directly to a relational database

• Access therefore via SQL, PL/SQL; low level detail is important i.e. how relationships are implemented (e.g. well header to borehole)

• Considerations to access: primary keys, foreign keys, data types – conversions, maximum lengths. Load order required by FKs – PPDM Load of the rings, relationships – cardinality etc

• SQL: only know at runtime, so turnaround can be slow

• All of this metadata is available in machine readable format, so we should use it

Page 17: PPDM data loading best practice

External data sources

• Looking at the external files, we need a variety of skills: text manipulation, XML processing, Excel, database

• The data model is unlikely to be as rich as PPDM, but there is some definition of the content e.g. Excel workbooks have a tabular layout with column titles, worksheets are named

• It can be hard to find people with the relevant skills - you sometimes see ad hoc, non-standard implementations because the developer used whatever skills he/she had: perl, python, xslt, sql

• So the next clue is that we should use the model information: what elements, attributes and relationships are defined, rather than details of how we access it

• Abstract out the data access layer; don’t mix data access with the business rules required to move them into PPDM

PPDM 3.8

Page 18: PPDM data loading best practice

Challenges with domain expert mapping rules• A common step for defining how a data source is to be loaded is for a domain expert to

write it up in Excel• Not concerned with data access, but some details will creep in, e.g. specifying an xpath• When lookups, merging/splitting values, string manipulation, conditional logic appear,

the description can become ambiguous• Also note the duplication: the model metadata is being written in the spreadsheet; if the

model changes, the spreadsheet needs to be manually updated

Page 19: PPDM data loading best practice

Challenges with developer mapping rules• The example here probably wouldn’t pass a code inspection, but it does illustrate the

type of issues that can arise• Firstly, duplication: this is reiterating the Excel rules – they need to match up, but while

a domain expert might follow the simple example previously, low level code can be tricky to discuss

• Secondly, metadata is again duplicated: the names of the tables and columns appear in the SQL statements, the max length of the name column is checked

• Thirdly, boiler plate code: select/update/insert conditional logic• Fourthly, data access code appears in the rules• Finally, the code becomes hard to maintain as the developer moves on to other roles

Page 20: PPDM data loading best practice

Documentation of mapping rules

Word document for sign-off

Data Management record How data was loaded Stored in your MDM

data storeCan be queried

PPDM mapping tables

Page 21: PPDM data loading best practice

• Here is where you do require some duplication

• Tests are stories:• Define what the system

should do• If it does, the system is good

enough if the tests are complete

• If we use a single version of truth to generate tests, the tests will duplicate errors, not find them

Test artifacts

Page 22: PPDM data loading best practice

Agenda

Data loading challenges

Best practices in tools and

methodology

Tools

Page 23: PPDM data loading best practice

Use tools

• Use available metadata• Abstract out data access layer• Higher level DSL for the mapping

rules:• Increase team communication

– developer/business• Reduce boiler plate code

• One definition:• Replace Excel and code• Generate documentation

Page 24: PPDM data loading best practice

An example of a graphical tool: Altova MapForce

• Tools such as Talend, Mule DataMapper and Altova MapForce take a predominantly graphical approach

• The metadata loaded on the left and right (source/target) with connecting lines

• In addition to the logic gates for more complex processing, code snippets can be added to implement most business logic

• Issues:• Is it really very easy to read? The

example here is a simple mapping; imagine PPDM well log curve, reference data tables etc

• It isn’t easy to see what really happens: a+b versus an “adder” – e.g. follow the equal() to Customers – what does that actually do?

• But: can generate documentation and executable from that single definitive mapping definition

• Typing errors etc are mostly eliminated

Page 25: PPDM data loading best practice

ETL Solutions’ Transformation Manager

• An alternative is to use a textual DSL: again the metadata has been loaded

• No data access code• Metadata is used extensively: for example

warnings, primary key for identification; relationships

• Typing errors are checked at designtime, and model or element changes affecting the code are quickly detected e.g. PPDM 3.8 to 3.9

• Rels used to link transforms: a more logical view with no need to understand underlying constraints; complexity of the model doesn’t matter, as the project becomes structured naturally

• FK constraints used to determine load order• Metadata pulled in directly from the source

e.g. PPDM, making use of all the hard work put in by the PPDM Association

Page 26: PPDM data loading best practice

Generated documentation

Page 27: PPDM data loading best practice

One of the many questions a data manager has about the data he/she manages:

Keeping the PPDM data manager happy

PPDM 3.8

Data lineage: How did this data get here?

Page 28: PPDM data loading best practice

PPDM provides tables to record data lineage

Page 29: PPDM data loading best practice

Transformation Manager can generate documentation for the PPDM metadata module

Page 30: PPDM data loading best practice

Agenda

Data loading challenges

Best practices in tools and

methodology

Project management

Page 31: PPDM data loading best practice

Key points

• Be aware• Look at data migration

methodologies• Select appropriate

components

• Look for and remove large risky steps

• Start early• Ensure correct

resources will be available

• No nasty budget surprises

• Use tools• Build a happy virtual team

Page 32: PPDM data loading best practice

• Did you know about these tables?

• Who uses them? • How do you use them? • What features would be truly

useful in a data loader tool?

Questions

Page 33: PPDM data loading best practice

www.etlsolutions.com

Raising data management

standards

www.etlsolutions.com

Contact us for more information:

Karl Glenn, Business Development [email protected]+44 (0) 1912 894040

Read more on our website:

http://www.etlsolutions.com/what-we-do/oil-and-gas/

Images from Free Digital Photos freedigitalphotos.net