Building a successful agile data transformation stack

31
Data Transformation made easy Building a successful agile data transformation stack Martin Magdinier March 2014

description

With today's abundance of data (big or small), organization's ability to capture, understand and process new content is key for their success. Martin Magdinier has developed a custom data transformation stack to integrate over an hundred eclectic data feeds into a single repository. His process goes through three stages: - Data discovery and exploration, - Rapid data transformation prototyping and - Automation of data cleaning and transformation process. This presentation review challenges specific to each step of the integration process, describe tools used (OpenRefine, Talend, Crowdflower) and processes developed to address them while keeping agility and flexibility of the overall stack in mind.

Transcript of Building a successful agile data transformation stack

Page 1: Building a successful agile data transformation stack

Data Transformation made easy

Building a successful agile data transformation stack

Martin Magdinier

March 2014

Page 2: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Agile Data Transformation Stack is Agile Data Transformation Stack is the Key for Successthe Key for Success

Page 3: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

If Data is the new oilIf Data is the new oilWhere are the gas station !?!Where are the gas station !?!

● Data is not (yet?) a standardized good:

- Environment with evolving technology and formats● - Unique need:

● Industry, ● Department, ● Business case

Page 4: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

The Data Transformation ProcessThe Data Transformation Process

Your data transformation stack should help you to:

– Explore and search new data

– Identify and Extract relevant data

– Refine/Turn data into usable information

– Store & distribute to business users

Page 5: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

The Agile Data Transformation Stack

● Is a combination of complementary tools, technology and processes,

● Supporting rapid iteration of ideas, processes and products

● Focused on value creation for the customer (internal or external)

Page 6: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

The Data Transformation Stack

......

PlatformData Processing

SolutionsStorage

FreeOpen Source

Suit your needs All Software are cross platform

Page 7: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Data Discovery & ProfilingMine existing dataAdd new data Data Transformation

Process & CodePrototype (MVP)Semi automatedAutomation

Track / MeasureCollect feedbackLearn from your experience

Progress in small

incremental steps

Data ConsumptionCreate valueGenerate new need

Agile Data Transformation Iteration

Page 8: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Data Discovery & Profiling

Data Discovery & ProfilingMine existing dataAdd new data Data Transformation

Process & CodePrototype (MVP)Semi automatedAutomation

Track / MeasureCollect feedbackLearn from your experience

Progress in small

incremental steps

Data ConsumptionCreate valueGenerate new need

Page 9: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Data Discovery

● Seek:– New data sources

– New usage for existing data

● Validate– Does the data match my quality criteria?

– Can I create value out of it?

Page 10: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Data Profiling

● Understand your data and make sense of it – Mine

– Explore

– Interact

– Transform

● Combine with visualization and reporting tool● Iterate and explore various vantage points

Page 11: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Data Discovery & ProfilingMine existing dataAdd new dataRefine requirements

Data TransformationProcess & CodePrototype (MVP)Semi automatedAutomation

Track / MeasureCollect feedbackLearn from your experience

Progress in small

incremental steps

Data ConsumptionCreate valueGenerate new need

Data Transformation

Page 12: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Role of a Working Prototype

● Minimize project cost and development time● Focus on core functions of the transformation

process (packaging will come later)● Define your transformation strategy in a

sandbox mode– Validate your assumption

– Identify road block on the path to automation

Page 13: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Iterate - Iterate - Iterate

● Improve and grow by incremental steps● Start feeding your business with data

– Validate if there is value in this data

– Collect feedback from the users

● Iterate as much as necessary

Page 14: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Discovery, Profiling & Prototyping

● Designed for technical and business users● Support a variety of input format● Allow easy and safe interaction with the data:

– Somewhere between Excel ● Point and click user friendly interface● Changes Preview ● Undo / Redo functions

– and SQL● Query oriented language● Handling large amount of data

Page 15: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

OpenRefine Interface

Facet for fastfiltering

Expression builder

Instant preview of the transformation

Page 16: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Prototyping & Automation

● Extract – Transform – Load solution● Process focus with

– Drag and drop component graphical interface

– Java based

● Compile your job to run it on your server– Java (Talend Open Studio)

– Map reduce (Talend for Big Data)

● Connect to anything● Open Source: Ease of addition / customizing

your own components / library

Page 17: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Talend Open Studio Interface

Drag, drop, connect and configure components

Process oriented interface

List of components available

Page 18: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Semi Automated Cleaning

● Intelligent Meta Crowd-sourcing Platform

● Build your job for data:

– clean up

– analysis

– categorization

– collection ...

● Ensure quality output– Check consistency of

results

– Select best worker

● Web Interface to – Build Prototype

– Test job

● API for automation– OpenRefine extension

– Talend Internet component

Page 19: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Lesson Learned

Data Discovery & ProfilingMine existing dataAdd new dataRefine requirements

Data TransformationProcess & CodePrototype (MVP)Semi automatedAutomation

Track / MeasureCollect feedbackLearn from your experience

Progress in small

incremental steps

Data ConsumptionCreate valueGenerate new need

Page 20: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Don't repeat yourself

● 1 process = 1 independent component / job● Reuse your existing components● Maintain your code in one place● Add few new items at each iteration

Page 21: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Name Splitting

3. Move the talend component to a routine

● Split FullName into FirstName and LastName– John Doe / John Van de Doe / John Della Doe

1. Define Logic and exception list in OpenRefine 2. Translate the logic into a talend component (tJavaRow)

Page 22: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Garbage in - Garbage out

● Catch errors early– The sooner, the easier

– Do not build the next step on erroneous data

● Independent process – Make it easier to track and debug.

– When the bug is fixed, every process / job benefit from it

Page 23: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Know where the value is

● Poorly planned data cleaning process is a never ending job (and a depressing experience)

● Prototyping helps to – Anticipate how dirty the data is

● Plan appropriate strategy● Discard the source early on if too dirty

– Set quality level of acceptance ● Level of granularity● Data format● ...

Page 24: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Example: Address parsingExample:

91 King Street East

305 – 1055, 20 TH ST SW

● Option A:– Address Line 1

– Address Line 2

● Option B:– Street Number

– Street Name

– Unit / PO Box

– Unit / PO Box Number

Page 25: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Know when to stop

● Plan your process keeping in mind the effort to – Build

– Operate

– Maintain

● Balance fully automated vs semi-automated process

– Manual Cleaning - Crowdflower API

– OpenRefine Redo / Apply function

– Talend job

Page 26: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Undo / redo in OpenRefine

History to undo previous steps

Extract and re apply transformation steps on a different project

JSON code to copy / paste in a different project

Page 27: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Know when to stopBuild your job in Crowdflower

Page 28: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Cleaning Typo

● How do you spell: – Mississagua

– mississauga

– Mississauga

– Mississuaga

– Misssisauga

● Algorithms– Levenshtein

– Fingerprint

– n-gram

– Metaphone

– PPM

● Process followed– Test and explore various algorithms in OpenRefine

– Automate in Talend with tFuzzyMatch

– Add human validation over a certain threshold

Page 29: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Cleaning Typo1. OpenRefine cluster interface to test different algorithms

2. tFuzzyMatch in talend to automate transformation

Page 30: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Conclusion

● Think Agile!● Iterate as often as you can

– Start small and build on it

– Confirm your assumption

– Focus on value creation

● Build a data friendly environment – Chose your tools carefully

– Leave room for learning and growing

Page 31: Building a successful agile data transformation stack

Building an Agile Data Transformation StackMartin Magdinier

Contact

Ask me questions!

Martin Magdinier

● Linkedin: www.linkedin.com/in/magdinier/en● Twitter: @magdmartin● Email

[email protected]

[email protected]