"Beyond the Data Lake", Matthias Korn, Technical Consultant at datavirtuality

Post on 16-Apr-2017

478 views 0 download

Transcript of "Beyond the Data Lake", Matthias Korn, Technical Consultant at datavirtuality

US Office:1355 Market Street, #488San Francisco, CA 94103

German Office:Katharinenstr. 1504109 Leipzig, Germany

Beyond the Data LakeSimplifying data integration for the modern age

Matthias Korn | Technical Consultantmatthias.korn@datavirtuality.de

The Challenge

Gartner 2014: “VARIETY is the biggest challenge.”

“When asked about the dimensions of data organizations strugglewith most, 49% answered variety, while 35%answered volume and 16% velocity.”

Integration using the Data Warehouse

Data is integrated by copying it into a central repositoryApproach: ETL processStructure is applied in the repositoryBI users query Data Marts

Why do so many DWH projects fail

Inflexible; costly modifications

Labour-intensive setup and maintenance

77% failure rate*

Slow data-to-actionable-insights (6 to 9+ months)

Data Lake – getting data in is pretty easy…

Databases

Web API

Sensor Data

Server logs

Clickstream Data

Unique identifie

r provide

d

Metadata tags

provied

Original data

structure

…but making sense of it is the challenge

Business User

?

Approaches to data fishing

Situation improved with YARNApache Mahout, HBase, Hive, Pig and MapReduceData Marts are createdBI user‘s report tools query Data MartsWait, didn‘t they do this before already?

„Transform“ just changed its position: ETL -> ELT

Data Marts have to be created by Data ScientistsBI users can‘t do new thingsNo permission conceptA lot of the stored data is never used, eating up the low storage costs

The Logical Data Warehouse

Introduced by Gartner in 2012new data management architecture for analyticsUses repositories just like the EDWAdds distributed processesAdds virtualization of data sources

Logical Data Warehouse (LDW)

What does the Logical Data Warehouse do?

LDW knows where the data is stored instead of copying itRepositories are used for datasources that are too slowPresents all data in a single virtual databaseQuickly reacts to changes in data models of source systems

Advantages of the Logical Data Warehouse

Real time data available and ready for analysisImmediately productiveLogical Data ModelPermission conceptWebservicesWrite to connected systems

Example data flow in an LDW

Distributed queryBI frontend aware of all data sources - creates SQL statementPerformance optimization engine replicates data only if needed

Conclusion

Logical Warehouse holds enormous promiseFlexibility and real-time access give an advantageUse Hadoop for batch jobs rather than integrationWe dataconomy!

US Office:1355 Market Street, #488San Francisco, CA 94103

German Office:Katharinenstr. 1504109 Leipzig, Germany

DataVirtualityThanks for your attention!

Visit our stand in the exhibition area.matthias.korn@datavirtuality.de