Best Practices for Unleashing the Power of Data Lakes

18
#TalendConne ct #TalendConne ct Best practices for unleashing the power of data lakes Isabelle Nuage & Christophe Toum, Big Data Products, Talend

Transcript of Best Practices for Unleashing the Power of Data Lakes

Page 1: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect#TalendConnect

Best practices for unleashing the power of data lakesIsabelle Nuage & Christophe Toum, Big Data Products, Talend

Page 2: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Self-service data lake, cafeteria style

Using sensor data collected in real-time to improve gas turbines reliability, operational performance and extend lifetime value.

Page 3: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Why Do We Need a Data Lake?“Data lakes are enterprise-wide data management platforms for analyzing disparate sources of data in its native format.”, Gartner.

Busin

ess V

alue

Reducing cost

Generating new opportunities

• ETL offload• EDW offload/optimization• Data archiving

• Customer acquisition, retention..• Real-time engagement• Pricing optimization• Demand forecasting• Risk and fraud• Predictive maintenance• Smart products…

Page 4: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

But Data Lakes Bring New Challenges

The rest of us

Data Lakes Bring New Challenges

High-end users

Complexity, poor governance and control, no reuse

Page 5: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Data Lake – Conceptual Architecture

AcquireIngest

Understand & Improve

Curate & Govern

DeliverSelf-service

SCALE

Page 6: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Continuously refreshed data Continuous data delivery and data processes

Page 7: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Wide connectivity Batch & streaming ubiquity Scale with volume and variety

Pitfalls:o Hand codingo Fragmented tools

Page 8: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Add context on data (provenance, semantics…)

Optimize data with curation, stewardship, preparation…

Use a collaborative process

Pitfalls:o Authoritative governanceo Inconsistent framework

Page 9: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Pervasive DQ, masking… Consistent operationalization Single platform for all use cases

& personas

Pitfalls:o Fragmented toolso Hand codingo Shadow IT

Page 10: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

Make data accessible Governed self-service Scalable operationalization

Pitfalls:o Unmanaged autonomyo Self-service tools for the tech

savvy

Page 11: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Best Practices to a Successful Data Lake

Accelerate Data

Ingestion

Understand & Govern Your Data

Remove Silos

Unify Data Managemen

t

Deliver Data to a Wide Audience

GET READY FOR CHANGE

Page 12: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Ingestion Best Practices

Transactions

Messages & Events

1011011100

10

1011011100

10

Logs

Sensors

Data Analytics & Data Science

Real-time Data Visualization

Real-time Indicators / Scorecard

Collect - Distribute

Track

Streaming

WindowingAlert

NYC Taxi Data Streaming

Page 13: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect#TalendConnect

NYC Taxi Data Streaming

Page 14: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

• The future features described in this presentation are under consideration by Talend and are not commitments for future products, technologies, or services.• The roadmap is subject to change and Talend does not guarantee the features

or release dates.

Disclaimer

Page 15: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Roadmap 2017

Addressing the needs of large enterprises

Big Data

1st on Spark 2.0&

Data Prep on Big Data

Data Prep&

Data Ingestion

Cloud Self-service

Data Stewardship &

Self-service connectors

Governance

Apache Atlas

Page 16: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Analyze way more data to find more opportunities for innovations and transformations

Real-time data streaming brings increased agility

To unleash data lakes, data governance is essential

Key Take Aways

Page 17: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect

Free Trial: Talend Big Data Sandbox

• A ready-to-run Docker environment

• A step-by-step expert guide

• Real-world scenarios using Spark, Kafka, MapReduce & NoSQL

www.talend.com/BigDataSandbox

Hit the Easy Button for Hadoop, Spark and Machine Learning

#TalendConnect

Page 18: Best Practices for Unleashing the Power of Data Lakes

#TalendConnect#TalendConnect

Thank You