Implementing a Data Lake with Enterprise Grade Data Governance

50
© Hortonworks Inc. 2011 – 2014. All Rights Reserved Implementing a Data Lake with Enterprise Grade Data Governance We do Hadoop.

Transcript of Implementing a Data Lake with Enterprise Grade Data Governance

Page 1: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Implementing a Data Lake with Enterprise Grade Data Governance

We do Hadoop.

Page 2: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Your speakers

Andrew Ahn Governance Product Manager, Hortonworks

Oliver Claude CMO at Waterline

Page 3: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP: Data Governance We Do Hadoop

Page 4: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enterprise Data Governance Goals

GOAL: Provide a common approach to data governance across all systems and data within the organization

•  Transparent Governance standards & protocols must be clearly defined and available to all

•  Reproducible Recreate the relevant data landscape at a point in time

•  Auditable All relevant events and assets but be traceable with appropriate historical lineage

•  Consistent Compliance practices must be consistent

ETL/DQ

BPM

Business Analytics

Visualization & Dashboards

ERP

CRM SCM

MDM

ARCHIVE

Governance Framework

Page 5: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Governance Challenges WITHIN Hadoop

•  No comprehensive governance within the Hadoop stack

•  Mostly disjoint as each project defines its own future and there is no common framework

•  Disparate tools, such as HCatalog, Ranger and Falcon provide pieces of the overall solution

•  No integration with external governance frameworks

•  Difficult to get right because each project is autonomous and you need insight into traditional IT

Apa

che

Pig

Apa

che

Hiv

e

Apa

che

HB

ase

Apa

che

Acc

umul

o

Apa

che

Sol

r

Apa

che

Spa

rk

Apa

che

Sto

rm

Page 6: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Data Governance Initiative for Hadoop

ETL/DQ

BPM

Business Analytics

Visualization & Dashboards

ERP

CRM SCM

MDM

ARCHIVE

Data Governance Initiative

Common Governance Framework

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

°

°

Apa

che

Pig

A

pach

e H

ive

Apa

che

HB

ase

Apa

che

Acc

umul

o A

pach

e S

olr

Apa

che

Spa

rk

Apa

che

Sto

rm

TWO Requirements

1.  Hadoop must snap in to the existing frameworks and be a good citizen

2.  Hadoop must also provide governance within its own stack of technologies

A group of companies dedicated to meeting these requirements in the open

Page 7: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Common Data Governance Use Cases

Financial Reporting Chain of custody, Lineage Narratives

Telco Device log management, Correlation, Analysis, and Mitigation

Retail Point of sale analysis, Price optimization

Healthcare 30 day measures reporting

Page 8: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas Overview We Do Hadoop

Page 9: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

New Project Proposal: Apache Atlas

Apache Atlas Proposed open source project aimed at solving the Hadoop data governance challenge in the open.

Key Capabilities • Data Classification • Metadata Exchange • Centralized Auditing • Search & Lineage (Browse) • Security & Policy Engine

Apache Atlas

Knowledge Store

Audit Store

Models Type-System

Policy Rules Taxonomies

Tag Based Policies

Data Lifecycle Management

Real Time Tag Based Access Control

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Energy

PPDM

Retail

PCI PII

Other

CWM

Essen%al  Timeline  

 Phase-­‐3  

•  Collaboration Features •  Self Service •  Steward Delegation •  Profiling & Pattern Analysis •  Visualization  

Phase-­‐2

•  Advance audit reporting •  Advanced Policy Engine •  Row / Column Masking •  3rd party Metadata exchange  

1H  2015  GA  

•  Rest API •  Centralized Taxonomy •  Import / export metadata •  Basic Policy Rules Engine •  Real-time access control •  Column Level Tagging

Page 10: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas Capabilities: Overview Data Classification •  Import or define taxonomy business-oriented annotations for data

•  Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes

•  Export metadata to third-party systems

Centralized Auditing •  Capture security access information for every application, process, and interaction with data

•  Capture the operational information for execution, steps, and activities

Search & Lineage (Browse) •  Pre-defined navigation paths to explore the data classification and audit information

•  Text-based search features locates relevant data and audit event across Data Lake quickly and accurately

•  Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information

Security & Policy Engine •  Rationalize compliance policy at runtime based on data classification schemes

•  Advanced definition of policies for preventing data derivation based on classification (i.e. re-identification)

Apache Atlas

Knowledge Store

Audit Store

Models Type-System

Policy Rules Taxonomies

Tag Based Policies

Data Lifecycle Management

Real Time Tag Based Access Control

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Energy

PPDM

Retail

PCI PII

Other

CWM

Page 11: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas

Apache Atlas Overview

Knowledge Store Knowledge store categorized with appropriate business-oriented taxonomy

•  Data sets & objects •  Tables / Columns

•  Logical context •  Source, destination

Support exchange of metadata between foundation components and third-party applications/governance tools

Leverages existing Hadoop metastores

Audit Store

Policy Engine

Data Lifecycle Management

Security

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Custom

CWM

Retail

PCI PII

Other

Knowledge Store

Models Type-System

Policy Rules Taxonomies

Page 12: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas

Knowledge Store

Apache Atlas Overview

Data Lifecycle Management Leverage existing investment in Apache Falcon with a focus on:

•  Provenance

•  Multi-cluster replication

•  Data set retention/eviction

•  Late data handling

•  Automation

Audit Store

Models Type-System

Policy Rules Taxonomies Policy Engine

Security

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Custom

CWM

Retail

PCI PII

Other

Data Lifecycle Management

Page 13: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas

Knowledge Store

Apache Atlas Overview

Audit Store Historical repository for all governance events

•  Security: Access Grant & Deny

•  Operational: Data Provenance & Metrics

•  Indexed and Searchable

Models Type-System

Policy Rules Taxonomies Policy Engine

Data Lifecycle Management

Security

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Custom

CWM

Retail

PCI PII

Other

Audit Store

Page 14: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas

Knowledge Store

Apache Atlas Overview

Security Integration with HDP Advanced Security investments to ensure compliance.

Establish global security policies based on data classification.

Leverages Ranger plug-in architecture for policy enforcement

Audit Store

Models Type-System

Policy Rules Taxonomies Policy Engine

Data Lifecycle Management

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Custom

CWM

Retail

PCI PII

Other

Security

Page 15: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas

Knowledge Store

Apache Atlas Overview

Policy Engine Runtime rationalization of policies rules with respect to data asset combinations and time. Fully extensible.

•  Metadata based

•  Geo based rules

•  Time-based rules

•  Hive Column Prohibitions

•  Preview: Hive Row and Column Masking

Audit Store

Models Type-System

Taxonomies

Data Lifecycle Management

Security

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Custom

CWM

Retail

PCI PII

Other

Policy Rules Policy Engine

Page 16: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas

Knowledge Store

Apache Atlas Overview

RESTful interface •  Extensible enterprise classification of data assets,

relationships and policies organized in a meaningful way -- aligned to business organization.

•  Supports exploration via user interface

•  Supports extensibility via API and CLI exposure

Audit Store

Models Type-System

Policy Rules Taxonomies Policy Engine

Data Lifecycle Management

Security

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Custom

CWM

Retail

PCI PII

Other

Page 17: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Coming 2h 2015

Page 18: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas

Knowledge Store

Apache Atlas Overview

Enhanced Audit Store Historical repository for all governance events

•  Immutable file format •  Events Metadata Taggable •  Advanced Reporting •  Security: Access Grant & Deny

•  Operational: Data Provenance & Metrics

•  Indexed and Searchable

Models Type-System

Policy Rules Taxonomies Policy Engine

Data Lifecycle Management

Security

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Custom

CWM

Retail

PCI PII

Other

Audit Store

Page 19: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Summary

Page 20: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas Capabilities: Overview Data Classification •  Import or define taxonomy business-oriented annotations for data

•  Define, annotate, and automate capture of relationships between data sets and underlying elements including source, target, and derivation processes

•  Export metadata to third-party systems

Centralized Auditing •  Capture security access information for every application, process, and interaction with data

•  Capture the operational information for execution, steps, and activities

Search & Lineage (Browse) •  Pre-defined navigation paths to explore the data classification and audit information

•  Text-based search features locates relevant data and audit event across Data Lake quickly and accurately

•  Browse visualization of data set lineage allowing users to drill-down into operational, security, and provenance related information

Security & Policy Engine •  Rationalize compliance policy at runtime based on data classification schemes

•  Advanced definition of policies for preventing data derivation based on classification (i.e. re-identification)

Apache Atlas

Knowledge Store

Audit Store

Models Type-System

Policy Rules Taxonomies

Tag Based Policies

Data Lifecycle Management

Real Time Tag Based Access Control

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Energy

PPDM

Retail

PCI PII

Other

CWM

Page 21: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Governance Ready Certification Program

Curated group of vendor partners to provide rich & complete features Customers choose features that they want to deploy – a la carte. Low switching costs ! HDP at core to provide stability and interoperability

Discovery Tagging

Prep / Cleanse

ETL

Governance BPM

Self Service

Visual-ization

Page 22: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Waterline Data improves speed to value and compliance

Data Warehouse Offload

Data Science/Analytics Sandbox

Data Lake

VALUE CREATION

COST SAVINGS

Deliver a Business-Ready

Data Lake

Accelerate Data Prep Process

Govern Data in Hadoop

Page 23: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Find, understand and govern data in Hadoop

Page 24: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

The Modern Data Architecture

Page 25: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Atlas Capabilities: Overview

Apache Atlas

Knowledge Store

Audit Store

Models Type-System

Policy Rules Taxonomies

Tag Based Policies

Data Lifecycle Management

Real Time Tag Based Access Control

REST API

Services

Search Lineage Exchange

Healthcare

HIPAA HL7

Financial

SOX Dodd-Frank

Energy

PPDM

Retail

PCI PII

Other

CWM

Rest API

Business Glossary

Automated Classification (Tagging)

Automated Lineage Discovery

Profiling and Data Quality

Schema Discovery

Change Detection and Audit •  Glossary •  Tags •  Lineage •  Models

Page 26: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Visual-ization

Governance Ready Certification Program

Discovery Tagging

Prep / Cleanse

ETL

Governance BPM

Self Service

Visual-ization

Page 27: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Imagine shopping on Amazon.com

GOVERNANCE

Inventory

Find and Understand

Provision

Page 28: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Waterline Data is like Amazon.com for data in Hadoop

GOVERNANCE

Inventory

Find and Understand

Provision

Page 29: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Inventory

Page 30: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Find and Understand

Page 31: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Provision

Page 32: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Governance

Page 33: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Find, understand and govern data in Hadoop

Big Data IT Architect

Deliver a Business-Ready Data Lake

Data Engineer/Data Scientist

Accelerate Data Prep Process

CDO/Data Steward

Govern Data in Hadoop

Page 34: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Deliver a business-ready data lake “It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a service to help the business find, understand, and govern data in Hadoop.” Joe DosSantos, EMC Big Data Practice Leader

Page 35: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Deliver a business-ready data lake “It’s easy to get data into Hadoop, but it’s not necessarily easy to get data out of Hadoop. There is a need for data as a service to help the business find, understand, and govern data in Hadoop.” Joe DosSantos, EMC Big Data Practice Leader

Page 36: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Accelerate data prep process “80% of Big Data analytics is data prep, and 80% of data prep is inventorying data.” Data Engineering Director, Financial Services

Page 37: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Accelerate data prep process "Waterline Data fills a critical gap in big data exploratory analytics by automating the tagging and cataloging of data, which in turn can help analytic teams provision the right data for their analyses.” Tony Baer, Principal Analyst, Ovum

Page 38: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Govern data in Hadoop “Data lakes therefore carry substantial risks. The most important is the inability to determine data quality or the lineage of findings by other analysts or users that have found value, previously, in using the same data in the lake. By its definition, a data lake accepts any data, without oversight or governance. Without descriptive metadata and a mechanism to maintain it, the data lake risks turning into a data swamp. And without metadata, every subsequent use of data means analysts start from scratch.” “Gartner Says Beware of the Data Lake Fallacy” post on the Gartner website

Page 39: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Govern data in Hadoop “The first step to governing Big Data is to build an inventory.” Sunil Soares, Managing Partner, Information Asset

Page 40: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Best practice approach to implement an enterprise grade data lake

6. Monitor and maintain

5. Open up to users

4. Protect sensitive data

3. Integrate with enterprise metadata repository

2. Build inventory of data

1. Create and populate landing area

Page 41: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Best practices in deployment landscape

1. Create and populate landing area

1 1

•  Create Landing directory structure •  Set up ETL processes using

Falcon to orchestrate •  Implement ETL jobs using ETL

tools (Syncsort, Talend, Informatica, etc), Hadoop tools (Sqoop, Flume, etc) or FTP

Falcon

Page 42: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Best practices in deployment landscape

2. Build inventory of data

1. Create and populate landing area

2

•  Crawl the cluster •  Profile files •  Automatically discover technical,

business, and compliance metadata at a field level

•  Create Hive tables as needed •  Import lineage •  Export to Atlas

2

2 Falcon

HCatalog

Atlas

Page 43: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Best practices in deployment landscape

3. Integrate with enterprise metadata repository

2. Build inventory of data

1. Create and populate landing area

3

3

•  Import business glossary terms and export new tags and updated definitions

•  Synchronize Atlas and Waterline Data Inventory

•  Export metadata and lineage from Hadoop to Enterprise repository

Falcon

HCatalog

Atlas

Page 44: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Best practices in deployment landscape

4. Protect sensitive data

3. Integrate with enterprise metadata repository

2. Build inventory of data

1. Create and populate landing area

4

•  Use Waterline Data Inventory to find sensitive data

•  Create access privileges in Ranger •  Encrypt or de-identify

HCatalog

Ranger

Falcon Atlas

Page 45: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Best practices in deployment landscape

5. Open up to users

4. Protect sensitive data

3. Integrate with enterprise metadata repository

2. Build inventory of data

1. Create and populate landing area

5

5

5

•  Create account with Kerberos, LDAP, etc.

•  Set up ACLs (leverage Ranger) •  Users can browse securely through

Waterline Data Inventory

5

HCatalog

Ranger

Falcon Atlas

Page 46: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Best practices in deployment landscape

6. Monitor and maintain

5. Open up to users

4. Protect sensitive data

3. Integrate with enterprise metadata repository

2. Build inventory of data

1. Create and populate landing area

•  Continue profiling new or changed files and sync with Atlas

•  Continue monitoring for sensitive data, use Ranger to protect

•  Build a folksonomy and synchronize with business glossary in Atlas and Enterprise Business Glossary

HCatalog

Ranger

Falcon Atlas

Page 47: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Find, understand and govern data in Hadoop

Discover lineage and business metadata automatically, and manage metadata

CDO/Data Steward

Automate cataloging of data assets at scale, with secure provisioning to business users

Big Data Architect

Find and understand best-suited and most trusted data without having to explore every file manually

Data Engineer/Data Scientist/Business Analyst

Page 48: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Questions and Answers

Page 49: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

Next Steps…

Download the Hortonworks Sandbox

Learn Hadoop

Build Your Analytic App

Try Hadoop 2

More about Waterline Data & Hortonworks http://hortonworks.com/partner/waterline-data Joint tutorial: bit.ly/DataLakeTutorial Modern Data Architecture Paper: go.waterlinedata.com/hw-mda

Page 50: Implementing a Data Lake with Enterprise Grade Data Governance

© Hortonworks Inc. 2011 – 2014. All Rights Reserved

SAN JOSE June 9-11

BRUSSELS April 15-16

•  Deep-dive technical content •  65+ sessions and 5 tracks •  1,000 attendees •  Sponsorships Available •  Including Pre and Post event community meetups

and BOFs •  Hadoop training available

•  100+ sessions and 7 tracks •  Deep-dive technical content •  5,000 attendees •  Sponsorships Available •  Including Pre and Post event community meetups

and BOFs •  Hadoop training available

www.hadoopsummit.org

The Largest Hadoop Community Events in Europe and North America