Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

36
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise June 28, 2016 Apache Atlas

Transcript of Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

Page 1: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the EnterpriseJune 28, 2016

Apache Atlas

Page 2: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Disclaimer

This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.

Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.

This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.

Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.

Page 3: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Atlas Data Governance

Organizations need data governance to understand its information to answer questions such as:

• What do we know about our information?• Where did this data come from and who can use it?• Does this data adhere to company policies and rules?

Page 4: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

STRUCTURED

UNSTRUCTURED

Vision - Enterprise Data Governance Across Platforms

TRADITIONALRDBMS

METADATA

MPP APPLIANCES

Project 1

Project 5

Project 4

Project 3

METADATA

Project 6

DATALAKE

STREAMING

Atlas: Metadata Truth in Hadoop

Data Managementalong the entire data lifecycle with integrated provenance and lineage capabilityModeling with Metadataenables comprehensive data lineage through a hybrid approach with enhanced tagging and attribute capabilitiesInteroperable Solutionsacross the Hadoop ecosystem, through a common metadata store

Page 5: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Atlas Overview

Page 6: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Atlas Data GovernanceData governance practices provide a holistic approach to managing, improving and leveraging information to help you gain insight and build confidence in business decisions and operations.

Atlas helps customers discover information about data objects, their meaning, location, characteristics, and usage.

Page 7: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Atlas timeline: from DGI to present

May2015Apache AtlasIncubation

DGI groupKickoff

Dec 2014

July2015HDP 2.3 FoundationGA Release

First kickoff to GA in 7 months

Global FinancialCompany

* DGI: Data Governance Initiative

Key Benefits:

• Co-Dev = Built for real customer use cases

• Faster & Safer = Customers know business + HWX knows Hadoop

Jan2016HDP 2.4 Kafka/StormSqoopFalconTag Based Security

Summer2016HDP 2.5 Business CatalogAD integrationVersioning

Page 8: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Big Data Management Through MetadataManagement ScalabilityMany traditional tools and patterns do not scale when applied to multi-tenant data lakes. Many enterprise have silo’d data and metadata stores that collide in the data lake. This is compounded by the ability to have very large windows (years). Can traditional EDW tools manage 100 million entities effectively with room to grow ?Metadata Tools Scalable, decoupled, de-centralized manage driven through metadata is the only via solution. This allows quick integration with automation and other metamodelsTags for Management, Discovery and SecurityProper metadata is the foundation for business taxonomy, stewardship, attribute based security and self-service.

Key Benefits:

Modern Data Lakes need new ways to govern because:

• Cost – Traditional staff ratio to data size not possible

• Diversity – Only way to manage velocity of new datasets

• Agility – Quick change based on tags / taxonomy

Page 9: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

High Level Architecture: 4 Key points

Type System

Repository

Search DSL

Brid

geHive Storm

Falcon

Custom

REST API

Graph DB

Sear

ch

Kafka

Sqoop

Conn

ecto

rs

Mes

sagi

ng F

ram

ewor

k

3 REST APIModern, flexible access to Atlas services, HDP components, UI & external tools

1 Data Lineage Only product that captures lineage across Hadoop components at platform level. 4 Exchange

Leverage existing metadata / models by importing it from current tools. Export metadata to downstream systems

2 Agile Data Modeling:Type system allows custom metadata structures in a hierarchy taxonomy

Page 10: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Governance Ready Certification Program

DiscoveryTagging

Prep / CleanseETL

GovernanceBPM

Self Service Visualization

Choice: Customers choose features that they want to deploy—a la carte versus vendor lock

Curated & Fast: Selected group of vendor partners to provide rich, complimentary and complete features ready to deploy

Agile: Low switching costs, Faster deployment and innovation

Centralized: Common SLA & common open metadata store

Flexibility: Interoperability of products through Atlas metadata

Safe: HDP at core to provide stability and interoperability

Page 11: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Governance Ready Certification Program

Completed:

• Waterline• Dataguise• Attivo

Next:

• SAP ILM,VORA• IBM IGC

Work in progress:

• Collibra• Alation• Meta

Integration (Miti)

• Paxata• Syncsort• Trifacta

Page 12: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Near Term Roadmap: Summer 2016

Page 13: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summer 2016 Release Summary

• Dynamic Access Policies • Cross component lineage• Enterprise Readiness• Business Catalog

Differentiator

Differentiator

Differentiator

Page 14: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Dynamic Access PolicyApache Ranger + Atlas Integration

Page 15: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary of Dynamic Access Policies

• Basic Tag policy – PII example. Permission mapped to re-useable tag not resource

• Geo-based policy – Policy based on IP address mappings. Rule enforcement dynamically geo aware.

• Time-based policy – Timer for data access for resource management, compliance reporting

• Prohibitions – Prevention of toxic combinations of Hive tables or columns that may pose a risk together.

Key Benefits:

New scalable metadata based security paradigm

Dynamic, real-time policy

Automatically updates to changes in metadata

Centralized and simple to manage policy

Page 16: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

How does Atlas work with Ranger at scale?Atlas provides: Metadata• Business Classification (taxonomy): Company > HR >

Driver• Hierarchy with Inheritance of attribute to child

objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver

• Atlas will notify Ranger via Kafka Topic for changes

Apache Atlas

Hive

Ranger

Falcon

Kafka

Storm

Atlas provides the metadata tag to create policies

Ranger provides: Access & Entitlements

• Ranger will cache tags and asset mapping for performance

• Ranger will have a policy based on tags instead of roles.

• Example: PII = <group> This can work for a may assets.

Page 17: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Scalable Access Control – Reusable Tag Policy

User group• AD• Linux

Resources:

• Files• Tables• Topologies

Atlas Tag

• PII

ANY asset PII

• Files• Tables• Topologies

Single Admin Group Assigns

Many Stewards Tag +Single point of

enforcement and audit

All future tagging is covered by

existing policy

Page 18: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Automatic update of policies – active protection

Metastore

• Tags• Assets• Entities

Notification Framework

Kafka Topics

AtlasAtlas Client

• Subscribes to Topic• Gets Metadata

Updates

PDPResource Cache

Ranger

Notification Metadata updates

Messagedurability

Optimized for Speed

Event driven updates

Page 19: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop Cross Component Data Lineage

Page 20: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Apache Atlas Component Integration

• Cross- component dataset lineage. Centralized location for all metadata inside HDP

• Single Interface point for Metadata Exchange with platforms outside of HDP

Apache Atlas

Hive

Ranger

Falcon

Sqoop

Storm

Kafka

Spark

NiFi

HBase

HDP 2.3

HDP 2.5

Beyond HDP 2.5

Page 21: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Users in the upcoming release of HDP 2.5 will be able to track lineage across the following components using Atlas:

Sqoop – Import from and export to relational databases, and additional package that leverages sqoop.  ATLAS-184 , SQOOP-2609

Hive - Dataset lineage with entity versioning (including schema changes) ATLAS-75. ATLAS-183, ATLAS-492

Kafka/ Storm - IoT event-level processing, such as syslogs, or sensor data ATLAS-181 ,  ATLAS-183, STORM-1381

Falcon - Data lifecycle at Feed and Process entity level for replication, and repeating workflows. Tracks period-icy, throttling, ecviction. ATLAS-69 , FALCON-1570

Summary of Data Lineage

Key Benefits:

Enterprises need open solutions, not single app vendor

More native connectors than anyone else with more coming Hardened metadata infrastructure

Page 22: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Sqoop

TeradataConnector

ApacheKafka

Expanded Native Connector: Dataset Lineage

Custom Activity Reporter

MetadataRepository

RDBMS

Any process using Sqoop is

covered

No other tool tracks IOT of

the box

Page 23: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summer 2016 Release Summary

• Dynamic Access Policies • Cross component lineage• Enterprise Readiness• Business Catalog

Differentiator

Differentiator

Differentiator

Page 24: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enterprise Readiness

Page 25: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Security/Enterprise Readiness

• Highly reliable and scalable components• Authorization with AD via Ranger • Rolling upgrade support HDP 2.5 +• BC & DR capabilities• Improved performance of 5x from previous

version

Page 26: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enterprise Readiness: Scalable and Highly Reliable Components

SolrCloud

Kafka Quorum

Type System

Repository

Search DSL

Brid

ge

Hive Storm

Falcon Custom

REST API

Graph DB

Sear

ch

Kafka

SqoopCo

nnec

tors

Mes

sagi

ng F

ram

ewor

k

HBase

Page 27: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summer 2016 Release Summary

• Dynamic Access Policies • Cross component lineage• Enterprise Readiness• Business Catalog

Differentiator

Differentiator

Differentiator

Page 28: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Business Taxonomy (Catalog)

Page 29: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Key ConceptsBusiness Taxonomy (Catalog)The practice and science of classification of things or concepts, including the principles that underlie such classification. The business organization model is hierarchical making authoritative with no duplication.Data Lineage (Provenance)Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sourcesTags: Traits vs. Labels vs. Business TaxonomyAtlas has Tags that are authorative and prevent duplication. Tag can span different parts of the business taxonomy. A tag PII can be used in HR as well Finance or Sales.

Benefits:

A view of data assets organized by business language

Impact analysis, Compliance, Acceptable use

Common tag though Hadoop components

Page 30: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Taxonomies Benefits:

• Search / Discovery – Business catalog of conceptual, logical and physical assets

• Security --Dynamic metadata based Access control

Page 31: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

We conduct open-ended user interviews so that we can learn more about who are users are and what their needs are. This helps us validate whether or not we’re solving the right problem.

Research: Focused on Hadoop

Page 32: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

We test our prototype in InVision - a click through prototyping tool that allows users to interact with static mockups.

Usability Testing

Page 33: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Principle Roles & Activities

• Data Steward – Curator, responsible for catalog veracity

• Data Scientist – Analyst, primary consumer of Business Catalog

• Administrator – Role management only

• Data Engineer – Data ingress and egress, semantic data quality

• 50% - 80%+ Time spend looking for data

• Profit Center • Primary User of Atlas

• Enables Scientist

Goal: < 25% spent on finding data=Empowering scientist to spend their time uncovering insights -- faster

Page 34: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Atlas Value

• Designed for Hadoop at platform, not application level• High Confidence data in Hadoop for regulated verticals• Compliance and business objectives aligned to data organization• Faster discovery for analysts – reduce time to value• Agile and adaptable – ensures information is current by native

connectors• Dynamic protection with Ranger in simple audited policies

Page 35: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Additional Atlas Sessions• Extend Governance in Hadoop with the Atlas

Ecosystem: integrations with partners Waterline, Trifacta and Attivo:

Thursday 4:10PM @ Room 210A

• BOF: Apache Knox and Apache Ranger provide Hadoop security while Atlas provides a Hadoop metadata store and enterprise compliance. Come learn and discuss security & governance innovations and future directions.

Thursday 5-7 PM @ Room 210A

Page 36: Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise

36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Learn More:

• Hortonworks links: http://hortonworks.com/solutions/security-and-governance/

• Tutorials: https://github.com/hortonworks/tutorials/tree/atlas-ranger-tp/tutorials/hortonworks/atlas-ranger-preview