Unleashing the power of apache atlas with apache - virtual dataconnector

33
Unleashing the power of Apache Atlas with Apache Ranger Virtual Data Connector Project NIGEL JONES [email protected] DATAWORKS, MUNICH, APRIL 2017 Apache®, Apache Atlas, Apache Ranger & other Apache project names referenced are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Transcript of Unleashing the power of apache atlas with apache - virtual dataconnector

Page 1: Unleashing the power of apache atlas with apache  - virtual dataconnector

Unleashing the power of Apache Atlas with Apache Ranger

Virtual Data Connector ProjectNIGEL [email protected], MUNICH, APRIL 2017

Apache®, Apache Atlas, Apache Ranger & other Apache project names referenced are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.

Page 2: Unleashing the power of apache atlas with apache  - virtual dataconnector

About Me – Nigel Jones

https://www.linkedin.com/in/nigelljones/ [email protected] (Anyone still use email?) @planetf1 – noisy, f1, electric vehicles, food & drink …. A split of

work/life accounts didn’t work for me! And of course the Apache Atlas & Ranger mailing lists & JIRA! Science fan at school uni. It was cloud chambers back then… now

just the cloud IBM Hursley, UK since 1990 Last 3 years focus on Data Lake, Information Governance, Open

Metadata

Page 3: Unleashing the power of apache atlas with apache  - virtual dataconnector

The Problem…..WHY ARE WE HERE…..

Page 4: Unleashing the power of apache atlas with apache  - virtual dataconnector

Data?

What data do I have? What does it mean? Where is it? Who has access to it? Who owns it? What quality is it? How does it relate to other data? How to I control, audit & understand access?

Page 5: Unleashing the power of apache atlas with apache  - virtual dataconnector

Regulatory needs

Adhere to regulations like BCBS-239 and GDPR Need to know meaning, value of the data Demonstrate processes in place to govern access Audit Significant fines if rules breached Whilst ensuring easy, ready access to appropriate data for data

professionals to support an agile business

Page 6: Unleashing the power of apache atlas with apache  - virtual dataconnector

So what do we need to address this?

Page 7: Unleashing the power of apache atlas with apache  - virtual dataconnector

Metadata..

Metadata enables data to be used outside of the application that created it. Analytics and decision making New business applications Reporting and compliance

Metadata describes the format and content of data allowing people to judge which dataset to use for a new project

Structure Meaning Origin Valid values and quality Usage and ownership Regulations and classifications that apply

Metadata describes the business context and classification of data allowing automated governance processes to operate.

Page 8: Unleashing the power of apache atlas with apache  - virtual dataconnector

Which can support…

An enterprise data catalogue that lists all data including where it is, what it is, who owns it, it’s meaning, quality, where it came from , and can fully describe it’s business context & how the data should be governed….

Subject Matter experts searching, collaborating, feeding back about their data needs and use

Automated governance actions to protect and manage including auditing, monitoring, quality control, rights management

Page 9: Unleashing the power of apache atlas with apache  - virtual dataconnector

But easily…

Open frameworks & APIs Automatic collection & discovery of metadata in a dynamic

heterogeneous environment Using predefined standards for glossaries, schemas, rules,

regulations to reduce cost Cheap to integrate new tools No proprietary lock-in & assumptions that all tools are from one

suite or vendor Avoiding silos Distributed and Open

Page 10: Unleashing the power of apache atlas with apache  - virtual dataconnector

The vision

Open andUnified Metadata

Page 11: Unleashing the power of apache atlas with apache  - virtual dataconnector

Virtualization Data Connector project

Page 12: Unleashing the power of apache atlas with apache  - virtual dataconnector

Data virtualization project

Collaboration – IBM, several banks & open community A Data Lake environment

Not just Hadoop, but other sources too Business Terms, Classifications, Metadata rich Offer virtualized views. Expose relational data with business terms Manage Access to resources – permit, deny, log, filter/mask …. THROUGH METADATA Open, pluggable

Working through use cases, design, initial MVP (this year) Critique, feedback is welcomed. We’re looking for guidance and support from

the Atlas & Ranger communities as well as contribute our ideas Proposed changes all go through mailing list and JIRA for feedback

Page 13: Unleashing the power of apache atlas with apache  - virtual dataconnector

Apache Atlas

“Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.” …. http://www.apache.org

Open Community -- Apache Incubator since May 2015 Type agnostic metadata store REST API & UI Supports many Hadoop components including HBase, Hive,

Sqoop, Storm & others

Page 14: Unleashing the power of apache atlas with apache  - virtual dataconnector

Apache Ranger

Centralized security administration to manage all security related tasks in a central UI or using REST APIs.

Fine grained authorization to do a specific action and/or operation with Hadoop component/tool and managed through a central administration tool

Standardize authorization method across all Hadoop components. Enhanced support for different authorization methods - Role

based access control, attribute based access control etc. Centralize auditing of user access and administrative actions

(security related) within all the components of Hadoop. … from http://ranger.apache.org

Page 15: Unleashing the power of apache atlas with apache  - virtual dataconnector

Project InteractionsSearch/Report

GaianDB

• Search for list of assets by metadata• Search for data• Reporting tool obtains data to draw report

Underlying data, sql, hive, HDFS, Oracle, Netezza etc

Manages logical views

Deploys rules, pushes classifications, source for user roles (not users)

+ranger plugin to permit/deny, mask etc

Pulls rules. classifications

RDBMSHadoop

Apache Atlas

Apache Ranger

Apache Solr

Page 16: Unleashing the power of apache atlas with apache  - virtual dataconnector

Why Atlas and Ranger?

Open Source essential to forming an active ecosystem Vision, active community & evolving – ability to contribute & work

with others to provide the best solution Already have good core capabilities

Atlas type system is very flexible Ranger offers a range of policy types and provides a pluggable

framework Already cross project integration

Use of tag based policie in Ranger sourced from Atlas Can be used independently of full Hadoop stack

Page 17: Unleashing the power of apache atlas with apache  - virtual dataconnector

Refined virtual connector scope scope

GaianDBRangerPlugin

Titan(GraphDB,Metadata

Repository)

RangerConfig

Ranger Server

AtlasPoll Policies

OMAS

OMRS

IGC

Pre Post Create View Metadata

Extract physical

metadata

Man

age

Logi

cal

Tabl

es

Virtualizer

Retrieve meta data

Retrieve meta dataRetrieve meta data

Push meta data

Oracle Netezza HiveTables

Push and query meta data

Data Lake Repositories

MetaData

Data Lake Virtualization

tag-sync

rule-sync

Config (eg Policies, Audit log location)

LDAP

Audit Log

Mapper

Search for data/reporting

Push and query metadata

MetaData

Navigator

MetaData

Datameer

Page 18: Unleashing the power of apache atlas with apache  - virtual dataconnector

GaianDB & Virtualizer

GaianDB Open Source Federated, self learning, dynamic configuration Based on Apache Derby Already had “policy” support – we’re plugging in

Ranger for this project Virtualizer

Listens to event notifications on assets etc Creates view definitions in GaianDB, and new Atlas

APIs to store metadata. Could use different virtual engine..

Designed to be open to other virtualization technologies.

LT1 LT2

DS2DS1 DS3

Policy Plugin (ranger)

Virtualizer Atlas

GaianDB supports federation – not used for MVP

Page 19: Unleashing the power of apache atlas with apache  - virtual dataconnector

Atlas – glossary enhancements

Get Atlas closer to parity with commercial offerings Business Terms – categories, category hierarchies Has-a, is-a, type-of, synonym, antonym, arbitrary relationships Assets mapped to Business Terms Classifications

Hierarchy Navigable mappings to retain ability to flatten tags to ranger

Instead of hive column EMP_SALARY -> SPI, now can be EMP_SALARY -> SALARY -> SPI …

Used to drive governance ATLAS-1410

Page 20: Unleashing the power of apache atlas with apache  - virtual dataconnector

Atlas – other enhancements

Consumer Centric APIs Open Metadata Access Services (OMAS) REST & more Kafka notifications Asset, Catalog, Connector, Glossary, Governance Action, Governance Definitions,

Information View, Roles and Access Repository level APIs

Open Metadata Repository Services (OMRS) REST & more Kafka notifications Pluggability through an Open Connector Framework to other metadata repositories –

distributed and Open Standard data model/core

Enhancement to core model – versioning, external linkage etc More standard types ie for all relational databases to ease sharing

Page 21: Unleashing the power of apache atlas with apache  - virtual dataconnector

Ranger areas being looked at

Building a plugin for GaianDB Access control, simple masking. More later

User synchronization (large #users, role of Atlas) Changes to tag sync process for New glossary proposal As more metadata goes into Atlas, it becomes source for

generation of some kinds of policies. Where is the master? Generating ranger rules from governance definitions How about control of access to Atlas itself?

Aside: Interfaces used by enforcement engines (such as to get classification data) need to be efficient – these should work for projects like Apache Sentry as well as Atlas

Page 22: Unleashing the power of apache atlas with apache  - virtual dataconnector

Beyond the MVP

Open Discovery Framework Consider other security enforcement engines – such as Apache

Sentry & driving more capability around rules & governance actions from Atlas metadata

Work on standard models to support different domains Lineage

From high level design lineage through to operational detail. Logs vs graph….

API metadata Infrastructure – JanusGraph…

Abstraction added by IBM in last few months for titan 1

Page 23: Unleashing the power of apache atlas with apache  - virtual dataconnector

The vision An enterprise data catalog that lists all of your data, where it is located, its origin (lineage), owner, structure, meaning,

classification and quality Spanning systems both on premise and cloud providers Hosted locally to your data platforms but integrated to provide the enterprise view

New data tools (from any vendor) connect to your data catalog out of the box No vendor lock-in; nor expensive population of yet another proprietary siloed metadata repository

Metadata is added automatically to the catalog as new data is created Extensible discovery processes characterise and classify the data Interested parties and processes are notified

Subject matter experts collaborating around the data Locate the data they need, quickly and efficiently Feed back their knowledge about the data and the uses they have made about it to help others and support economic evaluation of

data Automated governance processes protect and manage your data

Metadata-driven access control Auditing, metering and monitoring Quality control and exception management Rights management

Predefined standards for glossaries, data schemas, rules and regulations that reduce the cost of doing business Open frameworks and APIs for collaborating with universities, traditional vendors and new innovators around data and

advanced analytics

Page 24: Unleashing the power of apache atlas with apache  - virtual dataconnector

Summary

Atlas can help us have an industry wide common metadata platform around which a vibrant ecosystem can evolve Not only in Hadoop but more broadly

Metadata driven governance can be scalable & enable us to manage our data better, and be compliant with regulations

The ideas presented here resonate with many people we’ve spoken to Get involved! I’d love to hear the feedback on this approach! Comment on the JIRAS, ask questions, contribute, disagree… ;-)

Look at JIRA Tag “VirtualDataConnector” or start at ATLAS-1689 Atlas wiki

“Innovation happens best not in isolation but in collaboration” (keynote) THANKS!

Page 25: Unleashing the power of apache atlas with apache  - virtual dataconnector

Questions

After this [email protected]:50 Room 4 – Security & Governance BOF

zzzzzzz

Questions?

Page 26: Unleashing the power of apache atlas with apache  - virtual dataconnector

Backup charts

Page 27: Unleashing the power of apache atlas with apache  - virtual dataconnector

Atlas

graphDB“gaiandb”

IGC

IGC REST API

OracleData

HDFSData

NetezzaData

P-JDBC P-JDBCP-JDBC

GAF OMAS

VirtualAssetOMAS

SearchSearch/Explore UI

Catalog OMAS

OMRS

OMRS

GAF Pre

GAF Post

Connector Framework

*

Atlas boundariesDeveloped in POCMay not be in POC initially

* May be hardcoded at first

Connector Framework

ATLASVirtualizer

Architecture

Page 28: Unleashing the power of apache atlas with apache  - virtual dataconnector

Metadata areas and types

Policy Metadata (Principles, Regulations, Standards, Approaches,

Rule Specifications, Roles and Metrics)

Governance Actions and Processes

Augmentation

MappingImplementation

Connector DirectoriesAccess

Access

InformationAuditor

IntegrationDeveloper

BusinessAnalyst

DataScientist

InformationWorker

InformationOwner

InformationGovernor

InformationSteward

DataQualityAnalyst

Business Objects and Relationships, Taxonomies

and Ontologies

Business Attributes

Organization

InformationCurator

Teaming Metadata(people profiles, communities,

projects, notebooks, …)

Models and Schemas

3

2

4

5

Physical Asset Descriptions(Data stores, APIs,

models and components)

Asset Collections(Sets, Typed Sets, Type

Organized Sets)

Information Views

RightsManagement Reference Data

Feedback Metadata(tags, comments, ratings, …)

Classification Schemes

Classification

Strategy Subject Area Definition

Campaigns and Projects

Infrastructure and systems

Rollout

1

DiscoveryMetadata (profile data,

technical classification, data classification,

data quality assessment, …)

Augmentation

InstrumentAssociation

Information ProcessInstrumentation (design lineage)

6

7

Page 29: Unleashing the power of apache atlas with apache  - virtual dataconnector

User & Group/Role synchronization

UserSync2

LDAP holds role-membership (LDAP groups) – could also be Active Directory

ATLAS manages definitive list of roles <that are used for atlas managed sources>

• Corporate LDAP has a huge number of users/groups

• Ranger currently needs to sync all• In future perhaps we establish group/role

membership during authentication• Capability for alternative source could be

merged in to base UserSync

LDAP lookup -> group:member

Governance Action OMAS - getRoles

Apache Ranger

LDAP

Apache Atlas

Page 30: Unleashing the power of apache atlas with apache  - virtual dataconnector

Atlas Glossary v2: Tag Sync to Ranger

TagSync2ATLAS glossary manages a sophisticated enterprise glossary structure

• Atlas Glossary v2 Proposed in ATLAS-1410 (David Radley) Sync Builds on existing tagsync approach

• New API in Atlas will flatten classification structure• No changes to ranger – but exposing richer classification could be area of future work

Governance Action OMAS

Confidential

Salary

emp_renum

Business Term

Hive Column

Business Term

Confidential

emp_renumHive Column

Tag

Apache Ranger

Apache Atlas

Page 31: Unleashing the power of apache atlas with apache  - virtual dataconnector

Policy (Rule) synchronization

RuleSync

• Generate policies in Ranger based off entities in Atlas• Currently designing how this works• Scoped by policy service so existing Ranger UI approach still works

Governance Action OMAS - getRules

Role

Classifications

Asset

Ranger RuleAction

Apache Ranger

Apache Atlas

Page 32: Unleashing the power of apache atlas with apache  - virtual dataconnector

VirtualDataConnector JIRAS 20170402

RANGER-1488 RANGER-1487 RANGER-1486 RANGER-1485 RANGER-1464 RANGER-1454 RANGER-1234 RANGER-1186 RANGER-1168 ATLAS-1696 ATLAS-1694 ATLAS-1691 ATLAS-1158 ATLAS-520 ATLAS-519 ATLAS-455 ATLAS-197

Create Ranger plugin for gaiandb generate rules from Governance definitions in Atlas New usersync alternative for Atlas (vdc) Ranger support for Virtual Data Connector Project (ATLAS) Support Atlas v2 glossary in Atlas plugin (for access control to terms etc) Support of Atlas v2 glossary API proposal for tag source Post-evaluation phase user extensions Ranger Source: eclipse Add data masking for tag based policies Governance Action Framework OMAS Sample assets to support Virtual Connector Project OMAS Interfaces for Atlas Build ATLAS using Docker Temporal / Versioning support for types, traits, entites .... metrics Timeouts in tests should be configurable from system property Add build instructions in top level dir

Page 33: Unleashing the power of apache atlas with apache  - virtual dataconnector

References

Apache Atlas - http://atlas.apache.org/ Top level JIRA for this activity

https://issues.apache.org/jira/browse/ATLAS-1689 Apache Ranger - http://ranger.apache.org/ GaianDB

https://github.com/gaiandb/gaiandb https://developer.ibm.com/open/openprojects/gaian-database/

The case for open metadata – A.M.Chessell http://www.ibmbigdatahub.com/blog/case-open-metadata