Unleashing the power of apache atlas with apache - virtual dataconnector
-
Upload
nigel-jones -
Category
Data & Analytics
-
view
99 -
download
0
Transcript of Unleashing the power of apache atlas with apache - virtual dataconnector
Unleashing the power of Apache Atlas with Apache Ranger
Virtual Data Connector ProjectNIGEL [email protected], MUNICH, APRIL 2017
Apache®, Apache Atlas, Apache Ranger & other Apache project names referenced are either registered trademarks or trademarks of the Apache Software Foundation in the United States and/or other countries. No endorsement by The Apache Software Foundation is implied by the use of these marks.
About Me – Nigel Jones
https://www.linkedin.com/in/nigelljones/ [email protected] (Anyone still use email?) @planetf1 – noisy, f1, electric vehicles, food & drink …. A split of
work/life accounts didn’t work for me! And of course the Apache Atlas & Ranger mailing lists & JIRA! Science fan at school uni. It was cloud chambers back then… now
just the cloud IBM Hursley, UK since 1990 Last 3 years focus on Data Lake, Information Governance, Open
Metadata
The Problem…..WHY ARE WE HERE…..
Data?
What data do I have? What does it mean? Where is it? Who has access to it? Who owns it? What quality is it? How does it relate to other data? How to I control, audit & understand access?
Regulatory needs
Adhere to regulations like BCBS-239 and GDPR Need to know meaning, value of the data Demonstrate processes in place to govern access Audit Significant fines if rules breached Whilst ensuring easy, ready access to appropriate data for data
professionals to support an agile business
So what do we need to address this?
Metadata..
Metadata enables data to be used outside of the application that created it. Analytics and decision making New business applications Reporting and compliance
Metadata describes the format and content of data allowing people to judge which dataset to use for a new project
Structure Meaning Origin Valid values and quality Usage and ownership Regulations and classifications that apply
Metadata describes the business context and classification of data allowing automated governance processes to operate.
Which can support…
An enterprise data catalogue that lists all data including where it is, what it is, who owns it, it’s meaning, quality, where it came from , and can fully describe it’s business context & how the data should be governed….
Subject Matter experts searching, collaborating, feeding back about their data needs and use
Automated governance actions to protect and manage including auditing, monitoring, quality control, rights management
But easily…
Open frameworks & APIs Automatic collection & discovery of metadata in a dynamic
heterogeneous environment Using predefined standards for glossaries, schemas, rules,
regulations to reduce cost Cheap to integrate new tools No proprietary lock-in & assumptions that all tools are from one
suite or vendor Avoiding silos Distributed and Open
The vision
Open andUnified Metadata
Virtualization Data Connector project
Data virtualization project
Collaboration – IBM, several banks & open community A Data Lake environment
Not just Hadoop, but other sources too Business Terms, Classifications, Metadata rich Offer virtualized views. Expose relational data with business terms Manage Access to resources – permit, deny, log, filter/mask …. THROUGH METADATA Open, pluggable
Working through use cases, design, initial MVP (this year) Critique, feedback is welcomed. We’re looking for guidance and support from
the Atlas & Ranger communities as well as contribute our ideas Proposed changes all go through mailing list and JIRA for feedback
Apache Atlas
“Atlas is a scalable and extensible set of core foundational governance services – enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.” …. http://www.apache.org
Open Community -- Apache Incubator since May 2015 Type agnostic metadata store REST API & UI Supports many Hadoop components including HBase, Hive,
Sqoop, Storm & others
Apache Ranger
Centralized security administration to manage all security related tasks in a central UI or using REST APIs.
Fine grained authorization to do a specific action and/or operation with Hadoop component/tool and managed through a central administration tool
Standardize authorization method across all Hadoop components. Enhanced support for different authorization methods - Role
based access control, attribute based access control etc. Centralize auditing of user access and administrative actions
(security related) within all the components of Hadoop. … from http://ranger.apache.org
Project InteractionsSearch/Report
GaianDB
• Search for list of assets by metadata• Search for data• Reporting tool obtains data to draw report
Underlying data, sql, hive, HDFS, Oracle, Netezza etc
Manages logical views
Deploys rules, pushes classifications, source for user roles (not users)
+ranger plugin to permit/deny, mask etc
Pulls rules. classifications
RDBMSHadoop
Apache Atlas
Apache Ranger
Apache Solr
Why Atlas and Ranger?
Open Source essential to forming an active ecosystem Vision, active community & evolving – ability to contribute & work
with others to provide the best solution Already have good core capabilities
Atlas type system is very flexible Ranger offers a range of policy types and provides a pluggable
framework Already cross project integration
Use of tag based policie in Ranger sourced from Atlas Can be used independently of full Hadoop stack
Refined virtual connector scope scope
GaianDBRangerPlugin
Titan(GraphDB,Metadata
Repository)
RangerConfig
Ranger Server
AtlasPoll Policies
OMAS
OMRS
IGC
Pre Post Create View Metadata
Extract physical
metadata
Man
age
Logi
cal
Tabl
es
Virtualizer
Retrieve meta data
Retrieve meta dataRetrieve meta data
Push meta data
Oracle Netezza HiveTables
Push and query meta data
Data Lake Repositories
MetaData
Data Lake Virtualization
tag-sync
rule-sync
Config (eg Policies, Audit log location)
LDAP
Audit Log
Mapper
Search for data/reporting
Push and query metadata
MetaData
Navigator
MetaData
Datameer
GaianDB & Virtualizer
GaianDB Open Source Federated, self learning, dynamic configuration Based on Apache Derby Already had “policy” support – we’re plugging in
Ranger for this project Virtualizer
Listens to event notifications on assets etc Creates view definitions in GaianDB, and new Atlas
APIs to store metadata. Could use different virtual engine..
Designed to be open to other virtualization technologies.
LT1 LT2
DS2DS1 DS3
Policy Plugin (ranger)
Virtualizer Atlas
GaianDB supports federation – not used for MVP
Atlas – glossary enhancements
Get Atlas closer to parity with commercial offerings Business Terms – categories, category hierarchies Has-a, is-a, type-of, synonym, antonym, arbitrary relationships Assets mapped to Business Terms Classifications
Hierarchy Navigable mappings to retain ability to flatten tags to ranger
Instead of hive column EMP_SALARY -> SPI, now can be EMP_SALARY -> SALARY -> SPI …
Used to drive governance ATLAS-1410
Atlas – other enhancements
Consumer Centric APIs Open Metadata Access Services (OMAS) REST & more Kafka notifications Asset, Catalog, Connector, Glossary, Governance Action, Governance Definitions,
Information View, Roles and Access Repository level APIs
Open Metadata Repository Services (OMRS) REST & more Kafka notifications Pluggability through an Open Connector Framework to other metadata repositories –
distributed and Open Standard data model/core
Enhancement to core model – versioning, external linkage etc More standard types ie for all relational databases to ease sharing
Ranger areas being looked at
Building a plugin for GaianDB Access control, simple masking. More later
User synchronization (large #users, role of Atlas) Changes to tag sync process for New glossary proposal As more metadata goes into Atlas, it becomes source for
generation of some kinds of policies. Where is the master? Generating ranger rules from governance definitions How about control of access to Atlas itself?
Aside: Interfaces used by enforcement engines (such as to get classification data) need to be efficient – these should work for projects like Apache Sentry as well as Atlas
Beyond the MVP
Open Discovery Framework Consider other security enforcement engines – such as Apache
Sentry & driving more capability around rules & governance actions from Atlas metadata
Work on standard models to support different domains Lineage
From high level design lineage through to operational detail. Logs vs graph….
API metadata Infrastructure – JanusGraph…
Abstraction added by IBM in last few months for titan 1
The vision An enterprise data catalog that lists all of your data, where it is located, its origin (lineage), owner, structure, meaning,
classification and quality Spanning systems both on premise and cloud providers Hosted locally to your data platforms but integrated to provide the enterprise view
New data tools (from any vendor) connect to your data catalog out of the box No vendor lock-in; nor expensive population of yet another proprietary siloed metadata repository
Metadata is added automatically to the catalog as new data is created Extensible discovery processes characterise and classify the data Interested parties and processes are notified
Subject matter experts collaborating around the data Locate the data they need, quickly and efficiently Feed back their knowledge about the data and the uses they have made about it to help others and support economic evaluation of
data Automated governance processes protect and manage your data
Metadata-driven access control Auditing, metering and monitoring Quality control and exception management Rights management
Predefined standards for glossaries, data schemas, rules and regulations that reduce the cost of doing business Open frameworks and APIs for collaborating with universities, traditional vendors and new innovators around data and
advanced analytics
Summary
Atlas can help us have an industry wide common metadata platform around which a vibrant ecosystem can evolve Not only in Hadoop but more broadly
Metadata driven governance can be scalable & enable us to manage our data better, and be compliant with regulations
The ideas presented here resonate with many people we’ve spoken to Get involved! I’d love to hear the feedback on this approach! Comment on the JIRAS, ask questions, contribute, disagree… ;-)
Look at JIRA Tag “VirtualDataConnector” or start at ATLAS-1689 Atlas wiki
“Innovation happens best not in isolation but in collaboration” (keynote) THANKS!
Backup charts
Atlas
graphDB“gaiandb”
IGC
IGC REST API
OracleData
HDFSData
NetezzaData
P-JDBC P-JDBCP-JDBC
GAF OMAS
VirtualAssetOMAS
SearchSearch/Explore UI
Catalog OMAS
OMRS
OMRS
GAF Pre
GAF Post
Connector Framework
*
Atlas boundariesDeveloped in POCMay not be in POC initially
* May be hardcoded at first
Connector Framework
ATLASVirtualizer
Architecture
Metadata areas and types
Policy Metadata (Principles, Regulations, Standards, Approaches,
Rule Specifications, Roles and Metrics)
Governance Actions and Processes
Augmentation
MappingImplementation
Connector DirectoriesAccess
Access
InformationAuditor
IntegrationDeveloper
BusinessAnalyst
DataScientist
InformationWorker
InformationOwner
InformationGovernor
InformationSteward
DataQualityAnalyst
Business Objects and Relationships, Taxonomies
and Ontologies
Business Attributes
Organization
InformationCurator
Teaming Metadata(people profiles, communities,
projects, notebooks, …)
Models and Schemas
3
2
4
5
Physical Asset Descriptions(Data stores, APIs,
models and components)
Asset Collections(Sets, Typed Sets, Type
Organized Sets)
Information Views
RightsManagement Reference Data
Feedback Metadata(tags, comments, ratings, …)
Classification Schemes
Classification
Strategy Subject Area Definition
Campaigns and Projects
Infrastructure and systems
Rollout
1
DiscoveryMetadata (profile data,
technical classification, data classification,
data quality assessment, …)
Augmentation
InstrumentAssociation
Information ProcessInstrumentation (design lineage)
6
7
User & Group/Role synchronization
UserSync2
LDAP holds role-membership (LDAP groups) – could also be Active Directory
ATLAS manages definitive list of roles <that are used for atlas managed sources>
• Corporate LDAP has a huge number of users/groups
• Ranger currently needs to sync all• In future perhaps we establish group/role
membership during authentication• Capability for alternative source could be
merged in to base UserSync
LDAP lookup -> group:member
Governance Action OMAS - getRoles
Apache Ranger
LDAP
Apache Atlas
Atlas Glossary v2: Tag Sync to Ranger
TagSync2ATLAS glossary manages a sophisticated enterprise glossary structure
• Atlas Glossary v2 Proposed in ATLAS-1410 (David Radley) Sync Builds on existing tagsync approach
• New API in Atlas will flatten classification structure• No changes to ranger – but exposing richer classification could be area of future work
Governance Action OMAS
Confidential
Salary
emp_renum
Business Term
Hive Column
Business Term
Confidential
emp_renumHive Column
Tag
Apache Ranger
Apache Atlas
Policy (Rule) synchronization
RuleSync
• Generate policies in Ranger based off entities in Atlas• Currently designing how this works• Scoped by policy service so existing Ranger UI approach still works
Governance Action OMAS - getRules
Role
Classifications
Asset
Ranger RuleAction
Apache Ranger
Apache Atlas
VirtualDataConnector JIRAS 20170402
RANGER-1488 RANGER-1487 RANGER-1486 RANGER-1485 RANGER-1464 RANGER-1454 RANGER-1234 RANGER-1186 RANGER-1168 ATLAS-1696 ATLAS-1694 ATLAS-1691 ATLAS-1158 ATLAS-520 ATLAS-519 ATLAS-455 ATLAS-197
Create Ranger plugin for gaiandb generate rules from Governance definitions in Atlas New usersync alternative for Atlas (vdc) Ranger support for Virtual Data Connector Project (ATLAS) Support Atlas v2 glossary in Atlas plugin (for access control to terms etc) Support of Atlas v2 glossary API proposal for tag source Post-evaluation phase user extensions Ranger Source: eclipse Add data masking for tag based policies Governance Action Framework OMAS Sample assets to support Virtual Connector Project OMAS Interfaces for Atlas Build ATLAS using Docker Temporal / Versioning support for types, traits, entites .... metrics Timeouts in tests should be configurable from system property Add build instructions in top level dir
References
Apache Atlas - http://atlas.apache.org/ Top level JIRA for this activity
https://issues.apache.org/jira/browse/ATLAS-1689 Apache Ranger - http://ranger.apache.org/ GaianDB
https://github.com/gaiandb/gaiandb https://developer.ibm.com/open/openprojects/gaian-database/
The case for open metadata – A.M.Chessell http://www.ibmbigdatahub.com/blog/case-open-metadata