Partner Ecosystem Showcase for Apache Ranger and Apache Atlas
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise
-
Upload
hadoop-summit -
Category
Technology
-
view
3.383 -
download
0
Transcript of Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the Enterprise
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Top Three Big Data Governance Issues and How Apache ATLAS resolves it for the EnterpriseJune 28, 2016
Apache Atlas
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Disclaimer
This document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.
Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Data Governance
Organizations need data governance to understand its information to answer questions such as:
• What do we know about our information?• Where did this data come from and who can use it?• Does this data adhere to company policies and rules?
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
STRUCTURED
UNSTRUCTURED
Vision - Enterprise Data Governance Across Platforms
TRADITIONALRDBMS
METADATA
MPP APPLIANCES
Project 1
Project 5
Project 4
Project 3
METADATA
Project 6
DATALAKE
STREAMING
Atlas: Metadata Truth in Hadoop
Data Managementalong the entire data lifecycle with integrated provenance and lineage capabilityModeling with Metadataenables comprehensive data lineage through a hybrid approach with enhanced tagging and attribute capabilitiesInteroperable Solutionsacross the Hadoop ecosystem, through a common metadata store
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Overview
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Data GovernanceData governance practices provide a holistic approach to managing, improving and leveraging information to help you gain insight and build confidence in business decisions and operations.
Atlas helps customers discover information about data objects, their meaning, location, characteristics, and usage.
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas timeline: from DGI to present
May2015Apache AtlasIncubation
DGI groupKickoff
Dec 2014
July2015HDP 2.3 FoundationGA Release
First kickoff to GA in 7 months
Global FinancialCompany
* DGI: Data Governance Initiative
Key Benefits:
• Co-Dev = Built for real customer use cases
• Faster & Safer = Customers know business + HWX knows Hadoop
Jan2016HDP 2.4 Kafka/StormSqoopFalconTag Based Security
Summer2016HDP 2.5 Business CatalogAD integrationVersioning
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Management Through MetadataManagement ScalabilityMany traditional tools and patterns do not scale when applied to multi-tenant data lakes. Many enterprise have silo’d data and metadata stores that collide in the data lake. This is compounded by the ability to have very large windows (years). Can traditional EDW tools manage 100 million entities effectively with room to grow ?Metadata Tools Scalable, decoupled, de-centralized manage driven through metadata is the only via solution. This allows quick integration with automation and other metamodelsTags for Management, Discovery and SecurityProper metadata is the foundation for business taxonomy, stewardship, attribute based security and self-service.
Key Benefits:
Modern Data Lakes need new ways to govern because:
• Cost – Traditional staff ratio to data size not possible
• Diversity – Only way to manage velocity of new datasets
• Agility – Quick change based on tags / taxonomy
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
High Level Architecture: 4 Key points
Type System
Repository
Search DSL
Brid
geHive Storm
Falcon
Custom
REST API
Graph DB
Sear
ch
Kafka
Sqoop
Conn
ecto
rs
Mes
sagi
ng F
ram
ewor
k
3 REST APIModern, flexible access to Atlas services, HDP components, UI & external tools
1 Data Lineage Only product that captures lineage across Hadoop components at platform level. 4 Exchange
Leverage existing metadata / models by importing it from current tools. Export metadata to downstream systems
2 Agile Data Modeling:Type system allows custom metadata structures in a hierarchy taxonomy
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Governance Ready Certification Program
DiscoveryTagging
Prep / CleanseETL
GovernanceBPM
Self Service Visualization
Choice: Customers choose features that they want to deploy—a la carte versus vendor lock
Curated & Fast: Selected group of vendor partners to provide rich, complimentary and complete features ready to deploy
Agile: Low switching costs, Faster deployment and innovation
Centralized: Common SLA & common open metadata store
Flexibility: Interoperability of products through Atlas metadata
Safe: HDP at core to provide stability and interoperability
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Governance Ready Certification Program
Completed:
• Waterline• Dataguise• Attivo
Next:
• SAP ILM,VORA• IBM IGC
Work in progress:
• Collibra• Alation• Meta
Integration (Miti)
• Paxata• Syncsort• Trifacta
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Near Term Roadmap: Summer 2016
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summer 2016 Release Summary
• Dynamic Access Policies • Cross component lineage• Enterprise Readiness• Business Catalog
Differentiator
Differentiator
Differentiator
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dynamic Access PolicyApache Ranger + Atlas Integration
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summary of Dynamic Access Policies
• Basic Tag policy – PII example. Permission mapped to re-useable tag not resource
• Geo-based policy – Policy based on IP address mappings. Rule enforcement dynamically geo aware.
• Time-based policy – Timer for data access for resource management, compliance reporting
• Prohibitions – Prevention of toxic combinations of Hive tables or columns that may pose a risk together.
Key Benefits:
New scalable metadata based security paradigm
Dynamic, real-time policy
Automatically updates to changes in metadata
Centralized and simple to manage policy
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
How does Atlas work with Ranger at scale?Atlas provides: Metadata• Business Classification (taxonomy): Company > HR >
Driver• Hierarchy with Inheritance of attribute to child
objects: Sensitive “PII” tag of department HR will be inherited by group HR> Driver
• Atlas will notify Ranger via Kafka Topic for changes
Apache Atlas
Hive
Ranger
Falcon
Kafka
Storm
Atlas provides the metadata tag to create policies
Ranger provides: Access & Entitlements
• Ranger will cache tags and asset mapping for performance
• Ranger will have a policy based on tags instead of roles.
• Example: PII = <group> This can work for a may assets.
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Scalable Access Control – Reusable Tag Policy
User group• AD• Linux
Resources:
• Files• Tables• Topologies
Atlas Tag
• PII
ANY asset PII
• Files• Tables• Topologies
Single Admin Group Assigns
Many Stewards Tag +Single point of
enforcement and audit
All future tagging is covered by
existing policy
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Automatic update of policies – active protection
Metastore
• Tags• Assets• Entities
Notification Framework
Kafka Topics
AtlasAtlas Client
• Subscribes to Topic• Gets Metadata
Updates
PDPResource Cache
Ranger
Notification Metadata updates
Messagedurability
Optimized for Speed
Event driven updates
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hadoop Cross Component Data Lineage
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache Atlas Component Integration
• Cross- component dataset lineage. Centralized location for all metadata inside HDP
• Single Interface point for Metadata Exchange with platforms outside of HDP
Apache Atlas
Hive
Ranger
Falcon
Sqoop
Storm
Kafka
Spark
NiFi
HBase
HDP 2.3
HDP 2.5
Beyond HDP 2.5
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Users in the upcoming release of HDP 2.5 will be able to track lineage across the following components using Atlas:
Sqoop – Import from and export to relational databases, and additional package that leverages sqoop. ATLAS-184 , SQOOP-2609
Hive - Dataset lineage with entity versioning (including schema changes) ATLAS-75. ATLAS-183, ATLAS-492
Kafka/ Storm - IoT event-level processing, such as syslogs, or sensor data ATLAS-181 , ATLAS-183, STORM-1381
Falcon - Data lifecycle at Feed and Process entity level for replication, and repeating workflows. Tracks period-icy, throttling, ecviction. ATLAS-69 , FALCON-1570
Summary of Data Lineage
Key Benefits:
Enterprises need open solutions, not single app vendor
More native connectors than anyone else with more coming Hardened metadata infrastructure
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Sqoop
TeradataConnector
ApacheKafka
Expanded Native Connector: Dataset Lineage
Custom Activity Reporter
MetadataRepository
RDBMS
Any process using Sqoop is
covered
No other tool tracks IOT of
the box
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summer 2016 Release Summary
• Dynamic Access Policies • Cross component lineage• Enterprise Readiness• Business Catalog
Differentiator
Differentiator
Differentiator
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Readiness
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security/Enterprise Readiness
• Highly reliable and scalable components• Authorization with AD via Ranger • Rolling upgrade support HDP 2.5 +• BC & DR capabilities• Improved performance of 5x from previous
version
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise Readiness: Scalable and Highly Reliable Components
SolrCloud
Kafka Quorum
Type System
Repository
Search DSL
Brid
ge
Hive Storm
Falcon Custom
REST API
Graph DB
Sear
ch
Kafka
SqoopCo
nnec
tors
Mes
sagi
ng F
ram
ewor
k
HBase
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Summer 2016 Release Summary
• Dynamic Access Policies • Cross component lineage• Enterprise Readiness• Business Catalog
Differentiator
Differentiator
Differentiator
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Business Taxonomy (Catalog)
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Key ConceptsBusiness Taxonomy (Catalog)The practice and science of classification of things or concepts, including the principles that underlie such classification. The business organization model is hierarchical making authoritative with no duplication.Data Lineage (Provenance)Data lineage is defined as a data life cycle that includes the data's origins and where it moves over time. It describes what happens to data as it goes through diverse processes. It helps provide visibility into the analytics pipeline and simplifies tracing errors back to their sourcesTags: Traits vs. Labels vs. Business TaxonomyAtlas has Tags that are authorative and prevent duplication. Tag can span different parts of the business taxonomy. A tag PII can be used in HR as well Finance or Sales.
Benefits:
A view of data assets organized by business language
Impact analysis, Compliance, Acceptable use
Common tag though Hadoop components
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Taxonomies Benefits:
• Search / Discovery – Business catalog of conceptual, logical and physical assets
• Security --Dynamic metadata based Access control
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
We conduct open-ended user interviews so that we can learn more about who are users are and what their needs are. This helps us validate whether or not we’re solving the right problem.
Research: Focused on Hadoop
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
We test our prototype in InVision - a click through prototyping tool that allows users to interact with static mockups.
Usability Testing
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Principle Roles & Activities
• Data Steward – Curator, responsible for catalog veracity
• Data Scientist – Analyst, primary consumer of Business Catalog
• Administrator – Role management only
• Data Engineer – Data ingress and egress, semantic data quality
• 50% - 80%+ Time spend looking for data
• Profit Center • Primary User of Atlas
• Enables Scientist
Goal: < 25% spent on finding data=Empowering scientist to spend their time uncovering insights -- faster
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Atlas Value
• Designed for Hadoop at platform, not application level• High Confidence data in Hadoop for regulated verticals• Compliance and business objectives aligned to data organization• Faster discovery for analysts – reduce time to value• Agile and adaptable – ensures information is current by native
connectors• Dynamic protection with Ranger in simple audited policies
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Additional Atlas Sessions• Extend Governance in Hadoop with the Atlas
Ecosystem: integrations with partners Waterline, Trifacta and Attivo:
Thursday 4:10PM @ Room 210A
• BOF: Apache Knox and Apache Ranger provide Hadoop security while Atlas provides a Hadoop metadata store and enterprise compliance. Come learn and discuss security & governance innovations and future directions.
Thursday 5-7 PM @ Room 210A
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn More:
• Hortonworks links: http://hortonworks.com/solutions/security-and-governance/
• Tutorials: https://github.com/hortonworks/tutorials/tree/atlas-ranger-tp/tutorials/hortonworks/atlas-ranger-preview