Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

52
Hortonworks DataFlow Enterprise Data Flow powered byApache NiFi Mats Johansson Solutions Engineer EMEA © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Transcript of Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 1: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Hortonworks DataFlowEnterprise Data Flow powered by Apache NiFi

Mats JohanssonSolutions Engineer -­ EMEA

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

Page 2: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

DisclaimerThis document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.

Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.

This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.

Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.

Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.

Page 3: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

IoAT Data Grows Faster Than We Consume It

Much of the new data exists in-­flight, between systems and devices as part of the Internet of AnythingNEW

TRADITIONAL

The OpportunityUnlock transformational business valuefrom a full fidelity of data and analyticsfor all data.

Geolocation

Server logs

Files & emails

ERP, CRM, SCM

Traditional Data Sources

Internet of Anything

Sensorsand machines

Clickstream

Social media

Page 4: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Internet of Anything is Driving New RequirementsNeed trusted insights from data at the very edge to the data lake in real-­time with full-­fidelity–Data generated by sensors, machines, geo-­location devices, logs, clickstreams, social feeds, etc.

Modern applications need access to both data-­in-­motion and data-­at-­rest

IoAT data flows are multi-­directional and point-­to-­point– Very different than existing ETL, data movement, and streaming technologies which are generally one direction

The perimeter is outside the data center and can be very jagged– This “Jagged Edge” creates new opportunity for security, data protection, data governance and provenance

Page 5: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Architectural Limitations Today• Traditional data movement software has been built for the world of standardized data and one way flows

• Tools built for newer types of data tend to be custom, difficult to manage, and architecturally disjoint

• Businesses can not easily collect, conduct, and curate secure multi-­directional and point-­to-­point IoAT data flows

• IoAT data flows are not optimized and use costly/limited bandwidth and cannot dynamically prioritize the most valuable data

• Difficult to gain actionable insights from the combination of data-­in-­motion and data-­at-­rest

Page 6: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

The IoAT Data Flow

Hortonworks Data Platformpowered by Apache Hadoop

Hortonworks Data Platformpowered by Apache Hadoop

EnrichContext

Store Data and Metadata

Internetof Anything

Hortonworks DataFlow powered by Apache NiFi

Perishable Insights

HistoricalInsights

Introducing Hortonworks DataFlow

Hortonworks DataFlow and the Hortonworks Data Platform deliver the industry’s most complete solution for management of Big Data.

Page 7: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Simplistic View of IoAT & Data Flow

The Data Flow Thing

Process and Analyze DataAcquire Data

Store Data

Page 8: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Global interactions with customers, business partners, and thingsspanning different volume, velocity, bandwidth, and latency needs

Realistic View of IoAT and Data Flow

Page 9: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Meeting IoAT Edge Requirements

GATHER

DELIVER

PRIORITIZE

Track from the edge Through to the datacenter

Small Footprintsoperate with very little power

Limited Bandwidthcan create high latency

Data Availabilityexceeds transmission bandwidth

Data Must Be Securedthroughout its journey

Page 10: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Dataflow requirements within the Data CenterUnderstandingAbility to observe precisely how systems exchange data in real-­time and historically

AgilityAbility to interact with and alter live flows and iterate on new ones

Dynamic Access ControlsThe entitlements of users and systems and sensitivity of data can change frequently

Cross Cutting ConcernsAddress common needs once like enrichment, filtering, transformation

Enable architecture transitionLegacy vs modern is an ‘always’ event. Format, schema, protocol conversion is routine

Page 11: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache NiFi: Collect, Conduct, Curate

Aggregate all IoAT data from sensors, geo-­location devices, machines, logs, files, and feeds via a highly secure lightweight agent

Collect: Bring Together• Logs• Files• Feeds• Sensors

Mediate point-­to-­point and bi-­directional data flows, delivering data reliably to real-­time applications and storage platforms such as HDP

Conduct: Mediate the Data Flow• Deliver• Secure• Govern• Audit

Parse, filter, join, transform, fork, and clone data in motion to empower analytics and perishable insights

Curate: Gain Insights• Parse• Filter• Transform• Fork• Clone

Page 12: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

November 2014NiFi is donated to the Apache Software Foundation (ASF) through NSA’s Technology Transfer Program and enters ASF’s incubator.

2006NiagaraFiles (NiFi) was first incepted by Joe Witt at the National Security Agency (NSA)

A Brief History of Apache Nifi

July 2015NiFi reaches ASF top-­level project status

Page 13: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Apache NiFi: Three key concepts

• Manage the flow of information

• Data Provenance

• Secure the control plane and data plane

Page 14: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Apache NiFi – Key Features

• Guaranteed delivery• Data buffering

- Backpressure- Pressure release

• Prioritized queuing• Flow specific QoS

- Latency vs. throughput- Loss tolerance

• Data provenance

• Recovery/recording a rolling log of fine-­grained history

• Visual command and control

• Flow templates• Pluggable/multi-­role security

• Designed for extension• Clustering

Page 15: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Common Apache NiFi Use CasesPredictive AnalyticsEnsure the highest value data is captured and available for analysisComplianceGain full transparency into provenance and flow of data

IoT OptimizationSecure, Prioritize, Enrich and Trace data at the edge

Fraud DetectionMove sales transaction data in real time to analyze on demand

Big Data IngestEasily and efficiently ingest data into Hadoop

Value ResourcesGain visibility into how data sources are used to determine value

Page 16: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Flow Based Programming (FBP)FBP Term NiFi Term DescriptionInformation Packet

FlowFile Each object moving through the system.

Black Box FlowFile Processor

Performs the work, doing some combination of data routing, transformation, or mediation between systems.

Bounded Buffer

Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates.

Scheduler Flow Controller

Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.

Subnet Process Group

A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.

Page 17: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Hortonworks Data Flow

Visual User InterfaceHTML 5, drag and drop, for agile execution

High Throughput, Low Bandwidthfor any data, big or small

Provenance Metadatafor governance and compliance

Secure End-­to-­End Data Routingwith encryption and compressionPowered by

Apache NiFi

Page 18: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Basics of Connecting SystemsFor every connection, these must agree:1. Protocol2. Format3. Schema4. Priority5. Size of event6. Frequency of event7. Authorization access8. Relevance

P1

Producer

C1

Consumer

Page 19: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Using MessagingOnly a subset agree using messaging1. Protocol2. Format3. Schema4. Priority5. Size of event6. Frequency of event7. Authorization access8. Relevance

P1

CN

C1

Messaging

More issues to consider:• How do you know what the data flow looks like? • How is it managed?• How is it working – today, yesterday?

Page 20: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Using an Enterprise Service Bus (ESB)Still, only a subset agree using an ESB:1. Protocol2. Format3. Schema4. Priority5. Size of event6. Frequency of event7. Authorization access8. Relevance

P1

Broker

CN

C1

Messaging

Even more issues to consider:• Remote procedure calls (RPC) and throughput issues are introduced

• Design and deploy management – slow setup, not interactive• You can scale out, but not up or down• You still don’t know what the data flow looks like

Page 21: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

ArchitectureOS/Host

JVM

NiFi Cluster Manager – Request Replicator

Web Server

MasterNiFi Cluster Manager (NCM)

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

SlavesNiFi Nodes

High Availability: Control plane vs Data plane…

Page 22: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Define A Hortonworks DataFlow

• Easy to use drag and drop UI• Flexible to define the Data Flow

Page 23: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

HDF – Powered by Apache NiFi

Page 24: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Add processor for data intake1 Drag and drop processor icon from the top menu

Page 25: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Choose the specific processor2 Choose one of the processors – currently 90 available – designed for extension

Page 26: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Example: Pick Twitter Processor

Page 27: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Configure the processor3 Select processor and

choose option to Configure

4

Adjust parameters as required

Page 28: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Another processor for data output5 Drag and drop processor icon from the top menu

6 Example: choose PutHDFSprocessor

Page 29: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Configure second processor7 Configure 2nd processor

Page 30: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Connect processors, configure connection

8

Page 31: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Click Start to begin processing

9

Page 32: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

See processors update with real time changes

10As data flows, GUI interface updates in real time.

Page 33: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Dynamically adjust and tune data flow as needed

11 Dynamically adjust and tune dataflow as needed, in real time. Can also replicate data for testing and comparison.

Page 34: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Understand the data path with Data Provenance

14 Select Data Provenance

Page 35: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Trace lineage of a particular piece of data

15

Icon for Data Lineage

Page 36: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Every change to data is tracked: processing, views

16

Provenance event is tracked

Page 37: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Updates as changes happen

17 Updates as data flows

Page 38: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Easily access and trace changes to dataflow

Page 39: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Audit trail of Hortonworks DataFlow User Actions

Page 40: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Nifi is complementary to Hadoop

Deployment flexibility from devices to data center. Delivers data flow QoS across dimensions such as: loss tolerant vs. guaranteed delivery, low latency vs. high throughput, and priority-­based queuing.

Operations

GovernanceStarting at the source, captures fine-­grained metadata regarding all data received, forked, joined, cloned, modified, sent, and ultimately dropped as data reaches its configured end-­state delivering comprehensive governance (aka provenance, chain of custody)

Security Secures the data movement from beginning to end. Allows for fine-­grained data authorization policies to be enforced at the flow-­level.

Page 41: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Operations• Reporting tasks (push)• Statistics / status (pull)• Dynamic flow changes

- Push new business rules via REST API (closed loop)

- Pull updates periodically from web services

• Site-­to-­site- Stay at the ‘flow level’ not suddenly doing file transfer protocols

• Extensible• Optimized user experience – log hunts should be the exception

Scale down, up, and out – in containers and on virtual machines

Page 42: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

The Need for Data ProvenanceFor Operators• Traceability, lineage• Recovery and replay

For Compliance• Audit trail

For Business• Value sources • Value IT investment

BEGIN

ENDLINEAGE

Page 43: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Internet of Anything

Extending Data Governance from the Edge to Hadoop

ETL / DQ MDM

ARCHIVE

Traditional Data Systems

Data Governance Requirements

TransparentGovernance standards and protocols must be clearly defined and available to all

Reproducible Recreate the relevant data landscape at a given point in time

Auditable Trace all relevant events and assets with appropriate historical lineage

Consistent Compliance practices must be consistent

Hadoop Data PlatformMust snap into existingdata governance frameworks and openlyexchange metadata

SCM

CRM

ERP

Holistic Data Governance

Business Analytics

Visualization& Dashboards

Page 44: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

The Need for Fine-­grained Security and ComplianceIt’s not enough to say you have encrypted communications• Enterprise authorization services –entitlements change often

• People and systems with different roles require difference access levels

• Tagged/classified data

Page 45: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

SecurityAdministrationCentral management and consistent security

• NiFi Cluster Manager

AuthenticationAuthenticate users and systems • 2-­Way SSL support out of the box;; additional types coming

AuthorizationProvision access to data

• Pluggable authorization designed to fit any Identity and Access Management (IAM) scheme• File-­based authority provider out of the box• Multi-­role

AuditMaintain a record of data access

• Detailed logging of all user actions• Detailed logging of key system behaviors• Data Provenance enables unparalleled tracking from the edge through the Lake

Data ProtectionProtect data at rest and in motion

• Support a variety of SSL/encrypted protocols• Tag and utilize tags on data for fine grained access controls• Encrypt/decrypt content using pre-­shared key mechanisms

Administrator Configure system threads, user accounts, and flow audit history

Data Flow Manager Manipulate the dataflow

Read Only View the dataflow only

+NiFi Configure system threads, user accounts, and flow audit history

Proxy Manipulate the dataflow

Provenance Query the provenance repository and download content

Page 46: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Page 47: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Operations: Planned

Page 48: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Page 49: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Page 50: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Planned Apache NiFi Enhancements

IN PROGRESS Enhanced Configuration management of flowsSTARTED Extension and template registry

TARGETTED TONIFI 0.4.0 RELEASE First-­class Avro support1

STARTED Interactive queue managementSTARTED Multi-­tenant data flow

FUTURE Pluggable authenticationFUTURE Reference-­able process groupsFUTURE Variable registry

https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals

Page 51: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 51 © Hortonworks Inc. 2011 – 2015. All Rights ReservedPage 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow

Try It Yourself,

Download Nifi and HDP Sandbox from

hortonworks.com/sandbox

Tweet: #hadooproadshow

Page 52: Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data

Page 52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Thank you!

Mats Johansson

[email protected]

@matsjo66

https://se.linkedin.com/in/matsjo66