Beyond Messaging Enterprise Dataflow powered by Apache NiFi

20
Beyond Messaging Enterprise Dataflow powered by Apache NiFi © Hortonworks Inc. 2011 – 2015. All Rights Reserved Aldrin Piri 19 January 2016

Transcript of Beyond Messaging Enterprise Dataflow powered by Apache NiFi

Beyond MessagingEnterprise Dataflow powered by Apache NiFi

© Hortonworks Inc. 2011 – 2015. All Rights Reserved

Aldrin Piri19 January 2016

Page 2 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

About me

Senior Member of Technical Staff

Project Management Committee and Committer

@aldrinpiri

Page 3 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Simplistic View of Enterprise Data Flow

The Data Flow Thing

Process and Analyze DataAcquire Data

Store Data

Page 4 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Global interactions with customers, business partners, and thingsspanning different volume, velocity, bandwidth, and latency needs

Realistic View of Data Flow

Page 5 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Meeting Edge Requirements

GATHER

DELIVER

PRIORITIZE

Track from the edge Through to the datacenter

Small Footprintsoperate with very little power

Limited Bandwidthcan create high latency

Data Availabilityexceeds transmission bandwidth

Data Must Be Securedthroughout its journey

Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

• Remote sensor delivery (Internet of Things - IoT)

• Intra-site / Inter-site / global distribution (Enterprise)

• Ingest for driving analytics (Big Data)

• Data Processing (Simple Event Processing)

Where do we find data flow?

Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Basics of Connecting Systems

For every connection, these must agree:1. Protocol2. Format3. Schema4. Priority5. Size of event6. Frequency of event7. Authorization access8. Relevance

P1

Producer

C1

Consumer

Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

• Messaging addresses only a small subset of the problem space

• Needed to understand the big picture

• Needed the ability to make immediate changes

• Must maintain chain of custody for data

• Rigorous security and compliance requirements

Challenges of dataflow in the enterprise

Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Great options including: • Kafka• ActiveMQ• Tibco

Let us consider the perfect messaging system for this talk:• It has zero latency• It has perfect data durability• It supports unlimited consumers and producers

Messaging Systems as Dataflow

Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

“But my system needs…”

• A different format and/or schema

• To use a different protocol

• The highest priority information first

• Large objects (event batches) / Small Objects (streams)

• Authorization to the data level

• Only interested in a subset of data on a topic

• Data needs to be enriched/sanitized before it arrives

Dataflow as a messaging problem

Page 11 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Using Messaging

Only a subset agree using messaging1. Protocol2. Format3. Schema4. Priority5. Size of event6. Frequency of event7. Authorization access8. Relevance

P1

CN

C1

Messaging

More issues to consider:• How do you know what the data flow looks like? • How is it managed?• How is it working – today, yesterday?

Page 12 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

• Add new systems to handle the protocol differences

• Add new systems to convert the data

• Add new systems to reorder the data

• Add new systems to filter the unauthorized data

• Add new topics to represent ‘stages of the flow’

Which leads to latency, complexity, and limited retention

Ultimately, the operations teams who handle data at flow boundaries become responsible for managing.

How these issues are typically solved

Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Real-time Data Flow

It’s not just how quickly you move data – it’s about how quickly you can change behavior and seize new opportunities

Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Introducing Apache NiFi

• Guaranteed delivery• Data buffering

- Backpressure- Pressure release

• Prioritized queuing• Flow specific QoS

- Latency vs. throughput- Loss tolerance

• Data provenance

• Recovery/recording a rolling log of fine-grained history

• Visual command and control

• Flow templates• Pluggable/multi-role

security• Designed for extension• Clustering

Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

November 2014NiFi is donated to the Apache Software Foundation (ASF) through NSA’s Technology Transfer Program and enters ASF’s incubator.

2006NiagaraFiles (NiFi) was first incepted by Joe Witt at the National Security Agency (NSA)

A Brief History

July 2015NiFi reaches ASF top-level project status

Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Flow Based Programming (FBP)

FBP Term NiFi Term DescriptionInformation Packet

FlowFile Each object moving through the system.

Black Box FlowFile Processor

Performs the work, doing some combination of data routing, transformation, or mediation between systems.

Bounded Buffer

Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates.

Scheduler Flow Controller

Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.

Subnet Process Group

A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.

Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

Architecture

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

OS/Host

JVM

NiFi Cluster Manager – Request Replicator

Web Server

MasterNiFi Cluster Manager (NCM)

OS/Host

JVM

Flow Controller

Web Server

Processor 1 Extension N

FlowFileRepository

ContentRepository

ProvenanceRepository

Local Storage

SlavesNiFi Nodes

Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Live Demonstration

Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Learn more and join us!

Apache NiFi sitehttp://nifi.apache.org

Subscribe to and collaborate [email protected]@nifi.apache.org

Submit Ideas or Issueshttps://issues.apache.org/jira/browse/NIFI

Follow us on Twitter@apachenifi

Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved

Thank you!