Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
-
Upload
mats-johansson -
Category
Data & Analytics
-
view
2.153 -
download
3
Transcript of Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlowEnterprise Data Flow powered by Apache NiFi
Mats JohanssonSolutions Engineer - EMEA
© Hortonworks Inc. 2011 – 2015. All Rights Reserved
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
DisclaimerThis document may contain product features and technology directions that are under development, may be under development in the future or may ultimately not be developed.
Project capabilities are based on information that is publicly available within the Apache Software Foundation project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, technical feasibility, market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from Hortonworks to deliver these features in any generally available product.
Product features and technology directions are subject to change, and must not be included in contracts, purchase orders, or sales agreements of any kind.
Since this document contains an outline of general product development plans, customers should not rely upon it when making purchasing decisions.
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
IoAT Data Grows Faster Than We Consume It
Much of the new data exists in-flight, between systems and devices as part of the Internet of AnythingNEW
TRADITIONAL
The OpportunityUnlock transformational business valuefrom a full fidelity of data and analyticsfor all data.
Geolocation
Server logs
Files & emails
ERP, CRM, SCM
Traditional Data Sources
Internet of Anything
Sensorsand machines
Clickstream
Social media
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Internet of Anything is Driving New RequirementsNeed trusted insights from data at the very edge to the data lake in real-time with full-fidelity–Data generated by sensors, machines, geo-location devices, logs, clickstreams, social feeds, etc.
Modern applications need access to both data-in-motion and data-at-rest
IoAT data flows are multi-directional and point-to-point– Very different than existing ETL, data movement, and streaming technologies which are generally one direction
The perimeter is outside the data center and can be very jagged– This “Jagged Edge” creates new opportunity for security, data protection, data governance and provenance
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Architectural Limitations Today• Traditional data movement software has been built for the world of standardized data and one way flows
• Tools built for newer types of data tend to be custom, difficult to manage, and architecturally disjoint
• Businesses can not easily collect, conduct, and curate secure multi-directional and point-to-point IoAT data flows
• IoAT data flows are not optimized and use costly/limited bandwidth and cannot dynamically prioritize the most valuable data
• Difficult to gain actionable insights from the combination of data-in-motion and data-at-rest
Page 6 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
The IoAT Data Flow
Hortonworks Data Platformpowered by Apache Hadoop
Hortonworks Data Platformpowered by Apache Hadoop
EnrichContext
Store Data and Metadata
Internetof Anything
Hortonworks DataFlow powered by Apache NiFi
Perishable Insights
HistoricalInsights
Introducing Hortonworks DataFlow
Hortonworks DataFlow and the Hortonworks Data Platform deliver the industry’s most complete solution for management of Big Data.
Page 7 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Simplistic View of IoAT & Data Flow
The Data Flow Thing
Process and Analyze DataAcquire Data
Store Data
Page 8 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Global interactions with customers, business partners, and thingsspanning different volume, velocity, bandwidth, and latency needs
Realistic View of IoAT and Data Flow
Page 9 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Meeting IoAT Edge Requirements
GATHER
DELIVER
PRIORITIZE
Track from the edge Through to the datacenter
Small Footprintsoperate with very little power
Limited Bandwidthcan create high latency
Data Availabilityexceeds transmission bandwidth
Data Must Be Securedthroughout its journey
Page 10 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Dataflow requirements within the Data CenterUnderstandingAbility to observe precisely how systems exchange data in real-time and historically
AgilityAbility to interact with and alter live flows and iterate on new ones
Dynamic Access ControlsThe entitlements of users and systems and sensitivity of data can change frequently
Cross Cutting ConcernsAddress common needs once like enrichment, filtering, transformation
Enable architecture transitionLegacy vs modern is an ‘always’ event. Format, schema, protocol conversion is routine
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache NiFi: Collect, Conduct, Curate
Aggregate all IoAT data from sensors, geo-location devices, machines, logs, files, and feeds via a highly secure lightweight agent
Collect: Bring Together• Logs• Files• Feeds• Sensors
Mediate point-to-point and bi-directional data flows, delivering data reliably to real-time applications and storage platforms such as HDP
Conduct: Mediate the Data Flow• Deliver• Secure• Govern• Audit
Parse, filter, join, transform, fork, and clone data in motion to empower analytics and perishable insights
Curate: Gain Insights• Parse• Filter• Transform• Fork• Clone
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
November 2014NiFi is donated to the Apache Software Foundation (ASF) through NSA’s Technology Transfer Program and enters ASF’s incubator.
2006NiagaraFiles (NiFi) was first incepted by Joe Witt at the National Security Agency (NSA)
A Brief History of Apache Nifi
July 2015NiFi reaches ASF top-level project status
Page 13 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache NiFi: Three key concepts
• Manage the flow of information
• Data Provenance
• Secure the control plane and data plane
Page 14 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Apache NiFi – Key Features
• Guaranteed delivery• Data buffering
- Backpressure- Pressure release
• Prioritized queuing• Flow specific QoS
- Latency vs. throughput- Loss tolerance
• Data provenance
• Recovery/recording a rolling log of fine-grained history
• Visual command and control
• Flow templates• Pluggable/multi-role security
• Designed for extension• Clustering
Page 15 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Common Apache NiFi Use CasesPredictive AnalyticsEnsure the highest value data is captured and available for analysisComplianceGain full transparency into provenance and flow of data
IoT OptimizationSecure, Prioritize, Enrich and Trace data at the edge
Fraud DetectionMove sales transaction data in real time to analyze on demand
Big Data IngestEasily and efficiently ingest data into Hadoop
Value ResourcesGain visibility into how data sources are used to determine value
Page 16 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Flow Based Programming (FBP)FBP Term NiFi Term DescriptionInformation Packet
FlowFile Each object moving through the system.
Black Box FlowFile Processor
Performs the work, doing some combination of data routing, transformation, or mediation between systems.
Bounded Buffer
Connection The linkage between processors, acting as queues and allowing various processes to interact at differing rates.
Scheduler Flow Controller
Maintains the knowledge of how processes are connected, and manages the threads and allocations thereof which all processes use.
Subnet Process Group
A set of processes and their connections, which can receive and send data via ports. A process group allows creation of entirely new component simply by composition of its components.
Page 17 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Hortonworks Data Flow
Visual User InterfaceHTML 5, drag and drop, for agile execution
High Throughput, Low Bandwidthfor any data, big or small
Provenance Metadatafor governance and compliance
Secure End-to-End Data Routingwith encryption and compressionPowered by
Apache NiFi
Page 18 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Basics of Connecting SystemsFor every connection, these must agree:1. Protocol2. Format3. Schema4. Priority5. Size of event6. Frequency of event7. Authorization access8. Relevance
P1
Producer
C1
Consumer
Page 19 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Using MessagingOnly a subset agree using messaging1. Protocol2. Format3. Schema4. Priority5. Size of event6. Frequency of event7. Authorization access8. Relevance
P1
CN
C1
Messaging
More issues to consider:• How do you know what the data flow looks like? • How is it managed?• How is it working – today, yesterday?
Page 20 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Using an Enterprise Service Bus (ESB)Still, only a subset agree using an ESB:1. Protocol2. Format3. Schema4. Priority5. Size of event6. Frequency of event7. Authorization access8. Relevance
P1
Broker
CN
C1
Messaging
Even more issues to consider:• Remote procedure calls (RPC) and throughput issues are introduced
• Design and deploy management – slow setup, not interactive• You can scale out, but not up or down• You still don’t know what the data flow looks like
Page 21 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFileRepository
ContentRepository
ProvenanceRepository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFileRepository
ContentRepository
ProvenanceRepository
Local Storage
ArchitectureOS/Host
JVM
NiFi Cluster Manager – Request Replicator
Web Server
MasterNiFi Cluster Manager (NCM)
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFileRepository
ContentRepository
ProvenanceRepository
Local Storage
SlavesNiFi Nodes
High Availability: Control plane vs Data plane…
Page 22 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Define A Hortonworks DataFlow
• Easy to use drag and drop UI• Flexible to define the Data Flow
Page 23 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
HDF – Powered by Apache NiFi
Page 24 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Add processor for data intake1 Drag and drop processor icon from the top menu
Page 25 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Choose the specific processor2 Choose one of the processors – currently 90 available – designed for extension
Page 26 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Example: Pick Twitter Processor
Page 27 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Configure the processor3 Select processor and
choose option to Configure
4
Adjust parameters as required
Page 28 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Another processor for data output5 Drag and drop processor icon from the top menu
6 Example: choose PutHDFSprocessor
Page 29 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Configure second processor7 Configure 2nd processor
Page 30 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Connect processors, configure connection
8
Page 31 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Click Start to begin processing
9
Page 32 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
See processors update with real time changes
10As data flows, GUI interface updates in real time.
Page 33 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Dynamically adjust and tune data flow as needed
11 Dynamically adjust and tune dataflow as needed, in real time. Can also replicate data for testing and comparison.
Page 34 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Understand the data path with Data Provenance
14 Select Data Provenance
Page 35 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Trace lineage of a particular piece of data
15
Icon for Data Lineage
Page 36 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Every change to data is tracked: processing, views
16
Provenance event is tracked
Page 37 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Updates as changes happen
17 Updates as data flows
Page 38 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Easily access and trace changes to dataflow
Page 39 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Audit trail of Hortonworks DataFlow User Actions
Page 40 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Nifi is complementary to Hadoop
Deployment flexibility from devices to data center. Delivers data flow QoS across dimensions such as: loss tolerant vs. guaranteed delivery, low latency vs. high throughput, and priority-based queuing.
Operations
GovernanceStarting at the source, captures fine-grained metadata regarding all data received, forked, joined, cloned, modified, sent, and ultimately dropped as data reaches its configured end-state delivering comprehensive governance (aka provenance, chain of custody)
Security Secures the data movement from beginning to end. Allows for fine-grained data authorization policies to be enforced at the flow-level.
Page 41 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Operations• Reporting tasks (push)• Statistics / status (pull)• Dynamic flow changes
- Push new business rules via REST API (closed loop)
- Pull updates periodically from web services
• Site-to-site- Stay at the ‘flow level’ not suddenly doing file transfer protocols
• Extensible• Optimized user experience – log hunts should be the exception
Scale down, up, and out – in containers and on virtual machines
Page 42 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
The Need for Data ProvenanceFor Operators• Traceability, lineage• Recovery and replay
For Compliance• Audit trail
For Business• Value sources • Value IT investment
BEGIN
ENDLINEAGE
Page 43 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Internet of Anything
Extending Data Governance from the Edge to Hadoop
ETL / DQ MDM
ARCHIVE
Traditional Data Systems
Data Governance Requirements
TransparentGovernance standards and protocols must be clearly defined and available to all
Reproducible Recreate the relevant data landscape at a given point in time
Auditable Trace all relevant events and assets with appropriate historical lineage
Consistent Compliance practices must be consistent
Hadoop Data PlatformMust snap into existingdata governance frameworks and openlyexchange metadata
SCM
CRM
ERP
Holistic Data Governance
Business Analytics
Visualization& Dashboards
Page 44 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
The Need for Fine-grained Security and ComplianceIt’s not enough to say you have encrypted communications• Enterprise authorization services –entitlements change often
• People and systems with different roles require difference access levels
• Tagged/classified data
Page 45 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
SecurityAdministrationCentral management and consistent security
• NiFi Cluster Manager
AuthenticationAuthenticate users and systems • 2-Way SSL support out of the box;; additional types coming
AuthorizationProvision access to data
• Pluggable authorization designed to fit any Identity and Access Management (IAM) scheme• File-based authority provider out of the box• Multi-role
AuditMaintain a record of data access
• Detailed logging of all user actions• Detailed logging of key system behaviors• Data Provenance enables unparalleled tracking from the edge through the Lake
Data ProtectionProtect data at rest and in motion
• Support a variety of SSL/encrypted protocols• Tag and utilize tags on data for fine grained access controls• Encrypt/decrypt content using pre-shared key mechanisms
Administrator Configure system threads, user accounts, and flow audit history
Data Flow Manager Manipulate the dataflow
Read Only View the dataflow only
+NiFi Configure system threads, user accounts, and flow audit history
Proxy Manipulate the dataflow
Provenance Query the provenance repository and download content
Page 46 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Page 47 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Operations: Planned
Page 48 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Page 49 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Page 50 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Planned Apache NiFi Enhancements
IN PROGRESS Enhanced Configuration management of flowsSTARTED Extension and template registry
TARGETTED TONIFI 0.4.0 RELEASE First-class Avro support1
STARTED Interactive queue managementSTARTED Multi-tenant data flow
FUTURE Pluggable authenticationFUTURE Reference-able process groupsFUTURE Variable registry
https://cwiki.apache.org/confluence/display/NIFI/NiFi+Feature+Proposals
Page 51 © Hortonworks Inc. 2011 – 2015. All Rights ReservedPage 51 © Hortonworks Inc. 2011 – 2015. All Rights Reserved Tweet: #hadooproadshow
Try It Yourself,
Download Nifi and HDP Sandbox from
hortonworks.com/sandbox
Tweet: #hadooproadshow
Page 52 © Hortonworks Inc. 2011 – 2015. All Rights Reserved
Thank you!
Mats Johansson
@matsjo66
https://se.linkedin.com/in/matsjo66