Dataflow with Apache NiFi - Crash Course - HS16SJ

35
Dataflow with Apache NiFi Aldrin Piri - @aldrinpiri Apache NiFi Crash Course Hadoop Summit 2016 – San Jose 29 June 2016

Transcript of Dataflow with Apache NiFi - Crash Course - HS16SJ

Page 1: Dataflow with Apache NiFi - Crash Course - HS16SJ

DataflowwithApacheNiFiAldrinPiri- @aldrinpiriApacheNiFi CrashCourseHadoop Summit2016– SanJose

29June2016

Page 2: Dataflow with Apache NiFi - Crash Course - HS16SJ

2 ©HortonworksInc.2011–2016.AllRightsReserved

Key:'ApacheNiFi’Value:'PMCMember'

Key:'Work’Value:’Sr.MemberofTechnicalStaff@Hortonworks'

Key:'WorkingwithNiFi Since’Value:'2010’

Page 3: Dataflow with Apache NiFi - Crash Course - HS16SJ

3 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaWhatisdataflowandwhatarethechallenges?

ApacheNiFi

Architecture

LiveDemo

Community

Page 4: Dataflow with Apache NiFi - Crash Course - HS16SJ

4 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaWhatisdataflowandwhatarethechallenges?

ApacheNiFi

Architecture

LiveDemo

Community

Page 5: Dataflow with Apache NiFi - Crash Course - HS16SJ

5 ©HortonworksInc.2011–2016.AllRightsReserved

Let’sConnectAtoBProducersA.K.AThings

AnythingAND

Everything

Internet!

Consumers• User• Storage• System• …MoreThings

Page 6: Dataflow with Apache NiFi - Crash Course - HS16SJ

6 ©HortonworksInc.2011–2016.AllRightsReserved

Movingdataeffectivelyishard

Standards: http://xkcd.com/927/

Page 7: Dataflow with Apache NiFi - Crash Course - HS16SJ

7 ©HortonworksInc.2011–2016.AllRightsReserved

Whyismovingdataeffectivelyhard?

à Standards

à Formats

à “ExactlyOnce”Delivery

à Protocols

à VeracityofInformation

à ValidityofInformation

à EnsuringSecurity

à OvercomingSecurity

à Compliance

à Schemas

à ConsumersChange

à CredentialManagement

à “That [person|team|group]”

à Network

à “ExactlyOnce”Delivery

Page 8: Dataflow with Apache NiFi - Crash Course - HS16SJ

8 ©HortonworksInc.2011–2016.AllRightsReserved

Let’sConnectLotsofAstoBs toAstoCstoBs toΔs toCstoϕsLet’sconsidertheneedsofacourierservice

PhysicalStore

GatewayServer

MobileDevices

Registers

ServerCluster

DistributionCenter CoreDataCenteratHQ

ServerCluster

OnDeliveryRoutes

Trucks Deliverers

DeliveryTruck: CreativeStall,https://thenounproject.com/creativestall/

Deliverer:RigoPeter,https://thenounproject.com/rigo/

CashRegister:SergeyPatutin,https://thenounproject.com/bdesign.by/

HandScanner:EricPearson,https://thenounproject.com/epearson001/

Page 9: Dataflow with Apache NiFi - Crash Course - HS16SJ

9 ©HortonworksInc.2011–2016.AllRightsReserved

Great!Iamcollectingallthisdata!Let’suseit!Findingourneedlesinthehaystack

PhysicalStore

GatewayServer

MobileDevices

Registers

ServerCluster

DistributionCenter

Kafka

CoreDataCenteratHQ

ServerCluster

Others

Storm/Spark/Flink /Apex

Kafka

Storm/Spark/Flink /Apex

OnDeliveryRoutes

Trucks Deliverers

DeliveryTruck: CreativeStall,https://thenounproject.com/creativestall/

Deliverer:RigoPeter,https://thenounproject.com/rigo/

CashRegister:SergeyPatutin,https://thenounproject.com/bdesign.by/

HandScanner:EricPearson,https://thenounproject.com/epearson001/

Page 10: Dataflow with Apache NiFi - Crash Course - HS16SJ

10 ©HortonworksInc.2011–2016.AllRightsReserved

Whyismovingdataeffectivelyhardwhenscopedinternally?

à Standards

à Formats

à “ExactlyOnce”Delivery

à Protocols

à VeracityofInformation

à ValidityofInformation

à EnsuringSecurity

à OvercomingSecurity

à Compliance

à Schemas

à ConsumersChange

à CredentialManagement

à “That [person|team|group]”

à Network

à “ExactlyOnce”Delivery

Page 11: Dataflow with Apache NiFi - Crash Course - HS16SJ

11 ©HortonworksInc.2011–2016.AllRightsReserved

Let’sConnectLotsofAstoBs toAstoCstoBs toΔs toCstoϕsOh,thatcourierserviceisglobal

Page 12: Dataflow with Apache NiFi - Crash Course - HS16SJ

12 ©HortonworksInc.2011–2016.AllRightsReserved

Whyismovingdataeffectivelyhardwhenscopedglobally?

à Standards

à Formats

à “ExactlyOnce”Delivery

à Protocols

à VeracityofInformation

à ValidityofInformation

à EnsuringSecurity

à OvercomingSecurity

à Compliance

à Schemas

à ConsumersChange

à CredentialManagement

à “That [person|team|group]”

à Network

à “ExactlyOnce”Delivery

Page 13: Dataflow with Apache NiFi - Crash Course - HS16SJ

13 ©HortonworksInc.2011–2016.AllRightsReserved

TheUnassumingLine:ACaseStudyWe’veseenafewlinesshowupinthewildthusfar

Internet! Inter- &Intra- connectionsinourglobalcourierenterprise

Spotlight:ArthurLacôte,https://thenounproject.com/turo/

Page 14: Dataflow with Apache NiFi - Crash Course - HS16SJ

14 ©HortonworksInc.2011–2016.AllRightsReserved

DataflowLineAnatomy101Let’sdissectwhatthislinetypicallyrepresents

Fig1.Lineus Worldwidewebus.CommonName:Internet!

ScriptorApplication

ScriptorApplication

Data Data

DisparateTransportMechanisms

Page 15: Dataflow with Apache NiFi - Crash Course - HS16SJ

15 ©HortonworksInc.2011–2016.AllRightsReserved

DataflowLineAnatomy201Sometimesthattransportisjustmorelines

Fig1.Lineus Worldwidewebus.CommonName:Internet!

ScriptorApplication

ScriptorApplication

LineInception

Data Data

Page 16: Dataflow with Apache NiFi - Crash Course - HS16SJ

16 ©HortonworksInc.2011–2016.AllRightsReserved

DataflowLineAnatomy301Butthoselinescouldalsohavecomponents…

Fig1.Lineus Worldwidewebus.CommonName:Internet! Fig2.Good RecursionJoke

NoSuchJokeException

footagenotfound

Page 17: Dataflow with Apache NiFi - Crash Course - HS16SJ

17 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaWhatisdataflowandwhatarethechallenges?

ApacheNiFi

Architecture

LiveDemo

Community

Page 18: Dataflow with Apache NiFi - Crash Course - HS16SJ

18 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheNiFiKeyFeatures

• Guaranteeddelivery

• Databuffering- Backpressure

- Pressurerelease

• Prioritizedqueuing

• FlowspecificQoS- Latencyvs.throughput

- Losstolerance

• Dataprovenance

• Supportspushandpullmodels

• Recovery/recordingarollinglogoffine-grainedhistory

• Visualcommandandcontrol

• Flowtemplates

• Pluggable/multi-rolesecurity

• Designedforextension

• Clustering

Page 19: Dataflow with Apache NiFi - Crash Course - HS16SJ

19 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheNiFi Subproject:MiNiFi

à LetmegetthekeypartsofNiFi closetowheredatabeginsandprovidebidrectionalcommunication

à NiFi livesinthedatacenter.Giveitanenterpriseserveroraclusterofthem.

à MiNiFi livesasclosetowheredataisbornandisaguestonthatdeviceorsystem

Page 20: Dataflow with Apache NiFi - Crash Course - HS16SJ

20 ©HortonworksInc.2011–2016.AllRightsReserved

Let’srevisitourcourierservicefromtheperspectiveofNiFi

PhysicalStore

GatewayServer

MobileDevices

Registers

ServerCluster

DistributionCenter

Kafka

CoreDataCenteratHQ

ServerCluster

Others

Storm/Spark/Flink /Apex

Kafka

Storm/Spark/Flink /Apex

OnDeliveryRoutes

Trucks Deliverers

DeliveryTruck: CreativeStall,https://thenounproject.com/creativestall/

Deliverer:RigoPeter,https://thenounproject.com/rigo/

CashRegister:SergeyPatutin,https://thenounproject.com/bdesign.by/

HandScanner:EricPearson,https://thenounproject.com/epearson001/

ClientLibraries

ClientLibraries

MiNiFi

MiNiFiNiFi NiFi NiFi NiFi NiFi NiFi

ClientLibraries

Page 21: Dataflow with Apache NiFi - Crash Course - HS16SJ

21 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheNiFi ManagedDataflowSOURCES REGIONAL

INFRASTRUCTURECORE

INFRASTRUCTURE

Page 22: Dataflow with Apache NiFi - Crash Course - HS16SJ

22 ©HortonworksInc.2011–2016.AllRightsReserved

NiFi isbasedonFlowBasedProgramming(FBP)

FBPTerm NiFi Term Description

InformationPacket

FlowFile Each objectmovingthroughthesystem.

Black Box FlowFileProcessor

Performsthework, doingsomecombinationofdatarouting,transformation,ormediationbetweensystems.

BoundedBuffer

Connection Thelinkage betweenprocessors,actingasqueuesandallowingvariousprocessestointeractatdifferingrates.

Scheduler FlowController

Maintainstheknowledgeofhowprocessesareconnected, andmanagesthethreadsandallocationsthereofwhichallprocessesuse.

Subnet ProcessGroup

Asetofprocessesandtheirconnections,whichcanreceiveandsenddataviaports.Aprocess groupallowscreationofentirelynewcomponentsimplybycompositionofits components.

Page 23: Dataflow with Apache NiFi - Crash Course - HS16SJ

23 ©HortonworksInc.2011–2016.AllRightsReserved

FlowFiles &DataAgnosticism

à NiFi isdataagnostic!

à But,NiFi wasdesignedunderstandingthatusers

cancareaboutspecificsandprovidestooling

tointeractwithspecificformats,protocols,etc.

ISO8601 - http://xkcd.com/1179/

Robustnessprinciple

Beconservativeinwhatyoudo,beliberalinwhatyouacceptfromothers“

Page 24: Dataflow with Apache NiFi - Crash Course - HS16SJ

24 ©HortonworksInc.2011–2016.AllRightsReserved

FlowFiles arelikeHTTPdataHTTPData FlowFile

HTTP/1.1200OK

Date:Sun,10Oct201023:26:07GMT

Server:Apache/2.2.8(CentOS)OpenSSL/0.9.8g

Last-Modified:Sun,26Sep201022:04:35GMT

ETag:"45b6-834-49130cc1182c0"

Accept-Ranges:bytes

Content-Length:13

Connection: close

Content-Type: text/html

Helloworld!

StandardFlowFile AttributesKey:'entryDate’ Value:'FriJun1717:15:04EDT2016'Key:'lineageStartDate’Value:'FriJun1717:15:04EDT2016'Key:'fileSize’ Value:'23609'FlowFile AttributeMapContentKey:'filename’ Value:'15650246997242'Key:'path’ Value:'./’

BinaryContent*

Header

Content

Page 25: Dataflow with Apache NiFi - Crash Course - HS16SJ

25 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaWhatisdataflowandwhatarethechallenges?

ApacheNiFi

Architecture

LiveDemo

Community

Page 26: Dataflow with Apache NiFi - Crash Course - HS16SJ

26 ©HortonworksInc.2011–2016.AllRightsReserved

Extension/IntegrationPoints

NiFi Term Description

Flow FileProcessor

Push/Pull behavior.CustomUI

ReportingTask

Used topushdatafromNiFi tosomeexternalservice(metrics,provenance,etc..)

ControllerService

Usedtoenablereusablecomponents/ sharedservicesthroughouttheflow

RESTAPI Allowsclientstoconnecttopullinformation,changebehavior,etc..

Page 27: Dataflow with Apache NiFi - Crash Course - HS16SJ

27 ©HortonworksInc.2011–2016.AllRightsReserved

OS/Host

JVM

FlowController

WebServer

Processor1 ExtensionN

FlowFileRepository

ContentRepository

ProvenanceRepository

LocalStorage

OS/Host

JVM

FlowController

WebServer

Processor1 ExtensionN

FlowFileRepository

ContentRepository

ProvenanceRepository

LocalStorage

Architecture* OS/Host

JVM

NiFiClusterManger– RequestReplicator

WebServer

MasterNiFiClusterManager(NCM)

OS/Host

JVM

FlowController

WebServer

Processor1 ExtensionN

FlowFileRepository

ContentRepository

ProvenanceRepository

LocalStorage

SlavesNiFiNodes

Page 28: Dataflow with Apache NiFi - Crash Course - HS16SJ

28 ©HortonworksInc.2011–2016.AllRightsReserved

NiFiArchitecture– Repositories- Passbyreference

FlowFile Content Provenance

F1à C1 C1 P1à F1

Excerptofdemoflow… What’shappeninginsidetherepositories…

BEFORE

AFTER

F2à C1 C1 P3à F2 – Clone(F1)

F1à C1 P2à F1 – Route

P1à F1 – Create

Page 29: Dataflow with Apache NiFi - Crash Course - HS16SJ

29 ©HortonworksInc.2011–2016.AllRightsReserved

NiFiArchitecture– Repositories– CopyonWrite

FlowFile Content Provenance

F1à C1 C1 P1à F1- CREATE

Excerptofdemoflow… What’shappeninginsidetherepositories…

BEFORE

AFTER

F1à C1F1.1à C2 C2(encrypted)

C1(plaintext)

P2à F1.1 - MODIFY

P1à F1- CREATE

Page 30: Dataflow with Apache NiFi - Crash Course - HS16SJ

30 ©HortonworksInc.2011–2016.AllRightsReserved

AgendaWhatisdataflowandwhatarethechallenges?

ApacheNiFi

Architecture

Demo

Community

Page 31: Dataflow with Apache NiFi - Crash Course - HS16SJ

31 ©HortonworksInc.2011–2016.AllRightsReserved

Learn,ShareatBirdsofaFeatherStreaming,DataFlow&Cybersecurity

ThursdayJune306:30pm,BallroomC

Page 32: Dataflow with Apache NiFi - Crash Course - HS16SJ

32 ©HortonworksInc.2011–2016.AllRightsReserved

WhyNiFi?

à Movingdataismultifacetedinitschallengesandthesearepresentindifferentcontextsatvaryingscopes– Thinkofourcourierexampleandorganizationslikeit:intervs intra,domestically,internationally

à Providecommontoolingandextensionsthatarecommonlyneededbutbeflexibleforextension– LeverageexistinglibrariesandexpansiveJavaecosystemforfunctionality– Alloworganizationstointegratewiththeirexistinginfrastructure

à Empowerfolksmanagingyourinfrastructuretomakechangesandreasonaboutissuesthatareoccurring– DataProvenancetoshowcontextanddata’sjourney– UserInterface/Experienceakeycomponent

Page 33: Dataflow with Apache NiFi - Crash Course - HS16SJ

33 ©HortonworksInc.2011–2016.AllRightsReserved

Learnmoreandjoinus!

Apache NiFi sitehttp://nifi.apache.org

Subproject MiNiFi sitehttp://nifi.apache.org/minifi/

Subscribe to and collaborate [email protected]@nifi.apache.org

Submit Ideas or Issueshttps://issues.apache.org/jira/browse/NIFI

Follow us on Twitter@apachenifi

Page 34: Dataflow with Apache NiFi - Crash Course - HS16SJ

34 ©HortonworksInc.2011–2016.AllRightsReserved

OurLabforToday

à WewillbeexploringsomeexamplestoworkthroughcreatingadataflowwithApacheNiFi

à UseCase:Anurbanplanningboardisevaluatingtheneedforanewhighway,dependentoncurrenttrafficpatterns,particularlyasotherroadworkinitiativesareunderway.Integratinglivedataposesaproblembecausetrafficanalysishastraditionallybeendoneusinghistorical,aggregatedtrafficcounts.Toimprovetrafficanalysis,thecityplannerwantstoleveragereal-timedatatogetadeeperunderstandingoftrafficpatterns.NiFi wasselectedforforthisreal-timedataintegration.

à Labsareavailableathttp://tinyurl.com/nificrashcourse

Page 35: Dataflow with Apache NiFi - Crash Course - HS16SJ

35 ©HortonworksInc.2011–2016.AllRightsReserved

ThankYou