Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

download Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

of 26

  • date post

    07-Jan-2017
  • Category

    Software

  • view

    137
  • download

    1

Embed Size (px)

Transcript of Back-Pressure in Action: Handling High-Burst Workloads with Akka Streams & Kafka

PowerPoint Presentation

Back-Pressure in ActionHandling High-Burst Workloads with Akka Streams & Kafka

Akara Sucharitakul, PayPalAnil Gursel, PayPal

Intro & AgendaCrawler Intro & Problem StatementsCrawler ArchitectureInfrastructure: Akka Streams, Kafka, etc.The Goodies

Crawl JobsJob DBValidateURLCacheDownloadProcessURLs

URLs

TimestampsHigh-Level View

RequirementsEver-expanding # of URLsCant crawl all URLs at onceControl over concurrent web GETsEfficient resource usageResilient under high burstScales horizontally & vertically

Sizing the Crawl JobLet:i = Number of crawl URLs in a jobn = Average number of links per paged = The crawl depth (how many layers to follow links)u = The max number of URLs to processThen:u = ind

The Reactive ManifestoResponsiveMessage DrivenElasticResilient

6

Why Does it Matter?Respond in a deterministic, timely manner

Stays responsive in the face of failure even cascading failures

Stays responsive under workload spikes

Basic building block for responsive, resilient, and elastic systemsResponsiveResilientElasticMessage Driven

The Right IngredientsKafkaHuge persistent buffer for the burstsLoad distribution to very large number of processing nodesEnable horizontal scalabilityAkka streamsHigh performance, highly efficient processing pipelineResilient with end-to-end back-pressureFully asynchronous utilizes mapAsyncUnordered with Async HTTP clientAsync HTTP clientNon-blocking and consumes no threads in waitingIntegrates with Akka Streams for a high parallelism, low resource solution

Crawl JobsJob DBValidateURLCacheDownloadProcessURLs

URLs

TimestampsAdding Kafka & Akka Streams URLsAkka Streams

Akka Streams,what???High performance, pure async, stream processingConforms to reactive streamsSimple, yet powerful GraphDSL allows clear stream topology declarationCentral point to understand processing pipeline

Crawl Stream Actual Stream Declaration in Code

prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink bCast ~> dataSinkFlow ~> kafkaDataSink bCast ~> hdfsDataSink bCast ~> graphFlow ~> merge ~> graphSink bCast0 ~> maxPage ~> merge bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge bCastRetry ~> errorSink

PrioritizedSource

CrawlResultMaxPageReachedRetry

OutLinks

Data

Graph

CheckFail

CheckErr

OutLinksSinkKafka DataSinkHDFS DataSinkGraphSinkErrorSink

Resulting CharacteristicsEfficientLow thread count, controlled by Akka and pure non-blocking async HTTPHigh latency URLs do not block low latency URLs using MapAsyncUnorderedWell-controlled download concurrency using MapAsyncUnorderedThread per concurrent crawl jobResilientProcesses only what can be processed no resource overloadKafka as short-term, persistent queueScaleKafka feeds next batch of URLs to available node clusterPull model only processes that have capacity will get the loadKafka distributes work to large number of processing nodes in cluster

Back-PressureinitialURLs : 100parallelism : 1000processTime : 1 5 soutLinks : 0 - 10depth : 5totalCrawled : 312500

ChallengesTrainingDevelopers not used to E2E stream definitionsMore familiar with deeply nested function callsMaturity of InfrastructureKafka 0.9 use fetch as heartbeatSlow nodes cause timeout & rebalanceSolved in 0.10

What it would have beenBloated, ineffective concurrency controlLack of well-thought-out and visible processing pipelineClumsy code, hard to manage & understandLow training cost, high project TCODev / Support / Maintenance

Bottom LineCrawl Time Reduced to 1/10th (compared to thread-based architecture)

Standardized Reactive PlatformFor Large Scale Internet Deployments

17

Efficiency & Resilience meets StandardizationMonitoringNeed to collect metrics, consistently LoggingCorrelation across servicesUniformity in logsSecurityNeed to apply standard security configurationEnvironment ResolutionStaging, production, etc.Consistency in the face of Heterogeneity

squbs is notA framework by its ownA programming model use AkkaTake all or none Components/patterns can mostly be used independently

squbsAkka for large scale deploymentsBootstrapLifecycle managementLoosely-coupled module systemIntegration hooks for logging, monitoring, ops integration

squbsAkka for large scale deploymentsJSON consoleHttpClient with pluggable resolver and monitoring/logging hooksTest tools and interfacesGoodies:- Activators for Scala & Java- Programming patterns and helpers for Akka and Akka Stream Use cases, and growing

PerpetualStreamProvides a convenience trait to help write streams controlled by system lifecycleMinimal/no message lossesRegister PerpetualStream to make stream start/stopProvides customization hooks especially for how to stop the streamProvides killSwitch (from Akka) to be embedded into streamImplementers - just provide your stream!A non-stop stream; starts and stops with the systemclass MyStream extends PerpetualStream[Future[Int]] {

def generator = Iterator.iterate(0) { p => if (p == Int.MaxValue) 0 else p + 1 } val source = Source.fromIterator(generator _) val ignoreSink = Sink.ignore[Int]

override def streamGraph = RunnableGraph.fromGraph( GraphDSL.create(ignoreSink) { implicit builder => sink => import GraphDSL.Implicits._ source ~> killSwitch.flow[Int] ~> sink ClosedShape })}

PersistentBuffer/BroadcastBufferData & indexes in rotating memory-mapped filesOff-heap rotating file buffer very large buffersRestarts gracefully with no or minimal message lossNot as durable as a remote data store, but much fasterDoes not back-pressure upstream beyond data/index writesSimilar usage to Buffer and BroadcastBroadcastBuffer a FanOutShape decouples each output port making each downstream independentUseful if downstream stage blocked or unavailableKafka is unavailable/rebalancing but system cannot backpressure/deny incoming trafficOptional commit stage for at-least-once delivery semanticsImplementation based on Chronicle Queue

A buffer of virtually unlimited size

SummaryKafka + Akka Streams + Async I/O = Ideal Architecture for High Bursts & High EfficiencyAkka StreamsClear view of stream topologyBack-pressure & Kafka allows buffering load burstsStandardizationWalk like a duck, quack like a duck, and manage it like a ducksqubs: Have the cake, and eat it too, with goodies likePerpetualStreamPersistentBufferBroadcastBuffer

24

Q&A Feedback AppreciatedJoin us on link from https://github.com/paypal/squbs

@squbs, @S_Akara, @anilgursel