Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

download Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

If you can't read please download the document

  • date post

    12-Apr-2017
  • Category

    Software

  • view

    172
  • download

    0

Embed Size (px)

Transcript of Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

PowerPoint Presentation

Building EffectiveNear-Real-Time Analytics with Spark Streaming and KuduJeremy Beard | Senior Solutions Architect, ClouderaMay 2016

# Cloudera, Inc. All rights reserved.MyselfJeremy BeardSenior Solutions Architect at Cloudera3.5 years at Cloudera6 years data warehousing before thatjeremy@cloudera.com

# Cloudera, Inc. All rights reserved.AgendaWhat do we mean by near-real-time analytics?Which components can we use from the Cloudera stack?How do these components fit together?How do we implement the Spark Streaming to Kudu path?What if I dont want to write all that code?

# Cloudera, Inc. All rights reserved.Defining near-real-time analytics (for this talk)Ability to analyze events happening right now in the real worldAnd in the context of all the history that has gone before it

By near we mean this is human scale (seconds), not machine scale (ns/us)Closer to real time is possible in CDH, but is more custom development

SQL is the lingua franca of analyticsMillions of people know it or the tools that run on itSay what you want to get not how you want to get it

# Cloudera, Inc. All rights reserved.Components from the Cloudera stackFour components come together to make this possibleApache KafkaApache SparkApache Kudu (incubating)Apache Impala (incubating)

First well discuss what they are, then how they fit together

# Cloudera, Inc. All rights reserved.Apache KafkaPublish-subscribe systemPublish messages into topicsSubscribe to messages arriving in topics

Very high throughputVery low latencyDistributed for fault tolerance and scaleSupported by Cloudera

# Cloudera, Inc. All rights reserved.Apache SparkModern distributed data processing engineHeavy utilizer of memory for speedRich and intuitive API

Spark StreamingModule for running a continuous loop of Spark transformationsEach iteration is a micro-batch, usually in the single-digit seconds

Supported by Cloudera (with some exceptions for experimental features)

# Cloudera, Inc. All rights reserved.Apache Kudu (incubating)New open-source columnar storage layerData model of tables with finite typed columnsVery fast random I/OVery fast scansDeveloped from scratch in C++Client APIs for C++, Java, Python

First developed in Cloudera, now at Apache Software FoundationCurrently in beta, not yet supported by Cloudera, not production ready

# Cloudera, Inc. All rights reserved.Apache Impala (incubating)Open-source SQL query engineBuilt for one purpose: really fast analytics SQLHigh concurrencyQueries data stored in HDFS, HBase, and now KuduStandard JDBC/ODBC interface for SQL editors and BI toolsUses JIT query compilation and modern CPU instructions

First developed in Cloudera, now at Apache Software FoundationFully supported by Cloudera and in production at many of our customers

# Cloudera, Inc. All rights reserved.Near-real-time analytics on the Cloudera stack

# Cloudera, Inc. All rights reserved.

10

Implementing Spark Streaming to KuduWe define what we want Spark to do each micro-batchSpark then takes care of running the micro-batches for us

We have limited time to process a micro-batchStorage lookups must be key lookups or very short scans

A lot of repetitive boilerplate code to get up and running

# Cloudera, Inc. All rights reserved.Typical stages of a Spark Streaming to Kudu pipelineSourcing from a queue of dataTranslating into a structured formatDeriving the storage recordsPlanning how to update the storage layerApplying the planned mutations to the storage layer

# Cloudera, Inc. All rights reserved.Queue sourcingEach micro-batch we first have to bring in data to processThis is near-real-time, so we expect a queue of messages waiting to be processed

Kafka fits this requirement very wellNative no-data-loss integration with Spark StreamingPartitioned topics automatically parallelize across Spark executorsFault recovery simple because Kafka does not drop consumed messages

In Spark Streaming this is the creation of a DStream objectFor Kafka use KafkaUtils.createDirectStream()

# Cloudera, Inc. All rights reserved.TranslationArriving messages could be in any format (XML, CSV, binary, proprietary, etc.)

We need them in a common structured record format to effectively transform it

When messages arrive, translate them firstAvros GenericRecord is a widely adopted in-memory record format

In Spark Streaming job use DStream.map() to define translation

# Cloudera, Inc. All rights reserved.DerivationWe need to create the records that we want to write to the storage layerOften not identical to the arriving records

Derive the storage records from the arriving recordsSpark SQL can define transformation, but much more plumbing code required

May also require deriving from existing records in the storage layerEnrichment using reference data is a common example

# Cloudera, Inc. All rights reserved.PlanningWith derived storage records in hand we need to plan the storage mutations

When existing records are never updated it is straight-forwardJust plan inserts

When updates for a key can occur it is a bit harderPlan insert if key does not exist, plan update if key does exist

When all versions of a key are kept it can be a lot more complicatedInsert arriving record, update metadata on existing records (e.g. end date)

# Cloudera, Inc. All rights reserved.

16

StoringWith the planned mutations for the micro-batch, we apply them to the storage

For Kudu this requires using the Kudu client Java APIApplied mutations are immediately visible to Impala usersUse RDD.forEachPartition() so that you can open a Kudu connection per JVM

Alternatively write to Solr, can be a good option where SQL is not requiredAlternatively write to HBase, but storage is too slow for analytics queriesAlternatively write to HDFS, but storage does not support updates or deletes

# Cloudera, Inc. All rights reserved.Performance considerationsRepartition the arriving records across all the cores of the Spark jobIf using Spark SQL, lower the number of shuffle partitions from default 200Use Spark Streaming backpressure to optimize micro-batch sizeIf using Kafka, also use spark.streaming.kafka.maxRatePerPartitionExperiment with micro-batch lengths to balance latency vs. throughputEnsure storage lookup predicates are at least by key, or face full table scansAvoid connecting and disconnecting from storage every micro-batchSingleton pattern can help to keep a connection per JVMAvoid instantiating objects for each record where they could be reusedBatch mutations for higher throughput

# Cloudera, Inc. All rights reserved.New on Cloudera Labs: EnvelopeA pre-developed Spark Streaming application that implements these stages

Pipelines are defined as simple configuration using a properties fileCustom implementations of stages can be referenced in the configuration

Available on Cloudera Labs (cloudera.com/labs)Not supported by Cloudera, not production ready

# Cloudera, Inc. All rights reserved.

19

Envelope built-in functionalityQueue source for KafkaTranslators for delimited text, key-value pairs, and binary AvroLookup of existing storage recordsDeriver for Spark SQL transformationsPlanners for appends, upserts, and history trackingStorage system for KuduSupport for many of the described performance considerations

All stage implementations are also pluggable with user-provided classes

# Cloudera, Inc. All rights reserved.Example pipeline: Traffic

# Cloudera, Inc. All rights reserved.Example pipeline: FIX

# Cloudera, Inc. All rights reserved.Thank youjeremy@cloudera.com

# Cloudera, Inc. All rights reserved.