Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

23
1 © Cloudera, Inc. All rights reserved. Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu Jeremy Beard | Senior Solutions Architect, Cloudera May 2016

Transcript of Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

Page 1: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

1© Cloudera, Inc. All rights reserved.

Building EffectiveNear-Real-Time Analytics with Spark Streaming and KuduJeremy Beard | Senior Solutions Architect, ClouderaMay 2016

Page 2: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

2© Cloudera, Inc. All rights reserved.

Myself

• Jeremy Beard• Senior Solutions Architect at Cloudera• 3.5 years at Cloudera• 6 years data warehousing before that• [email protected]

Page 3: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

3© Cloudera, Inc. All rights reserved.

Agenda

• What do we mean by near-real-time analytics?• Which components can we use from the Cloudera stack?• How do these components fit together?• How do we implement the Spark Streaming to Kudu path?• What if I don’t want to write all that code?

Page 4: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

4© Cloudera, Inc. All rights reserved.

Defining near-real-time analytics (for this talk)

• Ability to analyze events happening right now in the real world• And in the context of all the history that has gone before it

• By “near” we mean this is human scale (seconds), not machine scale (ns/us)• Closer to real time is possible in CDH, but is more custom development

• SQL is the lingua franca of analytics• Millions of people know it or the tools that run on it• Say what you want to get not how you want to get it

Page 5: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

5© Cloudera, Inc. All rights reserved.

Components from the Cloudera stack

• Four components come together to make this possible• Apache Kafka• Apache Spark• Apache Kudu (incubating)• Apache Impala (incubating)

• First we’ll discuss what they are, then how they fit together

Page 6: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

6© Cloudera, Inc. All rights reserved.

Apache Kafka

• Publish-subscribe system• Publish messages into topics• Subscribe to messages arriving in topics

• Very high throughput• Very low latency• Distributed for fault tolerance and scale• Supported by Cloudera

Page 7: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

7© Cloudera, Inc. All rights reserved.

Apache Spark

• Modern distributed data processing engine• Heavy utilizer of memory for speed• Rich and intuitive API

• Spark Streaming• Module for running a continuous loop of Spark transformations• Each iteration is a micro-batch, usually in the single-digit seconds

• Supported by Cloudera (with some exceptions for experimental features)

Page 8: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

8© Cloudera, Inc. All rights reserved.

Apache Kudu (incubating)

• New open-source columnar storage layer• Data model of tables with finite typed columns• Very fast random I/O• Very fast scans• Developed from scratch in C++• Client APIs for C++, Java, Python

• First developed in Cloudera, now at Apache Software Foundation• Currently in beta, not yet supported by Cloudera, not production ready

Page 9: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

9© Cloudera, Inc. All rights reserved.

Apache Impala (incubating)

• Open-source SQL query engine• Built for one purpose: really fast analytics SQL• High concurrency• Queries data stored in HDFS, HBase, and now Kudu• Standard JDBC/ODBC interface for SQL editors and BI tools• Uses JIT query compilation and modern CPU instructions

• First developed in Cloudera, now at Apache Software Foundation• Fully supported by Cloudera and in production at many of our customers

Page 10: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

10© Cloudera, Inc. All rights reserved.

Near-real-time analytics on the Cloudera stack

Page 11: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

11© Cloudera, Inc. All rights reserved.

Implementing Spark Streaming to Kudu

• We define what we want Spark to do each micro-batch• Spark then takes care of running the micro-batches for us

• We have limited time to process a micro-batch• Storage lookups must be key lookups or very short scans

• A lot of repetitive boilerplate code to get up and running

Page 12: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

12© Cloudera, Inc. All rights reserved.

Typical stages of a Spark Streaming to Kudu pipeline

• Sourcing from a queue of data• Translating into a structured format• Deriving the storage records• Planning how to update the storage layer• Applying the planned mutations to the storage layer

Page 13: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

13© Cloudera, Inc. All rights reserved.

Queue sourcing

• Each micro-batch we first have to bring in data to process• This is near-real-time, so we expect a queue of messages waiting to be processed

• Kafka fits this requirement very well• Native no-data-loss integration with Spark Streaming• Partitioned topics automatically parallelize across Spark executors• Fault recovery simple because Kafka does not drop consumed messages

• In Spark Streaming this is the creation of a DStream object• For Kafka use KafkaUtils.createDirectStream()

Page 14: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

14© Cloudera, Inc. All rights reserved.

Translation

• Arriving messages could be in any format (XML, CSV, binary, proprietary, etc.)

• We need them in a common structured record format to effectively transform it

• When messages arrive, translate them first• Avro’s GenericRecord is a widely adopted in-memory record format

• In Spark Streaming job use DStream.map() to define translation

Page 15: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

15© Cloudera, Inc. All rights reserved.

Derivation

• We need to create the records that we want to write to the storage layer• Often not identical to the arriving records

• Derive the storage records from the arriving records• Spark SQL can define transformation, but much more plumbing code required

• May also require deriving from existing records in the storage layer• Enrichment using reference data is a common example

Page 16: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

16© Cloudera, Inc. All rights reserved.

Planning

• With derived storage records in hand we need to plan the storage mutations

• When existing records are never updated it is straight-forward• Just plan inserts

• When updates for a key can occur it is a bit harder• Plan insert if key does not exist, plan update if key does exist

• When all versions of a key are kept it can be a lot more complicated• Insert arriving record, update metadata on existing records (e.g. end date)

Page 17: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

17© Cloudera, Inc. All rights reserved.

Storing

• With the planned mutations for the micro-batch, we apply them to the storage

• For Kudu this requires using the Kudu client Java API• Applied mutations are immediately visible to Impala users• Use RDD.forEachPartition() so that you can open a Kudu connection per JVM

• Alternatively write to Solr, can be a good option where SQL is not required• Alternatively write to HBase, but storage is too slow for analytics queries• Alternatively write to HDFS, but storage does not support updates or deletes

Page 18: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

18© Cloudera, Inc. All rights reserved.

Performance considerations

• Repartition the arriving records across all the cores of the Spark job• If using Spark SQL, lower the number of shuffle partitions from default 200• Use Spark Streaming backpressure to optimize micro-batch size• If using Kafka, also use spark.streaming.kafka.maxRatePerPartition

• Experiment with micro-batch lengths to balance latency vs. throughput• Ensure storage lookup predicates are at least by key, or face full table scans• Avoid connecting and disconnecting from storage every micro-batch• Singleton pattern can help to keep a connection per JVM

• Avoid instantiating objects for each record where they could be reused• Batch mutations for higher throughput

Page 19: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

19© Cloudera, Inc. All rights reserved.

New on Cloudera Labs: Envelope

• A pre-developed Spark Streaming application that implements these stages

• Pipelines are defined as simple configuration using a properties file• Custom implementations of stages can be referenced in the configuration

• Available on Cloudera Labs (cloudera.com/labs)• Not supported by Cloudera, not production ready

Page 20: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

20© Cloudera, Inc. All rights reserved.

Envelope built-in functionality

• Queue source for Kafka• Translators for delimited text, key-value pairs, and binary Avro• Lookup of existing storage records• Deriver for Spark SQL transformations• Planners for appends, upserts, and history tracking• Storage system for Kudu• Support for many of the described performance considerations

• All stage implementations are also pluggable with user-provided classes

Page 21: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

21© Cloudera, Inc. All rights reserved.

Example pipeline: Traffic

Page 22: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

22© Cloudera, Inc. All rights reserved.

Example pipeline: FIX

Page 23: Building Effective Near-Real-Time Analytics with Spark Streaming and Kudu

23© Cloudera, Inc. All rights reserved.

Thank [email protected]