Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

84
@wip Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark AJUG, Jan 17, 2017 Todd Fritz Cox Automotive, Inc.

Transcript of Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Page 1: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

@wipBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

AJUG, Jan 17, 2017Todd Fritz

Cox Automotive, Inc.

Page 2: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

2

This presentation is a draft of what will be presented, next month, at DevNexus.

“Sneak peek”

Reactions and questions may influence evolution of content

Disclaimer

Page 3: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

3

Page 5: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

5License: CC BY-SA 3.0

• Senior Solutions Architect @ Cox Automotive, Inc.• Strategic Data Services• The opinions contained herein may not represent my employer, but I

believe they should. • Background is building platforms, middleware, MoM, EIP, EDA, etc• DevOps mentality• Exposed to many environments, technologies, people• Life-long learner and always curious• Novice bass player• Scuba diver

About Me

Page 6: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

6

DevNexus 2015 http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015

Great Wide Open - Atlanta (April 3, 2014)http://www.slideshare.net/ToddFritz/2014-04-03legacytocloud

AJUG (April 15, 2014)http://www.slideshare.net/ToddFritz/2014-april-15-atlanta-java-users-groupVideo - https://vimeo.com/94556976

Previous Presentations

Page 7: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

7

• Forward• The Briefest History• Background: Reactive Systems, Patterns, Implementations• The Enterprise• Fast Data • The Data Lake for Analytics, App Dev• Presentation Improvements Planned for DevNexus• Questions• Resources

Agenda

Page 8: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

8

Forward“Our greatest glory is not in never failing,

but in rising every time we fall.”-Confucius

Page 9: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

9

• Why is Reactive Important?• Reactive Systems and Programming != Reactive Management• Reactive underpins every use case, every business capability, every product

feature• Tendency for companies to survey market and select products to match

perceived business need• Importance of vision, governance, tenancy, entitlements, security• Both Process and Technology

• Use Case: Build a system that can scale thousands to millions of users, handling millions to billions of messages• Real time data, for application components, middleware, data processing,

analytics

• A journey of innovation and successive refinement

Onward

Page 10: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

10

The Briefest HistoryHistory is the sum total of things that could have been

avoided. - Konrad Adenauer

Page 11: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

11

• “Reactive” is not new

• Underlying principles go back almost 50 years, to the days of punch cards

• Erlang• Built to scale, handle extremely high volume• Extensive use in Telco for decades; many billions of messages• Actors

• Bedrock is messaging• We’ve been using this technique for decades, via many technologies• Improvements over time around component isolation, decoupling to benefit scalability and

concurrency

The Briefest History

Page 12: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

12

Reactive Systems, Patterns, Implementations“People who think they know everything

are a great annoyance to those of us who do.”- Isaac Asimov

Page 13: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

13

• The Reactive Manifesto • http://www.reactivemanifesto.org

• Many organizations independently building software to new, and similar, patterns• Increasing pressures to simplify, scale, innovate and improve customer experience• Increasing proliferation and interoperability of system environments and connected

devices• More data with contextual use cases

• Yesterday’s architectures just don’t cut it• Need more flexible, resilient, robust systems• Solution through evolution

• Spawn of Actors (Akka, Erlang)• Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns

Reactive Systems

Page 14: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

14

“…“Reactive” is a set of design principles for creating cohesive systems. It’s a way of thinking about systems architecture and design in a distributed environment where implementation techniques, tooling, and design patterns are components of a larger whole.”*

“A Reactive System is based on an architectural style that allows … multiple individual services to coalesce as a single unit and react to its surroundings while remaining aware of each … scale up/down, load balance and even ...” * (proactive steps)

Components may qualify as reactive, but when combined, does not guarantee a Reactive System

What is Reactive? A Set of Design Principles

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 15: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

15

1. Reactive Systems (architecture, design)2. Reactive Programming (declarative event-based)3. Functional Reactive Programming (FRP)

NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava, Bacon.js, etc). Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07-22-essence-of-frp.html

Reactive Begets*

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 16: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

16

• Reactive Programming != Functional Reactive Programming• Subset of Asynch Programming• New information drives logic flow vs. control flow driven by thread-of-execution

• Avoids resource contention (Amdahl’s Law) that impedes scalability.• Decompose into multiple steps that are asynch and nonblocking

• Combine into a composed workflow• Reactive Systems very rarely block• Reactive API libraries are either declarative (functional composition,

combinators) or callback-based (attached to events, executed during dataflow chain) with stream-based operators (windowing, triggers, etc)

• Reactive programming is event-driven• Reactive systems are message driven• Wait? What? More about this distinction in a few slides.

Reactive Programming*

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 17: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

17

• Reactive Programming is related to Dataflow Programming• Both emphasize flow of data vs. flow of control

• Examples• Futures / Promises• (Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured

pipelines connecting sources and destinations.• Dataflow Variables – single assignment variables (AKA a cell in Excel) whereby a value

change can trigger dependent functions to produce new values (state)• Technologies that do this, include• Akka Streams• RxJava• Vert.X

• Reactive Streams Specification• The standard for interoperability amongst Reactive Programming libraries on the JVM• “…an initiative to provide a standard for asynchronous stream processing with non-

blocking back pressure.”

About Dataflow Programming*

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 18: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

18

• Notable benefits• Increased (efficient) utilization of compute resources (incl. multi-core)• Increased performance via serialization reduction• Amdahl’s Law• Neil Günter’s Universal Scalability Law. To quantify the effects of

contention and coordination in concurrent, distributed systems. This explains how the cost of coherency in a system can lead to negative results, as new resources are added to the system.

• Productivity. Reactive libraries handle complexities such as dealing with asynch, nonblocking compute, IO, coordination between components.

• Great for creating components that are composed to workflows, that are back-pressured, scalable, high-performance

Why Reactive Programming*

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 19: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

19

• Reactive Programming - event-driven• Computation via dataflow chains• Events are not directed; “addressable event sources”• Events are mere facts that can be observed• Emitted by changes in state (state machine)• Listeners attach to even sources, which in turn, react to them• Emitted locally

• Reactive Systems - message-driven• Basis of communication across network/components; prefer asynch• Sender/Receiver are decoupled• Focus on resilience and elasticity via communication/communication inherent to distributed systems• Long-lived, addressable components• Waits for messages to be sent, then reacts to them• Messages are ”directed”• A message has a clear destination; “addressable recipient”

Event-Driven vs. Message-Driven*

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 20: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

20

• Common pattern is to us Messaging as means to communicate Events across network/components • Events within Messages

• Examples• AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub)

• Pros• Abstraction and simplicity

• Cons• Lose some control• Messaging forces developers to deal with complex realities of distributed programming• Failure detection• Message delivery contracts (dupes, retry, ordering)• Consistency guarantees• Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC,

etc)

Event-Driven & Message-Driven*

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 21: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

21 Reactive Systems: Characteristics*

Responsive• Low latency• Consistent &

predictable• Foundation of

Usability• Essential for Utility• Expose problems

quickly• Happier Customers

Resilient• Responsive through failure• Resilient (H/A) through

replication• Failure isolated to

Component (bulkheads)• Delegate recovery to

external Component (supervisor hierarchies)

• Client does not handle failures

Elastic• Responsive as

workload varies• Devoid of bottlenecks

or hot spots• Perf metrics drive

predictive / reactive autonomic scaling

• Favor commodity infrastructure

Message Driven• Asynchronous

• Loose Coupling• Isolation• Location Transparency

• Non Blocking• Recipients can passivate

• Message Passing• Flow Control• Exception Management• Elasticity• Back Pressure

* Source: The Reactive Manifesto

Page 22: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

22 Reactive Systems: PatternsArchitecture Pattern:Single Component • Component does one thing, fully and well• Single Responsibility Principle• Max cohension, min coupling

Architecture Pattern:Let-it-Crash• Prefer a full component restart to complex internal

error handling• Design for failure• Leads to more reliable components• Avoids hard to find/fix errors• Failure is unavoidable

Implementation PatternsCircuit Breaker• Protect services by breaking connections

during failures• From EE• Protects clients from timeouts• Allows time for service to recover

Source: https://www.lightbend.com/blog/architect-reactive-design-patterns

Implementation PatternsSaga• Divide long-lived, distributed transactions into

quick local ones with compensating actions for recovery.

• Compensating txns run during saga rollback• Concurrent sagas can see intermediate state• Sags need to be persistent to recover from

hardware failures. Save points.

Page 23: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

23

• Still need to use the brain and learn• Not going to be able to build reactive systems with just

paper certifications• No substitute for experience!

• The “new” is based on evolution (old techniques and patterns)

• Paradigms and patterns not isolated to components or technology

Sound Complicated?

Source: The Reactive Manifesto

Page 24: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

24

The Enterprise“Even if you are on the right track,

you’ll get run over if you just sit there.”- Will Rogers

Page 25: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

25

Page 26: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

• We all work for 1..n companies that have a size, complexity and age• Young companies (e.g. start-ups)• Less legacy overhead• More able to adopt newer technology; attempting to innovate, disrupt, find a niche• Less able to leverage expensive, enterprise-class solutions; value through IP or a

unique product• More likely to build cloud native (various reasons)• Staff typically has larger sphere of influence• Sometimes filled with Unicorns

• Mid-size companies• May have legacy overhead• Complexity if growth through acquisition vs. growth through product evolution• Strategic use of of Enterprise products • May be cloud native, or hybrid• Increased division of labor, defined roles, paying customers to keep happy

The Enterprise

26

Page 27: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

• Generalizing large companies• Probably where the terms “technical debt” and “culture debt” came from• Certainly has legacy overhead, very complex environments• Fear of change, less room to fail – backed by valid business reasons• Likely has complexity due to both acquisition and product evolution• Purchase a company, adding a ”different” tech stack• Money to fully absorb? Risk? Or purchase to evolve existing lines of business?

• Use of of Enterprise products, likely prefers “supported” flavors of open source• A blend of on-premises, hybrid, cloud• More politics and cats to herd• Slower to see value from new technology• Matrixed division of labor, defined roles, paying customers to keep happy• Transformative change is more difficult; planning, budgeting, process, management

structure

The Enterprise

27

Page 28: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

• An Enterprise has a variety of software applications• Customer facing, revenue generating• “Supporting” applications for operation, development, maintenance activities• Datamarts used for analytics• Data typically moved from system of record, into one or more centralized data

“hubs”• OLAP / Data Warehouses• Data Lake, common use of Hadoop

• Older application architectures focused on the application, not on enterprise interoperability• Drop in ETL, ESB, SOA, iPaaS to do so ($$$)• Perhaps refactor service layers into more modern middleware

• Enterprises becoming ”real time”• The old, batch-oriented, application silos are too complex and just can’t do this well

Breaking Down the Enterprise

28

Page 29: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

• (Not a complete list)• Products (and supporting software) operates in Real Time• Opens door to do things to increase customer satisfaction/retention

• Save money, become more agile (speed to market) – productivity enhancer• Less reliant on batch or pull (streaming batch to interop legacy)• Data flows to system that need it, where it is acted on in real time• More powerful data processing• Easier to manage; reduced TCO• Decoupled, enables greater interoperability, CI/CD (Infrastructure as a Service)• Leverage cloud solutions, auto scaling, resiliency, high availability• Analysts get to use latest tools, oftentimes in the cloud• Enables autonomic automation

How the Enterprise Benefits from Reactive

29

Page 30: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

• Friends don’t let their friend-analysts do map-reduce• Analysts are not highly technical (some do code in R)• Larger companies may have communities of SAS users• They just want to connect a friendly BI tool (e.g. Tableau) to run their

SQL against data.• Can’t expect people to switch careers or skill sets to accommodate

new technology.

And Again, those Analysts

30

Page 31: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

31Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

Page 32: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

32

Fast Data“If you are in a spaceship that is traveling at the speed of

light, and you turn on the headlights,

does anything happen?”- Steven Wright

Page 33: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

33

• Big Data Fast Data• Constant stream of inbound data, at incredible rates• The old way is to store it, then analyze

• Say hello to Map Reduce!• Hadoop did reinvent how to process petabytes of data on commodity hardware

• Why not analyze and act on data as it is received?• Using (proven) technology built to scale to massive volume?• Akka, Kafka, Spark, Flink• Use case reminder: millions of concurrent actors handling billions of

messages, in real-time• Fast Data means acting on data in real-time, and sending it to destinations

• Data Lake• Other systems

Fast Data

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 34: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

34 Time is Gold

* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)

Page 35: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

35

• Fast Data is the obvious future• Many of us have been using these techs for years

• The easy part• Build new systems using Reactive patterns, architectures and technology• You start ups have it easy…• The ability of a company to adopt disruptive architectures and patterns is inversely related to its

size• In a real world where SDLC costs money

• How to reconcile with legacy architectures and implementations?• How to interoperate with disparate systems with varying capabilities?• Does it make sense to refactor “what works”?• Are Frankenstein / Stove-piped solutions the norm, or (partially) a matter

of perspective, when viewed in the rear-facing mirror of innovation?• What is a cost efficient approach to adopt and adapt to Reactive?

Reality Check

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 36: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

36

• Enough background, time to talk tech• For this discussion, the reference platform is Lightbend’s Fast

Data platform• (A few deviations mentioned)• Core techs will be• Kafka - open source or Confluent• Spark - open source or Databricks (a nice managed service in AWS)• Akka (and Akka-HTTP) (open source or Lightbend stack)• Alpakka

Now, the Fun Stuff

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

Page 37: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

37Source: Lightbend, Inc.

Page 38: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

38

• JMS -> AMQP –> Kafka• Streaming platform• Process messages when provided• Fault-tolerant storage• Pub/Sub capability

• Common use cases• Data pipelines (messaging) between a source and destination• Real-time action / transformation• Metrics, weblogs, stream processing, event sourcing

• A commit log• Spread across a cluster• Record streams stored in topics• Each record has a key, value, timestamp• Each topic has offsets and a retention policy

Kafka 101

Page 39: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

39 Kafka Performance

Source: “Introduction to Kafka”, Ducas Francis

30k/s 1.8M/min 108M/hr 2.7B/day

Page 40: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

40

• Producer API• To publish a record to Kafka

• Consumer API• To subscribe to a topic(s) • Consumer groups• Handle records pushed to topics

• Streams API• Stream processing• Consume from Stream An, do processing,

publish to Stream Bn

• E.g. aggregation, joining• Connector API• To build custom Producers/Consumers• Purpose built integration components

Kafka APIs

Page 41: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

41

• > Messaging systems (RabbitMQ, AMQ, etc) • Just don’t scale well and become complex• Need to use other abstractions for batching• Lacks replay ability (reset offset, etc)

• > Log forwarders, e.g. Scribe or Flume• Push architecture• High performance • Scales well• Sensitive to business logic in endpoints, because needs to push data fast• Assumes data is pushed to large sink (e.g. Hadoop)• Oh, and then queries later. So much for real-time.

• Supports Polyglot• Python, Go, .NET, Node.js, C/C++, etc

• Robust ecosystem• https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem

Why Use Kafka?

Page 42: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

42

• “…a fast and general engine for large-scale data processing.”• Hadoop / YARN• Map-Reduce – mapper, reducer, disk IO, queue, fetch resource• Great for parallel file processing of large files• Synchronization barrier during persistence

• Spark• In-memory data processing• Interactive/iterative data query• Better supports more complex, interactive (real-time) apps

• 100x faster than Hadoop MR (in memory), 10x faster on disk• Microbatching

Spark: What is it?

Source: spark.apache.org

Page 43: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

43

• Combine SQL, streaming, complex analytics• SQL• Dataframes• MLlib• GraphX• Spark Streaming

Spark

Source: spark.apache.org

Page 44: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

44

• Run it • Standalone• Hadoop• Mesos• Cloud

• Access data• HDFS• Cassandra• Hbase• S3• Hive• Tachyon, and more

• Write code• Scala• Java• Python, Clojure, R

• Interactive query shell (notebooks)

Spark Execution Modes

Source: spark.apache.org

Page 45: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

45

• Slow due to replication, serialization, filesystem IO• Inefficient use cases:• Iterative algorithms (ML, Graphs, Network analysis)• Interactive / Ad-hoc data mining (R, Excel, Searching, analyst queries)

Spark: Hadoop MR

Source: “Spark Overview”, Lisa Hua

Page 46: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

46 Spark: Hadoop & Spark (in Hadoop)

Source: “Spark Overview”, Lisa Hua

• Spark has additional features, such as interop with S3 storage

Page 47: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Spark: A Clustered Application

47Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

Page 48: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Spark: Execution Terminology

48Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

• Job – a set of tasks to be executed as a result of an action• Stage – a set of tasks in a job that can be run in parallel• Task – a individual unit of work sent to a single executor

Page 49: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Spark: SQL

49Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

• Spark SQL is a module for structured data querying• Supports basic SQL and HiveQL• Can act as distributed query engine via JDBC/ODBC, or CLI

Page 50: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Spark: Dataframe, Datasets, RDDs

50Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

• Dataframe is a distributed assembly of data into named columns• Analogous to a relational table, or data frame in R/Python (with richer

optimizations)• Dataset was added in 1.6 to provide benefit of RDDs and Spark

SQL’s execution engine• Build datasets from JVM objects and then manipulate with functional

transformations

Page 51: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

• Scalable, high-throughput, fault tolerant processing of real time streams and use cases.

• Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs. handling a single message/event.

Spark Streaming

51Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

Page 52: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

• Can merge inbound data with historical data• Write code in Scala, Python, Java, etc• Lower level access via DStream, obtained via StreamingContext.• Create RDD’s from the DStream• Two primary metrics to monitor and tune:• Processing time (per batch)• Scheduling delay (processed upon arrival?)

• Use Kyro for serialization

Spark Streaming

52Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

Page 53: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Spark Streaming - comparison with other techs

53Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

Page 54: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Building a Spark Application

54Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

• Scala or Java, compiled to JAR (in turn uploaded to worker nodes)• Running a spark app is as easy as pie (submit options can expand the

experience):

$ spark-submit myAwesomePythonScript.py theFileURL

$ spark-submit –class SkyNetInScala skynet2017.jar theFileURL

• Spark uses log4j (beware)• YARN can aggregate worker logs

Page 55: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

55 Akka, who uses it?

Source: “Akka Actor Introduction”, Gene

Page 56: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

56 Akka

Source: “Introducing Akka”, Jonas Bonér

• Vision• Simpler: Concurrency, Scalability, Fault-Tolerance• With a single unified• Programming Model• Managed Runtime• Open Source Distribution

• Manage System Overload (backpressure)• Scale up & Scale out• Program to a higher level• No more shared state, state visibility, threads, locks, concurrent collections, thread

notifications• Low level concurrency built into the plumbing; it becomes simple workflow just

deal with messages• Increases CPU utilization, lowers latency, high throughput, scalable!• Superior, Proven model to detect and recover from errors

Page 57: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

57 Akka: Perfect for the Cloud

Source: “Introducing Akka”, Jonas Bonér

• Elastic and dynamic• Fault tolerant & self healing (autonomic)• Adaptive load-balancing, cluster rebalancing & Actor migration• Build loosely coupled systems that can dynamically adapt at runtime

Page 58: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

58 Akka 101

Source: “Introducing Akka”, Jonas Bonér

• Akka’s unit of code is called an Actor• The vehicle to create concurrent, scalable, fault-tolerant apps atop the

fabric• Encapsulates code like servlets or session beans; policy decisions

separated from biz logic• Actors have been around since 1973, and if you’ve ever used a

phone, that software helped make it work. 9 nines of uptime!• Think of an actor as a VM in the cloud (it isn’t, but)• Encapsulated, decoupled• Managing own memory and behavior• Communicates asynchronously with non-blocking messages• Elastic – grow/shrink on demand• Hot deploy, change runtime behavior

Page 59: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

59 Akka: How to use Actors?

Source: “Introducing Akka”, Jonas Bonér

• Alternative to:• A thread• An object instance or component• Callback or listener• Singleton or service• Router, load-balancer or pool• Session bean or MDB• Out of process service• A Finite State Machine (FSM)

Page 60: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

60 Akka: What is it?

Source: http://bit.ly/hewitt-on-actors

• Carl Hewitt’s definition• Fundamental unit of computation that embodies:• Processing• Storage• Communication

• 3 Axioms – When an actor receives a message it can:• Create new Actors• Send messages to Actors it knows• Designate how it should handle the next message it receives

Page 61: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

61 Akka: Core Actor Operations

Source: “Introducing Akka”, Jonas Bonér

0. Define1. Create2. Send3. Become4. Supervise

Page 62: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

62 Akka: Define Operation

Source: “Introducing Akka”, Jonas Bonér

0. Define

• Define the message (class) the actor should respond to, and Actor class

Page 63: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

63 Akka: Create Operation

Source: “Introducing Akka”, Jonas Bonér

1. Create

• Yes, creates new actor. From ActorSystem, then ActorRef.• Lightweight, 2.6M per Gb RAM• Strong encapsulation of: state/behavior (indistinguishable), message queue

Page 64: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

64 Akka: Send Operation

Source: “Introducing Akka”, Jonas Bonér

2. Send

• Sends a message to an Actor• Asynch and on-blocking (fire & forget)• Everything is Reactive• Actor is passivated until receiving a message, which triggers it to awaken• Messages are energy

• Everything is asynch and lockless

Page 65: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

65 Akka: Performance

Source: “Introducing Akka”, Jonas Bonér

+50 million messages per second !!!

Page 66: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

66 Akka: Remote Deployment

Source: “Introducing Akka”, Jonas Bonér

Page 67: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

67 Akka: Become Operation

Source: “Introducing Akka”, Jonas Bonér

3. Become

• Dynamically redefine Actor’s behavior• Reactively triggered by receipt of a message• Will not react differently to messages it receives• Behaviors are stacked – can by pushed and popped…

(Think in terms of the object changed it’s type – interface, protocol, implementation)

Page 68: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

68 Akka: Become Operation – Why?

Source: “Introducing Akka”, Jonas Bonér

Why do this?• A busy actor can become an Actor Pool or Router!• Implement FSM (Finite State Machine)• Implement graceful degradation• Spawn empty workers that can ”Become” whatever the Master

desires• Very useful. Limited only by your imagination.

Page 69: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

69 Akka: Failure Management

Source: “Introducing Akka”, Jonas Bonér

In Java/C/C+ etc.• Single thread of control• If that thread blows up you are screwed• Only option to do explicit error handling within your single thread• Errors isolated within thread; other threads have no clue• Results in tons of defensive code, scattered throughout the codebase

and entangled in your business logic

Page 70: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

70 Akka: Supervise Operation

Source: “Introducing Akka”, Jonas Bonér

4. Supervise

• Manage another Actor’s failures (or the person sitting next to you)• Let actors monitor (supervise) each other for failure, then respond• Notification sent to Actor’s supervisor if failure occurs. • Neat separation of processing from error handling

Page 71: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

71 Akka Streaming

Source: Akka Documentation

• Akka Streams API is completely decoupled from Reactive Streams interfaces• Implementation details to pass stream data between processing stages• Akka Streams is interoperable with any conformant Reactive Streams i

• Principles• All features explicit in the API• Compositionality; combined pieces retain the function of each part• Model of domain of distributed bounded stream processing

• Reactive Streams -> JDK9 Flow APIs

Page 72: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

72 Akka Streaming

Source: Akka Documentation

• Immutable building blocks / blueprints enabled for libraries, include• Source - something with exactly one output stream• Sink - something with exactly one input stream• Flow - something with exactly one input and one output stream• BidiFlow - something with exactly two input streams and two output streams that conceptually

behave like two Flows of opposite direction• Graph - a packaged stream processing topology that exposes a certain set of input and output ports,

characterized by an object of type Shape.• Built in backpressure capability• No stage can push downstream unless it received a pull beforehand

• Difference between error and failure• Error is accessible within the stream as a data element (signaled via onNext)• Failure means the stream itself has collapsed (signaled via onError).• Want failure to propagate faster than data (essential, to deal with backpressure)• Data elements emitted before a failure can still be lost of the onError overtakes them• Recovery element acts as bulkhead to confine a stream collapse to a given region of stream

topology, to isolate outside from impact of collapsed region (e.g. buffered elements)

Page 73: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

73 Akka HTTP

Source: Akka Documentation

• Built atop Akka Streams• Can expose an incoming connection in form of a Source instance• To start listening on network with Akka HTTP, create a Route and bind it to a port

(similar syntax to Spray).• Backpressure on source? Akka HTTP stops consuming data from network; eventually

leads to 0 TCP window – applying backpressure to sending party itself (e.g. a sensor)• Rules• Libraries shall provide their users with reusable pieces, i.e. expose factories that return

graphs, allowing full compositionality• Avoid destruction of compositionality.• Express functionality of a library such that materialization can be done by user

outside of library’s control.• Libraries may optionally and additionally provide facilities that consume and

materialize graphs• Allows a library to provide convenience “sugar” for use cases

Page 74: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

74

• Akka Streams Integration• https://github.com/akka/alpakka• http://developer.lightbend.com/docs/alpakka/current/

• Adds interesting capabilities to Akka Streams• Modern alternative to Apache Camel (EIP

implementation)• Camel en ze Akka

• Community driven, focused on connectors to external libraries, integration patterns and conversions.• ”A call to arms”• https://github.com/akka/alpakka/releases/tag/v0.1

Alpakka 101

Page 75: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

75

The Data Lake for Analytics, App Dev“Lake Wobegon, the little town that time forgot

and the decades cannot improve.”- Garrison Keillor

Page 76: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

76 The Data Lake

• Note: This section will be expanded for DevNexus.

Page 77: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

77 The Data Lake

• The Data Lake is where a copy of much of the data from source sytems “ends up”, via Fast Data, etc.

• Easily accessible, massive repository of data built on commodity hardware (or Cloud).

• Data is not stored in a way that is optimized for data analysis (S3)• Data Lake retains all attributes• Beware the Data Lake fallacy: http://

www.gartner.com/newsroom/id/2809117• Let’s combine all this data to drive increased information sharing, usage,

while reducing cost through consolidation / tech simplication.• Does it really work?• Has the ideal of Enterprise-wide data management been realized?• Deriving value from data still in hands of business end user (enter: Fast Data Platforms)

Page 78: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

78 The Dark Side of the Data Lake

Source: “Gartner Says Beware of the Data Lake Fallacy”

• Many companies tend to vacuum data into a Hadoop for later use• Many companies use overlapping tools within the same ecosystem, that do not

interoperate• Data lakes ignore how/why data is used, governed, defined and secured.• Does this sound like a good solution?• Data Lake solves old problem of siloing data. Great, so now it’s all in comingled. • Federated query? AWS Athena? Why move data if not necessary?

• Inability to quantitatively measure data quality• Accepts any data without governance or oversight• Accepts any data without metadata (description)• Inability to share lineage of findings by other analysis to share found value• Security, access control (and tracking of both)• Data Ownership, Entitlements?• Tenancy?• Regard for regulatory controls, compliance issues?• What to do?

Page 79: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

79 More to Come…

Source: “Gartner Says Beware of the Data Lake Fallacy”

• At DevNexus 2017• Thurs, Feb 23 @ 2:30pm• http://devnexus.com/s/devnexus2017/presentations/17212

Page 80: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

80

Presentation Improvements for DevNexus“Build something 100 people love,

not something 1 million people kind of like.”- Brian Chesky

Page 81: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

• More diagrams, fewer words• Refine, refine, refine• Mix in coding examples• Improve contrast between older architectures and reactive, for the

Enterprise• Content that contrasts different streaming options (Akka, Spark,

Kafka)• Add specific performance details• Incorporate additional, interesting content (incl. Data Lake related)

Planned Improvements

81

Page 82: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

82

Questions?“I'm sorry, if you were right, I'd agree with you.”

- Robin Williams

Page 83: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

83

• The Reactive Manifesto (http://www.reactivemanifesto.org/)• Chaos Monkey? Use Linear Fault Driven Testing instead.• https://www.lightbend.com/blog/architect-reactive-design-patterns• http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html• https://

www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka• https://kafka.apache.org• http://www.slideshare.net/ducasf/introduction-to-kafka• http://www.slideshare.net/SparkSummit/grace-huang-49762421• http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms• https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/• https://github.com/akka/alpakka/releases/tag/v0.1• http://www.slideshare.net/LisaHua/spark-overview-37479609• http://spark.apache.org/• https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/• http://www.slideshare.net/gene7299/akka-actor-presentation• http://www.slideshare.net/jboner/introducing-akka• http://bit.ly/hewitt-on-actors• http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html• https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/

Resources

Page 84: Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Ideas

84