Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

@wipBuilding Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

AJUG, Jan 17, 2017Todd Fritz

Cox Automotive, Inc.

2

This presentation is a draft of what will be presented, next month, at DevNexus.

“Sneak peek”

Reactions and questions may influence evolution of content

Disclaimer

4

[email protected]@coxautoinc.com

www.linkedin.com/in/tfritz

http://www.slideshare.net/ToddFritz

https://github.com/todd-fritz

mailto:[email protected]

mailto:[email protected]

http://www.linkedin.com/in/tfritz





5License: CC BY-SA 3.0

• Senior Solutions Architect @ Cox Automotive, Inc.• Strategic Data Services• The opinions contained herein may not represent my employer, but I

believe they should. • Background is building platforms, middleware, MoM, EIP, EDA, etc• DevOps mentality• Exposed to many environments, technologies, people• Life-long learner and always curious• Novice bass player• Scuba diver

About Me

http://creativecommons.org/licenses/by-sa/3.0/



6

DevNexus 2015 http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015

Great Wide Open - Atlanta (April 3, 2014)http://www.slideshare.net/ToddFritz/2014-04-03legacytocloud

AJUG (April 15, 2014)http://www.slideshare.net/ToddFritz/2014-april-15-atlanta-java-users-groupVideo - https://vimeo.com/94556976

Previous Presentations

http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015

http://www.slideshare.net/ToddFritz/2015-03-11_Todd_Fritz_Devnexus_2015

http://www.slideshare.net/ToddFritz/2014-04-03legacytocloud

http://www.slideshare.net/ToddFritz/2014-april-15-atlanta-java-users-group

https://vimeo.com/94556976

7

• Forward• The Briefest History• Background: Reactive Systems, Patterns, Implementations• The Enterprise• Fast Data • The Data Lake for Analytics, App Dev• Presentation Improvements Planned for DevNexus• Questions• Resources

Agenda

8

Forward“Our greatest glory is not in never failing,

but in rising every time we fall.”-Confucius

9

• Why is Reactive Important?• Reactive Systems and Programming != Reactive Management• Reactive underpins every use case, every business capability, every product

feature• Tendency for companies to survey market and select products to match

perceived business need• Importance of vision, governance, tenancy, entitlements, security• Both Process and Technology

• Use Case: Build a system that can scale thousands to millions of users, handling millions to billions of messages• Real time data, for application components, middleware, data processing,

analytics

• A journey of innovation and successive refinement

Onward

10

The Briefest HistoryHistory is the sum total of things that could have been

avoided. - Konrad Adenauer

11

• “Reactive” is not new

• Underlying principles go back almost 50 years, to the days of punch cards

• Erlang• Built to scale, handle extremely high volume• Extensive use in Telco for decades; many billions of messages• Actors

• Bedrock is messaging• We’ve been using this technique for decades, via many technologies• Improvements over time around component isolation, decoupling to benefit scalability and

concurrency

The Briefest History

12

Reactive Systems, Patterns, Implementations“People who think they know everything

are a great annoyance to those of us who do.”- Isaac Asimov

13

• The Reactive Manifesto • http://www.reactivemanifesto.org

• Many organizations independently building software to new, and similar, patterns• Increasing pressures to simplify, scale, innovate and improve customer experience• Increasing proliferation and interoperability of system environments and connected

devices• More data with contextual use cases

• Yesterday’s architectures just don’t cut it• Need more flexible, resilient, robust systems• Solution through evolution

• Spawn of Actors (Akka, Erlang)• Good starting point: https://www.lightbend.com/blog/architect-reactive-design-patterns

Reactive Systems

http://www.reactivemanifesto.org/



https://www.lightbend.com/blog/architect-reactive-design-patterns


14

“…“Reactive” is a set of design principles for creating cohesive systems. It’s a way of thinking about systems architecture and design in a distributed environment where implementation techniques, tooling, and design patterns are components of a larger whole.”*

“A Reactive System is based on an architectural style that allows … multiple individual services to coalesce as a single unit and react to its surroundings while remaining aware of each … scale up/down, load balance and even ...” * (proactive steps)

Components may qualify as reactive, but when combined, does not guarantee a Reactive System

What is Reactive? A Set of Design Principles

* Source: “Reactive Programming versus Reactive Systems”, Jonas Bonér and Viktor Klang

15

1. Reactive Systems (architecture, design)2. Reactive Programming (declarative event-based)3. Functional Reactive Programming (FRP)

NOTE: The inventor of this term, Conal Elliott, says this term is misapplied today (e.g. RxJS, RxJava, Bacon.js, etc). Refer to his presentation (July 22, 2015) for the details: https://begriffs.com/posts/2015-07-22-essence-of-frp.html

Reactive Begets*


https://begriffs.com/posts/2015-07-22-essence-of-frp.html



16

• Reactive Programming != Functional Reactive Programming• Subset of Asynch Programming• New information drives logic flow vs. control flow driven by thread-of-execution

• Avoids resource contention (Amdahl’s Law) that impedes scalability.• Decompose into multiple steps that are asynch and nonblocking

• Combine into a composed workflow• Reactive Systems very rarely block• Reactive API libraries are either declarative (functional composition,

combinators) or callback-based (attached to events, executed during dataflow chain) with stream-based operators (windowing, triggers, etc)

• Reactive programming is event-driven• Reactive systems are message driven• Wait? What? More about this distinction in a few slides.

Reactive Programming*


17

• Reactive Programming is related to Dataflow Programming• Both emphasize flow of data vs. flow of control

• Examples• Futures / Promises• (Reactive) Streams – unbounded data processing. Asynch, non-blocking, back-pressured

pipelines connecting sources and destinations.• Dataflow Variables – single assignment variables (AKA a cell in Excel) whereby a value

change can trigger dependent functions to produce new values (state)• Technologies that do this, include• Akka Streams• RxJava• Vert.X

• Reactive Streams Specification• The standard for interoperability amongst Reactive Programming libraries on the JVM• “…an initiative to provide a standard for asynchronous stream processing with non-

blocking back pressure.”

About Dataflow Programming*


18

• Notable benefits• Increased (efficient) utilization of compute resources (incl. multi-core)• Increased performance via serialization reduction• Amdahl’s Law• Neil Günter’s Universal Scalability Law. To quantify the effects of

contention and coordination in concurrent, distributed systems. This explains how the cost of coherency in a system can lead to negative results, as new resources are added to the system.

• Productivity. Reactive libraries handle complexities such as dealing with asynch, nonblocking compute, IO, coordination between components.

• Great for creating components that are composed to workflows, that are back-pressured, scalable, high-performance

Why Reactive Programming*


19

• Reactive Programming - event-driven• Computation via dataflow chains• Events are not directed; “addressable event sources”• Events are mere facts that can be observed• Emitted by changes in state (state machine)• Listeners attach to even sources, which in turn, react to them• Emitted locally

• Reactive Systems - message-driven• Basis of communication across network/components; prefer asynch• Sender/Receiver are decoupled• Focus on resilience and elasticity via communication/communication inherent to distributed systems• Long-lived, addressable components• Waits for messages to be sent, then reacts to them• Messages are ”directed”• A message has a clear destination; “addressable recipient”

Event-Driven vs. Message-Driven*


20

• Common pattern is to us Messaging as means to communicate Events across network/components • Events within Messages

• Examples• AWS Lambda, distributed streaming (Spark Streaming, Flink, Kafka, Akka Streams, Pub/Sub)

• Pros• Abstraction and simplicity

• Cons• Lose some control• Messaging forces developers to deal with complex realities of distributed programming• Failure detection• Message delivery contracts (dupes, retry, ordering)• Consistency guarantees• Can’t hide behind “leaky” abstractions that pretend a network does not exist (EJB, XA, RPC,

etc)

Event-Driven & Message-Driven*


21 Reactive Systems: Characteristics*

Responsive• Low latency• Consistent &

predictable• Foundation of

Usability• Essential for Utility• Expose problems

quickly• Happier Customers

Resilient• Responsive through failure• Resilient (H/A) through

replication• Failure isolated to

Component (bulkheads)• Delegate recovery to

external Component (supervisor hierarchies)

• Client does not handle failures

Elastic• Responsive as

workload varies• Devoid of bottlenecks

or hot spots• Perf metrics drive

predictive / reactive autonomic scaling

• Favor commodity infrastructure

Message Driven• Asynchronous

• Loose Coupling• Isolation• Location Transparency

• Non Blocking• Recipients can passivate

• Message Passing• Flow Control• Exception Management• Elasticity• Back Pressure

* Source: The Reactive Manifesto

22 Reactive Systems: PatternsArchitecture Pattern:Single Component • Component does one thing, fully and well• Single Responsibility Principle• Max cohension, min coupling

Architecture Pattern:Let-it-Crash• Prefer a full component restart to complex internal

error handling• Design for failure• Leads to more reliable components• Avoids hard to find/fix errors• Failure is unavoidable

Implementation PatternsCircuit Breaker• Protect services by breaking connections

during failures• From EE• Protects clients from timeouts• Allows time for service to recover

Source: https://www.lightbend.com/blog/architect-reactive-design-patterns

Implementation PatternsSaga• Divide long-lived, distributed transactions into

quick local ones with compensating actions for recovery.

• Compensating txns run during saga rollback• Concurrent sagas can see intermediate state• Sags need to be persistent to recover from

hardware failures. Save points.


23

• Still need to use the brain and learn• Not going to be able to build reactive systems with just

paper certifications• No substitute for experience!

• The “new” is based on evolution (old techniques and patterns)

• Paradigms and patterns not isolated to components or technology

Sound Complicated?

Source: The Reactive Manifesto

24

The Enterprise“Even if you are on the right track,

you’ll get run over if you just sit there.”- Will Rogers

• We all work for 1..n companies that have a size, complexity and age• Young companies (e.g. start-ups)• Less legacy overhead• More able to adopt newer technology; attempting to innovate, disrupt, find a niche• Less able to leverage expensive, enterprise-class solutions; value through IP or a

unique product• More likely to build cloud native (various reasons)• Staff typically has larger sphere of influence• Sometimes filled with Unicorns

• Mid-size companies• May have legacy overhead• Complexity if growth through acquisition vs. growth through product evolution• Strategic use of of Enterprise products • May be cloud native, or hybrid• Increased division of labor, defined roles, paying customers to keep happy

The Enterprise

26

• Generalizing large companies• Probably where the terms “technical debt” and “culture debt” came from• Certainly has legacy overhead, very complex environments• Fear of change, less room to fail – backed by valid business reasons• Likely has complexity due to both acquisition and product evolution• Purchase a company, adding a ”different” tech stack• Money to fully absorb? Risk? Or purchase to evolve existing lines of business?

• Use of of Enterprise products, likely prefers “supported” flavors of open source• A blend of on-premises, hybrid, cloud• More politics and cats to herd• Slower to see value from new technology• Matrixed division of labor, defined roles, paying customers to keep happy• Transformative change is more difficult; planning, budgeting, process, management

structure

The Enterprise

27

• An Enterprise has a variety of software applications• Customer facing, revenue generating• “Supporting” applications for operation, development, maintenance activities• Datamarts used for analytics• Data typically moved from system of record, into one or more centralized data

“hubs”• OLAP / Data Warehouses• Data Lake, common use of Hadoop

• Older application architectures focused on the application, not on enterprise interoperability• Drop in ETL, ESB, SOA, iPaaS to do so ($$$)• Perhaps refactor service layers into more modern middleware

• Enterprises becoming ”real time”• The old, batch-oriented, application silos are too complex and just can’t do this well

Breaking Down the Enterprise

28

• (Not a complete list)• Products (and supporting software) operates in Real Time• Opens door to do things to increase customer satisfaction/retention

• Save money, become more agile (speed to market) – productivity enhancer• Less reliant on batch or pull (streaming batch to interop legacy)• Data flows to system that need it, where it is acted on in real time• More powerful data processing• Easier to manage; reduced TCO• Decoupled, enables greater interoperability, CI/CD (Infrastructure as a Service)• Leverage cloud solutions, auto scaling, resiliency, high availability• Analysts get to use latest tools, oftentimes in the cloud• Enables autonomic automation

How the Enterprise Benefits from Reactive

29

• Friends don’t let their friend-analysts do map-reduce• Analysts are not highly technical (some do code in R)• Larger companies may have communities of SAS users• They just want to connect a friendly BI tool (e.g. Tableau) to run their

SQL against data.• Can’t expect people to switch careers or skill sets to accommodate

new technology.

And Again, those Analysts

30

31Source: “Rapid Cluster Computing with Apache Spark”, Zohar Elkayam

32

Fast Data“If you are in a spaceship that is traveling at the speed of

light, and you turn on the headlights,

does anything happen?”- Steven Wright

33

• Big Data Fast Data• Constant stream of inbound data, at incredible rates• The old way is to store it, then analyze

• Say hello to Map Reduce!• Hadoop did reinvent how to process petabytes of data on commodity hardware

• Why not analyze and act on data as it is received?• Using (proven) technology built to scale to massive volume?• Akka, Kafka, Spark, Flink• Use case reminder: millions of concurrent actors handling billions of

messages, in real-time• Fast Data means acting on data in real-time, and sending it to destinations

• Data Lake• Other systems

Fast Data


34 Time is Gold

* Source: “Towards Benchmarking Modern Distributed Systems”, Grace Huang (Intel)

35

• Fast Data is the obvious future• Many of us have been using these techs for years

• The easy part• Build new systems using Reactive patterns, architectures and technology• You start ups have it easy…• The ability of a company to adopt disruptive architectures and patterns is inversely related to its

size• In a real world where SDLC costs money

• How to reconcile with legacy architectures and implementations?• How to interoperate with disparate systems with varying capabilities?• Does it make sense to refactor “what works”?• Are Frankenstein / Stove-piped solutions the norm, or (partially) a matter

of perspective, when viewed in the rear-facing mirror of innovation?• What is a cost efficient approach to adopt and adapt to Reactive?

Reality Check


36

• Enough background, time to talk tech• For this discussion, the reference platform is Lightbend’s Fast

Data platform• (A few deviations mentioned)• Core techs will be• Kafka - open source or Confluent• Spark - open source or Databricks (a nice managed service in AWS)• Akka (and Akka-HTTP) (open source or Lightbend stack)• Alpakka

Now, the Fun Stuff


37Source: Lightbend, Inc.

38

• JMS -> AMQP –> Kafka• Streaming platform• Process messages when provided• Fault-tolerant storage• Pub/Sub capability

• Common use cases• Data pipelines (messaging) between a source and destination• Real-time action / transformation• Metrics, weblogs, stream processing, event sourcing

• A commit log• Spread across a cluster• Record streams stored in topics• Each record has a key, value, timestamp• Each topic has offsets and a retention policy

Kafka 101

39 Kafka Performance

Source: “Introduction to Kafka”, Ducas Francis

30k/s 1.8M/min 108M/hr 2.7B/day

40

• Producer API• To publish a record to Kafka

• Consumer API• To subscribe to a topic(s) • Consumer groups• Handle records pushed to topics

• Streams API• Stream processing• Consume from Stream An, do processing,

publish to Stream Bn

• E.g. aggregation, joining• Connector API• To build custom Producers/Consumers• Purpose built integration components

Kafka APIs

41

• > Messaging systems (RabbitMQ, AMQ, etc) • Just don’t scale well and become complex• Need to use other abstractions for batching• Lacks replay ability (reset offset, etc)

• > Log forwarders, e.g. Scribe or Flume• Push architecture• High performance • Scales well• Sensitive to business logic in endpoints, because needs to push data fast• Assumes data is pushed to large sink (e.g. Hadoop)• Oh, and then queries later. So much for real-time.

• Supports Polyglot• Python, Go, .NET, Node.js, C/C++, etc

• Robust ecosystem• https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem

Why Use Kafka?

https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem



42

• “…a fast and general engine for large-scale data processing.”• Hadoop / YARN• Map-Reduce – mapper, reducer, disk IO, queue, fetch resource• Great for parallel file processing of large files• Synchronization barrier during persistence

• Spark• In-memory data processing• Interactive/iterative data query• Better supports more complex, interactive (real-time) apps

• 100x faster than Hadoop MR (in memory), 10x faster on disk• Microbatching

Spark: What is it?

Source: spark.apache.org

43

• Combine SQL, streaming, complex analytics• SQL• Dataframes• MLlib• GraphX• Spark Streaming

Spark


44

• Run it • Standalone• Hadoop• Mesos• Cloud

• Access data• HDFS• Cassandra• Hbase• S3• Hive• Tachyon, and more

• Write code• Scala• Java• Python, Clojure, R

• Interactive query shell (notebooks)

Spark Execution Modes


45

• Slow due to replication, serialization, filesystem IO• Inefficient use cases:• Iterative algorithms (ML, Graphs, Network analysis)• Interactive / Ad-hoc data mining (R, Excel, Searching, analyst queries)

Spark: Hadoop MR

Source: “Spark Overview”, Lisa Hua

46 Spark: Hadoop & Spark (in Hadoop)

Source: “Spark Overview”, Lisa Hua

• Spark has additional features, such as interop with S3 storage

Spark: A Clustered Application


Spark: Execution Terminology


• Job – a set of tasks to be executed as a result of an action• Stage – a set of tasks in a job that can be run in parallel• Task – a individual unit of work sent to a single executor

Spark: SQL


• Spark SQL is a module for structured data querying• Supports basic SQL and HiveQL• Can act as distributed query engine via JDBC/ODBC, or CLI

Spark: Dataframe, Datasets, RDDs


• Dataframe is a distributed assembly of data into named columns• Analogous to a relational table, or data frame in R/Python (with richer

optimizations)• Dataset was added in 1.6 to provide benefit of RDDs and Spark

SQL’s execution engine• Build datasets from JVM objects and then manipulate with functional

transformations

• Scalable, high-throughput, fault tolerant processing of real time streams and use cases.

• Microbatched. Think of in terms of EDA (e.g. Esper), for “windowing” (etc) vs. handling a single message/event.

Spark Streaming


• Can merge inbound data with historical data• Write code in Scala, Python, Java, etc• Lower level access via DStream, obtained via StreamingContext.• Create RDD’s from the DStream• Two primary metrics to monitor and tune:• Processing time (per batch)• Scheduling delay (processed upon arrival?)

• Use Kyro for serialization

Spark Streaming


Spark Streaming - comparison with other techs


Building a Spark Application


• Scala or Java, compiled to JAR (in turn uploaded to worker nodes)• Running a spark app is as easy as pie (submit options can expand the

experience):

$ spark-submit myAwesomePythonScript.py theFileURL

$ spark-submit –class SkyNetInScala skynet2017.jar theFileURL

• Spark uses log4j (beware)• YARN can aggregate worker logs

55 Akka, who uses it?

Source: “Akka Actor Introduction”, Gene

56 Akka

Source: “Introducing Akka”, Jonas Bonér

• Vision• Simpler: Concurrency, Scalability, Fault-Tolerance• With a single unified• Programming Model• Managed Runtime• Open Source Distribution

• Manage System Overload (backpressure)• Scale up & Scale out• Program to a higher level• No more shared state, state visibility, threads, locks, concurrent collections, thread

notifications• Low level concurrency built into the plumbing; it becomes simple workflow just

deal with messages• Increases CPU utilization, lowers latency, high throughput, scalable!• Superior, Proven model to detect and recover from errors

57 Akka: Perfect for the Cloud


• Elastic and dynamic• Fault tolerant & self healing (autonomic)• Adaptive load-balancing, cluster rebalancing & Actor migration• Build loosely coupled systems that can dynamically adapt at runtime

58 Akka 101


• Akka’s unit of code is called an Actor• The vehicle to create concurrent, scalable, fault-tolerant apps atop the

fabric• Encapsulates code like servlets or session beans; policy decisions

separated from biz logic• Actors have been around since 1973, and if you’ve ever used a

phone, that software helped make it work. 9 nines of uptime!• Think of an actor as a VM in the cloud (it isn’t, but)• Encapsulated, decoupled• Managing own memory and behavior• Communicates asynchronously with non-blocking messages• Elastic – grow/shrink on demand• Hot deploy, change runtime behavior

59 Akka: How to use Actors?


• Alternative to:• A thread• An object instance or component• Callback or listener• Singleton or service• Router, load-balancer or pool• Session bean or MDB• Out of process service• A Finite State Machine (FSM)

60 Akka: What is it?

Source: http://bit.ly/hewitt-on-actors

• Carl Hewitt’s definition• Fundamental unit of computation that embodies:• Processing• Storage• Communication

• 3 Axioms – When an actor receives a message it can:• Create new Actors• Send messages to Actors it knows• Designate how it should handle the next message it receives

http://bit.ly/hewitt-on-actors

61 Akka: Core Actor Operations


0. Define1. Create2. Send3. Become4. Supervise

62 Akka: Define Operation


0. Define

• Define the message (class) the actor should respond to, and Actor class

63 Akka: Create Operation


1. Create

• Yes, creates new actor. From ActorSystem, then ActorRef.• Lightweight, 2.6M per Gb RAM• Strong encapsulation of: state/behavior (indistinguishable), message queue

64 Akka: Send Operation


2. Send

• Sends a message to an Actor• Asynch and on-blocking (fire & forget)• Everything is Reactive• Actor is passivated until receiving a message, which triggers it to awaken• Messages are energy

• Everything is asynch and lockless

65 Akka: Performance


+50 million messages per second !!!

66 Akka: Remote Deployment


67 Akka: Become Operation


3. Become

• Dynamically redefine Actor’s behavior• Reactively triggered by receipt of a message• Will not react differently to messages it receives• Behaviors are stacked – can by pushed and popped…

(Think in terms of the object changed it’s type – interface, protocol, implementation)

68 Akka: Become Operation – Why?


Why do this?• A busy actor can become an Actor Pool or Router!• Implement FSM (Finite State Machine)• Implement graceful degradation• Spawn empty workers that can ”Become” whatever the Master

desires• Very useful. Limited only by your imagination.

69 Akka: Failure Management


In Java/C/C+ etc.• Single thread of control• If that thread blows up you are screwed• Only option to do explicit error handling within your single thread• Errors isolated within thread; other threads have no clue• Results in tons of defensive code, scattered throughout the codebase

and entangled in your business logic

70 Akka: Supervise Operation


4. Supervise

• Manage another Actor’s failures (or the person sitting next to you)• Let actors monitor (supervise) each other for failure, then respond• Notification sent to Actor’s supervisor if failure occurs. • Neat separation of processing from error handling

71 Akka Streaming

Source: Akka Documentation

• Akka Streams API is completely decoupled from Reactive Streams interfaces• Implementation details to pass stream data between processing stages• Akka Streams is interoperable with any conformant Reactive Streams i

• Principles• All features explicit in the API• Compositionality; combined pieces retain the function of each part• Model of domain of distributed bounded stream processing

• Reactive Streams -> JDK9 Flow APIs

72 Akka Streaming


• Immutable building blocks / blueprints enabled for libraries, include• Source - something with exactly one output stream• Sink - something with exactly one input stream• Flow - something with exactly one input and one output stream• BidiFlow - something with exactly two input streams and two output streams that conceptually

behave like two Flows of opposite direction• Graph - a packaged stream processing topology that exposes a certain set of input and output ports,

characterized by an object of type Shape.• Built in backpressure capability• No stage can push downstream unless it received a pull beforehand

• Difference between error and failure• Error is accessible within the stream as a data element (signaled via onNext)• Failure means the stream itself has collapsed (signaled via onError).• Want failure to propagate faster than data (essential, to deal with backpressure)• Data elements emitted before a failure can still be lost of the onError overtakes them• Recovery element acts as bulkhead to confine a stream collapse to a given region of stream

topology, to isolate outside from impact of collapsed region (e.g. buffered elements)

73 Akka HTTP


• Built atop Akka Streams• Can expose an incoming connection in form of a Source instance• To start listening on network with Akka HTTP, create a Route and bind it to a port

(similar syntax to Spray).• Backpressure on source? Akka HTTP stops consuming data from network; eventually

leads to 0 TCP window – applying backpressure to sending party itself (e.g. a sensor)• Rules• Libraries shall provide their users with reusable pieces, i.e. expose factories that return

graphs, allowing full compositionality• Avoid destruction of compositionality.• Express functionality of a library such that materialization can be done by user

outside of library’s control.• Libraries may optionally and additionally provide facilities that consume and

materialize graphs• Allows a library to provide convenience “sugar” for use cases

74

• Akka Streams Integration• https://github.com/akka/alpakka• http://developer.lightbend.com/docs/alpakka/current/

• Adds interesting capabilities to Akka Streams• Modern alternative to Apache Camel (EIP

implementation)• Camel en ze Akka

• Community driven, focused on connectors to external libraries, integration patterns and conversions.• ”A call to arms”• https://github.com/akka/alpakka/releases/tag/v0.1

Alpakka 101

https://github.com/akka/alpakka

https://github.com/akka/alpakka

http://developer.lightbend.com/docs/alpakka/current/

http://developer.lightbend.com/docs/alpakka/current/

https://github.com/akka/alpakka/releases/tag/v0.1


75

The Data Lake for Analytics, App Dev“Lake Wobegon, the little town that time forgot

and the decades cannot improve.”- Garrison Keillor

76 The Data Lake

• Note: This section will be expanded for DevNexus.

77 The Data Lake

• The Data Lake is where a copy of much of the data from source sytems “ends up”, via Fast Data, etc.

• Easily accessible, massive repository of data built on commodity hardware (or Cloud).

• Data is not stored in a way that is optimized for data analysis (S3)• Data Lake retains all attributes• Beware the Data Lake fallacy: http://

www.gartner.com/newsroom/id/2809117• Let’s combine all this data to drive increased information sharing, usage,

while reducing cost through consolidation / tech simplication.• Does it really work?• Has the ideal of Enterprise-wide data management been realized?• Deriving value from data still in hands of business end user (enter: Fast Data Platforms)

http://www.gartner.com/newsroom/id/2809117

http://www.gartner.com/newsroom/id/2809117

78 The Dark Side of the Data Lake

Source: “Gartner Says Beware of the Data Lake Fallacy”

• Many companies tend to vacuum data into a Hadoop for later use• Many companies use overlapping tools within the same ecosystem, that do not

interoperate• Data lakes ignore how/why data is used, governed, defined and secured.• Does this sound like a good solution?• Data Lake solves old problem of siloing data. Great, so now it’s all in comingled. • Federated query? AWS Athena? Why move data if not necessary?

• Inability to quantitatively measure data quality• Accepts any data without governance or oversight• Accepts any data without metadata (description)• Inability to share lineage of findings by other analysis to share found value• Security, access control (and tracking of both)• Data Ownership, Entitlements?• Tenancy?• Regard for regulatory controls, compliance issues?• What to do?

79 More to Come…

Source: “Gartner Says Beware of the Data Lake Fallacy”

• At DevNexus 2017• Thurs, Feb 23 @ 2:30pm• http://devnexus.com/s/devnexus2017/presentations/17212

80

Presentation Improvements for DevNexus“Build something 100 people love,

not something 1 million people kind of like.”- Brian Chesky

• More diagrams, fewer words• Refine, refine, refine• Mix in coding examples• Improve contrast between older architectures and reactive, for the

Enterprise• Content that contrasts different streaming options (Akka, Spark,

Kafka)• Add specific performance details• Incorporate additional, interesting content (incl. Data Lake related)

Planned Improvements

81

82

Questions?“I'm sorry, if you were right, I'd agree with you.”

- Robin Williams

83

• The Reactive Manifesto (http://www.reactivemanifesto.org/)• Chaos Monkey? Use Linear Fault Driven Testing instead.• https://www.lightbend.com/blog/architect-reactive-design-patterns• http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html• https://

www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka• https://kafka.apache.org• http://www.slideshare.net/ducasf/introduction-to-kafka• http://www.slideshare.net/SparkSummit/grace-huang-49762421• http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms• https://github.com/akka/alpakka • http://developer.lightbend.com/docs/alpakka/current/• https://github.com/akka/alpakka/releases/tag/v0.1• http://www.slideshare.net/LisaHua/spark-overview-37479609• http://spark.apache.org/• https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/• http://www.slideshare.net/gene7299/akka-actor-presentation• http://www.slideshare.net/jboner/introducing-akka• http://bit.ly/hewitt-on-actors• http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html• https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/

Resources

http://www.reactivemanifesto.org/)

http://www.reactivemanifesto.org/)



http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html

http://www.infoworld.com/article/2608040/big-data/fast-data--the-next-step-after-big-data.html

https://www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka

https://www.lightbend.com/blog/lessons-learned-from-paypal-implementing-back-pressure-with-akka-streams-and-kafka

https://kafka.apache.org/

https://kafka.apache.org/

http://www.slideshare.net/ducasf/introduction-to-kafka

http://www.slideshare.net/ducasf/introduction-to-kafka

http://www.slideshare.net/SparkSummit/grace-huang-49762421

http://www.slideshare.net/SparkSummit/grace-huang-49762421

http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms

http://www.slideshare.net/HadoopSummit/performance-comparison-of-streaming-big-data-platforms






http://www.slideshare.net/LisaHua/spark-overview-37479609

http://www.slideshare.net/LisaHua/spark-overview-37479609

http://spark.apache.org/

http://spark.apache.org/

https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/

https://www.realdbamagic.com/intro-to-apache-spark-2016-slides/

http://www.slideshare.net/gene7299/akka-actor-presentation

http://www.slideshare.net/gene7299/akka-actor-presentation

http://www.slideshare.net/jboner/introducing-akka

http://www.slideshare.net/jboner/introducing-akka



http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html

http://tech.measurence.com/2016/06/01/a-dive-into-akka-streams.html

https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/

https://infocus.emc.com/rachel_haines/is-the-data-lake-the-best-architecture-to-support-big-data/

Ideas

84

Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark

Software

Transcript of Building Reactive Fast Data & the Data Lake with Akka, Kafka, Spark