A Practical Guide to Selecting a Stream Processing Technology

Post on 08-Jan-2017

658 views 2 download

Transcript of A Practical Guide to Selecting a Stream Processing Technology

A Practical Guide to Selecting a Stream Processing Technology

Michael � G. � NollProduct � Manager, � Confluent

Kafka Talk SeriesDate Title

Sep 27 Introduction  To  Streaming  Data  and  Stream  Processing  with  Apache  Kafka

Oct  06 Deep  Dive  into  Apache  Kafka

Oct  27 Data  Integration  with  Apache  Kafka

Nov  17 Demystifying  Stream  Processing  with  Apache  Kafka

Dec  01 A  Practical  Guide  to  Selecting  a  Stream  Processing  Technology

Dec  15 Streaming  in  Practice:  Putting  Apache  Kafka  in  Production

https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Powered by Kafka (﴾thousands more)﴿

Spark Streaming API (﴾2.0)﴿

Kafka’s Streams API (﴾0.10)﴿

Example: Streams and Tables in Kafka

Word Count

hello 2

kafka 1

world 1

… …

Streams & Databases

• A � stream � processing � technology � must � have � first-class � support � for Streams � and Tables• With � scalability, � fault � tolerance, � …

• Why? � Because � most � use � cases � require � not � just � one, � but � both!• Support � – or � lack � thereof � – strongly � impacts � the � resulting � 

technical � architecture � and � development � efforts• No � support � means:• Painful � Do-It-Yourself• Increased � complexity, � more � moving � pieces � to � juggle

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Organizational/Non-‐Tech Dimensions

• Can � your � org � understand � and � leverage � the � technology?• Familiarity � with � languages; � intuitive � concepts � and � APIs; � trainings

• Are � you � permitted � to � use � it � in � your � organization?• Security � features, � licensing, � open � source � vs. � proprietary

• Can � you � continue � to � use � it � in � the � future?• Longevity � of � technology, � licensing, � vendor � strength

Organizational/Non-‐Tech Dimensions

• Do � you � believe � in � the � long-term � vision?• Switching � technologies � in � an � organization � is � often � expensive/slow: � 

legacy � migration, � re-training, � resistance � to � change, � etc.

• What � is � the � path � and � time � to � success?• Can � you � move � smoothly � and � quickly � from � proof-of-concept � to � 

production?

• Areas � and � range � of � applicability in � your � organization• General-purpose � vs. � niche � technology• Viable � for � S/M/L/XL � use � cases � vs. � for � XL � use � cases � only• Building � core � business � apps � vs. � doing � backend � analytics

Organizational/Non-‐Tech Dimensions

Licensing Vision/Roadmap ROI

Impact  onOrganization

Broad  vs.  NicheApplicability

Time  to  Market

ProfessionalServices

Documentation Examples User  CommunityLearning  Curve

Impact  on  Tools,Infrastructure,  …

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

State

• Stateful � processing � of � any � kind � requires…state• Many � (most?) � use � cases � for � stream � processing � are � stateful• Joins, � aggregations, � windowing, � counting, � ...

• Is � state � performant? � Local � vs. � remote � state?

50

State

• Stateful � processing � of � any � kind � requires…state• Many � (most?) � use � cases � for � stream � processing � are � stateful• Joins, � aggregations, � windowing, � counting, � ...

• Is � state � performant? � Local � vs. � remote � state?• Is � state � fault-tolerant? � How � fast � is � recovery/failover?

53

State

• Stateful � processing � of � any � kind � requires…state• Many � (most?) � use � cases � for � stream � processing � are � stateful• Joins, � aggregations, � windowing, � counting, � ...

• Is � state � performant? � Local � vs. � remote � state?• Is � state � fault-tolerant? � How � fast � is � recovery/failover?• Is � state � interactively � queryable?• Kafka: � ready � for � use � (GA)• Spark, � Flink: � under � development � (alpha)• Storm, � Samza, � and � others: � not � available

55

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Abstractions

• What � are � the � data � model � and � the � available � abstractions?• Most � common � abstraction: � stream of � records, � events• Kafka, � Spark, � Storm, � Samza, � Flink, � Apex, � ...

• New, � very � powerful: � table � of � records• Currently � unique � to � Kafka• Represents � latest � state and � materialized � views• State � must � have � a � first-class � abstraction � because, � as � we � just � saw � in � 

the � previous � section, � state � is � crucial � for � stream � processing!

58

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Time model

• Different � use � cases � require � different � time � semantics• Great � majority � of � use � cases � require � event-time semantics• Other � use � cases � may � require � processing-time (e.g. � real-

time � monitoring) � or � special � variants � like � ingestion-time• A � stream � processing � technology � should, � at � a � minimum, � 

support � event-time � to � cover � most � use � cases � in � practice• Examples: � Kafka, � Beam, � Flink

Time Model

61

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Windowing• Windowing � is � an � operation � that � groups events

Windowing

Input  data,  wherecolors  represent

different  users  events

Rectangles  denotedifferent  event-­‐time

windows

processing-­‐time

event-­‐time

windowing

alicebob

dave

Windowing• Windowing � is � an � operation � that � groups events• Most � commonly � needed: � time � windows, � session � windows• Examples:• Real-time � monitoring: � 5-minute � averages• Reader � behavior � on � a � website: � user � browsing � sessions

Windowing

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Out-‐of-‐order and late-‐arriving data

• Is � very � common in � practice, � not � a � rare � corner � case• Related � to � time � model � discussion

Out-‐of-‐order and late-‐arriving data

Users  with  mobile  phones  enterairplane,  lose  Internet  connectivity

Emails  are  being  writtenduring  the  10h  flight

Internet  connectivity  is  restored,phones  will  send  queued  emails  now

Out-‐of-‐order and late-‐arriving data

• Is � very � common in � practice, � not � a � rare � corner � case• Related � to � time � model � discussion

• We � want � control over � how � out-of-order � data � is � handled• Example:• We � process � data � in � 5-minute � windows, � e.g. � compute � statistics• When � event � arrives � 1 � minute � late: � update the � original � result!• When � event � arrives � 2 � hours � late: � discard it!

• Handling � must � be � efficient because � it � happens � so � often

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Reprocessing

• Re-process � data � by � rewinding � a � stream � back � in � time• Use � cases � in � practice � include• Correcting � output � data � after � fixing � a � bug• Facilitate � iterative � and � explorative � development• A/B � testing• Processing � historical � data• Walking � through � "What � If?" � scenarios

• Also: � often � used � behind-the-scenes � for � fault � tolerance

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Scalability, Elasticity, Fault Tolerance

• Can � the � technology � scale according � to � your � needs?• Desired � latency, � throughput?• Able � to � process � millions � of � messages � per � second?

• What � is � the � minimum � footprint?• Expand/shrink � capacity � dynamically � during � operations?

• Helps � with � resource � utilization � because � most � stream � apps � run � continuously• Resilience and � fault � tolerance

• Which � guarantees � for � data � delivery � and � for � state? � "At-least-once", � "exactly-once", � "effectively-once", � etc.

• Failover � behavior � and � recovery � time? � Automated � or � manual?• Any � negative � impact � of � fault � tolerance � features � on � performance?

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Security

• To � meet � internal � security � policies, � legal � compliance, � etc.• Typical � base � requirements � for � stream � processing � applications:• Encrypt � data-in-transit � (e.g. � from/to � Kafka)• Authentication: � "only � some � applications � may � talk � to � production"• Authorization: � "access � to � sensitive � data � such � as � PII � is � restricted”

• The � easier � it � is � to � use � security � features, � the � more � likely � they � are � actually � being � used � in � practice

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Processing Model• True � stream � processing � is � record-at-a-time processing

• Benefits � include � low � latency (millisecs), � dealing � efficiently � with � out-of-order � data• Can � provide � both � latency � and � high � throughput � via � internal � optimizations• Examples: � Kafka, � Storm, � Samza, � Flink, � Beam

• Some � processing � technologies � opt � for � (micro)batching• Micro-batching � has � no � true � benefits: � consider � it � a � technical � workaround � to � 

shoehorn � stream-like � functionality � into � a � tool• Suffers � from � significant � overhead � when � dealing � with � e.g. � out-of-order/late-arriving � 

data, � when � performing � windowed � analyses � (e.g. � session � windows)• Typically � a � strong � blocker � for � use � cases � such � as � fraud � detection � or � anything � where � 

"a � few � seconds" � of � latency � is � prohibitive• Examples: � Spark, � Storm � (Trident), � Hadoop*

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

API

• Choice � of � API � is � a � subjective � matter � – skills, � preference, � …• Typical � options• Declarative, � expressive � API: � operations � like � map(), � filter()• Imperative, � lower-level � API: � callbacks � like � process(event)• Streaming � SQL: � STREAM  SELECT  …  FROM  …  WHERE  …  • In � the � best � case � you � get � not � just � one, � but � all � three

• "Abstractions � are � great!"• "Abstractions � considered � harmful!"

Technical Dimensions

Reprocessing Scalability  &Elasticity

Fault  Tolerance

API Dev/OpsLifecycle

Security ProcessingModel

Out  of  OrderData

Abstractions Time  Model WindowingState

Developer/Operations Lifecycle

• How � should � your � daily � work � look � and � feel � like?• "I � like � to � do � quick, � iterative � development" � (modify/test/repeat)• "I � want � to � decouple � team � roadmaps, � project � schedules"

• Big � difference � between � App � Model � <-> � Cluster � Model• Testing, � packaging, � deployment, � monitoring, � operations• "Do � I � need � to � know � Java � (app) � or � YARN � (cluster) � for � this?”• "I � want � reactive � processing � in � containers � that � run � on � Mesos!"

• Rolling, � no-downtime � upgrades?• Integration � with � existing � Ops � infra, � tools, � processes?

Agenda

• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions

• Summary

Summary

• What � we � covered � is � a � good � starting � point• But, � no � free � lunch!• Understand � what � you � need, � and � weigh � criteria � appropriately• Think � end-to-end: � idea, � development, � operations, � troubleshooting• Think � big-picture: � future � use � cases, � architecture, � security, � training, � …• Do � your � own � internal � hackathons, � proof-of-concepts• Do � your � own � benchmarks

• If � in � doubt: � simplicity � beats � complexity• Faster � to � learn, � easier � to � understand, � less � likely � to � fail, � …

Q&A Session

89

Coming Up NextDate Title Speaker

Dec  15 Streaming in Practice: Putting Apache Kafka in Production

Roger Hoover

https://www.confluent.io/apache-­‐kafka-­‐talk-­‐series