Apache Spark and Oracle Stream Analytics

21
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1 Oracle Stream Analytics Complex Event Processing for Apache Spark Streaming

Transcript of Apache Spark and Oracle Stream Analytics

Page 1: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 1

Oracle Stream Analytics

Complex Event Processing for Apache Spark Streaming

Page 2: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 2

DisclaimerOpinions and views here are of my own and does not reflect the official

position of Oracle Inc.

Page 3: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 3

Underpinning Technologies• Oracle Continuous Query Engine– Event by event processing• Every event is assigned a unique timestamp

• Apache Spark– Distributed computing framework for scale out and fault tolerance

Spark Streaming + Oracle Stream Analytics=

Highly scalable and Fault Tolerant Complex Event Processing Engine

Page 4: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

Spark-CQP Runtime Architecture

Events From Kafka

Continuous Query EngineQuery Plans and Operator State

Finite State Automaton for Pattern Detection across

Events

HDFS

Journaled CQP Engine State serialized and persisted to HDFS after computing each partition

CQP Engine State restored on Executor Restart and Recompute of a partition

Geo Sensing Cartridge for Real-time Spatial Analytics

RETE Rules for Conditional logic

Complex Pattern Detection, Temporal Queries, Spatial Queries, Contextual Queries, and Conditional Logic

Spar

k Ex

ecut

or

JVM

s

Page 5: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 5

Why Oracle Continuous Query Processor ?• Complex event processing requires events to be processed one at a time– Real world events originate at different points in time and must be processed as such– Each event must be uniquely processed as identified by its individual timestamp

• Micro-batching with Spark Streaming– All events in the batch are identified by the same batch timestamp and treated as one– There is no progression of time between events in the same batch– CEP applications seek correlation and patterns across events in time

Page 6: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 6

How CQP complements Spark Streaming ?1. Continuous, event-by-event, and stateful query processing2. Flexible temporal windows3. Event ordering and application timestamps4. Automatic progression of time– Automatic heartbeat generation to advance time

5. Pattern detection– Built-in finite state automaton

6. Geo Sensing for Spatial data7. Integrated rules engine based on RETE8. Derived from ANSI SQL 99

Page 7: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 7

1. Continuous and Stateful Query Processing

ORDER_ID STATUS VALUE

1001 PROCESSING 9000

1002 SHIPPED 7000

1003 OPEN 6000

1004 PROCESSING 2000

1005 SHIPPED 5000

1006 OPEN 4000

1007 PROCESSING 5000

1008 PREPARING 2000

1009 SHIPPED 7000

ISTREAM ( SELECT STATUS, COUNT(*) AS STATUS_COUNT, SUM(VALUE) AS STATUS_TOTAL FROM ORDER_STREAM GROUP BY STATUS)

STATUS STATUS_COUNT

STATUS_TOTAL

PROCESSING 1 9000

SHIPPED 1 7000

OPEN 1 6000

PROCESSING 2 11000

SHIPPED 2 12000

OPEN 2 10000

PROCESSING 3 16000

PREPARING 1 2000

SHIPPED 3 19000

Input :- Order Stream Output :- Continuous & StatefulYes, fully fault tolerant with accurate results even after Spark executor crashes & restarts !

Business Query :- Continuously show order count and order total grouped by order status

Output is also stateful. For example, there will be no output when a new event results in the same average as previous. No duplicates to downstream.

Page 8: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 8

2. Flexible Windows

• Windows based on the number of Events and not just time interval• Example :- A window of last 5 IBM stock quotes

• Dynamic windows based on the value of an Event attribute• Example :- A window based on the event attribute “Campaign Duration”

• Windows based on current intervals• Example :- A window based on current hour, current day, etc..

Page 9: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

• Example Business Query• Flag a credit card with three or more transactions in a 3 second interval

• Spark Streaming• All events in the same batch have the same time• There is no progression of time between events in the same batch

• Oracle Continuous Query Engine.• Every event in batch is automatically assigned a timestamp on ingestion and treated as discrete.

• Results• Spark Streaming could fail to raise alerts for above query• Oracle CQP will execute correctly and flag such cards.• Please see graphics in next slide

2. Event by Event Processing

Page 10: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. |

• Flag credit cards with three or more transactions in a 3 second interval• Spark Batch Interval = 1 second• Assume event arrival rate is = 2 events/second/batch

E1, CC1 E2, CC2 E3, CC3 E4, CC4

Time t1 t2 t3

E1, CC1 E2, CC2 E3, CC3 E4, CC4E1, CC1 E2, CC2 E3, CC3 E4, CC4 E5, CC5 E6, CC2

E3, CC3 E4, CC4 E5, CC5 E6, CC2 E7, CC2 E8, CC6

t4

Spark assigns batch time t4 to both events E7 & E8 and by default uses the batch interval to slide.

In sliding by 1 sec, Spark misses the fact that there were 3 txns for CC2 at t=3.5

E2, CC2 E3, CC3 E4, CC4 E5, CC5 E6, CC2 E7, CC2

CQP in Spark assigns unique time to each individual event in the batch and by default the window slides using nanoseconds and not as a multiple of batch interval.

Window length is 3 seconds

Page 11: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 11

4. Application Timestamps and Event Ordering

• Progression of time can be based on an event field set by the application itself instead of system time• E.g. OrderSubmissionTime field in an order event

• Late events could be flagged as out-of-band events

• Out-of-band events can be logged and processed by different applications

Page 12: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 12

5. Automatic Progression of Time

• In Spark, there is no progression of time when there are no incoming events

• In Continuous Query Engine, time progresses even when there are no incoming events and application state is automatically adjusted via heartbeats.

• E.g., Raise an alert when order received event is not followed by order shipped event in 2 hours

Page 13: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 13

6. Pattern DetectionChecks if temperature readings from a transmission sensor are wobbling during a certain time interval.

The CQP Below checks for a W-Pattern in temperature readings during a 10 minute interval and selects support levels as output

SELECT LAST(A.value), LAST(C.value) FROM TEMP_STREAM

MATCH_RECOGNIZE (

PARTITION BY DEVICE_ID

PATTERN (A+ B+ C+ D+) DURATION OF 10 MINUTES

DEFINEA AS (value < PREV(value))B AS (value > PREV(value))C AS (value < PREV(value))D AS (value > PREV(value)))

AB

C D

10 Minutes

Page 14: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 14

7. Geo Sensing for Spatial Streams• OOTB Pattern Recognition• Near, Enter, Re-enter, Exit, Stay, Contains, Within

Distance

• Continuous Proximity Monitoring• Continuous output of distance between a moving

and a stationary object• E.g. Prepare dockyard for incoming freight

• Continuous output of distance between two moving objects• E.g. Moving away or on a Collision course ?

Page 15: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 15

8. RETE based Business Rules

• Integrated RETE engine for complex conditional business logic– Simple nested IF-THEN-ELSE statements–Write rules in any order, state change will fire all affected rules

Page 16: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 16

9. Derived from ANSI SQL 99

• Well defined syntax and semantics• Domain specific language for Stream Processing– SQL like with very little learning curve

Page 17: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 17

In short….• Oracle Stream Analytics reduces application development & delivery times– Without Continuous Query Engine, Spark Streaming application developers will have to build pattern

detection, state management, and fault-tolerance into every streaming application they write !!

• Wait, it’s mutual…. Spark provides benefits to Continuous Query Engine too !!– Leverges Spark for data ingestion• Kafka/flume/JMS, etc

– Leverages Spark for scalability and HA

Page 18: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 18

Leveraging scalability from Spark

• No CQP specific tuning parameters• Increased throughput via standard Spark parameters. E.g. Number of Spark

Executors, Total Executor Cores, etc.…• CQP exploits data locality with help from Spark• CQP leverages the elasticity framework from Spark

Page 19: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 19

CQP Linear Scalability with Spark (From our Labs…)

w1(c3) w2(c5) w3(c7) w4(c9)0

10000

20000

30000

40000

50000

60000

70000Processing Time (seconds) for 40 Million

Records

w1(c3) w2(c5) w3(c7) w4(c9)0

2000400060008000

100001200014000160001800020000

Avg. Processing Time Per Batch (mil-liseconds)

w1(c3) w2(c5) w3(c7) w4(c9)0

20

40

60

80

100

120

Number of Batches Processed over 10 Minutes

Legend

Wn = n number of workers or executors

W2(c5) means 2 executors and a total of 5 cores across both executors

Page 20: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 20

Fault Tolerance in OSA

• Automatic application state recovery via Journaled (incremental) snapshots for worker failure• State of a Spark-CQP app is automatically fault tolerant

• Piggybacks on Spark’s ability to detect an executor crash and restart– CQP engine lifecycle is same as the Spark Executor lifecycle

• Piggybacks on Spark’s ability to “re-compute” the partition on failures– CQP engine is aware of a partition re-compute and initializes its state from persisted

snapshots

Page 21: Apache Spark and Oracle Stream Analytics

Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | Oracle Confidential – Internal/Restricted/Highly Restricted 21

Spark Streaming + Oracle Stream Analytics=

Highly scalable and Fault Tolerant Complex Event Processing System