AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

48
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. November 2016 BDM304 Analyzing Streaming Data in Real-Time with Amazon Kinesis Analytics Allan MacInnis, Senior Solutions Architect, AWS

Transcript of AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Page 1: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

November 2016

BDM304

Analyzing Streaming Data in Real-Time

with Amazon Kinesis Analytics

Allan MacInnis, Senior Solutions Architect, AWS

Page 2: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

What to Expect from the Session

• Streaming data overview

• Amazon Kinesis platform review

• Amazon Kinesis Analytics overview

• Amazon Kinesis Analytics patterns

• Streaming data end-to-end example and walk-

through

• Amazon Kinesis Analytics best practices

Page 3: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Streaming Data Overview

Page 4: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Most data is produced continuously

Mobile apps Web clickstream Application logs

Metering records IoT sensors Smart buildings

[Wed Oct 11 14:32:52

2000] [error] [client

127.0.0.1] client

denied by server

configuration:

/export/home/live/ap/h

tdocs/test

Page 5: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

The diminishing value of data

Recent data is highly valuable• If you act on it in time

• Perishable Insights (M. Gualtieri, Forrester)

Old + Recent data is more valuable • If you have the means to combine them

Page 6: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Processing real-time, streaming data

• Durable

• Continuous

• Fast

• Correct

• Reactive

• Reliable

What are the key requirements?

Ingest Transform Analyze React Persist

Page 7: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Amazon Kinesis Platform

Overview

Page 8: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Amazon Kinesis makes it easy to work with

real-time streaming data

Amazon Kinesis

Streams

• For technical developers

• Collect and stream data

for ordered, replayable,

real-time processing

Amazon Kinesis

Firehose

• For all developers, data

scientists

• Easily load massive

volumes of streaming data

into Amazon S3, Amazon

Redshift, Amazon

Elasticsearch Service

Amazon Kinesis

Analytics

• For all developers, data

scientists

• Easily analyze data

streams using standard

SQL queries

Page 9: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Amazon Kinesis Streams

• Reliably ingest and durably store streaming data at low

cost

• Build custom real-time applications to process streaming

data

Page 10: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Amazon Kinesis Firehose

• Reliably ingest and deliver batched, compressed, and encrypted

data to S3, Amazon Redshift, and Amazon Elasticsearch Service

(Amazon ES)

• Point-and-click setup with zero administration and seamless

elasticity

Page 11: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Amazon Kinesis Analytics

• Interact with streaming data in real time using SQL

• Build fully managed and elastic stream processing

applications that process data for real-time visualizations

and alarms

Page 12: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Amazon Kinesis Analytics:

Service Overview

Page 13: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Amazon Kinesis Analytics

Pay for only what you use

Automatic elasticity

Standard SQL for analytics

Real-time processing

Easy to use

Page 14: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Use SQL to build real-time applications

Easily write SQL code to process

streaming data

Connect to streaming source

Continuously deliver SQL results

Page 15: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Connect to streaming source

• Streaming data sources include Amazon

Kinesis Firehose or Amazon Kinesis Streams

• Input formats include JSON, .csv, variable

column, or unstructured text

• Each input has a schema; schema is inferred,

but you can edit

• Reference data sources (S3) for data

enrichment

Page 16: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Write SQL code

• Build streaming applications with one-to-many

SQL statements

• Robust SQL support and advanced analytic

functions

• Extensions to the SQL standard to work

seamlessly with streaming data

• Support for at-least-once processing

semantics

Page 17: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Continuously deliver SQL results

• Send processed data to multiple destinations

• S3, Amazon Redshift, Amazon ES (through

Firehose)

• Streams (with AWS Lambda integration for

custom destinations)

• End-to-end processing speed as low as sub-

second

• Separation of processing and data delivery

Page 18: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

What are common uses for

Amazon Kinesis Analytics?

Page 19: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Generate time series analytics

• Compute key performance indicators over time periods

• Combine with static or historical data in S3 or Amazon Redshift

Analytics

Streams

Firehose

Amazon

Redshift

S3

Streams

Firehose

Custom, real-

time

destinations

Page 20: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Feed real-time dashboards

• Validate and transform raw data, and then process to calculate

meaningful statistics

• Send processed data downstream for visualization in BI and

visualization services

Amazon

QuickSight

Analytics

Amazon ES

Amazon

Redshift

Amazon

RDS

Streams

Firehose

Page 21: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Create real-time alarms and notifications

• Build sequences of events from the stream, like user sessions in a

clickstream or app behavior through logs

• Identify events (or a series of events) of interest, and react to the

data through alarms and notifications

Analytics

Streams

Firehose

Streams

Amazon

SNS

Amazon

CloudWatch

Lambda

Page 22: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Example: NHL Tweet Analysis

Page 23: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Example Scenario Requirements

Data to capture

• Filter for NHL-related tweets

• Total number of tweets per hour that contain hashtags for

NHL teams

• Top 5 NHL-related hashtags per hour

Output Requirements

• Filtered tweets are saved to Amazon S3

• Hourly aggregate count is saved to Amazon ES

• Full team name of top 5 hashtags are saved to Amazon ES

Page 24: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Why use Amazon Kinesis Analytics for this solution?

Challenges

• Twitter stream can be noisy

• Tweet structure is complex, with several levels of

nested JSON

• NHL-related tweet volume is cyclical

With Amazon Kinesis Analytics:

• Easily filter out unwanted tweets

• Normalize tweet schema for simple SQL queries

• Automatically scale to meet demand

Page 25: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

End-to-End Architecture

Amazon

Kinesis

Streams

Amazon

Kinesis

Analytics

Amazon

Kinesis

Firehose

Amazon

Elasticsearch

Service

Amazon S3

EC2

instance

Reference

data

Page 26: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Twitter Data Input

Amazon Kinesis streamEC2 instance

Source JSON Data• Hundreds of attributes

per tweet

• ~5KB per tweet

• Most attributes are

unnecessary for use

case

Amazon Kinesis

Streams input• 5 attributes per record

• ~250B per record

• Relevant attributes

Page 27: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

How are tweets mapped to a schema?

Amazon Kinesis stream Amazon Kinesis Analytics

{

"id": 795296435386388500,

"text": "#Canes game tonight! #NHL",

"created_at": "11-06-2016 16:07:00",

"tags": [{

"tag": "Canes"

}, {

"tag": "NHL"

}]

}

id text created_at tag

795… #Canes… 11-06-2016… Canes

795… #Canes… 11-06-2016… NHL

Source data for

Amazon Kinesis Analytics

Page 28: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

How is streaming data accessed with SQL?

STREAM• Analogous to a TABLE

• Represents continuous data flow

CREATE OR REPLACE STREAM "NHL_TWEET_STREAM" (

ID BIGINT, TWEET_TEXT VARCHAR(140), HASHTAG VARCHAR(140));

PUMP• Continuous INSERT query

• Inserts data from one in-application stream to another

CREATE OR REPLACE PUMP "NHL_TWEET_PUMP" AS

INSERT INTO "NHL_TWEET_STREAM"

SELECT STREAM * FROM . . .

Page 29: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

How do we model our data?

NHL_TWEET_STREAM

•ID

•TEXT

•HASHTAG

TOTAL_TWEETS_STREAM

•TWEET_COUNT

MENTION_COUNT_STREAM

•TEAMNAME

•MENTIONCOUNT

SELECT STREAM *

FROM "SOURCE_STREAM"

WHERE "HASHTAG" NOT IN

<exclude list>

SOURCE_STREAM

•id

•text

•tag

Amazon

Kinesis

stream

Amazon Kinesis

Firehose

Amazon Kinesis

FirehoseTeamName

• hashtag

• team

Page 30: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

How do we filter unwanted tweets?

Use PUMP to insert filtered data into STREAM

NHL_TWEET_STREAM

•ID

•TEXT

•HASHTAG

SOURCE_STREAM

•id

•text

•tag

CREATE OR REPLACE PUMP "NHL_TWEET_PUMP" AS

INSERT INTO "NHL_TWEET_STREAM"

SELECT STREAM "id", "text", LOWER("tag")

FROM "SOURCE_STREAM"

WHERE LOWER("tag") NOT IN ('nfl', 'mlb', 'nba');

Page 31: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

How do we get team name from the hashtag?

• Create CSV file with hashtag to team name map in S3

• Configure Amazon Kinesis Analytics application to import

file as reference data

• Reference data appears as a table

• Join streaming data on reference data

hashtag,team

NYRangers,New York Rangers

NYR,New York Rangers

Canes,Carolina Hurricanes

Oilers,Edmonton Oilers

Sharks,San Jose Sharks

s3://mybucket/team_map.csv

Page 32: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Use Reference Data in Query

SELECT STREAM tn."team"

FROM "NHL_TWEET_STREAM" tweets

INNER JOIN "TeamName" tn

ON tweets."HASHTAG" =

LOWER(tn."hashtag")

TeamName

• hashtag

• team

NHL_TWEET_STREAM

•ID

•TEXT

•HASHTAG

hashtag,team

NYRangers,New York Rangers

NYR,New York Rangers

Canes,Carolina Hurricanes

Oilers,Edmonton Oilers

Sharks,San Jose Sharks

s3://mybucket/team_map.csv

New York Rangers

New York Rangers

Carolina Hurricanes

Edmonton Oilers

San Jose Sharks

Page 33: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

How do we aggregate streaming data?

• A common requirement in streaming

analytics is to perform set-based operation(s) (count, average, max, min,..) over events

that arrive within a specified period of time

• Cannot simply aggregate over an entire table

like typical static database

• How do we define a subset in a potentially infinite

stream?

• Windowing functions!

Page 34: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Windowing Concepts

• Windows can be tumbling or sliding

• Windows are fixed length

Output record will have the timestamp of the end of the window

1 5 4 26 8 6 4

t1 t2 t5 t6t3 t4

Time

Window 1 Window 2 Window 3

Aggregate

Function (Sum)

18 14Output Events

Page 35: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Comparing Types of Windows

• Output created at the end of the window

• The output of the window will be single event based on the

aggregate function used

Tumbling windowAggregate per time interval

Sliding windowWindows constantly re-evaluated

Page 36: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

How do we aggregate team mentions per hour?

• Use TOP_K_ITEMS_TUMBLING function

• Pass cursor to team name stream

• Define window size of 3600 seconds

INSERT INTO "MENTION_COUNT_STREAM"

SELECT STREAM *

FROM TABLE(TOP_K_ITEMS_TUMBLING(

CURSOR(SELECT STREAM tn."team"... ),

'team', -- name of column to aggregate

5, -- number of top items

3600 -- tumbling window size in seconds

));

Page 37: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Output to Amazon Kinesis Firehose

MENTION_COUNT_STREAM

•TEAMNAME

•MENTIONCOUNT

Amazon

Elasticsearch

Service

KibanaAmazon

Kinesis Firehose

{

"aggregationtime": "2016-11-06T14:42:03.335",

"teamname": "New York Rangers",

"mentioncount": 604

}

5 records, every hour

Page 38: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Visualize Results with Kibana

Page 39: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Amazon Kinesis Analytics

Best Practices

Page 40: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Managing Applications

Set up Cloudwatch Alarms

• MillisBehindLatest metric tracks how far

behind the application is from the source

• Alarm on MillisBehindLatest metric.

Consider triggering when 1-hour behind, on a

1-minute average. Adjust accordingly for

applications with lower end-to-end processing

needs.

Page 41: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Managing Applications

Increase input parallelism to improve

performance

• By default, a single source in-application

stream is created

• If application is not keeping up with input

stream, consider increasing input parallelism to

create multiple source in-application streams

Page 42: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Managing Applications

Limit number of applications reading from

same source

• Avoid ReadProvisionedThroughputExceeded

exceptions

• For an Amazon Kinesis Streams source, limit

to 2 total applications

• For an Amazon Kinesis Firehose source, limit

to 1 application

Page 43: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Defining Input Schema

• Review and adequately test inferred input

schema

• Manually update schema to handle nested

JSON with greater than 2 levels of depth

• Use SQL functions in your application for

unstructured data

Page 44: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Authoring Application Code

• Avoid time-based windows greater than one

hour

• Keep window sizes small during development

• Use smaller SQL queries, with multiple in-

application streams, rather than a single, large

query

Page 45: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Limits

• Maximum row size in an in-application stream

is 50 KB.

• Maximum input parallelism is 10 in-application

streams.

• Each application supports one streaming

source, and one reference data source. The

reference data source can be no larger than 1

GB in size.

Page 46: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Pricing

• Pay only for what you use.

• Charged an hourly rate, based on the average

number of Amazon Kinesis Processing Units

(KPU) used to run your application.

• A single KPU provides one vCPU, and 4 GB of

memory.

• $0.11 per KPU-hour (US East).

Page 47: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Thank you!

Page 48: AWS re:Invent 2016: Analyzing Streaming Data in Real-time with Amazon Kinesis Analytics (BDM304)

Remember to complete

your evaluations!