Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Kinesis & Big Data Meld real-time streaming with EMR (Hadoop), &

Redshift (Data Warehousing)

Adi Krishnan, AWS Product Management, @adityak

Daniel Mintz, Director of BI, Upworthy, @danielmintz

July 10, 2014

Amazon Kinesis & Big Data

o Motivations for Stream Processing

Origins: Internal metering capability

Expanding the big data processing landscape

o Customer view on streaming data

o Amazon Kinesis Overview

Amazon Kinesis Architecture

Kinesis concepts & Demo

o Amazon Elastic MapReduce and Kinesis

EMR connector morphs Kinesis streamed data into Hadoop framework

Applying Hadoop frameworks to streaming data

o Amazon Kinesis and Redshift:

Upworthy presents “Shrinking Redshift data load times from 24 hours to 10 minutes”

Presented by Daniel Mintz, Director of Business Intelligence, Upworthy

The Motivation for Continuous Processing

Origins: Internal AWS Metering Capability

Workload• 10s of millions records/sec

• Multiple TB per hour

• 100,000s of sources

Pain points• Doesn’t scale elastically

• Customers want real-time alerts

• Expensive to operate

• Relies on eventually consistent

storage

Expanding the Big Data Processing Landscape

• Query Engine Approach

• Pre-computations such as

indices and dimensional views

improve performance

• Historical, structured data

• HIVE/SQL-on-Hadoop/ M-R/

Spark

• Batch programs, or other

abstractions breaking down

into MR style computations

• Historical, Semi-structured

data

• Custom computations of relative simple complexity

• Continuous Processing –filters, sliding windows, aggregates – on infinite data streams

• Semi/Structured data, generated continuously in real-time

Traditional Data Warehousing Hadoop Style Processing Stream Processing

A Generalized Data FlowMany different technologies, at different stages of evolution

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

Our Big Data Transition

Old Posture

• Capture huge amounts of data

and process it in hourly or daily

batches

New Requirements

• Make decisions faster,

sometimes in real-time

• Scale entire system elastically

• Make it easy to “keep

everything”

• Multiple applications can

process data in parallel

Foundation for Data Streams Ingestion, Continuous Processing

Right Toolset for the Right Job

Real-time Ingest

• Highly Scalable

• Durable

• Elastic

• Replay-able Reads

Continuous Processing FX

• Load-balancing incoming streams

• Fault-tolerance, Checkpoint / Replay

• Elastic

• Enable multiple apps to process in parallel

Enable data movement into Stores/ Processing Engines

Managed Service

Low end-to-end latency

Continuous, real-time workloads

Customer View

Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis

Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data

Software/ Technology

IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence

Digital Ad Tech./Marketing

Advertising Data aggregation Advertising metrics like coverage, yield, conversion

Analytics on User engagement with Ads, Optimized bid/ buy engines

Financial Services Market/ Financial Transaction order data collection

Financial market data metrics Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data

Consumer Online/E-Commerce

Online customer engagement data aggregation

Consumer engagement metrics like page views, CTR

Customer clickstream analytics, Recommendation engines

Customer Scenarios across Industry Segments

1 2 3

Big streaming data comes from the small{

"payerId": "Joe",

"productCode": "AmazonS3",

"clientProductCode": "AmazonS3",

"usageType": "Bandwidth",

"operation": "PUT",

"value": "22490",

"timestamp": "1216674828"

}

Metering Record

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700]

"GET /apache_pb.gif HTTP/1.0" 200 2326

Common Log Entry

<165>1 2003-10-11T22:14:15.003Z

mymachine.example.com evntslog - ID47

[exampleSDID@32473 iut="3"

eventSource="Application"

eventID="1011"][examplePriority@32473

class="high"]

Syslog Entry“SeattlePublicWater/Kinesis/123/Realtime”

– 412309129140

MQTT Record <R,AMZN ,T,G,R1>

NASDAQ OMX Record

What Biz. Problem needs to be solved?

Mobile/ Social Gaming Digital Advertising Tech.

Deliver continuous/ real-time delivery of game

insight data by 100’s of game servers

Generate real-time metrics, KPIs for online ad

performance for advertisers/ publishers

Custom-built solutions operationally complex to

manage, & not scalable

Store + Forward fleet of log servers, and Hadoop based

processing pipeline

• Delay with critical business data delivery

• Developer burden in building reliable, scalable

platform for real-time data ingestion/ processing

• Slow-down of real-time customer insights

• Lost data with Store/ Forward layer

• Operational burden in managing reliable, scalable

platform for real-time data ingestion/ processing

• Batch-driven real-time customer insights

Accelerate time to market of elastic, real-time

applications – while minimizing operational

overhead

Generate freshest analytics on advertiser performance

to optimize marketing spend, and increase

responsiveness to clients

Amazon KinesisManaged Service for streaming data ingestion, and processing

Amazon Kinesis Architecture

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

Kinesis Stream:

Managed ability to capture and store data

• Streams are made of Shards

• Each Shard ingests data up to

1MB/sec, and up to 1000 TPS

• Each Shard emits up to 2 MB/sec

• All data is stored for 24 hours

• Scale Kinesis streams by splitting

or merging Shards

• Replay data inside of 24Hr.

Window

Putting Data into Kinesis

Simple Put interface to store data in Kinesis

• Producers use a PUT call to store data in a Stream

• PutRecord {Data, PartitionKey, StreamName}

• A Partition Key is supplied by producer and used to

distribute the PUTs across Shards

• Kinesis MD5 hashes supplied partition key over the

hash key range of a Shard

• A unique Sequence # is returned to the Producer

upon a successful PUT call

Building Kinesis Processing Apps: Kinesis Client LibraryOpen Source library for fault-tolerant, continuous processing apps

• Java client library, source available on Github

• Build app with KCL on your EC2 instance(s)

• KCL is intermediary b/w your application & stream

• Automatically starts a Kinesis Worker for each shard

• Simplifies reading by abstracting individual shards

• Increase / Decrease Workers as # of shards changes

• Checkpoints to keep track of a Worker’s location in

the stream, Restarts Workers if they fail

• Deploy app on your EC2 instances

• Integrates with AutoScaling groups to redistribute workers

to new instances

Amazon Kinesis Connector LibraryOpen Source code to Connect Kinesis with S3, Redshift, DynamoDB

S3

DynamoDB

Redshift

Kinesis

ITransformer

• Defines the transformation of records from the Amazon Kinesis stream in order to suit the user-defined data model

IFilter

• Excludes irrelevant records from the processing.

IBuffer

• Buffers the set of records to be processed by specifying size limit (# of records)& total byte count

IEmitter

• Makes client calls to other AWS services and persists the records stored in the buffer.

Sending & Reading Data from Kinesis Streams

HTTP Post

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Kinesis Client

Library

+

Connector Library

Apache

Storm

Amazon Elastic

MapReduce

Sending Consuming

AWS Mobile

SDK

Amazon Kinesis & Elastic MapReduce

Amazon Elastic MapReduce (EMR)Managed Service for Hadoop based data processing

• Managed service

• Easy to tune clusters and trim

costs

• Support for multiple data stores

• Unique features that ensure

customer success on AWS

Applying batch processing to streamed data

Client/ Sensor Recording Service

Aggregator/ Sequencer

Continuous processor for dashboard

Storage

Analytics and Reporting

Amazon Kinesis Amazon EMR

Streaming Data Ingestion

What would this look like?

Processing

Input

• User

• Dev

My Website

Kinesis

Log4J

Appender

push to

Kinesis

EMR

HivePig

CascadingMapReduce

pull from

• Features offered starting EMR AMI 3.0.4– Simply spin up the EMR cluster like normal

• Logical names – Labels that define units of work (Job A vs Job B)

• Iterations – Provide idempotency (pessimistic locking of the Logical name)

• Checkpoints– Creating an input start and end points to allow batch processing

Features and Functionality

Iterations – the run of a Job

Iteration 1 Iteration 2 Iteration 3 Iteration 4

Trim Horizon seqID

1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00

-24 hours

Logical Name

Stream

NOW

Latest seqID

Next

Logical Names & Checkpointing – allows

efficient batching

Kinesis Stream

NOW

Latest seqIDTrim Horizon seqID

-24 hours

Logical Name

Stream

• Dynamo DB

Metadata StorageLogical Name A

Mapper 1

Mapper 2

Mapper 3

Mapper 4

Logical Name B

Mapper 1

Mapper 2

Mapper 3

Mapper 4

Each Kinesis shard maps 1:1 to a Hadoop

map task

1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00

Mapper 2

Kinesis

Hadoop

Next

Logical Name

Mapper 1

Shard 2

Shard 1

Mapper 2

Mapper 1

Mapper 2

Mapper 1

Mapper 2

Mapper 1

-24 hoursStart seq ID End seq ID

NOW

Latest seqID

Handling stream scaling events

Trim Horizon seqID

1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00

Mapper 2

Kinesis

Hadoop

Logical Name

Mapper 1

Shard 2

Shard 1

Mapper 2

Mapper 1

Mapper 2

Mapper 13

Mapper 3

1

2

4

5

Split

Shard 2

Shard 1

Shard 2

Shard 1

Shard 3

Shard 2

Shard 1

Shard 3

Split Merge-24 hours

Latest seqID

NOW

Next

• InputFormat handles service errors– Throttling: 400

– Service unavailable errors : 503

– Internal server 500

– Http Client exceptions : socket connection timeout

• Hadoop handles retry of failed map tasks

• Iterations allow retrys– Fixed input boundaries on a stream (idempotency for reruns)

– Enable multiple queries on the same input boundaries

Handling errors

Hadoop Ecosystem Implementation

• Hadoop Input format

• Hive Storage Handler

• Pig Load Function

• Cascading Scheme and

Tap

• Join multiple data

sources for analysis

• Filter and preprocess

streams

• Export and archive

streaming data

Use CasesImplementations

Writing to Kinesis using Log4J

Option Default Description

log4j.appender.KINESIS.streamName AccessLogStream

Stream name to which data is to be published.

log4j.appender.KINESIS.encoding UTF-8 Encoding used to convert log message strings into bytes before sending to Amazon Kinesis.

log4j.appender.KINESIS.maxRetries 3 Maximum number of retries when calling Kinesis APIs to publish a log message.

log4j.appender.KINESIS.backoffInterval 100ms Milliseconds to wait before a retry attempt.

log4j.appender.KINESIS.threadCount 20 Number of parallel threads for publishing logs to configured Kinesis stream.

log4j.appender.KINESIS.bufferSize 2000 Maximum number of outstanding log messages to keep in memory.

log4j.appender.KINESIS.shutdownTimeout 30 Seconds to send buffered messages before application JVM quits normally.

.error("Cannot find resource XYX… go do something about it!");

Run the Ad-hoc Hive Query

Amazon Kinesis & Redshift


24 Hours to 10 MinutesHow Upworthy’s Data Pipeline uses Kinesis

Daniel Mintz, Director Business Intelligence, Upworthy, @danielmintz

What’s Upworthy

• We’ve been called– “Social media with a mission” by our About Page

– “The fastest growing media site of all time” by Fast Company

– “The Fastest Rising Startup” by The Crunchies

– “That thing that’s all over my newsfeed” by my annoyed friends

– “The most data-driven media company in history” by me,

optimistically

What We Do

• We aim to drive massive amounts of attention to things that really matter.

• We do that by finding, packaging, and distributing great, meaningful content.

Our Use Case

When We Started

• Had built a data warehouse from scratch

• Hadoop-based batch workflow

• Nightly ETL cycle

• 2.5 Engineers

• Wanted to do all three:– Comprehensive

– Ad Hoc

– Real-Time

The Decision

• Speed up our current system, rather than

building a parallel one

• Had looked at alternative stream processors– Cost

– Maintenance

• Comfortable with concept of application log

stream

How It Works

• Log Drain receives, formats, batches and zips

• PUTs 50k GZIP batches on Kinesis stream

• Three types of Kinesis consumers:1. Archiver – Batch and write permanent record

2. Stats – Filter, sample and count; Report to StatHat

3. Transformer – Filter, batch, validate; writes temporary BSVs to S3

• Database Importer handles manifest files.

• S3 handles garbage collection.

Our system now

• Stats:– Average: ~1085 events/second

– Peak: ~2500 events/second

• Data is available in Redshift < 10 min

• Kinesis has been cheap, stable, and gives us

redundancy and resiliency.

• Computation model that’s easy to reason about

Resiliency

• When something goes wrong, you have 24 hours.

• Timestamp at outset. Track lag at each step.

• Bigger workers (more CPU, RAM, deeper queues) can catch us up very fast.

What We’ve Learned

Some Lessons

• You can use one pipeline for everything.

• High-cardinality fact data belongs in Kinesis.

• EDN works well with Kinesis.

• We prefer explicit checkpointing. (Your mileage may

vary.)

• Languages that run on the JVM can take advantage of

AWS Client Libraries.

Kinesis PricingSimple, Pay-as-you-go, & no up-front costs

Pricing Dimension Value

Hourly Shard Rate $0.015

Per 1,000,000 PUT

transactions:

$0.028

• Customers specify throughput requirements in shards, that they control

• Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress

• Inbound data transfer is free

• EC2 instance charges apply for Kinesis processing applications

Canonical Data flows with Amazon Kinesis

Continuous Metric

Extraction

Incremental Stats

Computation

Record Archiving

Live Dashboard

Try out Amazon Kinesis• Try out Amazon Kinesis

– http://aws.amazon.com/kinesis/

• Thumb through the Developer Guide

– http://aws.amazon.com/documentation/kinesis/

• Test drive the sample app

– https://github.com/awslabs/amazon-kinesis-data-visualization-sample

• Kinesis Connector Framework

– https://github.com/awslabs/amazon-kinesis-connectors

• Read EMR-Kinesis FAQs

– http://aws.amazon.com/elasticmapreduce/faqs/#kinesis-connector

• Visit, and Post on Kinesis Forum

– https://forums.aws.amazon.com/forum.jspa?forumID=169#

http://aws.amazon.com/kinesis/

http://aws.amazon.com/documentation/kinesis/

https://github.com/awslabs/amazon-kinesis-data-visualization-sample

https://github.com/awslabs/amazon-kinesis-connectors

http://aws.amazon.com/elasticmapreduce/faqs/#kinesis-connector

https://forums.aws.amazon.com/forum.jspa?forumID=169


Thank You!

Adi Krishnan, Product Management, AWS

Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Technology

Transcript of Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce