Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

50
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Kinesis & Big Data Meld real-time streaming with EMR (Hadoop), & Redshift (Data Warehousing) Adi Krishnan, AWS Product Management, @adityak Daniel Mintz, Director of BI, Upworthy, @danielmintz July 10, 2014

description

Originally, Hadoop was used as a batch analytics tool; however, this is rapidly changing, as applications move towards real-time processing and streaming. Amazon Elastic MapReduce has made running Hadoop in the cloud easier and more accessible than ever. Each day, tens of thousands of Hadoop clusters are run on the Amazon Elastic MapReduce infrastructure by users of every size — from university students to Fortune 50 companies. We recently launched Amazon Kinesis – a managed service for real-time processing of high volume, streaming data. Amazon Kinesis enables a new class of big data applications which can continuously analyze data at any volume and throughput, in real-time. Adi will discuss each service, dive into how customers are adopting the services for different use cases, and share emerging best practices. Learn how you can architect Amazon Kinesis and Amazon Elastic MapReduce together to create a highly scalable real-time analytics solution which can ingest and process terabytes of data per hour from hundreds of thousands of different concurrent sources. Forever change how you process web site click-streams, marketing and financial transactions, social media feeds, logs and metering data, and location-tracking events.

Transcript of Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Page 1: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Kinesis & Big Data Meld real-time streaming with EMR (Hadoop), &

Redshift (Data Warehousing)

Adi Krishnan, AWS Product Management, @adityak

Daniel Mintz, Director of BI, Upworthy, @danielmintz

July 10, 2014

Page 2: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Amazon Kinesis & Big Data

o Motivations for Stream Processing

Origins: Internal metering capability

Expanding the big data processing landscape

o Customer view on streaming data

o Amazon Kinesis Overview

Amazon Kinesis Architecture

Kinesis concepts & Demo

o Amazon Elastic MapReduce and Kinesis

EMR connector morphs Kinesis streamed data into Hadoop framework

Applying Hadoop frameworks to streaming data

o Amazon Kinesis and Redshift:

Upworthy presents “Shrinking Redshift data load times from 24 hours to 10 minutes”

Presented by Daniel Mintz, Director of Business Intelligence, Upworthy

Page 3: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

The Motivation for Continuous Processing

Page 4: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Origins: Internal AWS Metering Capability

Workload• 10s of millions records/sec

• Multiple TB per hour

• 100,000s of sources

Pain points• Doesn’t scale elastically

• Customers want real-time alerts

• Expensive to operate

• Relies on eventually consistent

storage

Page 5: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Expanding the Big Data Processing Landscape

• Query Engine Approach

• Pre-computations such as

indices and dimensional views

improve performance

• Historical, structured data

• HIVE/SQL-on-Hadoop/ M-R/

Spark

• Batch programs, or other

abstractions breaking down

into MR style computations

• Historical, Semi-structured

data

• Custom computations of relative simple complexity

• Continuous Processing –filters, sliding windows, aggregates – on infinite data streams

• Semi/Structured data, generated continuously in real-time

Traditional Data Warehousing Hadoop Style Processing Stream Processing

Page 6: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

A Generalized Data FlowMany different technologies, at different stages of evolution

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

Page 7: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Our Big Data Transition

Old Posture

• Capture huge amounts of data

and process it in hourly or daily

batches

New Requirements

• Make decisions faster,

sometimes in real-time

• Scale entire system elastically

• Make it easy to “keep

everything”

• Multiple applications can

process data in parallel

Page 8: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Foundation for Data Streams Ingestion, Continuous Processing

Right Toolset for the Right Job

Real-time Ingest

• Highly Scalable

• Durable

• Elastic

• Replay-able Reads

Continuous Processing FX

• Load-balancing incoming streams

• Fault-tolerance, Checkpoint / Replay

• Elastic

• Enable multiple apps to process in parallel

Enable data movement into Stores/ Processing Engines

Managed Service

Low end-to-end latency

Continuous, real-time workloads

Page 9: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Customer View

Page 10: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis

Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data

Software/ Technology

IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence

Digital Ad Tech./Marketing

Advertising Data aggregation Advertising metrics like coverage, yield, conversion

Analytics on User engagement with Ads, Optimized bid/ buy engines

Financial Services Market/ Financial Transaction order data collection

Financial market data metrics Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data

Consumer Online/E-Commerce

Online customer engagement data aggregation

Consumer engagement metrics like page views, CTR

Customer clickstream analytics, Recommendation engines

Customer Scenarios across Industry Segments

1 2 3

Page 11: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Big streaming data comes from the small{

"payerId": "Joe",

"productCode": "AmazonS3",

"clientProductCode": "AmazonS3",

"usageType": "Bandwidth",

"operation": "PUT",

"value": "22490",

"timestamp": "1216674828"

}

Metering Record

127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700]

"GET /apache_pb.gif HTTP/1.0" 200 2326

Common Log Entry

<165>1 2003-10-11T22:14:15.003Z

mymachine.example.com evntslog - ID47

[exampleSDID@32473 iut="3"

eventSource="Application"

eventID="1011"][examplePriority@32473

class="high"]

Syslog Entry“SeattlePublicWater/Kinesis/123/Realtime”

– 412309129140

MQTT Record <R,AMZN ,T,G,R1>

NASDAQ OMX Record

Page 12: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

What Biz. Problem needs to be solved?

Mobile/ Social Gaming Digital Advertising Tech.

Deliver continuous/ real-time delivery of game

insight data by 100’s of game servers

Generate real-time metrics, KPIs for online ad

performance for advertisers/ publishers

Custom-built solutions operationally complex to

manage, & not scalable

Store + Forward fleet of log servers, and Hadoop based

processing pipeline

• Delay with critical business data delivery

• Developer burden in building reliable, scalable

platform for real-time data ingestion/ processing

• Slow-down of real-time customer insights

• Lost data with Store/ Forward layer

• Operational burden in managing reliable, scalable

platform for real-time data ingestion/ processing

• Batch-driven real-time customer insights

Accelerate time to market of elastic, real-time

applications – while minimizing operational

overhead

Generate freshest analytics on advertiser performance

to optimize marketing spend, and increase

responsiveness to clients

Page 13: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Amazon KinesisManaged Service for streaming data ingestion, and processing

Page 14: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Amazon Kinesis Architecture

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

Page 15: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Kinesis Stream:

Managed ability to capture and store data

• Streams are made of Shards

• Each Shard ingests data up to

1MB/sec, and up to 1000 TPS

• Each Shard emits up to 2 MB/sec

• All data is stored for 24 hours

• Scale Kinesis streams by splitting

or merging Shards

• Replay data inside of 24Hr.

Window

Page 16: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Putting Data into Kinesis

Simple Put interface to store data in Kinesis

• Producers use a PUT call to store data in a Stream

• PutRecord {Data, PartitionKey, StreamName}

• A Partition Key is supplied by producer and used to

distribute the PUTs across Shards

• Kinesis MD5 hashes supplied partition key over the

hash key range of a Shard

• A unique Sequence # is returned to the Producer

upon a successful PUT call

Page 17: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Building Kinesis Processing Apps: Kinesis Client LibraryOpen Source library for fault-tolerant, continuous processing apps

• Java client library, source available on Github

• Build app with KCL on your EC2 instance(s)

• KCL is intermediary b/w your application & stream

• Automatically starts a Kinesis Worker for each shard

• Simplifies reading by abstracting individual shards

• Increase / Decrease Workers as # of shards changes

• Checkpoints to keep track of a Worker’s location in

the stream, Restarts Workers if they fail

• Deploy app on your EC2 instances

• Integrates with AutoScaling groups to redistribute workers

to new instances

Page 18: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Amazon Kinesis Connector LibraryOpen Source code to Connect Kinesis with S3, Redshift, DynamoDB

S3

DynamoDB

Redshift

Kinesis

ITransformer

• Defines the transformation of records from the Amazon Kinesis stream in order to suit the user-defined data model

IFilter

• Excludes irrelevant records from the processing.

IBuffer

• Buffers the set of records to be processed by specifying size limit (# of records)& total byte count

IEmitter

• Makes client calls to other AWS services and persists the records stored in the buffer.

Page 19: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Sending & Reading Data from Kinesis Streams

HTTP Post

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Kinesis Client

Library

+

Connector Library

Apache

Storm

Amazon Elastic

MapReduce

Sending Consuming

AWS Mobile

SDK

Page 20: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Amazon Kinesis & Elastic MapReduce

Page 21: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Amazon Elastic MapReduce (EMR)Managed Service for Hadoop based data processing

• Managed service

• Easy to tune clusters and trim

costs

• Support for multiple data stores

• Unique features that ensure

customer success on AWS

Page 22: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Applying batch processing to streamed data

Client/ Sensor Recording Service

Aggregator/ Sequencer

Continuous processor for dashboard

Storage

Analytics and Reporting

Amazon Kinesis Amazon EMR

Streaming Data Ingestion

Page 23: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

What would this look like?

Processing

Input

• User

• Dev

My Website

Kinesis

Log4J

Appender

push to

Kinesis

EMR

HivePig

CascadingMapReduce

pull from

Page 24: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

• Features offered starting EMR AMI 3.0.4– Simply spin up the EMR cluster like normal

• Logical names – Labels that define units of work (Job A vs Job B)

• Iterations – Provide idempotency (pessimistic locking of the Logical name)

• Checkpoints– Creating an input start and end points to allow batch processing

Features and Functionality

Page 25: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Iterations – the run of a Job

Iteration 1 Iteration 2 Iteration 3 Iteration 4

Trim Horizon seqID

1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00

-24 hours

Logical Name

Stream

NOW

Latest seqID

Next

Page 26: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Logical Names & Checkpointing – allows

efficient batching

Kinesis Stream

NOW

Latest seqIDTrim Horizon seqID

-24 hours

Logical Name

Stream

Page 27: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

• Dynamo DB

Metadata StorageLogical Name A

Mapper 1

Mapper 2

Mapper 3

Mapper 4

Logical Name B

Mapper 1

Mapper 2

Mapper 3

Mapper 4

Page 28: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Each Kinesis shard maps 1:1 to a Hadoop

map task

1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00

Mapper 2

Kinesis

Hadoop

Next

Logical Name

Mapper 1

Shard 2

Shard 1

Mapper 2

Mapper 1

Mapper 2

Mapper 1

Mapper 2

Mapper 1

-24 hoursStart seq ID End seq ID

NOW

Latest seqID

Page 29: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Handling stream scaling events

Trim Horizon seqID

1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00

Mapper 2

Kinesis

Hadoop

Logical Name

Mapper 1

Shard 2

Shard 1

Mapper 2

Mapper 1

Mapper 2

Mapper 13

Mapper 3

1

2

4

5

Split

Shard 2

Shard 1

Shard 2

Shard 1

Shard 3

Shard 2

Shard 1

Shard 3

Split Merge-24 hours

Latest seqID

NOW

Next

Page 30: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

• InputFormat handles service errors– Throttling: 400

– Service unavailable errors : 503

– Internal server 500

– Http Client exceptions : socket connection timeout

• Hadoop handles retry of failed map tasks

• Iterations allow retrys– Fixed input boundaries on a stream (idempotency for reruns)

– Enable multiple queries on the same input boundaries

Handling errors

Page 31: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Hadoop Ecosystem Implementation

• Hadoop Input format

• Hive Storage Handler

• Pig Load Function

• Cascading Scheme and

Tap

• Join multiple data

sources for analysis

• Filter and preprocess

streams

• Export and archive

streaming data

Use CasesImplementations

Page 32: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Writing to Kinesis using Log4J

Option Default Description

log4j.appender.KINESIS.streamName AccessLogStream

Stream name to which data is to be published.

log4j.appender.KINESIS.encoding UTF-8 Encoding used to convert log message strings into bytes before sending to Amazon Kinesis.

log4j.appender.KINESIS.maxRetries 3 Maximum number of retries when calling Kinesis APIs to publish a log message.

log4j.appender.KINESIS.backoffInterval 100ms Milliseconds to wait before a retry attempt.

log4j.appender.KINESIS.threadCount 20 Number of parallel threads for publishing logs to configured Kinesis stream.

log4j.appender.KINESIS.bufferSize 2000 Maximum number of outstanding log messages to keep in memory.

log4j.appender.KINESIS.shutdownTimeout 30 Seconds to send buffered messages before application JVM quits normally.

.error("Cannot find resource XYX… go do something about it!");

Page 33: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Run the Ad-hoc Hive Query

Page 34: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Run the Ad-hoc Hive Query

Page 35: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Amazon Kinesis & Redshift

Page 36: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

24 Hours to 10 MinutesHow Upworthy’s Data Pipeline uses Kinesis

Daniel Mintz, Director Business Intelligence, Upworthy, @danielmintz

Page 37: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

What’s Upworthy

• We’ve been called– “Social media with a mission” by our About Page

– “The fastest growing media site of all time” by Fast Company

– “The Fastest Rising Startup” by The Crunchies

– “That thing that’s all over my newsfeed” by my annoyed friends

– “The most data-driven media company in history” by me,

optimistically

Page 38: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

What We Do

• We aim to drive massive amounts of attention to things that really matter.

• We do that by finding, packaging, and distributing great, meaningful content.

Page 39: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Our Use Case

Page 40: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

When We Started

• Had built a data warehouse from scratch

• Hadoop-based batch workflow

• Nightly ETL cycle

• 2.5 Engineers

• Wanted to do all three:– Comprehensive

– Ad Hoc

– Real-Time

Page 41: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

The Decision

• Speed up our current system, rather than

building a parallel one

• Had looked at alternative stream processors– Cost

– Maintenance

• Comfortable with concept of application log

stream

Page 42: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

How It Works

• Log Drain receives, formats, batches and zips

• PUTs 50k GZIP batches on Kinesis stream

• Three types of Kinesis consumers:1. Archiver – Batch and write permanent record

2. Stats – Filter, sample and count; Report to StatHat

3. Transformer – Filter, batch, validate; writes temporary BSVs to S3

• Database Importer handles manifest files.

• S3 handles garbage collection.

Page 43: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Our system now

• Stats:– Average: ~1085 events/second

– Peak: ~2500 events/second

• Data is available in Redshift < 10 min

• Kinesis has been cheap, stable, and gives us

redundancy and resiliency.

• Computation model that’s easy to reason about

Page 44: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Resiliency

• When something goes wrong, you have 24 hours.

• Timestamp at outset. Track lag at each step.

• Bigger workers (more CPU, RAM, deeper queues) can catch us up very fast.

Page 45: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

What We’ve Learned

Page 46: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Some Lessons

• You can use one pipeline for everything.

• High-cardinality fact data belongs in Kinesis.

• EDN works well with Kinesis.

• We prefer explicit checkpointing. (Your mileage may

vary.)

• Languages that run on the JVM can take advantage of

AWS Client Libraries.

Page 47: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Kinesis PricingSimple, Pay-as-you-go, & no up-front costs

Pricing Dimension Value

Hourly Shard Rate $0.015

Per 1,000,000 PUT

transactions:

$0.028

• Customers specify throughput requirements in shards, that they control

• Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress

• Inbound data transfer is free

• EC2 instance charges apply for Kinesis processing applications

Page 48: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Canonical Data flows with Amazon Kinesis

Continuous Metric

Extraction

Incremental Stats

Computation

Record Archiving

Live Dashboard

Page 49: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

Try out Amazon Kinesis• Try out Amazon Kinesis

– http://aws.amazon.com/kinesis/

• Thumb through the Developer Guide

– http://aws.amazon.com/documentation/kinesis/

• Test drive the sample app

– https://github.com/awslabs/amazon-kinesis-data-visualization-sample

• Kinesis Connector Framework

– https://github.com/awslabs/amazon-kinesis-connectors

• Read EMR-Kinesis FAQs

– http://aws.amazon.com/elasticmapreduce/faqs/#kinesis-connector

• Visit, and Post on Kinesis Forum

– https://forums.aws.amazon.com/forum.jspa?forumID=169#

Page 50: Real-time Streaming and Querying with Amazon Kinesis and Amazon Elastic MapReduce

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Thank You!

Adi Krishnan, Product Management, AWS