© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Amazon Kinesis & Big Data Meld real-time streaming with EMR (Hadoop), &
Redshift (Data Warehousing)
Adi Krishnan, AWS Product Management, @adityak
Daniel Mintz, Director of BI, Upworthy, @danielmintz
July 10, 2014
Amazon Kinesis & Big Data
o Motivations for Stream Processing
Origins: Internal metering capability
Expanding the big data processing landscape
o Customer view on streaming data
o Amazon Kinesis Overview
Amazon Kinesis Architecture
Kinesis concepts & Demo
o Amazon Elastic MapReduce and Kinesis
EMR connector morphs Kinesis streamed data into Hadoop framework
Applying Hadoop frameworks to streaming data
o Amazon Kinesis and Redshift:
Upworthy presents “Shrinking Redshift data load times from 24 hours to 10 minutes”
Presented by Daniel Mintz, Director of Business Intelligence, Upworthy
The Motivation for Continuous Processing
Origins: Internal AWS Metering Capability
Workload• 10s of millions records/sec
• Multiple TB per hour
• 100,000s of sources
Pain points• Doesn’t scale elastically
• Customers want real-time alerts
• Expensive to operate
• Relies on eventually consistent
storage
Expanding the Big Data Processing Landscape
• Query Engine Approach
• Pre-computations such as
indices and dimensional views
improve performance
• Historical, structured data
• HIVE/SQL-on-Hadoop/ M-R/
Spark
• Batch programs, or other
abstractions breaking down
into MR style computations
• Historical, Semi-structured
data
• Custom computations of relative simple complexity
• Continuous Processing –filters, sliding windows, aggregates – on infinite data streams
• Semi/Structured data, generated continuously in real-time
Traditional Data Warehousing Hadoop Style Processing Stream Processing
A Generalized Data FlowMany different technologies, at different stages of evolution
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Our Big Data Transition
Old Posture
• Capture huge amounts of data
and process it in hourly or daily
batches
New Requirements
• Make decisions faster,
sometimes in real-time
• Scale entire system elastically
• Make it easy to “keep
everything”
• Multiple applications can
process data in parallel
Foundation for Data Streams Ingestion, Continuous Processing
Right Toolset for the Right Job
Real-time Ingest
• Highly Scalable
• Durable
• Elastic
• Replay-able Reads
Continuous Processing FX
• Load-balancing incoming streams
• Fault-tolerance, Checkpoint / Replay
• Elastic
• Enable multiple apps to process in parallel
Enable data movement into Stores/ Processing Engines
Managed Service
Low end-to-end latency
Continuous, real-time workloads
Customer View
Scenarios Accelerated Ingest-Transform-Load Continual Metrics/ KPI Extraction Responsive Data Analysis
Data Types IT infrastructure, Applications logs, Social media, Fin. Market data, Web Clickstreams, Sensors, Geo/Location data
Software/ Technology
IT server , App logs ingestion IT operational metrics dashboards Devices / Sensor Operational Intelligence
Digital Ad Tech./Marketing
Advertising Data aggregation Advertising metrics like coverage, yield, conversion
Analytics on User engagement with Ads, Optimized bid/ buy engines
Financial Services Market/ Financial Transaction order data collection
Financial market data metrics Fraud monitoring, and Value-at-Risk assessment, Auditing of market order data
Consumer Online/E-Commerce
Online customer engagement data aggregation
Consumer engagement metrics like page views, CTR
Customer clickstream analytics, Recommendation engines
Customer Scenarios across Industry Segments
1 2 3
Big streaming data comes from the small{
"payerId": "Joe",
"productCode": "AmazonS3",
"clientProductCode": "AmazonS3",
"usageType": "Bandwidth",
"operation": "PUT",
"value": "22490",
"timestamp": "1216674828"
}
Metering Record
127.0.0.1 user-identifier frank [10/Oct/2000:13:55:36 -0700]
"GET /apache_pb.gif HTTP/1.0" 200 2326
Common Log Entry
<165>1 2003-10-11T22:14:15.003Z
mymachine.example.com evntslog - ID47
[exampleSDID@32473 iut="3"
eventSource="Application"
eventID="1011"][examplePriority@32473
class="high"]
Syslog Entry“SeattlePublicWater/Kinesis/123/Realtime”
– 412309129140
MQTT Record <R,AMZN ,T,G,R1>
NASDAQ OMX Record
What Biz. Problem needs to be solved?
Mobile/ Social Gaming Digital Advertising Tech.
Deliver continuous/ real-time delivery of game
insight data by 100’s of game servers
Generate real-time metrics, KPIs for online ad
performance for advertisers/ publishers
Custom-built solutions operationally complex to
manage, & not scalable
Store + Forward fleet of log servers, and Hadoop based
processing pipeline
• Delay with critical business data delivery
• Developer burden in building reliable, scalable
platform for real-time data ingestion/ processing
• Slow-down of real-time customer insights
• Lost data with Store/ Forward layer
• Operational burden in managing reliable, scalable
platform for real-time data ingestion/ processing
• Batch-driven real-time customer insights
Accelerate time to market of elastic, real-time
applications – while minimizing operational
overhead
Generate freshest analytics on advertiser performance
to optimize marketing spend, and increase
responsiveness to clients
Amazon KinesisManaged Service for streaming data ingestion, and processing
Amazon Kinesis Architecture
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates dataacross three data centers (availability zones)
Aggregate andarchive to S3
Millions ofsources producing100s of terabytes
per hour
FrontEnd
AuthenticationAuthorization
Ordered streamof events supportsmultiple readers
Real-timedashboardsand alarms
Machine learningalgorithms or
sliding windowanalytics
Aggregate analysisin Hadoop or adata warehouse
Inexpensive: $0.028 per million puts
Kinesis Stream:
Managed ability to capture and store data
• Streams are made of Shards
• Each Shard ingests data up to
1MB/sec, and up to 1000 TPS
• Each Shard emits up to 2 MB/sec
• All data is stored for 24 hours
• Scale Kinesis streams by splitting
or merging Shards
• Replay data inside of 24Hr.
Window
Putting Data into Kinesis
Simple Put interface to store data in Kinesis
• Producers use a PUT call to store data in a Stream
• PutRecord {Data, PartitionKey, StreamName}
• A Partition Key is supplied by producer and used to
distribute the PUTs across Shards
• Kinesis MD5 hashes supplied partition key over the
hash key range of a Shard
• A unique Sequence # is returned to the Producer
upon a successful PUT call
Building Kinesis Processing Apps: Kinesis Client LibraryOpen Source library for fault-tolerant, continuous processing apps
• Java client library, source available on Github
• Build app with KCL on your EC2 instance(s)
• KCL is intermediary b/w your application & stream
• Automatically starts a Kinesis Worker for each shard
• Simplifies reading by abstracting individual shards
• Increase / Decrease Workers as # of shards changes
• Checkpoints to keep track of a Worker’s location in
the stream, Restarts Workers if they fail
• Deploy app on your EC2 instances
• Integrates with AutoScaling groups to redistribute workers
to new instances
Amazon Kinesis Connector LibraryOpen Source code to Connect Kinesis with S3, Redshift, DynamoDB
S3
DynamoDB
Redshift
Kinesis
ITransformer
• Defines the transformation of records from the Amazon Kinesis stream in order to suit the user-defined data model
IFilter
• Excludes irrelevant records from the processing.
IBuffer
• Buffers the set of records to be processed by specifying size limit (# of records)& total byte count
IEmitter
• Makes client calls to other AWS services and persists the records stored in the buffer.
Sending & Reading Data from Kinesis Streams
HTTP Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Kinesis Client
Library
+
Connector Library
Apache
Storm
Amazon Elastic
MapReduce
Sending Consuming
AWS Mobile
SDK
Amazon Kinesis & Elastic MapReduce
Amazon Elastic MapReduce (EMR)Managed Service for Hadoop based data processing
• Managed service
• Easy to tune clusters and trim
costs
• Support for multiple data stores
• Unique features that ensure
customer success on AWS
Applying batch processing to streamed data
Client/ Sensor Recording Service
Aggregator/ Sequencer
Continuous processor for dashboard
Storage
Analytics and Reporting
Amazon Kinesis Amazon EMR
Streaming Data Ingestion
What would this look like?
Processing
Input
• User
• Dev
My Website
Kinesis
Log4J
Appender
push to
Kinesis
EMR
HivePig
CascadingMapReduce
pull from
• Features offered starting EMR AMI 3.0.4– Simply spin up the EMR cluster like normal
• Logical names – Labels that define units of work (Job A vs Job B)
• Iterations – Provide idempotency (pessimistic locking of the Logical name)
• Checkpoints– Creating an input start and end points to allow batch processing
Features and Functionality
Iterations – the run of a Job
Iteration 1 Iteration 2 Iteration 3 Iteration 4
Trim Horizon seqID
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
-24 hours
Logical Name
Stream
NOW
Latest seqID
Next
Logical Names & Checkpointing – allows
efficient batching
Kinesis Stream
NOW
Latest seqIDTrim Horizon seqID
-24 hours
Logical Name
Stream
• Dynamo DB
Metadata StorageLogical Name A
Mapper 1
Mapper 2
Mapper 3
Mapper 4
Logical Name B
Mapper 1
Mapper 2
Mapper 3
Mapper 4
Each Kinesis shard maps 1:1 to a Hadoop
map task
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
Mapper 2
Kinesis
Hadoop
Next
Logical Name
Mapper 1
Shard 2
Shard 1
Mapper 2
Mapper 1
Mapper 2
Mapper 1
Mapper 2
Mapper 1
-24 hoursStart seq ID End seq ID
NOW
Latest seqID
Handling stream scaling events
Trim Horizon seqID
1:00 – 7:00 7:00 – 13:00 13:00 – 19:00 19:00 – 1:00
Mapper 2
Kinesis
Hadoop
Logical Name
Mapper 1
Shard 2
Shard 1
Mapper 2
Mapper 1
Mapper 2
Mapper 13
Mapper 3
1
2
4
5
Split
Shard 2
Shard 1
Shard 2
Shard 1
Shard 3
Shard 2
Shard 1
Shard 3
Split Merge-24 hours
Latest seqID
NOW
Next
• InputFormat handles service errors– Throttling: 400
– Service unavailable errors : 503
– Internal server 500
– Http Client exceptions : socket connection timeout
• Hadoop handles retry of failed map tasks
• Iterations allow retrys– Fixed input boundaries on a stream (idempotency for reruns)
– Enable multiple queries on the same input boundaries
Handling errors
Hadoop Ecosystem Implementation
• Hadoop Input format
• Hive Storage Handler
• Pig Load Function
• Cascading Scheme and
Tap
• Join multiple data
sources for analysis
• Filter and preprocess
streams
• Export and archive
streaming data
Use CasesImplementations
Writing to Kinesis using Log4J
Option Default Description
log4j.appender.KINESIS.streamName AccessLogStream
Stream name to which data is to be published.
log4j.appender.KINESIS.encoding UTF-8 Encoding used to convert log message strings into bytes before sending to Amazon Kinesis.
log4j.appender.KINESIS.maxRetries 3 Maximum number of retries when calling Kinesis APIs to publish a log message.
log4j.appender.KINESIS.backoffInterval 100ms Milliseconds to wait before a retry attempt.
log4j.appender.KINESIS.threadCount 20 Number of parallel threads for publishing logs to configured Kinesis stream.
log4j.appender.KINESIS.bufferSize 2000 Maximum number of outstanding log messages to keep in memory.
log4j.appender.KINESIS.shutdownTimeout 30 Seconds to send buffered messages before application JVM quits normally.
.error("Cannot find resource XYX… go do something about it!");
Run the Ad-hoc Hive Query
Run the Ad-hoc Hive Query
Amazon Kinesis & Redshift
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
24 Hours to 10 MinutesHow Upworthy’s Data Pipeline uses Kinesis
Daniel Mintz, Director Business Intelligence, Upworthy, @danielmintz
What’s Upworthy
• We’ve been called– “Social media with a mission” by our About Page
– “The fastest growing media site of all time” by Fast Company
– “The Fastest Rising Startup” by The Crunchies
– “That thing that’s all over my newsfeed” by my annoyed friends
– “The most data-driven media company in history” by me,
optimistically
What We Do
• We aim to drive massive amounts of attention to things that really matter.
• We do that by finding, packaging, and distributing great, meaningful content.
Our Use Case
When We Started
• Had built a data warehouse from scratch
• Hadoop-based batch workflow
• Nightly ETL cycle
• 2.5 Engineers
• Wanted to do all three:– Comprehensive
– Ad Hoc
– Real-Time
The Decision
• Speed up our current system, rather than
building a parallel one
• Had looked at alternative stream processors– Cost
– Maintenance
• Comfortable with concept of application log
stream
How It Works
• Log Drain receives, formats, batches and zips
• PUTs 50k GZIP batches on Kinesis stream
• Three types of Kinesis consumers:1. Archiver – Batch and write permanent record
2. Stats – Filter, sample and count; Report to StatHat
3. Transformer – Filter, batch, validate; writes temporary BSVs to S3
• Database Importer handles manifest files.
• S3 handles garbage collection.
Our system now
• Stats:– Average: ~1085 events/second
– Peak: ~2500 events/second
• Data is available in Redshift < 10 min
• Kinesis has been cheap, stable, and gives us
redundancy and resiliency.
• Computation model that’s easy to reason about
Resiliency
• When something goes wrong, you have 24 hours.
• Timestamp at outset. Track lag at each step.
• Bigger workers (more CPU, RAM, deeper queues) can catch us up very fast.
What We’ve Learned
Some Lessons
• You can use one pipeline for everything.
• High-cardinality fact data belongs in Kinesis.
• EDN works well with Kinesis.
• We prefer explicit checkpointing. (Your mileage may
vary.)
• Languages that run on the JVM can take advantage of
AWS Client Libraries.
Kinesis PricingSimple, Pay-as-you-go, & no up-front costs
Pricing Dimension Value
Hourly Shard Rate $0.015
Per 1,000,000 PUT
transactions:
$0.028
• Customers specify throughput requirements in shards, that they control
• Each Shard delivers 1 MB/s on ingest, and 2MB/s on egress
• Inbound data transfer is free
• EC2 instance charges apply for Kinesis processing applications
Canonical Data flows with Amazon Kinesis
Continuous Metric
Extraction
Incremental Stats
Computation
Record Archiving
Live Dashboard
Try out Amazon Kinesis• Try out Amazon Kinesis
– http://aws.amazon.com/kinesis/
• Thumb through the Developer Guide
– http://aws.amazon.com/documentation/kinesis/
• Test drive the sample app
– https://github.com/awslabs/amazon-kinesis-data-visualization-sample
• Kinesis Connector Framework
– https://github.com/awslabs/amazon-kinesis-connectors
• Read EMR-Kinesis FAQs
– http://aws.amazon.com/elasticmapreduce/faqs/#kinesis-connector
• Visit, and Post on Kinesis Forum
– https://forums.aws.amazon.com/forum.jspa?forumID=169#
© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Thank You!
Adi Krishnan, Product Management, AWS
Top Related