Real-Time Analytics at Scale in the AWS Cloud · AWS Data Pipeline Amazon S3 Amazon Lambda Amazon...
Transcript of Real-Time Analytics at Scale in the AWS Cloud · AWS Data Pipeline Amazon S3 Amazon Lambda Amazon...
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Olivier Klein
Solutions Architect, AWS
23rd June 2015
Cloud & Big Data Analytics Summit 2015
Hong Kong
Real-Time Analytics at Scale in the
AWS Cloud
Three Types of Data Analytics
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
apps
Three Types of Data Analytics
Retrospective
analysis and
reporting
Here-and-now
real-time processing
and dashboards
Predictions
to enable smart
apps
“There’s no such thing as real time.
There’s only near-real time. Typically
when we talk about real-time, what
we mean is architectures that allow
you to respond to data without
persisting it to a database first!”
John Akred
CTO, Silicon Valley Data Science
So what is near real-time?
• Ability to process data as it arrives
• Roughly speaking, process data in
“the present” rather than “the future”
• But what is “the present”?
• eCommerce – Attention span of a
potential customer
• Options Trader – Milliseconds
• Guided Missile – Microseconds
Solution: Stream Processing
• Stream “storage” which allows processing events as
they come in and react accordingly
Real-Time Data Stream Expectations
• What do we expect from a real-time data stream?
• Highly Available
• Fully Scalable
• Fault Tolerant
• (Temporary) Durable
• How can we achieve this?
• Multiple Datacenter Facilities
• Auto-Scalable Server Infrastructure
• Global Load-Balancers
• etc.
Oregon Beijing
Tokyo
Singapore
Ireland
GovCloud
Northern California Sydney
São Paulo
11 Regions
29 Availability Zones
53 Edge Locations
Continuous Expansion
Frankfurt
N. Virginia
AWS Global Infrastructure
Amazon Web Services
Core Services Compute Storage Database Networking
Infrastructure Regions Availability Zones Edge Locations
Platform Services
Analytics App Deployment Mobile
Access Control
Auditing Monitoring Encryption Security
Virtual Desktops
Collaboration & Sharing
App Delivery E-Mail Applications
API
&
SDKs
Compute Storage Database Networking
Amazon Web Services
Core Services
Infrastructure Regions Availability Zones Edge Locations
Platform Services
Analytics App Deployment Mobile
Access Control
Auditing Monitoring Encryption Security
Virtual Desktops
Collaboration & Sharing
App Delivery E-Mail Applications
API
&
SDKs
Amazon S3
Amazon
DynamoDB
Amazon RDS
Ingest Store Process Visualize
Amazon Mobile
Analytics
Amazon
EC2
AWS
Import/Export
Amazon EMR
Amazon Redshift
Amazon
Lambda
Amazon
Kinesis Amazon Machine
Learning
Amazon
CloudSearch AWS Data
Pipeline
Amazon
EC2 Amazon
Glacier
Amazon
DynamoDB
Amazon RDS
Ingest Store Process Visualize
Amazon Mobile
Analytics
Amazon
EC2
AWS
Import/Export
Amazon EMR
Amazon Redshift
Amazon
Kinesis Amazon Machine
Learning
Amazon
CloudSearch AWS Data
Pipeline
Amazon S3 Amazon
Lambda
Amazon
EC2 Amazon
Glacier
Stream in Real Time: Amazon Kinesis
• Real-Time Data Processing over
large distributed streams
• Elastic capacity that scales to
millions of events per second
• React In real-time upon incoming
stream events
• Reliable stream storage replicated
across 3 facilities Amazon Kinesis
Amazon Kinesis: Produce and Consume
HTTP Post
AWS SDKs
LOG4J
Flume
Kinesis
Producer
Library (IoT)
Fluentd
App.4
[Machine Learning]
App.1
[Aggregate & De-Duplicate]
App.2
[Metric Extraction]
Amazon S3
Amazon
DynamoDB
Apache Storm
App.3
[Decision Making Tree]
Amazon EMR
Amazon Kinesis
Amazon
DynamoDB
Amazon RDS
Ingest Store Process Visualize
Amazon Mobile
Analytics
Amazon
EC2
AWS
Import/Export
Amazon EMR
Amazon Redshift
Amazon
Lambda
Amazon
Kinesis Amazon Machine
Learning
Amazon
CloudSearch AWS Data
Pipeline
Amazon S3 Amazon
EC2 Amazon
Glacier
React in Real-Time: Amazon Lambda
• Run your code in the cloud, fully
managed and highly-available
• Triggered through invocation or
state changes in your setup
• Scales automatically to match the
incoming event rate
• Can be connected to an Amazon
Kinesis stream to react upon every
incoming event
• Charged per 100ms execution time
Amazon Kinesis
Amazon Lambda
Amazon
DynamoDB
Amazon RDS
Ingest Store Process Visualize
Amazon Mobile
Analytics
Amazon
EC2
AWS
Import/Export
Amazon EMR
Amazon Redshift
Amazon
Kinesis Amazon Machine
Learning
Amazon
CloudSearch AWS Data
Pipeline
Amazon
Lambda Amazon S3 Amazon
EC2 Amazon
Glacier
Amazon DynamoDB
• Schemaless Data Model
• Seamless scalability
• No storage or throughput limits
• Consistent low latency performance
• High durability and availability
• Replicated across 3 facilities
DynamoDB
table
items
attributes
Fully Managed NoSQL Database Service
500,000 writes / second to their Amazon
DynamoDB tables
200 additional servers during Superbowl
0 additional servers right after
Amazon
Kinesis Twitter Stream
Amazon
DynamoDB
Amazon SNS
Amazon
Lambda
Demo: Live Twitter Feed Analysis
Amazon S3
Visualization with
D3.js
Demo: Live Twitter Feed Analysis
Cost of running this demo?
Kinesis Shard: $0.15/h
DynamoDB: $0.0065/h + $0.25/GB
Lambda: $0.000000208/100ms
S3: $0.03/GB
Total: $0.436502080 ~ $0.43
Highly available with virtually unlimited scalability.
What’s next?
• Many AWS Services can help your Big
Data Roadmap
• Talk to us at the AWS and Masterson
booth to learn how to build a cost-
effective data analytics platform on us
• US$50 AWS Credits to get you started
$50