Building a Big Data & Analytics Platform using AWS
-
Upload
amazon-web-services -
Category
Technology
-
view
1.600 -
download
1
Transcript of Building a Big Data & Analytics Platform using AWS
v
Chris Hampartsoumian
Technology Evangelist - ASEAN
End to End Data Flows on the Cloud
Structured, Unstructured & Streaming
July 2015
How is Cloud Computing important for Big Data
Applications?
v
?
…get into cloud computing?
How did Amazon…
11 Regions
30 Availability Zones
53 Edge locations
AWS Global Infrastructure
Why are customers adopting cloud computing?
Variable expense
Replace capital
expenditure with variable
expense
Elastic capacity
No need to guess
capacity requirements
and over-provision
Speed and agility
Infrastructure in minutes
not weeks
Global Reach
Go global in minutes and
reach a global audience
Mobile
PushNotifications
MobileAnalytics
CognitoCognito
Sync
AWS Global Infrastructure
Your Applications
AWS Global Infrastructure11 Regions 30 Availability Zones 53 Edge Locations
Network
VPCDirect
ConnectRoute 53
AP
I
Human Interaction
Support
Web Console
Interaction
Command Line
Libraries, SDK’s
Database
DynamoDBRDS ElastiCache
Deployment & Management
ElasticBeanstalk
OpsWorksCloud
FormationCode
DeployCode
PipelineCode
Commit
Security & Administration
CloudWatch ConfigCloudTrail
IAM Directory KMS
Application
SQS SWFApp
StreamElastic
TranscoderSES
CloudSearch
SNS
Enterprise Applications
WorkSpaces WorkMail WorkDocs
Compute
EC2 ELBAuto
ScalingLambdaECS
Analytics
KinesisData
PipelineRedShift EMR
Machine Learning
Storage
EBS Glacier CloudFrontEFSS3
v
StructureLowHigh
Large
Small
Size
Traditional
Database
Hadoop
NoSQL
MPP Database
UnstructuredStructured Streaming
MPP Databases
Amazon Redshift
Hadoop
Amazon EMR
Real-time Analysis
Amazon Kinesis
v
• Standard SQL
• Optimized for fast analysis
• Very scalable
vAmazon Redshift
v
Q1. What is it?
vMPP SQL Database
Optimised for Analytics
Gigabytes to Petabytes
Fully relational
Fully managed
Amazon Redshift
v
Q2. How does it work?
JDBC/ODBC
JDBC/ODBC
ID Name
1 John Smith
2 Jane Jones
3 Peter Black
4 Pat Partridge
5 Sarah Cyan
6 Brian Snail
1 John Smith
4 Pat Partridge
2 Jane Jones
5 Sarah Cyan
3 Peter Black
6 Brian Snail
v
• Column storage
• Data compression
• Zone maps• With row storage you do unnecessary I/O
• To get average Amount by State, you have
to read everything
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
Dramatically reduces I/O
v
• With column storage, you only
read the data you need
ID Age State Amount
123 20 CA 500
345 25 WA 250
678 40 FL 125
957 37 WA 375
• Column storage
• Data compression
• Zone maps
Dramatically reduces I/O
v analyze compression listing;
Table | Column | Encoding
---------+----------------+----------
listing | listid | delta
listing | sellerid | delta32k
listing | eventid | delta32k
listing | dateid | bytedict
listing | numtickets | bytedict
listing | priceperticket | delta32k
listing | totalprice | mostly32
listing | listtime | raw
• Column storage
• Data compression
• Zone maps• COPY compresses automatically
• You can analyze and override
• More performance, less cost
Dramatically reduces I/O
v
• Column storage
• Data compression
• Zone maps
10 | 13 | 14 | 26 |…
… | 100 | 245 | 324
375 | 393 | 417…
… 512 | 549 | 623
637 | 712 | 809 …
… | 834 | 921 | 959
10
324
375
623
637
959
• Track the minimum and maximum
value for each block
• Skip over blocks that don’t contain
relevant data
Dramatically reduces I/O
v
Q3. What’s good about it?
Performance, Scalability, Ease of Use, Cost
v
Performance Evaluation on 2B Rows
Aggregate by month 02:08:35 00:35:46 00:00:12
Traditional SQL Database
AmazonRedshift
160 GBDW2.L
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL
2 PB
v
Q4. How do I integrate with Redshift?
v
Works with your existing analysis tools
JDBC/ODBC
Amazon Redshift
S3
Redshift
DynamoDB
EMR
Linux
Loading data
AmazonRedshift
SourceSystems
ETL
UnstructuredStructured Streaming
MPP Databases
Amazon Redshift
Hadoop
Amazon EMR
Real-time Analysis
Amazon Kinesis
Input File
Hadoop cluster
Functions Output
1. Very Flexible2. Very Scalable3. Often Transient
vAmazon Elastic MapReduce (EMR)
v
Q1. What is it?
Managed Hadoop
Input File
EMR cluster
Functions OutputEC2
EC2
EC2
EC2
EC2
EC2
v
Q2. How does it work?
v
EMR
EMR ClusterS3
1. Put the data into S3
2. Choose: Hadoop distribution, # of nodes, types
of nodes, Hadoop apps like Hive/Pig/HBase
4. Get the output from S3
3. Launch the cluster using the EMR console, CLI, SDK,
or APIs
v
EMR
EMR Cluster
S3
You can easily resize the cluster
And launch parallel clusters using the same
data
v
EMR
EMR Cluster
S3
Use Spotnodes to save time and money
v
EMR ClusterS3
When processing is complete, you can terminate the cluster (and stop
paying)
v
Q3. What’s good about it?
Scalability, Cost & Ease of Use
v
14 Hours
Duration:
Scenario #1
Duration:
7 Hours
Scenario #2
EMR with spot instances
#1: Cost without Spot4 instances *14 hrs * $0.50 = $28
#2: Cost with Spot4 instances *7 hrs * $0.50 = $14 +5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Time Savings: 50% Cost Savings: ~22%
Master instance groupEMR cluster
Task instance groupCore instance group
HDFS HDFS
Amazon S3
Great for Spot Instances
v
The Hadoop Ecosystem
UnstructuredStructured Streaming
MPP Databases
Amazon Redshift
Hadoop
Amazon EMR
Real-time Analysis
Amazon Kinesis
v
v
Q1. What is it?
vKinesis
A fully managed service for real-time processing
of high-volume, streaming data.
v
Q2. How does it work?
Availability
Zone
Availability
ZoneAvailability
Zone
Data Sources
Data Sources
Data Sources
Data Sources
Data Sources
Logging
Metrics
Analysis
MachineLearning
S3
DynamoDB
Redshift
EMR
Kinesis
Stream
Putting data into Kinesis
• Each shard
• 1000 Tx Per Second
• 1MB Per Second
• 50KB Payload Per Tx
• Messages kept for 24 hours
• Simple PUT interface to store data in Kinesis
• A Partition Key is used to distribute the PUTs across Shards
• A unique Sequence # is created
v
Getting data out of Kinesis
Kinesis Client Library (KCL):
• Abstracts code from individual shards
• Starts a Kinesis Worker for each shard
• Increases and decreases workers
• Tracks a Worker’s location in the stream
v
Q3. What’s good about it?
v
Easy Administration Real-time Performance High Throughput.
Elastic
Integration
S3
Redshift
DynamoDB
Storm
ElasticSearch
Build Real-time
Applications
.
Low Cost
v
Amazon Machine Learning
v A Legacy of Machine Learning at Amazon
“Customers who bought this
also bought…”
Why Did We Build Amazon Machine Learning?
Three types of data-driven development
Retrospective
analysis and
reporting
Amazon Redshift
Amazon RDS
Amazon S3
Amazon EMR
Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing and
dashboards
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift,
Amazon RDS
Amazon S3
Amazon EMR
Three types of data-driven development
Retrospective
analysis and
reporting
Here-and-now
real-time processing and
dashboards
Predictions
to enable smart
applications
Amazon Kinesis
Amazon EC2
AWS Lambda
Amazon Redshift,
Amazon RDS
Amazon S3
Amazon EMR
v
Machine learning and smart applications
• Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available
v
Machine learning and smart applications
• Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available
Your data + machine learning = smart applications
v
Smart applications by example
Based on what you know
about the user:
Will they use your product?
v
Smart applications by example
Based on what you know
about the user:
Will they use your product?
Based on what you know
about an order:
Is this order fraudulent?
v
Smart applications by example
Based on what you know
about the user:
Will they use your product?
Based on what you know
about an order:
Is this order fraudulent?
Based on what you know about a
news article:
What other articles are
interesting?
v
Challenges to Building Smart Applications Today
Expertise Technology Operationalization
Limited supply of data scientists
Many choices, few mainstays
Complex and error-prone data workflows
Expensive to hire or outsource
Difficult to use and scale Custom platforms and APIs
What is Amazon Machine Learning?
v
Amazon Machine Learning
• Easy to use, managed machine learning service built for developers
• Robust, powerful machine learning technology based on Amazon’s internal systems
• Create models using your data already stored in the AWS cloud
• Deploy models to production in seconds
v
Easy to use and developer-friendly
• Use the intuitive, powerful service console to build and explore your initial models
• Data retrieval • Model training, quality evaluation, fine-tuning• Deployment and management
• Automate model lifecycle with fully featured APIs and SDKs
• Java, Python, .NET, JavaScript, Ruby, PHP
• Easily create smart iOS and Android applications with AWS Mobile SDK
v
Powerful machine learning technology
• Based on Amazon’s battle-hardened internal systems
• Not just the algorithms:• Smart data transformations• Input data and model quality alerts• Built-in industry best practices
• Grows with your needs• Train on up to 100 GB of data• Generate billions of predictions• Obtain predictions in batches or real-time
v
Integrated with AWS Data Ecosystem
• Access data that is stored in Amazon S3, Amazon Redshift, or MySQL databases in RDS
• Output predictions to Amazon S3 for easy integration with your data flows
• Use AWS Identity and Access Management (IAM) for fine-grained data-access permission policies
v
Fully-managed model and prediction services
• End-to-end service, with no servers to provision and manage
• One-click production model deployment
• Programmatically query model metadata to enable automatic retraining workflows
• Monitor prediction usage patterns with Amazon CloudWatch metrics
v
Pay-as-you-go and inexpensive
• Data analysis, model training, and evaluation: $0.42/instance hour
• Batch predictions: $0.10/1000
• Real-time predictions: $0.10/1000
• + hourly capacity reservation charge
v
Three Supported Types of Predictions
• Binary Classification
• Predict the answer to a Yes/No question
• Multi-class classification
• Predict the correct category from a list
• Regression
• Predict the value of a numeric variable
How Do I Get started Using Amazon Machine Learning?
Get Started Quickly• Create, access, and manage all Amazon
ML entities through the AWS Management Console
• Easily learn to build a model with the tutorial dataset provided
• Add prediction capabilities to your iOS and Android applications with AWS Mobile SDK
• Use Amazon ML APIs, CLIs, or SDKs
v
Buildmodel
Evaluate andoptimize
Retrieve predictions
1 2 3
Building smart applications with Amazon ML
v
Trainmodel
Evaluate andoptimize
Retrieve predictions
1 2 3
Building smart applications with Amazon ML
- Create a Datasource object pointing to your data
- Explore and understand your data
- Transform data and train your model
v
Explore and understand your data
v
Train your model
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_ml_model(
ml_model_id=’my_model',
ml_model_type='REGRESSION',
training_data_source_id='my_datasource')
v
Trainmodel
Evaluate andoptimize
Retrieve predictions
1 2 3
Building smart applications with Amazon ML
- Understand model quality
- Adjust model interpretation
v
Explore model quality
v
Fine-tune model interpretation
v
Fine-tune model interpretation
v
Trainmodel
Evaluate andoptimize
Retrieve predictions
1 2 3
Building smart applications with Amazon ML
- Batch predictions
- Real-time predictions
v
Batch predictions
• Asynchronous, large-volume prediction generation
• Request through service console or API
• Best for applications that deal with batches of data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> model = ml.create_batch_prediction(
batch_prediction_id = 'my_batch_prediction’
batch_prediction_data_source_id = ’my_datasource’
ml_model_id = ’my_model',
output_uri = 's3://examplebucket/output/’)
v
Real-time predictions
• Synchronous, low-latency, high-throughput prediction generation
• Request through service API or server or mobile SDKs
• Best for interaction applications that deal with individual data records
>>> import boto
>>> ml = boto.connect_machinelearning()
>>> ml.predict(
ml_model_id=’my_model',
predict_endpoint=’example_endpoint’,
record={’key1':’value1’, ’key2':’value2’})
{
'Prediction': {
'predictedValue': 13.284348,
'details': {
'Algorithm': 'SGD',
'PredictiveModelType': 'REGRESSION’
}
}
}
Architecture Patterns for Smart Applications
Batch predictions with Amazon EMR
Query for predictions with Amazon ML batch API
Process data with Amazon EMR
Raw data in Amazon S3
Aggregated data in Amazon S3
Predictions in Amazon S3 Your application
Batch predictions with Amazon Redshift
Structured dataIn Amazon Redshift
Load predictions into Amazon Redshift
-or-Read prediction results directly
from Amazon S3
Predictions in Amazon S3
Query for predictions with Amazon ML batch API
Your application
Real-time predictions for interactive applications
Your application
Query for predictions with Amazon ML real-time API
Thank You!
aws.amazon.com/big-data
Thank you!
@AWSCloudSEAsia
Chris Hampartsoumian
Technology Evangelist ASEAN