Building a Big Data & Analytics Platform using AWS

v

Chris Hampartsoumian

Technology Evangelist - ASEAN

End to End Data Flows on the Cloud

Structured, Unstructured & Streaming

July 2015

How is Cloud Computing important for Big Data

Applications?

v

?

…get into cloud computing?

How did Amazon…

11 Regions

30 Availability Zones

53 Edge locations

AWS Global Infrastructure

Why are customers adopting cloud computing?

Variable expense

Replace capital

expenditure with variable

expense

Elastic capacity

No need to guess

capacity requirements

and over-provision

Speed and agility

Infrastructure in minutes

not weeks

Global Reach

Go global in minutes and

reach a global audience

Mobile

PushNotifications

MobileAnalytics

CognitoCognito

Sync

AWS Global Infrastructure

Your Applications

AWS Global Infrastructure11 Regions 30 Availability Zones 53 Edge Locations

Network

VPCDirect

ConnectRoute 53

AP

I

Human Interaction

Support

Web Console

Interaction

Command Line

Libraries, SDK’s

Database

DynamoDBRDS ElastiCache

Deployment & Management

ElasticBeanstalk

OpsWorksCloud

FormationCode

DeployCode

PipelineCode

Commit

Security & Administration

CloudWatch ConfigCloudTrail

IAM Directory KMS

Application

SQS SWFApp

StreamElastic

TranscoderSES

CloudSearch

SNS

Enterprise Applications

WorkSpaces WorkMail WorkDocs

Compute

EC2 ELBAuto

ScalingLambdaECS

Analytics

KinesisData

PipelineRedShift EMR

Machine Learning

Storage

EBS Glacier CloudFrontEFSS3

v

StructureLowHigh

Large

Small

Size

Traditional

Database

Hadoop

NoSQL

MPP Database

UnstructuredStructured Streaming

MPP Databases

Amazon Redshift

Hadoop

Amazon EMR

Real-time Analysis

Amazon Kinesis

v

• Standard SQL

• Optimized for fast analysis

• Very scalable

vAmazon Redshift

v

Q1. What is it?

vMPP SQL Database

Optimised for Analytics

Gigabytes to Petabytes

Fully relational

Fully managed

Amazon Redshift

v

Q2. How does it work?

JDBC/ODBC

JDBC/ODBC

ID Name

1 John Smith

2 Jane Jones

3 Peter Black

4 Pat Partridge

5 Sarah Cyan

6 Brian Snail

1 John Smith

4 Pat Partridge

2 Jane Jones

5 Sarah Cyan

3 Peter Black

6 Brian Snail

v

• Column storage

• Data compression

• Zone maps• With row storage you do unnecessary I/O

• To get average Amount by State, you have

to read everything

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

Dramatically reduces I/O

v

• With column storage, you only

read the data you need

ID Age State Amount

123 20 CA 500

345 25 WA 250

678 40 FL 125

957 37 WA 375

• Column storage


• Zone maps


v

• Column storage


• Zone maps

10 | 13 | 14 | 26 |…

… | 100 | 245 | 324

375 | 393 | 417…

… 512 | 549 | 623

637 | 712 | 809 …

… | 834 | 921 | 959

10

324

375

623

637

959

• Track the minimum and maximum

value for each block

• Skip over blocks that don’t contain

relevant data


v

Q3. What’s good about it?

Performance, Scalability, Ease of Use, Cost

v

Performance Evaluation on 2B Rows

Aggregate by month 02:08:35 00:35:46 00:00:12

Traditional SQL Database

AmazonRedshift

160 GBDW2.L

8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL 8XL










2 PB

v

Q4. How do I integrate with Redshift?

v

Works with your existing analysis tools

JDBC/ODBC

Amazon Redshift

S3

Redshift

DynamoDB

EMR

Linux

Loading data

AmazonRedshift

SourceSystems

ETL


MPP Databases

Amazon Redshift

Hadoop

Amazon EMR

Real-time Analysis

Amazon Kinesis

Input File

Hadoop cluster

Functions Output

1. Very Flexible2. Very Scalable3. Often Transient

vAmazon Elastic MapReduce (EMR)

v

Q1. What is it?

Managed Hadoop

Input File

EMR cluster

Functions OutputEC2

EC2

EC2

EC2

EC2

EC2

v


v

EMR

EMR ClusterS3

1. Put the data into S3

2. Choose: Hadoop distribution, # of nodes, types

of nodes, Hadoop apps like Hive/Pig/HBase

4. Get the output from S3

3. Launch the cluster using the EMR console, CLI, SDK,

or APIs

v

EMR

EMR Cluster

S3

You can easily resize the cluster

And launch parallel clusters using the same

data

v

EMR

EMR Cluster

S3

Use Spotnodes to save time and money

v

EMR ClusterS3

When processing is complete, you can terminate the cluster (and stop

paying)

v


Scalability, Cost & Ease of Use

v

14 Hours

Duration:

Scenario #1

Duration:

7 Hours

Scenario #2

EMR with spot instances

#1: Cost without Spot4 instances *14 hrs * $0.50 = $28

#2: Cost with Spot4 instances *7 hrs * $0.50 = $14 +5 instances * 7 hrs * $0.25 = $8.75

Total = $22.75

Time Savings: 50% Cost Savings: ~22%

Master instance groupEMR cluster

Task instance groupCore instance group

HDFS HDFS

Amazon S3

Great for Spot Instances

v

The Hadoop Ecosystem


MPP Databases

Amazon Redshift

Hadoop

Amazon EMR

Real-time Analysis

Amazon Kinesis

v

Q1. What is it?

vKinesis

A fully managed service for real-time processing

of high-volume, streaming data.

v


Availability

Zone

Availability

ZoneAvailability

Zone

Data Sources

Data Sources

Data Sources

Data Sources

Data Sources

Logging

Metrics

Analysis

MachineLearning

S3

DynamoDB

Redshift

EMR

Kinesis

Stream

Putting data into Kinesis

• Each shard

• 1000 Tx Per Second

• 1MB Per Second

• 50KB Payload Per Tx

• Messages kept for 24 hours

• Simple PUT interface to store data in Kinesis

• A Partition Key is used to distribute the PUTs across Shards

• A unique Sequence # is created

v

Getting data out of Kinesis

Kinesis Client Library (KCL):

• Abstracts code from individual shards

• Starts a Kinesis Worker for each shard

• Increases and decreases workers

• Tracks a Worker’s location in the stream

v


v

Easy Administration Real-time Performance High Throughput.

Elastic

Integration

S3

Redshift

DynamoDB

Storm

ElasticSearch

Build Real-time

Applications

.

Low Cost

v

Amazon Machine Learning

v A Legacy of Machine Learning at Amazon

“Customers who bought this

also bought…”

Why Did We Build Amazon Machine Learning?

Three types of data-driven development

Retrospective

analysis and

reporting

Amazon Redshift

Amazon RDS

Amazon S3

Amazon EMR


Retrospective

analysis and

reporting

Here-and-now

real-time processing and

dashboards

Amazon Kinesis

Amazon EC2

AWS Lambda

Amazon Redshift,

Amazon RDS

Amazon S3

Amazon EMR


Retrospective

analysis and

reporting

Here-and-now

real-time processing and

dashboards

Predictions

to enable smart

applications

Amazon Kinesis

Amazon EC2

AWS Lambda

Amazon Redshift,

Amazon RDS

Amazon S3

Amazon EMR

v

Machine learning and smart applications

• Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available

v

Machine learning and smart applications

• Machine learning is the technology that automatically finds patterns in your data and uses them to make predictions for new data points as they become available

Your data + machine learning = smart applications

v

Smart applications by example

Based on what you know

about the user:

Will they use your product?

v



about the user:



about an order:

Is this order fraudulent?

v



about the user:



about an order:

Is this order fraudulent?

Based on what you know about a

news article:

What other articles are

interesting?

v

Challenges to Building Smart Applications Today

Expertise Technology Operationalization

Limited supply of data scientists

Many choices, few mainstays

Complex and error-prone data workflows

Expensive to hire or outsource

Difficult to use and scale Custom platforms and APIs

What is Amazon Machine Learning?

v

Amazon Machine Learning

• Easy to use, managed machine learning service built for developers

• Robust, powerful machine learning technology based on Amazon’s internal systems

• Create models using your data already stored in the AWS cloud

• Deploy models to production in seconds

v

Easy to use and developer-friendly

• Use the intuitive, powerful service console to build and explore your initial models

• Data retrieval • Model training, quality evaluation, fine-tuning• Deployment and management

• Automate model lifecycle with fully featured APIs and SDKs

• Java, Python, .NET, JavaScript, Ruby, PHP

• Easily create smart iOS and Android applications with AWS Mobile SDK

v

Powerful machine learning technology

• Based on Amazon’s battle-hardened internal systems

• Not just the algorithms:• Smart data transformations• Input data and model quality alerts• Built-in industry best practices

• Grows with your needs• Train on up to 100 GB of data• Generate billions of predictions• Obtain predictions in batches or real-time

v

Integrated with AWS Data Ecosystem

• Access data that is stored in Amazon S3, Amazon Redshift, or MySQL databases in RDS

• Output predictions to Amazon S3 for easy integration with your data flows

• Use AWS Identity and Access Management (IAM) for fine-grained data-access permission policies

v

Fully-managed model and prediction services

• End-to-end service, with no servers to provision and manage

• One-click production model deployment

• Programmatically query model metadata to enable automatic retraining workflows

• Monitor prediction usage patterns with Amazon CloudWatch metrics

v

Pay-as-you-go and inexpensive

• Data analysis, model training, and evaluation: $0.42/instance hour

• Batch predictions: $0.10/1000

• Real-time predictions: $0.10/1000

• + hourly capacity reservation charge

v

Three Supported Types of Predictions

• Binary Classification

• Predict the answer to a Yes/No question

• Multi-class classification

• Predict the correct category from a list

• Regression

• Predict the value of a numeric variable

How Do I Get started Using Amazon Machine Learning?

Get Started Quickly• Create, access, and manage all Amazon

ML entities through the AWS Management Console

• Easily learn to build a model with the tutorial dataset provided

• Add prediction capabilities to your iOS and Android applications with AWS Mobile SDK

• Use Amazon ML APIs, CLIs, or SDKs

v

Buildmodel

Evaluate andoptimize

Retrieve predictions

1 2 3

Building smart applications with Amazon ML

v

Trainmodel



1 2 3


- Create a Datasource object pointing to your data

- Explore and understand your data

- Transform data and train your model

v

Explore and understand your data

v

Train your model

>>> import boto

>>> ml = boto.connect_machinelearning()

>>> model = ml.create_ml_model(

ml_model_id=’my_model',

ml_model_type='REGRESSION',

training_data_source_id='my_datasource')

v

Trainmodel



1 2 3


- Understand model quality

- Adjust model interpretation

v

Explore model quality

v

Fine-tune model interpretation

v

Trainmodel



1 2 3


- Batch predictions

- Real-time predictions

v

Batch predictions

• Asynchronous, large-volume prediction generation

• Request through service console or API

• Best for applications that deal with batches of data records

>>> import boto


>>> model = ml.create_batch_prediction(

batch_prediction_id = 'my_batch_prediction’

batch_prediction_data_source_id = ’my_datasource’

ml_model_id = ’my_model',

output_uri = 's3://examplebucket/output/’)

v

Real-time predictions

• Synchronous, low-latency, high-throughput prediction generation

• Request through service API or server or mobile SDKs

• Best for interaction applications that deal with individual data records

>>> import boto


>>> ml.predict(

ml_model_id=’my_model',

predict_endpoint=’example_endpoint’,

record={’key1':’value1’, ’key2':’value2’})

{

'Prediction': {

'predictedValue': 13.284348,

'details': {

'Algorithm': 'SGD',

'PredictiveModelType': 'REGRESSION’

}

}

}

Architecture Patterns for Smart Applications

Batch predictions with Amazon EMR

Query for predictions with Amazon ML batch API

Process data with Amazon EMR

Raw data in Amazon S3

Aggregated data in Amazon S3

Predictions in Amazon S3 Your application

Batch predictions with Amazon Redshift

Structured dataIn Amazon Redshift

Load predictions into Amazon Redshift

-or-Read prediction results directly

from Amazon S3

Predictions in Amazon S3

Query for predictions with Amazon ML batch API

Your application

Real-time predictions for interactive applications

Your application

Query for predictions with Amazon ML real-time API

Thank You!

aws.amazon.com/big-data

Thank you!

@AWSCloudSEAsia

Chris Hampartsoumian

Technology Evangelist ASEAN

Building a Big Data & Analytics Platform using AWS

Technology

Transcript of Building a Big Data & Analytics Platform using AWS