(BDT310) Big Data Architectural Patterns and Best Practices on AWS
-
Upload
amazon-web-services -
Category
Technology
-
view
12.953 -
download
4
Transcript of (BDT310) Big Data Architectural Patterns and Best Practices on AWS
![Page 1: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/1.jpg)
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Siva Raghupathy, Principal Solutions Architect, Amazon Web Services
October 2015
BDT310
Big Data Architectural
Patterns and Best Practices
on AWS
![Page 2: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/2.jpg)
What to Expect from the Session
Big data challenges
How to simplify big data processing
What technologies should you use?
• Why?
• How?
Reference architecture
Design patterns
![Page 3: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/3.jpg)
Ever Increasing Big Data
Volume
Velocity
Variety
![Page 4: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/4.jpg)
Big Data Evolution
Batch
Report
Real-time
Alerts
Prediction
Forecast
![Page 5: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/5.jpg)
Plethora of Tools
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data PipelineAmazon Kinesis Cassandra
CloudSearch
Kinesis-
enabled
app
Lambda ML
SQS
ElastiCache
DynamoDB
Streams
![Page 6: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/6.jpg)
Is there a reference architecture ?
What tools should I use ?
How ?
Why ?
![Page 7: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/7.jpg)
Architectural Principles
• Decoupled “data bus”
• Data → Store → Process → Answers
• Use the right tool for the job
• Data structure, latency, throughput, access patterns
• Use Lambda architecture ideas
• Immutable (append-only) log, batch/speed/serving layer
• Leverage AWS managed services
• No/low admin
• Big data ≠ big cost
![Page 8: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/8.jpg)
Simplify Big Data Processing
ingest /
collectstore
process /analyze
consume / visualize
Time to Answer (Latency)
Throughput
Cost
![Page 9: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/9.jpg)
Collect /
Ingest
![Page 10: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/10.jpg)
Types of Data
• Transactional
• Database reads & writes (OLTP)
• Cache
• Search
• Logs
• Streams
• File
• Log files (/var/log)
• Log collectors & frameworks
• Stream
• Log records
• Sensors & IoT data
Database
File
Storage
Stream
Storage
A
iOS Android
Web Apps
Logstash
Lo
gg
ing
IoT
Ap
plicati
on
s
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Search
Collect StoreL
og
gin
gIo
T
![Page 11: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/11.jpg)
Store
![Page 12: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/12.jpg)
Stream
StorageA
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
AmazonES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
ElastiCache
Se
arc
h S
QL
N
oS
QL
C
ac
he
Str
eam
Sto
rag
eF
ile S
tora
ge
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Database
File
Storage
Search
Collect StoreL
og
gin
gIo
TA
pp
licati
on
s
![Page 13: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/13.jpg)
Stream Storage Options
• AWS managed services
• Amazon Kinesis → streams
• DynamoDB Streams → table + streams
• Amazon SQS → queue
• Amazon SNS → pub/sub
• Unmanaged
• Apache Kafka → stream
![Page 14: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/14.jpg)
Why Stream Storage?
• Decouple producers & consumers
• Persistent buffer
• Collect multiple streams
• Preserve client ordering
• Streaming MapReduce
• Parallel consumption
4 4 3 3 2 2 1 14 3 2 1
4 3 2 1
4 3 2 1
4 3 2 1
4 4 3 3 2 2 1 1
Shard 1 / Partition 1
Shard 2 / Partition 2
Consumer 1
Count of
Red = 4
Count of
Violet = 4
Consumer 2
Count of
Blue = 4
Count of
Green = 4
Kafka TopicDynamoDB Stream Kinesis Stream
![Page 15: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/15.jpg)
What About Queues & Pub/Sub ?
• Decouple producers &
consumers/subscribers
• Persistent buffer
• Collect multiple streams
• No client ordering
• No parallel consumption for
Amazon SQS
• Amazon SNS can route
to multiple queues or ʎ
functions
• No streaming MapReduce
Consumers
Producers
Producers
Amazon SNS
Amazon SQS
queue
topic
function
ʎ
AWS Lambda
Amazon SQSqueue
Subscriber
![Page 16: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/16.jpg)
Which stream storage should I use?
Amazon
Kinesis
DynamoDB
Streams
Amazon SQS
Amazon SNSKafka
Managed Yes Yes Yes No
Ordering Yes Yes No Yes
Delivery at-least-once exactly-once at-least-once at-least-once
Lifetime 7 days 24 hours 14 days Configurable
Replication 3 AZ 3 AZ 3 AZ Configurable
Throughput No Limit No Limit No Limit ~ Nodes
Parallel Clients Yes Yes No (SQS) Yes
MapReduce Yes Yes No Yes
Record size 1MB 400KB 256KB Configurable
Cost Low Higher(table cost) Low-Medium Low (+admin)
![Page 17: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/17.jpg)
File
StorageA
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
AmazonES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
ElastiCache
Se
arc
h S
QL
N
oS
QL
C
ac
he
Str
eam
Sto
rag
eF
ile S
tora
ge
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Database
Search
Collect StoreL
og
gin
gIo
TA
pp
licati
on
s
![Page 18: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/18.jpg)
Why Is Amazon S3 Good for Big Data?
• Natively supported by big data frameworks (Spark, Hive, Presto, etc.)
• No need to run compute clusters for storage (unlike HDFS)
• Can run transient Hadoop clusters & Amazon EC2 Spot instances
• Multiple distinct (Spark, Hive, Presto) clusters can use the same data
• Unlimited number of objects
• Very high bandwidth – no aggregate throughput limit
• Highly available – can tolerate AZ failure
• Designed for 99.999999999% durability
• Tired-storage (Standard, IA, Amazon Glacier) via life-cycle policy
• Secure – SSL, client/server-side encryption at rest
• Low cost
![Page 19: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/19.jpg)
What about HDFS & Amazon Glacier?
• Use HDFS for very frequently
accessed (hot) data
• Use Amazon S3 Standard for
frequently accessed data
• Use Amazon S3 Standard –
IA for infrequently accessed
data
• Use Amazon Glacier for
archiving cold data
![Page 20: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/20.jpg)
Database +
Search
Tier
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
AmazonES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
ElastiCache
Se
arc
h S
QL
N
oS
QL
C
ac
he
Str
eam
Sto
rag
eF
ile S
tora
ge
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Collect Store
![Page 21: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/21.jpg)
Database + Search Tier Anti-pattern
Database + Search Tier
![Page 22: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/22.jpg)
Best Practice — Use the Right Tool for the Job
Data TierSearch
Amazon
Elasticsearch
Service
Amazon
CloudSearch
Cache
Redis
Memcached
SQL
Amazon Aurora
MySQL
PostgreSQL
Oracle
SQL Server
NoSQL
Cassandra
Amazon
DynamoDB
HBase
MongoDB
Database + Search Tier
![Page 23: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/23.jpg)
Materialized Views
![Page 24: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/24.jpg)
What Data Store Should I Use?
• Data structure → Fixed schema, JSON, key-value
• Access patterns → Store data in the format you will
access it
• Data / access characteristics → Hot, warm, cold
• Cost → Right cost
![Page 25: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/25.jpg)
Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (Key, Value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, Search Search
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, Value) Cache, NoSQL
![Page 26: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/26.jpg)
What Is the Temperature of Your Data / Access ?
![Page 27: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/27.jpg)
Data / Access Characteristics: Hot, Warm, Cold
Hot Warm Cold
Volume MB–GB GB–TB PB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–High High Very High
Request rate Very High High Low
Cost/GB $$-$ $-¢¢ ¢
Hot Data Warm Data Cold Data
![Page 28: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/28.jpg)
CacheSQL
Request RateHigh Low
Cost/GBHigh Low
LatencyLow High
Data VolumeLow High
GlacierS
tructu
re
NoSQL
Hot Data Warm Data Cold Data
Low
High
Search
![Page 29: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/29.jpg)
What Data Store Should I Use?
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
Aurora
Amazon
Elasticsearch
Amazon
EMR (HDFS)
Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec sec,min,hrs ms,sec,min
(~ size)
hrs
Data volume GB GB–TBs
(no limit)
GB–TB
(64 TB
Max)
GB–TB GB–PB
(~nodes)
MB–PB
(no limit)
GB–PB
(no limit)
Item size B-KB KB
(400 KB
max)
KB
(64 KB)
KB
(1 MB max)
MB-GB KB-GB
(5 TB max)
GB
(40 TB max)
Request rate High -
Very High
Very High
(no limit)
High High Low – Very
High
Low –
Very High
(no limit)
Very Low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢ ¢/10
Durability Low -
Moderate
Very High Very High High High Very High Very High
Hot Data Warm Data Cold Data
Hot Data Warm Data Cold Data
![Page 30: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/30.jpg)
Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
![Page 31: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/31.jpg)
Cost Conscious Design Example: Should I use Amazon S3 or Amazon DynamoDB?
https://calculator.s3.amazonaws.com/index.html
![Page 32: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/32.jpg)
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
300 2,048 1,483 777,600,000
Amazon S3 orAmazon DynamoDB?
![Page 33: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/33.jpg)
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per
month
Scenario 1300 2,048 1,483 777,600,000
Scenario 2300 32,768 23,730 777,600,000
Amazon S3
Amazon DynamoDB
use
use
![Page 34: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/34.jpg)
Process /
Analyze
![Page 35: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/35.jpg)
AnalyzeA
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
AmazonES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
Am
azo
n E
lasti
c M
ap
Red
uce
Amazon
ElastiCache
Se
arc
h S
QL
N
oS
QL
C
ac
he
Str
eam
Pro
cessin
gB
atc
hIn
tera
cti
ve
Lo
gg
ing
Str
eam
Sto
rag
e
IoT
Ap
plicati
on
s
File S
tora
ge
Hot
Cold
Warm
Hot
Hot
ML
Transactional Data
File Data
Stream Data
Mobile
Apps
Search Data
Collect Store Analyze
![Page 36: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/36.jpg)
Process / Analyze
Analysis of data is a process of inspecting, cleaning,
transforming, and modeling data with the goal of discovering
useful information, suggesting conclusions, and supporting
decision-making.
Examples• Interactive dashboards → Interactive analytics
• Daily/weekly/monthly reports → Batch analytics
• Billing/fraud alerts, 1 minute metrics → Real-time analytics
• Sentiment analysis, prediction models → Machine learning
![Page 37: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/37.jpg)
Interactive Analytics
Takes large amount of (warm/cold) data
Takes seconds to get answers back
Example: Self-service dashboards
![Page 38: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/38.jpg)
Batch Analytics
Takes large amount of (warm/cold) data
Takes minutes or hours to get answers back
Example: Generating daily, weekly, or monthly reports
![Page 39: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/39.jpg)
Real-Time Analytics
Take small amount of hot data and ask questions
Takes short amount of time (milliseconds or seconds) to
get your answer back
• Real-time (event)
• Real-time response to events in data streams
• Example: Billing/Fraud Alerts
• Near real-time (micro-batch)
• Near real-time operations on small batches of events in data
streams
• Example: 1 Minute Metrics
![Page 40: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/40.jpg)
Predictions via Machine Learning
ML gives computers the ability to learn without being explicitly
programmed
Machine Learning Algorithms:
- Supervised Learning ← “teach” program
- Classification ← Is this transaction fraud? (Yes/No)
- Regression ← Customer Life-time value?
- Unsupervised Learning ← let it learn by itself
- Clustering ← Market Segmentation
![Page 41: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/41.jpg)
Analysis Tools and Frameworks
Machine Learning
• Mahout, Spark ML, Amazon ML
Interactive Analytics
• Amazon Redshift, Presto, Impala, Spark
Batch Processing
• MapReduce, Hive, Pig, Spark
Stream Processing
• Micro-batch: Spark Streaming, KCL, Hive, Pig
• Real-time: Storm, AWS Lambda, KCL
Amazon
Redshift
Impala
Pig
Amazon Machine
Learning
Streaming
Amazon
Kinesis
AWS
Lambda
Am
azo
n E
lasti
c M
ap
Red
uce
Str
eam
Pro
cessin
gB
atc
hIn
tera
cti
ve
ML
Analyze
![Page 42: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/42.jpg)
What Stream Processing Technology Should I Use?Spark Streaming Apache Storm Amazon Kinesis
Client Library
AWS Lambda Amazon EMR (Hive,
Pig)
Scale /
Throughput
~ Nodes ~ Nodes ~ Nodes Automatic ~ Nodes
Batch or Real-
time
Real-time Real-time Real-time Real-time Batch
Manageability Yes (Amazon EMR) Do it yourself Amazon EC2 +
Auto Scaling
AWS managed Yes (Amazon EMR)
Fault Tolerance Single AZ Configurable Multi-AZ Multi-AZ Single AZ
Programming
languages
Java, Python, Scala Any language
via Thrift
Java, via
MultiLangDaemon (
.Net, Python, Ruby,
Node.js)
Node.js, Java Hive, Pig, Streaming
languages
High
![Page 43: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/43.jpg)
What Data Processing Technology Should I Use?
Amazon
Redshift
Impala Presto Spark Hive
Query
Latency
Low Low Low Low Medium (Tez) –
High (MapReduce)
Durability High High High High High
Data Volume 1.6 PB
Max
~Nodes ~Nodes ~Nodes ~Nodes
Managed Yes Yes (EMR) Yes (EMR) Yes (EMR) Yes (EMR)
Storage Native HDFS / S3A* HDFS / S3 HDFS / S3 HDFS / S3
SQL
Compatibility
High Medium High Low (SparkSQL) Medium (HQL)
HighMedium
![Page 44: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/44.jpg)
What about ETL?
Store Analyze
https://aws.amazon.com/big-data/partner-solutions/
ETL
![Page 45: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/45.jpg)
Consume / Visualize
![Page 46: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/46.jpg)
Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
AmazonES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
Am
azo
n E
lasti
c M
ap
Red
uce
Amazon
ElastiCache
Se
arc
h S
QL
N
oS
QL
C
ac
he
Str
eam
Pro
cessin
gB
atc
hIn
tera
cti
ve
Lo
gg
ing
Str
eam
Sto
rag
e
IoT
Ap
plicati
on
s
File S
tora
ge
An
aly
sis
& V
isu
alizati
on
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Transactional Data
File Data
Stream Data
No
teb
oo
ks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Amazon
QuickSight
![Page 47: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/47.jpg)
Consume
• Predictions
• Analysis and Visualization
• Notebooks
• IDE
• Applications & API
Consume
An
aly
sis
& V
isu
alizati
on
Amazon
QuickSight
No
teb
oo
ks
Predictions
Apps & APIs
IDE
Store Analyze ConsumeETL
Business
users
Data Scientist,
Developers
![Page 48: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/48.jpg)
Putting It All Together
![Page 49: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/49.jpg)
Collect Store Analyze Consume
A
iOS Android
Web Apps
Logstash
Amazon
RDS
Amazon
DynamoDB
AmazonES
Amazon
S3
Apache
Kafka
Amazon
Glacier
Amazon
Kinesis
Amazon
DynamoDB
Amazon
Redshift
Impala
Pig
Amazon ML
Streaming
Amazon
Kinesis
AWS
Lambda
Am
azo
n E
lasti
c M
ap
Red
uce
Amazon
ElastiCache
Se
arc
h S
QL
N
oS
QL
C
ac
he
Str
eam
Pro
cessin
gB
atc
hIn
tera
cti
ve
Lo
gg
ing
Str
eam
Sto
rag
e
IoT
Ap
plicati
on
s
File S
tora
ge
An
aly
sis
& V
isu
alizati
on
Hot
Cold
Warm
Hot
Slow
Hot
ML
Fast
Fast
Amazon
QuickSight
Transactional Data
File Data
Stream Data
No
teb
oo
ks
Predictions
Apps & APIs
Mobile
Apps
IDE
Search Data
ETL
Reference Architecture
![Page 50: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/50.jpg)
Design Patterns
![Page 51: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/51.jpg)
Multi-Stage Decoupled “Data Bus”
• Multiple stages
• Storage decoupled from processing
Store Process Store Process
process
store
![Page 52: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/52.jpg)
Multiple Processing Applications (or
Connectors) Can Read from or Write to Multiple
Data Stores
Amazon
Kinesis
AWS
Lambda
Amazon
DynamoDB
Amazon
Kinesis S3
Connector
Amazon S3
process
store
![Page 53: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/53.jpg)
Processing Frameworks (KCL, Storm, Hive,
Spark, etc.) Could Read from Multiple Data
Stores
Amazon
Kinesis
AWS
Lambda
Amazon
S3Amazon
DynamoDB
Hive SparkStorm
Amazon
Kinesis S3
Connector
process
store
![Page 54: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/54.jpg)
Spark Streaming
Apache Storm
AWS Lambda
KCL
Amazon
Redshift Spark
Impala
Presto
Hive
Amazon
Redshift
Hive
Spark
Presto
Impala
Amazon Kinesis
Apache KafkaAmazon
DynamoDBAmazon S3data
Hot Cold
Data TemperatureP
roc
es
sin
g L
ate
nc
y
Low
High Answers
Amazon EMR
(HDFS)
Hive
Native
KCLAWS Lambda
Data Temperature vs Processing Latency
Batch
![Page 55: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/55.jpg)
Real-time Analytics
ProducerApache
Kafka
KCL
AWS Lambda
Spark
Streaming
Apache
Storm
Amazon
SNS
Amazon
ML
Notifications
Amazon
ElastiCache
(Redis)
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Alert
App state
Real-time Prediction
KPI
process
store
DynamoDB
Streams
Amazon
Kinesis
![Page 56: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/56.jpg)
Interactive &
Batch
Analytics
Producer Amazon S3
Amazon EMR
Hive
Pig
Spark
Amazon
ML
process
store
Consume
Amazon
Redshift
Amazon EMR
Presto
Impala
Spark
Batch
Interactive
Batch Prediction
Real-time Prediction
![Page 57: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/57.jpg)
Batch Layer
Amazon
Kinesis
data
process
store
Lambda Architecture
Amazon
Kinesis S3
Connector Amazon S3
A
p
p
l
i
c
a
t
i
o
n
s
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark answer
Speed Layer
answer
Serving
LayerAmazon
ElastiCache
AmazonDynamoDB
Amazon
RDS
Amazon
ES
answer
Amazon
ML
KCL
AWS Lambda
Spark Streaming
Storm
![Page 58: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/58.jpg)
Summary
• Build decoupled “data bus”
• Data → Store ↔ Process → Answers
• Use the right tool for the job
• Latency, throughput, access patterns
• Use Lambda architecture ideas
• Immutable (append-only) log, batch/speed/serving layer
• Leverage AWS managed services
• No/low admin
• Be cost conscious
• Big data ≠ big cost
![Page 59: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/59.jpg)
Remember to complete
your evaluations!
![Page 60: (BDT310) Big Data Architectural Patterns and Best Practices on AWS](https://reader034.fdocuments.net/reader034/viewer/2022051709/586e8c9f1a28aba0038b8537/html5/thumbnails/60.jpg)
Thank you!