Cloud AnalyticsData Warehousing
Marco Serafini
COMPSCI 532Lecture 18
22
Trivia• How does Amazon make money?
• Selling books?• Entertainment?
33
Migrating to the Cloud
• ELASTICITY• Pay-as-you-go• Unlimited scale
• COST• HW procurement at scale• Cluster management at scale
44
Cloud Computing• Shared resources
• Multiple tenants sharing resources (with isolation)• Economy of scale
• Elastic provisioning• Can easily add and remove resources on the fly
• Pay as you go only when used• Different flavors
• IaaS, PaaS, SaaS• Public, private cloud
55
Cloud Offerings• Computing nodes
• Example: AWS EC2• Full nodes with local storage and pre-installed OS• Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable…
• Storage services• Example: AWS S3• Key-value stores (put/get), file systems
• Higher-level services• Example: DBMS
66
Storage Disaggregation• Computing nodes (e.g. EC2)
• Feature-rich machines• Storage services (e.g. S3)
• On cheaper, storage-heavy machines• Limited read/write interface
• Advantages for cloud provider• Provision storage and computation independently
• Advantages for users• Storage services cheaper• Network bandwidth ~ I/O bandwidth
7
7
Cloud Storage Types
STORAGE PERFORMANCE
ACCESS APPENDS AVAILABILITY PRICE
OBJECT (S3) -- Shared X ✓ Low
FILE SYSTEM (EFS) - Shared ✓ ✓ High
BLOCK (EBS) + Instance (*) ✓ X Mid
INSTANCE-LOCAL ++ Instance ✓ X High (**)
(*) Can be detached from an instance and reattached to another(**) Storage-heavy instances are expensive
88
From Shared-Nothing Architecture…
COMPUTE COMPUTE COMPUTE COMPUTE
LS LS LS LS
Principle: move computation to data
99
…To Hybrid Architectures
COMPUTE COMPUTE COMPUTE COMPUTE
LS LS LS LS
STORAGESERVICE
Arbitrary computation
Read/Write onlyCannot move computation to data!
1010
Scheduling Low-Priority Tasks• Helps increase hardware utilization• Spot instances
• Allocated in real-time based on live bidding• Can be revoked any time (with notice)
• Serverless computing• Example: AWS Lambda
• Each of these services comes with own pricing
12 12
Goals: Push-Button Analytics• Easily parallelize single-threaded code• Eliminate cluster management overhead
• Deployment of nodes• Installation• Configuration
• Even cloud offerings have their complexities• Many instance types• Many services
• Solution: Serverless functions
1313
Goal: Push-Button Analytics• Use ”serverless” components
• No need to select a specific cluster size• System auto-scales up and down on demand
• Building blocks• Serverless functions (AWS Lambdas)• Cloud storage services (AWS S3)
• This paper implements MapReduce in this setting
1414
Serverless Functions• Single threaded code• Invoked through HTTP requests• Cloud platform takes care of
• Deployment• Load balancing• Performance isolation
• No need to• Deploy servers• Configure clusters
1515
Challenges with Lambdas• No local storage, need to use remote cloud storage
• For example S3• No function-to-function communication
• Again need remote storage to share remote memory• Short maximum running time
1616
Remote vs. Local Storage
1717
State and Fault Tolerance• State is lost after execution• Inputs and outputs need to be persisted• Fault tolerance
• Re-execute function• Require atomic writes to check what has succeeded
1818
Registering Functions• Registering a new Lambda function is slow• Solution
• Register a single generic Lambda function• Serialize the code that needs the be executed• Store the code (and the input data) on S3• Generic Lambda function loads code and executes it
1919
Remote Storage Scalability
2020
Semantics• Map is easy
• Execute one function per element of the list• Map + single Reducer
• E.g. parallel featurization + single-server ML• MapReduce
• Many Lambdas needed, many small intermediate files• Use Redis, an in-memory key-value store
• Parameter server• Use Redis
2121
The Cost of Scaling Up• Using more nodes does not always imply higher cost• Lower latency à lower cost per node
22
Data Warehousing Architectures
23 23
Data Warehousing• Analytical (OLAP) relational queries• Different architectures
• Snowflake: shared-disk + caching at compute nodes• Redshift: shared-nothing, store all data at compute nodes• Redshift Spectrum: serverless workers executing on-demand and reading from S3
• Let’s discuss these architectures and compare them
24 24
Snowflake• Shared-disk architecture
• Data is stored on S3, all nodes can access it• But nodes keep a distributed cache
• Challenges• Heterogeneous workloads
• No one-size-fits-all hardware configuration• Membership changes
• Large data shuffles when a node fails/is removed• Online upgrade
• It is similar to changing all the nodes in the system
2525
Snowflake Architecture• Data Storage
• Based on S3: high throughput, high latency• Used also for intermediate data
• Virtual Warehouses• Responsible for query execution• Stateless (restarted in their entirety)• Shared cache (low latency on hot data, most data cold)
• Cloud Services• Query parsing, access control, optimization• Snapshot isolation with multi-versioning• Metadata on external key-value store
2626
Snowflake Advantages• Storage on S3 is cheaper• Use expensive local disk only for hot data• All services (except storage) are stateless
• Simpler fault tolerance and membership change
27 27
Redshift• Classical shared-nothing architecture
• Initially based on PostgreSQL but heavily re-optimized for OLAP• Runs on EC2, explicit provisioning• All data pre-loaded on instance storage• Query compilation
• S3 for backup only
2828
Redshift Spectrum• Serverless query executor
• Number of workers dynamically assigned• Stateless
• Reads data directly from S3• Scale out to leverage storage and computation bandwidth
30 30
Comparison Setup• Benchmark: TPC-H
• 1 TB uncompressed data • 1 execution of the query suite
• Configuration• Default: 4 nodes, memory optimized (r4 8xlarge) • Redshift: analogous node that offers SSD storage (dc2) • Athena: opaque
3131
Comparison: Initialization Time• Paid every time we shut down and restart the system• Load metadata and (optionally) data
3232
Comparison: Runtime• Pre-loading pays off
• Initialization delay is easily amortized
• Caching less helpful• Cost
• Athena: pay data scan only• Other systems: mainly running time• Spectrum: scan + running time
3333
Comparison: Execution Cost• RS can amortize loading costs• Athena
• Servlerless• Pay per amount of data scanned
• RS Spectrum• Similar scheme as Athena• But must add RS cluster cost
3434
Storage Cost Per Day
EBS very expensive Instance storage + S3 backup cheaper
3535
Pushing Down Computation?• One should always move computation to data• But disaggregated storage cannot compute!
COMPUTE COMPUTE COMPUTE COMPUTE
LS LS LS LS
STORAGESERVICE
Arbitrary computation
Read/Write only
36 36
S3 Select• Computation on the storage layer
• Simple selection and projection queries on structured data (e.g. CSV or Parquet)• Simple aggregations (e.g. sum)
37
PusdownDB• Stateless query execution with S3 select• Example: Bloom join
• Standard hash join but push down Bloom filter to filter results that will not join
3838
TPC-H Results• Great speedups with S3 select
Top Related