Download - Cloud Analytics Data Warehousing - Marco Serafini · 2020. 9. 4. · •Analytical (OLAP) relational queries •Different architectures •Snowflake: shared-disk + caching at compute

Cloud AnalyticsData Warehousing

Marco Serafini

COMPSCI 532Lecture 18

22

Trivia• How does Amazon make money?

• Selling books?• Entertainment?

33

Migrating to the Cloud

• ELASTICITY• Pay-as-you-go• Unlimited scale

• COST• HW procurement at scale• Cluster management at scale

44

Cloud Computing• Shared resources

• Multiple tenants sharing resources (with isolation)• Economy of scale

• Elastic provisioning• Can easily add and remove resources on the fly

• Pay as you go only when used• Different flavors

• IaaS, PaaS, SaaS• Public, private cloud

55

Cloud Offerings• Computing nodes

• Example: AWS EC2• Full nodes with local storage and pre-installed OS• Very large number of instance types: compute optimized, memory optimized, storage optimized, with GPUs, burstable…

• Storage services• Example: AWS S3• Key-value stores (put/get), file systems

• Higher-level services• Example: DBMS

66

Storage Disaggregation• Computing nodes (e.g. EC2)

• Feature-rich machines• Storage services (e.g. S3)

• On cheaper, storage-heavy machines• Limited read/write interface

• Advantages for cloud provider• Provision storage and computation independently

• Advantages for users• Storage services cheaper• Network bandwidth ~ I/O bandwidth

7

7

Cloud Storage Types

STORAGE PERFORMANCE

ACCESS APPENDS AVAILABILITY PRICE

OBJECT (S3) -- Shared X ✓ Low

FILE SYSTEM (EFS) - Shared ✓ ✓ High

BLOCK (EBS) + Instance (*) ✓ X Mid

INSTANCE-LOCAL ++ Instance ✓ X High (**)

(*) Can be detached from an instance and reattached to another(**) Storage-heavy instances are expensive

88

From Shared-Nothing Architecture…

COMPUTE COMPUTE COMPUTE COMPUTE

LS LS LS LS

Principle: move computation to data

99

…To Hybrid Architectures


LS LS LS LS

STORAGESERVICE

Arbitrary computation

Read/Write onlyCannot move computation to data!

1010

Scheduling Low-Priority Tasks• Helps increase hardware utilization• Spot instances

• Allocated in real-time based on live bidding• Can be revoked any time (with notice)

• Serverless computing• Example: AWS Lambda

• Each of these services comes with own pricing

12 12

Goals: Push-Button Analytics• Easily parallelize single-threaded code• Eliminate cluster management overhead

• Deployment of nodes• Installation• Configuration

• Even cloud offerings have their complexities• Many instance types• Many services

• Solution: Serverless functions

1313

Goal: Push-Button Analytics• Use ”serverless” components

• No need to select a specific cluster size• System auto-scales up and down on demand

• Building blocks• Serverless functions (AWS Lambdas)• Cloud storage services (AWS S3)

• This paper implements MapReduce in this setting

1414

Serverless Functions• Single threaded code• Invoked through HTTP requests• Cloud platform takes care of

• Deployment• Load balancing• Performance isolation

• No need to• Deploy servers• Configure clusters

1515

Challenges with Lambdas• No local storage, need to use remote cloud storage

• For example S3• No function-to-function communication

• Again need remote storage to share remote memory• Short maximum running time

1616

Remote vs. Local Storage

1717

State and Fault Tolerance• State is lost after execution• Inputs and outputs need to be persisted• Fault tolerance

• Re-execute function• Require atomic writes to check what has succeeded

1818

Registering Functions• Registering a new Lambda function is slow• Solution

• Register a single generic Lambda function• Serialize the code that needs the be executed• Store the code (and the input data) on S3• Generic Lambda function loads code and executes it

1919

Remote Storage Scalability

2020

Semantics• Map is easy

• Execute one function per element of the list• Map + single Reducer

• E.g. parallel featurization + single-server ML• MapReduce

• Many Lambdas needed, many small intermediate files• Use Redis, an in-memory key-value store

• Parameter server• Use Redis

2121

The Cost of Scaling Up• Using more nodes does not always imply higher cost• Lower latency à lower cost per node

22

Data Warehousing Architectures

23 23

Data Warehousing• Analytical (OLAP) relational queries• Different architectures

• Snowflake: shared-disk + caching at compute nodes• Redshift: shared-nothing, store all data at compute nodes• Redshift Spectrum: serverless workers executing on-demand and reading from S3

• Let’s discuss these architectures and compare them

24 24

Snowflake• Shared-disk architecture

• Data is stored on S3, all nodes can access it• But nodes keep a distributed cache

• Challenges• Heterogeneous workloads

• No one-size-fits-all hardware configuration• Membership changes

• Large data shuffles when a node fails/is removed• Online upgrade

• It is similar to changing all the nodes in the system

2525

Snowflake Architecture• Data Storage

• Based on S3: high throughput, high latency• Used also for intermediate data

• Virtual Warehouses• Responsible for query execution• Stateless (restarted in their entirety)• Shared cache (low latency on hot data, most data cold)

• Cloud Services• Query parsing, access control, optimization• Snapshot isolation with multi-versioning• Metadata on external key-value store

2626

Snowflake Advantages• Storage on S3 is cheaper• Use expensive local disk only for hot data• All services (except storage) are stateless

• Simpler fault tolerance and membership change

27 27

Redshift• Classical shared-nothing architecture

• Initially based on PostgreSQL but heavily re-optimized for OLAP• Runs on EC2, explicit provisioning• All data pre-loaded on instance storage• Query compilation

• S3 for backup only

2828

Redshift Spectrum• Serverless query executor

• Number of workers dynamically assigned• Stateless

• Reads data directly from S3• Scale out to leverage storage and computation bandwidth

30 30

Comparison Setup• Benchmark: TPC-H

• 1 TB uncompressed data • 1 execution of the query suite

• Configuration• Default: 4 nodes, memory optimized (r4 8xlarge) • Redshift: analogous node that offers SSD storage (dc2) • Athena: opaque

3131

Comparison: Initialization Time• Paid every time we shut down and restart the system• Load metadata and (optionally) data

3232

Comparison: Runtime• Pre-loading pays off

• Initialization delay is easily amortized

• Caching less helpful• Cost

• Athena: pay data scan only• Other systems: mainly running time• Spectrum: scan + running time

3333

Comparison: Execution Cost• RS can amortize loading costs• Athena

• Servlerless• Pay per amount of data scanned

• RS Spectrum• Similar scheme as Athena• But must add RS cluster cost

3434

Storage Cost Per Day

EBS very expensive Instance storage + S3 backup cheaper

3535

Pushing Down Computation?• One should always move computation to data• But disaggregated storage cannot compute!


LS LS LS LS

STORAGESERVICE

Arbitrary computation

Read/Write only

36 36

S3 Select• Computation on the storage layer

• Simple selection and projection queries on structured data (e.g. CSV or Parquet)• Simple aggregations (e.g. sum)

37

PusdownDB• Stateless query execution with S3 select• Example: Bloom join

• Standard hash join but push down Bloom filter to filter results that will not join

3838

TPC-H Results• Great speedups with S3 select