Download - SDN + Storage. Outline Measurement of storage traffic Network aware placement Control of resources SDN + Resource allocation – Predicting Resources utilization.

SDN + Storage

Outline

• Measurement of storage traffic

• Network aware placement

• Control of resources

• SDN + Resource allocation– Predicting Resources utilization– Bring it all together

HDFS Storage Patters

– Maps reads from HDFS• Local read versus Non-local read• Rack locality or not

80%Locality!!!


– Maps reads from HDFS• Local read versus Non-local read• Rack locality or not

80%Cross-rack

Traffic


– Reducers writes to HDFS• 3 copies of file written to HDFS• 2 rack local and 1 non-rack local• Fault tolerance and good performance

THERE MUST BE CROSS RACK

TRAFFIC

Ideal Goal: Minimize Congestion

Real Life Traces

• Analyze Facebook traces: – 33% of time spent in network– Network links are highly utilized; why?– Determine cause of network traffic

1. Job output2. Job input3. Pre-processing

Current Ways To Improve HDFS Transfers

• Change Network Paths– Hedera, MicroTE, C-thru, Helios

• Change Network Rates– Orchestra, D3

• Increase Network Capacity– VL2, Portland (Fat-Tree)

The case for Flexible Endpoints

90%80%20%

90%

• Traffic Matrix limits benefits– of techniques that change paths– of network rates

• Ability to Change Matrix is important

Flexible Endpoints in HDFS

• Recall: Constraint placed by HDFS– 3 replicas– 2 fault domains– Doesn’t matter where as long as constraints are

met• The source of transfer is fixed!– However destination, location of 3 replicas is not

fixed

Sinbad

• Determine placement for block replica– Place replicas to avoid hotspots– Constraints:• 3 copies• Spread across 2 fault domains

• Benefits– Faster writes: – Faster transfers

Sinbad: Ideal Algorithm

• Input:– Blocks of diff size– Links of diff capacity

• Objective:– Minimize write time (transfer time)

• Challenges: Lack of future knowledge– Location & duration of hotspots– Size and arrival times of new replicas

Sinbad Heuristic

• Assumptions– Link utilizations are stable• True for 5-10 seconds

– All block have same size• Fixed-size large blocks

• Heuristic:– Pick least-loaded link/path– Send block from file with least amount to send

Sinbad Architecture

• Recall: original DFS is master-slave architecture

• Sinbad has similar

Sinbad

• Determine placement for block replica– Place replicas to avoid hotspots– Constraints:• 3 copies• Spread across 2 fault domains

• Benefits– Faster writes: – Faster transfers

Orchestrating the Entire Cluster

• How to control Compute, Network, Storage?

• Challenges from SinBAD– How to determine future replica demands?

• You can’t control job arrival• You can control task scheduling• If you predict job characteristics you can determine future

– How to determines future hot spots?• Control all network traffic (SDN)• Use future

Ideal Centralized Entity

• Controls: – Storage, CPU, N/W

• Determines:– Which task to run– Where to run the task– When to start Network transfer• What rate to transfer at• Which network path

Predicting Job Characteristics

• To predict resources that a job needs to complete, what do you need?

Predicting Job Characteristics

– Job’s DAG (job’s traces history)– Computations time for each node– Data transfer size between nodes– Transfer time between nodes

Things you absolutely know!

• Input data– Size of input data– Location of all replicas– Split of input data

• Job’s D.A.G– # of Map– # of Reduce

Map Map Map

HDFS

Reduce Reduce

HDFS

200GB

3 Mappers

2 Reducers

Approaches to Prediction:Input/intermediate/Output Data

• Assumption:– Map & Reduce run same code over and over– Code gives the same ratio of reduction

• E.g. 50% reduction from Map to intermediate• E.g. 90% reduction from intermediate to output

• Implications:– Given size of input, you can determine size of

future transfers

• Problems:– Not always true!!!

Map Map Map

HDFS

Reduce Reduce

HDFS

200GB

100GB

10GB

Approaches to Prediction: Task Run Time

• Assumption: – Task is dominated by reading input– Time to run a task is essentially time to read input

• If Local: Time to read from Disk• If non-local: Time to read across Network

• Implication:– If you can model read time you can determine

task run time

• Problems:– How do you model disk I/O?– How do you model I/O interrupt contention?

Map Map Map

HDFS

Reduce Reduce

HDFS

200GB

100GB

10GB

Predict Job Runs

• Given:– Prediction of tasks, transfers, and of Dag

• Can you predict job completion time?– How do you account for interleaving between

jobs?– How do you determine optimal # of slots?– How do you determine optimal network

bandwidth?

• Really easy right?– But what happens if the network only has 2 slots• You can’t run map in parallel

Map

Map

Map

HDFS

Redu

ceRe

duce

HDFS

200GB

100GB 10

GB2

8

2

1

2

1 30 3

323

23

10 sec 40 sec0 sec

• Which tasks to run in which order?• How many slots to assign?

Map

Map

Map

HDFS

Redu

ceRe

duce

HDFS

200GB

100GB 10

GB2

8

2

1

2

130 3

323

23

0 sec 3 sec 13 sec 33 sec

Approaches to Prediction Job Run Times

• Assumption:– Job Runtime Function (# slots)

• Implication:– Given N slots, I can predict completion time

• Jockey Approach [EuroSys’10]– Track job progress: fraction of completed tasks– Build a map of [{% done + # of slots} time to complete]– Use simulator to build map– Iterate through all possible combination of # of slots and %done.

• Problems:– Ignores network transfers: Network congestion– Cross job contention on server can impact completion time– Not all tasks are equal: # of tasks done isn’t a good representation of progress

Open Questions

• What about background traffic?– Control messages– Other bulk transfer

• What about unexpected events?– Failures?– Loss of data?

• What about protocol inefficiencies?– Hadoop scheduling– TCP inefficiencies– Server scheduling