Monte carlo and network cmg'14

Sources of Traffic Demand Variability and Use

of Monte Carlo for Network Capacity Planning

Performance and Capacity 2014 by CMG

November 05, 2014

Alex Gilgur & Brian Eck

Views and opinions expressed in this presentation are views and opinions of its authors.

If found to be in contradiction with views and policies of Google, Inc., the latter take precedence.

Select images are reproduced with permission from Google, Inc.

Moore’s Law in Reverse: Drinking from a firehose?

http://www.kpcb.com/internet-trends

$

http://www.kpcb.com/internet-trends

…………...

“Matter and energy had ended and with it, space and time...

“All collected data had come to a final end. Nothing was left to be collected.

“But all collected data had yet to be completely correlated and put together

in all possible relationships.

“A timeless interval was spent in doing that.

“And it came to pass that AC learned how to reverse the direction of entropy.

“But there was now no man to whom AC might give the answer of the last

question.”

Isaac Asimov. “The Last Question”. 1956

What does it cost to own a network?

“... ‘THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL

ANSWER.’”

http://www.multivax.com/last_question.html


We don’t have the

time for all this!

Guesstimate!


Ahah! But how sure

are you?

It depends on:

● number of servers

● topology

● policies

● traffic patterns

● network protocols

What does a network cost?

What is the confidence interval of your “guesstimate” of

Total Cost of Ownership of a network?

Network

Cost

Demand Topology Policies

ConstructionNode & Link

Reliability

The Fishbone Diagram

Hardware &

Software

Sizing the Network

Network

Cost

Demand Topology Policies

ConstructionNode & Link

Reliability

Hardware &

Software

Network

SIZE

Network

Cost

Network size is where we bring value

Network

SIZE

TopologyDemand

Node & Link

Reliability

Demand Fishbone

Demand Fishbone

Demand

UsageQoS

Topology

Destination

Source

Guarantees

Latency

Flow

Demand Variability● Noise & Gaps in data

● Non-stationarity & Outliers

● Variation by O & D Nodes

o Node A

o Node Z

● Variation by QoS

o latency

o Pr{delivery}

● Variation within QoS

o other factors

● Distribution:

Bursty

Wide Amplitude

Complex Patterns

Congestion Control

Demand Forecastability: Noise & Gaps

● Noise & Gaps in data



o Node A

o Node Z


o latency

o Pr{delivery}


o other factors

● Distribution:

o “from feast to famine”

o Bursts

o Congestion Control

Demand Forecastability: Non-Stationarity




o Node A

o Node Z


o latency

o Pr{delivery}


o other factors

● Distribution:

Bursty

Wide Amplitude

Complex Patterns

Congestion Control

Demand Variability: Non-stationarity




o Node A

o Node Z


o latency

o Pr{delivery}


o other factors

● Distribution:

Bursty

Wide Amplitude

Complex Patterns

Congestion Control

Demand Variability: QoS VariationSC1

SC2● Noise & Gaps in data



o Node A

o Node Z


o latency

o Pr{delivery}


o other factors

● Distribution:

Bursty

Wide Amplitude

Complex Patterns

Congestion Control

Demand Variability: Other Factors● Noise & Gaps in data



o Node A

o Node Z


o latency

o Pr{delivery}


o other factors

● Distribution:

Bursty

Wide Amplitude

Complex Patterns

Congestion Control

Demand Variability: Signal Distribution




o Node A

o Node Z


o latency

o Pr{delivery}


o other factors

● Distribution

Bursty

Wide Amplitude

Complex Patterns

Congestion Control

Demand Predictability

● Not all forecasting tools were created equal:

○ Non-Gaussian distributions

○ Non-stationarity

○ Congestion Control

“All models are wrong. Some models are useful” - G.E.P. Box

● TSA is not the only way to forecast Demand:

○ Explanatory variables:

■ Timestamp is one of them

■ Power

■ CPU

■ Business Metrics

Forecast

From Demand to Capacity

Demand QoS

Topology

Capacity

QoS = what’s important to user

1. QoS = 1 / Latency

2. QoS = “Goodput” = Throughput * Pr{delivery}

1. Low Latency

2. High Probability of:

a. Delivery

b. Accuracy

Find shortest path from Node 1 to Node 2

Routing for Low Latency: SPF: “Travelling Salesman”

4 = Node 4

2= “Latency of this link = 2 units”

Cost = Latency

QoS = 1/Cost = 1/Latency

Find shortest path from Node 1 to Node 2 IF Node 4 is down

Cost = Latency



4 = Node 4



Find shortest path from Node 1 to Node 2 IF Node 4 is down ...

… and Link 3-5 is losing packetsCost = Latency



4 = Node 4



QoS = what’s important to user

1. QoS = 1 / Latency

2. QoS = “Goodput” = Throughput * Pr{delivery}

1. Low Latency

2. High Probability of:

a. Delivery

b. Accuracy

“Travelling Salesman” Non-linear optimization

Routing for “Goodput”: Nonlinear optimization

Non-linear optimization

Routing for “Goodput”: Can it be simplified?

Assume:

● No Queueing

○ No Blocking

Redefine:

Can be pseudo-linearized

Routing As a Process

SPF

SPF


Draining

SPF


SPF


Draining

SPF


SPF


Draining

SPF


SPF


Draining

SPF


“Whack-a-Mole!”

Routing is updated all the time via:

● Protocol (e.g., TCP)

● SDN Control

We need to accommodate each Flow’s:

● Primary Paths

● Alternative Paths

Network Demand & Throughput

Link Throughput

Demand Topology

Node & Link

Reliability

Link Size

Demandi

Throughputj

Connex Traversal Time

(Latency)

Concurrencyj Capacity

From Demand to Capacity:

Demandi

Throughputj

Link Traversal

Time (Latency)

Concurrencyj Erl-1 (N, PB) Capacity

QoS

PB

To account for Queueing & StatMux, …

Demand

Throughput

Concurrency for Flowi

Connex Traversal

Time (Latency)

Capacity

For Long-Haul Networks, it reduced to… LPropagation >> LQueueing

Erl-1 (N, PB)

QoS

PB

Demand

Throughput

Capacity

Bandwidth Fill Factor

For Long-Haul Network, it reduced to…

Can’t forget the stochastic element

LPropagation >> LQueueing

Latency ~ const

Concurrency = const * Throughput

We can forecast demandDemand:

● A1 -> Z1 : X11 Gbps

● A1 -> Z2 : X12 Gbps

● A2 -> Z3 : X23 Gbps

Throughput

on each Link

Capacity

for each Link

We can forecast demandDemand:

● A1 -> Z1 : X11 Gbps

● A1 -> Z2 : X12 Gbps

● A2 -> Z3 : X23 Gbps

Throughput

on each Link

Capacity

for each Link

Throughput is combinatorial

Demand is NOT DeterministicDemand:

● A1 -> Z1 : X11 Gbps

● A1 -> Z2 : X12 Gbps

● A2 -> Z3 : X23 Gbps

Throughput

on each Link

Neither is Throughput

Throughput:

L12 = ?

L24 = ?

L43 = ?

L31 = ?

L141 = ?

Demand:

N1_N4: 100 Gbps

N2_N4: 200 Gbps

100 G

100 G

200 G

100 G

200 G

200 G

Throughput:

L12 = 100 G

L21 = 200 G

L24 = 300 G

L14 = 300 G

L41 = 0

L43 = 0

L31 = 0

N1 N2

N3 N4

L31

L43

L24

L12

L141

5

315

25

22

From Deterministic Demand to Throughput

From Gaussian Demand to Throughput:

Throughput:

L12 = ?

L24 = ?

L43 = ?

L31 = ?

L141 = ?

Demand:

N1_N4: N (100, 10) Gbps

N2_N4: N (200, 15) Gbps

Throughput:

L12 = N (100, 10) G

L21 = N (200, 15) G

L24 = N (300, 18) G

L14 = N (300, 18) G

L41 = 0

L43 = 0

L31 = 0

N1 N2

N3 N4

L31

L43

L24

L12

L141

5

315

25

22

Throughput:

L12 = ?

L24 = ?

L43 = ?

L31 = ?

L141 = ?

Demand:

N1_N4: G (100, ...) Gbps

N2_N4: G (200, ...) Gbps

N1 N2

N3 N4

L31

L43

L24

L12

L141

5

315

25

22

?

From Generic Random Demand to Throughput:

Monte-Carlo

Every Demand VALUE is a REALIZATION of a RANGE of possible values

Demand Forecast Replace point

estimates with

probability

distributions

Link Throughput: Monte-Carlo Forecasting

Replace point estimates

with probability distributions

Slice the timeline

For each timestamp:

For each Flow:

roll the dice N times

For each timestamp:

For each of the N dice rolls:

Throughput =

sum (Flows)

Monte Carlo works with any Transfer Function

Monte Carlo

Throughput

on each Link

Demand (A-Z)

Capacity

for each Link

Use Case (a case study)

● Hundreds of links

● Thousands of demand flows forecasted

o 95th percentile

o Unspecified Prediction Intervals

● Establish optimal Inventory Size & Policies

o Account for Demand Predictability

● Estimate demand variability effect on:

o Network Size

o TCO

Forecast

Approach

Quantify Demand

Distributions (use Biases)

Use Monte-Carlo to forecast

Throughput Distributions

Use Monte-Carlo to compute

Capacity Predictive Intervals

Use Monte-Carlo to optimize

Inventory Size & Policies

Biases = Forecast - Observed

Biases != Residuals

Quantify Demand Ranges & Prepare MC “Forecasts”

Start

For Each

Time Slice

For Each Flow

Compute:

Bias = Projected - ObservedBuild:

Bias Distribution

Roll the dice

N = 100 times

Apply the rolled-out

numbers to the baseline

forecast for each flow

Save the N Demand

scenarios

Run the Pseudo-Random Demands through MC

Map1

Map2

MapN

MapN-1

Reduce

F flows *

N forecasts

Map: Compute

Capacities (N)

Reduce: Analyze the N

Capacity Forecasts

L links: Capacity

Prediction Intervals

Capacity Forecasts

for each Link

● Range forecasting is cool!

● Network Demand varies in many ways

● For WAN, it is OK to use throughput

o still it’s better to use concurrency

● Demand ≠ Throughput

o Demand -> Throughput -> Capacity

● Monte-Carlo is a model

o Therefore it is wrong

o But it is useful

In Conclusion

Acknowledgements● Google’s NetOps Division

● Google’s NetCap & ODS Teams

● Josep Ferrandiz

● Mike Perka

● Leonid Kats

● C. Steven Gunn

● Matthew Mathis

● Kevin J. Mitchell

● Linda Eck

● Sophia Shtilman

● Leora Gilgur

[email protected] [email protected]

THANK YOU!!!

mailto:[email protected]

mailto:[email protected]

Backup Slides

Biases != Residuals. Why?

How good are forecasts

at predicting demand

N days from “now” ???

H/W Availability: Fault Trees

Reliability Function:

Failure is a memoryless (Poisson) process

F(C|t) = F ((1 OR 2)|t) = 1- (R(1|t) * R(2|t))

F(D|t) = F ((3 AND 4 AND 5)|t) = F(3|t) * F(4|t) * F(5|t)

F(E|t) = F ((7 AND 8) | t) = F (7|t) * F(8|t)

F(F|t) = F ((6 OR E) | t) = 1 - (1 - F(7|t) * F(8|t)) * R(6|t)

F(B|t) = F ((C OR D OR F)|t) = 1 -

R(1|t) * R(2|t)

* (1-F(3|t) * F(4|t) * F(5|t))

* (1-F(7|t) * F(8|t))

* R(6|t)

⇒R(A|t) = R(1|t) * R(2|t)

* (1-F(3|t) * F(4|t) * F(5|t)) *

(1-F(7|t) * F(8|t)) * R(6|t)

C D F

E

B

There’s got to be a cleaner way!

Fault Trees and Monte-Carlo

C D F

E

B

clock.start()

for each component:

component.update (time = clock)

clock.set (min (next_update_time))

Component

state = (run, fail)

rule = (AND, OR, NONE)

mtbf

mttr

next_update_time

elements: Component

fail()

run()

update(time)

run():

if rule == NONE:

state = run;

else:

//apply rule to elements

return;

fail():

if rule == NONE:

state = fail;

else:

//apply rule to elements

return;

update (time):if time ≥ next_update_time:

if state == fail:

run();

next_update_time

+=Exp(mtbf);

else:

fail();

next_update_time

+=Exp(mttr);

return;

Probability distributions

Simplest - Uniform:

Least relevant to anything real

Convenient building block for any distribution

Most standard - Gaussian:

Mathematically the simplest

Does not describe the IT world

Most Relevant - Poisson & Exponential):

Relatively simple mathematically

Accurately describes times between arrivals and service times

for a memoryless process.

F(x) = Pr (X ≤ x) - CDF

f (x) = F’(x) - PDF

Monte carlo and network cmg'14

Data & Analytics

Transcript of Monte carlo and network cmg'14