Monte carlo and network cmg'14
-
Upload
alex-gilgur -
Category
Data & Analytics
-
view
206 -
download
0
Transcript of Monte carlo and network cmg'14
Sources of Traffic Demand Variability and Use
of Monte Carlo for Network Capacity Planning
Performance and Capacity 2014 by CMG
November 05, 2014
Alex Gilgur & Brian Eck
Views and opinions expressed in this presentation are views and opinions of its authors.
If found to be in contradiction with views and policies of Google, Inc., the latter take precedence.
Select images are reproduced with permission from Google, Inc.
Moore’s Law in Reverse: Drinking from a firehose?
http://www.kpcb.com/internet-trends
$
…………...
“Matter and energy had ended and with it, space and time...
“All collected data had come to a final end. Nothing was left to be collected.
“But all collected data had yet to be completely correlated and put together
in all possible relationships.
“A timeless interval was spent in doing that.
“And it came to pass that AC learned how to reverse the direction of entropy.
“But there was now no man to whom AC might give the answer of the last
question.”
Isaac Asimov. “The Last Question”. 1956
What does it cost to own a network?
“... ‘THERE IS AS YET INSUFFICIENT DATA FOR A MEANINGFUL
ANSWER.’”
What does it cost to own a network?
We don’t have the
time for all this!
Guesstimate!
What does it cost to own a network?
Ahah! But how sure
are you?
It depends on:
● number of servers
● topology
● policies
● traffic patterns
● network protocols
What does a network cost?
What is the confidence interval of your “guesstimate” of
Total Cost of Ownership of a network?
Network
Cost
Demand Topology Policies
ConstructionNode & Link
Reliability
The Fishbone Diagram
Hardware &
Software
Sizing the Network
Network
Cost
Demand Topology Policies
ConstructionNode & Link
Reliability
Hardware &
Software
Network
SIZE
Network
Cost
Network size is where we bring value
Network
SIZE
TopologyDemand
Node & Link
Reliability
Demand Fishbone
Demand Fishbone
Demand
UsageQoS
Topology
Destination
Source
Guarantees
Latency
Flow
Demand Variability● Noise & Gaps in data
● Non-stationarity & Outliers
● Variation by O & D Nodes
o Node A
o Node Z
● Variation by QoS
o latency
o Pr{delivery}
● Variation within QoS
o other factors
● Distribution:
Bursty
Wide Amplitude
Complex Patterns
Congestion Control
Demand Forecastability: Noise & Gaps
● Noise & Gaps in data
● Non-stationarity & Outliers
● Variation by O & D Nodes
o Node A
o Node Z
● Variation by QoS
o latency
o Pr{delivery}
● Variation within QoS
o other factors
● Distribution:
o “from feast to famine”
o Bursts
o Congestion Control
Demand Forecastability: Non-Stationarity
● Noise & Gaps in data
● Non-stationarity & Outliers
● Variation by O & D Nodes
o Node A
o Node Z
● Variation by QoS
o latency
o Pr{delivery}
● Variation within QoS
o other factors
● Distribution:
Bursty
Wide Amplitude
Complex Patterns
Congestion Control
Demand Variability: Non-stationarity
● Noise & Gaps in data
● Non-stationarity & Outliers
● Variation by O & D Nodes
o Node A
o Node Z
● Variation by QoS
o latency
o Pr{delivery}
● Variation within QoS
o other factors
● Distribution:
Bursty
Wide Amplitude
Complex Patterns
Congestion Control
Demand Variability: QoS VariationSC1
SC2● Noise & Gaps in data
● Non-stationarity & Outliers
● Variation by O & D Nodes
o Node A
o Node Z
● Variation by QoS
o latency
o Pr{delivery}
● Variation within QoS
o other factors
● Distribution:
Bursty
Wide Amplitude
Complex Patterns
Congestion Control
Demand Variability: Other Factors● Noise & Gaps in data
● Non-stationarity & Outliers
● Variation by O & D Nodes
o Node A
o Node Z
● Variation by QoS
o latency
o Pr{delivery}
● Variation within QoS
o other factors
● Distribution:
Bursty
Wide Amplitude
Complex Patterns
Congestion Control
Demand Variability: Signal Distribution
● Noise & Gaps in data
● Non-stationarity & Outliers
● Variation by O & D Nodes
o Node A
o Node Z
● Variation by QoS
o latency
o Pr{delivery}
● Variation within QoS
o other factors
● Distribution
Bursty
Wide Amplitude
Complex Patterns
Congestion Control
Demand Predictability
● Not all forecasting tools were created equal:
○ Non-Gaussian distributions
○ Non-stationarity
○ Congestion Control
“All models are wrong. Some models are useful” - G.E.P. Box
● TSA is not the only way to forecast Demand:
○ Explanatory variables:
■ Timestamp is one of them
■ Power
■ CPU
■ Business Metrics
Forecast
From Demand to Capacity
Demand QoS
Topology
Capacity
QoS = what’s important to user
1. QoS = 1 / Latency
2. QoS = “Goodput” = Throughput * Pr{delivery}
1. Low Latency
2. High Probability of:
a. Delivery
b. Accuracy
Find shortest path from Node 1 to Node 2
Routing for Low Latency: SPF: “Travelling Salesman”
4 = Node 4
2= “Latency of this link = 2 units”
Cost = Latency
QoS = 1/Cost = 1/Latency
Find shortest path from Node 1 to Node 2 IF Node 4 is down
Cost = Latency
QoS = 1/Cost = 1/Latency
Find shortest path from Node 1 to Node 2
4 = Node 4
2= “Latency of this link = 2 units”
Routing for Low Latency: SPF: “Travelling Salesman”
Find shortest path from Node 1 to Node 2 IF Node 4 is down ...
… and Link 3-5 is losing packetsCost = Latency
QoS = 1/Cost = 1/Latency
Find shortest path from Node 1 to Node 2
4 = Node 4
2= “Latency of this link = 2 units”
Routing for Low Latency: SPF: “Travelling Salesman”
QoS = what’s important to user
1. QoS = 1 / Latency
2. QoS = “Goodput” = Throughput * Pr{delivery}
1. Low Latency
2. High Probability of:
a. Delivery
b. Accuracy
“Travelling Salesman” Non-linear optimization
Routing for “Goodput”: Nonlinear optimization
“Travelling Salesman” Non-linear optimization
Routing for “Goodput”: Nonlinear optimization
Non-linear optimization
Routing for “Goodput”: Can it be simplified?
Assume:
● No Queueing
○ No Blocking
Redefine:
Can be pseudo-linearized
Routing As a Process
SPF
SPF
Routing As a Process
Draining
SPF
Routing As a Process
SPF
Routing As a Process
Draining
SPF
Routing As a Process
SPF
Routing As a Process
Draining
SPF
Routing As a Process
SPF
Routing As a Process
Draining
SPF
Routing As a Process
“Whack-a-Mole!”
Routing is updated all the time via:
● Protocol (e.g., TCP)
● SDN Control
We need to accommodate each Flow’s:
● Primary Paths
● Alternative Paths
Network Demand & Throughput
Link Throughput
Demand Topology
Node & Link
Reliability
Link Size
Demandi
Throughputj
Connex Traversal Time
(Latency)
Concurrencyj Capacity
From Demand to Capacity:
Demandi
Throughputj
Link Traversal
Time (Latency)
Concurrencyj Erl-1 (N, PB) Capacity
QoS
PB
To account for Queueing & StatMux, …
Demand
Throughput
Concurrency for Flowi
Connex Traversal
Time (Latency)
Capacity
For Long-Haul Networks, it reduced to… LPropagation >> LQueueing
Erl-1 (N, PB)
QoS
PB
Demand
Throughput
Capacity
Bandwidth Fill Factor
For Long-Haul Network, it reduced to…
Can’t forget the stochastic element
LPropagation >> LQueueing
Latency ~ const
Concurrency = const * Throughput
We can forecast demandDemand:
● A1 -> Z1 : X11 Gbps
● A1 -> Z2 : X12 Gbps
● A2 -> Z3 : X23 Gbps
Throughput
on each Link
Capacity
for each Link
We can forecast demandDemand:
● A1 -> Z1 : X11 Gbps
● A1 -> Z2 : X12 Gbps
● A2 -> Z3 : X23 Gbps
Throughput
on each Link
Capacity
for each Link
Throughput is combinatorial
Demand is NOT DeterministicDemand:
● A1 -> Z1 : X11 Gbps
● A1 -> Z2 : X12 Gbps
● A2 -> Z3 : X23 Gbps
Throughput
on each Link
Neither is Throughput
Throughput:
L12 = ?
L24 = ?
L43 = ?
L31 = ?
L141 = ?
Demand:
N1_N4: 100 Gbps
N2_N4: 200 Gbps
100 G
100 G
200 G
100 G
200 G
200 G
Throughput:
L12 = 100 G
L21 = 200 G
L24 = 300 G
L14 = 300 G
L41 = 0
L43 = 0
L31 = 0
N1 N2
N3 N4
L31
L43
L24
L12
L141
5
315
25
22
From Deterministic Demand to Throughput
From Gaussian Demand to Throughput:
Throughput:
L12 = ?
L24 = ?
L43 = ?
L31 = ?
L141 = ?
Demand:
N1_N4: N (100, 10) Gbps
N2_N4: N (200, 15) Gbps
Throughput:
L12 = N (100, 10) G
L21 = N (200, 15) G
L24 = N (300, 18) G
L14 = N (300, 18) G
L41 = 0
L43 = 0
L31 = 0
N1 N2
N3 N4
L31
L43
L24
L12
L141
5
315
25
22
Throughput:
L12 = ?
L24 = ?
L43 = ?
L31 = ?
L141 = ?
Demand:
N1_N4: G (100, ...) Gbps
N2_N4: G (200, ...) Gbps
N1 N2
N3 N4
L31
L43
L24
L12
L141
5
315
25
22
?
From Generic Random Demand to Throughput:
Monte-Carlo
Monte-Carlo
Monte-Carlo
Every Demand VALUE is a REALIZATION of a RANGE of possible values
Demand Forecast Replace point
estimates with
probability
distributions
Link Throughput: Monte-Carlo Forecasting
Replace point estimates
with probability distributions
Slice the timeline
For each timestamp:
For each Flow:
roll the dice N times
For each timestamp:
For each of the N dice rolls:
Throughput =
sum (Flows)
Monte Carlo works with any Transfer Function
Monte Carlo
Throughput
on each Link
Demand (A-Z)
Capacity
for each Link
Use Case (a case study)
● Hundreds of links
● Thousands of demand flows forecasted
o 95th percentile
o Unspecified Prediction Intervals
● Establish optimal Inventory Size & Policies
o Account for Demand Predictability
● Estimate demand variability effect on:
o Network Size
o TCO
Forecast
Approach
Quantify Demand
Distributions (use Biases)
Use Monte-Carlo to forecast
Throughput Distributions
Use Monte-Carlo to compute
Capacity Predictive Intervals
Use Monte-Carlo to optimize
Inventory Size & Policies
Biases = Forecast - Observed
Biases != Residuals
Quantify Demand Ranges & Prepare MC “Forecasts”
Start
For Each
Time Slice
For Each Flow
Compute:
Bias = Projected - ObservedBuild:
Bias Distribution
Roll the dice
N = 100 times
Apply the rolled-out
numbers to the baseline
forecast for each flow
Save the N Demand
scenarios
Run the Pseudo-Random Demands through MC
Map1
Map2
MapN
MapN-1
Reduce
F flows *
N forecasts
Map: Compute
Capacities (N)
Reduce: Analyze the N
Capacity Forecasts
L links: Capacity
Prediction Intervals
Capacity Forecasts
for each Link
What does it cost to own a network?
● Range forecasting is cool!
● Network Demand varies in many ways
● For WAN, it is OK to use throughput
o still it’s better to use concurrency
● Demand ≠ Throughput
o Demand -> Throughput -> Capacity
● Monte-Carlo is a model
o Therefore it is wrong
o But it is useful
In Conclusion
Acknowledgements● Google’s NetOps Division
● Google’s NetCap & ODS Teams
● Josep Ferrandiz
● Mike Perka
● Leonid Kats
● C. Steven Gunn
● Matthew Mathis
● Kevin J. Mitchell
● Linda Eck
● Sophia Shtilman
● Leora Gilgur
Backup Slides
Biases != Residuals. Why?
How good are forecasts
at predicting demand
N days from “now” ???
H/W Availability: Fault Trees
Reliability Function:
Failure is a memoryless (Poisson) process
F(C|t) = F ((1 OR 2)|t) = 1- (R(1|t) * R(2|t))
F(D|t) = F ((3 AND 4 AND 5)|t) = F(3|t) * F(4|t) * F(5|t)
F(E|t) = F ((7 AND 8) | t) = F (7|t) * F(8|t)
F(F|t) = F ((6 OR E) | t) = 1 - (1 - F(7|t) * F(8|t)) * R(6|t)
F(B|t) = F ((C OR D OR F)|t) = 1 -
R(1|t) * R(2|t)
* (1-F(3|t) * F(4|t) * F(5|t))
* (1-F(7|t) * F(8|t))
* R(6|t)
⇒R(A|t) = R(1|t) * R(2|t)
* (1-F(3|t) * F(4|t) * F(5|t)) *
(1-F(7|t) * F(8|t)) * R(6|t)
C D F
E
B
There’s got to be a cleaner way!
Fault Trees and Monte-Carlo
C D F
E
B
clock.start()
for each component:
component.update (time = clock)
clock.set (min (next_update_time))
Component
state = (run, fail)
rule = (AND, OR, NONE)
mtbf
mttr
next_update_time
elements: Component
fail()
run()
update(time)
run():
if rule == NONE:
state = run;
else:
//apply rule to elements
return;
fail():
if rule == NONE:
state = fail;
else:
//apply rule to elements
return;
update (time):if time ≥ next_update_time:
if state == fail:
run();
next_update_time
+=Exp(mtbf);
else:
fail();
next_update_time
+=Exp(mttr);
return;
Probability distributions
Simplest - Uniform:
Least relevant to anything real
Convenient building block for any distribution
Most standard - Gaussian:
Mathematically the simplest
Does not describe the IT world
Most Relevant - Poisson & Exponential):
Relatively simple mathematically
Accurately describes times between arrivals and service times
for a memoryless process.
F(x) = Pr (X ≤ x) - CDF
f (x) = F’(x) - PDF