DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability:...

Post on 17-Apr-2018

232 views 6 download

Transcript of DonDon t’t Lose Sleep Over Availability - USENIX · DonDon t’t Lose Sleep Over Availability:...

Don’t Lose Sleep Over Availability:Don t Lose Sleep Over Availability: The GreenUp Decentralized Wakeup Service

Siddhartha Sen, Princeton University

J b R L h Ri h d H h C l G J SJacob R. Lorch, Richard Hughes, Carlos G. J. Suarez, Brian Zill, Weverton Cordeiro, and Jitendra Padhye

Enterprise networksEnterprise networks

Enterprise networksEnterprise networks

WAN

users, IT admins

Enterprise networksEnterprise networks

WAN

users, IT admins

Despite the cloud, this is a common scenario

Enterprise networksEnterprise networks

St St

Energy savings

Stay Green!

Availability

Stay Up!

Energy savings Availability

Enterprise networksEnterprise networks

St St

Energy savings

Stay Green!

Availability

Stay Up!

GreenUp!Energy savings AvailabilityGreenUp!

Sleep proxySleep proxy

WAN

Machine(active)

Sleep proxySleep proxy

WAN

Machine(active) Sleep proxy

Sleep proxySleep proxy

WAN

Machine(asleep) Sleep proxy

Sleep proxySleep proxy

WAN

Send traffic to me!Machine(asleep) Sleep proxy

Sleep proxySleep proxy

WAN

Send traffic to me!

Machine(asleep) Sleep proxy

Sleep proxySleep proxy

WAN

Machine(asleep) Machine

(asleep)Sleep proxy

Sleep proxySleep proxy

Remote requestq(TCP SYN)

WAN

Machine(asleep) Machine

(asleep)Sleep proxy

Sleep proxySleep proxy

WAN

Machine(asleep) Machine

(asleep)Sleep proxyRemote request

(TCP SYN)

Sleep proxySleep proxy

WAN

Wake up!(WoL)Machine

(asleep) Machine(asleep)

Sleep proxy(WoL)

Sleep proxySleep proxy

WAN

Machine(asleep) Machine

(asleep)Sleep proxyWake up!

(WoL)

Sleep proxySleep proxy

WAN

Machine(active) Sleep proxy

Sleep proxySleep proxy

Remote requestq(TCP SYN)

WAN

Machine(active) Sleep proxy

Sleep proxySleep proxy

WAN

Machine(active) Sleep proxyRemote request

(TCP SYN)

Sleep proxySleep proxy

WAN

Response

Machine(active) Sleep proxy

Sleep proxySleep proxy

RResponse

WAN

Machine(active) Sleep proxy

Sleep proxySleep proxy

RResponse

Pros Cons

WAN• No special hardware• No envir. changesN h

• Dedicated server per subnet

• No app changes

Machine(active) Sleep proxy

Dedicated servers are a problemDedicated servers are a problem

• High deployment andHigh deployment and management cost

• Single point of failure

Dedicated servers are a problemDedicated servers are a problem

• High deployment andHigh deployment and management cost

• Single point of failure

• High availability becomes expensive!

GreenUp A decentralized minimalGreenUp: A decentralized, minimal software‐only sleep proxy

GreenUp A decentralized minimalGreenUp: A decentralized, minimal software‐only sleep proxy

Any machine can act as a proxy (manager) forAny machine can act as a proxy (manager) for sleeping machines on the subnet

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

Machine State… …… …… …

Distributed management

Subnet state coordination

Guardians

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

Machine State… …… …… …

Distributed management

Subnet state coordination

Guardians

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

GreenUp’s environmentGreenUp s environment

• Subnet domainsSubnet domains

• Load‐sensitive, unreliable machinesLoad sensitive, unreliable machines

• Single administrative domainSingle administrative domain

• Availability most importanty p

Running example (not to scale)Running example (not to scale)

M1 M8

M5

M2

M5

M6M3

M6

M7

M9M4

M7

Running example (not to scale)Running example (not to scale)

M1 M8

M5

M2

M5

M6awake

M3M6

M7

M9M4

M7

Running example (not to scale)Running example (not to scale)

M1 M8

M5

M2

M5

M6awake

M3M6

M7

M9M4

M7

asleep + unmanaged

Running example (not to scale)Running example (not to scale)manager

M1 M8 asleep + managedM5

M2 awake

managedM5

M6M3

M6

M7

M9M4asleep + 

unmanaged

M7

Distributed management: hWho manages M9?

b f l– No guarantees before sleep– M1 failure abandons M8M1 M8

M5• Probe randomly, repeat since machines unreliable

M2

M5

M6

• Load‐sensitive machines, di ib bi

M3M6

M7 so distribute probing– Robust to manager issuesM4

M9

M7

Distributed management: hWho manages M9?

• Wait for notification?– No guarantees before sleepM1 f il b d M8

M1 M8

M5 – M1 failure abandons M8

• Probe randomly, repeatM2

M5

M6 Probe randomly, repeat since machines unreliableM3

M6

M7• Load‐sensitive machines, so distribute probing

b

M4M9

M7

– Robust to manager issues

Distributed management: hWho manages M9?

• Wait for notification?– No guarantees before sleepM1 f il b d M8

M1 M8

M5 – M1 failure abandons M8

• Probe randomly, repeatM2

M5

M6 Probe randomly, repeat since machines unreliableM3

M6

M7• Load‐sensitive machines, so distribute probing

b

M4M9

M7

– Robust to manager issues

Distributed management: hWho manages M9?

• Wait for notification?– No guarantees before sleepM1 f il b d M8

M1 M8

M5 – M1 failure abandons M8

• Probe randomly, repeatM2

M5

M6 Probe randomly, repeat since machines unreliableM3

M6

M7• Load‐sensitive machines, so distribute probing

b

M4M9

M7

– Robust to manager issues

Distributed management: hWho manages M9?

M1 M8total # 

machines

M5

machinesawake#nM2

M5

M6 machinesawake#M3

M6

M7M4

M9

M7

Distributed management: hWho manages M9?

M5

M1 M8total # 

machines# managed 

by i

M5

M6 machinesawake#imn M2

M6

M7

machinesawake#M3

M7M4

M9

Distributed management: hWho manages M9?

M5

M1 M8total # 

machines# managed 

by i

M5

M6 machinesawake#imn M2

M6

M7

machinesawake#M3

M7M4

M9

Distributed management: hWho manages M9?

M5

M1 M8

M5

machinesawake#ln 1

1pimn M2

M6

M7 • Coupon collector analysis

machinesawake#M3

M6

M7 Coupon collector analysisM4

M9

Distributed management: hWho manages M9?

M5

M1 M8

M5

machinesawake#ln 1

1pimn M2

M6

M7 • Coupon collector analysis

machinesawake#M3

M6

M7 Coupon collector analysisM4

M9

Distributed management: hWho manages M9?

M5

M1 M8p = Pr(machine probed)

M5

machinesawake#ln 1

1pimn M2

M6

M7 • Coupon collector analysis

machinesawake#M3

M6

M7 Coupon collector analysisM4

M9

Distributed management: hWho manages M9?

M5

M1 M8p = Pr(machine probed)

M9

M5

machinesawake#ln 1

1pimn M2

M6

M7 • Coupon collector analysis

machinesawake#M3

M6

M7 Coupon collector analysisM4

Multiple managersMultiple managers

M1 M8

M5

M2 machinesawake#ln 1

1pimn

M5

M6

M7 • Availability most importantM3

machinesawake#M6

M7 Availability most important

• Simple resolution protocolM4

M9M9

Multiple managersMultiple managers

M1 M8

M5

M9

M2 machinesawake#ln 1

1pimn

M5

M6M9

M7 • Availability most importantM3

machinesawake#M6M9

M7 Availability most important

• Simple resolution protocolM4

Multiple managersMultiple managers

M1 M8

M5

M9

M2 machinesawake#ln 1

1pimn

M5

M6M9

M7 • Availability most importantM3

machinesawake#M6M9

M7 Availability most important

• Simple resolution protocolM4

Multiple managersMultiple managers

M1 M8

M5

M9

M2 machinesawake#ln 1

1pimn

M5

M6M9

M7 • Availability most importantM3

machinesawake#M6M9

M7 Availability most important

• Simple resolution protocolM4

Multiple managersMultiple managers

M1 M8

M5

M9

M2 machinesawake#ln 1

1pimn

M5

M6

M7 • Availability most importantM3

machinesawake#M6

M7 Availability most important

• Simple resolution protocolM4

Load balanceLoad balance

M8 M9M1

M5

M2

M5

M6M3

M6

M7M4

M7

Load balanceLoad balance

M5

M2 M8

M5

M6M3 M9

M6

M7M4 M1

M7

Load balanceLoad balance

M5

• Induction analysis: equivalent to balls‐in‐bins!

M2 M8

M5

M6)2/ln(ln

)2/ln(nn after n/2 sleeps

M3 M9M6

M7M4 M1

M7

Load balanceLoad balance

M5

• Induction analysis: equivalent to balls‐in‐bins!

Distributed management elects leaders in aM2 M8

M5

M6)2/ln(ln

)2/ln(nn after n/2 sleeps

Distributed management elects leaders in a robust and load‐balanced way, assuming temporary conflicts are tolerable

M3 M9M6

M7

temporary conflicts are tolerable.

M4 M1M7

Subnet state coordinationSubnet state coordination

• Distributed management relies on global state 

M1 M8

M5 g– Who to probe?– How to manage?

M2

M5

M6

• IP address, MAC addressM3

M6

M7• TCP listen ports

M9M4

M7

Subnet state coordinationSubnet state coordination

• Distributed management relies on global state 

M1 M8

M5 g– Who to probe?– How to manage?

M2Machine StateM8 …M9

M5

M6

• IP address, MAC addressM3M9 …M5 …… …

M6

M7• TCP listen ports

M9M4

M7

Subnet state coordinationSubnet state coordination

• Replicated state machine?– Unreliable machines, correlated behavior

M1 M8

M5 correlated behavior– Strong consistency overkillM2Machine State

M8 …M9

M5

M6• External database?– Lose instant deployability

M3M9 …M5 …… …

M6

M7

• Exploit subnet and weaker M9M4

M7

consistency

Subnet state coordinationSubnet state coordination

• Replicated state machine?– Unreliable machines, correlated behavior

M1 M8

M5 correlated behavior– Strong consistency overkillM2Machine State

M8 …M9

M5

M6• External database?– Lose instant deployability

M3M9 …M5 …… …

M6

M7

• Exploit subnet and weaker M9M4

M7

consistency

Subnet state coordinationSubnet state coordination

• Replicated state machine?– Unreliable machines, correlated behavior

M1 M8

M5 correlated behavior– Strong consistency overkillM2Machine State

M8 …M9

M5

M6• External database?– Lose instant deployability

M3M9 …M5 …… …

M6

M7

• Exploit subnet and weaker M9M4

M7

consistency

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9M9

3. Daily roll call to garbage‐collect stateM4

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9M9

3. Daily roll call to garbage‐collect stateM4

M9 state

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect state

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 ? managers while asleep

3 Daily roll call toM3

M6

M7

?

M9

3. Daily roll call to garbage‐collect stateM4

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 ? managers while asleep

3 Daily roll call toM3

M6

M7

?

M9

3. Daily roll call to garbage‐collect stateM4

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8M8 state

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8M8 state

M5

2. Rebroadcast by h l l

M2

M5

M6M8’ state

managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

M8’ state

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5

2. Rebroadcast by h l l

M2

M5

M6 managers while asleep

3 Daily roll call toM3

M6

M7

M9

3. Daily roll call to garbage‐collect stateM4

M7

Subnet state coordinationSubnet state coordination

1. Periodic broadcast while awake

M1 M8

M5Subnet state coordination distributes2. Rebroadcast by 

h l l

M2

M5

M6

Subnet state coordination distributes per‐machine state on a subnet when strong consistency is not requiredmanagers while asleep

3 Daily roll call toM3

M6

M7

strong consistency is not required.

M9

3. Daily roll call to garbage‐collect stateM4

M7

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

Machine State… …… …… …

Distributed management

Subnet state coordination

Guardians

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

P iMachine State… …… …… …

• Protects against simultaneous sleep

• Caps the max loadDistributed management

Subnet state coordination

Guardians• Caps the max load

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

OutlineOutline1. How does GreenUp work?2. What can I learn from GreenUp?

Machine State… …… …… …

Distributed management

Subnet state coordination

Guardians

3. How effective is GreenUp?– Evaluation on 100 user machines, currently y

deployed on 1,100machines

Deployment in MicrosoftDeployment in Microsoft

• C# codeC# code– Interfaces with packet sniffer/network driver

• Client GUI for users and d l teasy deployment

• Pilot on 1 100 machinesPilot on 1,100 machines

EvaluationEvaluation

• Logs from 101 Windows 7 machines Feb –Logs from 101 Windows 7 machines, Feb. Sep. 2011

• Questions:– Does GreenUp consistently wake machines whenDoes GreenUp consistently wake machines when accessed?

– Does it do so in time to meet user patience?Does it do so in time to meet user patience?– Can GreenUp scale to large subnets?

GreenUp wakes machines reliablyGreenUp wakes machines reliably

GreenUp wakes machines reliablyGreenUp wakes machines reliably

• Connect to machines usingConnect to machines using Samba (TCP port 139) 

• 11 different days (weekends, evenings):– 496 already awake, 278 woken, 5 unwakeable

– Most failures due to WoL– Most failures due to WoL

• 99.4% success rate

GreenUp wakes machines reliablyGreenUp wakes machines reliably

• Connect to machines usingConnect to machines using Samba (TCP port 139) 

• 11 different days (weekends, evenings):

WoL is availability bottleneck!

– 496 already awake, 278 woken, 5 unwakeable

– Most failures due to WoL

bott e ec

– Most failures due to WoL

• 99.4% success rate

GreenUp wakes machines quicklyGreenUp wakes machines quickly

• GreenUp relies on someGreenUp relies on some user patience– Wakeup delay– User retry logic

• Side‐effect of WoLfailure: manager logs how long user waitshow long user waits– 48 events 

GreenUp wakes machines quicklyGreenUp wakes machines quickly

• GreenUp relies on someGreenUp relies on some user patience– Wakeup delay– User retry logic

• Side‐effect of WoLfailure: manager logs how long user waitshow long user waits– 48 events 

GreenUp wakes machines quicklyGreenUp wakes machines quickly

• Convolving: GreenUp wakes machines before userConvolving: GreenUp wakes machines before user gives up 85% of the time

GreenUp wakes machines quicklyGreenUp wakes machines quickly

87% of wakeups take < 9 sec

• Convolving: GreenUp wakes machines before userConvolving: GreenUp wakes machines before user gives up 85% of the time

GreenUp wakes machines quicklyGreenUp wakes machines quickly

87% of wakeups take < 9 sec

• Convolving: GreenUp wakes machines before userConvolving: GreenUp wakes machines before user gives up 85% of the time

GreenUp wakes machines quicklyGreenUp wakes machines quickly

87% of wakeups take < 9 sec

13% of users give up after 3 sec (port scanners?)

• Convolving: GreenUp wakes machines before userConvolving: GreenUp wakes machines before user gives up 85% of the time

GreenUp scales to large subnetsGreenUp scales to large subnets

• Sources of manager loadSources of manager load– Intercept traffic for asleep machinesBroadcast state– Broadcast state

– Probe/respond to probes

GreenUp scales to large subnetsGreenUp scales to large subnets

GreenUp scales to large subnetsGreenUp scales to large subnets

GreenUp scales to large subnetsGreenUp scales to large subnets

GreenUp scales to large subnetsGreenUp scales to large subnets

Good load balance +h k henough awake machines 

few managed machines!

GreenUp scales to large subnetsGreenUp scales to large subnets

• Simulated probing load on 2.4‐GHz, dual‐coreSimulated probing load on 2.4 GHz, dual core Windows 7 machine w/ 4GB memory and 1Gb/s NIC:

# of managed machines CPU utilization100 12%200 21%300 29%

• Guardians ensure max load is 100

GreenUp scales to large subnetsGreenUp scales to large subnets

• Simulated probing load on 2.4‐GHz, dual‐coreSimulated probing load on 2.4 GHz, dual core Windows 7 machine w/ 4GB memory and 1Gb/s NIC:

# of managed machines CPU utilization100 12%200 21%300 29%

• Guardians ensure max load is 100

Does GreenUp save more energy?Does GreenUp save more energy?

• Energy savings depends on sleep timeEnergy savings depends on sleep time 

Does GreenUp save more energy?Does GreenUp save more energy?

• Energy savings depends on sleep timeEnergy savings depends on sleep time 

Does GreenUp save more energy?Does GreenUp save more energy?

• Energy savings depends on sleep timeEnergy savings depends on sleep time 

Average 31% $17.50/machine/year

Does GreenUp save more energy?Does GreenUp save more energy?

• Energy savings depends on sleep timeEnergy savings depends on sleep time

• IT enforces sleep policy at Microsoft, so hardIT enforces sleep policy at Microsoft, so hard to tell

Extension: Higher availability via l l d h d ffexplicit load hand‐off

M5

M8 M9M1

M2

M5

M6M3

M6

M7M4

M7

Extension: Higher availability via l l d h d ffexplicit load hand‐off

M5

M2

M5

M6M8

M3M6

M7M9

M4M7

M1

Extension: Higher availability via l l d h d ffexplicit load hand‐off

M5

M8 M9M1

M2

M5

M6M3

M6

M7M4

M7

Extension: Higher availability via l l d h d ffexplicit load hand‐off

M5

M2

M5

M6M3

M6

M7M4

M7M8 M9M1

Extension: Higher availability via l l d h d ffexplicit load hand‐off

+M5

M2

M5

M6M3

M6

M7M4

M7M8 M9M1

Extension: Higher availability via l l d h d ffexplicit load hand‐off

+M5

• Theorem Expected maxM2

M5

M6 • Theorem. Expected max load = n/d Hd.M3

M6

M7M4

# awake machines

Harmonic numbers

M7M8 M9M1

Other solutionsOther solutions

• Sleep proxy idea: Christensen & Gulledge ’98Sleep proxy idea: Christensen & Gulledge 98• Recently:

System TechniqueSomniloquy, NSDI ’09  augmented NICsq y, gLiteGreen, ATC ’10Jettison, EuroSys ’12  VM migration

SleepServer, ATC ’10  application stubsNedevschi et al., NSDI ’08 Reich et al ATC ’10 dedicated serversReich et al., ATC  10

Other solutionsOther solutions

• Sleep proxy idea: Christensen & Gulledge ’98Barriers to Sleep proxy idea: Christensen & Gulledge 98• Recently:

deployment

System TechniqueSomniloquy, NSDI ’09  augmented NICsq y, gLiteGreen, ATC ’10Jettison, EuroSys ’12  VM migration

SleepServer, ATC ’10  application stubsNedevschi et al., NSDI ’08 Reich et al ATC ’10 dedicated serversReich et al., ATC  10

GreenUpGreenUp

• Completely decentralized software‐onlyCompletely decentralized, software only  sleep proxy

• Useful distributed systems techniques

• High availability at low cost, even as   machines sleep!machines sleep!

http://research.microsoft.com/en‐us/projects/greenup/http://research.microsoft.com/en us/projects/greenup/