lecture23-congestion-control - Cornell University€¦ · TCP congestion control: high-level idea...

Post on 18-Aug-2020

5 views 0 download

Transcript of lecture23-congestion-control - Cornell University€¦ · TCP congestion control: high-level idea...

ComputerNetworks:ArchitectureandProtocols

CS4450

Lecture23TCPandConges3onControl

Spring2018RachitAgarwal

Recap:WHYsbehindTCPdesign

• Startedfromfirstprinciples

• Correctnessconditionforreliabletransport

• …tounderstandingwhyfeedbackfromreceiverisnecessary(sol-v1)

• …tounderstandingwhytimersmaybeneeded(sol-v2)

• …tounderstandingwhywindow-baseddesignmaybeneeded(sol-v3)

• …tounderstandingwhycumulativeACKsmaybeagoodidea

• VeryclosetomodernTCP

2

Recap:Transportlayer

• Transportlayeroffera“pipe”abstractiontoapplications• Datagoesinoneendofthepipeandemergesfromother

• Pipesarebetweenprocesses,nothosts

• Twobasicpipeabstractions:• Unreliablepacketdelivery(UDP)

• Unreliable(applicationresponsibleforresending)• Messageslimitedtosinglepacket

• Reliablebytestreamdelivery

• Bytesinsertedintopipebysender• Theyemerge,inorderatreceiver(totheapp)

Recap:TransmissionControlProtocol(TCP)

• Reliable,in-orderdelivery• Ensuresbytestream(eventually)arrivesintact

• Inthepresenceofcorruption,delays,reordering,loss

• Connectionoriented• Explicitset-upandtear-downofTCPsession

• Fullduplexstreamofbyteservice

• Sendsandreceivesstreamofbytes,notmessages

• Flowcontrol• Ensuresthesenderdoesnotoverwhelmthereceiver

• Congestioncontrol• Dynamicadaptationtonetworkpath’scapacity

AnyQuestions?

Fromdesigntoimplementation:majornotationchange

• Previouslywefocusedonpackets• Packetshadnumbers

• ACKsreferredtothosenumbers

• Windowsizesexpressedintermsof#ofpackets

• TCPfocusesonbytes,thus• Packetsidentifiedbythebytestheycarry• ACKsrefertothebytesreceived• Windowsizeexpressedintermsof#ofbytes

BasicComponentsofTCP

• Segments,Sequencenumbers,ACKs

• TCPusesbytesequencenumberstoidentifypayloads

• ACKsreferredtosequencenumbers

• Windowsizesexpressedintermsof#ofbytes

• Retransmissions

• Can’tbecorrectwithoutretransmittinglost/corrupteddata

• TCPretransmitsbasedontimeoutsandduplicateACKs

• TimeoutsbasedonestimateofRTT

• FlowControl

• CongestionControl

Segments,SequenceNumbersandACKs

TCP“StreamofBytes”Service

Byte 0Byte 1Byte 2Byte 3

Byte 80

Byte 0Byte 1Byte 2Byte 3

Byte 80

Application @ Host A

Application @ Host B

TCP“StreamofBytes”Service

Byte 0Byte 1Byte 2Byte 3

Byte 80

Byte 0Byte 1Byte 2Byte 3

Byte 80

Application @ Host A

Application @ Host B

TCP Data

TCP Data

Segment sent when 1) Segment full (Max Segment Size)2) Not full, but times out

TCPSegment

• IPPacket• NobiggerthanMaximumTransmissionUnit(MTU)

• E.g.,upto1500byteswithEthernet

• TCPPacket• IPpacketwithaTCPheaderanddatainside• TCPheader>=20byteslong

• TCPSegment

• NomorethanMSS(MaximumSegmentSize)bytes

• E.g.,upto1460consecutivebytesfromthestream

• MSS=MTU-IPheader-TCPheader

IP HdrIP data (datagram)

TCP HdrTCP data (segment)

SequenceNumbers

Host A

K bytes

Sequence number = 1st byte in segment

= ISN + k

Initial Sequence Number (ISN)

TCP DataTCP Hdr

Host B

TCP DataTCP Hdr

ACK Sequence number = next expected byte

= seqno + length(data)

ACKingandSequenceNumbers

• Sendersendssegments(bytestream)

• DatastartswithInitialSequenceNumber(ISN):X

• PacketcontainsBbytes• X,X+1,X+2,…,X+B-1

• Uponreceiptofasegment,receiversendsanACK

• IfalldatapriortoXalreadyreceived:• ACKacknowledgesX+B(becausethatisnextexpectedbyte)

• IfhighestcontiguousbytereceivedissmallervalueY

• ACKacknowledgesY+1• EvenifthishasbeenACKedbefore

TCPHeader

Source Port Destination Port

Sequence Number

Acknowledgement

HdrLen

Checksum

Options (variable)

Data

Urgent Pointer

0 Flags Advertised Window

Starting byte offset of data carried in

this segment

TCPHeader

Source Port Destination Port

Sequence Number

Acknowledgement

HdrLen

Checksum

Options (variable)

Data

Urgent Pointer

0 Flags Advertised Window

Acknowledgement gives sequence

number just beyond highest

sequence number received in order

(“What byte is next”)

TCPConnectionEstablishment

andInitialSequenceNumbers

InitialSequenceNumber(ISN)

• Sequencenumberfortheveryfirstbyte

• E.g.,WhynotjustuseISN=0?

• Practicalissue• IPaddressesandport#suniquelyidentifyaconnection• Eventually,though,theseport#sdogetusedagain• …smallchanceanoldpacketisstillinflight

• TCPthereforerequireschangingISN• Setfrom32-bitclockthatticksevery4microseconds

• …onlywrapsaroundonceevery4.55hours

• Toestablishaconnection,hostsexchangeISNs• Howdoesthishelp?

EstablishingaTCPConnection

• Three-wayhandshaketoestablishconnection• HostAsendsaSYN(open;“synchronizesequencenumbers”)tohostB

• HostBreturnsaSYNacknowledgement(SYNACK)

• HostsendsanACKtoacknowledgetheSYNACK

SYN

ACK

DataData

SYN + ACK

A B

Each host tells its ISN to the other host.

TCPHeader

Source Port Destination Port

Sequence Number

Acknowledgement

HdrLen

Checksum

Options (variable)

Data

Urgent Pointer

0 Flags Advertised Window

Flags:SYNACKFINRSTPSHURG

See /usr/include/netinet/tcp.h on Unix Systems

Step1:A’sInitialSYNPacket

A’s port B’s port

A’s Initial Sequence Number

(Irrelevant since ACK not set)

5 = 20B

Checksum

Options (variable)

Urgent Pointer

0 Flags Advertised Window

Flags:SYNACKFINRSTPSHURG

A tells B it wants to open a connection…

Step2:B’sSYN-ACKPacket

B’s port A’s port

A’s Initial Sequence Number

ACK = A’s ISN plus 1

20B

Checksum

Options (variable)

Urgent Pointer

0 Flags Advertised Window

Flags:SYNACKFINRSTPSHURG

B tells A it accepts and is ready to hear the next byte…

… upon receiving this packet, A can start sending data

Step3:A’sACKoftheSYN-ACK

A’s port B’s port

A’s Initial Sequence Number

ACK = B’s ISN plus 1

20B

Checksum

Options (variable)

Urgent Pointer

0 Flags Advertised Window

Flags:SYNACKFINRSTPSHURG

A tells B it’s likewise okay to start sending

… upon receiving this packet, B can start sending data

TimingDiagram:3-WayHandshaking

SYN, SeqNum = x

ACK, ACK = y+1

SYN + ACK, SeqNum = y, Ack = x + 1

Active Open Passive OpenClient (initiator) Server

listen()

connect()

accept()

AnyQuestions?

TCPRetransmission

TwoMechanismsforRetransmissions

• DuplicateACKs

• Timeouts

LosswithCumulativeACKs

• Sendersendspacketswith100Bandseqnos• 100,200,300,400,500,600,700,800,900

• Assume5thpacket(seqno500)islost,butnoothers

• StreamofACKswillbe

• 200,300,400,500,500,500,500,500

LosswithCumulativeACKs

• DuplicateACKsareasignofanisolatedloss• ThelackofACKprogressmeans500hasn’tbeendelivered

• StreamofACKsmeanssomepacketsarebeingdelivered

• Therefore,couldtriggerresenduponreceivingkduplicateACKs• TCPusesk=3

• Wewillrevisitthisincongestioncontrol

TimeoutsandRetransmissions

• Reliabilityrequiresretransmittinglostdata

• Involvessettingtimersandretransmittingontimeouts

• TCPonlyhasasingletimer

• TCPresetstimerwhenevernewdataisACKed

• Retxpacketcontaining“nextbyte”whentimerexpires

• RTO(RetransmitTimeOut)isthebasictimeoutvalue

RTT

SettingtheTimeoutValue(RTO)

1

1

Timeout

Timeout too long -> inefficient

RTT

1

1Timeout

Timeout too short -> duplicate packets

SettingRTOvalue

• Manyideas

• Seebackupslidesforsomeexamples(notneededforexams)

• Implementationsoftenuseacoarser-grainedtimer

• 500msecistypical

• Incurringatimeoutisexpensive

• SowerelyonduplicateACKs

TCPFlowControl

FlowControl(SlidingWindow)

• AdvertisedWindow:W

• CansendWbytesbeyondthenextexpectedbyte

• ReceiverusesWtopreventsenderfromoverflowingbuffer

• Limitsnumberofbytessendercanhaveinflight

TCPHeader

Source Port Destination Port

Sequence Number

Acknowledgement

HdrLen

Checksum

Options (variable)

Data

Urgent Pointer

0 Flags Advertised Window

ImplementingSlidingWindow

• Sendermaintainsawindow

• DatathathasbeensentoutbutnotyetACK’ed

• Leftedgeofwindow:• Beginningofunacknowledgeddata• MoveswhendataisACKed

• Windowsize=maximumamountofdatainflight

• Receiversetsthisamount,basedonitsavailablebufferspace

• Ifithasnotyetsentdatauptotheapp,thismightbesmall

AdvertisedWindowLimitsRate

• SendercansendnofasterthanW/RTTbytes/sec

• InoriginalTCP,thatwasthesoleprotocolmechanismcontrolling

sender’srate

• What’smissing?

• CongestioncontrolabouthowtoadjustWtoavoidnetworkcongestion

AnyQuestions?

TCPCongestionControl

TCPcongestioncontrol:high-levelidea

• Endhostsadjustsendingrate

• Basedonimplicitfeedbackfromthenetwork

• Implicit:routerdropspacketsbecauseitsbufferoverflows,notbecauseit’stryingtosendmessage

• Hostsprobenetworktotestlevelofcongestion• Speedupwhennocongestion(i.e.,nopacketdrops)• Slowdownwhenwhencongestion(i.e.,packetdrops)

• Howtodothisefficiently?• ExtendTCP’sexistingwindow-basedprotocol…• Adaptthewindowsizebasedinresponsetocongestion

AllTheseWindows…

• Flowcontrolwindow:AdvertisedWindow(RWND)

• Howmanybytescanbesentwithoutoverflowingreceiversbuffers

• Determinedbythereceiverandreportedtothesender

• CongestionWindow(CWND)

• Howmanybytescanbesentwithoutoverflowingrouters

• Computedbythesenderusingcongestioncontrolalgorithm

• Sender-sidewindow=minimum{CWND,RWND}

• AssumeforthislecturethatRWND>>CWND

Note

• ThislecturewilltalkaboutCWNDinunitsofMSS

• RecallMSS:MaximumSegmentSize,theamountofpayloaddata

inaTCPpacket

• Thisisonlyforpedagogicalpurposes

• KeepinmindthatrealimplementationsmaintainCWNDinbytes

BasicsofTCPCongestion

• CongestionWindow(CWND)

• Maximum#ofunacknowledgedbytestohaveinflight

• Rate~CWND/RTT

• Adaptingthecongestionwindow• Increaseuponlackofcongestion:optimisticexploration

• Decreaseupondetectingcongestion

• Buthowdoyoudetectcongestion?

NotAllLossestheSame

• DuplicateACKs:isolatedloss• StillgettingACKs

• Timeout:possibledisaster

• NotenoughduplicateACKs• Musthavesufferedseverallosses

HowtoAdjustCWND?

• Consequencesofover-sizedwindowmuchworsethanhavinganunder-sizedwindow

• Over-sizedwindow:packetsdroppedandretransmitted

• Under-sizedwindow:somewhatlowerthroughput

• Approach• Gentleincreasewhenun-congested(exploration)• Rapiddecreasewhencongested

AdditiveIncrease,MultiplicativeDecrease(AIMD)

• Additiveincrease• Onsuccessoflastwindowofdata,increasebyoneMSS

• IfWpacketsinarowhavebeenACKed,increaseWbyone

• i.e.,+1/WperACK

• Multiplicativedecrease

• OnlossofpacketsbyDupACKs,dividecongestionwindowbyhalf• Specialcase:whentimeout,reducecongestionwindowtooneMSS

AIMD

• ACK:CWND->CWND+1/CWND

• WhenCWNDismeasuredinMSS

• Note:afterafullwindow,CWNDincreaseby1MSS

• Thus,CWNDincreasesby1MSSperRTT

• 3rdDupACK:CWND->CWND/2

• Specialcaseoftimeout:CWND->1MSS

LeadstotheTCPSawtooth

Loss

Halved

Window

t

AnyQuestions?

SlowStart

AIMDStartsTooSlowlyWindow

tIt could take a long time to get started!

Need to start with a small CWND to avoid overloading the network

BandwidthDiscoverywithSlowStart

• Goal:estimateavailablebandwidth

• Startslow(forsafety)• Butrampupquickly(forefficiency)

• Consider• RTT=100ms,MSS=1000bytes

• Windowsizetofill1MbpsofBW=12.5MSS

• Windowsizetofill1Gbps=12,500MSS

• WithjustAIMD,ittakesabout12500RTTstogettothis

windowsize!

• ~21mins

“SlowStart”Phase

• Startwithasmallcongestionwindow

• Initially,CWNDis1MSS

• So,initialsendingrateisMSS/RTT

• Thatcouldbeprettywasteful• Mightbemuchlessthantheactualbandwidth

• Linearincreasetakesalongtimetoaccelerate

• Slow-startphase(actually“faststart”)• Senderstartsataslowrate(hencethename)

• …butincreasesexponentiallyuntilfirstloss

SlowStartinAction

Src

Dst

1 2 3 4 8

Double CWND per round-trip time

Simple implementation: on each ACK, CWND += MSS

D A D AD A DA

DA

DA

DA

SlowStartandtheTCPSawtoothWindow

tExponential “slow start”

Why is it called slow-start? Because TCP originally had no congestion control mechanism. The source would just start by sending a whole window’s worth of data.

Slow-StartvsAIMD

• WhendoesasenderstopSlow-StartandstartAdditiveIncrease?

• Introducea“slowstartthreshold”(ssthresh)• Initializedtoalargevalue• Ontimeout,ssthresh=CWND/2

• WhenCWND>ssthresh,senderswitchesfromslow-starttoAIMD-style

increase

Timeouts

LossDetectedbyTimeout

• SenderstartsatimerthatrunsforRTOseconds

• RestarttimerwheneverACKfornewdataarrives

• Iftimerexpires

• SetSSHTHRESH<-CWND/2(“SlowStartThreshold”)

• SetCWND<-1(MSS)

• Retransmitfirstlostpacket

• ExecuteSlowStartuntilCWND>SSTHRESH

• AfterwhichswitchtoAdditiveIncrease

SummaryofIncrease

• “Slowstart”:increaseCWNDby1(MSS)foreachACK

• Afactorof2perRTT

• Leaveslow-startregimewheneither:

• CWND>SSTHRESH

• Packetdropdetectedbydupacks

• EnterAIMDregime

• Increaseby1(MSS)foreachwindow’sworthofACKeddata

SummaryofDecrease

• CutCWNDhalfonlossdetectedbydupacks

• Fastretransmittoavoidoverreacting

• CutCWNDallthewayto1(MSS)ontimeout

• SetssthreshtoCWND/2

• NeverdropCWNDbelow1(MSS)

• Ourcorrectnesscondition:alwaystrytomakeprogress

TCPCongestionControlDetails

Implementation

• Stateatsender• CWND(initializedtoasmallconstant)

• ssthresh(initializedtoalargeconstant)• dupACKcount• Timer,asbefore

• Eventsatsender• ACK(newdata)• dupACK(duplicateACKforolddata)• Timeout

• Whataboutreceiver?JustsendACKsuponarrival

Event:ACK(newdata)

• Ifinslowstart• CWND+=1 CWND packets per RTT

Hence after one RTT with no drops:

CWND = 2 x CWND

Event:ACK(newdata)

• IfCWND<=ssthresh

• CWND+=1

• Else• CWND=CWND+1/CWND

CWND packets per RTTHence after one RTT with

no drops:CWND = CWND + 1

Slow Start Phase

Congestion Avoidance Phase(additive increase)

Event:Timeout

• OnTimeout

• ssthresh<-CWND/2

• CWND<-1

Event:dupACK

• dupACKcount++

• IfdupACKcount=3/*fastretransmit*/

• ssthresh<-CWND/2

• CWND<-CWND/2

Remains in congestion avoidance after fast

retransmission

TimeDiagramWindow

tSlow start in operation until it

reached half of previous CWND, i.e., SSThresh

Slow-start restart: Go back to CWND of 1 MSS, but take advantage of knowing the previous value of CWND.

Fast Retransmission Timeout SSThresh Set to here

TCPFlavors

• TCPTahoe• CWND=1ontripledupACK

• TCPReno• CWND=1ontimeout

• CWND=CWND/2ontripledupACK

• TCP-newReno• TCP-Reno+improvedfastrecovery

• TCP-SACK• Incorporatesselectiveacknowledgements

Our default assumption

Done!

Nextlecture:CriticalAnalysisofTCP

TCPBackupslides

CouldBaseRTOonRTTEstimation

• UseexponentialaveragingifRTTsamples

SampleRTT = AckRcvdTime - SendPktTimeEstimatedRTT = ⍺ x EstimatedRTT + (1-⍺) x SampledRTT

0 < ⍺ <= 1

EstimatedRTT

Time

SampleRTT

ExponentialAveragingExample

EstimatedRTT = ⍺ x EstimatedRTT + (1-⍺) x SampledRTT (Assume RTT is constant => SampleRTT = RTT)

RTT

time0 1 2 3 4 5 6 7 8 9

EstimatedRTT (α = 0.8)

EstimatedRTT (α = 0.5)

ExponentialAveraginginAction

Set Timeout Estimate (ETO) = 2 x EstimatedRTT

From Jacobson and Karels, SIGCOMM 1988

Jacobson/KarelsAlgorithm

• Problem:needtobettercapturevariabilityinRTT

• Directlymeasuredeviation

• Deviation=|SampleRTT-EstimatedRTT|

• EstimatedDeviation:exponentialaverageofDeviation

• ETO=EstimatedRTT+4xEstimatedDeviation

WithJacobson/Karels

Problem:AmbiguousMeasurements

• HowdowedifferentiatebetweentherealACK,andACKoftheretransmittedpacket?

Sampled RTT

Original TransmissionSampled

RTT

Retransmission

Original Transmission

RetransmissionACK

ACK

TCPTimers

• Twoimportantquantities

• RTO:valueyousettimertofortimeouts

• ETO:currentestimateofappropriate“raw”timeout

• Useexponentialaveragingtoestimate

• RTT• Deviation=|EstimatedRTT-SampleRTT|

• ETO=EstimatedRTT+4xEstimatedDeviation

UseOnly“Clean”SamplesforETO

• OnlyupdateETOwhenyougetacleansample

• WherecleanmeansACKincludesnoretransmittedsegments

Example

• Send100,200,300• 100meanspacketwhosefirstbyteis100,lastbyteis199

• ReceiveA200• A200meansbytesupto199rep’d,expecting200next

• Cleansample

• 200timesout,resend200,receiveA300

• Nocleansamples

• Send400,500,receiveA600• Cleansamples

SettingRTO

• EverytimeRTOtimerexpires,setRTO<-2.RTO

• Uptomaximum>=60sec

• EverytimecleansamplearrivessetRTOtoETO

Example

• FirstarrivingACKexpects100(adv.window=500)• InitializeETP;RTO=ETO• RestarttimerforRTOseconds(newdataACK’ed)

• RememberTCPonlyhasonetimer,nottimerperpacket

• Sendpackets100,200,300,400and500

• ArrivingACKexpects300(A300)• UpdateETO;RTO=ETO• RestarttimerforRTOseconds(newdataACKed)

• Sendpackets600,700

• ArrivingACKexpects300(A300)

Example(cont’d)

• Timergoesoff

• RTO=2*RTO(backofftimer)

• RestarttimerforRTOseconds(ithadexpired)

• Resendpacket300

• ArrivingACKexpects800• Don’tupdateETO(ACKincludesaretransmission)

• RestarttimerforRTOseconds(newdataACKed)

• Sendpackets800,900,1000,1100,1200

Example(cont’d)

• ArrivingACKexpects1000• UpdatesETO;RTO=ETO• RestarttimerforRTOseconds(newdataACKed)

• Sendpackets1300,1400

• …Connectioncontinues…

Example

• ConsideraTCPconnectionwith:• CWND=10packets

• LastACKwasforpacket#101• i.e.,receiverexpectingnextpackettohaveseqno101

• 10packets[101,102,103,…,110]areinflight• Packet101isdropped• WhatACKsdotheygenerate?

• Andhowdoesthesenderrespond?

Timeline

• ACK101(dueto102)CWND=10dupACK#1(noxmit)

• ACK101(dueto103)CWND=10dupACK#2(noxmit)

• ACK101(dueto104)CWND=10dupACK#3(noxmit)

• RETRANSMIT101ssthresh=5CWND=5

• ACK101(dueto105)CWND=5(noxmit)

• ACK101(dueto106)CWND=5(noxmit)

• ACK101(dueto107)CWND=5(noxmit)

• ACK101(dueto108)CWND=5(noxmit)

• ACK101(dueto109)CWND=5(noxmit)

• ACK101(dueto110)CWND=5(noxmit)

• ACK111(dueto101)<-onlynowcanwetransmitnewpackets

• PlusnopacketsinflightsonoACKsforanotherRTT

Note that you do not restart dupACKcounter

on same packet!

Solution:FastRecovery

• Idea:Grantthesendertemporary“credit”foreachdupACKsoasto

keeppacketsinflight(eachACKduetoarrivingpkt)

• IfdupACKcount=3• ssthresh=CWND/2

• CWND=ssthresh+3

• Whileinfastrecovery

• CWND=CWND+1foreachadditionalduplicatepet

• ExitfastrecoveryafterreceivingnewACK• SetCWND=ssthresh(whichhadbeensettoCWND/2afterloss)

Example

• ConsideraTCPconnectionwith:• CWND=10packets

• LastACKwasforpacket#101• i.e.,receiverexpectingnextpackettohaveseqno101

• 10packets[101,102,103,…,110]areinflight• Packet101isdropped

Timeline

• ACK101(dueto102)CWND=10dupACK#1(noxmit)

• ACK101(dueto103)CWND=10dupACK#2(noxmit)

• ACK101(dueto104)CWND=10dupACK#3(noxmit)

• RETRANSMIT101ssthresh=5CWND=8(5+3)

• ACK101(dueto105)CWND=9(noxmit)

• ACK101(dueto106)CWND=10(noxmit)

• ACK101(dueto107)CWND=11(xmit111)

• ACK101(dueto108)CWND=12(xmit112)

• ACK101(dueto109)CWND=13(xmit113)

• ACK101(dueto110)CWND=14(xmit114)

• ACK111(dueto101)CWND=5(xmit115)<-exitingfastrecovery

• Packets111-114alreadyinflight(andnotsending115)• ACK112(dueto111)CWND=5+1/5<-backtocongestionavoidance

TCP“Phases”

• Slow-start• Enterontimeout

• LeavewhenCWND>ssthresh(toCong.Avoid.)

• The>onlyapplieshere…

• CongestionAvoidance• Leavewhentimeout

• Fastrecovery• EnterwhendupACK=3• LeavewhenNewACKorTimeout

TCPStateMachine

congestn. avoid.

fast recovery

slow start

Timeout

CWND > ssthresh

Timeout

Timeoutnew ACK

dupACK=3 dupACK=3

new ACK

dupACK

dupACK

new ACK

dupACK

TCPStateMachine

congestn. avoid.

fast recovery

slow start

Timeout

CWND > ssthresh

Timeout

Timeoutnew ACK

dupACK=3 dupACK=3

new ACK

dupACK

dupACK

new ACK

dupACK

TCPStateMachine

congestn. avoid.

fast recovery

slow start

Timeout

CWND > ssthresh

Timeout

Timeoutnew ACK

dupACK=3 dupACK=3

new ACK

dupACK

dupACK

new ACK

dupACK