CS 61C: Great Ideas in Computer Architecture (Machine ... · PDF fileCS 61C: Great Ideas in...

Post on 04-Mar-2018

226 views 2 download

Transcript of CS 61C: Great Ideas in Computer Architecture (Machine ... · PDF fileCS 61C: Great Ideas in...

CS61C:GreatIdeasinComputerArchitecture(MachineStructures)

CachesPart2

Instructors:JohnWawrzynek &VladimirStojanovichttp://inst.eecs.berkeley.edu/~cs61c/

Second-LevelCache(SRAM)

TypicalMemoryHierarchy

Control

Datapath

SecondaryMemory(Disk

OrFlash)

On-ChipComponents

RegFile

MainMemory(DRAM)Data

CacheInstrCache

Speed(cycles):½’s 1’s 10’s 100’s 1,000,000’s

Size(bytes): 100’s 10K’s M’sG’sT’s

2

• Principleoflocality+memoryhierarchypresentsprogrammerwith≈asmuchmemoryasisavailableinthecheapest technologyatthe≈speedofferedbythefastest technology

Cost/bit:highest lowest

Third-LevelCache(SRAM)

Processor

Control

Datapath

Review:AddingCachetoComputer

3

PC

Registers

Arithmetic&LogicUnit(ALU)

MemoryInput

Output

Bytes

Enable?Read/Write

Address

WriteData

ReadData

Processor-Memory Interface I/O-MemoryInterfaces

Program

Data

Cache

0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111

0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111

0000000001000100001100100001010011000111010000100101010010110110001101011100111110000100011001010011101001010110110101111100011001110101101111100111011111011111

8 88Byte

Word8-Byte Block

address address address

2 LSBs are 0 3 LSBs are 0

0

1

2

3

01234567012345670123456701234567

Byte offset in blockBlock #10/20/15 4

MemoryBlock-addressingexample

010100100000

010100110000

010101000000

010101010000

010101100000

010101110000

010110000000

010110010000

010110100000

010110110000

010100100000

010100110000

010101000000

010101010000

010101100000

010101110000

010110000000

010110010000

010110100000

010110110000

82

83

84

85

86

87

88

89

90

91

2

3

4

5

6

7

0

1

2

3

0

1

0

1

0

1

0

1

0

1

010100100000

010100110000

010101000000

010101010000

010101100000

010101110000

010110000000

010110010000

010110100000

010110110000

Blocknumberaliasingexample

10/20/15 5

Block# Block#mod8 Block#mod2

12-bitmemoryaddresses,16Byteblocks

CachesReview

6

• PrincipleofLocality• TemporalLocalityandSpatialLocality

• HierarchyofMemories(speed/size/costperbit)toExploitLocality

• Cache– copyofdatainlowerlevelofmemoryhierarchy

• DirectMappedtofindblockincacheusingTagfieldandValidbitforHit

• Cachedesignorganizationchoices:• FullyAssociative,Set-Associative,Direct-

Mapped

CacheOrganizations• “FullyAssociative”:Blockcangoanywhere– Firstdesigninlecture– Note:NoIndexfield,but1comparator/block

• “DirectMapped”:Blockgoesoneplace– Note:Only1comparator– Numberofsets=numberblocks

• “N-waySetAssociative”:Nplacesforablock– Numberofsets=numberofblocks/N– Ncomparators– FullyAssociative:N=numberofblocks– DirectMapped:N=1

7

ProcessorAddressFieldsusedbyCacheController

• BlockOffset:Byteaddresswithinblock• SetIndex:Selectswhichset• Tag:Remainingportionofprocessoraddress

• SizeofIndex=log2(numberofsets)• SizeofTag=Addresssize– SizeofIndex– log2(numberofbytes/block)

Block offsetSetIndexTag

8

ProcessorAddress(32-bitstotal)

• Onewordblocks,cachesize=1Kwords(or4KB)

Direct-MappedCacheReview

20Tag 10Index

DataIndex TagValid012...

102110221023

3130 ... 131211 ... 210Byteoffset

20

Data

32

Hit

9

Validbitensures

somethingusefulincacheforthisindex

CompareTagwith

upperpartofAddress toseeifaHit

Readdatafromcacheinstead

ofmemoryifaHit

Comparator

Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)

3130 ... 131211... 210 Byteoffset

DataTagV012...

253254255

DataTagV012...

253254255

DataTagV012...

253254255

SetIndex

DataTagV012...

253254255

8Index

22Tag

Hit Data

32

4x1select

Way0 Way1 Way2 Way3

10

HandlingStoreswithWrite-Through

• Storeinstructionswritetomemory,changingvalues

• Needtomakesurecacheandmemoryhavesamevaluesonwrites:2policies

1)Write-ThroughPolicy:writecacheandwritethroughthecachetomemory– Everywriteeventuallygetstomemory– Tooslow,soincludeWriteBuffertoallowprocessortocontinueoncedatainBuffer

– Bufferupdatesmemoryinparalleltoprocessor

11

Write-ThroughCache

• Writebothvaluesincacheandinmemory

• WritebufferstopsCPUfromstallingifmemorycannotkeepup

• Writebuffermayhavemultipleentriestoabsorbburstsofwrites

• Whatifstoremissesincache?

12

Processor

32-bitAddress

32-bitData

Cache

32-bitAddress

32-bitData

Memory

1022 99252

720

12

1312041 Addr Data

WriteBuffer

HandlingStoreswithWrite-Back

2)Write-BackPolicy:writeonlytocacheandthenwritecacheblockbacktomemorywhenevictblockfromcache–Writescollectedincache,onlysinglewritetomemoryperblock

– Includebittoseeifwrotetoblockornot,andthenonlywritebackifbitisset• Called“Dirty”bit(writingmakesit“dirty”)

13

Write-BackCache

• Store/cachehit,writedataincacheonly&setdirtybit– Memoryhasstalevalue

• Store/cachemiss,readdatafrommemory,thenupdateandsetdirtybit– “Write-allocate”policy

• Load/cachehit,usevaluefromcache

• Onanymiss,writebackevictedblock,onlyifdirty.Updatecachewithnewblockandcleardirtybit.

14

Processor

32-bitAddress

32-bitData

Cache

32-bitAddress

32-bitData

Memory

1022 99252

720

12

1312041

DDDD

DirtyBits

Write-Throughvs.Write-Back

• Write-Through:– Simplercontrollogic– Morepredictabletimingsimplifiesprocessorcontrollogic

– Easiertomakereliable,sincememoryalwayshascopyofdata(bigidea:Redundancy!)

• Write-Back– Morecomplexcontrollogic– Morevariabletiming(0,1,2memoryaccessespercacheaccess)

– Usuallyreduceswritetraffic

– Hardertomakereliable,sometimescachehasonlycopyofdata

15

Administrivia• Project3-1duedateWednesday10/21.• Project3-2duedatenow10/28(release10/21)

• Midterm1:– gradesposted

16

Cache(Performance) Terms

• Hitrate:fractionofaccessesthathitinthecache• Missrate:1– Hitrate• Misspenalty:timetoreplaceablockfromlowerlevelinmemoryhierarchytocache

• Hittime:timetoaccesscachememory(includingtagcomparison)

• Abbreviation:“$”=cache(ABerkeleyinnovation!)

17

AverageMemoryAccessTime(AMAT)• AverageMemoryAccessTime(AMAT)istheaveragetimetoaccessmemoryconsideringbothhitsandmissesinthecache

AMAT= Timeforahit+Missrate× Misspenalty

18

B:400psec

C:600psec

A:≤200psec☐

19

Clickers/PeerinstructionAMAT=Timeforahit+Missratex Misspenalty

Givena200psec clock,amisspenaltyof50clockcycles,amissrateof0.02missesperinstructionandacachehittimeof1clockcycle,whatisAMAT?

Example:Direct-MappedCachewith4Single-WordBlocks,Worst-CaseReferenceString

0 4 0 4

0 4 0 4

• Considerthemainmemoryaddressreferencestringofwordnumbers:04040404

Startwithanemptycache- allblocksinitiallymarkedasnotvalid

20

Example:Direct-MappedCachewith4Single-WordBlocks,Worst-CaseReferenceString

0 4 0 4

0 4 0 4

miss miss miss miss

miss miss miss miss

00Mem(0) 00Mem(0)01 4

01Mem(4)000

00Mem(0)01 4

00Mem(0)01 4

00Mem(0)01 4

01Mem(4)000

01Mem(4)000

Startwithanemptycache- allblocksinitiallymarkedasnotvalid

• Ping-pong effectduetoconflictmisses- twomemorylocationsthatmapintothesamecacheblock

• 8requests,8misses

21

• Considerthemainmemoryaddressreferencestringofwordnumbers:04040404

AlternativeBlockPlacementSchemes

• DMplacement:mem block12in8blockcache:onlyonecacheblockwheremem block12canbefound—(12modulo8)=4

• SAplacement:foursetsx 2-ways(8cacheblocks),memoryblock12inset(12mod4)=0;eitherelementoftheset

• FAplacement:mem block12canappearinanycacheblocks22

Example:2-WaySetAssociative$(4words=2setsx2waysperset)

0

Cache

MainMemory

Q:Howdowefindit?

Usenext1lowordermemoryaddressbittodeterminewhichcacheset(i.e.,modulothenumberofsetsinthecache)

Tag Data

Q:Isitthere?

Compareall thecachetagsinthesettothehighorder3memoryaddressbits totellifthememoryblockisinthecache

V

0000xx0001xx0010xx0011xx0100xx0101xx0110xx0111xx1000xx1001xx1010xx1011xx1100xx1101xx1110xx1111xx

Set

1

01

Way

0

1

OnewordblocksTwoloworderbitsdefine thebyteintheword(32bwords)

23

Example:4Word2-WaySA$SameReferenceString

0 4 0 4

• Considerthemainmemorywordreferencestring04040404Startwithanemptycache- allblocks

initiallymarkedasnotvalid

24

Example:4-Word2-WaySA$SameReferenceString

0 4 0 4

• Considerthemainmemoryaddressreferencestring04040404

miss miss hit hit

000Mem(0) 000Mem(0)

Startwithanemptycache- allblocksinitiallymarkedasnotvalid

010Mem(4) 010Mem(4)

000Mem(0) 000Mem(0)

010Mem(4)

• Solvestheping-pongeffectinadirect-mappedcacheduetoconflictmissessincenowtwomemorylocationsthatmapintothesamecachesetcanco-exist!

• 8requests,2misses

25

Four-WaySet-AssociativeCache• 28 =256setseachwithfourways(eachwithoneblock)

3130 ... 131211... 210 Byteoffset

DataTagV012...

253254255

DataTagV012...

253254255

DataTagV012...

253254255

Index DataTagV012...

253254255

8Index

22Tag

Hit Data

32

4x1select

Way0 Way1 Way2 Way3

26

DifferentOrganizationsofanEight-BlockCache

Totalsizeof$inblocksisequaltonumberofsets× associativity.Forfixed$sizeandfixedblocksize,increasing associativitydecreasesnumberofsetswhileincreasingnumberofelementsperset.Witheightblocks,an8-wayset-associative$issameasafullyassociative$.

27

RangeofSet-AssociativeCaches• Forafixed-sizecacheandfixedblocksize,eachincreasebyafactoroftwoinassociativitydoublesthenumberofblocksperset(i.e.,thenumberorways)andhalvesthenumberofsets– decreasesthesizeoftheindexby1bitandincreasesthesizeofthetagby1bit

Wordoffset ByteoffsetIndexTag

28

RangeofSet-AssociativeCaches• Forafixed-sizecacheandfixedblocksize,eachincreasebyafactoroftwoinassociativitydoublesthenumberofblocksperset(i.e.,thenumberorways)andhalvesthenumberofsets– decreasesthesizeoftheindexby1bitandincreasesthesizeofthetagby1bit

Wordoffset ByteoffsetIndexTag

Decreasingassociativity

Fullyassociative(onlyoneset)Tagisallthebitsexceptblockandbyteoffset

Directmapped(onlyoneway)Smallertags,onlyasinglecomparator

Increasingassociativity

SelectsthesetUsedfortagcompare Selectsthewordintheblock

29

TotalCacheCapacity=

30

Associativity× #ofsets× block_sizeBytes=blocks/set× sets× Bytes/block

ByteOffsetTag Index

C=N× S× B

address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)

Clickers/PeerInstruction• Foracachewithconstanttotalcapacity, ifweincreasethenumberofwaysbyafactorof2,whichstatementisfalse:

• A:Thenumberofsetscouldbedoubled• B:Thetagwidthcoulddecrease• C:Theblocksizecouldstaythesame• D:Theblocksizecouldbehalved• E:Tagwidthmustincrease

31

TotalCacheCapacity=

32

Associativity× #ofsets× block_size

Bytes=blocks/set× sets× Bytes/block

ByteOffsetTag Index

C=N× S× B

ClickerQuestion:Cremainsconstant,Sand/orBcanchangesuchthatC=2N*(SB)’=>(SB)’=SB/2

Tag_size =address_size – (log2(S)+log2(B))=address_size – log2(SB)=address_size – log2(SB/2)=address_size – (log2(SB)– 1)

address_size =tag_size +index_size +offset_size=tag_size +log2(S)+log2(B)

CostsofSet-AssociativeCaches• N-wayset-associativecachecosts– Ncomparators(delayandarea)– MUXdelay(setselection)beforedataisavailable– Dataavailableaftersetselection(andHit/Missdecision).DM$:blockisavailablebeforetheHit/Missdecision• InSet-Associative,notpossibletojustassumeahitandcontinueandrecoverlaterifitwasamiss

• Whenmissoccurs,whichway’sblockselectedforreplacement?– LeastRecentlyUsed(LRU):onethathasbeenunusedthelongest(principleoftemporallocality)• Musttrackwheneachway’sblockwasusedrelativetootherblocksintheset

• For2-waySA$,onebitperset→setto1whenablockisreferenced;resettheotherway’sbit(i.e.,“lastused”)

33

CacheReplacementPolicies• RandomReplacement

– Hardwarerandomlyselectsacacheevict• Least-RecentlyUsed

– Hardwarekeepstrackofaccesshistory– Replacetheentrythathasnotbeenusedforthelongesttime– For2-wayset-associativecache,needonebitforLRUreplacement

• ExampleofaSimple“Pseudo”LRUImplementation– Assume64FullyAssociativeentries– Hardwarereplacementpointerpointstoonecacheentry– Wheneveraccessismadetotheentrythepointerpointsto:

• Movethepointertothenextentry– Otherwise:donotmovethepointer– (exampleof“not-most-recentlyused”replacementpolicy)

:

Entry0Entry1

Entry63

ReplacementPointer

34

BenefitsofSet-AssociativeCaches• ChoiceofDM$versusSA$dependsonthecostofamiss

versusthecostofimplementation

• Largestgainsareingoingfromdirectmappedto2-way(20%+reductioninmissrate)

35

UnderstandingCacheMisses:The3Cs

• Compulsory(coldstartorprocessmigration,1st reference):– Firstaccesstoblockimpossibletoavoid;smalleffectforlong

runningprograms– Solution:increaseblocksize(increasesmisspenalty;verylarge

blockscouldincreasemissrate)• Capacity:

– Cachecannotcontainallblocksaccessedbytheprogram– Solution:increasecachesize(mayincreaseaccesstime)

• Conflict(collision):– Multiplememorylocationsmappedtothesamecachelocation– Solution1:increasecachesize– Solution2:increaseassociativity (mayincreaseaccesstime)

36

HowtoCalculate3C’susingCacheSimulator

1. Compulsory:setcachesizetoinfinityandfullyassociative,andcountnumberofmisses

2. Capacity:Changecachesizefrominfinity,usuallyinpowersof2,andcountmissesforeachreductioninsize– 16MB,8MB,4MB,…128KB,64KB,16KB

3. Conflict:Changefromfullyassociativeton-waysetassociativewhilecountingmisses– Fullyassociative,16-way,8-way,4-way,2-way,1-way

37

3CsAnalysis

• Threesourcesofmisses(SPEC2000integerandfloating-pointbenchmarks)– Compulsorymisses0.006%;notvisible– Capacitymisses,functionofcachesize– Conflictportiondependsonassociativity andcachesize 38

ImprovingCachePerformance

• Reducethetimetohitinthecache– E.g.,Smallercache

• Reducethemissrate– E.g.,Biggercache

• Reducethemisspenalty– E.g.,Usemultiplecachelevels

39

AMAT=Timeforahit+MissratexMisspenalty

ImpactofLargerCacheonAMAT?• 1)Reducesmisses(whatkind(s)?)• 2)LongerAccesstime(Hittime):smallerisfaster– Increaseinhittimewilllikelyaddanotherstagetothepipeline

• Atsomepoint,increaseinhittimeforalargercachemayovercometheimprovementinhitrate,yieldingadecreaseinperformance

• Computerarchitectsexpendconsiderableeffortoptimizingorganizationofcachehierarchy– bigimpactonperformanceandpower!

40

Clickers:Impactoflongercacheblocksonmisses?

• Forfixedtotalcachecapacityandassociativity,whatiseffectoflongerblocksoneachtypeofmiss:– A:Decrease,B:Unchanged,C:Increase

• Compulsory?• Capacity?• Conflict?

41

Clickers:ImpactoflongerblocksonAMAT

• Forfixedtotalcachecapacityandassociativity,whatiseffectoflongerblocksoneachcomponentofAMAT:– A:Decrease,B:Unchanged,C:Increase

• HitTime?• MissRate?• MissPenalty?

42

Clickers/PeerInstruction:Forfixedcapacityandfixedblocksize,howdoesincreasingassociativityeffectAMAT?

43

CacheDesignSpace• Severalinteractingdimensions

– Cachesize– Blocksize– Associativity– Replacementpolicy– Write-throughvs.write-back– Writeallocation

• Optimalchoiceisacompromise– Dependsonaccesscharacteristics

• Workload• Use(I-cache,D-cache)

– Dependsontechnology/cost• Simplicityoftenwins

Associativity

CacheSize

BlockSize

Bad

Good

Less More

FactorA FactorB

44

And,InConclusion…

• NameoftheGame:ReduceAMAT–ReduceHitTime–ReduceMissRate–ReduceMissPenalty

• Balancecacheparameters(Capacity,associativity,blocksize)

45