CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips
description
Transcript of CCNoC : On-Chip Interconnects for Cache-Coherent Manycore Server Chips
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips
CCNoC: On-Chip Interconnects forCache-Coherent Manycore Server ChipsCiprianSeiculescuStavros VolosNaser Khosro PourBabak Falsafi Giovanni De Micheli
LSIIntegratedSystemsLaboratory1NoCs Major Power Consumer Move towards manycore Tiled architectures
Network-on-Chip (NoC) Significant power consumer40% MIT RAW30% Intel Tera-scale
Cache coherent CMPServer workloadsC$C$C$C$C$C$C$C$C$C$C$C$C$C$C$C$CoreCore$$Crossbar2Proposals to Reduce NoC PowerMultiple networksBetter area and power [Balfour & Dally ICS 2006]
Commercial server workloadsTraffic patterns are different
Run on cache coherent CMPsStrong relation between coherence protocol and NoC
Not optimized for Commercial Server Workload traffic3ContributionsCommercial server workloadsOptimized for reuse in L1, little sharingFull blown coherence protocol in CMPsOnly some transitions are frequent
Duality in Request/Response message size
CCNoCFull advantage of heterogeneity Same number of buffers 16% less power same performance as Mesh4OutlineOverview
Why CCNoC?
Dual-router design
Evaluation
Conclusions5Dual Router is More EfficientDual routerTwo crossbars per routing node
Wires less expensive on-chipUse more wires for better performanceArea and power grows faster than connectivityBalfour & Dally ICS 2006Dual router: better performance, power and area
N bit wideN/2 bit wideN/2 bit wideRight Dual Router DesignAvoid protocol level deadlockSeparate Requests ResponsesUse Virtual Channels
CCNoC sub-networksRequest / ResponseNo VCs neededSame number of buffersBuffers are power hungryH.S.Wang & L.S.Peh, MICRO 2003Protocol ActivityCMPs implement full blown coherence protocol
Some transitions are frequent [Hardavellas ISCA 2009]Read clean blockEvict clean blockWrite to unshared block
Other transitions needed for correctness (infrequent)Read dirty blockEvict dirtyWrite to shared blockFrequent Read Protocol ActivityReaderDirectoryWriterRead ReqRead RespEvict Clean ReqShort ReqShort ReqShort RespLong Resp9Frequent Write Protocol ActivityWriterDirectoryFetch/Upgrade ReqFetchRespShort ReqShort ReqShort RespLong RespUpgrade Resp10Infrequent Read Protocol ActivityReaderDirectoryWriterRead ReqRead RespShort ReqShort ReqShort RespLong RespDowngrade ReqDowngrade Resp11Infrequent Write Protocol ActivityWriterDirectoryReader 1Fetch/Upgrade ReqFetch RespShort ReqShort ReqShort RespLong RespReader 2Upgrade RespInv ReqInv ReqInv RespInv RespEvict Dirty Req12Traffic AnalysisRequest: 93% shortResponse: 86% long13CCNoC RouterRequest network narrow: optimized for short messages Response network wide: optimized for long messages RequestSwitchResponseSwitchNIRouter14Previous WorkBalfour et al. ICS 2006Better than single large routerRead/Write trafficSame number of reads and writes
Yoon et al. DAC 2010Physical channel better then virtual channel
Not optimized for cache coherent CMPRunning commercial server workloadsOutlineOverview
Why CCNoC?
Dual-router design
Evaluation
Conclusions16Evaluation MethodologyFLEXUSFull system simulation 16 or 8 UltraSPARC III ISA coresSplit I/D, 64KB L11 or 2 MB L2
ORION 2.0power estimationarea estimationWorkloadsOLTP: TPC-CIBM DB2 and OracleDSS: TPC-H IBM DB2Q1, Q6, Q13, Q16Web: SPECweb99 Apache and ZeusScientific: EM3DMultiprogrammed:SPEC2K 2x: gcc, twolf, art, mcf
17Evaluation NoCsMesh-128 - baseline128 bit flit widthTorus - reference128 bit flit widthMesh-176 high performance 176 bit flit widthCCNoCRequest: 48 bit flit widthResponse: 128 bit flit widthSwitchesWormhole flow controlInput queued Transmission protocolOn/OffInput buffers2 entry
18PerformancePerformance loss: 2% Torus, 8% Mesh-17619Power SavingsPower savings: 16% Mesh-128, 22% Torus, 38% Mesh-17620ConclusionsDuality in Request/Response trafficRequest: dominated by short messagesResponse: dominated by long messages
Proposed CCNoCNarrow request networkWide response network
Showed significant power savings22% against Torus38% against Mesh-176 21Thank you!Q&A22