Post on 03-Feb-2016
description
Design Issues for Memory Architecture in Embedded SystemsChia-Lin YangDept. of Computer Science and Information EngineeringNational Taiwan University
Memory Hierarchy in an Embedded SystemOn-chip Interconnection NetworkCPU
I-CacheD-CacheDigital Signal Processing (DSP)
Local MemoryIP Core
Local MemoryShared MemoryDRAM
Flash-Based Storage
FlashControllerMulti-banking flash memory chipsMemoryControllerMemoryBanksOn-chip Interconnection NetworkCPU
I-$D-$DSP
Memory or $IP Core
Private MemoryShared MemoryDRAM
Flash-Based Storage
FlashControllerMulti-banking flash memory chipsMemoryController
MemoryBanksIP CoreDMADMA
On-Chip Memory Design Issue Memory/Data allocationDecide the configuration of on-chip memory architectureWhich data are allocated to on-chip memories Consider data access frequency and life timeHow much on-chip memory resources are required? How many DMAs are sufficient?Private or shared memory?Private memory : lower contention, lower storage efficiency due to duplicated copyShared memory: higher contention, higher storage efficiency
On-Chip Memory Issue (contd) Low-leakage memory State destroying technique State preserving technique
When to turn a memory line into low-leakage modesHardware-Controlled CachePeriodically turn off a cache line, orTurn off a cache line when it is not accessed for a period of timeSoftware-managed addressable memory Data lifetime analysis at compile time
word line driverrow decoderdrowsy signaldrowsy (set)wake-up (reset)1V0.3Vdrowsydrowsypower lineword lineSRAMsWordline gatedrowsy bitvoltage controller
DRAM Management On-chip Interconnection NetworkGeneral Purpose Processor
I-CacheD-CacheDigital Signal Processing (DSP)
Local MemoryIP Core
Local MemoryShared MemoryDRAM
Flash-Based Storage
FlashControllerMulti-banking flash memory chipsMemoryControllerMemoryBanksOn-chip Interconnection NetworkCPU
I-$D-$DSP
Memory or $IP Core
Private MemoryShared MemoryDRAM
Flash-Based Storage
FlashControllerMulti-banking flash memory chipsMemoryController
MemoryBanksIP CoreDMA
Memory Controller Design IssueChallenges in the memory controller design in MPSOCConcurrent main memory accesses with different access patternsMultiple streams, random accesses, etc Limitations on conventional DRAM controllerUnaware of DRAM statusLack of control over the bandwidth allocation for different PEsSignificant access latencies due to the fair scheduling policies
Memory ControllerRequest BuffersMemory Access Scheduler
Smart Memory Controller DesignStream prefetching Identify streams at runtime or compile-time, and perform stream prefetching Address pre-computationMultimedia processing units usually have regular address patterns1-D linear address generator: audio codec2-D block-based address generator: mpeg2 motion compensation, DCT
Smart Memory Controller Design (contd) Row buffer management Close page policyPrecharge as soon as possibleGood for random accessesOpen page policyPrecharge as late as possible Good for accesses with high localityClose page or open page policy?Different access patterns within & among tasks=>Dynamic row buffer management
Smart Memory Controller Design (contd)Memory access schedulingSchedule accesses to different banks at the same timeProvide high utilization of DRAM bandwidthSchedule accesses according to the state of DRAM
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17PCCDRAM Operations:
P: bank precharge (3 cycle occupancy)A: row activation (3 cycle occupancy)C: column access (1 cycle occupancy) With access scheduling (17 DRAM cycles)
Smart Memory Controller Design (contd)Task-aware scheduling policySchedule the accesses of a task togetherMemory contentionTask-aware schedulingtimetimetimeMemoryaccesscomputationinterferencetask1task2
Power Management in DRAMsDDR/DDR2 power managementFour different power modesActive standby, active powerdown, precharge standby, precharge powerdownState transition eventsCKE: clock enable signalCKE must be high to serve requestsSense amplifier w/wo data
Active standby0.2x mWPrecharge standby0.18x mWPrecharge powerdown0.04x mWActive powerdown0.11x mWCKE lowCKE highCKE lowCKE highData in sense amplifiersData not in sense amplifiersData not in sense amplifiersDDR power-statesActive1.0x mW
DRAM Power Reduction Strategy Open page vs. close pageRow buffer hit Open page is more energy-efficient than close pageReduce row access powerRow buffer miss Open page is less energy-efficient than close pageWasted energy due to staying in the active standby mode before next row accessClose page or open page policy?Different access patterns within & among tasks=>Dynamic row buffer management
ActivestandbyPrechargestandbyPrechargepowerdownActivepowerdownActive
DRAM Power Reduction Strategy (contd) Increasing the idle period Schedule the operations to active banks firstRequest batchingClustering the requests in the memory controller
Memory access pattern reshapingCompiler approach - array interleaving: fine-grain power-aware data allocation
Run-time approach popularity layout
temporal concentrationIdle timeA[0]A[1]A[2]B[0]B[1]B[2]Sample code:for(i=0;i
ApplicationRTL / BehaviorISSMicro-architectural ModelingSystemC WrapperSystemC WrapperSystemC WrapperSystemC WrapperSystemC WrapperRTL / BehaviorSimulation MethodologyRTL / BehaviorRTL / BehaviorRTL / BehaviorSystemC WrapperDRAM-SimDRAM ControllerApplicationISSMicro-architectural ModelingSystemC WrapperNTU-CoSim for Memory System Evaluation
Back-up Slides
References Energy-Aware Flash Memory Management in Virtual Memory System,L.-H. Lin, C.-L. Yang, H.-W., Tseng, to appear in IEEE Transactions on Very Large Scale Integration (VLSI) Systems Tolerating Memory Latency Through Push Prefetching for Pointer-Intensive Applications, C.-L. Yang, A. R. Lebeck, H.-W. Tseng, and C.-H. Lee, ACM Transacations on Architecture and Code Optimization, 1(4), 445-475, December, 2004 Software-Controlled Cache Architecture for Energy Efficiency, C.-L. Yang, H.-W. Tseng, C.-C. Ho, J.-L. Wu, IEEE Trans. Circuits Syst. Video Techn. 15(5), 634-644, May, 2005 Cache Leakage Control Mechanism for Hard Real-Time Systems, J.-W. Chi, Y.-J. Chen, and C.-L. Yang, in Proceedings of IEEE/ACM International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES '07), Saizburg, Austria, September, 2007. P. R. Panda, N. D. Dutt and A. Nicolau, On-chip vs. Off-chip Memory: The Data Partitioning Problem in Embedded Processor-based Systems, TODAES, 2000.G. Chen, G. Chen, O. Ozturk, and M. Kandemir, Exploiting Inter-Processor Data Sharing for Improving Behavior of Multi Processor SoCs, ISVLSI 2005. K.-B. Lee, T.-C. Lin and C.-W. Jen, An Efficient Quality-aware Memory Controller for Multimedia Platform SoC, IEEE Transactions on Circuits and Systems for Video Technology, 2005.S. A. McKee, W. A. Wulf, J. H. Aylor, R. H. Klenke, M. H. Salinas, Su. I. Hong, and D. A. B. Weikle, Dynamic Access Ordering for Streamed Computations, IEEE Transactions on Computers, 2000.Min, S. L. and Nam, E. H., Current trends in flash memory technology: invited paper (ASP-DAC 06)N. Vijaykrishnan, A. Sivasubranmaniam, V. Delaluz, M. Kandemir, and M. J. Irwin, DRAM energy management using software and hardware directed power mode control, HPCA, 2001M. Kandemir, N. Vijaykrishnan, V. Delaluz, A. Sivasubramaniam, and M. J. Irwin, Scheduler-based DRAM energy management, DAC, 2002M. Vijaykrishnan, M. J. Irwin, A. Sivasubramaniam, V. Delaluz, M. Kandemir and I. Kolcu, Compiler-directed array interleaving for reducing energy in multi-bank memories, VSLI Design, 2002H. Zeng, A. R. Lebeck, X. Fan, and C. S. Ellis, Power aware page allocation, ASPLOS, 2000
Shared/Private Memory Allocation Customize on-chip memory configuration by capturing the privately-accessed and shared data across processors
Architecture SpecificationHW/SW Partition
System ConfigurationCPU frequency, voltageCache architecture (associativity, line size, capacity)3. Interconnection4. SDRAM configuration (RAS, CAS, etc)
SWIn CParameterized IP (f,w)RTLLevelTimed-FunLevelHW/SW Co-Simulation Power/PerformanceMonitor ToolPower/Performance BreakdownI/D Cache Miss RateComponent Activation Bus Utilization/ContentionFeatures of NTU-CoSimCycle-accurate power/performance modelDetailed SDRAM modelTunable simulation platform for easy design space exploration NTU-CoSim for Memory System Evaluation *
Target ArchitectureOn-chip busEmbeddedProcessor1DMAOff-chip BusInterfaceOn-chip busSRAMEmbeddedProcessor2off-chip busDDR/SDRAM Mem ControllerSDRAMI/O 1I/O 2off-chip bus*
Power Management ChallengeState transition overheadPrecharge powerdown active standby ~= 5nsOverhead is relatively small, but not negiligable Read ~= 75ns, Write ~= 85nsBreak-even timeResynchronization overhead=PowerTimeActivestandbyPrechargestandbyPrechargepowerdownActivepowerdownActive5nsIdle powerLow power
Flash-Based Storage in an Embedded SystemOn-chip Interconnection NetworkGeneral Purpose Processor
I-CacheD-CacheDigital Signal Processing (DSP)
Local MemoryIP Core
Local MemoryShared MemoryDRAM
Flash-Based Storage
FlashControllerMulti-banking flash memory chipsMemoryControllerMemoryBanksOn-chip Interconnection NetworkGeneral Purpose Processor
I-CacheD-CacheDigital Signal Processing (DSP)
Local MemoryIP Core
Local MemoryShared MemoryDRAM
Flash-Based Storage
FlashControllerMulti-banking flash memory chipsMemoryControllerMemoryBanksOn-chip Interconnection NetworkCPU
I-$D-$DSP
Memory or $IP Core
Private MemoryShared MemoryDRAM
Flash-Based Storage
FlashControllerMulti-banking flash memory chipsMemoryController
MemoryBanksIP CoreDMA
Challenges in Adopting Flash-Based Solid State DriveUnique features on flash memoryWrite-once, out-place update, garbage collection, limited write/erase cyclesThe need to revisit designs assuming disk as the storagee.g, virtual memory system Reliability issue With the technology shrinking to smaller geometries, there comes the quality and reliability issues of the small geometry in addition to the existing flash memory reliability issues Adopting ECC, wear leveling Single flash chip bandwidth < disk bandwidthMultiple flash chip system Flash memory chipsFlash Memory ControllerHost Inter
Multiple flash chip system Imbalanced bandwidth between host interface bandwidth and flash memory bus/chip bandwidthHost interface: 3Gb/sSingle flash chip: 10MB/sFlash memory bus: 33MB/sTo increase bandwidth of flash subsystemParallel reads/writesData interleaving in each flash chipProblem: data locality is destroyedinefficient garbage collectionParallelized garbage collection:Write live dataFlash Memory BusFlash memory chipsFlash Memory ControllerVictim blockBufferRead live dataFlash ChipsFree blocksErase the blockRead live dataWrite live data :::::::::::Erase the blockFlash ChipsHost InterBuffer::::