RAMP Gold Wrap

22
EE CS Electrical Engineering and Computer Sciences BERKELEY PAR LAB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EE CS Electrical Engineering and Computer Sciences BERKELEY PAR LAB RAMP Gold Wrap Krste Asanovic RAMP Wrap Stanford, CA August 25, 2010

description

RAMP Gold Wrap. Krste Asanovic RAMP Wrap Stanford, CA August 25, 2010. RAMP Gold Team. Graduate Students Zhangxi Tan Andrew Waterman Rimas Avizienis Yunsup Lee Henry Cook Sarah Bird Faculty Krste Asanovic David Patterson. Multicore Architecture Simulation Challenge. - PowerPoint PPT Presentation

Transcript of RAMP Gold Wrap

ParLab Retreat Intro Summer 2010

RAMP Gold WrapKrste AsanovicRAMP WrapStanford, CAAugust 25, 2010

P A R A L L E L C O M P U T I N G L A B O R A T O R YEECSElectrical Engineering andComputer SciencesBERKELEY PAR LABEECSElectrical Engineering andComputer SciencesBERKELEY PAR LAB1RAMP Gold TeamGraduate StudentsZhangxi TanAndrew WatermanRimas AvizienisYunsup LeeHenry CookSarah BirdFacultyKrste AsanovicDavid Patterson

2EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABMulticore Architecture Simulation ChallengeBigger, more complex target systemMany cores, more components, cache coherence, Need to run OS scheduler, or multithreaded run timeSequential software simulator performance no longer scalingMore cores, not faster coresDetailed software simulators dont parallelize wellCycle-by-cycle synchronization kills parallel performanceParallel code has non-deterministic performanceNeed multiple runs to get error barsSoftware more dynamic, adaptive, autotuned, JIT, Cant use sampling~No legacy parallel softwareNeed to write/port entire stack.3EECSElectrical Engineering andComputer SciencesBERKELEY PAR LAB

RAMP Blue, July 20071,008 modified MicroBlaze coresFPU (64-bit)RTL directly mapped to FPGA90MHzRuns UPC version of NAS parallel benchmarks.Message-passing clusterNo MMURequires lots of hardware21 BEE2 boards / 84 FPGAsDifficult to modifyFPGA computer, not a simulator!EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABDimensions in FAME(FPGA Architecture Model Execution)Direct: One target cycle executed in one FPGA host cycleDecoupled: One target cycle takes one or more FPGA cyclesFull RTL: Complete RTL of target machine modeledAbstract RTL: Partial/simplified RTL, split functional/timing

Host single-threaded: One target model per host pipelineHost multi-threaded: Multiple target models per host pipeline

5EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABHost MultithreadingCPU1CPU2CPU3CPU4Target ModelMultithreading emulation engine reduces FPGA resource use and improves emulator throughputHides emulation latencies (e.g., communicating across FPGAs)Multithreaded Emulation Engine (on FPGA)+12PC1PC1PC1PC1I$IRGPRGPRGPRGPRXY2D$Single hardware pipeline with multiple copies of CPU stateEECSElectrical Engineering andComputer SciencesBERKELEY PAR LAB6

RAMP Gold, November 2009Rapid accurate simulation of manycore architectural ideas using FPGAsInitial version models 64 cores of SPARC v8 with shared memory system on $750 boardHardware FPU, MMU, boots OS.

CostPerformance(MIPS)Simulations per daySimics SoftwareSimulator$2,0000.1 - 11RAMP Gold$2,000 + $75050 - 1001007EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABRAMP Gold Target MachineSPARC V8COREI$D$DRAMShared L2$ / InterconnectSPARC V8COREI$D$SPARC V8COREI$D$SPARC V8COREI$D$64 cores8EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABRAMP Gold Model

Functional Model PipelineArch StateTiming Model PipelineTiming StateRAMP emulation model for Parlab manycoreSPARC v8 ISASingle-socket manycore targetSplit functional/timing model, both in hardwareFunctional model: Executes ISATiming model: Capture pipeline timing detail (can be cycle accurate)Host multithreading of both functional and timing modelsBuilt for Virtex-5 systems (ML505 or BEE3)9EECSElectrical Engineering andComputer SciencesBERKELEY PAR LAB9Case Study: Manycore OS Resource Allocation Spatial resource allocation in a manycore system is hardCombinatorial explosion in number of apps and number of resourcesIdea: use predictive models of app performance to make it easier on OSHW partitioning for performance isolation (so models still work when apps run together)Problem: evaluating effectiveness of resulting scheduling decisions requires running hundreds of schedules for billions of cycles eachSimulation-bound: 8.3 CPU-years for Simics!Read ISCA10 FAME paper for details

10

EECSElectrical Engineering andComputer SciencesBERKELEY PAR LAB10Case Study: Manycore OS Resource Allocation 11EECSElectrical Engineering andComputer SciencesBERKELEY PAR LAB11Case Study: Manycore OS Resource Allocation The technique appears to perform very well for synthetic or reduced-input workloads, but is lackluster in reality! 12EECSElectrical Engineering andComputer SciencesBERKELEY PAR LAB12RAMP Gold Distributionhttp://sites.google.com/site/rampgold/BSD/GNU licensesMany (100?) downloads

Used by Xilinx tools group as exemplar System Verilog design

13EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABRAMP Gold LessonsArchitects will use it!Actually, dont want to use software simulators nowMake everything a run-time parameterAvoid resynth + P&RDifficult to modify highly tuned FPGA designsFunctional/Timing split should be at arch. block levelReally build a microfunctional modelStandard ISA only gets you so far in researchResearch immediately changes ISA/ABI, so lose compatibilityTarget machine design vital part of architecture researchHave to understand target to build a model of it!Need area/cycle-time/energy numbers from VLSI designFAME-7 simulators are very complex hardware designsMuch more complicated than a processor

14EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABMidas GoalsRicher set of target machinesIn-order scalar cores, possibly threaded, + various kinds of vector unitVarious hardware-managed plus software-managed memory hierarchiesCross-chip and off-chip interconnect structuresMore modular designTrade some FPGA performance to make modifications easierBetter/more timing modelsEspecially interconnect and memory hierarchyBetter I/O for target systemTimed I/O to model real-time I/O accuratelyParavirtual-style network/video/graphics/audio/haptic I/O devicesBetter scaling of simulation performanceWeak scaling more FPGA pipelines allow bigger system to run fasterStrong scaling more FPGA pipelines allow same system to run fasterFirst target machine running apps on Par Lab stack in 201015EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABMidas ISA ChoiceRISC-V: A new home-grown RISC ISAV for Five, V for Vector, V for VariantsInspired by MIPS but much cleanerSupports 32-bit or 64-bit address space32-bit instruction format plus optional variable length (16/32/>32)Designed to support easy extensionCompletely open (BSD)Are we crazy not to use a standard ISA?Standard ISA not very helpful, but standard ABI isStandard ABI effectively means running same OSRunning same OS means modeling hardware platform too much work for universityAlso, only interesting standard ISAs are x86 & ARM (+GPU), both are too complex for university project to implement in fullWere going to change ISA plus OS model anywayMidas modularity should make it easy to add other ISAs as well16EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABMidas Target Core Design SpaceScalar coresIn-order (out-of-order possible later)Different issue widths, functional-unit latenciesDecoupled memory system (non-blocking cache)Optional multithreadingNew translation/protection structures for OS supportPerformance countersAttached data-parallel accelerator optionsTraditional vector (a la Cray)SIMD (a la SSE/AVX)SIMT (a la NVIDIA GPU)Vector-threading (a la Scale/Maven)All with one or more lanesShould be possible to model mix of cores for asymmetric platforms (Same ISA, different architecture)

17EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABMidas Uncore Design SpaceDistributed coherent cachesInitially, using reverse-map tag directoryFlexible placement, replication, migration, eviction policiesVirtual local storesSoftware-managed with DMA enginesMultistage cross-chip interconnectRings/torusPartitioning/QoS hardwareCache capacityOn-chip and off-chip bandwidthDRAM access schedulersPerformance countersAll fully parameterized for latency/bandwidth settings18EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABPlatforms of Interest1W handheld10W laptop/settop/TV/games/car100W servers19EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABMidas SoftwareGcc+binutils as efficiency-level compiler toolsAlready up and running for draft RISC-V specPar Lab OS + software stack for the restPar Lab applicationsPattern-specific compilers build with SEJITS/Autotuning help get good app performance quickly on large set of new architectures20EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABNot just simulatorsNew DoE-funded Project Isis (with John Wawrzynek) on developing high-level hardware design capabilityIntegrated approach spanning VLSI design and simulationBuilding ASIC designs for various scalar+vector coresYes, we will fab chipsMapping RTL to FPGAsFPGA computers help software developers, helps debug RTLWhy #1. For designing (not characterising) simple cores, power model abstractions dont work (e.g., 2x difference based on data values).21EECSElectrical Engineering andComputer SciencesBERKELEY PAR LABAcknowledgementsSponsorsParLab: Intel, Microsoft, UC Discovery, National Instruments, NEC, Nokia, NVIDIA, Samsung, Sun MicrosystemsXilinx, IBM, SPARC InternationalDARPA, NSF, DoE, GSRC, BWRC

22EECSElectrical Engineering andComputer SciencesBERKELEY PAR LAB