UC Regents Spring 2014 © UCBCS 152 L3: Metrics 2014-1-28 John Lazzaro (not a prof - “John” is...

UC Regents Spring 2014 UCBCS 152 L3: Metrics John Lazzaro (not a prof - John is always OK) CS 152 Computer Architecture and Engineering www-inst.eecs.berkeley.edu/~cs152/ TA: Eric Love Lecture 3 Metrics Play: UC Regents Spring 2014 UCBCS 152 L3: Metrics + Microcode Topics for todays lecture Metrics: Estimating the goodness of a CPU design... so that we can redesign the CPU to be better. A case study in microcode control: the Motorola 68000, the CPU that powered the original Macintosh. [see Lecture 5 slides for this topic] Short Break. Administrivia: Will announce office hours soon... Todd Hamilton, iWatch concept. On the drawing board... Todd Hamilton, iWatch concept. Gray-scale computer graphics model. Todd Hamilton, iWatch concept. Color computer graphics model... Todd Hamilton, iWatch concept. Animated model... Then the baton is passed to us. We use models to do stepwise refinement of the silicon that powers the consumer product. Four metrics: Performance Execution time of a program. Cost How many dollars to manufacture. Energy Joules required to execute a program. Time to Market Will we ship a product before our competitors? Todays Focus For a later lecture... CS 152 L6: PerformanceUC Regents Fall 2006 UCB Performance Measurement (as seen by the customer) UC Regents Fall 2006 UCBCS 152 L6: Performance Who (sensibly) upgrades CPUs often? A professional who turns CPU cycles into money, and who is cycle-limited. Artist tool: animation, video special effects. UC Regents Fall 2006 UCBCS 152 L6: Performance How to decide to buy a new machine? Measure After Effects execution time on a representative render workload Night flight City map and clouds computed on the fly with fractals CPU intensive Trivial I/O (still shot from the movie) UC Regents Fall 2006 UCBCS 152 L6: Performance Interpreting Execution Time Performance 1 Execution Time = = 2.85 renders/hour 1.5 GHz PB (Y) is N times faster than 1.25 GHz PB (X). N is ? N = Performance (Y) Execution Time (Y) Execution Time (X) Performance (X) = = PB 1.5 Ghz : 3. 4 renders/hour. PB 1.25 : 2.85 renders/hour. Might make the difference in meeting a deadline... Execution Time: 1265 seconds Power Book G GHz CS 152 L6: PerformanceUC Regents Fall 2006 UCB Execution Time: time for one job to complete 2 CPUs: Execution Time vs Throughput Throughput: # of independent jobs/hour completed However, G5 and Opteron may have same throughput. Assume G5 MP execution time faster because AE isnt parallelized on Opteron CPUs. 1.8x faster. Implies parallel code on a Mac. 2 CPUs vs 1 CPU, otherwise similar CS 152 L6: PerformanceUC Regents Fall 2006 UCB Performance Measurement (as seen by a CPU designer) Q. Why do we care about After Effects performance? A. We want the CPU we are designing to run it well ! UC Regents Fall 2006 UCBCS 152 L6: Performance Step 1: Analyze the right measurement! CPU Time: Time the CPU spends running program under measurement. Response Time: Total time: CPU Time + time spent waiting (for disk, I/O,...). Guides CPU design Guides system design Measuring CPU time (Unix): % time 25.77u 0.72s 0: % UC Regents Fall 2006 UCBCS 152 L6: Performance CPU time: Proportional to Instruction Count CPU time Program Machine Instructions Program Q. Static count? (lines of program printout) Or dynamic count? (trace of execution) Rationale: Every additional instruction you execute takes time. Q. How does a architect influence the number of machine instructions needed to run an algorithm? A. Create new instructions: instruction set architect. A. Dynamic. Q. Once ISA is set, who can influence instruction count? A. Compiler writer, application developer. UC Regents Fall 2006 UCBCS 152 L6: Performance CPU time: Proportional to Clock Period Time Program Time One Clock Period Q. What ultimately limits an architects ability to reduce clock period ? Q. How can architects (not technologists) reduce clock period? We will revisit these questions later in lecture... Rationale: We measure each instructions execution time in number of cycles. By shortening the period for each cycle, we shorten execution time. UC Regents Fall 2006 UCBCS 152 L6: Performance CPU time: Proportional to Clock Period Q. What ultimately limits an architects ability to reduce clock period ? Time Program Time One Clock Period A. Clock-to-Q, setup times. Q. How can architects (not technologists) reduce clock period? A. Shorten the machines critical path. Rationale: We measure each instructions execution time in number of cycles. By shortening the period for each cycle, we shorten execution time. UC Regents Fall 2006 UCBCS 152 L6: Performance Completing the performance equation Seconds Program Instructions Program = Seconds Cycle We need all three terms, and only these terms, to compute CPU Time! Q. When is it OK to compare clock rates? What factors make different programs have different CPIs? Instruction mix varies. Cache behavior varies. Branch prediction varies. CPI -- The Average Number of Clock Cycles Per Instruction For the Program Instruction Cycles A. When other RHS terms are equal. UC Regents Fall 2008 UCBCS L5: Pipelining Consider Lecture 2 single-cycle CPU... All instructions take 1 cycle to execute every time they run. CPI of any program running on machine? 1.0 average CPI for the program is a more-useful concept for more complicated machines... UC Regents Spring 2014 UCBCS 152 L3: Metrics + Microcode Recall Lecture 2: Multi-flow VLIW CPU opcodersrtrdfunctshamt opcodersrtrdfunctshamt Syntax: ADD $8 $9 $10 Semantics:$8 = $9 + $10 Syntax: ADD $7 $8 $9 Semantics:$7 = $8 + $9 N x 32-bit VLIW yields factor of N speedup! Multiflow: N = 7, 14, or 28 (3 CPUs in product family) Seconds Program Instructions Program = Seconds Cycle Instruction Cycles Q. Which right-hand-side term decreases with N ? A. This one gets smaller. A. We hope this one doesnt grow. UC Regents Fall 2006 UCBCS 152 L6: Performance Consider machine with a data cache... Instructions Program = Seconds Cycle A programs load instructions stride through every memory address. The cache never hits, so every load goes to DRAM (100x slower than loads that go to cache). Thus, the average number of cycles for load instructions is higher for this program. Instruction Cycles Thus, the average number of cycles for all instructions is higher for this program. Seconds Program Thus, program takes longer to run! UC Regents Fall 2006 UCBCS 152 L6: Performance CPI as an analytical tool to guide design Machine CPI ( throughput, not latency ) 5 x x x x x = 2.7 cycles/instruction Program Instruction Mix Where program spends its time 20/270 Now we know how to optimize the design... UC Regents Fall 2006 UCBCS 152 L6: Performance Final thoughts: Performance Equation Seconds Program Instructions Program = Seconds Cycle Instruction Cycles Goal is to optimize execution time, not individual equation terms. The CPI of the program. Reflects the programs instruction mix. Machines are optimized with respect to program workloads. Clock period. Optimize jointly with machine CPI. CS 152 L6: PerformanceUC Regents Fall 2006 UCB Invented the one ISA, many implementations business model. UC Regents Fall 2006 UCBCS 152 L6: Performance Amdahls Law (of Diminishing Returns) If enhancement E makes multiply infinitely fast, but other instructions are unchanged, what is the maximum speedup S? Where program spends its time S =S = 1 (post-enhancement % ) / 100% = %/100% = Attributed to Gene Amdahl -- Amdahls Law What is the lesson of Amdahls Law? Must enhance computers in a balanced way! UC Regents Fall 2006 UCBCS 152 L6: Performance Program We Wish To Run On N CPUs The program spends 30% of its time running code that can not be recoded to run in parallel. S =S = 1 (30 % + (70% / N) ) / 100 % CPUs 2345 Speedup Amdahls Law in Action S() 23# CPUs UC Regents Fall 2006 UCBCS 152 L6: Performance Real-world 2006: 2 CPUs vs 4 CPUs 20 in iMac Core Duo 2, 2.16 GHz $1500 Mac Pro 2 Dual-Core Xeons, 2.66 GHz $3200 w/ 20 inch display. UC Regents Fall 2006 UCBCS 152 L6: Performance Real-world 2006: 2 CPUs vs 4 CPUs 2 cores on one die. 4 cores on two dies. Caveat: Mac Pro CPUs are server-class and have architectural advantages (better I/O, ECC DRAM, ETC) ZIPing a file: very difficult to parallelize. Simple audio and video tasks: easier to parallelize. Amdahls Law + Real-World Legacy Code Issues in action. Source: MACWORLD UC Regents Spring 2014 UCBCS 152 L3: Metrics Break Play: UC Regents Spring 2014 UCBCS 152 L3: Metrics Timing UC Regents Fall 2006 UCBCS 152 L6: Performance CPU time: Proportional to Clock Period Time Program Time One Clock Period Q. What ultimately limits an architects ability to reduce clock period ? Q. How can architects (not technologists) reduce clock period? In this part of lecture: we answer these questions... Rationale: We measure each instructions execution time in number of cycles. By shortening the period for each cycle, we shorten execution time. UC Regents Fall 2008 UCBCS L6: Timing Goal: Determine minimum clock period 32 rd1 RegFile 32 rd2 WE 32 wd 5 rs1 5 rs2 5 ws Ext RegDest ALUsrc ExtOp ALUctr MemToReg MemWr Equal RegWr Equal Control Lines Combinational Logic UC Regents Fall 2013 UCBCS 250 L3: Timing A Logic Circuit Primer Models should be as simple as possible, but no simpler... Albert Einstein. UC Regents Fall 2013 UCBCS 250 L3: Timing Inverters: A simple transistor model 1 0 pFET. A switch. On if gate is grounded. nFET. A switch. On if gate is at Vdd. 10 1 0 Correctly predicts logic output for simple static CMOS circuits. Extensions to model subtler circuit families, or to predict timing, have not worked well... UC Regents Fall 2013 UCBCS 250 L3: Timing Transistors as water valves. If electrons are water molecules, transistor strengths (W/L) are pipe diameters, and capacitors are buckets... A on p-FET fills up the capacitor with charge. A on n-FET empties the bucket. 1 0 Time Water level 0 1 Time Water level This model is often good enough... (Cartoon physics) UC Regents Fall 2013 UCBCS 250 L3: Timing What is the bucket? A gates fan-out. Driving other gates slows a gate down. Driving wires slows a gate down. Fan-out: The number of gate inputs driven by a gates output. Driving its own parasitics slows a gate down. UC Regents Fall 2013 UCBCS 250 L3: Timing Fanout UC Regents Fall 2013 UCBCS 250 L3: Timing A closer look at fan-out... Linear model works for reasonable fan-out Delay time of an inverter driving 4 inverters. FO4: Fanout of four delay. Driving more gates adds delay. UC Regents Fall 2013 UCBCS 250 L3: Timing Propagation delay graphs >0 0 ->1 inverter transfer function UC Regents Fall 2013 UCBCS 250 L3: Timing Worst-case delay through combinational logic x = g(a, b, c, d, e, f) T2 might be the worst- case delay path(critical path) If d going 0-to-1 switches x 0-to-1, delay is T1. If a going 0-to-1 switches x 0-to-1, delay is T2. It would be surprising if T1 > T2. T1 T2 0 ->1 UC Regents Fall 2013 UCBCS 250 L3: Timing Why might? Wires have delay too... Looks benign, but... UC Regents Fall 2013 UCBCS 250 L3: Timing Clocked Logic Circuits UC Regents Fall 2013 UCBCS 250 L3: Timing From Delay Models to Timing Analysis fT 1 MHz1 s 10 MHz100 ns 100 MHz10 ns 1 GHz1 ns Timing Analysis What is the smallest T that produces correct operation? UC Regents Fall 2013 UCBCS 250 L3: Timing Timing Analysis and Logic Delay If our clock period T > worst-case delay through CL, does this ensure correct operation? Register: An Array of Flip-Flops Combinational Logic UC Regents Fall 2013 UCBCS 250 L3: Timing Flip Flops have internal delays... DQ CLK Value of D is sampled on positive clock edge. Q outputs sampled value for rest of cycle. D Q t_setup t_clk-to-Q UC Regents Fall 2013 UCBCS 250 L3: Timing Flip-Flop delays eat into time budget ALU time budget Combinational Logic UC Regents Fall 2013 UCBCS 250 L3: Timing Clock skew also eats into time budget As T 0, which circuit fails first? CLKd UC Regents Fall 2006 UCBCS 152 L5: Timing Clocks have dedicated wires (low skew) From: Xilinx Spartan 3 data sheet. Virtex is similar. Clock tree Flip flop clock inputs are the leaves of the tree. Gold wires form clock tree. Die photo: Xilinx Virtex Pro UC Regents Fall 2013 UCBCS 250 L3: Timing Clock Tree Delays, IBM Power CPU Delay UC Regents Fall 2013 UCBCS 250 L3: Timing Clock Tree Delays, IBM Power UC Regents Fall 2013 UCBCS 250 L3: Timing Some Flip Flops have hold time... D t_setup CLK t_hold D must stay stable here DQ CLK Does flip-flop hold time affect operation of this circuit? Under what conditions? t_inv What is the intended function of this circuit? t_clk-to-Q + t_inv > t_hold For correct operation. UC Regents Fall 2008 UCBCS L6: Timing 32 rd1 RegFile 32 rd2 WE 32 wd 5 rs1 5 rs2 5 ws Ext RegDest ALUsrc ExtOp ALUctr MemToReg MemWr Equal RegWr Equal Control Lines Combinational Logic Searching for processor critical path UC Regents Fall 2013 UCBCS 250 L3: Timing Searching for processor critical path Timing Analysis What is the smallest T that produces correct operation? Must consider all connected register pairs. ? Q. Why might I suspect this one? A. Very long wire on the path. UC Regents Fall 2013 UCBCS 250 L3: Timing Combinational paths for IBM Power 4 CPU From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. Most wires have hundreds of picoseconds to spare. The critical path UC Regents Fall 2013 UCBCS 250 L3: Timing Power 4: Timing Estimation, Closure Timing Estimation Predicting a processors clock rate early in the project From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. UC Regents Fall 2013 UCBCS 250 L3: Timing Power 4: Timing Estimation, Closure Timing Closure Meeting (or exceeding!) the timing estimate From The circuit and physical design of the POWER4 microprocessor, IBM J Res and Dev, 46:1, Jan 2002, J.D. Warnock et al. UC Regents Fall 2013 UCBCS 250 L3: Timing Floorplaning: Essential to meet timing. (Intel XScale 80200) UC Regents Fall 2006 UCBCS 152 L6: Performance CPU time: Proportional to Clock Period Q. What ultimately limits an architects ability to reduce clock period ? Time Program Time One Clock Period A. Clock-to-Q, setup times, 2-D floorplanning geometry. Q. How can architects (not technologists) reduce clock period? A. Shorten the machines critical path. Rationale: We measure each instructions execution time in number of cycles. By shortening the period for each cycle, we shorten execution time. On Thursday Pipeline design - with enough detail to do a design. Have fun in section!

UC Regents Spring 2014 © UCBCS 152 L3: Metrics 2014-1-28 John Lazzaro (not a prof - “John” is...

Documents

Transcript of UC Regents Spring 2014 © UCBCS 152 L3: Metrics 2014-1-28 John Lazzaro (not a prof - “John” is...