Rosetta Demostrator Project MASC, Adelaide University and ... · Rosetta Demostrator Project MASC,...
Transcript of Rosetta Demostrator Project MASC, Adelaide University and ... · Rosetta Demostrator Project MASC,...
LECTURE 1:UNDERSTANDING COMPUTER
ARCHITECTURE
Abridged version of Patterson & Hennessy (2017):Ch.1
Hennessy & Patterson (2017): Ch.1
1
The Computer Revolution
Progress in computer technology Underpinned by Moore’s Law
Makes novel applications feasible Computers in automobiles Cell phones Human genome project World Wide Web Search Engines
Computers are pervasive
2
Classes of Computers Personal Mobile Device (PMD)
e.g. start phones, tablet computers Emphasis on energy efficiency and real-time
Desktop Computing Emphasis on price-performance
Servers Emphasis on availability, scalability, throughput
Clusters / Warehouse Scale Computers Used for “Software as a Service (SaaS)” Emphasis on availability and price-performance Sub-class: Supercomputers, emphasis: floating-point
performance and fast internal networks Embedded Computers
Emphasis: price
3
The PostPC Era
4
Opening the BoxCapacitive multitouch LCD screen
3.8 V, 25 Watt-hour battery
Computer board
5
Inside the Processor
E.g., Apple A5 Contain
Datapath Control Cache I/O
6
Below Your Program Application software
Written in high-level language System software
Compiler: translates HLL code to machine code
Operating System: service code Handling input/output Managing memory and storage Scheduling tasks & sharing resources
Hardware Processor, memory, I/O controllers
7
Levels of Program Code High-level language
Level of abstraction closer to problem domain
Provides for productivity and portability
Assembly language Textual representation of
instructions Hardware representation
Binary digits (bits) Encoded instructions and
data
8
Defining Computer Architecture
“Old” view of computer architecture: Instruction Set Architecture (ISA) design i.e. decisions regarding:
registers, memory addressing, addressing modes, instruction operands, available operations, control flow instructions, instruction encoding
“Real” computer architecture: Specific requirements of the target machine Design to maximize performance within
constraints: cost, power, and availability Includes ISA, microarchitecture, hardware
9
Understanding Performance
Algorithm Determines number of operations executed
Programming language, compiler, architecture Determine number of machine instructions executed
per operation Processor and memory system
Determine how fast instructions are executed I/O system (including OS)
Determines how fast I/O operations are executed
10
Eight Great Ideas
Design for Moore’s Law
Use abstraction to simplify design
Make the common case fast
Performance via parallelism
Performance via pipelining
Performance via prediction
Hierarchy of memories
Dependability via redundancy
11
Abstractions
Abstraction helps us deal with complexity Hide lower-level detail
Instruction set architecture (ISA) The hardware/software interface
Application binary interface The ISA plus system software interface
Implementation The details underlying and interface
The BIG Picture
12
Intel Core i7 Wafer
300mm wafer, 280 chips, 32nm technology Each chip is 20.7 x 10.5 mm
13
Transistors and Wires
Feature size Minimum size of transistor or wire in x or y
dimension 10 microns in 1971 to .011 microns in 2017 Transistor performance scales linearly
Wire delay does not improve with feature size! Integration density scales quadratically
14
Technology Trends Electronics
technology continues to evolve
Increased capacity and performance
Reduced cost
Year Technology Relative performance/cost
1951 Vacuum tube 1
1965 Transistor 35
1975 Integrated circuit (IC) 900
1995 Very large scale IC (VLSI) 2,400,000
2013 Ultra large scale IC 250,000,000,000
DRAM capacity
15
Trends in Technology
Integrated circuit technology Transistor density: 35%/year Die size: 10-20%/year Integration overall: 40-55%/year
DRAM capacity: 25-40%/year (slowing) Flash capacity: 50-60%/year
15-20X cheaper/bit than DRAM Magnetic disk technology: 40%/year
15-25X cheaper/bit then Flash 300-500X cheaper/bit than DRAM
16
Measuring Performance Typical performance metrics:
Response time Throughput
Speedup of X relative to Y Execution timeY / Execution timeX
Execution time Wall clock time: includes all system overheads CPU time: only computation time
Benchmarks Kernels (e.g. matrix multiply) Toy programs (e.g. sorting) Synthetic benchmarks (e.g. Dhrystone) Benchmark suites (e.g. SPEC06fp, TPC-C)
17
Bandwidth and Latency
Bandwidth or throughput Total work done in a given time 10,000-25,000X improvement for processors 300-1200X improvement for memory and disks
Latency or response time Time between start and completion of an event 30-80X improvement for processors 6-8X improvement for memory and disks
18
Bandwidth and Latency
Log-log plot of bandwidth and latency milestones
19
Relative Performance
Define Performance = 1/Execution Time “X is n time faster than Y”
n XY
YX
time Executiontime Execution
ePerformancePerformanc
Example: time taken to run a program 10s on A, 15s on B Execution TimeB / Execution TimeA
= 15s / 10s = 1.5 So A is 1.5 times faster than B
20
Measuring Execution Time
Elapsed time Total response time, including all aspects
Processing, I/O, OS overhead, idle time Determines system performance
CPU time Time spent processing a given job
Discounts I/O time, other jobs’ shares Comprises user CPU time and system CPU
time Different programs are affected differently by
CPU and system performance
21
Performance Summary
Performance depends on Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI Instruction set architecture: affects IC, CPI, Tc
The BIG Picture
cycle Clock
Seconds
nInstructio
cycles Clock
Program
nsInstructioTime CPU
22
SPEC CPU Benchmark Programs used to measure performance
Supposedly typical of actual workload Standard Performance Evaluation Corp (SPEC)
Develops benchmarks for CPU, I/O, Web, … SPEC CPU2006
Elapsed time to execute a selection of programs Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine Summarize as geometric mean of performance ratios
CINT2006 (integer) and CFP2006 (floating-point)
n
n
1iiratio time Execution
23
CINT2006 for Intel Core i7 920
24
Pitfall: Amdahl’s Law
Improving an aspect of a computer and expecting a proportional improvement in overall performance
unaffectedaffected
improved Tfactor timprovemen
TT
Corollary: make the common case fast Speed-up S(N) = [(1 – f) + f/N]-1
25
Pitfall: MIPS as a Performance Metric
MIPS: Millions of Instructions Per Second Doesn’t account for
Differences in ISAs between computers Differences in complexity between instructions
66
6
10CPI
rate Clock
10rate Clock
CPIcount nInstructiocount nInstructio
10time Execution
count nInstructioMIPS
CPI varies between programs on a given CPU
26
Power and Energy Problem: Get power in, get power out Thermal Design Power (TDP)
Characterizes sustained power consumption Used as target for power supply and cooling system Lower than peak power (1.5X higher), higher than
average power consumption Clock rate can be reduced dynamically to limit
power consumption Energy per task is often a better measurement
27
Power Intel 80386
consumed ~ 2 W 3.3 GHz Intel
Core i7 consumes 130 W
Heat must be dissipated from 1.5 x 1.5 cm chip
This is the limit of what can be cooled by air
28
Dynamic Energy and Power Dynamic energy
Transistor switch from 0 -> 1 or 1 -> 0 ½ x Capacitive load x Voltage2
Dynamic power ½ x Capacitive load x Voltage2 x Frequency switched
Reducing clock rate reduces power, not energy
29
Reducing Power Techniques for reducing power:
Do nothing well Dynamic Voltage-Frequency Scaling
Low power state for DRAM, disks Overclocking, turning off cores
30
Static Power Static power consumption
25-50% of total power Currentstatic x Voltage Scales with number of transistors To reduce: power gating
31
Power Trends
In CMOS IC technology
FrequencyVoltageload CapacitivePower 2
×1000×30 5V → 1V
32
SPEC Power Benchmark
Power consumption of server at different workload levels Performance: ssj_ops/sec Power: Watts (Joules/sec)
10
0ii
10
0ii powerssj_ops Wattper ssj_ops Overall
33
SPECpower_ssj2008 for Xeon X5650
34
Reducing Power
Suppose a new CPU has 85% of capacitive load of old CPU 15% voltage and 15% frequency reduction
0.520.85FVC
0.85F0.85)(V0.85C
P
P 4
old2
oldold
old2
oldold
old
new
The power wall We can’t reduce voltage further We can’t remove more heat
How else can we improve performance?
35
Fallacy: Low Power at Idle
Look back at i7 power benchmark At 100% load: 258W At 50% load: 170W (66%) At 10% load: 121W (47%)
Google data center Mostly operates at 10% – 50% load At 100% load less than 1% of the time
Consider designing processors to make power proportional to load
Can it be done?36
Single Processor Performance
RISC
Move to multi-processor
37
Current Trends in Architecture
Cannot continue to leverage Instruction-Level parallelism (ILP) Single processor performance improvement ended in
2003 New models for performance:
Data-level parallelism (DLP) Thread-level parallelism (TLP) Request-level parallelism (RLP)
These require explicit restructuring of the application
38
Multiprocessors
Multicore microprocessors More than one processor per chip
Requires explicitly parallel programming Compare with instruction level parallelism
Hardware executes multiple instructions at once Hidden from the programmer
Hard to do Programming for performance Load balancing Optimizing communication and synchronization
39
Fallacies and Pitfalls
Microprocessors are a silver bullet Performance is now a programmer’s burden
Falling prey to Amdahl’s Law A single point of failure Hardware enhancements that increase
performance also improve energy efficiency, or are at worst energy neutral
Benchmarks remain valid indefinitely Compiler optimizations target benchmarks
40
Fallacies and Pitfalls
All exponential laws must come to an end Dennard scaling (constant power density)
Stopped by threshold voltage Disk capacity
30-100% per year to 5% per year Moore’s Law
Most visible with DRAM capacity ITRS disbanded Only four foundries left producing state-of-the-art
logic chips 11 nm, 3 nm might be the limit
41