CS211 Advanced Computer Architecture L01 IntroductionShanghaiTech/Fall-2020/... · CS211 Advanced...
Transcript of CS211 Advanced Computer Architecture L01 IntroductionShanghaiTech/Fall-2020/... · CS211 Advanced...
CS211Advanced Computer Architecture
L01 Introduction
Chundong WangSeptember 7th, 2020
CS211@ShanghaiTech 1
Welcome to CS211• Course information
• Advanced topics related to computer architecture• Prerequisite
• CS110 (Computer Architecture I) of ShanghaiTech University, or equivalents• Reference Textbook
• 计算机体系结构:量化研究方法(英文版·原书第6版). 约翰·L. 亨尼斯(John L. Hennessy),戴维·A. 帕特森 (David A. Patterson)、机械工业出版社、2019年07月、
ISBN 9787111631101
• Instructor• Chundong Wang (王春东)• Office: SIST 1A-504D• Email: wangchd• Office hour: 14:00-16:00, every Thursday in the Fall semester of 2020
• Teaching assistant (TA) of Fall 2020• Ms. Chunyan Rong (荣春燕)• Email: rongchy
• Online• https://toast-lab.gitee.io/courses/CS211@ShanghaiTech/Fall-2020/• Blackboard, Gradescope• WeChat group, in the charge of TA
CS211@ShanghaiTech 2
Why CS211?
• To become a computer architect• Alumni of this class helped design your computer• To make the frequent cases fast and the rare case correct
• To learn what is under the hood of a computer• Innate curiosity• To better understand when things break• To write better code/applications• To write better system software (OS, compiler, DB, etc.)
• Because it is intellectually fascinating!• What is the most complex man-made single device?
Mikko Lipasti -- University of Wisconsin 3
Topics covered by CS211
• Microcode, instructions, ISA• Pipeline• Memory hierarchy
• CPU cache/main memory/disk
• In-order execution and out-of-order execution
• Speculative execution• Branch prediction
• VLIW• Very long instruction word
• Multi-threading
• Vectors• GPU
• Graphics processing unit
• Cache coherence• Memory consistency• Synchronization• Virtual machines• Cluster and distributed
computing• Security and Privacy
CS211@ShanghaiTech 4
Evaluation
CS211@ShanghaiTech 5
Item Percentage Note
Participation and quizzes 10% In the middle of some live lectures
Homework 15% Four or more homeworks to be assigned
Mid-term examination 15% 5th November, Wednesday (tentative)
Literature reading 15% Four (tentative) papers, details to be issued
Labs 20% Five (tentative) labs, details to be announced
Final examination 25% To be decided
Basic Rules
• No Abuse• Abusive words or behaviors are NOT allowed.• Be graceful and respectful.
• No Plagiarism• Do not copy ANYTHING from ANYWHERE at ANYTIME.• Be yourself.
CS211@ShanghaiTech 6
Policy on Assignments and Independent Work• With the exception of laboratories and assignments that explicitly permit you to work in groups, all
homework and projects are to be YOUR work and your work ALONE.• PARTNER TEAMS MAY NOT WORK WITH OTHER PARTNER TEAMS• You can discuss your assignments with other students, and credit will be assigned to students who help
others by answering questions (participation), but we expect that what you hand in is yours.
• Level of detail allowed to discuss with other students: Concepts (Material taught in the class/ in the text book)! Pseudocode is NOT allowed!
• Use the Office Hours of the TA and the Prof. if you need help with your homework/ project!• Rather submit an incomplete homework with maybe 0 points than risking an F!
• It is NOT acceptable to copy solutions from other students.• You can never look at homework/ project code not by you/your team!• You cannot give your code to anybody else –> secure your computer when not around it
• It is NOT acceptable to copy (or start your) solutions from the Web. • It is NOT acceptable to use PUBLIC github/gitlab/gitee archives (giving your answers away)• We have tools and methods, developed over many years, for detecting this. You WILL be caught, and the
penalties WILL be severe.
• At the minimum F in the course, and a letter to your university record documenting the incidence of cheating.
• Both Giver and Receiver are equally culpable and suffer equal penalties7CS211@ShanghaiTech
Timetable before 1st October, 2020
CS211@ShanghaiTech 8
Week Date Lecture Task
1 7 Sept., Monday Introduction & Review (I) Survey
9 Sept., Wednesday Introduction & Review (II) Lab 0 (with credit)
2 14 Sept., Monday Microcode, ISA, etc. HW1, Paper Reading 1
3 21 Sept., Monday Pipeline (I) Paper Reading 2, Lab 0 due,HW2 (tentative)
23 Sept., Wednesday Pipeline (II)
4 28 Sept., Monday Memory Hierarchy (I) Lab 1, HW1 due
L01 Survey
CS211@ShanghaiTech 9
https://www.wjx.cn/jq/89700035.aspx
Now, let’s dive into CA.
CS211@ShanghaiTech 10
Computer Architecture
CS211@ShanghaiTech 11
Algorithm, e.g., KMP, Quicksort
Gates/Register-Transfer Level (RTL)
Application, e.g., WeChat, Meituan
Instruction Set Architecture (ISA)
Operating System/Virtual Machines
Microarchitecture
Devices
Programming Language, e.g., C/C++/Python
Circuits
Physics
Copyright © 2016 Elsevier Ltd. All rights reserved.
12
Computing Devices Long Long ago
EDSAC (Electronic delay storage automatic calculator), University of Cambridge, UK, May 1949
CS211@ShanghaiTech
The 1st computer of China
CS211@ShanghaiTech 13
http://www.ict.cas.cn/kxcb/kxtp/200909/t20090922_2514368.html
http://news.sciencenet.cn/sbhtmlnews/2019/9/349501.shtm
Computer 103 (103机) Kick-off: On August 1st 1958, four instructions were executed on Computer 103 who was with
1KB memory and 30 instructions per second. Products: In all, 41 machines were manufactured. The remaining intact one: Fudan University (复旦大学) purchased it in 1964. In 1973, Fudan
donated it to Qufu Normal University (曲阜师范大学). It was delivered using two trucks with three instructors accompanied. Qufu Normal University built a two-story building for it.
14
Computing Devices Now
Robots
SupercomputersAutomobiles
Laptops
Set-top boxes
Smart phones
ServersMedia Players
Sensor Nets
Routers
CamerasGames
CS211@ShanghaiTech
Cost of software development makes compatibility a major force in market
Architecture continually changing
15
Applications
Technology
Applications suggest how to improve technology, provide revenue to fund development
Improved technologies make new applications possible
CS211@ShanghaiTech
Moore’s Law
16
Gordon MooreIntel Cofounder
Predicts: 2X Transistors / chip
every 2 years
"The number of transistors incorporated in a chip will approximately double every 24 months."—Gordon Moore, Intel co-founder
Single-Thread Processor Performance
17
[ Hen
ness
y &
Pat
ters
on, 2
017
]
CS211@ShanghaiTech
1
10
100
1999 2003 2006 2008 2011 2013 2014 2015 2016 2017
DRAM
Impr
ovem
ent (
log)
Capacity Bandwidth Latency
DRAM Capacity, Bandwidth, Latency Trends
128x
20x
1.3x
Upheaval in Computer Design
• Most of last 50 years, Moore’s Law ruled• Technology scaling allowed continual performance/energy improvements without
changing software model• Last decade, technology scaling slowed/stopped
• Dennard (voltage) scaling over (supply voltage ~fixed)• Moore’s Law (cost/transistor) over?• No competitive replacement for CMOS anytime soon• Energy efficiency constrains everything
• No “free lunch” for software developers, must consider:• Parallel systems
• e.g., AMD released their first dual-core processor, the Athlon 64 X2 3800+ (2.0 GHz, 512 KB L2 cache per core), on April 21, 2005.
• Heterogeneous systems• e.g., ARM’s big.LITTLE• ”LITTLE” processors (e.g., Cortex-A53) are designed for maximum power efficiency while ”big”
processors (e.g., Cortex-A73) are designed to provide maximum compute performance.
19CS211@ShanghaiTech
Today’s Dominant Target Systems• Mobile (smartphone/tablet)
• >1 billion sold/year• Market dominated by ARM-ISA-compatible general-purpose processor in
system-on-a-chip (SoC)• Plus sea of custom accelerators (radio, image, video, graphics, audio,
motion, location, security, etc.)
• Warehouse-Scale Computers (WSCs)• 100,000’s cores per warehouse• Market dominated by x86-compatible server chips• Dedicated apps, plus cloud hosting of virtual machines• Now seeing increasing use of GPUs, FPGAs, custom hardware to accelerate
workloads
• Embedded computing• Wired/wireless network infrastructure, printers• Consumer TV/Music/Games/Automotive/Camera/MP3• Internet of Things!
20CS211@ShanghaiTech
A toy example to review CA
CS211@ShanghaiTech 21
What are not covered by CA?
CS211@ShanghaiTech 22
Programming, and language design Compiling
Data structure and Algorithm
Process scheduling
File operations
What are covered by CA?
CS211@ShanghaiTech 23
Instructions and micro-codesInstruction execution: pipeline, in-order or out-of-order, speculation, etc.
Memory hierarchy: cache, main memory, disk, etc.
Exception, interrupt, etc.
I/OSingle-threaded or multi-threaded
Conventional Machine Structure
24
CA
I/O systemProcessor
CompilerOperatingSystem(Mac OSX)
Application (ex: browser)
Digital Design
Circuit Design
Instruction SetArchitecture
Datapath & Control
transistors
MemoryHardware
Software Assembler
Today: ISA, Pipeline, etc.Next: memory hierarchy, I/O, threading, WSC, etc.
Instruction Set Architecture
CS211@ShanghaiTech 25
ISA: instruction set architecture
9/9/2020 26
instruction set
software
hardware
ISA is the actual programmer-visible instruction set, a critical interface/boundary/contract between software and hardware
An instruction is a single operation, with an opcode and zero or more operands, of a processor, defined by the processor instruction set.
An example, add of RISC-V
CS211@ShanghaiTech 27
add rd, rs1, rs2• add: opcode• rd, rs1, rs2: registers as operands• [rd] [rs1] + [rs2]• No memory access• 32-bit long
0000000 rs2 rs1 000 rd 0110011
Reg-Reg OPrdaddadd rs2 rs1
7 5 5 3 75
31 25 20 15 71224 19 14 11 6 0funct7 rs2 rs1 funct3 rd opcode
7 Dimensions of ISA
• D1: Class of ISA• General-purpose register architectures
• Operands are either registers or memory locations• Two versions
• Register-memory ISA, e.g., 80x86, in which many instructions can access memory• Load-store ISA, e.g., ARMv8 and RISC-V, in which only load or store can access memory
• D2: Memory addressing• Byte-level addressing to access memory operands.
• D3: Addressing modes• How memory addresses are interpreted and specified• Basic: Register, Immediate (constant), Displacement (constant + register)• RISC-V supports these three modes• Other ISAs have more variations of displacement
• e.g., 80x86 has three variations of displacementCS211@ShanghaiTech 28
1. 80x86 has 16 general-purpose registers and 16 registers for floating-point data.
2. RISC-V has 32 general-purpose registers and 32 registers for floating-point data
ARMv8 demands address alignment in load/store. An access to an object of size 𝑠𝑠 bytes at byte address 𝐴𝐴 is aligned if 𝐴𝐴 % 𝑠𝑠 = 0. 80x86 and RISC-V do not require such alignment.
7 Dimensions of ISA (cont.)
• D4: Types and sizes of operands• 8-bit (character), 16-bit (half word), 32-bit (word), and 64-bit (double word) • IEEE 754 floating point in 32-bit (single precision) and 64-bit (double precision)
• D5: Operations• Data transfer, arithmetic, control, floating point• Floating point operations are complicated
• Versus fixed point data• Using binary (0/1) to represent real numbers is inexact as precision matters!!!• https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html• https://software.intel.com/sites/default/files/article/164389/fp-consistency-102511.pdf
CS211@ShanghaiTech 29
Data transfer
Arithmetic
Control
7 Dimensions of ISA (cont.)
• D4: Types and sizes of operands• 8-bit (character), 16-bit (half word), 32-bit (word), and 64-bit (double word) • IEEE 754 floating point in 32-bit (single precision) and 64-bit (double precision)
• D5: Operations• Data transfer, arithmetic, control, floating point• Floating point operations are complicated
• Versus fixed point data• Using binary (0/1) to represent real numbers is inexact as precision matters!!!• https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html• https://software.intel.com/sites/default/files/article/164389/fp-consistency-102511.pdf
CS211@ShanghaiTech 30
7 Dimensions of ISA (cont.)
• D6: Control flow instructions• Conditional branch• Unconditional jump• Procedure call• Return
• D7: Encoding an ISA• Fixed length or variable length• ARMv8 and RISC-V instructions are 32-bit long• 80x86 encoding is variable length, ranging from 1 to 18 bytes
CS211@ShanghaiTech 31
Implementing the add instruction of RISC-V
add rd, rs1, rs2
• Instruction makes two changes to machine’s state:• Reg[rd] = Reg[rs1] + Reg[rs2]• PC = PC + 4 %%% To get next instruction
0000000 rs2 rs1 000 rd 0110011
Reg-Reg OPrdaddadd rs2 rs1
7 5 5 3 75
31 25 20 15 71224 19 14 11 6 0funct7 rs2 rs1 funct3 rd opcode
32
Datapath for add
+4
Add
clk
addrinst
IMEM
PCpc+4
Inst[24:20]ALU+
clk
Reg [ ]
Inst[19:15]
Inst[11:7]
AddrB
AddrA DataA
DataB
AddrD
DataD
aluReg[rs1]
Reg[rs2]
Inst[31:0]
Control logic
RegWriteEnable (RegWEn)=1
add 5 5 add Reg-Reg OP5
31 25 20 15 71224 19 14 11 6 00000000 rs2 rs1 000 rd opcode
Reg[rd] = Reg[rs1] + Reg[rs2]PC = PC + 4
33
From architecture to microarchitecture
• Instructions are visible to programmers• Two processors may have the same ISA but different microarchitectures
• e.g., AMD Opteron and Intel Core i7, with the same 80x86 ISA, have very different pipelines and cache organizations.
• An instruction is partitioned into multiple stages• Five classic stages of executing an instruction
• Instruction fetch• Instruction decode/register fetch• Execute• Memory access• Write back
CS211@ShanghaiTech 34
Stages of Execution on Datapath
inst
ruct
ion
mem
ory
+4
rs2rs1rd
regi
ster
s
ALU
Dat
am
emor
y
imm
1. InstructionFetch
2. Decode/RegisterRead
3. Execute 4. Memory 5. Write Back
PC
35
Pipeline
CS211@ShanghaiTech 36
Instruction execution one by one
CS211@ShanghaiTech 37
Inst. No. 1 2 3 4 5 6 7 8 9 10 11 12 13
i (add) F D X M W
i+1 F D X
i+2 F D X M
i+3 F
i+4
Low utilization of hardware
Idle hardware
inst
ruct
ion
mem
ory
+4
rs2rs1rd
regi
ster
s
ALU
Dat
am
emor
y
imm
1. InstructionFetch
2. Decode/RegisterRead
3. Execute 4. Memory 5. Write Back
PC
38
Pipeline: instruction-level parallelism
CS211@ShanghaiTech 39
Inst. No. 1 2 3 4 5 6 7 8 9 10 11 12 13
i F D X M W
i+1 F D X
i+2 F D X M
i+3 F D X M W
i+4 F D X M W
Well, an imperfect world
• Hazards• A situation that prevents starting the next instruction in the next clock cycle• A hazard may stall the pipeline
• Structural hazard• A required hardware resource is busy
• Data hazard• An instruction depends on the result(s) of a previous instruction
• Control hazard• The pipelining of braches and other instructions that change the PC
CS211@ShanghaiTech 40
Structural Hazard
add t0, t1, t2
lw t5, 0(t4)
sll t6, t0, t3
instruction sequence
sw t0, 4(t3)
lw t0, 8(t3)
• Instruction and data memory used simultaneously Use two separate
memories
41
A required hardware resource is busy
Data hazard
CS211@ShanghaiTech 42
An instruction depends on the result(s) of a previous instruction
add t0, t1, t2
lw t5, 0(t4)
sll t6, t0, t3
instruction sequence
sw t0, 4(t3)
lw t0, 8(t3)
Solution 1: stall the pipeline Solution 2: register register, fast enough to do write then read in one clock cycle? Solution 3: forwarding
Control Hazard
• Branches• We need to calculate next PC
• Exception• Something unusual that breaks the normal flow of instruction execution• Exception handling
CS211@ShanghaiTech 43
Pipeline may stall due to branch
Jump to next instruction, with
a new PC
“assert” may terminate the program
Control Hazard
• Branches• We need to calculate next PC
• Exception• Something unusual that breaks the normal flow of instruction execution• Exception handling
CS211@ShanghaiTech 44
Control Hazard
• Branches• We need to calculate next PC
• Exception• Something unusual that breaks the normal flow of instruction execution• Exception handling, for synchronous and asynchronous events
• Jump to exception handler program, and return back if possible
CS211@ShanghaiTech 45
PCInst. Mem D Decode E M
Data Mem W+
Illegal Opcode Overflow Data address
ExceptionsPC address Exception
Asynchronous Interrupts
Conclusion
• The domain of CA• Many interesting topics• CA is a set of rules and methods that describe the functionality, organization,
and implementation of computer systems (from wikipedia)• CA describes the capabilities and programming model of a computer but not
a particular implementation (also from wikipedia)
• The evolution of CA• New challenges emerge from time to time
• Review of ISA and Pipeline
CS211@ShanghaiTech 46
47
Acknowledgements
• These slides contain materials developed and copyright by:• Prof. Krste Asanovic (UC Berkeley)• Prof. Xuehai Zhou (USTC)• Prof. Onur Mutlu (ETH Zurich)• Prof. Mikko Lipasti (UW-Madison)• Prof. Sören Schwertfeger (ShanghaiTech) • Prof. Kenji Kise (Tokyo Tech)• Prof. Wenzhi Chen (ZJU)
CS211@ShanghaiTech