POWER PC
description
Transcript of POWER PC
POWER-PC
They are powered by microprocessors based
On
IBM’s Power Instruction Set Architecture (POWER-ISA).
POWER is a RISC instruction set architecture designed by IBM.
The name is a acronym for Performance Optimization With Enhanced RISC
What do the world’s fastest supercomputer, network and communications
equipment such as Internet routers and switches, the Mars Rover, consumer electronics such as set
top boxes, and the game consoles all have in common ?
?
Where we can find POWER-PC
Topical Outline Introduction to POWER-PC
History Current Status PowerPC Architecture How Instruction execution differs from other Microprocessors? Design principles Registers Data Types Instruction Types
MPC8640D Dual core PowerPC Processor e600 PowerPC Core Features PowerPC e600 Core Pipeline Stages
AltiVec Vector Engine in e600 What is vector Processing? SIMD Intra element Instructions The Four Vector Engines in e600 AltiVec Characteristics AltiVec Software Enablement for Vector Signal and Image Processing Key Areas of Bandwidth, Performance and Computation Abilities
Quad Power-PC Processor Card
Introduction to POWER-PC
History POWER-PC stands for Performance Optimization With
Enhanced RISC - Performance Computing.
IBM (1990) introduced POWER-ISA in 1990 with RS/6000. In 1991, a group from IBM, Motorola and Apple decided to
design a new architecture, based on POWER-ISA which lead to the development of POWER-PC
Aim was to form the basis of a new generation of high-performance Superscalar low-cost products ranging from low cost embedded controllers to massively parallel supercomputers.
The first products were delivered near the end of 1993 Recent implementations include PowerPC 601, 603, 604
Current Status
PowerPC e200 - 32 bit power architecture microprocessor - speed ranging up to 600 MHz - ideal for embedded applications.
PowerPC e300 – similar to e200 with an increase in speed upto 667 MHz. PowerPC e600 – speed upto 2 Ghz – ideal for high performance routing
and telecommunications applications. POWER5 – IBM – dual core μP POWER6 – IBM – Dual core μP - A notable difference from POWER5 is that
the POWER6 executes instructions in-order instead of out-of-order PowerPC G3 - Apple Macintosh computers such as the PowerBook G3, the
multicolored iMacs, iBooks and several desktops, including both the Beige and Blue and White Power Macintosh G3s.
PowerPC G4 - is a designation used by Apple Computer to describe a fourth generation of 32-bit PowerPC microprocessors.
PowerPC G5 - 64-bit Power Architecture processors Xenon - based on IBM’s PowerPC ISA – XBOX 360 game console. Broadway – based on IBM’s PowerPC ISA – Nintendo Wii gaming console Blue Gene/L - dual core PowerPC 440, 700 MHz, 2004 Blue Gene/P - quad core PowerPC 450, 850 MHz, 2007
PowerPC Architecture POWER-PC is a high-performance
superscalar design supporting multiple independent execution units, including Integer Unit, Floating Point Unit, Branch Processing Unit.
standard, fixed instruction format single-cycle execution of most
instructions memory access is available only for
load and store instruction. other instructions are register-to-
register operations due to this execution units can run faster.
a small number of machine instructions, and instruction formats.
a large number of general-purpose registers
a small number of addressing modes
How Instruction execution differs from other Microprocessors?General Processor based
on
CISC Architecture
POWER-PC based on
RISC Architecture
To do multiplication of two operands in memory as shown in figure
Assembly Instructions
MUL 2:3,5:2
Here Execution Units will access memory directly.
To do multiplication of two operands in memory as shown in figure.
Assembly Instructions
LOAD A, 2:3LOAD B, 5:2PROD A, BSTORE 2:3, A
Load and Store unit will fetch operand at 2:3 from memory to Reg A.
Load and Store unit will fetch operand at 2:3 from memory to Reg B.
Integer unit will perform product of A and B and result is stored in A
Load and Store unit will store Result in Reg A at 2:3 in memory.
Here Execution Units will perform operations only on Processor Registers with a single cycle through put.
Design principles
Simplicity favors' regularityStandard 32 bit instruction format for all instructions
fixed-length instructions, register-to-register architecture three-operand instruction format.
Smaller is faster 3- Categories of registers , but each handles specific instructions so
presumably faster access time Make the common case fast
Integer and floating point instructions Good design demands good compromises
To align with RISC principles many instructions that required three source operands were eliminated
Many complex instructions curtailed to confirm with RISC principles but compensated by large number of mnemonics that increase the number of instructions .
General
All PowerPC processors run the same core PowerPC instruction set.
It is independent of implementation aspects.
It allows anyone to design and fabricate compatible PowerPC processors independent of implementation differences as the technology advances.
They differ primarily in the degree of dedicated hardware support for multiple execution units, cache size and capability, length of pipeline, and interface busses.
These differences result in different tradeoffs in processing performance, die area, and power dissipation.
Initialization
When the processor is first initialized, it is in supervisor (also called privileged) mode. In this mode, all processor resources, including registers and instructions are accessible.
The processor can limit access to certain privileged registers and instructions by placing itself in user mode.
This protection limits application code from being able to modify global and sensitive resources, such as the caches, memory management system, and timers.
Architecture defines five types of registers :
Special Purpose Registers (SPRs) General Purpose Registers (GPRs) Floating Point Registers (FPRs) Device Control Registers (DCRs) Machine State Register (MSR)
Registers
Registers SPRs give status and control of resources
within the processor core.
Registers
Five important user mode SPRs are:
The Fixed-Point Exception Register (XER) is used for indicating conditions for integer operations, such as carries and overflows.
The Floating-Point Status and Control Register (FPSCR) is a 32-bit register used to store the status and control of the floating-point operations.
The Count Register (CTR) is used to hold a loop count that can be decremented during the execution of branch instructions.
The Condition Register (CR) is a 32-bit register grouped into eight fields, where each field is 4 bits that signify the result of an instruction’s operation: Equal (EQ), Greater Than (GT), Less Than (LT), and Summary Overflow (SO).
The Link Register (LR) contains the address to return to at the end of a function call.
Registers
General Purpose Registers :
The Architecture specifies that all implementations have 32 GPRs (GPR0 - GPR31).
GPRs are the source and destination of all fixed-point operations and load/store operations. They also provide access to SPRs and DCRs.
They are all available for use in every instruction with one exception: In certain instructions, GPR0 simply means “0” and no lookup is done for GPR0’s contents.
RegistersFloating Point Registers :
The PowerPC architecture provides thirty-two 64-bit floating-point registers.
Device Control Registers :
DCRs are similar to SPRs in that they give status and control information, but DCRs are for resources outside the processor core.
DCRs allow for memory-mapped I/O control without using up portions of the memory address space.
Registers
Machine State Register :
MSR represents the state of the machine.
It is accessed only in supervisor mode, and contains the settings for things such as memory translation, cache settings, interrupt enables, user/privileged state, and floating point availability. Exact control bits vary by implementation.
The MSR does not readily fit into the SPR/GPR classification, as it contains its own pair of instructions to read and write the contents of the MSR into a GPR.
Data Types
PowerPC can deal with data types of 8–bits (byte), 16-bits (half word), 32-bits (word) and 64-bits (double word) in length. It can use either little-endian or big-endian style; that is, the least significant byte is stored in the lowest or highest address.
Fixed-point data types include:
* Unsigned byte
* Unsigned half word
* Signed half word
* Unsigned word
* Signed word
* Unsigned double word
* Byte Strings: From 0 – 128 bytes in length
Floating-point data types include IEEE-754 single- and double-precision types.
Instruction Format
The architecture encodes all instructions in 32 bits and aligns them on word address boundaries in memory.
Instructions are first decoded by the upper 6 bits, in a field called the primary opcode. The remaining 26 bits contain operands and/or reserved fields.
Different types of instructions defined are : ALU, Floating Point , Load/Store, Branch,
Condition and Synchronization Instructions
Instruction Types
Addressing ModesThree types of operand addressing :
Memory operand addressing: Indirect addressing : * Base address in a GPR + a 16-bit sign-extended literal Indirect-indexed addressing : * Base address in a GPR + displacement from another GPR
ALU and Floating-point instruction operand addressing: Three-register Format
Branch Operand Addressing : Absolute : Use the literal as the absolute address. Relative : Use the literal as the displacement from the branch
instruction address. Indirect : Take the target address from the LR or CTR registers
Power PC MPC8640D Dual core PowerPC Processor
MPC8640D Dual core PowerPC Processor
Fig: Block Diagram of MPC8641D Dual core PowerPC Processor
MPC8640D PowerPC Processor Specifications
e600 PowerPC Core Features High-performance, 32-bit superscalar microprocessor that implements the PowerPC
architecture Eleven independent execution units and three register files Branch processing unit (BPU) Four integer units (IUs) that share 32 GPRs for integer operands 64-bit floating-point unit (FPU) Four vector units and a 32-entry vector register file (VRs) Three-stage load/store unit (LSU) Three issue queues, FIQ, VIQ, and GIQ, can accept as many as one, two, and three
instructions, respectively, in a cycle. Dispatch unit Completion unit Two separate 32-Kbyte instruction and data level 1 (L1) caches Integrated 1-Mbyte, eight-way set-associative unified instruction and data level 2
(L2) cache with ECC 36-bit real addressing Separate memory management units (MMUs) for instructions and data Multiprocessing support features Power and thermal management Performance monitor In-system testability and debugging feature
MPC8640D PowerPC
Superscalar Microprocessor with Instruction Level Parallelism and Seven Stage Pipeline Execution– allows multiple instructions to be executed in parallel High-performance superscalar e600 core As many as 4 instructions can be fetched from the instruction cache at a
time. As many as 3 instructions can be dispatched to the issue queues at a
time. As many as 12 instructions can be in the instruction queue (IQ). As many as 16 instructions can be at some stage of execution
simultaneously. Single-cycle execution for most instructions One-instruction throughput per clock cycle for most instructions Seven-stage pipeline control
Execution Units BPU : Branch Processing Unit VPU : Vector Permute Unit VIU : Vector Integer Unit VFPU : Vector Floating Point Unit FPU : Floating Point Unit IU : Integer Unit LSU : Load/Store Unit
MPC 8640DWith e600 PPC
core Micro
architecture with emphasis
on pipeline stages of the front end and the functional
units.
Fig: e600 POWER-PC Core
PowerPC e600 Core Pipeline Stages
Stages 1 and 2 - Instruction Fetch:
These two stages are both dedicated primarily to grabbing an instruction from the L1 cache.
The e600 can fetch four instructions per clock cycle from the L1 cache and send them on to the next stage
Stage 3 - Decode/Dispatch:
Once an instruction has been fetched, it goes into a 12-entry instruction queue to be decoded.
The e600's decoder can dispatch up to three instructions
per clock cycle to the next stage.
PowerPC e600 Core Pipeline Stages
Stage 4 - Issue:
The first queue Floating-Point Issue Queue (FIQ), which holds floating-point (FP) instructions that are waiting to be executed.
The second is the Vector Issue Queue (VIQ), which holds vector operations.
The third queue is the General Instruction Queue (GIQ), which holds everything else.
Once the instruction leaves its issue queue, it goes to the execution engine to be executed.
PowerPC e600 Core Pipeline Stages
Stage 5 - Execute:
The instructions can pass out-of-order from their issue queues into their respective functional units and be executed.
Stage 6 and 7 - Complete and Write-Back :
In these two stages, the instructions are put back into the order in which they came into the processor, and their results are written back to memory.
AltiVec Vector Engine in e600
AltiVec Vector Engine in e600
AltiVec is a floating point and integer SIMD(Single Instruction and Multiple Data) instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, (the AIM alliance), and implemented on versions of the PowerPC
The Vector Processing Unit is Branded with several names IBM The vector multimedia extension (VMX) Apple Velocity Engine Freescale AltiVec
What is vector Processing? A vector architecture allows the simultaneous processing of
multiple data items in parallel Operations are performed on multiple data elements by a single
instruction Referred to as Single Instruction Multiple Data (SIMD) parallel
processing For example in Addition of two vectors instruction VT = (VA +VB) will be
computed in single cycle latency and single cycle throughput Multiply and accumulate instruction VT = (VA *VB) +VC will be
computed in 5 cycle latency and single cycle throughput Where vectors can be 128 bit size
an array of 16 characters an array of 8 short Integers an array of 4 long Integers an array of 4 SP Floating Point Numbers
SIMD Intra element Instructions
The Four Vector Engines in e600 AltiVec Vector Permute Unit (VPU)
The VPU executes permutation instructions such as pack, unpack, merge, splat, and permute on vectoroperands.
AltiVec Vector Integer Unit 1 (VIU1) The VIU1 executes simple vector integer computational instructions, such as
addition, subtraction, maximum and minimum comparisons, averaging, rotation, shifting, comparisons, and Boolean operations.
AltiVec Vector Integer Unit 2 (VIU2) The VIU2 executes longer-latency vector integer instructions, such as
multiplication, multiplication/addition, and sum-across with saturation. AltiVec Vector Floating-Point Unit (VFPU)
The VFPU executes all vector floating-point instructions. A maximum of two AltiVec instructions can be issued out-of-order to any combination of AltiVec execution units per clock cycle from the bottom two VIQ entries (VIQ1–VIQ0). An instruction in VIQ1 does not have to wait for an instruction in VIQ0 that is waiting for operand availability. Moreover, the VIU2, VFPU, and VPU are pipelined, so they can operate on multiple instructions.
AltiVec computational instructions are executed in four independent, pipelined AltiVec execution units. A maximum of two AltiVec instructions can be issued out-of-order to any combination of AltiVec execution units per clock cycle from the bottom two VIQ entries (VIQ1–VIQ0). This means an instruction in VIQ1 does not have to wait for an instruction in VIQ0 that is waiting for operand availability. Moreover, the VIU2, VFPU, and VPU are pipelined, so they can operate on multiple instructions. The VPU has a two-stage pipeline; the VIU2 and VFPU each have four-stage pipelines. As many as ten AltiVec instructions can be executing concurrently.
AltiVec Characteristics 128b vector size
4, 8 or 16 data elements Separate register file with 32 register namespace Vector-element data type of 8-, 16-, 32-bit signed / unsigned int, and IEEE SP
float 162 instructions
Intra- and inter-element arithmetic instructions Intra- and inter-element conditional instructions Powerful Permute, Shift and Rotate, Splat, Pack/Unpack and Merge instructions
Saturation or modulo arithmetic Four-operand, nondestructive instruction format
Three sources, one destination Modeless operation for zero-overhead use of AltiVec instructions Simultaneous dispatch of one ALU-class vector and one permute-class vector, or either paired with a vector load/store
Peak throughput of 2 instructions per cycle All instructions fully pipelined with single-cycle throughput
Simple ops: 1 cycle latency Compound ops: 3-4 cycle latency No restriction on issue with scalar instructions
AltiVec Software Enablement for Vector Signal and Image Processing Tool and product support Compilers – GCC, GHS, WR Conversion tools
SSE to AltiVec Vectorization – linear to
parallel code AltiVec libraries
Intrinsic optimized libraries from Freescale
Ecosystem libraries OpenSAL, VSIPL, OpenCV, OpenGL ES,
Multi-core/multi-threading VSIPL++, multicore SAL
Key Areas of Bandwidth, Performance and Computation Abilities
Key Areas of Bandwidth, Performance and Computation Abilities Summary of floating point calculations for e600 @ 1 GHz:
0.8 SP or DP GFLOPS (one core only, regular FP instructions). 1.6 SP or DP GFLOPS (one core only, only multiply-add FP
instructions). 4.8 SP GFLOPS (one core + AltiVec VFPU, regular FP instructions). 9.6 SP GFLOPS (one core + AltiVec VFPU, only multiply-add FP
instructions). Summary of floating point calculations for e600 @ 1.5 GHz:
1.2 SP or DP GFLOPS (one core only, regular FP instructions). 2.4 SP or DP GFLOPS (one core only, only multiply-add FP
instructions). 7.2 SP GFLOPS (one core + AltiVec VFPU, regular FP instructions). 14.4 SP GFLOPS (one core + AltiVec VFPU, only multiply-add FP
instructions). Along with this we are having 4 Integer Units( IU1, IU2, IU3, IU4),
one Vector Permute unit(VPU), and Two Vector Integer Unit(VIU1, VIU2) per core to meet Computation Requirements.
Computation time required to perform 1024 pt Complex to complex FFT is 10 micro sec.
Key Areas of Bandwidth, Performance and Computation Abilities Summary of floating point calculations for MPC8640D @ 1GHz:
1.6 SP or DP GFLOPS (dual core, regular FP instructions). 3.2SP or DP GFLOPS (dual core, only multiply-add FP
instructions). 9.6 SP GFLOPS (dual core + dual AltiVec VFPU, regular FP
instructions). 19.2 SP GFLOPS (dual core + dual AltiVec VFPU, only multiply-
add FP instructions). Summary of floating point calculations for MPC8640D @ 1.5 GHz:
2.4 SP or DP GFLOPS (dual core, regular FP instructions). 4.8 SP or DP GFLOPS (dual core, only multiply-add FP
instructions). 14.4 SP GFLOPS (dual core + dual AltiVec VFPU, regular FP
instructions). 28.8 SP GFLOPS (dual core + dual AltiVec VFPU, only multiply-
add FP instructions). Along with this we are having 4 Integer Units( IU1, IU2, IU3, IU4),
Vector Permute unit(VPU), and Two Vector Integer Unit(VIU1, VIU2) per core to meet Computation Requirements.
COTS VPX Card with
Quad Power-PC Processors (MPC8640D)
Quad Power-PC Processor Card
Quad Power-PC Processor Card
SMP OS Running on on a Dual core Processor
Node A
Node B
Node C
Node D
Embedded Controller
Node A , Node B and Node C will be used for Signal Processing Purpose
Quad Processor card each running SMP OS
Conclusion Overall the Power PC is a better architecture it
is capable of handling more instructions, it is able do more operations as far as branching and floating point operations and it is a more efficient architecture in handling various complexities in data and memory.