POWER PC

POWER-PC

They are powered by microprocessors based

On

IBM’s Power Instruction Set Architecture (POWER-ISA).

POWER is a RISC instruction set architecture designed by IBM.

The name is a acronym for Performance Optimization With Enhanced RISC

What do the world’s fastest supercomputer, network and communications

equipment such as Internet routers and switches, the Mars Rover, consumer electronics such as set

top boxes, and the game consoles all have in common ?

?

Where we can find POWER-PC

Topical Outline Introduction to POWER-PC

History Current Status PowerPC Architecture How Instruction execution differs from other Microprocessors? Design principles Registers Data Types Instruction Types

MPC8640D Dual core PowerPC Processor e600 PowerPC Core Features PowerPC e600 Core Pipeline Stages

AltiVec Vector Engine in e600 What is vector Processing? SIMD Intra element Instructions The Four Vector Engines in e600 AltiVec Characteristics AltiVec Software Enablement for Vector Signal and Image Processing Key Areas of Bandwidth, Performance and Computation Abilities

Quad Power-PC Processor Card

Introduction to POWER-PC

History POWER-PC stands for Performance Optimization With

Enhanced RISC - Performance Computing.

IBM (1990) introduced POWER-ISA in 1990 with RS/6000. In 1991, a group from IBM, Motorola and Apple decided to

design a new architecture, based on POWER-ISA which lead to the development of POWER-PC

Aim was to form the basis of a new generation of high-performance Superscalar low-cost products ranging from low cost embedded controllers to massively parallel supercomputers.

The first products were delivered near the end of 1993 Recent implementations include PowerPC 601, 603, 604

Current Status

PowerPC e200 - 32 bit power architecture microprocessor - speed ranging up to 600 MHz - ideal for embedded applications.

PowerPC e300 – similar to e200 with an increase in speed upto 667 MHz. PowerPC e600 – speed upto 2 Ghz – ideal for high performance routing

and telecommunications applications. POWER5 – IBM – dual core μP POWER6 – IBM – Dual core μP - A notable difference from POWER5 is that

the POWER6 executes instructions in-order instead of out-of-order PowerPC G3 - Apple Macintosh computers such as the PowerBook G3, the

multicolored iMacs, iBooks and several desktops, including both the Beige and Blue and White Power Macintosh G3s.

PowerPC G4 - is a designation used by Apple Computer to describe a fourth generation of 32-bit PowerPC microprocessors.

PowerPC G5 - 64-bit Power Architecture processors Xenon - based on IBM’s PowerPC ISA – XBOX 360 game console. Broadway – based on IBM’s PowerPC ISA – Nintendo Wii gaming console Blue Gene/L - dual core PowerPC 440, 700 MHz, 2004 Blue Gene/P - quad core PowerPC 450, 850 MHz, 2007

PowerPC Architecture POWER-PC is a high-performance

superscalar design supporting multiple independent execution units, including Integer Unit, Floating Point Unit, Branch Processing Unit.

standard, fixed instruction format single-cycle execution of most

instructions memory access is available only for

load and store instruction. other instructions are register-to-

register operations due to this execution units can run faster.

a small number of machine instructions, and instruction formats.

a large number of general-purpose registers

a small number of addressing modes

How Instruction execution differs from other Microprocessors?General Processor based

on

CISC Architecture

POWER-PC based on

RISC Architecture

To do multiplication of two operands in memory as shown in figure

Assembly Instructions

MUL 2:3,5:2

Here Execution Units will access memory directly.

To do multiplication of two operands in memory as shown in figure.

Assembly Instructions

LOAD A, 2:3LOAD B, 5:2PROD A, BSTORE 2:3, A

Load and Store unit will fetch operand at 2:3 from memory to Reg A.

Load and Store unit will fetch operand at 2:3 from memory to Reg B.

Integer unit will perform product of A and B and result is stored in A

Load and Store unit will store Result in Reg A at 2:3 in memory.

Here Execution Units will perform operations only on Processor Registers with a single cycle through put.

Design principles

Simplicity favors' regularityStandard 32 bit instruction format for all instructions

fixed-length instructions, register-to-register architecture three-operand instruction format.

Smaller is faster 3- Categories of registers , but each handles specific instructions so

presumably faster access time Make the common case fast

Integer and floating point instructions Good design demands good compromises

To align with RISC principles many instructions that required three source operands were eliminated

Many complex instructions curtailed to confirm with RISC principles but compensated by large number of mnemonics that increase the number of instructions .

General

All PowerPC processors run the same core PowerPC instruction set.

It is independent of implementation aspects.

It allows anyone to design and fabricate compatible PowerPC processors independent of implementation differences as the technology advances.

They differ primarily in the degree of dedicated hardware support for multiple execution units, cache size and capability, length of pipeline, and interface busses.

These differences result in different tradeoffs in processing performance, die area, and power dissipation.

Initialization

When the processor is first initialized, it is in supervisor (also called privileged) mode. In this mode, all processor resources, including registers and instructions are accessible.

The processor can limit access to certain privileged registers and instructions by placing itself in user mode.

This protection limits application code from being able to modify global and sensitive resources, such as the caches, memory management system, and timers.

Architecture defines five types of registers :

Special Purpose Registers (SPRs) General Purpose Registers (GPRs) Floating Point Registers (FPRs) Device Control Registers (DCRs) Machine State Register (MSR)

Registers

Registers SPRs give status and control of resources

within the processor core.

Registers

Five important user mode SPRs are:

The Fixed-Point Exception Register (XER) is used for indicating conditions for integer operations, such as carries and overflows.

The Floating-Point Status and Control Register (FPSCR) is a 32-bit register used to store the status and control of the floating-point operations.

The Count Register (CTR) is used to hold a loop count that can be decremented during the execution of branch instructions.

The Condition Register (CR) is a 32-bit register grouped into eight fields, where each field is 4 bits that signify the result of an instruction’s operation: Equal (EQ), Greater Than (GT), Less Than (LT), and Summary Overflow (SO).

The Link Register (LR) contains the address to return to at the end of a function call.

Registers

General Purpose Registers :

The Architecture specifies that all implementations have 32 GPRs (GPR0 - GPR31).

GPRs are the source and destination of all fixed-point operations and load/store operations. They also provide access to SPRs and DCRs.

They are all available for use in every instruction with one exception: In certain instructions, GPR0 simply means “0” and no lookup is done for GPR0’s contents.

RegistersFloating Point Registers :

The PowerPC architecture provides thirty-two 64-bit floating-point registers.

Device Control Registers :

DCRs are similar to SPRs in that they give status and control information, but DCRs are for resources outside the processor core.

DCRs allow for memory-mapped I/O control without using up portions of the memory address space.

Registers

Machine State Register :

MSR represents the state of the machine.

It is accessed only in supervisor mode, and contains the settings for things such as memory translation, cache settings, interrupt enables, user/privileged state, and floating point availability. Exact control bits vary by implementation.

The MSR does not readily fit into the SPR/GPR classification, as it contains its own pair of instructions to read and write the contents of the MSR into a GPR.

Data Types

PowerPC can deal with data types of 8–bits (byte), 16-bits (half word), 32-bits (word) and 64-bits (double word) in length. It can use either little-endian or big-endian style; that is, the least significant byte is stored in the lowest or highest address.

Fixed-point data types include:

* Unsigned byte

* Unsigned half word

* Signed half word

* Unsigned word

* Signed word

* Unsigned double word

* Byte Strings: From 0 – 128 bytes in length

Floating-point data types include IEEE-754 single- and double-precision types.

Instruction Format

The architecture encodes all instructions in 32 bits and aligns them on word address boundaries in memory.

Instructions are first decoded by the upper 6 bits, in a field called the primary opcode. The remaining 26 bits contain operands and/or reserved fields.

Different types of instructions defined are : ALU, Floating Point , Load/Store, Branch,

Condition and Synchronization Instructions

Instruction Types

Addressing ModesThree types of operand addressing :

Memory operand addressing: Indirect addressing : * Base address in a GPR + a 16-bit sign-extended literal Indirect-indexed addressing : * Base address in a GPR + displacement from another GPR

ALU and Floating-point instruction operand addressing: Three-register Format

Branch Operand Addressing : Absolute : Use the literal as the absolute address. Relative : Use the literal as the displacement from the branch

instruction address. Indirect : Take the target address from the LR or CTR registers

Power PC MPC8640D Dual core PowerPC Processor

MPC8640D Dual core PowerPC Processor

Fig: Block Diagram of MPC8641D Dual core PowerPC Processor

MPC8640D PowerPC Processor Specifications

e600 PowerPC Core Features High-performance, 32-bit superscalar microprocessor that implements the PowerPC

architecture Eleven independent execution units and three register files Branch processing unit (BPU) Four integer units (IUs) that share 32 GPRs for integer operands 64-bit floating-point unit (FPU) Four vector units and a 32-entry vector register file (VRs) Three-stage load/store unit (LSU) Three issue queues, FIQ, VIQ, and GIQ, can accept as many as one, two, and three

instructions, respectively, in a cycle. Dispatch unit Completion unit Two separate 32-Kbyte instruction and data level 1 (L1) caches Integrated 1-Mbyte, eight-way set-associative unified instruction and data level 2

(L2) cache with ECC 36-bit real addressing Separate memory management units (MMUs) for instructions and data Multiprocessing support features Power and thermal management Performance monitor In-system testability and debugging feature

MPC8640D PowerPC

Superscalar Microprocessor with Instruction Level Parallelism and Seven Stage Pipeline Execution– allows multiple instructions to be executed in parallel High-performance superscalar e600 core As many as 4 instructions can be fetched from the instruction cache at a

time. As many as 3 instructions can be dispatched to the issue queues at a

time. As many as 12 instructions can be in the instruction queue (IQ). As many as 16 instructions can be at some stage of execution

simultaneously. Single-cycle execution for most instructions One-instruction throughput per clock cycle for most instructions Seven-stage pipeline control

Execution Units BPU : Branch Processing Unit VPU : Vector Permute Unit VIU : Vector Integer Unit VFPU : Vector Floating Point Unit FPU : Floating Point Unit IU : Integer Unit LSU : Load/Store Unit

MPC 8640DWith e600 PPC

core Micro

architecture with emphasis

on pipeline stages of the front end and the functional

units.

Fig: e600 POWER-PC Core

PowerPC e600 Core Pipeline Stages

Stages 1 and 2 - Instruction Fetch:

These two stages are both dedicated primarily to grabbing an instruction from the L1 cache.

The e600 can fetch four instructions per clock cycle from the L1 cache and send them on to the next stage

Stage 3 - Decode/Dispatch:

Once an instruction has been fetched, it goes into a 12-entry instruction queue to be decoded.

The e600's decoder can dispatch up to three instructions

per clock cycle to the next stage.


Stage 4 - Issue:

The first queue Floating-Point Issue Queue (FIQ), which holds floating-point (FP) instructions that are waiting to be executed.

The second is the Vector Issue Queue (VIQ), which holds vector operations.

The third queue is the General Instruction Queue (GIQ), which holds everything else.

Once the instruction leaves its issue queue, it goes to the execution engine to be executed.


Stage 5 - Execute:

The instructions can pass out-of-order from their issue queues into their respective functional units and be executed.

Stage 6 and 7 - Complete and Write-Back :

In these two stages, the instructions are put back into the order in which they came into the processor, and their results are written back to memory.

AltiVec Vector Engine in e600

AltiVec Vector Engine in e600

AltiVec is a floating point and integer SIMD(Single Instruction and Multiple Data) instruction set designed and owned by Apple, IBM and Freescale Semiconductor, formerly the Semiconductor Products Sector of Motorola, (the AIM alliance), and implemented on versions of the PowerPC

The Vector Processing Unit is Branded with several names IBM The vector multimedia extension (VMX) Apple Velocity Engine Freescale AltiVec

What is vector Processing? A vector architecture allows the simultaneous processing of

multiple data items in parallel Operations are performed on multiple data elements by a single

instruction Referred to as Single Instruction Multiple Data (SIMD) parallel

processing For example in Addition of two vectors instruction VT = (VA +VB) will be

computed in single cycle latency and single cycle throughput Multiply and accumulate instruction VT = (VA *VB) +VC will be

computed in 5 cycle latency and single cycle throughput Where vectors can be 128 bit size

an array of 16 characters an array of 8 short Integers an array of 4 long Integers an array of 4 SP Floating Point Numbers

SIMD Intra element Instructions

The Four Vector Engines in e600 AltiVec Vector Permute Unit (VPU)

The VPU executes permutation instructions such as pack, unpack, merge, splat, and permute on vectoroperands.

AltiVec Vector Integer Unit 1 (VIU1) The VIU1 executes simple vector integer computational instructions, such as

addition, subtraction, maximum and minimum comparisons, averaging, rotation, shifting, comparisons, and Boolean operations.

AltiVec Vector Integer Unit 2 (VIU2) The VIU2 executes longer-latency vector integer instructions, such as

multiplication, multiplication/addition, and sum-across with saturation. AltiVec Vector Floating-Point Unit (VFPU)

The VFPU executes all vector floating-point instructions. A maximum of two AltiVec instructions can be issued out-of-order to any combination of AltiVec execution units per clock cycle from the bottom two VIQ entries (VIQ1–VIQ0). An instruction in VIQ1 does not have to wait for an instruction in VIQ0 that is waiting for operand availability. Moreover, the VIU2, VFPU, and VPU are pipelined, so they can operate on multiple instructions.

AltiVec computational instructions are executed in four independent, pipelined AltiVec execution units. A maximum of two AltiVec instructions can be issued out-of-order to any combination of AltiVec execution units per clock cycle from the bottom two VIQ entries (VIQ1–VIQ0). This means an instruction in VIQ1 does not have to wait for an instruction in VIQ0 that is waiting for operand availability. Moreover, the VIU2, VFPU, and VPU are pipelined, so they can operate on multiple instructions. The VPU has a two-stage pipeline; the VIU2 and VFPU each have four-stage pipelines. As many as ten AltiVec instructions can be executing concurrently.

AltiVec Characteristics 128b vector size

4, 8 or 16 data elements Separate register file with 32 register namespace Vector-element data type of 8-, 16-, 32-bit signed / unsigned int, and IEEE SP

float 162 instructions

Intra- and inter-element arithmetic instructions Intra- and inter-element conditional instructions Powerful Permute, Shift and Rotate, Splat, Pack/Unpack and Merge instructions

Saturation or modulo arithmetic Four-operand, nondestructive instruction format

Three sources, one destination Modeless operation for zero-overhead use of AltiVec instructions Simultaneous dispatch of one ALU-class vector and one permute-class vector, or either paired with a vector load/store

Peak throughput of 2 instructions per cycle All instructions fully pipelined with single-cycle throughput

Simple ops: 1 cycle latency Compound ops: 3-4 cycle latency No restriction on issue with scalar instructions

AltiVec Software Enablement for Vector Signal and Image Processing Tool and product support Compilers – GCC, GHS, WR Conversion tools

SSE to AltiVec Vectorization – linear to

parallel code AltiVec libraries

Intrinsic optimized libraries from Freescale

Ecosystem libraries OpenSAL, VSIPL, OpenCV, OpenGL ES,

Multi-core/multi-threading VSIPL++, multicore SAL

Key Areas of Bandwidth, Performance and Computation Abilities

Key Areas of Bandwidth, Performance and Computation Abilities Summary of floating point calculations for e600 @ 1 GHz:

0.8 SP or DP GFLOPS (one core only, regular FP instructions). 1.6 SP or DP GFLOPS (one core only, only multiply-add FP

instructions). 4.8 SP GFLOPS (one core + AltiVec VFPU, regular FP instructions). 9.6 SP GFLOPS (one core + AltiVec VFPU, only multiply-add FP

instructions). Summary of floating point calculations for e600 @ 1.5 GHz:

1.2 SP or DP GFLOPS (one core only, regular FP instructions). 2.4 SP or DP GFLOPS (one core only, only multiply-add FP

instructions). 7.2 SP GFLOPS (one core + AltiVec VFPU, regular FP instructions). 14.4 SP GFLOPS (one core + AltiVec VFPU, only multiply-add FP

instructions). Along with this we are having 4 Integer Units( IU1, IU2, IU3, IU4),

one Vector Permute unit(VPU), and Two Vector Integer Unit(VIU1, VIU2) per core to meet Computation Requirements.

Computation time required to perform 1024 pt Complex to complex FFT is 10 micro sec.

Key Areas of Bandwidth, Performance and Computation Abilities Summary of floating point calculations for MPC8640D @ 1GHz:

1.6 SP or DP GFLOPS (dual core, regular FP instructions). 3.2SP or DP GFLOPS (dual core, only multiply-add FP

instructions). 9.6 SP GFLOPS (dual core + dual AltiVec VFPU, regular FP

instructions). 19.2 SP GFLOPS (dual core + dual AltiVec VFPU, only multiply-

add FP instructions). Summary of floating point calculations for MPC8640D @ 1.5 GHz:

2.4 SP or DP GFLOPS (dual core, regular FP instructions). 4.8 SP or DP GFLOPS (dual core, only multiply-add FP

instructions). 14.4 SP GFLOPS (dual core + dual AltiVec VFPU, regular FP

instructions). 28.8 SP GFLOPS (dual core + dual AltiVec VFPU, only multiply-

add FP instructions). Along with this we are having 4 Integer Units( IU1, IU2, IU3, IU4),

Vector Permute unit(VPU), and Two Vector Integer Unit(VIU1, VIU2) per core to meet Computation Requirements.

COTS VPX Card with

Quad Power-PC Processors (MPC8640D)

Quad Power-PC Processor Card

SMP OS Running on on a Dual core Processor

Node A

Node B

Node C

Node D

Embedded Controller

Node A , Node B and Node C will be used for Signal Processing Purpose

Quad Processor card each running SMP OS

Conclusion Overall the Power PC is a better architecture it

is capable of handling more instructions, it is able do more operations as far as branching and floating point operations and it is a more efficient architecture in handling various complexities in data and memory.

POWER PC

Documents

Transcript of POWER PC