Sharc Processor

97
2.4 SHARC 2.4 SHARC Processor Processor

Transcript of Sharc Processor

Page 1: Sharc Processor

2.4 SHARC 2.4 SHARC ProcessorProcessor

Page 2: Sharc Processor

Why DSPWhy DSP a special class of microprocessors a special class of microprocessors

that are optimized for computing the that are optimized for computing the real-time calculations used in signal real-time calculations used in signal processingprocessing

DSPs have an architecture that DSPs have an architecture that simplifies application designs and simplifies application designs and makes low-cost signal processing a makes low-cost signal processing a realityreality

Page 3: Sharc Processor

characteristicscharacteristics fast, flexible computation unitsfast, flexible computation units unconstrained data flow to and from unconstrained data flow to and from

the computation unitsthe computation units extended precision and dynamic extended precision and dynamic

range in the computation unitsrange in the computation units dual address generatorsdual address generators efficient program sequencing and efficient program sequencing and

looping mechanismslooping mechanisms

Page 4: Sharc Processor

SHARC family of DSPs SHARC family of DSPs Harvard architecture Harvard architecture one instructions per line one instructions per line each instruction, end with with a each instruction, end with with a

semicolon (;) semicolon (;) A label, end with a colon (:) A label, end with a colon (:) Comments, start with an exclamation Comments, start with an exclamation

point (!) point (!)

Page 5: Sharc Processor

Instructions exampleInstructions example R1 = DM(M0,I0), R2 = PM(M8,I8); ! a R1 = DM(M0,I0), R2 = PM(M8,I8); ! a

comment comment Label:Label: R3 = R1 + R2;R3 = R1 + R2;

Page 6: Sharc Processor

2.4.1 Memory 2.4.1 Memory OrganizationOrganization

Page 7: Sharc Processor

memorymemory SHARC uses different word sizes and SHARC uses different word sizes and

address space sizes for instructions address space sizes for instructions and data and data

instruction consists of 48 bits instruction consists of 48 bits basic data word, 32 bits basic data word, 32 bits address, 32 bits address, 32 bits

Page 8: Sharc Processor

on-chip memoryon-chip memory the 21061, has smallest 1Mbit of on-the 21061, has smallest 1Mbit of on-

chip memory chip memory internal memory: internal memory: program memory program memory

(PM), (PM), data memory data memory (DM) (DM)

Page 9: Sharc Processor

types of datatypes of data 32-bit IEEE single-precision floating-32-bit IEEE single-precision floating-

pointpoint 40-bit IEEE extended-precision 40-bit IEEE extended-precision

floating-pointfloating-point 32-bit integers32-bit integers

Page 10: Sharc Processor

SHARC memory SHARC memory allows the program memory to hold allows the program memory to hold

both data and instructions both data and instructions allow extra data to be squeezed into allow extra data to be squeezed into

the on-chip memory the on-chip memory allows data to be fetched from both allows data to be fetched from both

memories in parallel memories in parallel

Page 11: Sharc Processor

SHARC memorySHARC memory The PM bus is used to access either The PM bus is used to access either

instructions or datainstructions or data During a single cycle the processor During a single cycle the processor

can access two data operands, one can access two data operands, one over the PM bus and one over the DM over the PM bus and one over the DM busbus

Page 12: Sharc Processor

SHARC memorySHARC memory The register file has two sets (primary and The register file has two sets (primary and

alternate) of sixteen registers eachalternate) of sixteen registers each The data address generators (DAGs) The data address generators (DAGs)

provide memory addresses when data is provide memory addresses when data is transferred between memory and registerstransferred between memory and registers

DAG1 supplies 32-bit addresses to data DAG1 supplies 32-bit addresses to data memorymemory

DAG2 supplies 24-bit addresses to program DAG2 supplies 24-bit addresses to program memory for program memory data accessesmemory for program memory data accesses

Page 13: Sharc Processor

SHARC memorySHARC memory Each DAG keeps track of up to eight Each DAG keeps track of up to eight

address pointers, eight modifiers and address pointers, eight modifiers and eight length valueseight length values

A pointer used for indirect addressing A pointer used for indirect addressing can be modified by a value in a can be modified by a value in a specified registerspecified register

Page 14: Sharc Processor

2.4.2 Data Operations2.4.2 Data Operations

Page 15: Sharc Processor

SHARC programming modelSHARC programming model The primary data registers, r0-r15 or f0-The primary data registers, r0-r15 or f0-

f15f15 R0-R15: used for integer operationsR0-R15: used for integer operations F0-F15: used for floating-point operationsF0-F15: used for floating-point operations registers are 40 bits long for data typeregisters are 40 bits long for data type- 40-bit extended-precision floating-point - 40-bit extended-precision floating-point

valuevalue- 32-bit data types, in most-significant bits- 32-bit data types, in most-significant bits

Page 16: Sharc Processor
Page 17: Sharc Processor

CPUCPU CPU has three major data function CPU has three major data function

units: an ALU, a multiplier, and a units: an ALU, a multiplier, and a shifter. shifter.

three most-significant mode registers three most-significant mode registers for data operations:for data operations:

- arithmetic status - arithmetic status (ASTAT), (ASTAT), - sticky - sticky (STKY), (STKY), - mode 1 - mode 1 (MODE1) (MODE1)

Page 18: Sharc Processor

The ALU updates seven status flags The ALU updates seven status flags in the ASTAT register at the end of in the ASTAT register at the end of each operationeach operation

ALU also updates four ALU also updates four ““stickysticky”” status status flags in the STKY register. flags in the STKY register.

Once set, a sticky flag remains high Once set, a sticky flag remains high until explicitly cleareduntil explicitly cleared

Page 19: Sharc Processor

ASTATASTATBitBit NamNam

eeDefinitionDefinition

00 AZAZ ALU result zero or floating-point underflowALU result zero or floating-point underflow11 AVAV ALU overflowALU overflow22 ANAN ALU result negativeALU result negative33 ACAC ALU fixed-point carryALU fixed-point carry44 ASAS ALU X input sign (ABS, MANT operations)ALU X input sign (ABS, MANT operations)55 AIAI ALU floating-point invalid operationALU floating-point invalid operation1010 AFAF Last ALU operation was a floating-point Last ALU operation was a floating-point

operationoperation31-31-2424

CACCACCC

Compare Accumulation register (results of last Compare Accumulation register (results of last 8 compare operations)8 compare operations)

Page 20: Sharc Processor

STKYSTKYBiBitt

NamNamee

DefinitionDefinition

00 AUSAUS ALU floating-point underflowALU floating-point underflow

11 AVSAVS ALU floating-point overflowALU floating-point overflow

22 AOSAOS ALU fixed-point overflowALU fixed-point overflow

55 AISAIS ALU floating-point invalid ALU floating-point invalid operationoperation

Page 21: Sharc Processor

SHARC arithmeticSHARC arithmetic Rn, Rx, Rn, Rx, and and Ry Ry are arbitrary data are arbitrary data

registers R0-R15 registers R0-R15 operations set various status bits in operations set various status bits in

the ASTAT1 and STKY registers the ASTAT1 and STKY registers COMP COMP compares two values without compares two values without

modifying any data registers modifying any data registers

Page 22: Sharc Processor

Rn = Rx+Ry Rn = Rx+Ry Rn = Rx-Ry Rn = Rx-Ry Rn = Rx+Ry+CI Rn = Rx+Ry+CI Rn = Rx-Ry+CI-l Rn = Rx-Ry+CI-l Rn=(Rx + Ry)/2 Rn=(Rx + Ry)/2 COMP(Rx,Ry) COMP(Rx,Ry)

AddAddSubtractSubtractAdd with carryAdd with carrySubtract with borrowSubtract with borrowAverageAverageCompareCompare

Page 23: Sharc Processor

Rn = Rx + CI Rn = Rx + CI Rn = Rx+CI-l Rn = Rx+CI-l Rn = Rx+l Rn = Rx+l Rn = Rx-l Rn = Rx-l Rn = -Rx Rn = -Rx Rn = ABS Rx Rn = ABS Rx Rn = PASS Rx Rn = PASS Rx

Add carryAdd carryAdd borrowAdd borrowIncrementIncrementDecrementDecrementNegateNegateAbsolute valueAbsolute valueCopy Rx to RnCopy Rx to Rn

Page 24: Sharc Processor

Rn = Rx AND Ry Rn = Rx AND Ry Rn = Rx OR Ry Rn = Rx OR Ry Rn = Rx XOR Ry Rn = Rx XOR Ry Rn = NOT Rx Rn = NOT Rx Rn = MIN(Rx,Ry) Rn = MIN(Rx,Ry) Rn = MAX(Rx,Ry) Rn = MAX(Rx,Ry) Rn = CLIP Rx by Rn = CLIP Rx by Ry Ry

Logical ANDLogical ANDLogical ORLogical ORLogical exclusive ORLogical exclusive ORLogical negateLogical negateMinimum of Rx, RyMinimum of Rx, RyMaximum of Rx, RyMaximum of Rx, RyClip Rx within range [-Clip Rx within range [-Ry,Ry] Ry,Ry]

Page 25: Sharc Processor

All the ALU operations set the AZ (ALU All the ALU operations set the AZ (ALU result zero), AN (ALU result negaresult zero), AN (ALU result nega­­tive), AV tive), AV (ALU result overflow), AC (ALU fixed-point (ALU result overflow), AC (ALU fixed-point carry), and AI (floatingcarry), and AI (floating­­point invalid) bits in point invalid) bits in the ASTAT register. the ASTAT register.

STKY register is a sticky version of ASTAT STKY register is a sticky version of ASTAT register. register.

STKY bits are set along with the ASTAT STKY bits are set along with the ASTAT register bits, but are not cleared. register bits, but are not cleared.

STKY bits always remain set until cleared STKY bits always remain set until cleared by an instruction. by an instruction.

Page 26: Sharc Processor

Saturation ModeSaturation Mode The SHARC can perform The SHARC can perform saturation saturation

arithmetic on fixed-point values. arithmetic on fixed-point values. all positive fixed-point overflows all positive fixed-point overflows

cause the maximum positive fixed-cause the maximum positive fixed-point number (0x7FFF FFFF) to be point number (0x7FFF FFFF) to be returned, and all negative overflows returned, and all negative overflows cause the maximum negative number cause the maximum negative number (0x8000 0000) to be returned(0x8000 0000) to be returned

Page 27: Sharc Processor

Saturation ModeSaturation Mode In saturation arithmetic, an overflow In saturation arithmetic, an overflow

results in the maximum-range value, results in the maximum-range value, not the result of wrapping around the not the result of wrapping around the numeric range. numeric range.

Saturation mode is controlled by the Saturation mode is controlled by the ALUSAT bit in the MODE1 register ALUSAT bit in the MODE1 register

Page 28: Sharc Processor

SHARC doesn't have a divide SHARC doesn't have a divide instruction instruction

Iterative algorithms are used to Iterative algorithms are used to compute both reciprocals and square compute both reciprocals and square roots. roots.

The The RECIPS RECIPS and and RSQRTS RSQRTS operations operations are used to start these iterative are used to start these iterative algorithms algorithms

Page 29: Sharc Processor

Floating-Point Rounding Floating-Point Rounding ModesModes

If the TRUNC bit is set, the ALU If the TRUNC bit is set, the ALU rounds a result to zero (truncation). If rounds a result to zero (truncation). If the TRUNC bit is cleared, the ALU the TRUNC bit is cleared, the ALU rounds to nearest.rounds to nearest.

The rounding modes used for The rounding modes used for floating-point arithmetic are floating-point arithmetic are controlled by two bits in the MODE1 controlled by two bits in the MODE1 register register

Page 30: Sharc Processor

Multiplication sets the MN (multiplier Multiplication sets the MN (multiplier result negative), MV (multiplier overresult negative), MV (multiplier over­­flow), MU (multiplier floating-point flow), MU (multiplier floating-point underflow), and MI (multiplier underflow), and MI (multiplier floatingfloating­­point invalid operation) bits in point invalid operation) bits in the ASTAT register. the ASTAT register.

Page 31: Sharc Processor

Fn = Fx + FyFn = Fx + FyFn = Fx-FyFn = Fx-FyFn = ABS(Fx + Fn = ABS(Fx + Fy)Fy)Fn = ABS(Fx-Fy)Fn = ABS(Fx-Fy)Fn=(Fx + Fy)/2Fn=(Fx + Fy)/2COMP(Fx,Fy)COMP(Fx,Fy)Fn = -FxFn = -Fx

AddAddSubtractSubtractAbsolute value of sumAbsolute value of sumAbsolute value of Absolute value of differencedifferenceAverageAverageCompareCompareNegateNegate

Page 32: Sharc Processor

Fn = ABSFxFn = ABSFxFn = PASS FxFn = PASS FxFn = RND FxFn = RND FxFn = SCALE Fx by Fn = SCALE Fx by RyRyRn = MANX FxRn = MANX FxRn = LOGB FxRn = LOGB FxRn = FIX Fx, Rn = FIX Fx, Rn = TRUNC FxRn = TRUNC FxFn = FLOAT Rx by Fn = FLOAT Rx by RyRy, , LOAT Rx LOAT Rx

Absolute valueAbsolute valueCopyFxtoFnCopyFxtoFnRoundRoundScale exponent of Fx by RyScale exponent of Fx by RyExtract mantissa of FxExtract mantissa of FxConvert exponent of Fx to Convert exponent of Fx to integerintegerConvert floating-point to Convert floating-point to integerinteger

Convert integer to floating-Convert integer to floating-pointpoint

Page 33: Sharc Processor

Fn = RECIPS FxFn = RECIPS FxFn = RSQRTS FxFn = RSQRTS Fx

Fn = Fx COPYSIGN Fn = Fx COPYSIGN FyFyFn = MIN(Fx.Fy)Fn = MIN(Fx.Fy)Fn = MAX(Fx,Fy)Fn = MAX(Fx,Fy)Fn = CLIPFxbyFy Fn = CLIPFxbyFy

Create seed for reciprocalCreate seed for reciprocalCreate seed for reciprocal Create seed for reciprocal square rootsquare rootCopy sign of Fy to FxCopy sign of Fy to FxMinimum of Fx, FyMinimum of Fx, FyMaximum of Fx, FyMaximum of Fx, FyClip Fx within range [-Fy,Fy] Clip Fx within range [-Fy,Fy]

Page 34: Sharc Processor

The multiplier performs fixed-point The multiplier performs fixed-point and floating-point multiplication. and floating-point multiplication.

perform saturation, rounding, and perform saturation, rounding, and setting the result to 0. setting the result to 0.

Fixed-point multiplication produces Fixed-point multiplication produces an 80-bit resultan 80-bit result

Page 35: Sharc Processor

Logical shifts fill with zeroes, while Logical shifts fill with zeroes, while arithmetic shifts copy sign bits. arithmetic shifts copy sign bits.

The distance to shift, supplied by the The distance to shift, supplied by the Ry Ry register, may be positive for a left register, may be positive for a left shift or negative for a right shift. shift or negative for a right shift.

Shift operations set the SZ (shifter Shift operations set the SZ (shifter zero), SV (shifter overflow), and SS zero), SV (shifter overflow), and SS (shifter input sign) bits in the ASTAT (shifter input sign) bits in the ASTAT register.register.

Page 36: Sharc Processor

Rn Rn = = LSHIFT Rx by RyLSHIFT Rx by RyRn = Rn OR LSHIFT Rx by Rn = Rn OR LSHIFT Rx by RyRyRn=ASHIFT Rx by RyRn=ASHIFT Rx by RyRn = Rn OR ASHIFT Rx byRn = Rn OR ASHIFT Rx by RyRyRn = ROT Rx by RyRn = ROT Rx by RyRn = BCLR Rx by RyRn = BCLR Rx by RyRn = BSET Rx by RyRn = BSET Rx by RyRn = BTGL Rx by RyRn = BTGL Rx by Ry

Logical shift distance RyLogical shift distance RyLogical shift and logical ORLogical shift and logical ORArithmetic shiftArithmetic shiftArithmetic shift and logical ORArithmetic shift and logical ORRotate distance RyRotate distance RyClear one bit in RxClear one bit in RxSet one bit in RxSet one bit in RxToggle one bit in RxToggle one bit in Rx

Page 37: Sharc Processor

BTST Rx by RyBTST Rx by RyRn = FDEP Rx by RyRn = FDEP Rx by RyRn = Rn OR FDEP Rx by Rn = Rn OR FDEP Rx by RyRyRn = FDEP Rx by RyRn = FDEP Rx by RyRn = Rn OR FDEP Rx by Rn = Rn OR FDEP Rx by RyRyRn = FEXT Rx by RyRn = FEXT Rx by RyRn = FEXT Rx by RyRn = FEXT Rx by RyRn = EXP RxRn = EXP Rx

Test one bit in RxTest one bit in RxDeposit field from Rx into RnDeposit field from Rx into RnDeposit field from Rx using ORDeposit field from Rx using ORDeposit and sign extend field from Deposit and sign extend field from RxRxDeposit and sign extend using ORDeposit and sign extend using ORExtract field from RxExtract field from RxExtract and sign extend field from Extract and sign extend field from RxRxExtract exponent fieldExtract exponent field

Page 38: Sharc Processor

Rn = EXP Rx (EX)Rn = EXP Rx (EX)Rn Rn = = LEFTZ RxLEFTZ RxRn = LEFTO RxRn = LEFTO RxRn = FPACK FxRn = FPACK Fx

Fx = FUNPACK Rn Fx = FUNPACK Rn

Extract exponent field from Extract exponent field from ALUALUExtract number of leading OsExtract number of leading OsExtract number of leading IsExtract number of leading IsConvert 32-bit floating-point to Convert 32-bit floating-point to 16-bit floating-point16-bit floating-pointConvert 16-bit floating-point to Convert 16-bit floating-point to 32-bit floating-point 32-bit floating-point

Page 39: Sharc Processor

Ex2-7 Data Operation Status Ex2-7 Data Operation Status Bits in the SHARCBits in the SHARC

fixed-point ALU calculation -1 + 1 = fixed-point ALU calculation -1 + 1 = 0, 0,

ASTAT status bits are set: AZ = 1, AU ASTAT status bits are set: AZ = 1, AU = 0, AN = 0, AV = 0, AC = 1, and AI = 0, AN = 0, AV = 0, AC = 1, and AI = 0. = 0.

floating-point operation -1EO+ 1EO = floating-point operation -1EO+ 1EO = 0E0, AOS (ALU fixed-point underflow) 0E0, AOS (ALU fixed-point underflow) will be similarly set. will be similarly set.

Page 40: Sharc Processor

Ex2-7Data Operation Status Ex2-7Data Operation Status Bits in the SHARCBits in the SHARC

fixed-point multiplier operation -2 * 3, fixed-point multiplier operation -2 * 3, ASTAT bits are set as follows: ASTAT bits are set as follows: MN = 1, MV = 0, MU = 1, and MI = 0. MN = 1, MV = 0, MU = 1, and MI = 0. multiplier has four STKY bits, none will be setmultiplier has four STKY bits, none will be set MOS (multiplier fixed-point overMOS (multiplier fixed-point over­­flow), flow), MVS (multiplier floating-point overflow), MVS (multiplier floating-point overflow), MUS (multiplier floating-point underflow), MUS (multiplier floating-point underflow), MIS (multiplier floating-point invalid MIS (multiplier floating-point invalid

operation).operation).

Page 41: Sharc Processor

Ex2-7Data Operation Status Ex2-7Data Operation Status Bits in the SHARCBits in the SHARC

For the following shifter operation,For the following shifter operation, LSHIFT Ox7fffffff BY 3LSHIFT Ox7fffffff BY 3 ASTAT bits will be set as follows: ASTAT bits will be set as follows: SZ = 0, SV = 1, and SS = 0. SZ = 0, SV = 1, and SS = 0. The shifter has no sticky bits.The shifter has no sticky bits.

Page 42: Sharc Processor

load and store load and store operationsoperations

operands must be loaded into operands must be loaded into registers before operating on them. registers before operating on them.

SHARC supplies special registers that SHARC supplies special registers that are used to control loading and storing. are used to control loading and storing.

SHARC has two SHARC has two data address data address generators (DAGs): ogenerators (DAGs): one for the data ne for the data memory and the other for the program memory and the other for the program memory. memory.

Page 43: Sharc Processor

DAGsDAGs Data address generator 1 (DAG1) Data address generator 1 (DAG1)

generates 32-bit addresses on the generates 32-bit addresses on the DM Address BusDM Address Bus

Data address generator 2 (DAG2) Data address generator 2 (DAG2) generates 24-bit addresses on the generates 24-bit addresses on the PM Address BusPM Address Bus

Each DAG has four types of registers: Each DAG has four types of registers: Index (I), Modify (M), Base (B), and Index (I), Modify (M), Base (B), and Length (L) registersLength (L) registers

Page 44: Sharc Processor

DAGsDAGs I register acts as a pointer to memoryI register acts as a pointer to memory M register contains the increment value M register contains the increment value

for advancing the pointer. for advancing the pointer. B registers and L registers are used only B registers and L registers are used only

for circular data buffers. for circular data buffers. B register holds the base address (i.e. B register holds the base address (i.e.

the first address) of a circular buffer. the first address) of a circular buffer. L register contains the number of L register contains the number of

locations in (i.e. the length of) the locations in (i.e. the length of) the circular buffer.circular buffer.

Page 45: Sharc Processor

DAGsDAGs two DAGs, the SHARC can perform two DAGs, the SHARC can perform

two load-store operations per cycle. two load-store operations per cycle. DAG hardware automatically updates DAG hardware automatically updates

their values so that a series of their values so that a series of accesses can be very easily accesses can be very easily performed. performed.

DAGs quite useful for the sequential DAGs quite useful for the sequential accessesaccesses

Page 46: Sharc Processor

DAGsDAGs Each data address generator has Each data address generator has

eight sets of primary registers. eight sets of primary registers. Having several sets allows for Having several sets allows for

quicker access of multiple sets of quicker access of multiple sets of datadata

The registers numbered 0 through 7 The registers numbered 0 through 7 belong to DAG1, while registers 8 belong to DAG1, while registers 8 through 15 belong to DAG2. through 15 belong to DAG2.

Page 47: Sharc Processor
Page 48: Sharc Processor

MODE1MODE1

BiBitt

NameName DefinitionDefinition

33 SRD1SRD1HH

DAG1 alternate register select (4-DAG1 alternate register select (4-7)7)

44 SRD1SRD1LL

DAG1 alternate register select (0-DAG1 alternate register select (0-3)3)

55 SRD2SRD2HH

DAG2 alternate register select DAG2 alternate register select (12-15)(12-15)

66 SRD2SRD2LL

DAG2 alternate register select (8-DAG2 alternate register select (8-11)11)

Page 49: Sharc Processor

DAGsDAGs DAGs provide the following addressing DAGs provide the following addressing

modes modes immediate value immediate value R0 = DM (0x2000000); R0 = DM (0x2000000); R0 = DM(_a);R0 = DM(_a); loads R0 the contents of the variable aloads R0 the contents of the variable a DM(_a) = R0;DM(_a) = R0; stores R0 into memory location stores R0 into memory location

Page 50: Sharc Processor

DAGsDAGs absolute addressabsolute address has the entire address in the has the entire address in the

instruction instruction address bits take up most of the address bits take up most of the

instruction, 32bits/40bitsinstruction, 32bits/40bits

Page 51: Sharc Processor

post-modify with update post-modify with update mode mode

sweep through a range of addresses sweep through a range of addresses uses an I register and a modifier, M uses an I register and a modifier, M

register or an immediate value. register or an immediate value. I register specifies the address, I register specifies the address,

updated by the modifier value updated by the modifier value R0 = DM(I3,M1) R0 = DM(I3,M1) DM(I2,1) = R1 DM(I2,1) = R1

Page 52: Sharc Processor

base-plus-offset base-plus-offset addressing addressing

address of the location to be fetched address of the location to be fetched is computed as I + M, where I is the is computed as I + M, where I is the base and M is the modifier or offset base and M is the modifier or offset

I0 = 0x2000000 and Ml = 4, I0 = 0x2000000 and Ml = 4, R0 = DM(M1,I0)R0 = DM(M1,I0) load DM(0x2000004) into R0load DM(0x2000004) into R0

Page 53: Sharc Processor

circular bufferscircular buffers A circular buffer is an array of A circular buffer is an array of n n elements; elements;

when the when the n + n + 1th element is referenced, 1th element is referenced, the reference goes to buffer location 0, the reference goes to buffer location 0, wrapping around from the end to the wrapping around from the end to the beginning of the buffer. beginning of the buffer.

L register is set with a positive, nonzero L register is set with a positive, nonzero value as the starting point in the circular value as the starting point in the circular buffer, buffer,

B register of the same number is loaded B register of the same number is loaded with the base address of the circular buffer. with the base address of the circular buffer.

Page 54: Sharc Processor

bit-reversal addressingbit-reversal addressing fast Fourier transform (FFT) fast Fourier transform (FFT) Bit-reversal addressing can be Bit-reversal addressing can be

performed only in I0 and I8, as performed only in I0 and I8, as controlled by the BR0 and BR8 bits in controlled by the BR0 and BR8 bits in the MODE1 register. the MODE1 register.

Page 55: Sharc Processor

storing data in program storing data in program memorymemory

allows data to be stored in the allows data to be stored in the program memory program memory

allows two data fetches per cycle allows two data fetches per cycle F0 = DM(M0,I0), F1 = PM(M8,I9)F0 = DM(M0,I0), F1 = PM(M8,I9) simultaneously load F0 from data simultaneously load F0 from data

memory and F1 from program memory and F1 from program memory memory

Page 56: Sharc Processor

float dm a[N]; float dm a[N]; float pm b[N];float pm b[N]; will place the a[] array in data will place the a[] array in data

memory and b[] in program memory memory and b[] in program memory

Page 57: Sharc Processor

Ex2-8 C Assignments in SHARC Ex2-8 C Assignments in SHARC InstructionsInstructions

x = (a + b) - c; x = (a + b) - c; r0 for a, r1 for b, r2 for c, and r3 for x r0 for a, r1 for b, r2 for c, and r3 for x R0 = DM(_a); ! get value of aR0 = DM(_a); ! get value of a R1 = DM(_b); ! load value of bR1 = DM(_b); ! load value of b R3 = R0 + R1; ! set result for x to a + bR3 = R0 + R1; ! set result for x to a + b R2 = DM(_c) ; ! get value of cR2 = DM(_c) ; ! get value of c SUB R3 = R3 - R2 ; ! complete computation SUB R3 = R3 - R2 ; ! complete computation

of x of x DM(_x) = R3 ; ! store x at proper locationDM(_x) = R3 ; ! store x at proper location

Page 58: Sharc Processor

Ex2-8 C Assignments in SHARC Ex2-8 C Assignments in SHARC InstructionsInstructions

y = a*(b + c); y = a*(b + c); use r0 for a, r1 for b, and r2 for both c and y use r0 for a, r1 for b, and r2 for both c and y R1 = DM(_b); ! load bR1 = DM(_b); ! load b R2 = DM(_c); ! load cR2 = DM(_c); ! load c R2 = R1 + R2 ; ! compute partial result for yR2 = R1 + R2 ; ! compute partial result for y R0 = DM(_a); ! load aR0 = DM(_a); ! load a R2 = R2 * r0 ; ! compute final value of yR2 = R2 * r0 ; ! compute final value of y DM(_y) = R2 ; ! store yDM(_y) = R2 ; ! store y

Page 59: Sharc Processor

Ex2-8 C Assignments in SHARC Ex2-8 C Assignments in SHARC InstructionsInstructions

y = a*(b + c); y = a*(b + c); made shorter by using pointers made shorter by using pointers R2 = DM(I1,M5), R1 = PM(I8,M13); ! R2 = DM(I1,M5), R1 = PM(I8,M13); !

load b and c in parallelload b and c in parallel R0 = R2 + R1, R12 = DM(I0,M5); ! R0 = R2 + R1, R12 = DM(I0,M5); !

add (b+c) and load (a) in paralleladd (b+c) and load (a) in parallel R6 = R12*R0 (SSI); ! finish y R6 = R12*R0 (SSI); ! finish y

computationcomputation DM(I0,M5) =R8; ! store yDM(I0,M5) =R8; ! store y

Page 60: Sharc Processor

Ex2-8 C Assignments in SHARC Ex2-8 C Assignments in SHARC InstructionsInstructions

z = (az = (a««2) | (b & 15);2) | (b & 15); r0 for a and z, r1 for b, and r3 to hold the bit r0 for a and z, r1 for b, and r3 to hold the bit

mask to be ANDed mask to be ANDed R0 = DM(_a) ; ! get value of aR0 = DM(_a) ; ! get value of a R0 = LSHIFT R0 BY #2 ; ! perform shiftR0 = LSHIFT R0 BY #2 ; ! perform shift R1 = DM(_b) ; ! get value of bR1 = DM(_b) ; ! get value of b R3 = #15 ; ! set up the bit mask for R3 = #15 ; ! set up the bit mask for

ANDingANDing R1 = R1 AND R3 ; ! perform logical ANDR1 = R1 AND R3 ; ! perform logical AND R0 = R1 OR R0 ; ! compute final value of zR0 = R1 OR R0 ; ! compute final value of z DM(_z) = R0 ; ! store value of zDM(_z) = R0 ; ! store value of z

Page 61: Sharc Processor

2.4.3 Flow of Control2.4.3 Flow of Control

Page 62: Sharc Processor

JUMP instruction JUMP instruction jumps to the location foo jumps to the location foo - JUMP foo- JUMP foo Direct: Direct: specifies a 24-bit address in specifies a 24-bit address in

immediate immediate Indirect: supplyIndirect: supply by DAG2 data by DAG2 data

address generator.address generator. PC-relative: PC-relative: specifies an immediate specifies an immediate

value that is added to the current PC.value that is added to the current PC.

Page 63: Sharc Processor

loop instruction loop instruction LCNTR = n, DO Label UNTIL LCE;LCNTR = n, DO Label UNTIL LCE; loop instruction specifies the following:loop instruction specifies the following:- length of the loop, loop counter LCNTR- length of the loop, loop counter LCNTR- Label, the address for the last - Label, the address for the last

instruction in the loopinstruction in the loop- loop termination condition LCE, which - loop termination condition LCE, which

stands for "loop counter expired"stands for "loop counter expired"

Page 64: Sharc Processor

True True versionversionEQEQLTLTLELEACACAVAV

DescriptionDescriptionALU = 0ALU = 0ALU<0ALU<0ALU≤0ALU≤0ALU carryALU carryALU ALU overflowoverflow

Complement Complement versionversionNENEGEGEGTGTNOT ACNOT ACNOT AVNOT AV

Page 65: Sharc Processor

MVMVMSMSSVSVSZSZFLAGO_INFLAGO_IN

Multiplier Multiplier overflowoverflowMultiplier signMultiplier signShifter overflowShifter overflowShifter zeroShifter zeroFlag 0 inputFlag 0 input

NOT MVNOT MVNOT MSNOT MSNOT SVNOT SVNOT SZNOT SZNOT NOT FLAGO_INFLAGO_IN

Page 66: Sharc Processor

FLAG1_INFLAG1_INFLAG2_INFLAG2_INFLAG3_INFLAG3_INTFTFLCELCENOT LCENOT LCE

Flag 1 inputFlag 1 inputFlag 2 inputFlag 2 inputFlag 3 inputFlag 3 inputBit test flagBit test flagLoop counter Loop counter expiredexpiredLoop counter not Loop counter not expiredexpired

NOT NOT FLAG1_INFLAG1_INNOT NOT FLAG2_INFLAG2_INNOT NOT FLAG3_INFLAG3_INNOT TFNOT TF

Page 67: Sharc Processor

Ex2-9 if statement Ex2-9 if statement if (a > b) { if (a > b) { x = 5; x = 5; y = c + d;y = c + d; }} else x = c - d; else x = c - d;

Page 68: Sharc Processor

Ex2-9 if statementEx2-9 if statement !test!test R0 = DM(_a); R0 = DM(_a); ! load a ! load a R1 = DM(_b); R1 = DM(_b); ! load b ! load b COMP(R0,R1) COMP(R0,R1) ! Compare a,b ! Compare a,b IF GE JUMP fbock; ! jump if fails testIF GE JUMP fbock; ! jump if fails test ! true block ! true block

Page 69: Sharc Processor

Ex2-9 if statementEx2-9 if statement tblock:tblock: R0 = 5;R0 = 5; ! get value for x! get value for x DM(_x) = R0;DM(_x) = R0; ! store value for x! store value for x R0 = DM(_c);R0 = DM(_c); ! get c! get c R1 = DM(_d);R1 = DM(_d);! getd! getd R1 = R0 + R1;R1 = R0 + R1; !compute c + d!compute c + d DM(_y) = R1;DM(_y) = R1; ! save value for y! save value for y JUMP other; JUMP other; ! skip false block! skip false block

Page 70: Sharc Processor

an example Ex2-9 if an example Ex2-9 if statement statement

! false block ! false block fblock: R0 = DM(_c); ! get cfblock: R0 = DM(_c); ! get c R1 = DM(_d);R1 = DM(_d); ! get d! get d R1 = R0 - R1;R1 = R0 - R1; ! compute c - d! compute c - d DM(_x) = Rl;DM(_x) = Rl; ! save value for ! save value for

xx other: ... ! code after ifother: ... ! code after if

Page 71: Sharc Processor

Ex2-9 if statementEx2-9 if statement if (a > b)if (a > b) y = c - d; y = c - d; elseelse y = c + d;y = c + d;

Page 72: Sharc Processor

Ex2-9 if statementEx2-9 if statement ! load values! load values R1 = DM(_a); R1 = DM(_a); ! load a! load a R8 = DM(_b); R8 = DM(_b); ! load b! load b R2 = DM(_c);R2 = DM(_c); ! load c! load c R4 = DM(_d); ! load dR4 = DM(_d); ! load d ! compute both sum and difference! compute both sum and difference

Page 73: Sharc Processor

Ex2-9 if statementEx2-9 if statement r12 = r2 + r4, r0 = r2 - r4; r12 = r2 + r4, r0 = r2 - r4; ! choose which one to save, copy it ! choose which one to save, copy it

into r0 if necessary, then write to yinto r0 if necessary, then write to y comp(r8,rl); ! Compare b,acomp(r8,rl); ! Compare b,a if ge r0 = r12; ! a <=bif ge r0 = r12; ! a <=b dm(_y) = r0; ! dm(_y) = r0; !

Page 74: Sharc Processor

When control reaches the last When control reaches the last instruction in the loop, the machine instruction in the loop, the machine immediately returns to the head of immediately returns to the head of the loop unless the loop counter has the loop unless the loop counter has expired. expired.

zero-overhead loop: because the zero-overhead loop: because the jump back to the top of the loop (and jump back to the top of the loop (and associated delays) are avoided.associated delays) are avoided.

Page 75: Sharc Processor

loop instruction: use two stacks to loop instruction: use two stacks to handle nested loops (one loop handle nested loops (one loop contained inside another).contained inside another).

The PC is in fact a stack; a separate The PC is in fact a stack; a separate stack holds the loop counters for all stack holds the loop counters for all active loops.active loops.

PC stack is 30 deep, holds subroutine PC stack is 30 deep, holds subroutine return addresses, loop addresses, loop return addresses, loop addresses, loop counter stack is 6 deep. counter stack is 6 deep.

Page 76: Sharc Processor

When the DO UNTIL is first When the DO UNTIL is first encountered,encountered,

- loop end address pushed onto PC stack- loop end address pushed onto PC stack- new loop counter value pushed onto - new loop counter value pushed onto

the loop counter stack. the loop counter stack. reaches the loop end address, reaches the loop end address, - CPU automatically decrements the loop - CPU automatically decrements the loop

counter and checks its value. counter and checks its value.

Page 77: Sharc Processor

If the termination condition (which If the termination condition (which may be LCE or NOT LCE) is not may be LCE or NOT LCE) is not satisfied, the PC is set to the satisfied, the PC is set to the instruction just after the DO UNTIL for instruction just after the DO UNTIL for another iteration. another iteration.

If the condition is satisfied, the two If the condition is satisfied, the two stacks are popped and execution stacks are popped and execution continues at the instruction after the continues at the instruction after the loop end address.loop end address.

Page 78: Sharc Processor

ex 2-10ex 2-10 loop loop for (i = 0, f = 0; i < N; i++) for (i = 0, f = 0; i < N; i++) f = f + c[i] * x[i];f = f + c[i] * x[i]; ! loop setup! loop setup I0 = _a; ! I0 points to a[0]I0 = _a; ! I0 points to a[0] M0 = 1; ! set up incrementM0 = 1; ! set up increment I8 = b; ! I8 points to b[0]I8 = b; ! I8 points to b[0] M8 = 1; ! set up postincrement M8 = 1; ! set up postincrement

modemode

Page 79: Sharc Processor

ex 2-10ex 2-10 loop loop ! loop body! loop body LCNTR = N, DO loopend UNTIL LCE;LCNTR = N, DO loopend UNTIL LCE; ! use postincrement mode! use postincrement mode R1 = DM(I0,M0), R2 = PM(I8,M8); R1 = DM(I0,M0), R2 = PM(I8,M8); loopend: R8 = R1*R2, R12 = R12 + loopend: R8 = R1*R2, R12 = R12 +

R9; ! multiply and accumulateR9; ! multiply and accumulate

Page 80: Sharc Processor

ex 2-10ex 2-10 loop loop optimized:optimized: ! loop setup! loop setup I4 = _a; ! load aI4 = _a; ! load a I12 = _b; ! load bI12 = _b; ! load b R4 = R4 xor R4, R1 = DM(I4,M6), R2 R4 = R4 xor R4, R1 = DM(I4,M6), R2

= PM(I12,M14);= PM(I12,M14); MR0F=R4, MODIFY(I7,M7); MR0F=R4, MODIFY(I7,M7);

Page 81: Sharc Processor

ex 2-10ex 2-10 loop loop ! start loop! start loop LCNTR = 20, DO(PC,loop) UNTIL LCE;LCNTR = 20, DO(PC,loop) UNTIL LCE; loop: MRF = MRF + R2*R1 (SSI), loop: MRF = MRF + R2*R1 (SSI),

R1 = DM(I4,M6), R2 = PM(I12,M14); R1 = DM(I4,M6), R2 = PM(I12,M14); ! loop clean-up! loop clean-up R0 = MR0F;R0 = MR0F;

Page 82: Sharc Processor

SHARC function callsSHARC function calls procedure calls,procedure calls, CALL foo;CALL foo; executed conditionallyexecuted conditionally IF GT CALL (PC,100);IF GT CALL (PC,100); a PC-relative call to a point 100 locations a PC-relative call to a point 100 locations

past the curpast the cur­­rent PC value. rent PC value. CALL instruction pushes current PC value CALL instruction pushes current PC value

plus 1 onto PC stack before to target plus 1 onto PC stack before to target address.address.

Page 83: Sharc Processor

SHARC function callsSHARC function calls return from a procedure call is return from a procedure call is

performed by the RTS (return from performed by the RTS (return from subroutine) instruction. subroutine) instruction.

This instruction pops the PC stack to This instruction pops the PC stack to return to the instruction after the return to the instruction after the call.call.

The SHARC does not include specific The SHARC does not include specific instructions for saving and restoring instructions for saving and restoring registers for procedure calls. registers for procedure calls.

Page 84: Sharc Processor

Example 2-11Example 2-11 void f1(int a) { void f1(int a) { f2(a);f2(a); }} SHARC has a PC stack, do not need to SHARC has a PC stack, do not need to

push the return address, only the push the return address, only the registers.registers.

SHARC does not have general-purpose SHARC does not have general-purpose stack operators, use the DAGs to stack operators, use the DAGs to implement a stack with a little effort. implement a stack with a little effort.

Page 85: Sharc Processor

Example 2-11Example 2-11 Pushing stack isPushing stack is—— use postincrement use postincrement

mode, I register automatically points to mode, I register automatically points to the empty location at the top of the the empty location at the top of the stack. stack.

Reading values off the stack requires Reading values off the stack requires specifying a constant offset in the M field specifying a constant offset in the M field to provide the distance from the end of to provide the distance from the end of the stack frame to the variable. Popping the stack frame to the variable. Popping the stack means modifying the I register.the stack means modifying the I register.

Page 86: Sharc Processor

Example 2-11Example 2-11 use I1 to point to the stack and we use I1 to point to the stack and we

assume that Ml has been set to 1, assume that Ml has been set to 1, the stack push increment, at the the stack push increment, at the start of the program. Here is start of the program. Here is handwritten code for fl(), which handwritten code for fl(), which includes a call to f2(): includes a call to f2():

Page 87: Sharc Processor

Example 2-11Example 2-11 fl: fl: R0 = DM(I1,-1);R0 = DM(I1,-1); ! load argument ! load argument

a into R0 from stacka into R0 from stack ! call f2() ! call f2() DM(I1,M1) = R0; DM(I1,M1) = R0; ! push f2's argument ! push f2's argument

onto the stackonto the stack CALL f2;CALL f2; ! call f2! call f2 ; return from fl() ; return from fl() MODIFY(I1,-1); MODIFY(I1,-1); ! pop one element off stack! pop one element off stack RTS;RTS; ! return! return

Page 88: Sharc Processor

2.4.4 Parallelism within 2.4.4 Parallelism within InstructionsInstructions

Page 89: Sharc Processor

SHARC to allow operations to SHARC to allow operations to performe simultaneously. performe simultaneously.

many machines offer parallel many machines offer parallel execution, but hidden from the execution, but hidden from the programmer. programmer.

The SHARC's wide instruction word The SHARC's wide instruction word allows the programmer to put allows the programmer to put together parallel operationstogether parallel operations

Page 90: Sharc Processor

The machine supports both memory The machine supports both memory parallelism and operation parallelism. parallelism and operation parallelism.

reduce the number of instructions reduce the number of instructions required for common operations. required for common operations.

For example, the basic operation in a For example, the basic operation in a dot product loop can be performed in dot product loop can be performed in one cycle that performs two fetches, a one cycle that performs two fetches, a multiplication, and an addition.multiplication, and an addition.

Page 91: Sharc Processor

The modified Harvard architecture The modified Harvard architecture allows multiple data fetches in a single allows multiple data fetches in a single instruction. instruction.

The most common instructions allow a The most common instructions allow a memory reference and a computation memory reference and a computation to be performed at the same time. to be performed at the same time.

Memory references can be done two Memory references can be done two at a time in many instructions, with at a time in many instructions, with each reference using a DAG.each reference using a DAG.

Page 92: Sharc Processor

instruction set allows the CPU's instruction set allows the CPU's function units to be performed in a function units to be performed in a single instructionsingle instruction

fixed-point multiply-accumulate and fixed-point multiply-accumulate and add, subtract, or average;add, subtract, or average;

floating-point multiplication and ALU floating-point multiplication and ALU operation; andoperation; and

multiplication and dual add-subtract.multiplication and dual add-subtract.

Page 93: Sharc Processor

restrictions on the sources of the restrictions on the sources of the operands when operations are operands when operations are combined. combined.

The operands going to the multiplier The operands going to the multiplier must come from R0 through R7 (or in must come from R0 through R7 (or in the case of floating-point operands, f0 the case of floating-point operands, f0 to f7), with one input coming from RO-to f7), with one input coming from RO-R3/fO-f3 and the other from R4-R7/f0-R3/fO-f3 and the other from R4-R7/f0-f7. f7.

Page 94: Sharc Processor

The ALU operands must come from The ALU operands must come from R8-R15/f8-fl5, with one operand R8-R15/f8-fl5, with one operand coming from R8-Rll/f8-fll and the coming from R8-Rll/f8-fll and the other from R12-R15/fl2-fl5. other from R12-R15/fl2-fl5.

performs three operations:performs three operations: R6 = R0 * R4, R9 = R8 + R12, RI0 = R6 = R0 * R4, R9 = R8 + R12, RI0 =

R8 - R12 R8 - R12

Page 95: Sharc Processor

2.5 Summary2.5 Summary

Page 96: Sharc Processor

all CPUs are similarall CPUs are similar—— read and write read and write memory, perform data operations, memory, perform data operations, and make decisions. and make decisions.

many ways to design an instruction many ways to design an instruction set, as illustrated by the differences set, as illustrated by the differences between the ARM and the SHARC. between the ARM and the SHARC.

Page 97: Sharc Processor

When designing complex systems, in When designing complex systems, in high-level language form, which high-level language form, which hides many of the details of the hides many of the details of the instruction set. instruction set.

differences in instruction sets can be differences in instruction sets can be reflected in nonfunctional reflected in nonfunctional characteristics, such as program size characteristics, such as program size and speed.and speed.