Automatically Generating Custom Instruction Set Extensions

17
1 Automatically Generating Custom Instruction Set Extensions Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors

description

Automatically Generating Custom Instruction Set Extensions. Nathan Clark, Wilkin Tang, Scott Mahlke Workshop on Application Specific Processors. Problem Statement. There’s a demand for high performance, low power special purpose systems E.g. Cell phones, network routers, PDAs - PowerPoint PPT Presentation

Transcript of Automatically Generating Custom Instruction Set Extensions

Page 1: Automatically Generating Custom Instruction Set Extensions

1

Automatically Generating Custom Instruction Set Extensions

Nathan Clark, Wilkin Tang, Scott MahlkeWorkshop on Application Specific

Processors

Page 2: Automatically Generating Custom Instruction Set Extensions

2

Problem StatementThere’s a demand for high performance, low power special purpose systems E.g. Cell phones, network routers, PDAs

One way to achieve these goals is augmenting a general purpose processor with Custom Function Units (CFUs) Combine several primitive operations We propose an automated method for CFU

generation

Page 3: Automatically Generating Custom Instruction Set Extensions

3

System Overview

Page 4: Automatically Generating Custom Instruction Set Extensions

4

Example1 2

3

4

5

6

7

8

Potential CFUs1,32,42,63,44,55,86,77,8

Page 5: Automatically Generating Custom Instruction Set Extensions

5

Example1 2

3

4

5

6

7

8

Potential CFUs1,32,42,6…1,3,42,4,52,6,7…

Page 6: Automatically Generating Custom Instruction Set Extensions

6

Example1 2

3

4

5

6

7

8

Potential CFUs1,32,42,6…1,3,4,52,4,5,82,6,7,8…1,3,4,5,8

Page 7: Automatically Generating Custom Instruction Set Extensions

7

Characterization

Use the macro library to get information on each potential CFU Latency is the sum of each primitive’s

latency Area is the sum of each primitive’s

macrocell

Page 8: Automatically Generating Custom Instruction Set Extensions

8

Issues we considerPerformance On critical path Cycles saved

Cost CFU area Control logic

Difficult to measure

Decode logic Difficult to

measure Register file area

Can be amortized

LD

ADD

ADD

AND

ASL

XOR

BR

1

1

1

1

1

0.1

0.1

0.1

0.6

0.6

Page 9: Automatically Generating Custom Instruction Set Extensions

9

More Issues to Consider

IO number of input

and output operands

Usability How well can the

compiler use the pattern

OR

LSL

AND

CMPP

Page 10: Automatically Generating Custom Instruction Set Extensions

10

Selection

Currently use a Greedy Algorithm Pick the best

performance gain / area first

Can yield bad selections

OR

LSL

AND

CMPP

Page 11: Automatically Generating Custom Instruction Set Extensions

11

Case study 1: BlowfishSpeedup: 1.24 10 cycles can be

compressed down to 2!

Cost: ~6 adders6 inputs, 2 outputsC code this DFG came from: r ^=(((s[(t>>24)] +

s[0x0100+((t>>16)&0xff)]) ^ s[0x0200+((t>>8)&0xff)]) + s[0x0300+((t&0xff)])&0xffffffff;

ADD

XOR

ADD

AND

XOR

LSR

AND

ADD

LSL

ADD

r65 r70

r76

r81

# -1

r891

#16

#255

#256

#2

r91

Page 12: Automatically Generating Custom Instruction Set Extensions

12

Case study 2: ADPCM DecodeSpeedup: 1.20 3 cycles can be

compressed down to 1

Cost: ~1.5 adders2 inputs, 2 outputsC code this DFG came from: d = d & 7;

if ( d & 4 ) { … }

AND

AND

CMPP

#7r16

#4

#0

Page 13: Automatically Generating Custom Instruction Set Extensions

13

Experimental SetupCFU recognition implemented in the Trimaran research infrastructureSpeedup shown is with CFUs relative to a baseline machine Four wide VLIW with predication Can issue at most 1 Int, Flt, Mem, Brn

inst./cyc. 300 MHz clock

CFU Latency is estimated using standard cells from Synopsis’ design library

Page 14: Automatically Generating Custom Instruction Set Extensions

14

Varying the Number of CFUs

More CFUs yields more performance Weakness in our selection algorithm causes plateaus

adpcm-decode

1

1.4

1.8

2.2

0 5 10 15 20

Number of function units

Spe

edup

0

20

40

60

80

Add

ition

al c

ost

Speedup Additional cost

Page 15: Automatically Generating Custom Instruction Set Extensions

15

Varying the Number of Ops

Bigger CFUs yield better performance If they’re too big, they can’t be used as often

and they expose alternate critical paths

blow f ish

1

1.2

1.4

1.6

1.8

2

0 5 10 15 20

Max Number of ops/CFU

Spe

edup

0

20

40

60

80

Add

ition

al c

ost

Speedup Additional cost

Page 16: Automatically Generating Custom Instruction Set Extensions

16

Related WorkMany people have done this for code size Bose et al., Liao et al.

Typically done with traces Arnold, et al.

Previous paper used more enumerative discovery algorithmWe are unique because: Compiler based approach Novel analyzation of CFUs

Page 17: Automatically Generating Custom Instruction Set Extensions

17

Conclusion and Future Work

CFUs have the potential to offer big performance gain for small costRecognize more complex subgraphs Generalized acyclic/cyclic subgraphs

Develop our system to automatically synthesize application tailored coprocessors