Architecture-aware Taylor shift by 1 - Drexel University1226/datastream... · Architecture-aware...

Architecture-aware Taylor Shift by 1

A Thesis

Submitted to the Faculty

of

Drexel University

by

Anatole D. Ruslanov

in partial fulfillment of the

requirements for the degree

of

Doctor of Philosophy

December 2006

11

Acknowledgements

• I would like to thank my advisors Jeremy Johnson and Werner Krandick for

their support and patience.

• I would like to thank Jiirgen Gerhard, Guillaume Hanrot et al., Bernard Mour-

rain et al., and George Collins, who developed IPRRIDB, for making their code

available.

• I would like to thank Paul Zimmermann for providing his code for the modular

convolution method.

• Thomas Decker invented the interlaced polynomial representation to improve

the SACLIB method.

• I would like to offer my profound gratitude to my parents, Rena and David

Finko for their endless and unfailing patience and support.

• I would like to thank P. R. Sarkar and Sandra Valenzano-Dillon for inspiration

and perspective.

• I would like to thank Jamie Chandler, Robert Garbellano, Donna Hanelin,

Wendy Henson, Sarah Hohenberger, Sherry Paris, Milan Sampat, Virat Staer,

and Clifford J. Straehley for support and encouragement. Friends do matter!

• I would like to thank Carolyn Kieber Grady and Diane Ranck for final proof

reading.

iii

Table of Contents

List of Tables vii

List of Figures ix

Abstract xv

1. Preliminaries 7

1.1 Introduction 7

1.2 Thesis organization 7

1.3 Taylor shift by 1: Analysis 8

1.4 Performance and computer architecture 15

1.4.1 Pipelined processors 16

1.4.2 Superscalar execution (ILP) 18

1.4.3 Memory hierarchy 20

1.5 Methodology and experimental procedures 24

1.5.1 Processor architecture 24

1.5.2 Hardware configuration 26

1.5.3 Compilation protocol 27

1.5.4 Input polynomials 28

1.5.5 Performance counter measurements 30

1.5.6 Cache flushing 30

1.6 Literature review 31

1.6.1 Taylor shift by 1 31

1.6.2 Asymptotically fast methods 32

1.6.3 Crossover points for GNU-MP, NTL, and SACLIB 34

iv

1.6.4 Register tiling 35

1.6.5 Compilers and automatic code generation and tuning 36

1.6.6 Real root isolation 37

1.6.7 Notes on experimental methodology 39

2. Straightforward implementations 40

2.1 Introduction 40

2.2 GNU-MP addition 42

2.3 NTL addition 43

2.4 The SACLIB method 44

3. The tile method 47

3.1 Introduction 47

3.2 Description of the algorithm 48

3.3 Properties of the algorithm 51

3.3.1 Register tile schedule 52

3.3.2 An example 53

3.4 Performance 59

3.4.1 Experimental methodology and platform 60

3.4.2 Execution time 60

3.4.3 Efficiency of addition 62

3.4.4 Memory traffic reduction 62

3.4.5 Cache miss rates 66

3.4.6 Branch mispredictions 66

3.4.7 Computing times in the literature 70

3.5 Automatic code generation 72

V

3.5.1 Processor utilization 73

3.5.2 The 4100 degree irregularity 76

4. Modeling Taylor shift by 1 79

4.1 Introduction 79

4.2 A model for GNU-MP addition 79

4.3 Modeling the straightforward method 83

4.4 Modeling the tile method 94

4.4.1 Impact of changing the number of engaged IEUs 99

4.4.2 Finding optimal register tile size 101

5. Asymptotically fast methods 104

5.1 Introduction 104

5.2 Performance of the fast methods 106

5.3 Computing times in the literature 109

5.4 Improving performance of the fast methods 109

5.4.1 Replacing native NTL arithmetic with GNU-MP ari thmetic. . . 113

5.4.2 Observations on coding 116

5.5 Conclusions 117

6. Applications 118

6.1 High-performance de Casteljau's algorithm 118

6.2 High-performance Descartes method 120

6.2.1 Monomial vs. Bernstein bases 120

6.2.2 The Descartes methods we compare 123

6.2.3 Performance results 125

7. Future research 136

vi

Bibliography 138

Vita 146

Vll

List of Tables

3.1 An optimal instruction schedule for the 8 x 8 register tile for the Ultra

SPARC III processor 54

3.2 An example polynomial with its coefficients in the interlaced representa

tion 55

3.3 The output polynomial after Taylor shift by 1 computation with its co

efficients not normalized 58

3.4 The output polynomial after Taylor shift by 1 computation with its co

efficients normalized 59

3.5 Computing times (s.) for Taylor shift by 1 —"small" coefficients 70

3.6 Computing times (s.) for Taylor shift by 1 —"large" coefficients 71

4.1 Parameters used for modeling GNU-MP addition. The patch refers to

Gaudry's patch [39] 82

4.2 Experimentally determined cost for register tile execution 96

4.3 Cost of the b x b register tile execution in cycles. The chosen b is the

optimal value for the particular platform, see Section 3.5 96

4.4 Experimentally determined cost of the 3 versions of the carry propagation

in processor cycles for the register tile sizes ranging from 4 x 4 to 24 x 24. 97

4.5 Cost of the delayed carry propagation in cycles 98

4.6 Parameters used in modeling the tile method 98

5.1 Computing times (s.) for the divide and conquer method of Taylor shift

by 1 —"small" coefficients 110

Vll l

5.2 Computing times (s.) for the convolution method of Taylor shift by 1

—"small" coefficients 110

5.3 Computing times (s.) for the Paterson & Stockmeyer method of Taylor

shift by 1 —"small" coefficients I l l

5.4 Computing times (s.) for the divide and conquer method of Taylor shift

by 1 —"large" coefficients I l l

5.5 Computing times (s.) for the convolution method of Taylor shift by 1

—"large" coefficients 112

5.6 Computing times (s.) for the Paterson & Stockmeyer method of Taylor

shift by 1 —"large" coefficients 112

6.1 Root isolation timings in milliseconds for Intel Pentium EE 132

6.2 Root isolation timings in milliseconds for AMD Opteron 133

6.3 Root isolation timings in milliseconds for UltraSPARC III 134

6.4 Root isolation timings in milliseconds for Intel Pentium 4 135

IX

List of Figures

1.1 By Theorem 1.3.3, the pattern of integer additions in Pascal's triangle, a>%,3 = &i,3-i + ai-i,j) c a n be used to perform Taylor shift by 1 10

1.2 The coefficients of the polynomial A^(x) in the proof of Theorem 1.3.3 reside on the &-th diagonal. Multiplication of Ak(x) by (x + 1) can be interpreted as an addition that follows a shift to the right and a downward shift 11

1.3 Pipelining delivers efficient execution of machine instructions 16

1.4 Pipeline stalls caused by dependencies such as mispredicted branches may cause a slowdown by a factor of 10 to 30 19

1.5 Superscalar feature of UltraSPARC III processor accommodates up to 4 simultaneous instructions in its Execute stage 20

1.6 Memory hierarchy of 1 GHz UltraSPARC III processor 22

1.7 A simple direct-mapped cache 23

2.1 The straightforward method we consider uses integer additions to compute the elements of the matrix in Figure 1.1 from top to bottom and from left to right 41

2.2 The GNU-MP assembly addition routine for the UltraSPARC III platform 43

2.3 Performance gain due to Gaudry's patch for the straightforward method. 44

2.4 Taylor shift by 1 in SACLIB 46

3.1 a. Tiled Pascal triangle, b. Register tile stack. A register tile is computed for each order of significance. Carries are propagated only along lower and right borders 49

X

3.2 a. A scheduled 8 x 8 register tile. Arrows represent memory references, "+" signs represent additions. Numbers represent processor cycles. The 2 integer execution units (IEUs) perform 2 additions per cycle, b. Register tile: a sketch of the proof of Theorem 3.3.1 52

3.3 Pascal's triangle for the example polynomial. The coefficients of the polynomial and the elements of the triangle are in decimal representation. 53

3.4 Pascal's triangle for the level of significance 0 56



3.7 Non-normalized register tile for the level of significance 0 57



3.10 Normalized register tile for level of significance 0 after carry propagation; now the radix (3 = 8 58



3.13 The tile method is up to 7 times faster than the straightforward method. 61

3.14 For the input polynomials Cn,d the tile method computes a whole register tile stack at the precision required for just the constant term 63

3.15 In GNU-MP addition the ratio of cycles per word addition (left scale) increases with the cache miss rate (right scale) 64

3.16 In classical Taylor shift by 1 the tile method requires fewer cycles per word addition than the straightforward method 65

xi

3.17 The tile method substantially reduces the number of memory reads required for the Taylor shift; the extent of the reduction depends on the compiler 67

3.18 For large degrees the tile method has a lower cache miss rate than the straightforward method. Moreover, the number of cache misses generated by the tile method is small because the tile method performs few read operations 68

3.19 The number of branch mispredictions per cycle is negligible for the tile method and the straightforward method 69

3.20 Impact of tile size on the performance of the tile method on Pentium EE processor. Legend: tile size in word x word 74

3.21 Impact of tile size on the performance of the tile method on Opteron processor. Legend: tile size in word x word 74

3.22 Impact of tile size on the performance of the tile method on Pentium 4 processor. Legend: tile size in word x word 75

3.23 Impact of tile size on the performance of the tile method on UltraSPARC III processor. Legend: tile size in word x word 75

3.24 Processor utilization in word additions per cycle for the straightforward method 77

3.25 Processor utilization in word additions per cycle for the tile method 77

4.1 Modeling GNU-MP addition for the UltraSPARC III processor 84

4.2 Modeling GNU-MP addition for the Pentium 4 processor 85

4.3 Modeling GNU-MP addition without Gaudry's patch for the Opteron processor 86

4.4 Modeling GNU-MP addition with Gaudry's patch for the Opteron processor 87

Xll

4.5 Modeling GNU-MP addition for the Pentium EE processor 88

4.6 The distribution of the length of sums L in the straightforward method. Both experimental and modeled data provided 91

4.7 Modeling the straightforward method for the UltraSPARC III processor. 91

4.8 Modeling the straightforward method for the Pentium 4 processor 92

4.9 Modeling the straightforward method for the Opteron processor 92

4.10 Modeling the straightforward method for the Opteron processor with Gaudry's patch 93

4.11 Modeling the straightforward method for the Pentium EE processor 93

4.12 The rolled delayed carry release routine for the tile method 97

4.13 Modeling the tile method on the Pentium EE architecture 99

4.14 Modeling the tile method on the Opteron architecture 100

4.15 Modeling the tile method on the Pentium 4 architecture 100

4.16 Modeling the tile method on the UltraSPARC III architecture 101

4.17 Impact of changing the number of IEUs on the lower bound for the computing time of the tile method for AMD Opteron architecture 102

5.1 The tile method is faster than the asymptotically superior divide and conquer method for a wide range of degrees 106

5.2 All asymptotically fast methods are slower than the divide and conquer method on the Pentium EE. The convolution method is over 80 x slower than the tile method and is not shown 107

5.3 All asymptotically fast methods are slower than the divide and conquer method on the Opteron. The convolution method is over llOx slower than the tile method and is not shown 107

Xll l

5.4 All asymptotically fast methods are slower than the divide and conquer method on the Pentium 4. The convolution method is over 50 x slower than the tile method and is not shown 108

5.5 All asymptotically fast methods are slower than the divide and conquer method on the UltraSPARC III. The convolution method is over 200 x slower than the tile method and is not shown 108

5.6 Using 64-bit arithmetic on the Pentium EE improves the crossover point. The convolution method is over 30 x slower than the tile method and is not shown 113

5.7 Using 64-bit arithmetic on the Opteron improves the crossover point. The convolution method is over 25 x slower than the tile method and is not shown 114

5.8 Using 64-bit arithmetic on the Pentium 4 improves the crossover point. The convolution method is over 8x slower than the tile method and is not shown 114

5.9 Using 64-bit arithmetic on the UltraSPARC III improves the crossover point. The convolution method is over 62 x slower than the tile method and is not shown 115

5.10 Gaudry's patch improves the crossover point on the Opteron. The convolution method is over 19 x slower than the tile method and is not shown. 116

6.1 The computation structure of (a) Taylor shift by 1 and (b) de Casteljau's algorithms 118

6.2 (a) The pattern of integer additions in Pascal's triangle, al>3 = at>3-\ + at-\>3) can be used to perform Taylor shift by 1. (b) In de Casteljau's algorithm all dependencies are reversed, the intermediate results are computed according to the recursion b3>l = ^-i,» + &j_i)l+i 119

6.3 Register tiling can be applied to (a) Taylor shift by 1 and (b) de Casteljau's algorithm. Arrows show direction of addition 119

XIV

6.4 Speedup with respect to the monomial SACLIB implementation for random polynomials on four architectures 126

6.5 Speedup with respect to the monomial SACLIB implementation for Cheby-shev polynomials on four architectures 127

6.6 Speedup with respect to the monomial SACLIB implementation for reduced Chebyshev polynomials on four architectures 128

6.7 Speedup with respect to the monomial SACLIB implementation for Mignotte polynomials on four architectures 129

XV

Abstract

Architecture-aware Taylor Shift by 1

Anatole D. Ruslanov Advisors: Jeremy R. Johnson and Werner Krandick

We introduce register tiling for optimizing series of multiprecision additions.

Our new tile method for designing an architecture-aware classical Taylor shift by 1

algorithm—a low-level operation important to the monomial bases variant of the

Descartes method for polynomial real root isolation—obtains up to 7 times faster

performance over standard implementations that call the efficient integer addition

routines from the GNU Multiple Precision Arithmetic Library [44].

Our tile method for Taylor shift by 1 algorithm requires more word additions

but it reduces the number of cycles per word addition by decreasing memory traffic

and the number of carry computations. To enable standard compilers to tile the

algorithm, we introduce signed digits, suspended normalization, radix reduction,

and delayed carry propagation.

The performance of our tile method depends on several parameters that can be

modeled for and tuned to the underlying architecture. We show how such modeling

can guide automatic code generation and automatic experimentation to adapt an

algorithm to the underlying architecture for better ILP and pipeline utilization. We

automatically generate our tile method for Taylor shift by 1 in a high-level language

and tune it to four different processor architectures.

The architecture-aware tile method outperforms four asymptotically fast meth-

XVI

ods up to degree 6000 on the four hardware platforms. We analyze feasibility of

constructing high-performance architecture-aware fast methods.

Using our register tiling technique, we automatically generate and tune de Castel-

jau's algorithm, an operation with similar pattern of additions. The algorithm—

"probably the most fundamental computation in the field of curve and surface de

sign" (G. Farin [33])—is the main subalgorithm of the Bernstein bases variant of the

Descartes method. We obtain similar performance gains.

Applying our architecture-aware algorithms, we compare performance of several

implementations of the monomial bases and Bernstein bases variants of the Descartes

method on four processor architectures and for three classes of input polynomials.

All variants have the same asymptotic computing time bound. The comparison

shows that the best absolute computing times are obtained on an Opteron processor

platform using the Bernstein-bases variant of the Descartes method with register

tiling.

1

Foreword

Computer algebra systems, such as Maple and Mathematica, provide many effi

cient algorithms for exact computation with mathematical objects such as arbitrary

precision integers, rational numbers, polynomials, and, more generally, mathemati

cal expressions. Improved algorithms, better implementations, and faster computers

have enabled many previously time-consuming computer algebra computations to be

performed routinely. However, many computations still require excessive computing

time, and there are many cases where the performance achieved by an implementa

tion could be dramatically improved.

Many challenges for achieving high performance in computer algebra algorithm

implementations are due to their irregular structure and higher level data types.

Most of the work in high-performance algebraic algorithm design has been focused on

reducing arithmetic complexity and bit-complexity when the size of numbers is im

portant. However, simply reducing the number of arithmetic operations or using op

timized implementations of basic arithmetic operations is insufficient. An algorithm-

level perspective that considers the entire computation—not just its parts—must be

adopted instead.

Modern computers are complex systems that incorporate features such as pipelin

ing, superscalar execution, speculative computing, and multilevel memory hierar-

2

chies, to achieve high performance. These features, when properly utilized, can lead

to dramatic improvement in performance. However, effective utilization of the pro

cessor is a highly non-trivial problem, which can not simply be left to the compiler.

The complex interactions of the features make it difficult to predict performance and

has led to an empirically based approach called automated performance tuning [70].

In fact, effective utilization of these features can be more important than reducing

the number of arithmetic operations in obtaining high-performance code and can

lead to an order of magnitude improvement in performance.

Achieving effective algorithm-level optimizations typically requires transforma

tions such as high-level restructuring of the algorithm, changing data structures

and reordering operations to overcome dependencies. Programming is done in a

high-level language for portability and ease of maintenance. Portable coding also

simplifies automatic code generation and performance tuning, which finds the best

algorithm for high performance on a particular architecture through automatic ex

perimentation.

An execution model that takes features of the architecture into account would be

helpful in guiding the choice of transformations and optimizations and selecting the

best implementation. However, as indicated, the complexities of modern processors

along with lack of detail provided by hardware vendors make this difficult. The

difficulty in accurately modeling performance leads to a more empirical approach

relying on benchmarking and profiling (including measuring the utilization of the

features of the processor) which searches for the best implementation. Nonethe

less, modeling, while not always an accurate prediction of performance, can provide

insight and reduce the amount of search required.

3

Designing high-performance algorithms for modern processors requires consider

able effort. We think of our work as computing with abstractions that arise from

the architecture versus more usual computing with abstractions that arise from the

underlying mathematical operations. This allows for a more architecture-centered

approach to designing high-performance algorithms.

4

Summary of contributions

My thesis has addressed the following problems and has made the following

contributions:

1. We have applied known high-performance architecture-aware algo

rithm design techniques—in particular, register tiling optimizations—

to Taylor shift by 1. The algorithm has a pattern of additions that a com

piler should be able to optimize for high performance. Tuning multiprecision

("bignum") integer addition, however, is a challenge because compilers cannot

perform such optimizations without understanding of the algorithmic domain.

We have avoided low-level assembly coding with a high-level language algo

rithm that exploits features of the architecture and enables the compiler to

perform optimizations that it otherwise would have been prevented from per

forming. We also reduced implementation time and software maintenance cost

with automatic code generation and tuning to a target architecture.

2. We have determined that there is a large range of input sizes for

which our classical approach to Taylor shift by 1 outperforms asymp

totically fast approaches. While our experiments utilized existing imple

mentations of several asymptotically fast algorithms for Taylor shift by 1 that

5

were not architecture-aware, we have performed extensive profiling and inves

tigated several approaches for redesign to further improve their performance.

We have demonstrated that effective utilization of features of the computer

architecture significantly affects crossover points. However, our results suggest

that, even with these enhancements, there is a wide range of inputs (up to de

gree 6000 in our studies), covering most practical sizes, where the tuned clas

sical approaches significantly outperform the asymptotically fast algorithms.

3. We have implemented a high-performance de Casteljau's a lgori thm

by applying the knowledge gained from designing the architecture-

aware Taylor shift by 1. De Casteljau's algorithm—a fundamental com

putation for curve and surface design—has a similar pattern of additions and

benefit from the same optimization techniques as Taylor shift by 1 computa

tion.

4. We have used the high-performance Taylor shift by 1 and de Castel

jau's a lgorithms in polynomial real root isolation. Using efficient ker

nels is not straightforward due to incompatible data structure interfaces, in

ability to apply high-level optimization across calls to the kernel routines,

and need for special instances of these kernel routines. We used our high-

performance versions of Taylor shift by 1 and de Casteljau's algorithms for

comparing the performance of algorithmically tuned implementations of the

monomial and Bernstein variants and architecture-unaware implementations

of both variants on four different processor architectures and for three classes

of input polynomials. The comparison shows that the best absolute computing

6

times are obtained on an Opteron processor platform using the Bernstein-bases

variant of the Descartes method with register tiling.

The results of this dissertation have led to the following publications:

1. High-performance architecture-aware Taylor shift by 1, (with Jeremy R. John

son and Werner Krandick), 10th International Conference on Applications of

Computer Algebra, Lamar University, July 21-23, 2004, Beaumont, Texas.

2. Architecture-aw are classical Taylor shift by 1, (with Jeremy R. Johnson and

Werner Krandick), International Symposium on Symbolic and Algebraic Com

putation, pages 200-207, ACM Press, 2005.

3. Using high-performance Taylor shift by 1 in real root isolation, (with Jeremy R.

Johnson and Werner Krandick), 11th International Conference on Applications

of Computer Algebra, Nara Women's University, July 31 - August 3, 2005,

Nara, Japan.

4. High-performance implementations of the Descartes method, (with Jeremy R.

Johnson, Werner Krandick, Kevin M. Lynch, David G. Richardson), Interna

tional Symposium on Symbolic and Algebraic Computation, pages 154-161,

ACM Press, 2006.

7

1. Prel iminaries

1.1 Introduct ion

This thesis is a study of methods for computing Taylor shift by 1, a low-level op

eration that is important for the monomial bases variant of the Descartes method of

polynomial real root isolation, an essential algorithm in computer algebra systems.

Both classical and asymptotically fast methods of Taylor shift by 1 are studied. The

thesis presents a new architecture-aware method that introduces register tiling opti

mization techniques for algorithms that involve patterns of multiprecision additions.

Since our register tile method (see Chapter 3) outperforms the asymptotically fast

methods for a wide range of degrees, it is also a study of how useful the theoretically

fast algorithms are in practical applications. In addition, we apply the register tiling

technique to a similar algorithm: de Casteljau's algorithm, a fundamental method in

computer-aided design [33] which is the main subalgorithm of the Bernstein-bases

variant of the Descartes method for real root isolation. We then apply the tiled

Taylor shift by 1 and de Casteljau's algorithms to their respective variants of the

Descartes method and experimentally compare performance of both variants for

three classes of input polynomials on four different processor architectures.

1.2 Thesis organization

This thesis is organized as follows. In Chapter 1, we define Taylor shift by 1 as

a series of additions that compute elements of Pascal's triangle, discuss computer

8

architecture and its influence on performance, discuss our experimental methodol

ogy, describe the four architecture platforms used in our experiments, and review

previous work. In Chapter 2, we discuss the straightforward methods for comput

ing Taylor shift by 1 based on GNU-MP [44] and NTL [82] libraries as well as the

SACLIB [22] method. Chapter 3 presents the tile method of computing Taylor

shift by 1, including its performance advantages on the UltraSPARC III architec

ture [49, 86]. We also discuss automatic code generation and tuning in the chapter.

Chapter 4 describes modeling GNU-MP [44] addition, the straightforward method,

and the tile method. Chapter 5 presents asymptotically fast methods for computing

Taylor shift by 1, performance comparison to the tile method, and our research into

ways of improving performance of the fast algorithms. In Chapter 6 we discuss ap

plying register tiling for implementing high-performance de Casteljau's algorithm.

We then apply the tiled implementation of de Casteljau's algorithms to derive and

compare several high-performance variants for the Descartes method of real root

isolation. In the final Chapter 7, we present ideas for further research.

1.3 Taylor shift by 1: Analysis

Let A(x) be a univariate polynomial with integer coefficients. Taylor shift by 1

is the operation that computes the coefficients of the polynomial B(x) = A(x + 1)

from the coefficients of the polynomial A(x). Taylor shift by 1 is the most time-

consuming subalgorithm of the monomial bases variant of Descartes method [21]

for polynomial real root isolation. Taylor shift by 1 can also be used to shift a

polynomial by an arbitrary integer a. Indeed, if B(x) = A(ax) and C(x) = B(x+ 1)

and D(x) = C(x/a), then D(x) = A(x + a). According to Borowczyk [12], Budan

proved this fact in 1811.

Theorem 1.3.1. Let A{x) = Y J atx% be a polynomial of degree. Taylor shift by 0<i<n

1 computes Taylor expansion B{x) = Y J b%x% = A(x + 1) = Y J at(x + 1)* where 0<i<n 0<i<n

K = (nt)an + (™71)a„_i + . . . + (l)at for i = 0,..., n.

Proof Induction on n using the binomial theorem. □

We will call a method that computes Taylor shift by 1 classical if the method

uses only additions and computes the intermediate results given in Definition 1.3.2.

Definition 1.3.2. For any non-negative integer n let In = {(i,j) \ i, J' > 0 A i-\-j <

n}. If n is a non-negative integer and

A(x) = anxn + . . . + a\X + a0,

is an integer polynomial we let, for k G { 0 , . . . , n} and (i, j) G In,

a-i,k = 0,

Qfe,-1 = «ra- fc j

as shown in Figure 1.1.

Theorem 1.3.3. Let n be a non-negative integer, and let A(x) = anxn+.. .+aix+a0

10

0 0 0 0 4* 4* 4* 4*

an —> ao,o Qo,i &o,2 ao,3 Q n - i ~ ^ Qi,o Qi , i Qi,2

O n - 2 ~ ^ &2,0 ^2,1

Q n - 3 ~ ^ a 3 , 0

Figure 1.1: By Theorem 1.3.3, the pattern of integer additions in Pascal's triangle, a>i,3 = ai,3-i + a*-i,j> c a n be used to perform Taylor shift by 1.

be an integer 'polynomial. Then, in the notation of Definition 1.3.2,

n

A(x + 1) = ^ an-h,hXh. h=0

Proof. The assertion clearly holds for n = 0; so we may assume n > 0. For every

_nak-h,hX ■ Figure 1.2 shows that the coefficients of

the polynomial Ak reside on the k-ih diagonal of the matrix of Figure 1.1. Then,

for all k E {0 , . . . , n — 1}, we have Ak+i(x) = (x + l)Ak(x) + an-(k+i)- Now an easy

induction on k shows that Ak(x) = J2h=0 an_k+h(x + ^)h for all k G {0 , . . . , n}. In

particular, An{x) = Y^h=o a^x + ^)h = Mx + !)■ ^

Definition 1.3.4. Let a be an integer. The binary-length of a is defined as

[log2 HJ + 1 otherwise.

Definition 1.3.5. The max-norm of an integer polynomial A = anxn + - ■ --\-aiX-\-ao

is \A\oo = max(|a„| , . . . , |a0|).

11

A0(x)

Ck),o

Qi,o

x

«o,i

« i , i

«0,2

«1,2

X

«0,3

M{x)^ 0*2,0^ «2,1

A2(x)^ «3,0 A2(x)^ «3,0

A3(x) ' '

multiplication by x multiplication by 1

Figure 1.2: The coefficients of the polynomial A^(x) in the proof of Theorem 1.3.3 reside on the A;-th diagonal. Multiplication of Ak(x) by (x + 1) can be interpreted as an addition that follows a shift to the right and a downward shift.

The SACLIB method (see Section 2.4) and the new tile method (see Section 3.2)

for Taylor shift by 1 computation require a bound on the binary lengths of the

intermediate results a, H,J-

Theorem 1.3.6. Let n be a non-negative integer, and let A(x) = anxn+.. .+aix+a0

be an integer polynomial of max-norm d. Then, for all (i,j) G In,

!• <h,3 = C + ' H + C T l ) a n ~ i + ■■■ + ( j K - * > a n d

2. L(at>3) < L(d) + i + j .

Proof Assertion (1) follows from Definition 1.3.2 by induction on % + j . Due to

12

assertion (1),

M < ( f ^ + f-H-1 ) + ... + (' in

l + J + l

< 2t+3d

which proves assertion (2). □

Remark 1.3.7. Theorem 1.3.6 implies that, for degree n and max-norm d, the bi

nary length of all intermediate results is at most L(d) + n. The SACLIB method

(Section 2.4) can be slightly improved for small-degree polynomials by tightening

that bound for n E { 8 , . . . 39} to L(d) +n- 1, for n E { 4 0 , . . . , 161} to L(d) + n - 2 ,

and for n E {162 , . . . , 649} to L{d) + n - 3.

We will use Theorem 1.3.8 to prove lower bounds for the computing time of two

classes of input polynomials.

Theorem 1.3.8. Let n be a non-negative integer. Then at least n/2 of the binomial

coefficients ( \, 0 < k < n, have binary length > n/2. V k '

Proof. By direct computation, the assertion is true for all n E { 0 , . . . , 19}, so we

may assume n > 20. We then have

n - [n/A\ + 1 > 42.

13

Also, for 0 < % < [n /4 j ,

n — % n n [n/4j - % [n/A\ ~~ n/A

so that

n

Ln/4j n n—\ n— |_rz/4j + 1

Ln/4J [n/A\ - 1

> 4K4J+1 _ 22K4J+2 > 2™/2

Hence, the binary length of each binomial coefficient

n \ / n \ / n

Ln/4J ^' ^ K 4 J + 1 n - K4J

is > n /2 . But the number of those coefficients is > n/2. D

Theorem 1.3.10 and the proof of Theorem 1.3.9 characterize the computing time

functions of classical Taylor shift on the sets of polynomials Bn>d and Cn>d, see

Definition 1.5.1 in Section 1.5.4. We use the concept of dominance defined by

Collins [20] since it hides fewer constants than the more widely used big-Oh notation;

Collins also defines the maximum computing time function.

Theorem 1.3.9. Let t+(n, d) be the maximum computing time function for classical

Taylor shift by 1 where n > 1 is the degree and d is the max-norm. Then t+(n, d) is

co-dominant with n3 + n2L(d).

Proof. The recursion formula in Definition 1.3.2 is invoked \In\ = n(n+ l ) / 2 times.

14

Hence the number of integer additions is dominated by n2. By Theorem 1.3.6, the

binary length of any summand is at most L(d) + n. Thus the computing time is

dominated by n2 • (L(d) + n).

We now show that, for the input polynomials Bn>(i, the computing time dominates

n3 +n2L(d). Since, for any fixed n > 1, the computing time clearly dominates L(d)

we may assume n > 2. By Theorem 1.3.6 (1),

at, = )d

for all (i, j) G In. For k = % + j > 2, Theorem 1.3.8 yields that at least (k + l)/2 of

the binomial coefficients

fc+1),('; + 1 ) , . . . , ( ' ; + 1

1 ' V 2 7 V fc+1

have binary length > [k + l) /2. So, for all A; G {2 , . . . , n} there are least (k + l)/2

integers al>3 with iJrj=k and

L(o„)>L(d)-l + ^ i i

Now the assertion follows by summing all the lengths. □

Our proof of Theorem 1.3.10 assumes that the time to add two non-zero integers

a, b is co-dominant with L(a) + L(b); Collins [20] makes the same assumption in his

analysis.

15

Theorem 1.3.10. The computing time function of classical Taylor shift by 1 on the

set of polynomials Cn>d of Definition 1.5.1 is co-dominant with n3 + L(d).

Proof By Theorem 1.3.6, a„)0 = d+l and, for (i,j) E In — {(n, 0)},

(h,3 = [ )■ J

Hence, by Theorem 1.3.8, for any k G { 0 , . . . , n } , at least half of the integers a^o,

afe-i,i, ■ • ■, ao,fc have binary length > k/2. Since all of them—except possibly an>o—

have binary length < k, we have that

n k

-L(anfi) + 5 ^ 5 ^ L{ah-3t3) ~ n3. fc=0 j = 0

But the time to compute ara0 is co-dominant with L(d), and so the total computing

time is co-dominant with n3 + L(d). D

1.4 Performance and computer architecture

Effective computation implies meeting performance expectations. Effective com

putation also motivates studying algorithms and computer architecture for achieving

future high performance computing goals.

Relying on compilers for achieving high performance is an oversight because

compilers do not know the application domain and cannot optimize to the depth

possible if the domain is well-understood. We have shown that compilers cannot

deliver high performance even for a relatively basic classical algorithm [59] and

16

A 01

■o +^ tu o>

o

t 0)

o> u. < j

>, u '—

t <s 01 < j

>. U

Figure 1.3: Pipelining delivers efficient execution of machine instructions.

certainly not for its asymptotically fast variant, see Chapter 5.

This section reviews the features of computer architecture that have become

important recently and have significant influence on performance. We begin with a

discussion of modern pipelining techniques followed by a discussion of the memory

hierarchy.

1.4.1 P ipe l ined processors

Pipelined processors deliver efficient execution of machine instructions by fetch

ing and executing several instructions per cycle. Instruction execution is partitioned

into several steps, and these steps are overlapped - when an instruction has com

pleted a step the next instruction can use the hardware for that step, see Figure 1.3.

[46, 14, 29]

a +^ ■o

ecut

S -o ec

ut

fa

Rec

o U Re

c

ode

Exec

ute

mit

ord

Dec

Exec

ute

fa o

U Rec

Fetc

h ode

Exec

ute

mit

ord

Fetc

h

Dec

Exec

ute

fa o U Re

c

t o> o> +^ ■o

T Fe

tch ■o

o

Exec

ut

S -o

o> " 3

Fetc

h

Dec

Exec

ut

fa o U Re

c

>, w

17

Dependencies that present problems for smooth pipelining are called pipeline

hazards. The hazard conditions occur when the next instruction in the instruction

stream is prevented from being executed during its designated clock cycle because

either the result of a previous instruction still in the pipeline is not yet available or

the instruction to be executed itself is not known. [46, 14, 29]

For example, an instruction that is moving through the Execute stage must have

a value to operate upon. If this value is not available—such as if the preceding

instruction is a memory load, which computes the reference address in the Execute

stage and fetches the data in the Commit stage—then the instruction cannot proceed

and must be stalled until the data is available. [46, 14, 29]

Mispredicted branches are an example of control hazard—another common pipelin

ing concern. Branch instructions must move through several pipeline stages before

the target address is known, see Figure 1.4. Meanwhile, other instruction must en

ter the pipeline to keep it operating. Current processors use a variety of prediction

algorithms to "guess" which instructions to execute while waiting for the branch to

resolve. Effective branch prediction circuitry is important for superscalar pipelines

(see Section 1.4.2 below) because mispredicted branches may cause a slowdown by a

factor of 10 to 30 since several pipeline stages must be cleared out, each containing

more than 1 instruction. [46, 14, 29]

The pipeline hazards must be avoided as they cause sequential execution and,

hence, degrade pipeline performance. Compilers strive to schedule instructions so

that the dependencies have time to be resolved. Algorithms can be designed for

easy scheduling. Most modern processors are capable of out-of-order execution—a

processor feature, which rearranges instructions in hardware in order to move them

18

through the pipeline with minimal stall interruptions. Register renaming is used to

reduce artificial dependencies between the registers during program execution that

is imposed by the limited number of registers visible to the compiler. [46, 14, 29,

40, 24, 26, 25]

For high performance, memory references immediately followed by dependent

instruction and control structures with irregular behavior in particular should be

avoided. The compilers are good at eliminating dependency hazards associated

with memory references—through rescheduling the involved instructions. Control

structures are harder to optimize because they are usually part of a larger conceptual

constructs: algorithm design, ADT, and easy to maintain top-level code. [46, 14, 29]

There are three fundamental ways to improve pipeline performance through

hardware design: improve manufacturing techniques to make clock rate faster, intro

duce a longer pipeline that has smaller steps to increase clock rate without improve

ments in manufacturing (an approach taken with Pentium 4 and being abandoned

now), and increase superscalar execution where several simultaneous pipelines exe

cute more than one instruction per cycle. Improvements in superscalar execution is

the approach common today and can be effectively used through high-level language

scheduling to elicit high performance. [46, 14, 25, 2, 4, 86]

1.4.2 Superscalar execut ion (ILP)

Modern pipelined processors are designed for superscalar execution—also called

instruction-level parallelism or ILP. The processors have many pipelines each with

many functional units for executing several instructions in the same clock cycle.

This is accomplished by designing a dispatch unit that sends several instructions in

19

r ^- Concurrent pipeline stage

A o>

■o o> = m

it ■o -—*- o —*- 0) —*- S —j»- 0)

tu 01

n r̂ « "

!̂ """-~ Branch target address known

x o> ■o 1 m

it ■o -01 tu

—*-

Dec

—*-

Exec

- i» -

Com

—!»-

Rec

_ 4 . f |

] Stall cycles ■ | L j _ _ i i

A o>

■o 1 mit ■o -

St '£ —!»-

Dec

—*-

Exec

—*-

Com

—*-

Rec

Bra ticn target address fetched ! j

Figure 1.4: Pipeline stalls caused by dependencies such as mispredicted branches may cause a slowdown by a factor of 10 to 30.

parallel down several pipelines and a commit unit that completes the instructions

so that correctness is assured. [14, 46, 29]

Figure 1.5 illustrates superscalar capabilities of UltraSPARC III processor, which

has 6 pipelines that can simultaneously execute up to 4 independent instructions.

The UltraSPARC III processor is capable of up to 4x performance gain over a

non-ILP processors. [49, 86]

For high performance on a superscalar pipelined processor, independent instruc

tions must be scheduled (or packed) appropriately so that the processor can dispatch

them in the same cycle. A program must have enough usable parallelism to accom

plish this. While ILP is straightforward conceptually, implementation is complicated

due to "precedence" hazards. [14, 46, 29]

20

JS ■o ALU

tnm

it

scor

d

% o ALU tnm

it

scor

d

tu MEM BR

o U «

Fetc

h

Dec

ode ALU

ALU MEM BR C

omm

it

Rec

ord

1 ecod

e ALU ALU m

mit

ecor

d

tu O MEM BR

o U «

Figure 1.5: Superscalar feature of UltraSPARC III processor accommodates up to 4 simultaneous instructions in its Execute stage.

1.4.3 M e m o r y hierarchy

Memory hierarchy exists to deal with the ever increasing performance gap be

tween the processor and random access memory (RAM). Memory systems are de

signed to provide an illusion of a very large memory that can be accessed as fast

as a very small memory. Without well-designed hierarchical structure the memory

system will either be expensive or slow. [46, 14, 29]

At the top of the memory hierarchy is the processor register file that consists of

an array of very fast n-bit SRAM registers where n is the width of hardware word in

bits. The register file is part of the processor and is the fastest part of the memory

hierarchy. Machine instructions reference the registers directly. The processor must

fetch data into the registers for all computations.

The memory hierarchy consists of a number of caches or buffers between the

main memory (RAM) at the bottom of the hierarchy and the processor register file.

Cache is a small but fast memory that is used to store or prefetch items that have

21

been recently referenced or likely will be referenced soon. Caches greatly speed up

memory access due to the Principle of Locality, which states that programs tend to

relatively small portion of their address space at any instant of time. This

allows for a small but fast memory buffer (i.e., a cache) near the registers to contain

nearly 100% of the data and instructions required at the time of the execution.

[46, 14, 29]

The memory hierarchy of a 1 GHz UltraSPARC III processor—a usual design

for the current processors—is presented in Figure 1.6. It takes 1 processor cycle to

reach data in registers. The LI cache has 2 to 3 cycles latency, i.e., it takes up to

3 cycles to transfer data from the cache to a register. The latency for the L2 cache

is greater, typically 10 — 20 cycles. The latency for transferring data from the main

memory to the L2 cache is up to 200 cycles. The two levels of caches are used to

reduce the gap between fast CPU clock rates and the relatively long time to access

memory. [49, 86]

A cache miss occurs when the data (or instruction) accessed is not in cache, and

a cache hit occurs when the data is available in cache. Miss rate is a measure of how

well a particular program behaves with respect to a cache. Miss rate is influenced

by the algorithm coding techniques, by compiler efficiency, and by hardware design.

When a cache misses the cost of reaching the data is equal to the latency of reaching

the next level of the hierarchy. The LI caches are optimized to fast hits. The L2

caches are optimized for low miss rates. For high performance, data should be

shifted toward the processor registers. [14, 29, 7, 54]

22

1 ns 2-3 ns i 10-20 ns

100-200 ns

1 ns 2-3 ns i 10-20 ns

RAM Memoiy L2 Cache

RAM Memoiy

Register File Ll Cache L2 Cache

RAM Memoiy

Register File Ll Cache

| L2 Cache

RAM Memoiy L2 Cache

RAM Memoiy

r p n i

RAM Memoiy

Smaller, faster and expensive

Figure 1.6: Memory hierarchy of 1 GHz UltraSPARC III processor.

Cache organization

Caches are organized by the way they reference and store data and by the size

of data line (block size).

Direct-mapped caches have a simple architecture: a block from memory can map

to only one location in cache (by using a part of an address as an index into the

cache). These caches tend to be the fastest because they have the least number of

hardware comparators. However, they are vulnerable to regular memory access pat

terns: if the cache is accessed at a certain stride that causes each memory reference

to map to the same location in cache the miss rate will approach 100%. A simple

direct-mapped cache is shown in Figure 1.7. [46]

Fully associative caches are directly opposite in their organization to direct

mapped caches. A memory location can be placed anywhere in the cache; data

in the cache is replaced using least recently used (LRU) strategy or some approxi

mation to LRU. These caches have the lowest miss rate but are impractical because

of the high hardware cost, and to be effective they must be small. They tend to be

slow due to the number of comparisons required to find whether the data is in the

cache. [46]

23

| | | I**— Memory address

t Tag f Index J Block offset

11

' Block data (Cache line)

Figure 1.7: A simple direct-mapped cache.

Set associative caches are a compromise between these two architectures. These

caches are indexed like direct mapped caches but have several places where data can

be stored for each indexed location (set). An n-way set associative cache has n fully

associative locations per set; the sets are direct mapped. Set associative caches are

fast and their performance is usually similar to the fully associative caches. [46]

Block fetching and prefetching

In order to further reduce miss rate, caches fetch several words from memory

at a time. A block, also known as a cache line, is a group of contiguous words

that are transferred to a cache simultaneously. Block size (in words) is specific to a

particular architecture and is usually a power of 2. If a particular word is referenced

by the processor, all words that belong to the block will also be fetched. This

results in a substantial reduction in cache misses due to the Principle of Locality (see

Section 1.4.3)—particularly when fetching instructions and traversing arrays. [46, 29]

24

Most modern processors also feature hardware prefetching where data is brought

into cache ahead of memory references. This is also known as load prediction. In ad

dition, all modern Instruction Set Architectures (ISAs) include software prefetching

instructions to bring the data to cache ahead of use. However, software prefetch

ing can have detrimental effects on the performance if it interferes with hardware

prefetching. [46, 29, 25, 2, 4, 49]

1.5 Methodology and experimental procedures

In this section, we describe the hardware platforms, profiling techniques, and

input polynomials used in our experiments.

1.5.1 Processor architecture

Our tile methods (see Chapter 3 and Section 6.1) primarily achieve their speedup

by using delayed carry propagation and register tiling for respectively reducing mem

ory traffic and improving locality of reference. The computation schedule for the

register tiles allows multiple integer execution units to be used simultaneously, see

Section 3.3.1. When implementing the tile methods for a given processor the maxi

mum speedup that can be obtained by the methods is determined by the precision

of the native integer arithmetic (i.e., the width of the hardware registers), the num

ber of general purpose integer registers, and the number of integer execution units.

The speedup will be greater with a higher native integer precision and with a larger

number of the general purpose integer registers and integer execution units, see

Section 3.5.

25

64-bit processors

Current processors such as the Pentium EE [27], Opteron [3, 2], and Ultra

SPARC III [49, 86] that support native 64-bit integer operations have at least 16

64-bit general purpose integer registers, and at least 2 integer execution units. The

tile method was developed for such processors. We briefly summarize the relevant

features of these three processors below.

Pentium EE: The Intel Pentium Extreme Edition (EE) dual-core processor

supports both the 32-bit x86 and 64-bit EM64T instruction sets. Each core of the

Pentium EE provides 16 64-bit general purpose integer registers and has an 8-way

set-associative 16 kilobyte LI data cache and an 8-way set-associative 1 megabyte

L2 cache. Each core has 2 ALUs that are each capable of 2 arithmetic operations per

cycle. The processor is capable of register renaming, out-of-order execution, dynamic

cache prefetching, and dynamic branch prediction. The number of Pentium EE

pipeline stages has not been publicly disclosed. [27]

Opteron: The AMD Opteron processor supports the 32-bit x86 and 64-bit

AMD64 instruction sets. The Opteron provides 16 64 -bit general purpose integer

registers and has a 2 -way set-associative 64 kilobyte LI data cache and a 4 -way

set-associative 1 megabyte L2 cache. The Opteron processor has 3 ALUs that can

be independently engaged to decode, execute and retire 3 x86-instructions per cycle

in its 20-stage pipeline. The processor is capable of register renaming, out-of-order

execution, and dynamic branch prediction. [3, 2]

UltraSPARC III: The Sun UltraSPARC III processor supports the SPARC

V9 instruction set. The UltraSPARC III provides 32 64-bit general purpose in-

26

teger registers and has a 64 kilobyte 4-way set-associative LI data cache and an

8 megabyte 2-way set-associative L2 cache. Its superscalar architecture provides six

14-stage pipelines, four of which can be independently engaged. Two of the pipelines

perform integer operations, two floating point operations, one memory access, and

one pipeline performs branch instructions. The processor is capable of speculative

execution of branch instructions and memory loads. [49, 86]

32-bit processors

The Pentium 4 [26] is included for comparison only and is not expected to per

form well with the tile methods due to the unavailability of native 64-bit integer

arithmetic and the small number of general purpose integer registers.

P e n t i u m 4: The Intel Pentium 4 processor supports the 32-bit x86 instruction

set. The Pentium 4 provides 8 32-bit general purpose integer registers and has

a 16 kilobyte 8-way set-associative LI data cache and a 1 megabyte 8-way set-

associative L2 cache. The Pentium 4 processor has 2 ALUs that are each capable

of 2 operations per cycle. The processor has a 20-stage pipeline that is capable of

register renaming, out-of-order execution, dynamic cache prefetching, and dynamic

branch prediction. [26]

1.5.2 Hardware configuration

The hardware platforms used in this study are configured as follows:

P e n t i u m EE: We use a Pentium Extreme Edition 840 Dual-Core CPU with a

clock speed of 3.2 GHz and 1 GB of main memory. The Gentoo Linux distribu

tion with the 2.6.14-gentoo-r2 kernel is installed. Hyper-Threading is disabled in

27

the BIOS.

O p t e r o n : We use an Opteron 244 with a clock speed of 1.8 GHz and 2 GB of

main memory. The Gentoo Linux distribution with the 2.6.14-gentoo-r2 kernel is

installed.

U l t r a S P A R C III: We use a Sun Blade 2000 with two 900 MHz UltraSPARC III

processors and 2 GB of main memory. The Solaris 9 operating system is installed.

P e n t i u m 4: We use a Pentium 4 with a clock speed of 3.0 GHz and 1 GB of

main memory. The Fedora Core 2 Linux distribution with the 2.6.5-1.358 kernel is

installed.

1.5.3 Compi lat ion p r o t o c o l

This section describes how our software was compiled. The default compilation

flags were chosen because they deliver the best performance in most cases.

Default: All software was written in C and, unless noted below, was com

piled using gcc 3.4.4 with the flags "-03 -march=nocona -m64" on the Pentium EE,

gcc 3.4.4 with the flags "-03 -march=opteron -m64" on the Opteron, Sun Studio

9 compilers [85] with the flags "-x03 -xarch=v9b" on the UltraSPARC III, and

gcc 3.3.3 with the flags "-03 -march=pentium4" on the Pentium 4.

SACLIB: The SACLIB 3.0 (Beta) [77] library was used on the Pentium EE,

Opteron, and Pentium 4 machines. The SACLIB 2.1 library [22] was used on the

UltraSPARC III machine and compiled with Sun Studio 9 compilers [85] with the

flags "-x03." The programs IPRRID and IPRRIDB respectively call the SACLIB

3.0 (Beta) or 2.1.

NTL: On the Pentium EE, Opteron, and Pentium 4 machines NTL 5.4 [82] is

28

compiled using the compiler flags set by NTL. On the UltraSPARC III the Sun

Studio 9 compiler with the "-x03 -xarch—v9b" flags was used. NTL is limited to

32-bit integer arithmetic because of the way it performs multiplication; however, for

compatibility, NTL is compiled to use the 64-bit application binary interface (ABI).

GNU-MP: GNU-MP 4.2 [44] is compiled using the compiler flags as set by

GNU-MP.

SYNAPS: SYNAPS 2.4 [71] is compiled with the default compilers and flags.

SYNAPS required minor porting before it could be compiled with the Sun Studio 9

compiler for the UltraSPARC III platform.

Hanrot et al.: The code of Hanrot et al. [45] is compiled with the default

compilers and flags.

1.5.4 Input polynomials

For testing the tile method for Taylor shift by 1 computation (see Chapter 3),

we use the following two classes of polynomials:

Definition 1.5.1. For any positive integers n, d we define the polynomials

Bn,d(x) = d%n + dxa~l -\ h dx + d,

Cn4{x) =xn + d.

We sometime refer to Bnd as the "worst case" and Crad as the "best case" polyno

mials, see Theorems 1.3.10 and 1.3.9. In our experiments, d is usually set to 220 — 1

or 2™ — 1. Such fixed coefficient polynomials require slightly more time for Taylor

shift by 1 computation than random polynomials.

29

For testing the Descartes methods, we use random polynomials, Chebyshev poly

nomials, and Mignotte polynomials. These polynomials are commonly used bench

mark polynomials [57] for testing the Descartes method.

1. R a n d o m polynomials are integer polynomials with random 20-bit coeffi

cients or with random n-bit coefficients, where n is the degree. The coeffi

cients are pseudo-randomly generated from a uniform distribution. We report

computing times for degrees 100, 200, . . . , 1000. For random polynomials, the

Descartes method produces recursion trees that typically have few nodes.

2. Chebyshev polynomials are the polynomials defined by the recurrence re

lation T0(x) = 1, Ti(x) = x, Tn+l(x) = 2xTn(x) - Tn_x(x). The roots of

Chebyshev polynomials are well-known values of the cosine function. When

the Descartes method is applied to Chebyshev polynomials wide recursion trees

with many nodes are obtained. We report computing times for degrees 100,

200, . . . , 1000. Since all these degrees are even, the corresponding Chebyshev

polynomials are polynomials in x2. Since, for even n, the method by Hanrot et

al. [45] reduces Tn(x) to Tn(y/x), we apply the same pre-processing step also to

the other methods. We call the polynomials Tn(y/x) somewhat ambiguously

"reduced Chebyshev polynomials of degree n"; of course, deg(Tn(y/x)) = n /2 .

3. Mignot te polynomials are defined by xn — 2 (5a; — 1) . We are not aware of

any applications that involve Mignotte polynomials; however, the Descartes

method generates extremely deep recursion trees for Mignotte polynomials,

and it requires computing times that are approximately proportional to its

worst-case computing time function. We report computing times for degrees

30

100, 200, . . . , 600.

1.5.5 Performance counter measurements

All modern processors have special hardware counter registers that allow mea

suring many hardware events in real time. We accessed the performance counters

on our target processor through the CPC-library (provided with Solaris operating

system) and PAPI library [51] on all others. The PAPI library is convenient; it is

portable on most hardware platforms.

Where noted, we also monitored the following events: processor instructions,

branch mispredictions, and cache misses for the LI data cache, the LI instruction

cache, and the L2 external cache.

Execution times were computed from the number of processor cycles; for exam

ple, 1 cycle corresponds precisely to 1/1000 /is on a 1 GHz machine.

In addition, we used the UNIX getrusage system call [28] on all platforms to

obtain execution time in order to verify hardware counter measurements or when

using hardware counters was inconvenient.

1.5.6 Cache flushing

Before each measurement, we flushed the LI and L2 data caches by declaring

a large integer array and writing and reading it once [14]. We did not flush the

LI instruction cache; our measurements show that its impact on performance is

insignificant. We obtained each data point as the average of at least 3 measurements,

unless otherwise noted. The fluctuation within these measurements was usually well

under 1%. We did not remove any outliers, unless noted otherwise.

31

1.6 Literature review

This section presents a survey of background literature about Taylor shift by

1, register tiling, and their applications including de Casteljau's algorithm and the

Descartes method.

1.6.1 Taylor shift by 1

Recently von zur Gathen and Gerhard [89, 41] compared six different methods

to perform Taylor shifts. The authors distinguish between classical methods and

asymptotically fast methods. When the shift amount is 1, the classical methods

collapse into a single method which computes n(n + l)/2 integer sums where n is

the degree of the input polynomial. Von zur Gathen's and Gerhard's implementation

of classical Taylor shift by 1 simply makes calls to an integer addition routine. We

refer to such implementations as straightforward implementations. There are four

such methods, see Section 2.

The efficiency of straightforward methods depends entirely on the efficiency of

the underlying integer addition routine. Von zur Gathen and Gerhard use the integer

addition routine of NTL [83, 82] in their experiments. In Johnson et al. [59], we used

the GNU-MP [43] addition routine because the data (see Tables 1 and 2 in [59]) imply

that the GNU-MP routine is faster. In fact, NTL documentation [82] suggests using

GNU-MP arithmetic if high performance is desired. See Chapter 2 for more detail

on SACLIB, NTL, and GNU-MP multiprecision addition and the straightforward

methods. See Section 5.4 for a discussion of the impact high-performance GNU-MP

arithmetic has on asymptotically fast methods of Taylor shift by 1.

32

In Johnson et al. [59], we presented two algorithms that outperform straightfor

ward implementations of classical Taylor shift by 1. For input polynomials of low

degrees the routine IUPTR1 of the SACLIB library [22] is faster than straightfor

ward implementations by a factor of at least 2 on the UltraSPARC III platform. The

SACLIB routine IUPTR1 is described in Section 2.4. In addition, we developed a

new, architecture-aware tile method that is faster than straightforward implementa

tions by a factor of up to 7 on the UltraSPARC III platform. Chapter 3 describes the

tile method, reviews its performance, and extends it using automatic code genera

tion and tuning to the Pentium EE and Opteron platforms with similar performance

results.

It is widely believed that computer algebra systems can obtain high performance

by building on top of basic arithmetic routines that exploit features of the hardware.

It is also believed that only assembly language programs can exploit features of the

hardware. Results reported in this thesis, in Johnson et al. (2005) [59], and in

Johnson et al. (2006) [58] suggest that these tenets are wrong.

1.6.2 Asymptot ica l ly fast m e t h o d s

The tile method [59] is faster than four asymptotically fast methods for Taylor

shift by 1 [89, 41, 97] up to degree 6000 on 4 platforms [58], see Chapter 5. We

are not aware of applications of Taylor shift by 1 for such high degrees. This is

an example of a common gap between theoretical expectations and practical results

from asymptotically fast algorithms. The concern is whether the fast algorithms

deliver the performance gain for the practical problem sizes that is worth the time

invested in designing and implementing them.

33

The exploration of the practical usefulness of the asymptotically fast algorithms

began shortly after the initial ground-breaking discovery of such algorithms for ex

act integer, polynomial, and matrix arithmetic [60, 23, 84]. For instance, it was

discovered early that implementations of the Strassen algorithm for matrix multi

plication [84] do not yield theoretically expected results but still provide performance

gains for useful—although large—input data sizes [16, 66]. A lower crossover point

was found for the Strassen algorithm on supercomputers [8, 50]. More recently

crossover points were explored for several finite field linear algebra algorithms [30].

Filatei et al. discussed their crossover experiments for high performance implemen

tations for polynomial arithmetic [36]. Nonetheless, as with the tile method [59],

some computer algebra problems are better solved by a classical approach [79].

The consequences of recent developments in computer architecture (pipelining,

superscalar execution, speculative computation, and multilevel memory hierarchy)

are seldom taken into consideration when designing the fast algorithms or seeking

to improve the crossover points. For example, Schonhage [81, 80], Zuras [98], and

Montgomery [69] do not discuss the effect of the architecture features on classical,

Karatsuba, Toom-Cook, and FFT-based integer multiplication algorithms. On the

other hand, Fateman explored the LI cache behavior in his comparisons of sparse

polynomial multiplication methods [34].

Automatic tuning for the best crossover point for a particular platform is likewise

uncommon. For example, GNU-MP is the only library we utilized that has a facility

for automatic determination of crossover points for the multiplication algorithms

using the included tuneup.c program, see Section 1.6.3 below for more information.

More generally, von zur Gathen and Gerhard in their major Modern Computer

34

Algebra text [90] point out that the crossover point determination requires coding

and testing a large variety of algorithms. Previous work comparing asymptotically

fast computer algebra algorithms to their classical counterparts typically does not

take the underlying computer architecture into account. In this thesis, we have

shown that the architecture can dramatically effect performance, and hence should

be taken into account when making these comparisons.

1.6.3 Crossover points for G N U - M P , N T L , and SACLIB

The GNU-MP, NTL, and SACLIB libraries—all include asymptotically fast al

gorithms. Some algorithms such as integer and polynomial multiplication have

crossover points that are within currently useful input data ranges, while others do

not. However, only the GNU-MP package provides a mechanism for automatically

tuning the crossover points to a particular architecture. The SACLIB and NTL

libraries hard-code the crossover points for outdated platforms.

The GNU-MP 4.2 [44] multiplication and squaring routines call one of four al

gorithms: classical (base case), Karatsuba, Toom-3, and FFT-based [60, 98]. The

crossover points for the multiplication algorithms can be automatically determined

using tuneup . c program included in the library that is run during installation.

For example, on Pentium 4 machines the crossover constants for multiplication and

squaring respectively were determined to be at 18 and 68 words from classical to

Karatsuba, at 139 and 108 to Toom-3, and at 5888 and 6400 for FFT-based algo

rithms. The FFT variant threshold is quite large. The GNU-MP library does not

implement operations on polynomials.

The NTL [82] library versions 5.0 through 5.4 (the current version) use only

35

classical and Karatsuba algorithms for integer multiplication. The library uses hard-

coded crossover points that were estimated for Sparc-10, Sparc-20, and Pentium-90

processors and are set at 16 words for multiplication and at 32 words for squaring.

No attempt is made to pre-tune the crossover point to a particular architecture.

No tuning method is available to the user. In order to avoid function calls and

loops, NTL multiplication is completely unrolled and optimized for small integers

of lengths < 3 words.

The NTL polynomial multiplication is carried out using a combination of the

classical algorithm, Karatsuba, the FFT using small primes, and the FFT using

the Schonhage-Strassen approach. The choice of algorithm depends on the coeffi

cient domain. The crossovers for polynomial multiplication are again hard-coded

in ZZXl.c source file to happen at degree 10 < n < 40 for Karatsuba and at de

gree 80 < n < 150 for FFT approaches; exact crossover depends on \A\oo of the

polynomial operands, see Definition 1.3.5.

The SACLIB [22] integer multiplication uses Karatsuba approach with crossover

from classical multiplication at the length of 14 words. SACLIB polynomial multi

plication, however, does not use Karatsuba algorithm.

1.6.4 Register t i l ing

Register tiling is an instance of loop tiling—a well-known loop transformation

used by the high-performance compilers for improving the utilization of memory

hierarchy and superscalar features of the processor. Register tiling groups the

operands, loads them into the machine registers, and operates on them utilizing

ILP without repeatedly referencing the memory. This achieves substantial perfor-

36

mance improvement. [7, 53, 54, 55]

There are many publications about register tiling due to the widespread use of

the technique. For example, an early one described improving register assignment

to subscripted variables in loops [15]. Another one suggested changing the shape

of the tiles for better processor utilization on multiprocessor platforms [47, 48].

Marta Jimenez et al. explored register tiling [53, 54] and introduced a cost-effective

algorithm to compute exact loop bounds for nonrectangular iteration spaces [55].

While register tiling is conceptually uncomplicated, the implementation for the

nonrectangular iteration spaces is problematic [53, 54, 55]. In fact, it is not clear

how the technique can be applied to computations with multi-word integers such as

classical Taylor shift by 1. There are no publications about register tiling multipreci-

sion addition. Without the introduction of signed digits, suspended normalization,

radix reduction, and delayed carry propagation, we would not be able to tile classical

Taylor shift by 1, see Chapter 3. Without these domain specific transformations,

standard compilers would be unable of tiling the code.

1.6.5 Compilers and automat ic code generat ion and tuning

Current compilers such as those from GNU, Intel, and Sun Microsystems [40,

24, 85], however efficient, typically cannot generate code that is more efficient than

hand-tuned code [94]. This is true even for a simple kernel like matrix multiplica

tion. There are many techniques for transforming high-level programs into programs

that run efficiently on modern high-performance architectures such as linear loop

transformations loop tiling [61], and loop unrolling [7] for enhancing locality and

parallelism. There are also many methods for estimating optimal values for pa-

37

rameters associated with these transformations, such as tile sizes [54], and loop

unroll factors [7]. Manual optimization, however, still remains the best method for

achieving high performance [37].

Manual optimization, however, can be automated, i.e., human participation can

be reduced or even eliminated from the process. A process of writing and timing

several versions of a particular program or an algorithm can be replaced with auto

matic code generation and search using empirical run times or a performance model.

The programmers writing the generator may also use their architectural insights and

domain knowledge to limit the number of versions that are automatically generated

and evaluated.

Self-adapting code has been developed to automatically generate and optimize

the implementation of important classes of algorithms [70]. A number of recent

projects such as F F T W [38], ATLAS [1, 92, 93], and SPIRAL [76] have an automatic

code generator and evaluator. These library generators produce much better code

than native compilers do on modern high-performance architectures. Thus, code

generation with automatic tuning has become state-of-the-art.

1.6.6 Real root isolation

A primary application for Taylor shift by 1 is polynomial real root isolation,

which spends nearly all its computing time performing the operation. Some years

after Collins and Akritas [21] proposed an algorithm for polynomial real root isola

tion, Lane and Riesenfeld [64] presented a variant of the algorithm that uses Bern

stein bases instead of monomial bases. Both methods proceed recursively and use

the Descartes rule of signs as a termination criterion. For any input polynomial,

38

the two methods compute the same isolating intervals since they generate the same

recursion tree. The recursion tree was analyzed by several authors, most recently

by Krandick and Mehlhorn [63].

The monomial variant of the Descartes method was initially analyzed by Us-

pensky [87], Ostrowski [74], Collins and Akritas [21], and Collins and Loos [18].

Collins and Johnson [17] showed that the computing time is dominated by n6.

Johnson [56, 57] later realized that a root separation theorem by Davenport could

be used to reduce the computing time bound to n5; his proof contained a gap that

was removed by Krandick [62]. The Bernstein variant of the Descartes method

was analyzed as n6 by Mourrain, Vrahatis and Yakoubsohn [72] and later repeated

by Basu, Pollack, and Roy [9] and the result restated by Mourrain, Rouillier and

Roy [73]. Basu, Pollack and Roy improved their analysis in the second edition of

their book [10]. Eigenwillig, Sharma, and Yap recently showed that the computing

time for the Bernstein variant is also n5 [31].

The computing times of the monomial and the Bernstein variants of the Descartes

method have never been compared empirically and fairly. To our knowledge, no

published work used modern hardware profiling methods and architecture-aware

optimization techniques to compare the two variants. The technique of register

tiling that makes the classical Taylor shift by 1 efficient [59] carries over to de

Casteljau's algorithm, a fundamental method in computer-aided design [33]. We

are not aware of any architecture-aware implementation of de Casteljau's algorithm.

De Casteljau's algorithm is also the main subalgorithm of the Bernstein variant of

the Descartes method. Using such high-performance implementations of the two

algorithms would yield a fairer comparison between the two variants of the Descartes

39

method. In Section 6.2.3, the results are presented.

1.6.7 Notes on experimental methodology

All current processors allow the user to monitor a wide range of hardware events.

Such hardware counters can be used for precise real run-time measurements of per

formance metrics such as processor cycles, pipeline stalls, cache behavior, and branch

misprediction. The counters can be used for tuning, compiler optimization, debug

ging, benchmarking, monitoring, and performance modeling. While these techniques

are becoming widely used [96, 95], we did not find any computer algebra papers that

use performance counter measurements apart from the papers by Richard Fate-

man [34, 35].

40

2. Straightforward implementations

In this chapter we discuss straightforward implementations of classical Taylor

shift by 1 as well as the multiprecision addition routines they call.

2.1 Introduction

We call an implementation of classical Taylor shift by 1 straightforward if it uses

a generic integer addition routine to compute one of the following sequences of the

intermediate results Definition 1.3.2.

1. Horner's scheme—descending order of output coefficients

(ao ,0 , ^0,1) ^1,0) «0,2, « 1 , 1 , «2,0 , • • • , «0,ra, ■ ■ ■ , O-n,0)-

2. Horner's scheme—ascending order of output coefficients

(ao ,0 , « l , 0 j «0 , l j «2,0j 0>1,1, <k),2i • • • , O>n,0i • • • j O>0,n)-

3. Synthetic division—ascending order of output coefficients

(^0,0) a l , 0 j • • • j O>n,0i ^0,1) • • • ) Qn-1 ,1) • • • j «0,ra)-

41

for i = 0 , . . . , n bt < - at

assertion: bt = an_8)_i for j = 0 , . . . , n — 1

for i = n — 1,... ,j bl*-bl + bl+i assertion: b% = an-

Figure 2.1: The straightforward method we consider uses integer additions to compute the elements of the matrix in Figure 1.1 from top to bottom and from left to right.

4. Descending order of output coefficients

,n—l j • • • j Q"n,G)'

Von zur Gathen and Gerhard use method (1) [42]. The computer algebra system

Maple [65, 68, 67], version 9.01, uses method (3) in its PolynomialTools [T rans l a t e ]

function. In methods (3) and (4) the output coefficients appear earlier in the se

quence than in the other methods. The computing times of the four methods are

very similar; they differ typically by less than 10%.

In our experiments we will use method (3) to represent the straightforward meth

ods; Figure 2.1 gives the pseudocode. For addition we use the faster GNU-MP [43]

addition, unless noted otherwise.

We review the the GNU-MP [43] and NTL [83, 82] addition next. A review of

the SACLIB library [22] Taylor shift by 1 routine follows and concludes this chapter.

42

2.2 GNU-MP addition

The GNU Multiple Precision Arithmetic Library [43, 44] represents integers in

sign-length-magnitude representation. On the Pentium EE, Opteron, and Ultra

SPARC III platforms we have the package use the radix (3 = 264. On the Pentium 4

platform the package is set to use the radix (3 = 232.

The GNU-MP addition routine mpn_add_n is written in highly optimized, hand

crafted assembly code for most platforms. We present GNU-MP addition on Ultra

SPARC III as an example.

GNU-MP addition on UltraSPARC III

Let n be a non-negative integer, and let u = UQ + U\f3 + • • • + un/3n, where

0 < u% < (3 for alH G {0 , . . . , n} and un ^ 0. The magnitude u is represented as

an array u of unsigned 64-bit integers such that u [ i ] — u% for alH G {0 , . . . , n}. Let

v = v0 + ViP+- ■ ■Jrvn(3n be a magnitude of the same length. The routine mpn_add_n

is designed to add u and v in n + 1 phases of 4 cycles each. Phase i computes the

carry-in ct-\ and the result digit r% = (ut + v% + ct-i)mod(3. Figure 2.2 gives a

high-level description of the routine; all logical operators in the figure are bit-wise

operators. The UltraSPARC III has two integer execution units (IEU1, IEU2) and

one memory management unit (MMU). The GNU-MP addition routine adds each

pair of 64-bit words in a phase that consists of 4 machine cycles. Digit additions

are performed modulo (3 = 264; carries are reconstructed from the leading bits of

the operands and the result. In each set of the four successive phases the operation

address computes new offset addresses for ut+\, vt+\, and rt+\, respectively, during

43

cycle 1 cycle 2 cycle 3 cycle 4

IEUl a <-- (w4_! Vt>4_i) A - r 4 _ i a <— a V b Cl_! <- [a /2 6 3 J r 4 *■ - 6 + Cj_i mod/3 IEU2 6 <— M»_l A V»_l a d d r e s s 6 <— Mj + Vj mod /3 u% Vo , MMU l o a d M 1 + 3 l o a d vl+3 s t o r e rj_i —

Figure 2.2: The GNU-MP assembly addition routine for the UltraSPARC III platform.

the first three phases; in the fourth phase, the operation address is replaced by a

loop control operation. The routine consists of 178 lines of assembly code. In-place

addition can be performed. Whenever the sum does not fit into the allocated result

array, GNU-MP allocates a new array that is just large enough to hold the sum.

Gaudry's patch

In May 2006 we learned about a patch to the GNU-MP assembly routines for

the AMD64 architecture [97, 39]. The new assembly routines provide substantial

speedup for the GNU-MP addition, subtraction, and multiplication routines. We

confirmed that the GNU-MP addition is approximately 2x faster with the patch.

Figure 2.3 illustrates the performance improvement offered by the Gaudry's patch

for the straightforward method of Taylor shift by 1.

2.3 NTL addition

The NTL library [83, 82] represents integers using a sign-length-magnitude rep

resentation similar to the one GNU-MP uses. But while GNU-MP allows the digits

to have word-length, NTL-digits have 2 bits less than a word. As opposed to GNU-

MP, NTL needs 1 bit of the word to absorb the carry when it adds two digits. This

44

Speedup due to Gaudry patch on AMD Opteron (Straightforward method of Taylor shift by 1)

2

1 8

Q-1 6 T3 <D CD Q . W 1 4

1 2

10 2000 4000 6000 8000 10000

Degree

Figure 2.3: Performance gain due to Gaudry's patch for the straightforward method.

explains why NTL-digits are 1 bit shorter than GNU-MP-digits. Another bit is lost

for the following reason. While GNU-MP represents an integer as a C-language

s t ruc t , NTL represents it as an array and uses the first array element to represent

the signed length of the integer. Since all array elements are of the same type,

NTL-digits are signed as well—even though their sign is never used. Finally, due to

its way of performing multiplications, NTL cannot take full advantage of a 64-bit

word-length. In our experiments on all four platforms the NTL radix was 230. The

NTL addition routine ntl_zadd consists of 113 lines of C++ code.

2.4 The SACLIB method

The SACLIB library of computer algebra programs [22] performs classical Taylor

shift by 1 using the routine IUPTR1. The routine, consisting of 144 lines of C-code,

45

was originally written by G. E. Collins for the SAC-2 computer algebra system [19].

The method implements its own addition scheme, uses its own data structure, and

does not call an external addition routine. The method is faster than NTL- and

GNU-MP-based straightforward methods for polynomials of small degrees [59].

SACLIB represents integers with respect to a radix (3 that is a positive power

of 2. In our experiments, we set (3 = 262 on the Pentium EE, Opteron, and Ul

traSPARC III platforms and (3 = 229 on the Pentium 4 platform. Integers a such

that —(3 < a < (3 are called /5-digits and are represented as variables of type i n t

or long long. Integers a such that a < —(3 or (3 < a are represented as lists

(d0,..., dh) of /3-digits with a = J2t=0 dt(3% where dh ^ 0 and, for i G { 0 , . . . , / i},

dt < 0 if a < 0 and dt > 0 if a > 0.

SACLIB adds integers of opposite signs by adding their digits. None of these

digit additions produces a carry. The result is a list (d0,..., d^) of /5-digits that may

be 0 and that may have different signs. If not all digits are 0, the non-zero digit of

highest order has the sign s of the result. The digits whose sign is different from s

are adjusted in a step called normalization. The normalization step processes the

digits in ascending order. Digits are adjusted by adding s ■ (3 and propagating the

carry —s.

The routine IUPTR1 performs Taylor shift by 1 of a polynomial of degree n and

max-norm d by performing the n(n+ l ) / 2 coefficient additions without normalizing

after each addition. A secondary idea is to eliminate the loop control for each

coefficient addition. To do this the program first computes the bound n + L(d) of

Remark 1.3.7 for the binary length of the result coefficients. The program determines

the number k of words required to store n + L(d) bits. The program then copies

46

Step4: /* Apply synthetic division. */ m = k * (n + 1); for (h = n; h >= 1; h—) { c = 0; m = m - k; for (i = 1; i <= m; i++) { s = P[i] + P[i + k] + c; c = 0; if (s >= BETA) { s = s - BETA; c = 1; }

else if (s <= -BETA) { s = s + BETA; c = -1; }

P[i + k] = s; } }

Figure 2.4: Taylor shift by 1 in SACLIB.

the polynomial coefficients in ascending order, and in ascending order of digits,

into an array that provides k words for each coefficient; the unneeded high-order

words of each coefficient are filled with the value 0. This results in an array P

of k(n + 1) entries such that , for % G { 0 , . . . , k(n + 1) — 1} and % = qk + r with

0 < r < k, P[i + 1] = a„_g where ^ I Q an-qP3 ^s t n e coefficient of xn~q in the input

polynomial. After these preparations the Taylor shift can be executed using just the

two nested loops of Figure 2.4. The principal disadvantage of the method is the cost

of adding many zero words due to padding. This makes the method impractical for

large inputs. Also, the carry computation generates branch mispredictions.

47

3. The ti le m e t h o d

3.1 Introduct ion

In Johnson et al. [59], we presented a new version of Taylor shift by 1 algo

rithm. The introduction of signed digits, suspended normalization, radix reduction,

and delayed carry propagation enables our algorithm to take advantage of the reg

ister tiling technique for multiprecision addition. Register tiling—an optimization

method commonly used by high-performance compilers—groups the operands, loads

them into machine registers, and operates on the operands without referencing the

memory [7, 54]. We call our method the tile method.

The new register tile method for Taylor shift by 1 outperforms the straightfor

ward methods by reducing the number of cycles per word addition. We reduce the

number of carry computations by using a smaller radix and allowing carries to ac

cumulate inside a computer word. Further, we reduce the number of read and write

operations by performing more than one word addition once a set of digits has been

loaded into registers. This requires changing the order of operations; only certain

digits of the intermediate integer results al>3 in Definition 1.3.2 are computed in one

step. We perform only additions; signed digits will implicitly distinguish between

addition and subtraction. The new algorithm was written in a high-level language.

The tile method routine consists of 275 hand-written lines of C-code. In addition,

we developed a code generator to automatically unroll and schedule some parts of

the code, which further improves performance, see Section 3.5.

48

3.2 Description of the algorithm

We partition the set of indices In of Definition 1.3.2 as shown in Figure 3.1 (a).

Definition 3.2.1. Let n, b be positive integers. For non-negative integers i,j let

Ttt3 = {(h, k)eln\ [h/b\ = % A [k/b\ = j},

and let T be the set of non-empty sets T%3.

Remark 3.2.2. The set T is a partition of the set of indices In; some elements of T

can be interpreted as squares of sidelength b, others as triangles and pentagons.

Definition 3.2.3. Let Th3 E T. The sets of input indices to T%3 are

Nt>3 = {(h,k)eln\h = ib-lA [k/b\ = 3},

Wl>3 = {(h, k) Eln\ [h/b\ =iAk = jb-l}.

The sets of output indices for Th3 are

St>3 = {(h, k) e ln | h = tb + b - 1 A [k/b\ = j},

E%)3 = {(h, k) eln\ [h/b\ =iAk = jb + b-l}.

Remark 3.2A. Clearly, Nh3 = St-it3, Wt>3 = Eh3-i whenever these sets are defined.

Definition 3.2.5. Let an>k be one of the intermediate integer results in Defini-

49

J-0,0 To,i J-0,2 J-0,3

Ti,o """ Tl,2

J-2,0 T2,i

■1-3,0

o A a — o

rQ-ri ?L x C a r r y j —j j j j—T—» f / Propagation

b.

Figure 3.1: a. Tiled Pascal triangle, b. Register tile stack. A register tile is computed for each order of significance. Carries are propagated only along lower and right borders.

tion 1.3.2, and let (3 be an integer > 1. We write

dhk — £4> where 0) < (3, and we define, for all i,j,

NS=Kk\(h,k)eNhJ}

and, analogously, W;r , S\ , and E,r* hj ' hj hj

Let / = max{? | T%3 E T} and J = max{j | T%3 E T}. The tile method computes,

for % = 0 , . . . , / and j = 0 , . . . , J — i, the intermediate integer results with indices

in Sh3 U E%3 from the intermediate integer results with indices in Nl>3 U Wl>3. The

computation is performed as follows. A register tile computation at level r takes

N^' and W!f as inputs and performs the additions described in Figure 3.2 (a);

50

the additions are performed without carry but using a radix B > f3. Once the

register tile computations have been performed for all levels r, a carry propagation

transforms the results into S^ and E^J for all levels r. Referring to Figure 3.1 (b)

we call the collection of register tile computations for all levels r a register hie stack.

The maximum value of r for each stack of index (i,j) depends on the precision of

the stack which we now define.

Definition 3.2.6. The precision L* of the register tile stack with index (i,j) is

defined recursively as follows.

L*_1>3 = max({L(ah>k) | (h, k) E N0>3}),

L*_i = max({L(ah>k) | (h, k) E Wt>0}),

L*ttJ = max({L*_1>3, L ^ . J U {L(ah>k) \ (h, k) E Sh3 U Eh3}).

To facilitate block prefetching, we place the input digits to a register tile next to

each other in memory. We thus have the following interlaced polynomial representa

tion of the polynomial A(x) in Definition 1.3.2 by the array P. If % is a non-negative

integer such that % = g(n + 1) + / and 0 < / < n + 1 then P[i] contains the value

df_i defined in Definition 3.2.5.

Theorem 3.2.7. The computation of a register tile requires at most L{(3—1) + 26 —2

bits for the magnitude of each intermediate result.

Proof. Let n = 2b—I, and let Bn>p_i(x) be the polynomial defined in Definition 1.5.1.

For all (h, k) E T0)0 we have 0 < h, k < b - 1. Then, by Theorem 1.3.6, L(aKk) <

L(p-l) + h + k<L(p-l) + 2b-2. □

51

Theorem 3.2.8. If L(B) > L{(3 — 1) + 2b — 2 and 1 bit is available for the sign,

then the tile method is correct.

Remark 3.2.9. The UltraSPARC III has a 64-bit word. We let b = 8, (3 = 249, and

B = 263. For other platforms, see Section 3.5.

3.3 Properties of the algorithm

Theorem 3.3.1. The tile method has the following properties:

1. Assuming the straightforward method must read all operands from memory

and write all results to memory, the tile method will reduce memory reads by

a factor of b/2, and memory writes by a factor of b/4.

2. Given a processor architecture capable of concurrent execution of 2 integer

instructions and 1 memory reference instruction with a memory reference la

tency of at least 2 cycles, ab x b register tile computation takes at least y + 7

processor cycles.

Proof. (1) Obvious. (2) In the register tile, the addition at the SE-corner must follow

the other b2 — 1 additions, and the addition at the NW-corner must precede all other

additions. The first addition requires two summands in registers, which takes at least

3 cycles for the first summand and 1 more cycle for the second summand. The last

sum needs to be written to two locations; the first write requires 3 cycles and the

second 1 more cycle. Since we can perform the other b2 — 2 additions in *■ ~ ' cycles,

the register tile will take at least 3 + 1 + (b2 - 2)/2 + 1 + 3 = b2/2 + 7 cycles. □

52

I I I I I I I I 0+ ,+ 8+

,+ 2+ 9+

Z+ 3+ ,0+

5 6^ i r

6+ , + ,4+

, + 8+ ,5+

-m

9+ , 6 +

,0+ „+ „+ ,8+

,2+ , 9 +

1 * 2 *

,4+zt 1 * 22*"

,«+ „+ 1* 8* 15' 16* 23 ' 24* 3l" 321

1 * 24"" 2 * T

1 * 2 * 2 *

1 * 2 * 2 *

2 * 2 * 2 *

2 * 2 * 2 *

2 * 2 * 3 *

2 * 3 * 3 *

ttttt l oad / ' load add

y X

add store

^ / store

b.

Figure 3.2: a. A scheduled 8 x 8 register tile. Arrows represent memory references, "+" signs represent additions. Numbers represent processor cycles. The 2 integer execution units (IEUs) perform 2 additions per cycle, b. Register tile: a sketch of the proof of Theorem 3.3.1.

The 8 x 8 register tile computation should take at least 82/2 + 7 = 39 processor

cycles, see Figure 3.2 (b). By unrolling and manually scheduling the code for the

register tile, the code compiled with the Sun Studio 9 C compiler and the optimiza

tion options -fast -xchip=ultra3 -xarch=v9b required 53 cycles. When the compiler

was used to schedule the unrolled code, the computation required 63 cycles. Also

see Section 4.4 for more recent measurements.

3.3.1 Register tile schedule

An example of optimal schedule for UltraSPARC II and III processors is pre

sented in Table 3.1. Except for the initial and final additions, all additions within

the register tile are scheduled in pairs to fully utilize UltraSPARC III processor's

two IEUs. This schedule assumes that loads and stores require a 3 cycle latency.

According to the schedule, it will take 40-41 cycles to execute one register tile on the

processor. Most modern processors are capable of at least two integer operations

53

0 0 0 0 0

148 ->■ 148 148 148 148 148 -192 - > ■ - 4 4 104 252 400

33 - > ■ - 7 7 27 279 15 - > ■ - 6 2 - 3 5 3 - > ■ - 5 9

Figure 3.3: Pascal's triangle for the example polynomial. The coefficients of the polynomial and the elements of the triangle are in decimal representation.

per processor cycle and will yield similar schedules. A similar schedule was used in

all our experiments with the tile method.

3.3.2 A n example

In order to illustrate the register tile stack, we provide an example of register

tiling Taylor shift by 1 computation for a small polynomial of degree 4. Let

A(x) = 148a;4 - 192a;3 - 33a;2 + 15a; + 3

be the input polynomial. Then,

B(x) = A(x + 1) = 148a;4 + 400a;3 + 279a;2 - 35a; - 59

will be the output polynomial. Pascal's triangle for the example computation is

provided in Figure 3.3.

For the purpose of this example, let us use octal representation for the integer

coefficients, i.e., the radix will be (3 = 8. Let us further assume that the CPU

54

Cycle MMU IEU1 IEU2 1 load a l 2 load zl 3 load a2 4 load z2 5 load z3 z l + = a l 6 load z4 z2+=zl z l+=a2 7 load z5 z3+=z2 z2+=zl 8 load z6 z4+=z3 z3+=z2 9 load z7 z5+=z4 z4+=z3 10 load z8 z6+=z5 z5+=z4 11 load a3 z7+=z6 z6+=z5 12 load a4 z8+=z7 z7+=z6 13 store al (=z8) z l+=a3 z8+=z7 14 store a2 (=z8) z2+=zl z l+=a4 15 z3+=z2 z2+=zl 16 z4+=z3 z3+=z2 17 z5+=z4 z4+=z3 18 load a5 z6+=z5 z5+=z4 19 load a6 z7+=z6 z6+=z5 20 z8+=z7 z7+=z6 21 store a3 (=z8) z l+=a5 z8+=z7 22 store a4 (=z8) z2+=zl z l+=a6 23 z3+=z2 z2+=zl 24 z4+=z3 z3+=z2 25 z5+=z4 z4+=z3 26 load a7 z6+=z5 z5+=z4 27 load a8 z7+=z6 z6+=z5 28 z8+=z7 z7+=z6 29 store a5 (=z8) z l+=a7 z8+=z7 30 store a6 (=z8) z2+=zl z l+=a8 31 store zl z3+=z2 z2+=zl 32 store z2 z4+=z3 z3+=z2 33 store z3 z5+=z4 z4+=z3 34 store z4 z6+=z5 z5+=z4 35 store z5 z7+=z6 z6+=z5 36 store z6 z8+=z7 z7+=z6 37 store a7 (=z8) z8+=z7 38 store a8 (=z8) 39 store z7 40 store z8

Table 3.1: An optimal instruction schedule for the 8 x 8 register tile for the UltraSPARC III processor.

55

Decimal Octal Level of significance 0 1 2

(24 148 4 2 2 a3 -192 0 0 - 3 a2 -33 - 1 - 4 0 d\ 15 7 1 0 a0 3 3 0 0

Table 3.2: An example polynomial with its coefficients in the interlaced representation.

hardware registers be 6 bits wide, i.e., the addition in the hardware will be performed

using radix B = 25, i.e., L(B) = 5 (allowing 1 bit for the sign). Therefore, there will

be a space of 2 bits for storing the accumulating carries during addition performed

without carry propagation, see Theorem 3.2.7.

Octal digits for the example polynomial A(x) are presented in Table 3.2. The

sign of the integers is embedded in the digits as illustrated. The digits are all

normalized, i.e., all nonzero digits of a coefficient have the same sign—the sign of

the coefficient.

Figures 3.4, 3.5, and 3.6 show Pascal's triangles for Taylor shift by 1 computed

for digits of the levels of significance 0, 1, and 2 respectively. See also Figure 1.1.

Some elements (e.g., 21, 23, 16, 8) in the triangles are digits to the radix B and not

(3 because the carries were not propagated. In addition, some of the elements are

not normalized, i.e., the integers are represented with digits of different signs.

Let us choose the 3 x 3 area in Pascal's triangles as illustrated in Figure 3.7,

3.8, and 3.9. These blocks are register tiles of size 3 x 3 and together they form a

56

0 0 0 0 0

4 -i

->■ 4 1 4

i 4

i 4

1 4

0 -->■ 4 8 12 16 - 1 -->■ 3 11 23 7 -->■ 1 0 21 3 -->■ 1 3

Figure 3.4: Pascal's triangle for the level of significance 0.

0 0 0 0 0

i 1 i i 1 2 -+ 2 2 2 2 2 0 -+ 2 4 6 8 - 4 -+ - 2 2 8 1 -+ - 1 1 0 -+ - 1


0 0 0 0 0

i 1 i i 1 2 -■» 2 2 2 2 2 - 3 -■» - 1 1 3 5 0 -■» - 1 0 3 0 -■» - 1 - 1 0 -■» - 1


57

4 4 4 4 8 12 3 11 23

Figure 3.7: Non-normalized register tile for the level of significance 0.

2 2 2 2 4 6

- 2 2 8


register tile stack of height 3, see Figure 3.1(b).

Carry propagation is performed only on the two relevant sides of the register tile

stack as illustrated in Figures 3.10, 3.11, and 3.12. After the carry propagation the

integers are still not normalized and contain digits of different signs (see lower left

element in Figures 3.10, 3.11, and 3.12).

The full output polynomial will be likewise non-normalized as shown in Table 3.3

and will be normalized before the Taylor shift computation completes as shown in

Table 3.4. The lower right corner of the register tile stack already contain the output

coefficient 02 shown in Table 3.4.

2 2 2 - 1 1 3 - 1 0 3


4 4

3 3 7

58

Figure 3.10: Normalized register tile for level of significance 0 after carry propagation; now the radix (3 = 8.

2 7

- 2 3 2


2 3

- 1 0 4



CL4 148 4 2 2 a3 400 0 2 6 a2 279 7 2 4 <2i -35 5 3 - 1 a0 -59 5 0 - 1

Table 3.3: The output polynomial after Taylor shift by 1 computation with its coefficients not normalized.

59


(24 148 4 2 2 a3 400 0 2 6 a2 279 7 2 4 d\ -35 - 3 - 4 0 a0 -59 - 3 - 7 0

Table 3.4: The output polynomial after Taylor shift by 1 computation with its coefficients normalized.

3.4 Performance

In the RAM-model of computation [6] the tile method is more expensive—with

respect to the logarithmic cost function—than straightforward methods. Indeed, by

reducing the radix the tile method increases the number of machine words needed to

represent integers and therefore requires more word additions than straightforward

implementations. However, modern computer architectures [46, 14] are quite differ

ent from the RAM-model. In this section, we show that, on the UltraSPARC III

architecture, the tile method outperforms straightforward methods by a significant

factor—essentially by reducing the number of cycles per word addition. In Sec

tion 3.4.7 we compare our computing times with those published by von zur Gathen

and Gerhard [89, 41]. In Section 3.5 we show how the code for our method can be

automatically generated and tuned for any architecture.

60

3.4.1 Experimental methodology and platform

For this section, all experimental code was written in C and compiled with Sun

Studio 9 [85] compiler using -fast -xchip=ultra3 -xarch=v9b optimization options.

The GNU-MP package [43] (GMP) library version 4.1.2 was compiled with Sun

Studio 7 compiler and installed using the standard installation but with CFLAGS

set to -fast. All experiments were performed on a Sun Blade 2000 workstation, see

Sections 1.5.1 and 1.5.2.

3.4.2 Execution time

Figure 3.13 shows the speedup that the SACLIB and tile methods provide with

respect to the straightforward method for the input polynomials Bnd with d =

220 — 1, see Definition 1.5.1. The tile method is up to 7 times faster than the

straightforward method for low degrees and 3 times faster for high degrees. The

SACLIB method is up to 4 times faster than the straightforward method for low

degrees but slower for high degrees. The speedups are not due to the fact that the

faster methods avoid the cost of re-allocating memory as the intermediate results

grow. Indeed, pre-allocating memory accelerates the straightforward method by a

factor of only 1.25 for degree 50. As the degree increases that factor approaches 1.

In Figure 3.14 the polynomials Cn>d (with d = 220 —1, see Definition 1.5.1) reveal a

weakness of the tile method. The tile method does not keep track of the individual

precisions of the intermediate results a v but uses the same precision for all the

integers in a tile. The tile stack containing the constant term d of C22,d a n d C25td

consists of 28 and 3 integers ah}, respectively. Thus, when the degree stays fixed and

d tends to infinity, the tile method becomes slower than the straightforward method

61

Q .

T3 <D 4 CD Q . W

Performance - low degrees Speedup relative to straightforward method

50

1 1 ' 1 i '

/ / Tile method SACLIB method

-/ — Tile method SACLIB method

-/

>' Y i

-

i , i i , i

\

i

-

100

Degree 150 200

Q . 3 4

T3 CD CD

w 3

Performance - high degrees Speedup relative to straightforward method

2000

\ I I I I I

\ \ - r I i.1 I -\

SACLIB method

-

V—. SACLIB method . V—.

^^—v« i AA. A r

y y v

\ -

s _ I , I 1 , 1 ,

4000 6000

Degree 8000 10000

Figure 3.13: The tile method is up to 7 times faster than the straightforward method.

62

by a constant factor. The figure shows that—even when the degree is small—the

constant term d must become extremely large in order to degrade the performance.

3.4.3 Efficiency of addition

Figure 3.15 shows the number of cycles per word addition for the GNU-MP

addition routine described in Section 2.2. In the experiment all words of both

summands were initialized to 264 — 1, and the summands were prefetched into LI

cache. The figure shows that the intended ratio of 4 cycles per word addition is

nearly reached when the summands are very long and fit into LI cache; for short

integers GNU-MP addition is much less efficient.

Figure 3.16 shows the number of cycles per word addition for the GNU-MP-

based straightforward Taylor shift described in Section 2.2 and for the tile method

described in Section 3.2; the polynomials Bnd with d = 220 — 1 (see Definition 1.5.1)

were used as inputs. For large degrees the methods require about 5.7 and 1.4

cycles per word addition, respectively. Since the tile method uses the radix 249

and the straightforward method uses the radix 264 the tile method executes about

64/49 ~ 1.3 times more word additions than the straightforward method. As a

result the tile method should be faster than the straightforward method by a factor

of 5.7/(1.4 • 1.3) ~ 3.1. The measurements shown in Figure 3.13 agree well with this

expectation.

3.4.4 Memory traffic reduction

Figure 3.17 shows that the tile method reduces the number of memory reads

with respect to the straightforward method by a factor of up to 7. The polynomials

Performance - degree=22 Speedup relative to straightforward method

1 1 ' 1 i 1 i 1 i

.

1 1 '

Tile method

1

SACLIB method

1 1

-

i 1

\. — w

i 1 I , I , 200 400 600

Coefficient length in bits 800 1000

Performance - degree=25 Speedup relative to straightforward method

400 600 Coefficient length in bits

1000

Figure 3.14: For the input polynomials Cn,d the tile method computes a register tile stack at the precision required for just the constant term.

Efficiency of GNU-MP addition

64

200

o ^ 150 T3 T3 CO T3 i _

O 5 100 l _ CD Q . V)

_CD O 50 O

1 1 1 ' 1

-- Cycles per word addition --

■

— Cache miss rate -■ -

I 1 i ii

-

\ i M H

f» iV l i L ' l _ - A - j \ l \ ' \

1 1 " 1 , 1 i

0 025

0 02 CD

(/> 015 (/>

fc CD c-o 01 m o m ro 005 T3

50 100 150 Summand integer length in words

200

Efficiency of GNU-MP addition

20

B 15

I 5 10

Cycles per word addition

Cache miss rate

— — — — — — — J

0 25

2000 4000 6000 8000 Summand integer length in words

-- o

10000

Figure 3.15: In GNU-MP addition the ratio of cycles per word addition (left scale) increases with the cache miss rate (right scale).

Efficiency of Taylor shift by 1

140

120

T3 T3 CO

T3 i _ O

CD Q . V)

_CD

o O

100

80

60

40

20

I I I I I I I

Tile method Straightforward method Tile method Straightforward method -Tile method Straightforward method

M ll -11 \1

\ N>

I ' - 1 1 1 , 50 100

Degree 150 200

Efficiency of Taylor shift by 1

20

i l 5 T3 T3 CC

T3 i _ O 5 10 l _ CD Q . V)

_CD

° 5 o

1 1 1 ' 1 ' 1 '

Tile method Straightforward method Tile method Straightforward method

\ \

Tile method Straightforward method

\ \ \

"** -~

v ^ / \ 1 I , I , I ,

2000 4000 6000

Degree 8000 10000

Figure 3.16: In classical Taylor shift by 1 the tile method requires fewer cycles word addition than the straightforward method.

66

Bn>d with d = 220 — 1 were used as inputs. The number of memory reads in the

GNU-MP-based straightforward method is independent of the compiler since the

implementation relies to a large extent on an assembly language routine. However,

the number of memory reads in the tile method depends on how well the compiler

is able to take advantage of our C-code for the computation of register tiles. The

figure shows that the Sun Studio 9 C-compiler with the option -xOS -xarch=v9b

works best for the tile method.

3.4.5 Cache miss rates

Figure 3.18 shows the LI data cache miss rates for the straightforward method

and the tile method; the polynomials Bnd with d = 220 — 1 were used as inputs. As

the degree increases the cache miss rate of the straightforward method rises sharply

as soon as the polynomials no longer fit into the cache. The cache miss rate levels

off at about 13%. Indeed, by Section 3.4.1 the block size is 8 words; so, one expects

7 cache hits for each cache miss.

3.4.6 Branch mispredict ions

Figure 3.19 shows the number of branch mispredictions per cycle for the straight

forward method and the tile method; the polynomials Bnd with d = 220 — 1 were

used as inputs. Since either method produces at most one branch misprediction

every 200 cycles, branch mispredictions do not significantly affect the two methods.

However, the branch misprediction rate of the SACLIB method is 60 times greater

than that of the straightforward method when the degree is high.

67

Memory reads 1 1 ' 1 1 1

"

I f . ' • ^ . *■—

i "

/ ~v— ^JVB--i~_ * —*

1 , 1 1 50 100 150

Degree 200

Memory reads

in gcc v3 4 2 unoptimized

qcc v3 4 2 -Q3

Sun cc v5 6 -x03 -xarch=v9b Sun cc v5 6 -fast -xchip=ultra3 -xarch=v9b

2000 4000 6000

Degree 8000 10000

Figure 3.17: The tile method substantially reduces the number of memory reads required for the Taylor shift; the extent of the reduction depends on the compiler.

68

Taylor shift by 1

0 05

0 04

■ § 0 0 3 <D

<D

"§002 O

7 3 0 01

-1 1 1 1 1 1

-1


\/1 1


\/1 1

- -

/\ ^ V / N V~^ -^^^

\

1 i , i , 50 100

Degree 150 200

02

S W 0 15

T3 CO CD

Taylor shift by 1


4000 6000 Degree

10000

Figure 3.18: For large degrees the tile method has a lower cache miss rate than the straightforward method. Moreover, the number of cache misses generated by the tile method is small because the tile method performs few read operations.

69

Taylor shift by 1

0 006

O > s 0 005 o

0 004 Q .

o o 15 0 003 <D l _ Q . w ' F 0 002 o nj 0001 00

1 1 ' 1 ' 1 '

1 I , A/ IAA/VA/* ^s^^^y]

1 ll 1 ll Tile method Straightforward method \\A Ai Tile method Straightforward method \\A Ai

\ \ (

1 , 1 , 1 , 50 100

Degree 150 200

Taylor shift by 1

0 005

O O 0 004 CD Q. (/> o "o T3 CD i_ Q. (/>

0 003

0 002

o CD 00

0 001

-I I I I I I I I

\ \ Tile method Straightforward method V

Tile method Straightforward method V

* ^ - _ - ~ % ' - - ~ - ' - - ^ . - _

-I

C- " " " --

I i , i , i , * ■ *

2000 4000 6000 Degree

8000 10000

Figure 3.19: The number of branch mispredictions per cycle is negligible for the tile method and the straightforward method.

70

straightforward tile NTL-addition GMP-add

degree UltraSPARC [89] Pentium III [41] UltraSPARC III 127 0.004 0.001 0.001 0.00076 0.00010 255 0.019 0.005 0.004 0.00327 0.00046 511 0.102 0.030 0.016 0.01475 0.00286 1023 0.637 0.190 0.101 0.08261 0.03183 2047 4.700 2.447 0.710 0.56577 0.27114 4095 39.243 22.126 4.958 3.73049 1.97799 8191 — 176.840 44.200 29.91298 18.48445

Table 3.5: Computing times (s.) for Taylor shift by 1 —"small" coefficients.

3.4.7 Computing times in the literature

Von zur Gathen and Gerhard [89, 41] published computing times for the NTL-

based implementation of the straightforward method described in Section 2.3. Ta

bles 3.5 and 3.6 quote those computing times and compare the NTL-based straight

forward method with the GNU-MP-based straightforward method and the tile

method.

The computing times we quote were obtained on an UltraSPARC workstation

rated at 167 MHz [89] and on a Pentium III 800 MHz Linux PC; the latter experi

ments were performed using the default installation of version 5.0c of NTL [41]. We

installed NTL in the same way on our experimental platform but while the default

installation uses the gcc compiler with the -02 option we used the Sun compiler with

the options -fast -xchip=ultra3. This change of compilers sped-up the NTL-based

straightforward method by factors ranging from 1.06 to 1.63.

Von zur Gathen and Gerhard ran their program letting k = 7 , . . . , 13 and n =

71

straightforward tile NTL-addition GMP-add

degree UltraSPARC [89] Pentium III [41] UltraSPARC III 127 0.006 0.002 0.001 0.00096 0.00016 255 0.036 0.010 0.005 0.00434 0.00099 511 0.244 0.068 0.029 0.02154 0.00838

1023 1.788 0.608 0.231 0.17607 0.09183 2047 13.897 8.068 1.773 1.27955 0.83963 4095 111.503 65.758 13.878 9.97772 6.27948 8191 — 576.539 140.630 151.27732 61.04515

Table 3.6: Computing times (s.) for Taylor shift by 1 —"large" coefficients.

2k — 1 for input polynomials of degree n and max-norm < n for Table 3.5, and max-

norm < 2n+1 for Table 3.6; the integer coefficients were pseudo-randomly generated.

We used the same input polynomials in our experiments.

The NTL-based straightforward method runs faster on the UltraSPARC III than

on the Pentium III, but the speedup ratios vary. This is likely due to differences

between the processors in cache size and pipeline organization. The computing

time ratios between the NTL- and GNU-MP-based straightforward methods on the

UltraSPARC III are more uniform and range between 0.9 and 1.7. If these computing

time ratios can be explained by the difference in radix size—230 for NTL and 264

for GNU-MP—then there is no justification for the use of assembly language in the

GNU-MP-addition routine. Again, the tile method outperforms the straightforward

methods.

72

3.5 Automat i c code generation

Predicting performance on modern computer platforms is difficult due to their

complexity, see Chapter 4. This challenge calls for automatic code generation and

tuning techniques where several variants of an algorithm are constructed and as

sessed for performance without human intervention. Modeling, however, helps to

narrow the search, see Section 4.4.2 [94].

A tile can be described by three parameters: tile size, tile shape, and addition

schedule [59]. The structure of the Taylor shift by 1 computation favors a rectangular

shape for the tiles; however, we performed experiments only with square tiles. More

than one addition schedule is possible. The number of possible schedules depends on

the ILP features of the target processor and increases with the number of available

integer execution units (IEUs). The optimal tile size depends on the number of

registers in the target CPU and the quality with which the CPU schedules memory

operations. We found that the best performing parameter values varied widely

depending on the target CPU.

We have implemented an automatic code generator in Perl [91]. The generator

produces portable ANSI C + + [52, 13] code for tiles of different sizes, compiles and

executes the code, and searches through successively larger tile sizes until a tile size

with the best performance is discovered. A single addition schedule is produced by

the code generator for a CPU with 2 addition units, see next paragraph. The code

generator consists of 1,000 lines of Perl code and produces approximately 11,000

lines of C + + code for each platform per each register tile size.

The purpose of the addition schedule was to expose the dependencies of the

73

computation so that the C++ compiler can then schedule the computation appro

priately for the target architecture. The compilers we tested were unable to infer

these dependencies when the computation is programed in for loops. We did not

test addition schedules that target CPUs with more than 2 IEUs.

We search over tiles of size n x n for n = 4, 6, 8, 10, 12, 14, and 16. Smaller

tiles are too small to offer a substantial speedup and larger tiles would cause register

spilling that would negate the locality of reference advantages of the tile method.

The outcome of the tile search is shown in Figures 3.20, 3.21, 3.22, and 3.23. The

Figures show performance gain (speedup) obtained by the tiled version of the Taylor

shift by 1 algorithm on the Pentium EE, Opteron, Pentium 4, and UltraSPARC III

platforms, respectively, for different register tile sizes. The speedup is calculated

with respect to the straightforward GMP-based implementation of Taylor shift by

1, see Section 2.2. The dips in the Opteron and UltraSPARC curves are discussed

in Section 3.5.2. Based on the outcome of the search, we use register tile sizes of

12 x 12 for the Pentium EE and Opteron, 8 x 8 on the UltraSPARC III, and 6 x 6

on the Pentium 4 in our experiments.

3.5.1 Processor utilization

The tile method performs more word additions but offers substantial gain in

performance[59]. Figures 3.24 and 3.25 show how the processor is being utilized

by word additions. The tile method dispatches substantially more word additions

per each processor cycle. This is the only cause for the substantial performance

difference between the method. Register tile sizes were set to 12 x 12 words for the

Pentium EE and Opteron, 8 x 8 words for the UltraSPARC III, and 6 x 6 words for

74

Impact of register tile size on performance Architecture Pentium EE

3 4 T3 <D

w 3

- - ■ ■ v .

4x4 6x6 8x8 10x10 12x12 14x14 16x16

) . ' / V - ^ s / -

2000 4000 6000 Degree

8000 10000

Figure 3.20: Impact of tile size on the performance of the tile method on Pentium EE processor. Legend: tile size in word x word.

Impact of register tile size on performance Architecture AMDOpteron

3 4 T3 <D Q-3 w 3

7ft .•■•. s v - ' 1 f f l l . .--f.i-. .- .-. .. ... .'rf » ' i ur

6x6 8x8 10x10 12x12 14x14 16x16 2000 4000 6000

Degree 8000 10000

Figure 3.21: Impact of tile size on the performance of the tile method on Opteron processor. Legend: tile size in word x word.

75

Impact of register tile size on performance Architecture Intel Pentium 4

Q-3 T3 <D <D Q . W 2

4x4 6x6 8x8 10x10 12x12 14x14 16x16

2000 4000 6000

Degree 8000 10000

Figure 3.22: Impact of tile size on the performance of the tile method on Pentium 4 processor. Legend: tile size in word x word.

Impact of register tile size on performance Architecture Sun UltraSPARC III

4x4 6x6 8x8 10x10 12x12 14x14

4000 6000

Degree 10000

Figure 3.23: Impact of tile size on the performance of the tile method on UltraSPARC III processor. Legend: tile size in word x word.

76

the Pentium 4. Initial coefficients were set to 2™ — 1.

We suggest several strategies for improving processor utilization by the tile

method and, hence, its performance:

1. Reschedule the register tile computation to utilize all available IEUs within

the target processor. For example, Opteron should be able to do 3 additions

per cycle as the processor has 3 IEUs.

2. A larger register file (i.e., more available registers) will allow for larger register

tiles. This would help further reduce the cost of carry propagation.

3. Using a processor with wider registers (i.e., 128-bit vs. 64-bit wide) will im

prove performance as wider registers allow for a larger radix. A larger radix

will shorten the integer coefficients and further reduce the number of additions.

4. It is not known whether the sparse interlaced array representation for poly

nomials is a more efficient data structure for the tile method than a noninter

laced representation. We conjecture that the interlaced representation is better

because it helps avoid cache thrashing and favors automatic block prefetch

ing [59]. The best representation should be determined experimentally.

3.5.2 The 4100 degree irregularity

The observed dip in the speedup and a spike in computing time at the degree

4100 for the tile method appears only in experimental data for UltraSPARC III and

Opteron processors.

Straightforward method - processor utilization 0 5

a , 0 4

o o

W 0 3

o

CO

I 0 2

0 1

Pentium EE Opteron with Gaudry's patch Opteron Pentium 4 UltraSPARC III

;V ' • \-^, J r.K./r - * - , V V v - V v ' v V v

r 2000 4000 6000 8000

Degree 10000

Figure 3.24: Processor utilization in word additions per cycle for the straightforward method.

Tile method - processor utilization

T ; 15 O

CO

■i 1 T3 T3 CO T3 i _ O > 05

- ■ V V V - A - , " / . / V A . A ^ A

Pentium EE Opteron Pentium 4 UltraSPARC II

it''"'..'

2000 4000 6000

Degree 8000 10000

Figure 3.25: Processor utilization in word additions per cycle for the tile method.

78

On UltraSPARC III processor, the tile method presents a dramatic increase in

LI data cache miss rate at the degree 4100. The increase is from 2% to almost 11%.

The UltraSPARC III processor's LI caches have block size of 32-bytes, i.e., each

block can contain 4 64-bit words [49, 86]. This implies an almost 50% LI cache miss

rate.

At the degree 4100, there is also a significant increase in the LI instruction cache

references and a significant decrease in branch mispredictions. Both of these obser

vations indicate that the instructions are not being fetched or executed efficiently

because the machine is waiting for the memory system to deliver the data.

The irregularity also appears for the GNU-MP addition of integers that are 4100

words long. On UltraSPARC III processor, there is a spike for q = 2 but not for

q = 3, see Section 4.2. On the Opteron processor (for GNU-MP both with and

without Gaudry's patch), there is a spike for q = 3 and a dip for q = 2. However,

this surprising behavior could be a compiler idiosyncrasy.

This anomaly is likely caused by the low set associativity of the UltraSPARC III

and Opteron data LI caches, which are respectively 4-way and 2-way set associa

tive. The Pentiums have 8-way set associative LI caches and do not exhibit the

irregularity. Conceivably a number-theoretic reason may cause the cache to exhibit

a substantial increase in conflict misses at this particular degree.

79

4. Modeling Taylor shift by 1

4.1 Introduction

In this chapter, we present a model for GNU-MP addition and models for the

straightforward and tile methods of Taylor shift by 1 with respect to target archi

tecture. We compare the models' predictions to the experimental data.

A performance model is a function of several variables that represent features

and parameters of the target microprocessor architecture: specifically, the number

of available integer execution units (IEUs), the memory management unit (MMU)

latency of the processor, the superscalar capacity of its pipeline, and the architecture

of caches in the processor.

The modeling explains performance advantages of the tile method with respect

to the straightforward method and suggests automatic code generation and tuning,

see Section 3.5.

4.2 A model for GNU-MP addition

The high-performance GNU-MP addition routines are written in assembly code

for most architectures. A generic version is also provided. The assembly code is

highly optimized and aims toward optimal utilization of superscalar features of the

target processor as well as minimization of stall cycles due to memory access delays

(cache misses) and branch mispredictions.

Our model for GNU-MP addition assumes that

80

1. a certain number of cycles are consumed for each two words (digits) to be

added, i.e., cycles per word addition,

2. a certain number of cycles are spent for the overhead of calling the addition

routine and housekeeping within the routine, and

3. a certain number of cycles may be spent waiting for memory hierarchy to

deliver the data.

Let C(L) be the number of cycles it takes to add two L-word integers. In our

model, we assume that C(L) = cL + h + 7r/x(L) is a function of c, h, L, / /(£), and

7r, where c is the number of cycles per word addition, h is the combined cost of the

overhead of calling the GNU-MP addition routine and of housekeeping within the

routine, L is the length of integers in words (or the length of the larger operand),

fi(L) is expected number of LI cache misses for a fully associative cache (see below),

and 7r is the cache miss penalty in cycles. These parameters are explained below.

Experimental data for C(L), however, may be influenced by factors not accounted

for in the model, e.g., particulars of the LI data cache design (i.e., associativity,

number of ports) or other processes running on the target machine.

Table 4.1 lists the parameters used in modeling GNU-MP addition. The cycles

per word addition c is known via one or more of the following: learned from available

GNU-MP documentation, measured experimentally, or estimated by studying the

code. The overhead h must be empirically measured because it varies with the target

architecture. In fact, the processor architecture, memory organization, compiler

quality, and compilation flags used—all affect the overhead. The Table 4.1 provides

values for c and h for two cases: the summands are cleared from or preloaded into

81

the LI data cache before running GNU-MP addition. We call these instances "cold"

and "warm" caches respectively.

Let the LI cache capacity be K bytes, w the size of the GNU-MP word (digit) in

bytes, A the LI cache block size (cache line) in bytes, and q the number of GNU-MP

integer operands. Then, assuming a fully associative LI data cache, the expected

number of LI cache misses fi{L) is a piece-wise function defined as follows:

{ 0 if qwL < K,

^f^ otherwise.

In addition, fi{L) may also be near 0 if the processor is capable of and succeeds

with speculative execution of memory reference instructions, see below. The LI

cache miss penalty n is a parameter determined by the target memory hierarchy

organization and technology and is identical to the L2 cache latency.

In our model and experimental measurements of GNU-MP addition, one of the

summands is also the destination sum, i.e., q = 2. There is virtually no difference

in performance for such in-place addition (two operands) and addition with three

operands for Pentium EE and UltraSPARC III processors. However, in-place ad

dition is 5 — 20% slower for Pentium 4 and Opteron processors (with and without

Gaudry's patch). We do not consider the L2 cache because the two GNU-MP in

teger operands of L < 10, 000 words will fit in the L2 cache and cause no misses.

In the straightforward method we use the in-place addition; the Table 4.1 lists only

those parameters.

The GNU-MP addition assembly routines prefetch data to avoid memory refer-

82

Description (Parameter) Value

Pentium EE Opteron Pentium 4 UltraSPARC III

w/patch w/o patch

Cycles per word (warm) (c) 11.5 cycles 1 667 cycles 3 cycles 7.25 cycles 4.5 cycles

Overhead (warm) (h) 170 cycles 60 cycles 61 cycles 200 cycles 300 cycles

Cycles per word (cold) (c) 115 cycles 6 cycles 5 8 cycles 7 5 cycles 8 25 cycles

Overhead (cold) (h) 325 cycles 250 cycles 140 cycles 375 cycles 425 cycles

LI data cache size (K) 16 kb 64 kb 16 kb 64 kb

LI data cache block size (A) 64 bytes 64 bytes 32 bytes 32 bytes

LI cache associativity 8-way 2-way 8-way 4-way

LI cache miss penalty (IT) 7 cycles 14 cycles 7 cycles 20 cycles

Effective miss penalty 0 cycles 14 cycles 7 cycles 20 cycles

Table 4.1: Parameters used for modeling GNU-MP addition. The patch refers to Gaudry's patch [39].

ence stall cycles [44]. For relatively old processors such as UltraSPARC III, such

data prefetching is not effective when the size of the summands exceeds the LI

cache size and the resulting capacity misses cause pipeline stalls. On newer proces

sors such as Pentium 4 and Pentium M as well as the current dual-core AMD and

Intel processors (e.g., Opteron and Pentium EE) the speculative execution of mem

ory reference instructions such as the address prediction, load value prediction, and

stride prediction are used to reduce the latency of load instructions by aggressively

prefetching source operands.

Figures 4.1, 4.2, 4.3, 4.4 and 4.5 show plots for both experimentally measured

and modeled arithmetic efficiency (C(L)/L) under both the favorable (warm) and

adverse (cold) LI data cache conditions. For long integer addition, the "warm"

cache data was obtained after reading the operands once before each measurement

in an attempt to preload the LI cache. For the modeled data, values for c and h

83

were extracted from the short integer addition data by the "best fit" method. For

long integer addition, the Lf data cache wait cycles were added as described in the

discussion above. All words of both summands were initialized to 264 — f. Very few

outliers were present in measurements of short integer addition; these outliers were

removed. The short integer addition was performed in 10-plicate. The long integer

addition was performed in triplicate.

This model gives excellent results for the short integer addition. For the long

integer addition, it gives excellent results for the Intel processors and good results

for the UltraSPARC III and Opteron processors.

The straightforward method of Taylor shift by 1 (see Sections 2 and 4.3) exercises

the GNU-MP addition for short integers only (see Figure 4.6) that will fit in the LI

data cache. However, the entire polynomial will not fit in the cache. An attempt to

understand what happens when GNU-MP addition is called in the context of the

straightforward method for Taylor shift by 1 prompted the decision to provide data

for both the "cold" and "warm" cache conditions.

4.3 Modeling the straightforward method

Our model for the straightforward method is based on the above model of GNU-

MP addition, see Section 2 for the definition of the method and Figure 2.1 for the

pseudocode for the method.

Let S be the number of processor cycles required to perform computation for the

straightforward method of Taylor shift by 1. For our model, we assume that S is a

function of degree n of the polynomial, initial size of its coefficients k < L2(\A\00) in

bits (see Definition 1.3.5), and the cycle cost of GNU-MP addition C(L) as discussed

84

Modeling GNU-MP addition on UltraSPARC

50

40

30

Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache)

100 200 300 400 Length of summands in words

500

Modeling GNU-MP addition on UltraSPARC

30

25

20



10000

Figure 4.1: Modeling GNU-MP addition for the UltraSPARC III processor.

Modeling GNU-MP addition on Pentium 4

50

40

30



500

Modeling GNU-MP addition on Pentium 4

20

15

10 o

1 ' 1 1 1 1 1

i ' i ' i ' 1 ' 1 1 1 1 1


1 1 1

.1 » \1 «


1 1 1

.1 » \1 «

.^4^ .^4^ W ^ W ^ W W T H A ^ T T

i I , I , I , 2000 4000 6000 8000

Length of summands in words 10000

ure 4.2: Modeling GNU-MP addition for the Pentium 4 processor.

86

30

25

20

15

o 10

Modeling GNU-MP addition on Opteron without Gaudry's patch

I I I I I I I I I

1 Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache)

-

1 1

Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache) -

\

1 I , I , I , 100 200 300 400


10

O

Modeling GNU-MP addition on Opteron without Gaudry's patch

I +

A ftJ.-r.T'V -"' •v^v.^vMV^^^^\w^>^.i'^Li^^l^)t^,I^>



10000

Figure 4.3: Modeling GNU-MP addition without Gaudry's patch for the Opteron processor.

87

Modeling GNU-MP addition on Opteron with Gaudry's patch

30

25

20

15 O

10

Measured (cold cache) Modeled (warm cache) Measured (warm cache) Modeled (cold cache)

-St >s< *^ww„ «f-.«


500

10

O

Modeling GNU-MP addition on Opteron with Gaudry's patch

-> . i l - L H <UI ,+X>y>fV»o...y

Measured (warm cache) Modeled (warm cache) Measured (cold cache) Modeled (cold cache) _L _L


10000

Figure 4.4: Modeling GNU-MP addition with Gaudry's patch for the Opteron processor.

Modeling GNU-MP addition on Pentium EE

O 20

Measured (cold cache) Measured (warm cache) Modeled (cold cache) Modeled (warm cache)

~~d^^^UL), -


500

Modeling GNU-MP addition on Pentium EE

30

25

20

o

15

10

SM -H

Measured (cold cache) Measured (warm cache) Modeled (cold cache) Modeled (warm cache)

W. m iA ..Jw'>^.,U,vr\L... 2000 4000 6000 8000


ure 4.5: Modeling GNU-MP addition for the Pentium EE processor.

89

in Section 4.2 above. Experimental measurement of S likely depends on additional

factors not accounted for in this model, such as the LI data cache behavior while

accessing the array of GNU-MP integers that represent polynomials.

Each element of Pascal's triangle has a binary length of L 2 (a v ) < k + % + j

by Theorem 1.3.6(2). The length of the longest summand is L2 < L(ah}). The

straightforward methods perform n(n+ l)/2 additions to compute the elements, see

Section 1.3. Therefore, we obtain a computing time bound as follows:

k + i w

) S = £ \ C ( l<i<n

We will ignore memory wait cycles since the LI cache miss penalty is negligible for

GNU-MP integers with L < 500; the integers in our experiments never grow longer

than 500 words. Therefore, substituting C(L) = cL + h (see Section 4.2 above) and

assuming that \^f\ = ^ , we get

s = £ l(c(£±!) +/o Ki<n

s= y (ic(—)+ih) *-^ w

Ki<n

S= > — + — + ih) Ki<n

s = - y t+-y t2+hy % Ki<n Ki<n Ki<n

90

w z—' w z—' l<i<n l<i<n

OK f

S=(— + h)n(n + l)/2 + -n(n + l)(2n + l)/6.

Since S is an upper bound on the computing time (in cycles) for the straightfor

ward method, we expect our model to overestimate the computing time. In addition,

we provide modeled performance data based on run-time L that were experimentally

measured for each addition; we add the corresponding computing time for GNU-MP

addition to the total computing time. We do this with values for c and h for both

warm and cold cache.

The modeled data were generated using parameters described in the above Sec

tion 4.2; initial coefficients were assumed equal to 2™ — 1 where n is the degree. The

data presented in Figures 4.7, 4.8, 4.9, and, 4.11 show reasonable agreement between

modeled and measured performance. The greater discrepancy between the modeled

and the measured data on the Intel processors is probably caused by the processors'

more sophisticated out-of-order execution and hardware prefetching units.

A histogram that shows the distribution of the length of sums in the GNU-MP

based straightforward method for both the experimental and the modeled data is

provided in Figure 4.6. Since the model incrementally overestimates the length of

the sums, they grow as the computation progresses. The experimental execution

does not have as many long additions as the modeled execution assumes. The real

execution "shifts" longer additions toward the middle of the Pascal triangle.

91

Histogram of the number of additions 64-bit straightforward method of Taylor shift by 1

5e+05

w o

CO

CD

E

4e+05

3e+05

2e+05

1e+05

in x Measured - degree 1000 Measured - degree 2000 Measured - degree 4000 Measured - degree 8000 Modeled - degree 1000 Modeled - degree 2000 Modeled - degree 4000 Modeled - degree 8000

/

/ ** * N

\

V+ / / \ i

V _^Z L 50 100 150 200

Length of the sum in words 250

Figure 4.6: The distribution of the length of sums L in the straightforward method. Both experimental and modeled data provided.

Modeling straightforward method for UltraSPARC

CD E i _ CD Q . X CD V) jD O > s

o

o o

T Warm cache with modeled L Warm cache wtih measured L Cold cache with modeled L Cold cache with measured L

4000 6000

Degree 10000

Figure 4.7: Modeling the straightforward method for the UltraSPARC III processor.

92

Modeling straightforward method for Pentium 4

E <D Q . X <D (/> jD

o O

w o O

T~ T~ Warm cache with modeled L Warm cache with measured L Cold cache with modeled L Cold cache with measured L

•-.. V

2000 4000 6000 Degree

8000 10000

Figure 4.8: Modeling the straightforward method for the Pentium 4 processor.

CD E i _ CD Q . X CD V)

j D O

o

O o

Modeling straightforward method for Opteron without Gaudry's patch

•*v> T" T"

\ Warm cache with modeled L Warm cache with measured L Cold cache with modeled L Cold cache with measured L

4000 6000 Degree

10000

Figure 4.9: Modeling the straightforward method for the Opteron processor.

93

CD E

■ i _ <D Q . X <D w jD O O

w O o

Modeling straightforward method for Opteron with Gaudry's patch

\ % \

1 ' 1 ' 1 ' ■ \

% \ \ A / l_ XL. - 1 1 - 1 1

■

\ \

warm cache with modeled L Warm cache with measured L Cold cache with modeled L Cold cache with measured L

■

* % ■■■«. " " - " « - . ^

-

-

1 1 1 , 1 , 1 , 2000 4000 6000

Degree 8000 10000

Figure 4.10: Modeling the straightforward method for the Opteron processor with Gaudry's patch.

Modeling straightforward method for Pentium EE

E i _ <D Q . X <D (/> jD O

o

O

o

Warm cache with modeled L Warm cache with measured L Cold cache with modeled L Cold cache with measured L

2000 4000 6000 Degree

8000 10000

Figure 4.11: Modeling the straightforward method for the Pentium EE processor.

94

4.4 Model ing the ti le m e t h o d

The computing time of the tile method entirely depends on the performance of

the register tile and the delayed carry propagation routines. However, it is difficult

to predict how efficiently a compiler will schedule the code.

In order to model performance of the tile method, we begin with a discussion

of the lower bound for computing time of the register tile and the delayed carry

propagation and compare it to the measured performance results.

In order to derive the lower bound for computing time, let us expand on Theo

rem 3.3.1 (2). Let register tile Th} be a square tile with sidelength b, see Remark 3.2.2

and Figure 3.2 (a). Let u be the number of integer execution units (IEUs) that can

be engaged simultaneously by the register tile. Current processors typically feature

2 — 4 IEUs. There will be b x b additions for the square register tile; most of the addi

tions can be simultaneously executed on u IEUs, see Figure 3.2 for an example with

u = 2. At the beginning and at the end of the register tile computation there are

several dependent additions that cannot be parallelized: ~̂ ' at each end. These

dependent additions require u— 1 cycles to execute at each end. Then—provided u 6x6 2"^"~1-)

divides b—we have ■^—i \- 2(u — 1) cycles for additions per each register tile.

Let m cycles be the latency for memory references (i.e., the LI data cache

latency). Let us further assume that memory operations can be pipelined with

CPI< 1, which is true for all current processors. This assumption implies that we

should be able to read the initial 2 words for addition or write the 2 words after the

final addition i n m + 1 cycles. It also implies that all the other memory operations

can be scheduled simultaneously with the remaining addition operations and will

95

not consume any additional cycles, see 3.3.1(2). Thus, the memory references will

contribute only 2(m + 1) cycles to the register tile computation.

Therefore, the total cost in cycles per each register tile without the carry prop

agation is at least t = -•»(«- ) _|_ 2(u — 1) + 2{m + 1).

Table 4.2 lists experimentally measured time for the register tile computation of

the optimal size value for a particular platform, see Section 3.5. The lower bound

time (in cycles) for the register tile as explained above is listed in Table 4.3. Table 4.3

also presents the corresponding measured performance and the ratio between the

two. In each case, except for the Opteron processor, there is a substantial difference

between the lower bound estimate and the experimentally measured performance.

For the UltraSPARC III processor, the datum is different from our previously pub

lished results [59] and reflects a new purpose for the measurement. In the publica

tion [59] (also see Section 3.3), we were interested in inducing the compiler to meet

the lower bound. The current measurement was performed by measuring the time

(in cycles) to call and execute the actual function that implements the register tile

in our tile method code. In addition, the new experiments were performed with

different compiler optimization flags (see Section 1.5.3) that favor performance of

the tile method as a whole—probably at the cost of individual register tiles.

Let p be the computational cost of the delayed carry propagation for each word

on the register tile border, see Figure 3.1 (b). There are 2b words (two register tile

sides) that need to have the accumulated carries thus released. There will be 2b x p

cycles spent on carry computation per each register tile.

The code illustrated in Figure 4.12 has 4 dependent instructions per each word:

1 addition instruction, 1 subtraction instruction, 1 right shift, and 1 left shift and

96

Experimental data (cycles) b 4 6 8 10 12 14 16

Pentium EE 96 96 120 144 168 600 704 Opteron 19 29 42 58 87 371 499

Pentium 4 96 126 210 284 442 532 698 UltraSPARC III 49 75 102 133 176 182 216

Table 4.2: Experimentally determined cost for register tile execution.

b Lower bound Experimentally measured Ratio Pentium EE 12 81 cycles 168 cycles 2.07

Opteron 12 81 cycles 87 cycles 1.07 Pentium 4 6 27 cycles 126 cycles 4.67

UltraSPARC III 8 39 cycles 102 cycles 2.62

Table 4.3: Cost of the b x b register tile execution in cycles. The chosen b is the optimal value for the particular platform, see Section 3.5.

3 instructions that can be overlapped: 1 read, 1 write, and 1 branch instructions.

This implies that the cost for carry propagation computation is at least 4 cycles

provided no ILP features are engaged, no branch mispredictions occur, and no load

hazards encountered.

The cost of carry propagation per word released was experimentally measured

and is presented in Table 4.4 are for register tile stack of 100 register tiles high and are

averages for 10000 runs. Previous experiments have shown significant improvement

from rolled carry propagation code to the unrolled code [59]. These results confirm

those findings.

These experimental measurements are compared with the above estimated lower

97

inline void release_carries(baseint *P, int indx, m t *P1, int span, int tile, int n) {

baseint c, s; int 1, digit, pi; pi = PI [tile]; c = 0;

for (l = 0; l < span; i++)

for (digit = 0; digit < pi; digit++) ■c

s = P[dig i t*(n+l)+mdx+i] + c; / / add carry c = s / BETA; s = s - c * BETA; P[digi t*(n+l)+indx+i] = s;

> while ( c != 0 ) {

s = P[pl*(n+l)+indx+i] + c; c = s / BETA; s = s - c * BETA; P[pl*(n+l)+mdx+i] = s; pl++;

>

// compute carry // compute current digit // set it

// add carry to existing digit // compute carry // compute current digit // set it // increment the length

PI [tile] = pi; // set the length

Figure 4.12: The rolled delayed carry release routine for the tile method.

Pentium EE Opteron Pentium 4 UltraSPARC III Rolled 37.73-38.22 7.82-8.06 15.78-16.33 14.77-14.87

Unrolled Unrolled variable

13.44-17.07 15.01-17.32

5.06-6.58 5.26-6.83

8.23-12.12 9.07-11.57

5.01-10.12 5.44-10.13

Table 4.4: Experimentally determined cost of the 3 versions of the carry propagation in processor cycles for the register tile sizes ranging from 4 x 4 to 24 x 24.

98

b Lower bound Experimentally measured Ratio Pentium EE 12 4 cycles 13.44 cycles 3.36

Opteron 12 4 cycles 5.06 cycles 1.27 Pentium 4 6 4 cycles 8.23 cycles 2.06

UltraSPARC III 8 4 cycles 5.01 cycles 1.25

Table 4.5: Cost of the delayed carry propagation in cycles.

Pentium EE Opteron Pentium 4 UltraSPARC III Tile size b 12 words 12 words 6 words 8 words

Register tile cost t 168 cycles 87 cycles 126 cycles 102 cycles Carry propagation cost p 13.5 cycles 5 cycles 8.5 cycles 5 cycles

Table 4.6: Parameters used in modeling the tile method.

bound cycles in Table 4.5. The difference between the experimental measurements

and the lower bound is small for Opteron and UltraSPARC III processors.

Ignoring the different register tile shapes and assuming only square tiles, the tile

method computes \(n+ l)/&] (|~(n+ 1)/^1 + l)/2 register tile stacks. The height of

each stack H will grow by at most 2b bits by Theorem 1.3.6(2).

The parameters used for the modeling are presented in Table 4.6. The impact

of cache misses on performance is assumed to be of minor significance.

The modeling results for the Pentium EE, AMD Opteron, Pentium 4, and Ul

traSPARC III architectures are presented and compared to measured data in Fig

ures 4.13, 4.14, 4.15, and 4.16 respectively. The measured data were generated with

a code scheduled for 2 IEUs. The modeled data were generated using the parameters

described in Table 4.6. Initial coefficients were assumed to k = 2n — 1, where n is

99

Modeling tile method for Pentium EE

E i _ <D Q . X <D

(/> o O

o O

Measured t & p - modeled H Measured t & p - measured H Lower bound t & p - modeled H

■ — - - Lower bound t & p - measured H

2000 4000 6000 Degree

8000 10000

Figure 4.13: Modeling the tile method on the Pentium EE architecture.

the degree.

Modeling of the tile method clearly shows that the measured computational

cost of the components does not predict well the computational cost of the whole

method. Thus, accurate modeling and predicting performance is difficult; this calls

for automatic code generation and tuning, see Section 3.5.

4.4.1 Impact of changing the number of engaged IEUs

Our software was originally designed for UltraSPARC III, a processor with u = 2.

The 2 IEUs assumption is hard-coded in our register tile schedule, see Section 3.3.1.

However, there is a linear dependence between the number of engaged IEUs u and

performance. Our model for the lower bound predicts 7 — 27% and 12 — 46% per

formance improvement when u is changed from 2 to 3 and 4 respectively, see Fig-

Modeling tile method for Opteron

E ■ i _ <D Q . X <D w jD

o O

w o O

Measured t & p - modeled H Measured t & p - measured H Lower bound t & p - modeled H Lower bound t & p - measured H

4000 6000 Degree

10000

Figure 4.14: Modeling the tile method on the Opteron architecture.

Modeling tile method for Pentium 4

E i _ <D Q . X <D (/> jD O

o

O

o

1 1 1 1 1 1 1 1

A Measured t & p - modeled H Measured t & p - measured H Lower bound t & p - modeled H Lower bound t & p - measured H ./ V Measured t & p - modeled H Measured t & p - measured H Lower bound t & p - modeled H Lower bound t & p - measured H

/ * • *

1: \. '>~-1: \. '>~-. ' * ■

-' ̂ \ — — -» V **■■■

1 . i , i , 2000 4000 6000

Degree 8000 10000

Figure 4.15: Modeling the tile method on the Pentium 4 architecture.

101

Modeling tile method for UltraSPARC

E i _ <D Q . X <D

w o O

w O

o

V v v * — i . Measured t & p - modeled H Measured t & p- measured H Lower bound t & p - modeled H Lower bound t & p - measured H

4000 6000 Degree

10000

Figure 4.16: Modeling the tile method on the UltraSPARC III architecture.

ure 4.17 for an example. This predicts a potentially substantial performance gain

from rescheduling the register tile code to utilize more IEUs. We suggest, therefore,

that a search for optimal u should be made a part of automatic tuning and code

generation in the future, see Section 3.5.

4.4.2 Finding optimal register tile size

The optimal tile size depends on several hard-wired features of the processor

architecture such as the number of registers in the processor's register file, word

size, and superscalar capabilities. The optimal tile size will also depend on the

register usage conventions, and the compiler.

For each sweep within the register tile at least b+u registers are required. Several

additional registers may be required by the compiler for housekeeping or due to the

102

Impact of changing the number of lEUs on Opteron processor

1 5 e + 1 0 i 1 1 1 1 1 1 1

Degree

Figure 4.17: Impact of changing the number of IEUs on the lower bound for the computing time of the tile method for AMD Opteron architecture.

register usage conventions.

For example, the UltraSPARC III processor machine is a RISC general pur

pose register architecture, which by convention restricts several registers for the OS

kernel use. On the UltraSPARC III, we had initially derived the optimal tile size

manually [59] and later confirmed it experimentally [58] using our automatic code

generation and tuning technique, see Section 3.5 below.

For the Intel processors and AMD Opteron, the GCC compiler appears to need

only 2 registers for housekeeping. Therefore, the optimal register tile size was pre

dicted to be b = p — u — 2, where p is the number of general purpose registers in the

processor's register file. For the Pentium EE, Opteron, and Pentium 4 processors,

p is 16, 16, and 8 registers respectively and the optimal b is predicted to be 12, 12,

103

and 4 respectively for u = 2. Our experiments verified that the optimal tile size for

the Pentium EE and Opteron processors is b = 12; for the Pentium 4, b = 6 gives

slightly better results.

104

5. Asymptotically fast methods

5.1 Introduction

There are 4 known asymptotically fast methods for computing Taylor shift by 1:

the Paterson and Stockmeyer's method [89, 41, 75], the divide and conquer method

[89, 41, 88, 11], the convolution method [89, 41, 5, 80], and the modular convolution

method [97].

This chapter follows the notation used by von zur Gathen and Gerhard in their

1997 work [89, 41] and by Gerhard in his 2004 work [41]. All experimental data in

this chapter was produced with polynomials with initial coefficients set to 220 — 1. In

our experiments, we have used the Taylor shift code kindly made available to us by

Jiirgen Gerhard [89, 41] and Paul Zimmermann (modular convolution method) [97].

Let f(x) be an integer polynomial and g(x) = f(x+l), then the 4 asymptotically

fast methods of Taylor shift by 1 are as follows:

Paterson and Stockmeyer's method

Assume (n + 1) = m2 is a square (padding / with leading zeroes if necessary),

and write / = J20<l<m f^xm%, with polynomials / ^ G R[x] of degree less than m

for 0 < % < m. Compute (x+ 1)* for 0 < % < m. For 0 < % < m, compute f^(x + 1)

as a linear combination of 1, (x + 1), (x + l)2 , . . . , (x + l ) m _ 1 . Finally compute

g(x) = ^^ (x + l )mVW( a ; + 1) in a Horner-like fashion. 0<i<m

105

Divide and conquer method

Assume that (n + 1) = 2m is a power of 2. Precompute (x + l)2* for 0 < % < m.

Divide f(x) as follows f(x) = f^(x) + x^n+1^2f^(x), where polynomials f^(x),

f{1)(x) E R[x] of degree less than (n + l)/2. Then g{x) = f{0){x + 1) + (x +

l)(n+1)/2j(1)(a; -f 1) where f^(x + 1) and f^l\x + 1) are computed recursively.

Convolution method

This method works if n! is not a zero divisor in R. By Theorem 1.3.1, g^ =

y I J/j. If we multiply both sides by n!&!, we obtain nlklgk = / ^ (^■ft)j'- nT &<«<«, ^ ' &<«<«, ^ ''

in _R. Let u = 2_, ilftXn~land v = n\ 2_, ~r m ^ M , then nlklgk is the coefficient

of xra_fcin the product polynomial uv.

Modular convolution method

We can compute convolution method modulo an integer m > 22n where n is

the degree as long as m is prime to n\. This fact is based on a trivial observation

that for l/loo < n the coefficients of g(x) have at most 2n bits. This method works

only with non-negative coefficients but can be extended to work with all integer

coefficients, for example, by separating negative and positive coefficients and writing

f(x) = f+(x) — f~(x) and computing the corresponding g(x) separately [97]. We

did not implement the method capable of handling negative coefficients.

The code used in experiments for the divide and conquer, convolution, and

Paterson-Stockmeyer methods was NTL-based. The code used for the modular

convolution method was GNU-MP-based.

106

Classical Taylor shift vs. divide and conquer Speedup of tile method relative to Divide&Conquer

11 i i i i i i i i i i

0 2000 4000 6000 8000 10000 Degree

Figure 5.1: The tile method is faster than the asymptotically superior divide and conquer method for a wide range of degrees.

5.2 Performance of the fast methods

Initially we have compared performance of von zur Gathen and Gerhard's NTL-

based divide and conquer Taylor shift by 1 [89, 41] with our tile method [59]. The

tile method is faster up to degree 6000 on all 4 target platforms, see Figure 5.1.

This result was published in 2006 [58].

We have compared the 4 asymptotically fast methods on our 4 target architec

tures. The performance of the methods relative to the tile method is reported in

Figures 5.2, 5.3, 5.4, and 5.5. The divide and conquer method outperforms all the

oretically fast methods for degrees 100 < n < 7500 on all 64-bit architectures. The

data was collected for polynomials Bn>d with d = 220 — 1. We did not measure for

degrees n > 7500 because the experiments would take very long to run.

107

Fast methods on Pentium EE

T3 O

15 E <D

Q) 8

! : X 5

3 ^ o O

X ~T

-**-\

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer

2000 4000

Degree 6000 8000

Figure 5.2: All asymptotically fast methods are slower than the divide and conquer method on the Pentium EE. The convolution method is over 80 x slower than the tile method and is not shown.

Fast methods on Opteron 13

12 T3 O 11 C-

<D 10

E 9 <i> — R -•—» t/> <D 1

O > s b o

■ — . 5 X

4 t/> <i> o > s O 2

\ ' , ' i 1 i '

\ » Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer

\ \ Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer


Convolution

Modular convolution

Paterson Stockmeyer

\ »

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer t \ v

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer \ ^ '•

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer

\ \ V

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer

\ \ V \ ^ ' " -

\ A. ""̂ \ / \ > ■ \ / \ . v

" ^ ^

i , i i 2000 4000

Degree 6000 8000

Figure 5.3: All asymptotically fast methods are slower than the divide and conquer method on the Opteron. The convolution method is over llOx slower than the tile method and is not shown.

108

Fast methods on Pentium 4

T3 O .C ID E <D

o o

w o O

T~


Convolution

Modular convolution

■ — ■ - Paterson Stockmeyer

2000 4000

Degree 6000 8000

Figure 5.4: All asymptotically fast methods are slower than the divide and conquer method on the Pentium 4. The convolution method is over 50 x slower than the tile method and is not shown.

Fast methods on UltraSPARC

T3 O .C ID E CD

jD O

jD O

o

\ V ' i * i * \ \ Divide and conquer

Convolution

Modular convolution

Paternson Stockmeyer

\ \ \ \

—

Divide and conquer

Convolution

Modular convolution


1 \ 1 * —

Divide and conquer

Convolution

Modular convolution

Paternson Stockmeyer \ \ \ \

—

Divide and conquer

Convolution

Modular convolution


\ \ """̂ \ ■ \

—

Divide and conquer

Convolution

Modular convolution


\ \ """̂ \ ■ \

\ \

V •—• s ;̂- ■ * « • ,

•v ^

i , i i 2000 4000

Degree 6000 8000

Figure 5.5: All asymptotically fast methods are slower than the divide and conquer method on the UltraSPARC III. The convolution method is over 200 x slower than the tile method and is not shown.

109

5.3 Comput ing t imes in the l iterature

Von zur Gathen and Gerhard [89, 41] have published computing times for their

NTL-based implementation of the asymptotically fast method. Tables 5.1, 5.2, 5.3,

5.4, 5.5, and 5.6 quote those computing times and list our experimental measure

ments for the three methods on the UltraSPARC III, Pentium 4, Opteron, and

Pentium EE processors. The computing times we quote were obtained on an Ultra

SPARC workstation rated at 167 MHz [89] and on a Pentium III 800 MHz Linux

PC [41]; our experiments were performed using compiled libraries as described in

Section 1.5.3. Von zur Gathen and Gerhard ran their program letting k = 7 , . . . , 13

and n = 2k — 1 for input polynomials of degree n. Tables 5.1, 5.2, and 5.3 quote and

report computing time using "small" coefficients, i.e, max-norm < n. Tables 5.4, 5.5,

and 5.6 quote and report computing time using "large" coefficients, i.e., max-norm

< 2n+1. The integer coefficients were pseudo-randomly generated; we used the same

input polynomials in our experiments. Our data corroborates Von zur Gathen and

Gerhard's conclusion that divide and conquer method is the fastest among the three

methods for high degrees.

5.4 Improving performance of the fast methods

Broadly, there are two ways to approach redesigning the fast methods: a top to

bottom and a bottom-up approach. The former is time consuming but in the long

run may lead to better performance results. The tile method for the classical Taylor

shift by 1 [59] is an example of a complete algorithm redesign for high performance.

The bottom-up hierarchical approach in the case of the asymptotically fast Taylor

110

Degree UltraSPARC [89] Pentium III [41] UltraSPARC III Pentium 4 Opteron Pentium EE

31 — — 0.00065 0.00841 0.00018 0.00021

63 — — 0.00180 0.00050 0.00049 0.00047

127 0.260 0.010 0.00452 0.00142 0.00123 0.00113

255 1.640 0.072 0.01226 0.00430 0.00331 0.00307

511 11.450 0.432 0.04065 0.01630 0.01012 0.00939

1023 86.090 2.989 0.15353 0.07650 0.03795 0.03665

2047 713.200 16.892 0.68125 0.36726 0.18785 0.17505

4095 — 125.716 3.16158 1.84255 0.91648 0.86930

8191 — — 17.34230 9.76371 4.55053 4.37315

Table 5.1: Computing times (s.) for the divide and conquer method of Taylor shift by 1 —"small" coefficients.


31 — — 0.00058 0.00028 0.00017 0.00022

63 — — 0.00222 0.00115 0.00064 0.00064

127 0.060 0.044 0.01515 0.00973 0.00418 0.00407

255 0.640 0.290 0.10956 0.06471 0.02940 0.02820

511 7.430 2.007 0.67894 0.44935 0.19416 0.18580

1023 87.570 13.958 4.60220 3.09768 1.25495 1.19873

2047 1387.390 98.807 33.93170 22.13006 8.53665 8.15568

4095 — 787.817 240.62589 160.39211 59.54717 56.98022

8191 — — 1721.13593 1666.83422 423.89578 2509.76500

Table 5.2: Computing times (s.) for the convolution method of Taylor shift by 1 —"small" coefficients.

I l l


31 — — 0.00033 0.00008 0.00009 0.00019

63 — — 0.00088 0.00026 0.00027 0.00026

127 0.080 0.010 0.00301 0.00115 0.00089 0.00083

255 0.440 0.072 0.01318 0.00511 0.00376 0.00364

511 2.480 0.602 0.06557 0.02785 0.01925 0.01820

1023 15.530 6.364 0.36632 0.17988 0.11426 0.10303

2047 102.640 57.744 1.34248 0.83561 0.41286 0.41554

4095 — 722.757 10.29823 5.78704 2.93546 3.18046

8191 — — 71.02069 42.28142 19.20488 22.63864

Table 5.3: Computing times (s.) for the Paterson & Stockmeyer method of Taylor shift by 1 —"small" coefficients.


31 — — 0.000695 0.000192 0.000178 0.000213

63 — — 0.001978 0.000600 0.000514 0.000536

127 6.340 0.489 0.005770 0.002020 0.001446 0.001503

255 107.000 9.566 0.019537 0.008620 0.005227 0.005123

511 — 166.138 0.089770 0.050037 0.024322 0.023670

1023 — — 0.422919 0.187685 0.113050 0.116488

2047 — — 1.715237 0.983371 0.481346 0.472606

4095 — — 8.759840 5.525364 2.488822 2.449783

8191 — — 51.315470 32.580679 13.996941 13.693067

Table 5.4: Computing times (s.) for the divide and conquer method of Taylor shift by 1 —"large" coefficients.

112


31 — — 0.000621 0.000257 0.000187 0.000176

63 — — 0.002456 0.002083 0.000721 0.000685

127 7.880 0.524 0.017887 0.013282 0.004989 0.004713

255 241.540 13.262 0.122438 0.287136 0.036668 0.032581

511 7453.690 234.087 0.762593 0.674811 0.278306 0.270762

1023 — — 5.237463 3.619739 1.476873 1.424302

2047 — — 38.150037 25.160389 9.522594 9.090904

4095 — — 268.787407 182.303800 66.651315 63.672271

8191 — — 1928.247407 2672.559111 475.816648 1874.960729

Table 5.5: Computing times (s.) for the convolution method of Taylor shift by 1 —"large" coefficients.


31 — — 0.000336 0.002618 0.002409 0.083244

63 — — 0.001028 0.345857 0.000289 0.015961

127 4.810 0.700 0.004395 0.078607 0.001279 0.266423

255 76.210 14.894 0.022186 2.403883 0.006682 0.795823

511 1289.730 420.562 0.128907 8.865631 0.040849 0.676700

1023 — — 0.830711 2.888625 0.300753 2.038819

2047 — — 3.478230 2.746924 1.002791 1.055621

4095 — — 29.064893 17.772409 8.059511 9.584729

8191 — — 218.780667 151.508244 58.104000 72.311823

Table 5.6: Computing times (s.) for the Paterson & Stockmeyer method of Taylor shift by 1 —"large" coefficients.

113

n 10

T3 ° 9

E 8

^ 7

i 6 & 5 O

£ 2 o

1

Fast methods on Pentium EE with GNU-MP arithmetic

\ V I ' I ' \ V \ \ x

\ \ N \ \ x

\ \ N Convolution Modular convolution

■ — - - Paterson Stockmeyer

\ v

\ x x Convolution Modular convolution

■ — - - Paterson Stockmeyer \ x Xx

Convolution Modular convolution

■ — - - Paterson Stockmeyer \ \ v

Convolution Modular convolution

■ — - - Paterson Stockmeyer

\ * s

\ sv X \ V <- — ^

I , I , I 2000 4000

Degree 6000 8000

Figure 5.6: Using 64-bit arithmetic on the Pentium EE improves the crossover point. The convolution method is over 30 x slower than the tile method and is not shown.

shifts by 1 would involve improving integer arithmetic, polynomial arithmetic, and

high-level coding. Only the first has been explored experimentally.

5.4.1 Replacing native N T L ari thmet ic w i th G N U - M P ari thmet ic

If we replace the native NTL arithmetic, which uses a radix of 230 with GNU-

MP arithmetic, which uses a radix of 264 (except in case of Pentium 4 architecture

where the radix is 232), we get a speedup that moves the crossover points in the

desirable direction, see Figures 5.6, 5.7, 5.8, and 5.9. However, the improvement in

the performance is not sufficient to make the fast methods superior for any practical

degree range.

Replacing the native GNU-MP arithmetic with Gaudry's patch on the Opteron

114

Fast methods on Opteron with GNU-MP arithmetic

4000

Degree 8000

Figure 5.7: Using 64-bit arithmetic on the Opteron improves the crossover point. The convolution method is over 25 x slower than the tile method and is not shown.

Fast methods on Pentium 4 with GNU-MP arithmetic

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer

4000

Degree 8000

Figure 5.8: Using 64-bit arithmetic on the Pentium 4 improves the crossover point. The convolution method is over 8x slower than the tile method and is not shown.

115

8000

Figure 5.9: Using 64-bit arithmetic on the UltraSPARC III improves the crossover point. The convolution method is over 62 x slower than the tile method and is not shown.

processor causes a minor improvement for the divide and conquer method but a

significant improvement for the convolution method, which is the slowest method,

see Figure 5.10.

Thus, improving integer arithmetic leads to significant performance advantage.

However, this advantage does not lead to a change in the crossover point such that

any of the fast methods becomes a viable replacement for the tile method.

Nonetheless, studying the computing times changes when switching from the

native NTL arithmetic to GNU-MP-based gives us an estimate on how much im

provement we can expect if we recode the algorithm in GNU-MP. This would not

be a trivial task however, as we would need to implement the necessary polyno

mial arithmetic using the GNU-MP integer arithmetic. Polynomial arithmetic is

18 17 1fi

a o 1b .c 14 (I) E 13 <D 12

til

11

(/> 10 (I) o y > s R o ^ / X 6

(/> ^ (I) o 4 > s ? () 2

1 0

Fast methods on UltraSPARC with GNU-MP arithmetic

t \ ' ' 1 1 '

. \ Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer

\ • \ Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer

\ x x

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer \ x """"%

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer \ t \

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer \ • \

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer

\ x

Divide and conquer

Convolution

Modular convolution

Paterson Stockmeyer

\ x

^ x. \ * ̂ . \ " '*'" • •».

"*

I , I i 2000 4000

Degree 6000

116

Fast methods on Opteron with GNU-MP arithmetic and Gaudry's patch

11 i i i i i i i i

0 2000 4000 6000 8000 Degree

Figure 5.10: Gaudry's patch improves the crossover point on the Opteron. The convolution method is over 19 x slower than the tile method and is not shown.

efficiently implemented in the NTL package but is not provided by the GNU-MP

library.

5.4.2 Observations on coding

The fast methods (courtesy of Jiirgen Gerhard) are skillfully coded using calls

to the NTL library. However, NTL is written in a portable C + + and treats both

polynomials and integers as ADT classes with constructors and destructors. All

NTL-based fast methods for Taylor shift by 1 involve multiple copying of the inter

mediate polynomials. Such construction and destruction is probably costly. Using

in-place computation techniques may be warranted.

117

5.5 Conclusions

None of our approaches for improving performance of the fast Taylor shifts by 1

algorithms have yielded a method that outperforms our tile method for any currently

useful degree. However, we have demonstrated that effective utilization of features

of computer architecture significantly affects crossover points.

118

6. Appl icat ions

In this chapter, we discuss application of the tile method both as a technique

for improving analogous computer algebra algorithms and as a high-performance

algorithm that can be used in important applications. Specifically, de Casteljau's

algorithm has similar pattern of addition and can be redesigned for performance

using the register tiling technique from the tile method for Taylor shift by 1. Both

algorithms play an important role in the two variants of the Descartes method for

real root isolation.

6.1 High-performance de Casteljau's a lgori thm

The Taylor shift by 1 and de Casteljau's algorithms share a similar computational

structure. The algorithms have a similar dependency pattern, but the order of

computation is reversed. Figures 6.1 and 6.2 present diagrams of addition direction

within the Pascal triangle for the two algorithms.

The technique of register tiling that makes the classical Taylor shift by 1 effi-

a. b.

Figure 6.1: The computation structure of (a) Taylor shift by 1 and (b) de Casteljau's algorithms.

119

0 0 0 0 i i 1 i

(ln ~> a 0 , 0 Qo,i ^0,2 ^0,3

Q>n-1 -> Qi,o «i , i ai,2 d"n-2 "> «2,0 «2,1 \ dn-3 "> «3,0 \

(a)

#,' 62 6? 6g T T T T

^3 < — ^3,0 &2,1 &12 t>0,3

b2 <— 62,0 &i,i bo,2 b[ <- 6i)0 6o,i \

(b)

h%3 = a * , j - i + Figure 6.2: (a) The pattern of integer additions in Pascal's triangle, CLi-i>3) can be used to perform Taylor shift by 1. (b) In de Casteljau's algorithm all dependencies are reversed, the intermediate results are computed according to the recursion b3>l = b3-\>t + b3-\>t+\.

Figure 6.3: Register tiling can be applied to (a) Taylor shift by 1 and (b) de Casteljau's algorithm. Arrows show direction of addition.

120

cient [59] can be used to redesign de Casteljau's algorithm as well, see Figure 6.3.

Tiling the classical Taylor shift by 1 was discussed in Chapter 3.

In order to allow our tiled implementations to be used on multiple architectures,

we have implemented an automatic code generator in Perl [91] for de Casteljau's

algorithm. The generator is similar to the one used for generating the tile method

for Taylor shift by 1, see Section 3.5. It produces portable ANSI C + + [52, 13] code

for tiles of different sizes, compiles and executes the code, and searches through

successively larger tile sizes until a tile size with peak performance is discovered.

The limitations of the tile method for Taylor shift by 1 transferred to the tile method

of de Casteljau's algorithm: a single addition schedule for a CPU with two addition

units.

6.2 High-performance Descartes m e t h o d

In this section we describe our implementation of the Descartes method using

the tiled version of de Casteljau's algorithm, see Section 6.1. We then compare

performance of our implementation with the method by Hanrot et al., the SYNAPS

method, and two architecture-unaware implementations from SACLIB library [22].

6.2.1 Monomial vs . Bernste in bases

The Descartes method, independent of the basis used to represent polynomials,

uses binary search to find isolating intervals and relies on the Descartes rule of signs

to determine when an isolating interval has been found or when the search can stop

since there are no roots in the given interval. Let A(x) = amxm + • • • + (i\X + a0.

The Descartes rule states that the number of coefficient sign variations, var(A), is

121

greater than or equal to the number of positive roots of A, and that the difference

is even. This provides an exact test when var(A) G {0,1}.

The following polynomial transformations are needed for the method and the

mapping between the monomial basis and the Bernstein basis:

1. Translation Tc(A(x)) = A(x - c),

2. Reciprocal transformation: R(A(x)) = xmA(l/x), and

3. Homothetic transformation: Ha(A(x)) = A(x/a).

The method proceeds by using a root bound and a homothetic transformation to

transform the input polynomial to a polynomial, A, whose roots in the interval (0, 1)

correspond to the positive roots of the input polynomial. It can be advantageous

to compute the negative roots separately using a separate root bound for the neg

ative roots. When using the monomial basis, the Descartes rule is applied to the

transformed polynomial A* = T_iR(A) to determine whether A has zero or one real

roots in the interval (0, 1). Bisection is performed by computing the transformed

polynomials Ax = H2(A) and A2 = T_iH2(A) whose roots in the interval (0,1)

correspond to the roots of A in the intervals (0,1/2) and (1/2,1) respectively. The

Descartes rule is then applied to A\ = T_iR(A\) and A2 = T_iR(A2), and if more

than one coefficient sign variation is obtained the algorithm proceeds recursively

with the bisected polynomials.

Associated with this bisection process is a binary tree, where each node in the

tree has an associated subinterval and polynomial. Each internal node requires the

computation of three polynomial translations T_i, i.e., Taylor shift by 1, to compute

122

the bisection polynomial and the two applications of the Descartes rule, while leaf

nodes only require the polynomial translations for the application of the Descartes

rule. The bulk of the computing time for the method is devoted to Taylor shift by

1. Figure 6.2(a) shows the classical computation of A(x + 1) = J2™=0 am-h,hXh-, see

also Theorem 1.3.3. Note that it is possible to avoid the complete computation of

the Taylor shift in the application of the Descartes rule by stopping as soon as more

than one sign variation is detected.

Let Bm>i(x) = (™)x*(l — x)m~\ i = 0 , . . . , m be the Bernstein basis, and write

Mx) = YZokBm,t(x). Since T^R{Bm>t{x))= (™)xm~\ T^R{A{x)) =E™o ( 7 ) M "

and v&r(A*(x)) = var(60, •••, bm). The Bernstein representation of the bisection poly

nomials, Ai(x) and A2(x), can be obtained from the coefficients of the Bernstein

representation of A(x) using de Casteljau's algorithm. In order to preserve integer

coefficients a fraction-free variant is used. For 0 < i < m set 60,« = h, and for

1 < j < m and 0 < % < m — j set b3>l = b3-\ + b3-\>t+\. As Figure 6.2(b) shows, this

computation is similar to the computation of the Taylor shift by 1, except that the

computation proceeds in the reverse direction. Eigenwillig et al. [31] remark that if

b[ = 2m~%fi and b"% = Tbm.v, A1(x) = £ ™ 0 VtBm>t(x) and A2{x) = TZo b'lBm>l(x).

This establishes a one-to-one mapping between the nodes and the associated polyno

mials in the search trees for the monomial and Bernstein variants of the algorithm.

Moreover, the cost of the computation at each node, assuming classical algorithms

for de Casteljau's and Taylor shift, for both variants is codominant and hence the

total computing time is codominant. In contrast to the monomial basis, each inter

nal node requires one application of de Casteljau's algorithm instead of three Taylor

shifts by one and no transformations are required for leaf nodes. A similar approach,

123

called the dual algorithm, which also reduces the number of Taylor shifts by com

puting A\(x) and A^ix) directly from A*(x) using monomial bases was suggested

by Johnson [56].

6.2.2 The Descartes methods we compare

The descriptions of the Descartes methods we compared follow.

The monomial SACLIB method IPRRID

The program IPRRID in the SACLIB library [22] processes the bisection tree

in breadth-first order [62, 78]. The IPRRID program calls the IUPTR1 routine,

i.e., the classical Taylor shift by 1 algorithm included in the SACLIB library, see

Section 2.4. The IPRRID routine also includes calls to a partial Taylor shift by 1

and will avoid the complete computation of the Taylor shift in the application of

the Descartes rule by stopping as soon as more than one sign variation is detected.

Also, IPRRID checks whether var(A) ^ 0 before computing T-\R(A).

The Bernstein SACLIB method IPRRIDB

The program IPRRIDB (courtesy of G. E. Collins) is based on the SACLIB

library [22]. The program converts the input polynomial from its monomial rep

resentation into a fraction-free Bernstein-basis representation. IPRRIDB processes

the bisection tree in the same way as the program IPRRID of Section 6.2.2 above.

IPRRIDB uses a fraction-free version of de Casteljau's algorithm that avoids the

overhead of calling integer addition routines and of normalizing after each integer

addition—in the same way as the SACLIB Taylor shift by 1 program IUPTR1, see

124

Sections 6.2.2 and 2.4.

The m e t h o d by Hanrot et al.

Hanrot et al. [45] provide an efficient implementation of the monomial version of

the Descartes method that incorporates the memory-saving technique of Rouillier

and Zimmermann [78]. Their implementation uses GNU-MP [44] for the integer

additions required by Taylor shift operations.

Additional algorithmic optimizations are included to reduce the time spent on

Taylor shift. The complete execution of the Taylor shift used to compute T_XR prior

to the application of the Descartes rule is not needed in many situations. If all of

the input coefficients are of the same sign, then the transformed polynomial will

have zero coefficient sign variations and the Taylor shift can be avoided. If all of

the intermediate coefficients in a column of the Taylor shift computation have the

same sign, then the remaining result coefficients will have the same sign, and the

computation can be aborted. If exactly two sign variations are reported, then there

are either zero or two roots in the interval. If the signs of the polynomial evaluated

at 0 and 1 are equal but different from the sign at 1/2, then two roots have been

found and the algorithm can terminate avoiding the additional Taylor shifts needed

for the Descartes test to report the termination. This test is efficient to apply since

the polynomial evaluated at 1 is equal to the sum of the coefficients and the sum

is known after computing the first column of intermediate coefficients in the Taylor

shift computation. In practice, computation of a partial rather than a complete

Taylor shift along with the early termination tests can save a substantial amount of

time.

125

In a pre-processing step the method determines the greatest k such that the

input polynomial A(x) is a polynomial in xk, and replaces A(x) by A(j/x). If k is

even, the method isolates only the positive roots.

The SYNAPS method

The SYNAPS [71] implementation IslBzInteger<QQ> [32] of the Descartes method

uses GNU-MP [44] for the integer additions required by the de Casteljau's opera

tions. Otherwise, the method is a straightforward implementation of the Bernstein-

bases variant. A hard-coded limitation of the recursion depth to 96 prevents the

method from isolating the roots of Mignotte polynomials of degrees greater than 80.

6.2.3 Performance results

We measured the executions times of the five implementations of the Descartes

method on the four processor architectures for input polynomials of various degrees

from the three classes of polynomials, see Section 1.5.4. The data are given in

Tables 6.1, 6.2, 6.4, and 6.3. Figures 6.4, 6.5, 6.6 and 6.7 plot the performance

gain obtained by the methods with respect to the SACLIB routine IPRRID for

input polynomials of various degrees. Gains by an order of magnitude are typical.

The largest speedup is by a factor of 24, and it is obtained by the Bernstein-based

variant of the Descartes method with register tiling on the Opteron processor for

the Chebyshev polynomial of degree 1000.

The data show that high performance can be achieved using a number of algo

rithmic devices:

126

Real root isolation of Random polynomials Real root isolation of Random polynomials Architecture Intel Penitum EE Architecture AMD Opteron

200 400 000 800 000 ° 200 400 000 800 000 Degree Degree

Figure 6.4: Speedup with respect to the monomial SACLIB implementation for random polynomials on four architectures.

Real root isolation of Chebyshev polynomials Architecture Intel Pentium EE

Real root isolation of Chebyshev polynomials Architecture AMDOpteron

Degree

Real root isolation of Chebyshev polynomials Architecture Sun UltraSPARC III

— SACLBBe nsten I ' ' ' ' ' I

Degree

Real root isolation of Chebyshev polynomials Architecture Intel Pentium 4

SACL B Be nste n TeclBei SYNAPS

Degree Degree

Figure 6.5: Speedup with respect to the monomial SACLIB implementation for Chebyshev polynomials on four architectures.

128

Real root isolation of reduced Chebyshev polynomials Architecture Intel Pentium EE

Degree

Real root isolation of reduced Chebyshev polynomials Architecture Sun UltraSPARC III

Real root isolation of reduced Chebyshev polynomials Architecture AMD Opteron

Degree

Real root isolation of reduced Chebyshev polynomials Architecture Intel Pentium 4

Degree Degree

Figure 6.6: Speedup with respect to the monomial SACLIB implementation for reduced Chebyshev polynomials on four architectures.

129

Real root isolation of Mignotte polynomials Architecture Intel Pentium EE

Degree

Real root isolation of Mignotte polynomials Architecture Sun UltraSPARC III

SACLBBe nste n T ed Be nste n

^ - ^ N

.—- " ^ S^ ^< = :<< : ; : .^

Real root isolation of Mignotte polynomials Architecture AMDOpteron

Degree

Real root isolation of Mignotte polynomials Architecture Intel Pentium 4

Degree Degree

Figure 6.7: Speedup with respect to the monomial SACLIB implementation for Mignotte polynomials on four architectures.

130

1. The use of Bernstein bases can be viewed as a way to reduce the number of

reopera t ions per internal node of the recursion tree in the monomial vari

ant from 3 Taylor shifts to 1 de Casteljau's transformation. A comparison

between the SACLIB methods IPRRID and IPRRIDB shows that this ap

proach is successful—despite the fact that the initial transformation of the

input polynomial into the Bernstein-base representation can increase the co

efficient length.

2. Hanrot et al. achieve a similar reduction in the number of n3-operations by

partial execution of certain Taylor shifts. In addition, their early termination

test avoids all n3-operations at certain leaf nodes. For reduced Chebyshev

polynomials this device reduces the number of complete Taylor shifts by 40%.

3. The use of the assembly-language integer addition routines of GNU-MP makes

the SYNAPS method faster than the SACLIB method IPRRIDB for polynomi

als with long coefficients, and it contributes to making the method by Hanrot

et al. faster than the SACLIB method IPRRID.

4. The use of register tiling is orthogonal to devices (1) and (2). In fact, in an

additional experiment we replaced the complete Taylor shift in the method by

Hanrot et al. with our tiled Taylor shift and obtained an additional speed-up

by a factor of about 1.33.

The three implementations of the Bernstein-bases variant might be further improved

by incorporating the early termination test by Hanrot et al.

The data show that—with minor exceptions—for all classes of polynomials, the

131

best absolute computing times are achieved on the Opteron processor using the

Bernstein-bases variant of the Descartes method with register tiling.

Deg SACLIB SACLIB Tiled Hanrot SYN-mon. Bern. Bern. et al. APS

Ran 100 8 5 4 4 11 Ran 200 81 44 20 25 44 Ran 300 311 148 57 103 115 Ran 400 787 360 128 194 232 Ran 500 1708 733 252 430 417 Ran 600 2257 1071 376 687 617 Ran 700 4416 1884 641 1058 982 Ran 800 8706 3309 1143 2361 1496 Ran 900 11679 4832 1610 2090 2132 Ran 1000 18155 6761 2274 3417 2741 Che 100 344 108 60 4 55 Che 200 6316 1712 608 52 764 Che 300 34606 8980 2604 240 3728 Che 400 115675 29409 8284 708 11664 Che 500 296434 74340 23748 1860 28625 Che 600 638131 156701 48267 3720 59842 Che 700 1207523 285578 90662 6540 116087 Che 800 2106963 492447 156935 10688 209550 Che 900 3455690 811711 261720 18033 351207 Che 1000 5388421 1320835 418654 27326 553752 Red 100 24 8 8 4 8 Red 200 348 108 64 52 60 Red 300 1924 544 236 240 268 Red 400 6372 1756 624 708 840 Red 500 16845 4372 1372 1860 2044 Red 600 35046 9092 2612 3720 4068 Red 700 67820 17237 4940 6540 7568 Red 800 116783 30049 8380 10688 12729 Red 900 193304 48583 15049 18033 20333 Red 1000 302510 75012 23685 27326 31286 Mig 100 4736 1288 760 728 N/A Mig 200 143673 38642 17932 20721 N/A Mig 300 1083263 290385 132196 167387 N/A Mig 400 4536036 1213095 660173 721453 N/A Mig 500 13777640 3682927 2470037 2204722 N/A Mig 600 34211121 9110764 7533232 5468834 N/A

Table 6.1: Root isolation timings in milliseconds for Intel Pentium EE.



Table 6.2: Root isolation timings in milliseconds for AMD Opteron.



Table 6.3: Root isolation timings in milliseconds for UltraSPARC III.



Table 6.4: Root isolation timings in milliseconds for Intel Pentium 4.

136

7. Future research

Several questions remain:

1. Our modeling predicts a substantial performance gain from rescheduling the

register tile code to utilize more IEUs, see Section 4.4.1. We suggest, therefore,

that a search for optimal u should be made a part of automatic tuning and

code generation. For example, Opteron should be able to do 3 additions per

cycle as the processor has 3 IEUs.

2. We have experimented only with the square-shaped register tiles. Effect of

register tile shape on performance should be analyzed. Another shape may

yield a better performance. A search for the optimal tile shape then should

be incorporated into automatic code generation and tuning process.

3. Applying Hanrot et al.'s early-termination tests to the Bernstein bases variant

of the Descartes method with register tiled de Casteljau's algorithm will likely

yield the fastest known implementation of the Descartes method.

4. Whether the sparse interlaced array representation for polynomials is a more

efficient data structure for the tile method than a noninterlaced representation

should be verified experimentally.

5. A larger register file in the CPU (i.e., with more available registers) will allow

larger register tiles. This would further reduce the cost of carry propagation.

137

6. Using a processor with wider registers (i.e., 128-bit vs. 64-bit wide) will im

prove performance as the wider registers would allow for a larger radix. A

larger radix will shorten the integer coefficients and further reduce the num

ber of additions.

138

Bibliography

Automatically Tuned Linear Algebra Software (ATLAS), h t t p : / /math-a t las . sourceforge .net/.

Advanced Micro Devices, Inc., AMD Eighth-Generation Processor Architecture, http://www.amd.com/, October 2001.

, Processor Reference, http://www.amd.com/, June 2004.

, Software Optimization Guide for AMD64 Processors, http://www. amd.com/, September 2005.

A. V. Aho, K. Steiglitz, and J. D. Ullman, Evaluating polynomials at fixed set of points, SIAM Journal on Computing 4 (1975), no. 4, 533-539.

Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman, The design and analysis of computer algorithms, Addison-Wesley Publishing Company, 1974.

Randy Allen and Ken Kennedy, Optimizing compilers for modern architecture: A dependence-based approach, Morgan Kaufmann Publishers, New York, 2002.

David H. Bailey, King Lee, and Horst D. Simon, Using Strassen's algorithm to accelerate the solution of linear systems, Journal of Supercomputing 4 (1990), no. 4, 357-371.

Saugata Basu, Richard Pollack, and Marie-Frangoise Roy, Algorithms in real algebraic geometry, Springer-Verlag, 2003.

, Algorithms in real algebraic geometry, second ed., Springer-Verlag, 2006.

D. Bini and V. Y. Pan, Polynomial and matrix computations, vol. 1, Birkhauser, 1994.

Jacques Borowczyk, Sur la vie et Voeuvre de Francois Budan (1761-1840), His-toria Mathematica 18 (1991), 129-157.

British Standards Institute, The C++ Standard: Incorporating technical corrigendum no. 1, John Wiley and Sons, 2003.

http://www.amd.com/

http://www.amd.com/

http://www

amd.com/

139

Randal E. Bryant and David R. O'Hallaron, Computer systems: A programmer's perspective, Prentice Hall, 2003.

David Callahan, Steve Carr, and Ken Kennedy, Improving register allocation for subscripted variables, ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM Press, 1990, pp. 53-65.

J. Cohen and M. Roth, On the implementation of Strassen's fast multiplication algorithm, Acta Informatica 6 (1976), 341-355.

G. E. Collins and J. R. Johnson, Quantifier elimination and the sign variation method for real root isolation, International Symposium on Symbolic and Algebraic Computation, ACM Press, 1989, pp. 264-271.

G. E. Collins and R. Loos, Real zeros of polynomials, Computer Algebra: Symbolic and Algebraic Computation (B. Buchberger, G. E. Collins, and R. Loos, eds.), Springer-Verlag, 2nd ed., 1982, pp. 83-94.

G. E. Collins and R. G. K. Loos, Specifications and index of SAC-2 algorithms, Tech. Report WSI-90-4, Wilhelm-Schickard-Institut fur Informatik, Universitat Tubingen, 1990.

George E. Collins, The computing time of the Euclidean algorithm, SIAM Journal on Computing 3 (1974), no. 1, 1-10.

George E. Collins and Alkiviadis G. Akritas, Polynomial real root isolation using Descartes' rule of signs, Proceedings of the 1976 ACM Symposium on Symbolic and Algebraic Computation (R. D. Jenks, ed.), ACM Press, 1976, pp. 272-275.

George E. Collins et al., SACLIB User's Guide, Tech. Report 93-19, Research Institute for Symbolic Computation, RISC-Linz, Johannes Kepler University, A-4040 Linz, Austria, 1993.

J.W. Cooley and J.W. Tukey, An algorithm for the machine calculation of complex Fourier series, Mathematics of Computation 19 (1965), 297-301.

Intel Corporation, The Intel C+ + Compiler, http://www.intel.com/.

, A detailed look inside the Intel NetBurst micro-architecture of the Intel Pentium 4 processor, http://www.intel.com/, November 2000.

[26] , The IA-32 Intel Architecture Optimization: Reference Manual, h t t p : //www.intel.com/, 2004.

http://www.intel.com/



140

[27] , Intel Pentium D Processor 800 Sequence: Datasheet, 2006.

[28] D. Curry, Using C on the UNIX System, 1st ed., O'Reilly and Associates, Inc., 1989.

[29] K. Dowd and C. R. Severance, High performance computing, 2nd ed., O'Reilly and Associates, Inc., Sebastopol, CA, 1998.

[30] Jean-Guillaume Dumas, Pascal Giorgi, and Clement Pernet, FFPACK: Finite field linear algebra package, International Symposium on Symbolic and Algebraic Computation, ACM Press, 2004, pp. 119-126.

[31] Arno Eigenwillig, Vikram Sharma, and Chee K. Yap, Almost tight recursion tree bounds for the Descartes method, International Symposium on Symbolic and Algebraic Computation, ACM Press, 2006, pp. 71-78.

[32] I. Z. Emiris, B. Mourrain, and E. Tsigaridas, Real algebraic numbers: Complexity analysis and experimentations, Research Report 5897, INRIA, 2006.

[33] Gerald Farin, Curves and surfaces for computer aided geometric design, Academic Press, 1988.

[34] Richard Fateman, Comparing the speed of programs for sparse polynomial multiplication, ACM SIGSAM Bulletin 37 (2003), no. 1, 4-15.

[35] , Memory cache and Lisp: Faster list processing via automatically rearranging memory, ACM SIGSAM Bulletin 37 (2003), no. 4, 109-116.

[36] Akpodigha Filatei, Xin Li, Marc Moreno Maza, and Eric Schost, Implementation techniques for fast polynomial arithmetic in a high-level programming environment, International Symposium on Symbolic and Algebraic Computation, ACM Press, 2006, pp. 93-100.

[37] M. Fowler, Yet another optimization article, IEEE Software 19 (2002), no. 3, 20-21.

[38] M. Frigo and S. G. Johnson, The design and implementation of FFTW3, Proceedings of the IEEE 93 (2005), no. 2, 216-231.

[39] Pierrick Gaudry, Assembly support for gmp on amd64, h t t p : //www. l o r i a . f r / ~gaudry/mpn_AMD64/.

[40] GNU Compiler Collection, h t tp : / /gcc .gnu .o rg / .

http://gcc.gnu.org/

141

[41] Jiirgen Gerhard, Modular algorithms in symbolic summation and symbolic integration, Lecture Notes in Computer Science, vol. 3218, Springer-Verlag, 2004.

, Personal communication, 2005.

Torbjorn Granlund, GNU MP: The GNU Multiple Precision Arithmetic Library, Swox AB, September 2004, Edition 4.1.4.

, GNU MP: The GNU Multiple Precision Arithmetic Library, Swox AB, March 2006, Edition 4.2.

Guillaume Hanrot, Fabrice Rouillier, Paul Zimmermann, and Sylvain Petitjean, Uspensky 's algorithm, http: //www. loria . f r/equipes/vegas/qi/usp/usp . c, 2004.

John L. Hennessy, David A. Patterson, and David Goldberg, Computer architecture: A quantitative approach, 3rd ed., Morgan Kaufmann, 2002.

Karin Hogstedt, Larry Carter, and Jeanne Ferrante, Determining the idle time of a tiling, ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, ACM Press, 1997, pp. 160-173.

, Selecting tile shape for minimal execution time, ACM Symposium on Parallel Algorithms and Architectures, ACM Press, 1999, pp. 201-211.

Tim Horel and Gary Lauterbach, UltraSPARC-Ill: Designing third-generation 64-bit performance, IEEE Micro 19 (1999), no. 3, 73-85.

Steven Huss-Lederman, Elaine M. Jacobson, Anna Tsao, Thomas Turnbull, and Jeremy R. Johnson, Implementation of Strassen's algorithm for matrix multiplication, Supercomputing '96: Proceedings of the 1996 ACM/IEEE conference on Supercomputing, IEEE Computer Society, 1996, p. 32.

Innovative Computing Laboratory, PAPI: Performance Application Programming Interface, h t tp : / / i c l . c s .u tk .edu /PAPI .

International Standards Organization, http://www.iso.org, ISO/IEC 14882:2003: Programming languages—C++, 2003.

M. Jimenez, J. M. Llaberia, A. Fernandez, and E. Morancho, A general algorithm for tiling the register level, International Conference on Supercomputing, ACM Press, 1998, pp. 133-140.

http://icl.cs.utk.edu/PAPI

http://www.iso.org

142

[54] Marta Jimenez, Jose M. Llaberia, and Agustin Fernandez, Register tiling in nonrectangular iteration spaces, ACM Transactions on Programming Languages and Systems 24 (2002), no. 4, 409-453.

[55] , A cost-effective implementation of multilevel tiling, IEEE Transactions on Parallel and Distributed Systems 14 (2003), no. 10, 1006-1020.

[56] J. R. Johnson, Algorithms for polynomial real root isolation, Technical research report OSU-CISRC-8/91-TR21, The Ohio State University, Department of Computer and Information Science, 1991.

[57] , Algorithms for polynomial real root isolation, Quantifier Elimination and Cylindrical Algebraic Decomposition (B. F. Caviness and J. R. Johnson, eds.), Springer-Verlag, 1998, pp. 269-299.

[58] Jeremy R. Johnson, Werner Krandick, Kevin Lynch, David G. Richardson, and Anatole D. Ruslanov, High-performance implementations of the Descartes method, International Symposium on Symbolic and Algebraic Computation (J.-G. Dumas, ed.), ACM Press, 2006, pp. 154-161.

[59] Jeremy R. Johnson, Werner Krandick, and Anatole D. Ruslanov, Architecture-aware classical Taylor shift by 1, International Symposium on Symbolic and Algebraic Computation (M. Kauers, ed.), ACM Press, 2005, pp. 200-207.

[60] A. Karatsuba and Yu Ofman, Multiplication of multidigit numbers on automata, Sov. Phys. Dokl. 7 (1962), 595-596.

[61] I. Kodukula, N. Ahmed, and K. Pingali, Data-centric multi-level blocking, ACM SIGPLAN Conference on Programming Language Design and Implementation, ACM Press, 1997, pp. 346-357.

[62] Werner Krandick, Isoherung reeller Nullstellen von Polynomen, Wis-senschaftliches Rechnen (J. Herzberger, ed.), Akademie Verlag, Berlin, 1995, pp. 105-154.

[63] Werner Krandick and Kurt Mehlhorn, New bounds for the Descartes method, Journal of Symbolic Computation 41 (2006), no. 1, 49-66.

[64] Jeffrey M. Lane and R. F. Riesenfeld, Bounds on a polynomial, BIT 21 (1981), no. 1, 112-117.

[65] Maplesoft, Maple 9: Learning guide, 2003.

143

[66] Robert T. Moenck, Practical fast polynomial multiplication, Proceedings of the 1976 ACM Symposium on Symbolic and Algebraic Computation, ACM Press, 1976, pp. 136-148.

[67] M. B. Monagan, K. O. Geddes, K. M. Heal, G. Labahn, S. M. Vorkoetter, J. Mc-Carron, and P. DeMarco, Maple 9: Advanced programming guide, Maplesoft, 2003.

[68] , Maple 9: Introductory programming guide, Maplesoft, 2003.

[69] Peter L. Montgomery, Five, six, and seven-term Karatsuba-like formulae, IEEE Transactions on Computers 54 (2005), no. 3, 899-908.

[70] J. Moura, M. Puschel, J. Dongarra, and D. Padua (eds.), Special issue on program generation, optimization, and adaptation, Proceedings of IEEE, vol. 93, February 2005.

[71] B. Mourrain, J. P. Pavone, P. Trebuchet, and E. Tsigaridas, SYNAPS: A library for symbolic-numeric computation, Software presentation. MEGA 2005, Sardinia, Italy, May 2005, h t tp : / /www-sop . in r i a . f r /ga laad / log ic ie l s / synaps/.

[72] B. Mourrain, M. N. Vrahatis, and J. C. Yakoubsohn, On the complexity of isolating real roots and computing with certainty the topological degree, Journal of Complexity 18 (2002), no. 2, 612-640.

[73] Bernard Mourrain, Fabrice Rouillier, and Marie-Frangoise Roy, The Bernstein basis and real root isolation, Combinatorial and Computational Geometry (J. E. Goodman, J. Pach, and E. Welzl, eds.), Mathematical Sciences Research Institute Publications, vol. 52, Cambridge University Press, 2005, pp. 459-478.

[74] A. M. Ostrowski, Note on Vincent's theorem, Annals of Mathematics, Second Series 52 (1950), no. 3, 702-707, Reprinted in: Alexander Ostrowski: Collected Mathematical Papers, vol. 1, Birkhauser Verlag, 1983, pages 728-733.

[75] M. S. Paterson and L. Stockmeyer, On the number of nonscalar multiplications necessary to evaluate polynomials, SIAM Journal on Computing 2 (1973), 60-66.

[76] M. Puschel, J. M. F. Moura, J. R. Johnson, D. Padua, M. Veloso, B. W. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo, SPIRAL: Code generation for DSP transforms, Proceedings of the

http://www-sop.inria.fr/galaad/logiciels/

144

IEEE, special issue on "Program Generation, Optimization, and Adaptation" 93 (2005), no. 2, 232-275.

[77] David G. Richardson and Werner Krandick, Compiler-enforced memory semantics in the SACLIB computer algebra library, International Workshop on Computer Algebra in Scientific Computing (V. G. Ganzha, E. W. Mayr, and E. V. Vorozhtsov, eds.), Lecture Notes in Computer Science, vol. 3718, Springer-Verlag, 2005, pp. 330-343.

[78] Fabrice Rouillier and Paul Zimmermann, Efficient isolation of a polynomial's real roots, Journal of Computational and Applied Mathematics 162 (2004), 33-50.

[79] David Saunders and Zhendong Wan, Smith normal form of dense integer matrices fast algorithms into practice, International Symposium on Symbolic and Algebraic Computation, ACM Press, 2004, pp. 274-281.

[80] A. Schonhage, A. F. W. Grotefeld, and E. Vetter, Fast algorithms, B.I. Wissenschaftsverlag, Mannheim, 1994.

[81] A. Schonhage and V. Strassen, Schnelle Multvplikation grosser Zahlen, Computing 7 (1971), 281-292.

[82] Victor Shoup, NTL: A Library for doing Number Theory, h t t p : //www. shoup. n e t / n t l .

[83] , A new polynomial factorization algorithm and its implementation, Journal of Symbolic Computation 20 (1995), no. 4, 363-397.

[84] V. Strassen, Gaussian elimination is not optimal, Numer. Math. 13 (1969), 354-356.

[85] Sun Microsystems, Sun Studio Collection, http://www.sun.com/.

[86] , UltraSPARC III Cu: User's manual, Ver. 2.2.1, http://www.sun. com/, 2004.

[87] J. V. Uspensky, Theory of equations, McGraw-Hill Book Company, Inc., 1948.

[88] Joachim von zur Gathen, Functional decomposition of polynomials: the tame case, Journal of Symbolic Computation 9 (1990), 281-299.

http://www.sun.com/

http://www.sun

145

[89] Joachim von zur Gathen and Jiirgen Gerhard, Fast algorithms for Taylor shifts and certain difference equations, International Symposium on Symbolic and Algebraic Computation (W. W. Kiichlin, ed.), ACM Press, 1997, pp. 40-47.

[90] , Modern computer algebra, 2nd ed., Cambridge University Press, 2003.

[91] Larry Wall, Tom Christiansen, and Jon Orwant, Programming perl, 3rd ed., O'Reilly, 2000.

[92] R. C. Whaley and A. Petitet, Minimizing development and maintenance costs in supporting persistently optimized BLAS, Software: Practice and Experience 35 (2005), no. 2, 101-121.

[93] R. C. Whaley, A. Petitet, and J. J. Dongarra, Automated empirical optimization of software and the ATLAS project, Parallel Computing 27 (2001), no. 1-2, 3 -35.

[94] K. Yotov, X. Li, G. Ren, M. Garzaran, D. Padua, K. Pingali, and P. Stodghill, Is search really necessary to generate high-performance BLAS?, Proceedings of the IEEE 93 (2005), no. 2, 358-386.

[95] K. Yotov, K. Pingali, and P. Stodghill, Automatic measurement of memory hierarchy parameters, SIGMETRICS '05: Proceedings of the 2005 ACM SIG-METRICS international conference on Measurement and modeling of computer systems (New York, NY, USA), ACM Press, 2005, pp. 181-192.

[96] , X-ray: A tool for automatic measurement of hardware parameters, Second International Conference on the Quantitative Evaluation of Systems 2005, IEEE Computer Society, 2005, pp. 168-177.

[97] Paul Zimmermann, Personal communication, 2006.

[98] Dan Zuras, More on multiplying and squaring large integers, IEEE Transactions on Computers 43 (1994), no. 8, 899-908.

146

Vita

Anatole D. Ruslanov was born in St. Petersburg, Russia. He emigrated to the

United States in 1979 and became a US citizen in 1986. He attended University

of Pennsylvania (BA in Mathematics) and Drexel University (M.S. and Ph.D. in

Computer Science). Dr. Ruslanov is currently an assistant professor of computer

science at SUNY Fredonia's Department of Computer and Information Sciences.

His research interests include algorithm engineering, high-performance computing,

automated performance tuning, computer architecture, performance analysis and

benchmarking, symbolic computation, and algorithms for VLSI design automation.

Architecture-aware Taylor shift by 1 - Drexel University1226/datastream... · Architecture-aware...

Documents

Transcript of Architecture-aware Taylor shift by 1 - Drexel University1226/datastream... · Architecture-aware...