Download - Special Course on Computer Architecture 2008

Page 1: Special Course on Computer Architecture 2008

Special Course on Computer Architecture 2008

Toolkit ver1.0

Special Course on Computer Architecture

This is based on “Cell Speed Challenge 2008”in IPSJ workshop SACSIS (Symposium on Advanced

Computing Systems and Infrastructures)

Page 2: Special Course on Computer Architecture 2008

Special Course on Computer Architecture 2008

About this document

• Brief information about toolkit ver.1.0• Tools for solving simultaneous linear equations

with multiple SPEs• Please refer to implementation guide of your homework

• This document explains the algorithm to solve simultaneous linear equations

Page 3: Special Course on Computer Architecture 2008

Summary of homework• Your task is to write a parallel program for solving

simultaneous linear equations.• You will compete performance of the program to

obtain a solution vector matrix x, from constant matrix A and right-hand vector b.

• A is a matrix of N×N elements(each element is float : 4 Byte)

• x and b are matrices of M×N element• Contact : [email protected]

Special Course on Computer Architecture 2008


Page 4: Special Course on Computer Architecture 2008

Toolkit ver1.0• Solving simultaneous linear equations by

multiple SPEs– Number of SPE can be modified.

Default is 6 (maximum).

• Limitation – Sizes of matrices MUST BE a multiple of 32 (N=32n)

• Implement program to modify function spe_soleqs() in spe1.c• Other modification will be ignored in evaluation

• You can implement freely even the code inside spe_soleqs().

Special Course on Computer Architecture 2008

Page 5: Special Course on Computer Architecture 2008

Initial data distribution 1/2• Matrices A,b,x are distributed as follows:Main memory


• The head address of working memory which is available for users• It must be aligned to128 Byte• Constant matrix A (NxNx4) is stored in the buf.

b = buf+N×N×sizeof(float)• The head address of the region which is stored right-hand vector b(MxNx4Byte). • Notice : elements are ordered column-direction(data are stored in (0,0),…,(0,N-1),(1,0),…,(1,N-1),…,(M-1, N-1))

x = buf+(N×N+M×N)×sizeof(float)• The head address of solution vector x(MxNx4Byte)• Ordering of data is same as b





Special Course on Computer Architecture 2008

Page 6: Special Course on Computer Architecture 2008

Main memory

Brank region• Head address : buf+N×(N+2M)×sizeof(float)• allocated in PPE program you can use this region.• The size if same as total size of matrices N×(N+2M)×sizeof(float)




Mapped address for transferringbetween SPE ls_addr[5]• Physical memory does not allocated• Each of them is 256KB ls_addr[0] ~ls_addr[4]• You can transfer data to the local store of each SPE accessing these regions directory.

• Memory allocation is suppressed less than 80 MB• Such that, total size of matrices A, b, and x is guaranteed less than half of allocated memory• N is the multiple of 32

Special Course on Computer Architecture 2008

Initial data distribution 1/2


Page 7: Special Course on Computer Architecture 2008









Ordering elements in matrices













• Notice! : Distribution of elements is not the same between matrix A and others




Special Course on Computer Architecture 2008

Page 8: Special Course on Computer Architecture 2008

The algorithm adopted by toolkit1. LU decomposition

• pivoting

2. Forward substitution

3. Backward substitution

• pivoting is always done in spite of the form of matrices and size

Special Course on Computer Architecture 2008

Page 9: Special Course on Computer Architecture 2008

Subroutines for DMA transfer (1/2)• Functions for DMA transfer

• dmaget, dmaput : Subroutines for DMA transfer in toolkit

• void dmaget_burst(unsigned int ppe_addr, unsigned int spe_addr, unsigned int row, unsigned int col,

unsigned int n)Read 128 Byte from the element of matrix (col, row) whose head address is ppe_addr in main memory (type of each element is float) and data into LocalStore whose head address is spe_addr (a certain element can be fetched by *(float*)(spe_addr+row%32*sizeof(float)) )

行列 (n×n) element (col,row)

address inalignment of 128Byte



PPE(Main memory)SPE(LocalStore)


Please pay a attention to identify the location of matrix

Special Course on Computer Architecture 2008

Page 10: Special Course on Computer Architecture 2008

• float dmaget_value(unsigned int addr, unsigned int row, unsigned int col,

unsigned int n)Reads one element from n x n matrix, whose coordinates are given by (col, row). The matrix is stored in the main memory, and its beginning address is addr.

• void dmaput_value(unsigned int addr, unsigned int row, unsigned int col,

unsigned int n, float value) Writes value to the element of n x n matrix, whose coordinates are given by (col, row). The matrix is stored in the main memory, and its beginning address is addr. Note that the value is NOT synchronized among SPEs.

Matrix (n×n) element(col, row)


PPE(main memory)





Special Course on Computer Architecture 2008

Subroutines for DMA transfer (2/2)

Page 11: Special Course on Computer Architecture 2008

Synchronization• Subroutines for synchronization• SPE0 is in charge of DMA synchronization among SPEs.• void sync(UINT32 id, // ID number of SPE

UINT32* ppe_ls, // array with addresses which “LocalStore” of

each SPEs are mapped in main memory. volatile struct spe_sync* sd, // array with local addresses in the SPE

UINT32 key) // a key used for synchronization

In function sync …• SPE0 writes a value key to variable start_flag of other SPEs, whose address is given by

sp. SPEs except SPE0 starts their calculation after start_flag=key becomes true.

• SPE1 ~ 5 writes a value key to SPE0’s variables (sd[id].end_flag) . SPE0 stops calculation of SPE1 ~ 5 after their end_flag=key becomes true.

• Users can set any value to the key, but be aware of the conflict with other sync functions.

Special Course on Computer Architecture 2008

Page 12: Special Course on Computer Architecture 2008

LU decomposition• Following procedures are repeated for N times(i=0 ~ N-1)

1. Pivot selection (selection of a row with a largest element)

2. Row swapping

3. LU decompotions (right looking method)

n×n matrix i=0 N×N

i=1 (N-1)×(N-1)

i=2 (N-2)×(N-2)

i=3 (N-3)×(N-3)

i=N-1 1×1

Partial matrices

Special Course on Computer Architecture 2008

Page 13: Special Course on Computer Architecture 2008

1. Pivot selection• Pivoting function : searches a row with maximum i-th

value• Parallel task with use of 6 SPEs

• An SPE reads i-th value of each row (use “dmaget_value” function)

• Reports the row number maxj with the maximum value to SPE0 (use “sync_collect” functions)

• SPE0 selects the row with the maximum value among all the SPEs.(n-i)×(n-i) partial matrix







Calculates a row if (row number)%6 is equal to the own ID

Reports the row number to the SPE0

Finds a row with maximum i-th value

Special Course on Computer Architecture 2008

Page 14: Special Course on Computer Architecture 2008

Matrix (n×n)

2. Swapping of rows & columns• “swap_row” function

• Each SPE swaps rows indicated in the arguments• Swaps i-th row of matrix A and “maxj” row• 32 elements are swapped at once (dmaget, dmaput)

i-th row

“maxj” row


• “swap_col” function• Swaps i-th column of Matrix b and “maxj” column

Special Course on Computer Architecture 2008

Page 15: Special Course on Computer Architecture 2008

3. LU decomposition with Right Looking Method

• lu_decomposition• Allots partial matrices to multiple SPEs, specified by units of rows.

• Same procedure as pivot selection







1. An element of (R1, R2) is stored to variable diag(dmaget_value)

2. Elements of row R1 is stored in buf2, beginning from the second element in row R1 (for SPE0 ~ 5)

3. Writes back value t1, the quotient of diag/Element of (i, row)

4. Elements of i-th row is stored in buf2, beginning from the second element of i-th row.

5. buf1 - buf2×t1 is calculated for each elements ,and written back to the i-th row.

6. Repeat procedures 2, 4, and 5 until it reaches the last row. (Use buf3 when needed )

buf1 buf2




Following procedures 1-3 must be repeated for N times to decompose matrix A:


Special Course on Computer Architecture 2008

Page 16: Special Course on Computer Architecture 2008

Forward and backward substitution• forward_substitution & backward_substitution functions• Refer to the source code for detail

• Each SPE calculates by a solution vector• When the number of solution vector is less than 6, some

SPE may not any work in these function• forward_substitution use “blank region” to store

intermediate data • Result of backward substitution is written to x in main


Special Course on Computer Architecture 2008

Page 17: Special Course on Computer Architecture 2008

References• Numerical Resipes in C

• 2.3 LU Decomposition and Its Application

• Wikipedia “LU decomposition”

• 奥村晴彦著「C言語による最新アルゴリズム辞典」技術評論社

• 小国力編著「行列計算ソフトウエアーWS、スーパーコン、並列計算機」丸善株式会

• 斉藤 宏樹,廣安 知之,三木 光範「 LU 分解の並列化について」

Special Course on Computer Architecture 2008