Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 1

Parallela Board with Epiphany

CoprocessorMartin Kruliš

6. 1. 2015


Adapteva Company◦ Small fabless semiconductor company◦ Founded in 2008◦ Main objective is to design massively parallel

chips with emphasis on power efficiency First company that designed chip that expects to

scale over 1000 cores◦ Current products

Epiphany processor (16 core and 64 core versions) Parallela board

◦ Parallela University Program started this year

6. 1. 2015

About Adapteva

by Martin Kruliš (v1.0) 36. 1. 2015

Parallela Board

16-core EpiphanyCoprocessor

1GB SDRAMμUSB

1Gb Ethernet

μSD

μHDMI

μUSBZyng dual-core ARM-A9(with integrated FPGA)

Expansion Slots


Parallela Architecture


Epiphany Coprocessor


Coprocessor◦ 32-bit RISC cores with superscalar architecture◦ 32KB local memory per core (1 cycle latency)

Divided into four independent banks◦ IEEE754 compliant floating point instruction set◦ Two DMA channels

eMesh (Network-on-Chip)◦ Both on chip and off chip communication◦ No specific API, works with memory transactions

eLink (Chip-to-Chip Links)◦ 4 I/O ports for external communication

6. 1. 2015

Epiphany Architecture


Coprocessor Cores◦ Simple in-order RISC architecture

Most instructions take 1 cycle 8-stage dual-issue pipeline Instruction set optimized for signal processing

◦ Separate integer and floating point ALU◦ 64x 32-bit registers (for both IALU and FPU)

Load store architecture Per cycle 3/1 FPU and 2/1 IALU accesses, 1 load/store

◦ Performance 16 cores ~ 2Gflops each, 64 cores ~ 1.6 Gflops each

6. 1. 2015



Memory Model◦ Internal memory of each node

is mapped into global memory

6. 1. 2015



Local Memory◦ Divided into four banks with independent

controllers◦ Each clock cycle each bank may perform:

Send 64bit word to program sequencer Transfer 64bit word between memory and registers Receive 64bit word from eMesh interface Local DMA sends 64bit word to eMesh interface

◦ Memory order model Local reads and writes follow strong memory model Non-local transactions follow weak memory model

Operations may not propagate in the same order

6. 1. 2015



eMesh◦ 2D topology with nearest-neighbor connections◦ 3 orthogonal (independent) meshes

cMesh – on-chip write transactions (8B/cycle) xMesh – off-chip write transactions (1B/cycle) rMesh – read requests (1req/8cycles)

◦ Edge connections may be interfaced with other epiphany chips Or other type of busses (off-core memory, IO ports,

…)◦ Significantly favorizes writing operations to

reading Writing transactions are 16x faster

6. 1. 2015



eMesh

6. 1. 2015



eMesh Routing◦ Upper 12bits of the address is address of the core

6 bits – row index, 6 bits – col index◦ Each node uses simple routing algorithm

◦ Nodes use round-robin arbitration to avoid deadlock

6. 1. 2015



DMA◦ Two DMA channels per node◦ 2D addressing awareness, flexible strides◦ Local-external memory and external-external

memory transfers◦ Completion signaling by HW interrupt◦ Master and slave modes

Slave DMA is controlled by external IO or another DMA

6. 1. 2015



Epiphany SDK◦ Separate compilation for host and coprocessor code

Epiphany uses e-gcc and e-objcopy◦ The host runtime provide way to

Detect the coprocessor Allocate memory, transfer data Execute precompiled binaries on the coprocessor

OpenCL◦ The coprocessor is perceived as OpenCL accelerator◦ Each core is computing unit, on-chip memory is

local memory, …

6. 1. 2015

Programming


Host Code Examplee_platform_t platform;

e_epiphany_t dev;

e_init(NULL);

e_reset_system();

e_get_platform_info(&platform);

e_open(&dev, 0, 0, platform.rows, platform.cols);

e_load_group("coproccode.srec", &dev, 0, 0, platform.rows, platform.cols);

for (i = 0; i < platform.rows ; ++i)

for (j = 0; j < platform.cols; ++j) {

coreid = (i + platform.row) * 64 + j + platform.col;

usleep(100000);

e_read(&emem, 0, 0, 0x0, emsg, _BufSize);

e_read(&dev, i, j, 0x6000, &flag, sizeof(flag));

...

}

e_close(&dev);

e_finalize();

6. 1. 2015

Programming


Matrix Multiplication◦ Using naïve algorithm◦ Square matrices◦ N is divisible by number of cores

Each core computing its corresponding tile of the result matrix

◦ Both input matrices and output matrix fit the total amount of local memory A smart plan of the computations and the data

transfers can be devised

6. 1. 2015

Example


MatrixMultiplication

6. 1. 2015

Example

A tiles are rotated vertically in each

column

B tiles are rotated horizontally in each row


Discussion

Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

Documents

Transcript of Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.