Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

18
Parallela Board with Epiphany Coprocessor Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0) 1

Transcript of Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

Page 1: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 1

Parallela Board with Epiphany

CoprocessorMartin Kruliš

6. 1. 2015

Page 2: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 2

Adapteva Company◦ Small fabless semiconductor company◦ Founded in 2008◦ Main objective is to design massively parallel

chips with emphasis on power efficiency First company that designed chip that expects to

scale over 1000 cores◦ Current products

Epiphany processor (16 core and 64 core versions) Parallela board

◦ Parallela University Program started this year

6. 1. 2015

About Adapteva

Page 3: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 36. 1. 2015

Parallela Board

16-core EpiphanyCoprocessor

1GB SDRAMμUSB

1Gb Ethernet

μSD

μHDMI

μUSBZyng dual-core ARM-A9(with integrated FPGA)

Expansion Slots

Page 4: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 46. 1. 2015

Parallela Architecture

Page 5: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 56. 1. 2015

Epiphany Coprocessor

Page 6: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 6

Coprocessor◦ 32-bit RISC cores with superscalar architecture◦ 32KB local memory per core (1 cycle latency)

Divided into four independent banks◦ IEEE754 compliant floating point instruction set◦ Two DMA channels

eMesh (Network-on-Chip)◦ Both on chip and off chip communication◦ No specific API, works with memory transactions

eLink (Chip-to-Chip Links)◦ 4 I/O ports for external communication

6. 1. 2015

Epiphany Architecture

Page 7: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 7

Coprocessor Cores◦ Simple in-order RISC architecture

Most instructions take 1 cycle 8-stage dual-issue pipeline Instruction set optimized for signal processing

◦ Separate integer and floating point ALU◦ 64x 32-bit registers (for both IALU and FPU)

Load store architecture Per cycle 3/1 FPU and 2/1 IALU accesses, 1 load/store

◦ Performance 16 cores ~ 2Gflops each, 64 cores ~ 1.6 Gflops each

6. 1. 2015

Epiphany Architecture

Page 8: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 8

Memory Model◦ Internal memory of each node

is mapped into global memory

6. 1. 2015

Epiphany Architecture

Page 9: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 9

Local Memory◦ Divided into four banks with independent

controllers◦ Each clock cycle each bank may perform:

Send 64bit word to program sequencer Transfer 64bit word between memory and registers Receive 64bit word from eMesh interface Local DMA sends 64bit word to eMesh interface

◦ Memory order model Local reads and writes follow strong memory model Non-local transactions follow weak memory model

Operations may not propagate in the same order

6. 1. 2015

Epiphany Architecture

Page 10: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 10

eMesh◦ 2D topology with nearest-neighbor connections◦ 3 orthogonal (independent) meshes

cMesh – on-chip write transactions (8B/cycle) xMesh – off-chip write transactions (1B/cycle) rMesh – read requests (1req/8cycles)

◦ Edge connections may be interfaced with other epiphany chips Or other type of busses (off-core memory, IO ports,

…)◦ Significantly favorizes writing operations to

reading Writing transactions are 16x faster

6. 1. 2015

Epiphany Architecture

Page 11: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 11

eMesh

6. 1. 2015

Epiphany Architecture

Page 12: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 12

eMesh Routing◦ Upper 12bits of the address is address of the core

6 bits – row index, 6 bits – col index◦ Each node uses simple routing algorithm

◦ Nodes use round-robin arbitration to avoid deadlock

6. 1. 2015

Epiphany Architecture

Page 13: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 13

DMA◦ Two DMA channels per node◦ 2D addressing awareness, flexible strides◦ Local-external memory and external-external

memory transfers◦ Completion signaling by HW interrupt◦ Master and slave modes

Slave DMA is controlled by external IO or another DMA

6. 1. 2015

Epiphany Architecture

Page 14: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 14

Epiphany SDK◦ Separate compilation for host and coprocessor code

Epiphany uses e-gcc and e-objcopy◦ The host runtime provide way to

Detect the coprocessor Allocate memory, transfer data Execute precompiled binaries on the coprocessor

OpenCL◦ The coprocessor is perceived as OpenCL accelerator◦ Each core is computing unit, on-chip memory is

local memory, …

6. 1. 2015

Programming

Page 15: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 15

Host Code Examplee_platform_t platform;

e_epiphany_t dev;

e_init(NULL);

e_reset_system();

e_get_platform_info(&platform);

e_open(&dev, 0, 0, platform.rows, platform.cols);

e_load_group("coproccode.srec", &dev, 0, 0, platform.rows, platform.cols);

for (i = 0; i < platform.rows ; ++i)

for (j = 0; j < platform.cols; ++j) {

coreid = (i + platform.row) * 64 + j + platform.col;

usleep(100000);

e_read(&emem, 0, 0, 0x0, emsg, _BufSize);

e_read(&dev, i, j, 0x6000, &flag, sizeof(flag));

...

}

e_close(&dev);

e_finalize();

6. 1. 2015

Programming

Page 16: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 16

Matrix Multiplication◦ Using naïve algorithm◦ Square matrices◦ N is divisible by number of cores

Each core computing its corresponding tile of the result matrix

◦ Both input matrices and output matrix fit the total amount of local memory A smart plan of the computations and the data

transfers can be devised

6. 1. 2015

Example

Page 17: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 17

MatrixMultiplication

6. 1. 2015

Example

A tiles are rotated vertically in each

column

B tiles are rotated horizontally in each row

Page 18: Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.

by Martin Kruliš (v1.0) 186. 1. 2015

Discussion