Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.
-
Upload
carol-bridges -
Category
Documents
-
view
221 -
download
1
Transcript of Martin Kruliš 6. 1. 2015 by Martin Kruliš (v1.0)1.
by Martin Kruliš (v1.0) 1
Parallela Board with Epiphany
CoprocessorMartin Kruliš
6. 1. 2015
by Martin Kruliš (v1.0) 2
Adapteva Company◦ Small fabless semiconductor company◦ Founded in 2008◦ Main objective is to design massively parallel
chips with emphasis on power efficiency First company that designed chip that expects to
scale over 1000 cores◦ Current products
Epiphany processor (16 core and 64 core versions) Parallela board
◦ Parallela University Program started this year
6. 1. 2015
About Adapteva
by Martin Kruliš (v1.0) 36. 1. 2015
Parallela Board
16-core EpiphanyCoprocessor
1GB SDRAMμUSB
1Gb Ethernet
μSD
μHDMI
μUSBZyng dual-core ARM-A9(with integrated FPGA)
Expansion Slots
by Martin Kruliš (v1.0) 46. 1. 2015
Parallela Architecture
by Martin Kruliš (v1.0) 56. 1. 2015
Epiphany Coprocessor
by Martin Kruliš (v1.0) 6
Coprocessor◦ 32-bit RISC cores with superscalar architecture◦ 32KB local memory per core (1 cycle latency)
Divided into four independent banks◦ IEEE754 compliant floating point instruction set◦ Two DMA channels
eMesh (Network-on-Chip)◦ Both on chip and off chip communication◦ No specific API, works with memory transactions
eLink (Chip-to-Chip Links)◦ 4 I/O ports for external communication
6. 1. 2015
Epiphany Architecture
by Martin Kruliš (v1.0) 7
Coprocessor Cores◦ Simple in-order RISC architecture
Most instructions take 1 cycle 8-stage dual-issue pipeline Instruction set optimized for signal processing
◦ Separate integer and floating point ALU◦ 64x 32-bit registers (for both IALU and FPU)
Load store architecture Per cycle 3/1 FPU and 2/1 IALU accesses, 1 load/store
◦ Performance 16 cores ~ 2Gflops each, 64 cores ~ 1.6 Gflops each
6. 1. 2015
Epiphany Architecture
by Martin Kruliš (v1.0) 8
Memory Model◦ Internal memory of each node
is mapped into global memory
6. 1. 2015
Epiphany Architecture
by Martin Kruliš (v1.0) 9
Local Memory◦ Divided into four banks with independent
controllers◦ Each clock cycle each bank may perform:
Send 64bit word to program sequencer Transfer 64bit word between memory and registers Receive 64bit word from eMesh interface Local DMA sends 64bit word to eMesh interface
◦ Memory order model Local reads and writes follow strong memory model Non-local transactions follow weak memory model
Operations may not propagate in the same order
6. 1. 2015
Epiphany Architecture
by Martin Kruliš (v1.0) 10
eMesh◦ 2D topology with nearest-neighbor connections◦ 3 orthogonal (independent) meshes
cMesh – on-chip write transactions (8B/cycle) xMesh – off-chip write transactions (1B/cycle) rMesh – read requests (1req/8cycles)
◦ Edge connections may be interfaced with other epiphany chips Or other type of busses (off-core memory, IO ports,
…)◦ Significantly favorizes writing operations to
reading Writing transactions are 16x faster
6. 1. 2015
Epiphany Architecture
by Martin Kruliš (v1.0) 11
eMesh
6. 1. 2015
Epiphany Architecture
by Martin Kruliš (v1.0) 12
eMesh Routing◦ Upper 12bits of the address is address of the core
6 bits – row index, 6 bits – col index◦ Each node uses simple routing algorithm
◦ Nodes use round-robin arbitration to avoid deadlock
6. 1. 2015
Epiphany Architecture
by Martin Kruliš (v1.0) 13
DMA◦ Two DMA channels per node◦ 2D addressing awareness, flexible strides◦ Local-external memory and external-external
memory transfers◦ Completion signaling by HW interrupt◦ Master and slave modes
Slave DMA is controlled by external IO or another DMA
6. 1. 2015
Epiphany Architecture
by Martin Kruliš (v1.0) 14
Epiphany SDK◦ Separate compilation for host and coprocessor code
Epiphany uses e-gcc and e-objcopy◦ The host runtime provide way to
Detect the coprocessor Allocate memory, transfer data Execute precompiled binaries on the coprocessor
OpenCL◦ The coprocessor is perceived as OpenCL accelerator◦ Each core is computing unit, on-chip memory is
local memory, …
6. 1. 2015
Programming
by Martin Kruliš (v1.0) 15
Host Code Examplee_platform_t platform;
e_epiphany_t dev;
e_init(NULL);
e_reset_system();
e_get_platform_info(&platform);
e_open(&dev, 0, 0, platform.rows, platform.cols);
e_load_group("coproccode.srec", &dev, 0, 0, platform.rows, platform.cols);
for (i = 0; i < platform.rows ; ++i)
for (j = 0; j < platform.cols; ++j) {
coreid = (i + platform.row) * 64 + j + platform.col;
usleep(100000);
e_read(&emem, 0, 0, 0x0, emsg, _BufSize);
e_read(&dev, i, j, 0x6000, &flag, sizeof(flag));
...
}
e_close(&dev);
e_finalize();
6. 1. 2015
Programming
by Martin Kruliš (v1.0) 16
Matrix Multiplication◦ Using naïve algorithm◦ Square matrices◦ N is divisible by number of cores
Each core computing its corresponding tile of the result matrix
◦ Both input matrices and output matrix fit the total amount of local memory A smart plan of the computations and the data
transfers can be devised
6. 1. 2015
Example
by Martin Kruliš (v1.0) 17
MatrixMultiplication
6. 1. 2015
Example
A tiles are rotated vertically in each
column
B tiles are rotated horizontally in each row
by Martin Kruliš (v1.0) 186. 1. 2015
Discussion