MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電子所陳泓輝.

MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE

Presenter: D96943001 電子所陳泓輝

Why Moore’s Law is die

He is not CEO anymore!! Walls => ILP, Frequency, Power, Memory walls

ILP – more cost less return

ILP: instruction level parallelism OOO: out of order execution of microcodes

Frequency wall

FO4 delay metric: delay of a inverter with 4 fan-in with ¼ size and it drives another inverter 4x size

Saturated!

Freq ↑ => Some OP cycle counts ↑

Memory wall

External access penalty is increasing(the gap) Solution => enlarge cache

Cache decide the performance and the price

It’s cache that matters!

The power wall

High power might imply Thermal run away of device behavior Larger current => electronic migration => issue of the

reliability of the metal connection Hit packaging heat limitation

Change to high cost packaging Cooling noise!! Form factor

The great wall……

CMOS

Moore’s Law

MulticoreManycore

Historical - Intel 2007 Xeon

Dual on chip memory controller => fcpu > 2*fmem Point-to-point interconnection => fabrics

Multiple communication activities (c.f. “bus” => one activity)

Fabric working notation

AMD – Opteron(Shanghai)

Much the same as Intel Xeon Shared L3 cache among 2 cores

Game consoles

XBox360 => Triple core PS3 => Cell, 8+1 cores

Power PC wins!

Homogeneous

Heterogeneous

State-of-art multicore DSP chips

TI TNETV3020 Freescale 8156

Homogeneous Heterogeneous

State-of-art multicore DSP chips

picoChip PC205 Tilera TILE64

Heterogeneous Homogeneous, Mesh

State-of-art multicore x86 chips

24 “tiles” with two IA cores per tile A 24-router mesh network with 256 GB/s bisection

bandwidth 4 integrated DDR3 memory controllers Hardware support for message-passing !!

Intel Single-chip Cloud Computer 1GHz Pentium

GPGPU - OpenCL

OfficialLOGO

Special case: multicore video processor

Characteristics of video applications in consumer electronics High computational capability Low hardware cost Low power consumption

A General Solution Fixed-function logic designed

Challenges Multiple video decoding standards Updating video decoding standards Ill-posed video processing algorithms Product requirements are diverse and mutually exclusive

mediaDSP technology Broadcom: mediaDSP technology

Heterogeneous (programmable and fixed functions units) A task-based programming model A uniform approach for managing tasks executing on different

types of programmable and fixed-function processing elements

A platform, easily extendable to support a range of applications Easily to be customized for special purpose

Successful stories SD MPEG Video encoder including scaling and noise reduction Frame-Rate-Conversation Video Processing for

FHD@60Hz /120Hz videos

Nickname: accelerator

Classes of video processing

Highly parallelizable operations for fixed-point data and no floating point A processor with SIMD data path engine

Ad-hoc computation and decision making, which are operating on smaller sets of data produced by the parallelizable processes A general processor such as RISC

Data movement and formatting on multidimensional pixels

Bit serial processing for entropy decoding and encoding => dedicate hardware do this job very efficiently

Task-based programming model

Programmers’ duties as follows: Partition a sequential algorithm into a set of parallelizable

tasks and then efficiently map it to the massively parallel architecture A task has a definite initiation time A task runs until completion with no interruption and no further

synchronization with other task Understand hardware architecture and limitation

Shared memory (instead of FIFO mode) Buffer size must be enough for a data unit Interconnect bandwidth must be enough Computational power must be enough for real time

(IP) Platform-based architecture

Task-oriented engine (TOE) A programmable DSP or a fixed function unit

Task control unit (TCU) A RISC maintains a queue of tasks and synchronous with other TCU/TOEs To maximize the utilization of TOEs

Control engine Shared memory Communication fabric

Memory architecture Memory hierarchy

L1 - Processor Instruction and Data Memory

L2 - On-chip Shared Memory

L3 - Off-chip

All TOEs use software-managed DMA rather than caches for their local storage 6D addressing (x,y,t,Y,U,V) and the chunking

of blocks into smaller subblocks. No {pre-fetching, early load scheduling,

cache, speculative execution, multithreading …}

Broadcom BCM35421 chip [1/2]

Do motion-compensated frame-rate conversion Double frame rate from FHD@60fps to FHD@120fps

(to conquer motion blur) 24fps 60fps (de-judder)

Broadcom BCM35421 chip [2/2]

65nm CMOS process mediaDSP runs at 400 MHz 106 Million transistors Two Teraops of peak integer performance

Performance of DSPs for applications

DSP becomes useful when it can perform a minimum of 100 instructions per sample period

68% DSP were shipped for mobile handsets and base stations in 2008

Several K cycles for processing a input sample

Multiple elements

Increase in performance: multiple elements > higher performance single

elements

Go deeper –TI’s multicore

Multicore Programming Guide

Mapping application to mutilcore

Know the processing model option Identify all the tasks

Task partition into many small ones Familiar with Inter-task communication/data flow Combination/aggregation Mapping

Memory hierarchy => L1/L2/L3, private/shared memory, external memory channel numbers/capability

DMA Special purpose hardware!!

FFT, Viterb, reed solomon, AES codec, Entropy codec

Parallel processing model

Very successful in communication system Router Base station

Master/Slave model Data flow

Data movement

Shared memory Dedicated memory

Transitional memory => ownership change, content not copy

Notification [1/4]

Direct signaling Create event to other core’s local interrupt controller

Other core polling local status Or the local interrupt controller convert this event to real

interrupt

Notification [2/4]

Indirect signaling Not directly controlled by software

Notification [3/4]

Atomic arbitration Hardware semaphore/mutex

Semaphore => allow limited multiple access => example: multi-port SRAM/external DDR memory

Mutex => allow one access only Use software semaphore instead if resource only shared between

processes only executed in one core Overhead of hardware semaphore is not small

Its only a facility for software usage, hardware only guarantee atomic operation, locked content is not protected Cost, performance consideration

Notification [4/4]

Left diagram is mutex

Just like the software counterpart

Data transfer engines

DMA => System DMA, local DMA belongs to a core Ethernet

Up to 32 MAC address RapidIO

Implemented with ultra fast serial IO physical layer Maybe multiple serial IO links uni/bi-directional Example

USB 2.0 => 480Mbit/sec USB 3.0 => 5Gbit/sec Serial ATA

1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec 3.0, Gen 3 => 6 Gbit/sec

High speed serial link

USB SATA

Memory management

Devices do not support automated cache coherency among cores because of the power consumption involved and the latency overhead introduced

Switched central resource fabric

Highlights [1/3]

Portion of the cache could be configured to as memory mapped SRAM Transparent cache => visible

Address aliasing => masking MSByte For core 0: 0x10800000 == 0x00800000 For core 1: 0x11800000 == 0x00800000 For core 2: 0x12800000 == 0x00800000

Special register DNUM for dynamic pointer address update =>

Write common rom codestill assess core’s private area

Each core has it DNUM

Implicit

Explicit

Highlight [2/3]

The only guaranteed coherency by hardware L1D L2 (core-locally) L1D L2 SL2 (if as

memory mapped SRAM) (core-locally)

Equal access to the external DDR2 SDRAM through SCR

L1DL1P L1DL1P

This may be the bottleneck for

certain application

Highlight [3/3]

If any portion of the L1s is configured as memory-mapped SRAM, there is a small paging engine built into the core (IDMA) that can be used to transfer linear blocks of memory between L1 and L2 in the background of CPU operation Paging engine => MMU

IDMA may also be used to perform bulk peripheral configuration register access

DSP code and data image

Image types Single image Multiple image Multiple image with shared code and data

Complex linking scheme should be used Device boot

Tool Tool Tool

Debugging

Cuda => XXOO↑↑↓↓←→←→BA

TI’s offer

Hardware emulation => ICE, JTAG Basically, not intrusive

Software instrumentation Patching original codes to enable same ability => this

time, “Trace Logs” Basically, intrusive

Type of Trace Logs API call log, Statistics log, DMA transaction log, Event

log, Customer data log

More on logs

Information stores in memory pull back to host by path through hardware emulation

Provide tool to correlate all the logs Display them with an organized manner Log example:

Go deeper –Freescale’s manycoreEmbedded Multicore: An Introduction

Why manycore?

Freescale MPC8641 Single core => freq x 1.5 => power x 2 Dual core => freq x 1.5 => power x 1.3

Bug in this Fig.

Memory system types

SMP + AMP + Sharing

Manycore enables multiple OS concurrently running Memory sharing => MMU Interface/peripheral sharing => hypervisor Virtualization is good for legacy support

Review of single core

Manycore example [1/2]

1

2

22 2

2

4

3

3

4

4

Manycore example [2/2]

1

2

3

1

4

4

2

Highlights [1/2]

CoreNet fabric supports cache coherency across all cache layers

CoreNet fabric also supports software semaphores by extending the bit-test to guarantee atomic access between cores

CLASS is better suited for DSPs as they tend to use less complex operating systems and the application software is more in control Silicon area of the fabric reduced

If core is configured as software accelerator Some space of L1, L2 could be converted to memory mapped SRAM

in a per way basis

Highlights [2/2]

While MMU protects memory access, DMA could ruin all the things => solution “PAMU” PAMU is located at the connection of non-core masters

and the CoreNet fabric configured to map memory and to limit access windows thereby increasing system stability

Cache stashing DMA between cache and external memory

Deal with 10Gb Ethernet

Parsing network traffic up to layer 4

Assign packet to designated cores TCP port 80/22 to core 0 ARP to core 1 UDP to core 4~7

Queue Mgr and Buffer Mgr simplify driver codes

Debugging

JTAG

High-speed logging linkPlatinum version “printk”

Why JTAG is slow?

Involving serial bit shifting

Conclusion

Hardware guys are crazy and unfriendly They are all “Postscript” geeks

However, they provides the world a chance to deal with real-time sample data with thousands of cycles => more things could be done by software

Besides data structure and algorithm, now good engineer need to know more

The more you know hardware the more it dance with you, or at least……

Talk as common sense

Reference[1/2]

Sudhakar Yalamanchili, Georgia Institute of Technology, “Multicore Computing Multicore Computing - - Evolution”

(Broadcom) “Broadcom mediaDSP: A platform for building programmable multicore video processors,” IEEE Micro., Mar/April 2009.

(Texas Instruments, TI) Lina J. Karam, Ismail Alkamal, Alan Gatherer, Gene A. Frantz, David V. Anderson, and Brian L. Evans, “Trends in Multicore DSP Platforms,” IEEE Signal Processing Magazine, Nov. 2009. University Texas and Texas Instruments

Reference[2/2]

http://en.wikipedia.org (Tilera) Tile64 processor products. http://www.tilera.com

Originate from MIT (Intel) Single-chip Cloud Computer TI, “Multicore Programming Guide” Freescale, “Embedded Multicore: An Introduction” Presentation slides from lab member

http://en.wikipedia.org/

http://www.tilera.com/

MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電子所陳泓輝.

Documents

Transcript of MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電子所陳泓輝.

MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電子所 陳泓輝.

Documents

Transcript of MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電子所 陳泓輝.

MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電子所陳泓輝.

Transcript of MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電子所陳泓輝.