MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電子所 陳泓輝.
-
Upload
moris-fitzgerald -
Category
Documents
-
view
266 -
download
3
Transcript of MANYCORES – FROM HARDWARE PROSPECTIVE TO SOFTWARE Presenter: D96943001 電子所 陳泓輝.
ILP – more cost less return
ILP: instruction level parallelism OOO: out of order execution of microcodes
Frequency wall
FO4 delay metric: delay of a inverter with 4 fan-in with ¼ size and it drives another inverter 4x size
Saturated!
Freq ↑ => Some OP cycle counts ↑
Memory wall
External access penalty is increasing(the gap) Solution => enlarge cache
Cache decide the performance and the price
The power wall
High power might imply Thermal run away of device behavior Larger current => electronic migration => issue of the
reliability of the metal connection Hit packaging heat limitation
Change to high cost packaging Cooling noise!! Form factor
Historical - Intel 2007 Xeon
Dual on chip memory controller => fcpu > 2*fmem Point-to-point interconnection => fabrics
Multiple communication activities (c.f. “bus” => one activity)
Game consoles
XBox360 => Triple core PS3 => Cell, 8+1 cores
Power PC wins!
Homogeneous
Heterogeneous
State-of-art multicore x86 chips
24 “tiles” with two IA cores per tile A 24-router mesh network with 256 GB/s bisection
bandwidth 4 integrated DDR3 memory controllers Hardware support for message-passing !!
Intel Single-chip Cloud Computer 1GHz Pentium
Special case: multicore video processor
Characteristics of video applications in consumer electronics High computational capability Low hardware cost Low power consumption
A General Solution Fixed-function logic designed
Challenges Multiple video decoding standards Updating video decoding standards Ill-posed video processing algorithms Product requirements are diverse and mutually exclusive
mediaDSP technology Broadcom: mediaDSP technology
Heterogeneous (programmable and fixed functions units) A task-based programming model A uniform approach for managing tasks executing on different
types of programmable and fixed-function processing elements
A platform, easily extendable to support a range of applications Easily to be customized for special purpose
Successful stories SD MPEG Video encoder including scaling and noise reduction Frame-Rate-Conversation Video Processing for
FHD@60Hz /120Hz videos
Nickname: accelerator
Classes of video processing
Highly parallelizable operations for fixed-point data and no floating point A processor with SIMD data path engine
Ad-hoc computation and decision making, which are operating on smaller sets of data produced by the parallelizable processes A general processor such as RISC
Data movement and formatting on multidimensional pixels
Bit serial processing for entropy decoding and encoding => dedicate hardware do this job very efficiently
Task-based programming model
Programmers’ duties as follows: Partition a sequential algorithm into a set of parallelizable
tasks and then efficiently map it to the massively parallel architecture A task has a definite initiation time A task runs until completion with no interruption and no further
synchronization with other task Understand hardware architecture and limitation
Shared memory (instead of FIFO mode) Buffer size must be enough for a data unit Interconnect bandwidth must be enough Computational power must be enough for real time
(IP) Platform-based architecture
Task-oriented engine (TOE) A programmable DSP or a fixed function unit
Task control unit (TCU) A RISC maintains a queue of tasks and synchronous with other TCU/TOEs To maximize the utilization of TOEs
Control engine Shared memory Communication fabric
Memory architecture Memory hierarchy
L1 - Processor Instruction and Data Memory
L2 - On-chip Shared Memory
L3 - Off-chip
All TOEs use software-managed DMA rather than caches for their local storage 6D addressing (x,y,t,Y,U,V) and the chunking
of blocks into smaller subblocks. No {pre-fetching, early load scheduling,
cache, speculative execution, multithreading …}
Broadcom BCM35421 chip [1/2]
Do motion-compensated frame-rate conversion Double frame rate from FHD@60fps to FHD@120fps
(to conquer motion blur) 24fps 60fps (de-judder)
Broadcom BCM35421 chip [2/2]
65nm CMOS process mediaDSP runs at 400 MHz 106 Million transistors Two Teraops of peak integer performance
Performance of DSPs for applications
DSP becomes useful when it can perform a minimum of 100 instructions per sample period
68% DSP were shipped for mobile handsets and base stations in 2008
Several K cycles for processing a input sample
Mapping application to mutilcore
Know the processing model option Identify all the tasks
Task partition into many small ones Familiar with Inter-task communication/data flow Combination/aggregation Mapping
Memory hierarchy => L1/L2/L3, private/shared memory, external memory channel numbers/capability
DMA Special purpose hardware!!
FFT, Viterb, reed solomon, AES codec, Entropy codec
Parallel processing model
Very successful in communication system Router Base station
Master/Slave model Data flow
Data movement
Shared memory Dedicated memory
Transitional memory => ownership change, content not copy
Notification [1/4]
Direct signaling Create event to other core’s local interrupt controller
Other core polling local status Or the local interrupt controller convert this event to real
interrupt
Notification [3/4]
Atomic arbitration Hardware semaphore/mutex
Semaphore => allow limited multiple access => example: multi-port SRAM/external DDR memory
Mutex => allow one access only Use software semaphore instead if resource only shared between
processes only executed in one core Overhead of hardware semaphore is not small
Its only a facility for software usage, hardware only guarantee atomic operation, locked content is not protected Cost, performance consideration
Data transfer engines
DMA => System DMA, local DMA belongs to a core Ethernet
Up to 32 MAC address RapidIO
Implemented with ultra fast serial IO physical layer Maybe multiple serial IO links uni/bi-directional Example
USB 2.0 => 480Mbit/sec USB 3.0 => 5Gbit/sec Serial ATA
1.0, Gen 1 => 1.5Gbit/sec; 2.0, Gen 2 => 3 Gbit/sec 3.0, Gen 3 => 6 Gbit/sec
Memory management
Devices do not support automated cache coherency among cores because of the power consumption involved and the latency overhead introduced
Switched central resource fabric
Highlights [1/3]
Portion of the cache could be configured to as memory mapped SRAM Transparent cache => visible
Address aliasing => masking MSByte For core 0: 0x10800000 == 0x00800000 For core 1: 0x11800000 == 0x00800000 For core 2: 0x12800000 == 0x00800000
Special register DNUM for dynamic pointer address update =>
Write common rom codestill assess core’s private area
Each core has it DNUM
Implicit
Explicit
Highlight [2/3]
The only guaranteed coherency by hardware L1D L2 (core-locally) L1D L2 SL2 (if as
memory mapped SRAM) (core-locally)
Equal access to the external DDR2 SDRAM through SCR
L1DL1P L1DL1P
This may be the bottleneck for
certain application
Highlight [3/3]
If any portion of the L1s is configured as memory-mapped SRAM, there is a small paging engine built into the core (IDMA) that can be used to transfer linear blocks of memory between L1 and L2 in the background of CPU operation Paging engine => MMU
IDMA may also be used to perform bulk peripheral configuration register access
DSP code and data image
Image types Single image Multiple image Multiple image with shared code and data
Complex linking scheme should be used Device boot
Tool Tool Tool
TI’s offer
Hardware emulation => ICE, JTAG Basically, not intrusive
Software instrumentation Patching original codes to enable same ability => this
time, “Trace Logs” Basically, intrusive
Type of Trace Logs API call log, Statistics log, DMA transaction log, Event
log, Customer data log
More on logs
Information stores in memory pull back to host by path through hardware emulation
Provide tool to correlate all the logs Display them with an organized manner Log example:
Why manycore?
Freescale MPC8641 Single core => freq x 1.5 => power x 2 Dual core => freq x 1.5 => power x 1.3
Bug in this Fig.
SMP + AMP + Sharing
Manycore enables multiple OS concurrently running Memory sharing => MMU Interface/peripheral sharing => hypervisor Virtualization is good for legacy support
Highlights [1/2]
CoreNet fabric supports cache coherency across all cache layers
CoreNet fabric also supports software semaphores by extending the bit-test to guarantee atomic access between cores
CLASS is better suited for DSPs as they tend to use less complex operating systems and the application software is more in control Silicon area of the fabric reduced
If core is configured as software accelerator Some space of L1, L2 could be converted to memory mapped SRAM
in a per way basis
Highlights [2/2]
While MMU protects memory access, DMA could ruin all the things => solution “PAMU” PAMU is located at the connection of non-core masters
and the CoreNet fabric configured to map memory and to limit access windows thereby increasing system stability
Cache stashing DMA between cache and external memory
Deal with 10Gb Ethernet
Parsing network traffic up to layer 4
Assign packet to designated cores TCP port 80/22 to core 0 ARP to core 1 UDP to core 4~7
Queue Mgr and Buffer Mgr simplify driver codes
Conclusion
Hardware guys are crazy and unfriendly They are all “Postscript” geeks
However, they provides the world a chance to deal with real-time sample data with thousands of cycles => more things could be done by software
Besides data structure and algorithm, now good engineer need to know more
The more you know hardware the more it dance with you, or at least……
Reference[1/2]
Sudhakar Yalamanchili, Georgia Institute of Technology, “Multicore Computing Multicore Computing - - Evolution”
(Broadcom) “Broadcom mediaDSP: A platform for building programmable multicore video processors,” IEEE Micro., Mar/April 2009.
(Texas Instruments, TI) Lina J. Karam, Ismail Alkamal, Alan Gatherer, Gene A. Frantz, David V. Anderson, and Brian L. Evans, “Trends in Multicore DSP Platforms,” IEEE Signal Processing Magazine, Nov. 2009. University Texas and Texas Instruments
Reference[2/2]
http://en.wikipedia.org (Tilera) Tile64 processor products. http://www.tilera.com
Originate from MIT (Intel) Single-chip Cloud Computer TI, “Multicore Programming Guide” Freescale, “Embedded Multicore: An Introduction” Presentation slides from lab member