Multicore Programming Faraday

NDA required

TCI6487 Multi-core Programming

China HPMPMay 2009

NDA required

Agenda

• Multi-core chips high level overview• Multi-core programming

– Memory consideration– Inter-core communication– Multi-core arbitration– Peripherals consideration

• Image creation

NDA required

•• THREE C64X+THREE C64X+™™ DSP CORE @ 1+ GHZDSP CORE @ 1+ GHZ– 16/32 bit ISA, doubled MPY vs C64x core– RSA instruction set extension for CR

processing (downlink & uplink)– 65 nm process

• MEMORYMEMORY– 32 kB L1 program memory– 32 kB L1 data memory– 3 MB of total L2 memory (2

configurations)1MB / 1MB / 1MB or 1.5MB / 1MB / 0.5MB

– Boot ROM– DDR2-667MHz 32-bit

• COMMUNICATIONS SUBSYSTEMCOMMUNICATIONS SUBSYSTEM– 2x sRIO (1x links)– SGMII Gigabit Ethernet– Antenna interface supporting OBSAI /

CPRI – 6 links

• ACCELERATIONACCELERATION– VCP2, TCP2– Receive accelerator (RAC)

•• 561 BALLS, 23x23 MM FC561 BALLS, 23x23 MM FC--BGABGA– 5 Rows + 11x11 center array

• OTHERSOTHERS– IP security, lead-free and green

EDMA3.0 WITH SWITCH FABRIC

L2 MEMORY

C64X+ CORE

L1 Prog

L1 Data

RSA

AntennaInterfaceMcBSP 10 / 100 / 1G

EthernetsRIODDR-2IF

GPIO I2CPLL

Timers Others BootROM

L2 MEMORY

C64X+ CORE

L1 Prog

L1 Data

RSA

L2 MEMORY

C64X+ CORE

L1 Prog

L1 Data

RSA

VCP2

TCP2

RAC

Faraday: High Level View

NDA required

Agenda



• Image creation

NDA required

Programming Considerations• Programming model

– Shared image: programmer needs to determine whether aliased addressing is appropriate. If so, the code still needs to assign pointers to memory using the global address for any data transfers (aside from internal DMA performed within a single core’s memory).

– Non-shared images: Only global addresses should be used. There is no advantage to aliased addressing.

• Resource allocation – Shared resources must be partitioned or arbitrated for.

– Multi-channel peripherals can be split amongst the cores for concurrent, orthogonal control – EDMA channels, EDMA events, Ethernet MAC TX/RX data flow, RapidIO TX/RX/LSU dataflow.

– Single-channel peripherals ought to be controlled by a single master, servicing the other cores if needed – Timer64

• System-level prioritization. A user-specified priority may be assigned to:– Any cache-miss or non-cacheable accesses by any of the CPUs– Any EDMA transfer– Any Serial RapidIO transfer– Any Ethernet transfer

• Inter-core communication– Discrete events: INTGEN peripheral– Message passing: Direct writes to memory, or DMA transfer. Can implement a

polling or interrupt-driven protocol (DSP BIOS MSGQ available).

NDA required

Core Local Memory Map• For each core, L1/L2 memories have two entries in the memory

map. – Global addresses: accessible to all masters in the chip– Local (aliased) addresses: accessible only to the local core and IDMA

• The eight most significant bits are masked to zero– E.g. 0x10800000 and 0x00800000 are the same memory for core 0.

• Allows for common code to be run unmodified on multiple cores• Not beneficial for un-shared code.

• Each core has a private configuration space– Local core control registers (cache, TSC, IDMA, INTC) are not visible to

other masters in the chip.• Core number

– software can verify the core on which it is running through register (DNUM) that holds the DSP core number (0, 1, or 2)

– The core number can be used during run-time to conditionally execute code, update pointers, create a global address, etc.

NDA required

Basic Techniques for Multi-core DSP

• Inter-core interrupts– Corporation between cores

• EDMA– Main inter-core data transaction engine

• Shared memory

* Blue parts are necessary for multi-core DSP

NDA required

Inter-core data transaction

• Discrete events: INTGEN peripheral• Message passing: Direct writes to memory, or

DMA transfer. Can implement a polling or interrupt-driven protocol (DSP BIOS MSGQ available).

NDA required

Inter-DSP Interrupts

• 2 Registers per core to control Inter-DSP Interrupts– IPCG (In IPCGRx)

• Write ‘1’ to IPCG triggers an interrupt to corresponding GEM• Any ‘1’ write within 8 CPU cycles does not trigger a new interrupt• Write ‘0’ and Reads have no effect

– SRCSx (In IPCGRx)• SW method to tell what caused the interrupt• Usage is completely SW defined• Write of ‘1’ is sticky and is read back as ‘1’ until cleared.• Write of ‘0’ has no effect• Reads return the current value of the bit

– SRCCx (In IPCARx)• Write of ‘1’ clears SRCSx in IPCARx• Write of ‘0’ or read has no effect

NDA required

Multi-Channel Peripherals• These peripherals allow resources to be allocated to the cores and orthogonally

controlled without the software hand-shaking prior to accesses. Examples to these multi-channel peripherals are:

– EDMA• 64 Channels and 256 Parameter RAM can be separated by software into Regions, with each region

assigned to a core. – EMAC

• Eight receive and eight transmit DMA channels assigned by software.• Received packets transferred to a core based on MAC address routing assigned to a channel.• Transmit packets transferred from a core based on a core defined list.

– SRIO• Eight receive and eight transmit DMA channels assigned by software.• Received packets transferred to a core based on address routing assigned to a channel.• Transmit packets transferred from a core based on a core defined list.

– AIF• Six inbound and outbound links, the multi EDMA channels assigned by software.

– INTGEN • The interrupt Generation logic, used for discrete signaling between cores, is designed to allow

orthogonal event assertions and clearing by each core. • Control registers are established per receiver and multiple senders can assert events concurrently.

– GPIO • multi GPIO can be separated by software.

NDA required

Single-Channel Peripherals– I2C

• Typically used during boot, system setup, or board monitoring, the I2C should be serviced by a single core. If shared tables/resources are accessed through I2C it would be much faster to first copy the data to DSP memory and share from there. The I2C can be serviced by direct CPU accesses or EDMA.

– Timer64 • There are multiple timers on the chip. Typically these are individually allocated to

single cores, allowing the owning core to control it without arbitrating.

– All other peripherals • Be intended for use during system initialization only, and as such do not need to be

allocated or arbitrated for. The boot master should take care of this initialization. This includes DDR2, which has built-in arbitration for multiple masters based on transaction priority

NDA required

Agenda



• Image creation

NDA required

Single Code Image

DDR2 memory

L2 memory

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

L2 memory L2 memory

App.out App.out App.out

App.out

codeand

read-onlydata

DDR2 memory

L2 memory

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

L2 memory L2 memory

App.out App.out App.out

App.out

codeand

read-onlydata

– Default configuration of chip will be for single image.

– BIOS code and read-only data should be placed into shared memory.

• .hwi_vec will default to LL2 memory (it can be modified during runtime).

• The sections .gblinit, .switch, .cinit, .pinit, and .const will default to shared memory. All other data sections will default to L2 memory.

– If using CCS• User can load and run the app on all

cores synchronously with parallel debug manager (Simulator).

• User can also load and run app on each individual core (Simulator).

– If using Bootloader• Sections located in aliased memory will

automatically be replicated across the cores’ memory.

• When done loading app, it can release all cores from reset.

NDA required

Multiple images, not shared

DDR2 memory

L2 memory

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

L2 memory L2 memory

App0.out App1.out

App0.out

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

App2.out

App1.out

DDR2 memory

L2 memory

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

L2 memory L2 memory

App0.out App1.out

App0.out

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

App2.out

App1.out

– Each core will be loaded with its app.

– Each app needs to manage its usage of memory and make sure it doesn’t collide with any other app.

– If using CCS• Open and load each core with

its app (Simulator).• Use Parallel Debug Manager

to run all cores synchronously or open up each core to run them asynchronously (Simulator).

– If using Bootloader• Load each core with its app• Take each core out of reset

App2.out

NDA required

Multiple images, shared

DDR memory

L2 memory

C64x+Core 0

L1 Prog

L1 Data

Partial Imagecode & read-

only data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

L2 memory L2 memory

Partial Image data

Partial Image data

Partial Image data

App0.out

App2.outBIOSand/orApp

App1.out

DDR memory

L2 memory

C64x+Core 0

L1 Prog

L1 Data

C64x+Core 0

L1 Prog

L1 Data

Partial Imagecode & read-

only data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 1

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

C64x+Core 2

L1 Prog

L1 Data

L2 memory L2 memory

Partial Image data

Partial Image data

Partial Image data

App0.out

App2.outBIOSand/orApp

App1.out

– All apps share some common code/data (partial link image).

• partial link image needs to be build as a separate step.

• partial link image is at a fixed location on all cores.

• Code and read-only data should be placed into shared memory.

• Some BIOS read/write data will need to be placed in each core’s L2 memory.

– The non-shared code and data.• Should be placed in each core’s LL2 memory.• Each app can use SL2 memory, but needs to

manage its usage of the SL2 memory and make sure it doesn’t collide with any other app.

– If using CCS• Load the partial link image first through

Parallel Debug Manager (Simulator). [Only needed if not loaded with app].

• Load each core with its app (Simulator).• Use Parallel Debug Manager to run all cores

synchronously or open up each core to run them asynchronously (Simulator).

– If using Bootloader• Load the partial link image (if not loaded with

app).• Load each core with its app.• Release each core from reset.

– Note: The partial link image could be loaded once if not included in the load of the apps otherwise it would be loaded multiple times (once for each app loaded on each core).

NDA required

Device Boot• Regardless of the number of .out files created, a single boot table should be

generated for the final image to be loaded in the end system. • The boot sequence is controlled by Core 0.

– After device reset, Core 0 is responsible for releasing all cores from reset after the boot image is loaded into the device.

• Details on the boot loader are available in TI user guide SPRUEA7, TMS320TCI648x DSP Bootloader

Core0.out Core0.rmd

Core1.out Core1.rmd

Core2.out Core2.rmd

Hex6x

Hex6x

Hex6x

Core0.btbl

Core2.btbl

Core1.btbl

ME R G E B T B L

DspCode.btbl

NDA required

Q &A

Multicore Programming Faraday

Documents

Transcript of Multicore Programming Faraday