Multicore Programming Faraday
-
Upload
legend1986 -
Category
Documents
-
view
110 -
download
1
Transcript of Multicore Programming Faraday
NDA required
TCI6487 Multi-core Programming
China HPMPMay 2009
NDA required
Agenda
• Multi-core chips high level overview• Multi-core programming
– Memory consideration– Inter-core communication– Multi-core arbitration– Peripherals consideration
• Image creation
NDA required
•• THREE C64X+THREE C64X+™™ DSP CORE @ 1+ GHZDSP CORE @ 1+ GHZ– 16/32 bit ISA, doubled MPY vs C64x core– RSA instruction set extension for CR
processing (downlink & uplink)– 65 nm process
• MEMORYMEMORY– 32 kB L1 program memory– 32 kB L1 data memory– 3 MB of total L2 memory (2
configurations)1MB / 1MB / 1MB or 1.5MB / 1MB / 0.5MB
– Boot ROM– DDR2-667MHz 32-bit
• COMMUNICATIONS SUBSYSTEMCOMMUNICATIONS SUBSYSTEM– 2x sRIO (1x links)– SGMII Gigabit Ethernet– Antenna interface supporting OBSAI /
CPRI – 6 links
• ACCELERATIONACCELERATION– VCP2, TCP2– Receive accelerator (RAC)
•• 561 BALLS, 23x23 MM FC561 BALLS, 23x23 MM FC--BGABGA– 5 Rows + 11x11 center array
• OTHERSOTHERS– IP security, lead-free and green
EDMA3.0 WITH SWITCH FABRIC
L2 MEMORY
C64X+ CORE
L1 Prog
L1 Data
RSA
AntennaInterfaceMcBSP 10 / 100 / 1G
EthernetsRIODDR-2IF
GPIO I2CPLL
Timers Others BootROM
L2 MEMORY
C64X+ CORE
L1 Prog
L1 Data
RSA
L2 MEMORY
C64X+ CORE
L1 Prog
L1 Data
RSA
VCP2
TCP2
RAC
Faraday: High Level View
NDA required
Agenda
• Multi-core chips high level overview• Multi-core programming
– Memory consideration– Inter-core communication– Multi-core arbitration– Peripherals consideration
• Image creation
NDA required
Programming Considerations• Programming model
– Shared image: programmer needs to determine whether aliased addressing is appropriate. If so, the code still needs to assign pointers to memory using the global address for any data transfers (aside from internal DMA performed within a single core’s memory).
– Non-shared images: Only global addresses should be used. There is no advantage to aliased addressing.
• Resource allocation – Shared resources must be partitioned or arbitrated for.
– Multi-channel peripherals can be split amongst the cores for concurrent, orthogonal control – EDMA channels, EDMA events, Ethernet MAC TX/RX data flow, RapidIO TX/RX/LSU dataflow.
– Single-channel peripherals ought to be controlled by a single master, servicing the other cores if needed – Timer64
• System-level prioritization. A user-specified priority may be assigned to:– Any cache-miss or non-cacheable accesses by any of the CPUs– Any EDMA transfer– Any Serial RapidIO transfer– Any Ethernet transfer
• Inter-core communication– Discrete events: INTGEN peripheral– Message passing: Direct writes to memory, or DMA transfer. Can implement a
polling or interrupt-driven protocol (DSP BIOS MSGQ available).
NDA required
Core Local Memory Map• For each core, L1/L2 memories have two entries in the memory
map. – Global addresses: accessible to all masters in the chip– Local (aliased) addresses: accessible only to the local core and IDMA
• The eight most significant bits are masked to zero– E.g. 0x10800000 and 0x00800000 are the same memory for core 0.
• Allows for common code to be run unmodified on multiple cores• Not beneficial for un-shared code.
• Each core has a private configuration space– Local core control registers (cache, TSC, IDMA, INTC) are not visible to
other masters in the chip.• Core number
– software can verify the core on which it is running through register (DNUM) that holds the DSP core number (0, 1, or 2)
– The core number can be used during run-time to conditionally execute code, update pointers, create a global address, etc.
NDA required
Basic Techniques for Multi-core DSP
• Inter-core interrupts– Corporation between cores
• EDMA– Main inter-core data transaction engine
• Shared memory
* Blue parts are necessary for multi-core DSP
NDA required
Inter-core data transaction
• Discrete events: INTGEN peripheral• Message passing: Direct writes to memory, or
DMA transfer. Can implement a polling or interrupt-driven protocol (DSP BIOS MSGQ available).
NDA required
Inter-DSP Interrupts
• 2 Registers per core to control Inter-DSP Interrupts– IPCG (In IPCGRx)
• Write ‘1’ to IPCG triggers an interrupt to corresponding GEM• Any ‘1’ write within 8 CPU cycles does not trigger a new interrupt• Write ‘0’ and Reads have no effect
– SRCSx (In IPCGRx)• SW method to tell what caused the interrupt• Usage is completely SW defined• Write of ‘1’ is sticky and is read back as ‘1’ until cleared.• Write of ‘0’ has no effect• Reads return the current value of the bit
– SRCCx (In IPCARx)• Write of ‘1’ clears SRCSx in IPCARx• Write of ‘0’ or read has no effect
NDA required
Multi-Channel Peripherals• These peripherals allow resources to be allocated to the cores and orthogonally
controlled without the software hand-shaking prior to accesses. Examples to these multi-channel peripherals are:
– EDMA• 64 Channels and 256 Parameter RAM can be separated by software into Regions, with each region
assigned to a core. – EMAC
• Eight receive and eight transmit DMA channels assigned by software.• Received packets transferred to a core based on MAC address routing assigned to a channel.• Transmit packets transferred from a core based on a core defined list.
– SRIO• Eight receive and eight transmit DMA channels assigned by software.• Received packets transferred to a core based on address routing assigned to a channel.• Transmit packets transferred from a core based on a core defined list.
– AIF• Six inbound and outbound links, the multi EDMA channels assigned by software.
– INTGEN • The interrupt Generation logic, used for discrete signaling between cores, is designed to allow
orthogonal event assertions and clearing by each core. • Control registers are established per receiver and multiple senders can assert events concurrently.
– GPIO • multi GPIO can be separated by software.
NDA required
Single-Channel Peripherals– I2C
• Typically used during boot, system setup, or board monitoring, the I2C should be serviced by a single core. If shared tables/resources are accessed through I2C it would be much faster to first copy the data to DSP memory and share from there. The I2C can be serviced by direct CPU accesses or EDMA.
– Timer64 • There are multiple timers on the chip. Typically these are individually allocated to
single cores, allowing the owning core to control it without arbitrating.
– All other peripherals • Be intended for use during system initialization only, and as such do not need to be
allocated or arbitrated for. The boot master should take care of this initialization. This includes DDR2, which has built-in arbitration for multiple masters based on transaction priority
NDA required
Agenda
• Multi-core chips high level overview• Multi-core programming
– Memory consideration– Inter-core communication– Multi-core arbitration– Peripherals consideration
• Image creation
NDA required
Single Code Image
DDR2 memory
L2 memory
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
L2 memory L2 memory
App.out App.out App.out
App.out
codeand
read-onlydata
DDR2 memory
L2 memory
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
L2 memory L2 memory
App.out App.out App.out
App.out
codeand
read-onlydata
– Default configuration of chip will be for single image.
– BIOS code and read-only data should be placed into shared memory.
• .hwi_vec will default to LL2 memory (it can be modified during runtime).
• The sections .gblinit, .switch, .cinit, .pinit, and .const will default to shared memory. All other data sections will default to L2 memory.
– If using CCS• User can load and run the app on all
cores synchronously with parallel debug manager (Simulator).
• User can also load and run app on each individual core (Simulator).
– If using Bootloader• Sections located in aliased memory will
automatically be replicated across the cores’ memory.
• When done loading app, it can release all cores from reset.
NDA required
Multiple images, not shared
DDR2 memory
L2 memory
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
L2 memory L2 memory
App0.out App1.out
App0.out
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
App2.out
App1.out
DDR2 memory
L2 memory
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
L2 memory L2 memory
App0.out App1.out
App0.out
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
App2.out
App1.out
– Each core will be loaded with its app.
– Each app needs to manage its usage of memory and make sure it doesn’t collide with any other app.
– If using CCS• Open and load each core with
its app (Simulator).• Use Parallel Debug Manager
to run all cores synchronously or open up each core to run them asynchronously (Simulator).
– If using Bootloader• Load each core with its app• Take each core out of reset
App2.out
NDA required
Multiple images, shared
DDR memory
L2 memory
C64x+Core 0
L1 Prog
L1 Data
Partial Imagecode & read-
only data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
L2 memory L2 memory
Partial Image data
Partial Image data
Partial Image data
App0.out
App2.outBIOSand/orApp
App1.out
DDR memory
L2 memory
C64x+Core 0
L1 Prog
L1 Data
C64x+Core 0
L1 Prog
L1 Data
Partial Imagecode & read-
only data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 1
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
C64x+Core 2
L1 Prog
L1 Data
L2 memory L2 memory
Partial Image data
Partial Image data
Partial Image data
App0.out
App2.outBIOSand/orApp
App1.out
– All apps share some common code/data (partial link image).
• partial link image needs to be build as a separate step.
• partial link image is at a fixed location on all cores.
• Code and read-only data should be placed into shared memory.
• Some BIOS read/write data will need to be placed in each core’s L2 memory.
– The non-shared code and data.• Should be placed in each core’s LL2 memory.• Each app can use SL2 memory, but needs to
manage its usage of the SL2 memory and make sure it doesn’t collide with any other app.
– If using CCS• Load the partial link image first through
Parallel Debug Manager (Simulator). [Only needed if not loaded with app].
• Load each core with its app (Simulator).• Use Parallel Debug Manager to run all cores
synchronously or open up each core to run them asynchronously (Simulator).
– If using Bootloader• Load the partial link image (if not loaded with
app).• Load each core with its app.• Release each core from reset.
– Note: The partial link image could be loaded once if not included in the load of the apps otherwise it would be loaded multiple times (once for each app loaded on each core).
NDA required
Device Boot• Regardless of the number of .out files created, a single boot table should be
generated for the final image to be loaded in the end system. • The boot sequence is controlled by Core 0.
– After device reset, Core 0 is responsible for releasing all cores from reset after the boot image is loaded into the device.
• Details on the boot loader are available in TI user guide SPRUEA7, TMS320TCI648x DSP Bootloader
Core0.out Core0.rmd
Core1.out Core1.rmd
Core2.out Core2.rmd
Hex6x
Hex6x
Hex6x
Core0.btbl
Core2.btbl
Core1.btbl
ME R G E B T B L
DspCode.btbl
NDA required
Q &A