[IEEE 2008 IEEE International Symposium on VLSI Design, Automation and Test (VLSI-DAT) - Hsinchu,...

Multilevel Communication Modeling for Multiprocessor System-on-Chip

Katalin Popovici1, Ahmed Amine Jerraya2

[email protected] TIMA Laboratory

46, Av Felix Viallet 38031, Grenoble,France

[email protected] CEA-LETI, MINATEC

17, rue des Martyrs 38054, Grenoble, France

ABSTRACT

The high complexity of current Multi-Processor System on Chip (MPSoC) impels the designers to model and simulate the system components and their interaction in the early design stages. High level modeling usually requires less modeling effort and executes faster. In this paper we propose high level communication models that allow early MPSoC design, performance estimation and evaluation of the application’s communication requirements. We applied the proposed modeling methods to analyze the impact on performance for different communication architectures for the H.264 Encoder running on a complex MPSoC platform.

I. INTRODUCTION

Current multiprocessor systems on chip (MPSoC) architectures integrate a massive number of processors that can exchange application and synchronization data in different ways [1]. The communication architecture is characterized by a large set of parameters and adopted design choices, such as:

- Programming Model: shared memory (e.g. OpenMP [2]), message passing (e.g. MPI [3])

- Blocking versus Non blocking semantic

- Synchronous versus Asynchronous communication

- Buffered versus Un-buffered data transfer

- Synchronization mechanism, such as interrupt or polling

- Type of connection: point-to-point dedicated link or global interconnection component, such as system bus or Network-on-Chip (NoC)

- Communication Buffer mapping: stored in sender subsystem, stored in receiver subsystem or using a dedicated storage resource such as global memory or hardware FIFO.

Finding the best communication architecture for a specific application requires a large design space exploration. Low level communication exploration at the virtual prototype is often time consuming for exhaustive exploration and requires too much effort to build accurate models. Thus, raising the abstraction level from the virtual prototype plays a key role in system level design to allow fast design space exploration. High level modeling usually requires less modeling effort and executes faster. Thus, they are especially well suited in the early design stages where design space is very large.

In this paper, we propose communication models used in the early stages of MPSoC design, which enable validation of the communication performances through simulation at different abstraction levels. The proposal allows exploring different communication mapping schemes and different network components. In our approach, we support asynchronous message passing communication architecture with buffered transfer and the usage of global interconnection components, such as bus or Network on Chip (NoC) in different topologies and routing configurations. The main

contribution of this paper is the definition of the communication models used at the different MPSoC abstraction levels.

The rest of the paper is structured as follows. Section II discusses related works on communication modeling and exploration flows. Section III details the proposed communication models at the different abstraction levels. Section IV addresses the experimental results using H.264 Encoder application running on a complex MPSoC architecture with different communication schemes. Then, this section is followed by conclusion.

II. RELATED WORKS

To overcome the low level design shortcomings, such as long design time, a number of high level modeling and simulation environments have been proposed. In [4] the authors describe the high level Sesame framework to explore the optimal mapping of an application model onto an architecture model. But, in their work they do not address communication architecture exploration.

Shin et al. propose automatic transaction level models generation from an abstract input description to refine the communication architecture [5]. But they target bus-based architectures and do not support network on chip. The authors in [6] presents fast interconnect architecture exploration, using three types of interconnects: distributed memory server, AMBA bus and Octagon NoC modeled using the OCCN methodology [7]. This approach is essential to select an optimal on-chip interconnect architecture, but it does not explore the communication mapping schemes.

In this paper, we focus on multilevel communication modeling that allows validation of the communication architecture, communication buffer mapping exploration for a given architecture and interconnect exploration using bus and NoC network components. In our previous work, we presented a multi-level MPSoC hardware and software modeling and simulation flow. The considered design flow has four abstraction levels: System Architecture, Virtual Architecture, Transaction Accurate Architecture and Virtual Prototype [8]. This paper emphasizes the adopted communication models at these abstraction levels.

III. MULTILEVEL COMMUNICATION MODELING

As illustrated in figure 1, the communication models adopted at the different abstraction levels of MPSoC details the following:

- Combined architecture/application specification with communication mapping information, resulting the System Architecture (shortly SA) model

- Communication buffer mapping onto the hardware resources (memories or dedicated hardware components such as hardware FIFO), corresponding to the Virtual Architecture (shortly VA) communication model

978-1-4244-1617-2/08/$25.00 ©2008 IEEE.

- Communication protocol implementation including explicit synchronization mechanism, according to the Transaction Accurate Architecture (shortly TA) communication model

- Communication software adaptation to a particular processor based on I/O device drivers, resulting the Virtual Prototype (shortly VP) communication model.

Application

SystemArchitecture

Comm. Buffer Mappingon Storage Resources

VirtualArchitecture

Transaction Accurate Arch.

VirtualPrototype

Specific HardwareComm. Implementation

Comm. SW Adapt. to Specific CPUs

Comm. Archit.

Comm.SW

adapt.

Comm. Implem.

Architecture

• Abstract Communication Units• Implicit I/Os

• Abstract Interconnect Component• Buffer storage resources

• Explicit Synchronization • Explicit Interconnect Component

• Explicit Implementation• I/O device drivers • Load/Store primitives

Figure 1. Communication Modeling Abstraction Levels

The following paragraphs give more details on the communication models adopted at each abstraction level.

Communication Specification at the SA level

The highest abstraction level is the SA level and it results from the application and architecture specification. The initial descriptions of the application functionality and target hardware topology are independent of one other. The application is specified as a set of Simulink task subsystems that group the application functions and communicate using abstract communication units. In the following, we use the term of process and task interchangeable. The target hardware topology can be defined using XML-based descriptions as proposed in SPIRIT. The combination of these two representations results in the SA model.

The SA model is annotated with the communication mapping information through adding parameters to each communication unit. Thus, the SA model comprises the mapping information to define which software process or task is to be executed on which hardware resources and which hardware/software communication mechanisms are used for the data exchange between the software elements. The communication at this level makes use of implicit Simulink I/Os. The communication units between the application tasks correspond to FIFO buffers.

Communication Modeling at the VA level

At the next abstraction level, the communication modeling consists of modeling the mapping of the FIFO buffers onto the storage elements provided by the architecture. This is implemented according to the mapping information specified at the previous level by the designer. The result corresponds to the virtual architecture, modeled using Transaction Level Modeling (TLM) in SystemC.

As illustrated in figure 2, the virtual architecture model has the following components:

- Abstract Processing Subsystems that execute the encapsulated application tasks or Kahn processes, such as CPU-SS1 and CPU-SS2 in figure 2. These represent SystemC modules (sc_module).

- Storage Elements available in the hardware architecture and serve to store the communication buffer, e.g. MEM. These are modeled as SystemC modules (sc_module).

- Abstract Interconnect Component that aims to interconnect the processing subsystems and the storage components. This represents a SystemC channel (sc_channel).

Abstract CPU-SS1

COMM1 T2T1 T3

COMM2SWFIFO

MEM Abstract CPU-SS2

Abstract Communication Network

11 1122 22

Figure 2. Communication in the Virtual Architecture Model

The communication inside a processing subsystem takes place through the local memory of the processing subsystem. Therefore, at the virtual architecture level, it will be not mapped explicitly on the storage element. The intra-subsystem communication is modeled as a point-to-point SystemC communication channel (sc_channel), e.g. the SWFIFO channel in figure 2.

The communication buffer mapping concerns only the mapping of the FIFO buffers between the processes running on different processing subsystems, e.g. the units COMM1 and COMM2 between the CPU-SS1 and CPU-SS2 are mapped onto the MEM storage resource. They correspond to communication path 1 between task T1and task T2, respectively path 2 between task T3 and task T1 (figure 2). These buffers become bounded, their size being defined through estimation of the application requirements via simulation.

The abstract communication network is implemented as a SystemC communication channel that interconnects the multiple data senders and receivers. It permits single data transfer through the network component. The arbitration of the simultaneous requests for a data transfer through the network is performed in a queue manner. Thus, the abstract communication network does not yet implement the topology and behavior of the target interconnect component.

The virtual architecture implements an asynchronous message passing communication between the processor subsystems. Hence, the sender transfers the data to the FIFO buffer mapped on a particular storage element. The sender blocks only if this FIFO is full. The use of FIFO makes the communication reliable, as no data is lost. But the sender has no timing information about when the data is consumed by the receiver. At this level, the synchronization between the processing subsystems is still implicit and the communication path, e.g. path 1 and 2, during the data exchange is not yet fully defined.

The virtual architecture simulation gives important statistics regarding the communication requirements, such as the total number of bytes exchanged between the subsystems during the execution of the application, the amount of data passing through the abstract interconnect component, the buffer size requirements in the worst case

scenario for the storage resources in order to support the communication mapping. Based on the application requirements and the communication traffic resulted after the virtual architecture simulation, the designer can fix the topology of the interconnect component (bus-based topology or NoC based topology).

Communication Protocol Modeling at the TA level

At the TA level, communication models implement the communication protocol. This considers three aspects: i) modeling of an explicit synchronization mechanism between the data sender subsystem and receiver subsystem, ii) usage of an explicit global interconnect component, such as bus or NoC, and iii) modeling of the full end-to-end communication path between the communicating processing subsystems. The result of this step corresponds to the transaction accurate architecture, modeled in SystemC TLM.

As shown in figure 3, the transaction accurate architecture model is composed of the following elements:

- Processing Subsystems, detailed with local hardware components, e.g. local memory, network interface, explicit synchronization components such as semaphore or mailbox, local bus, etc. Examples of these subsystems are CPU-SS1and CPU-SS2 and they are modeled as SystemC modules (sc_module).

- Global Storage Elements, such as global memories or dedicated hardware FIFOs, e.g. MEM-SS in figure 3. They represent SystemC modules (sc_module).

- Explicit Interconnect Component, which implements a global bus or NoC as a SystemC channel (sc_channel).

Interface

CPU-SS1COMM1COMM2

MEM-SS CPU-SS2

InterfaceSynchro.

MemoryAbstractCPU1

Interface Synchro.

MemoryAbstractCPU2

Communication Network (Bus/NoC)

Figure 3. Transaction Accurate Architecture Model

At this level, the communication between the tasks running on the same processing subsystem is managed in software by the real-time operating system. The communication between the tasks running on distinct processing subsystems follows a fully defined communication path from the sender to the receiver, crossing explicit components, such as local bus, network interface and network component. Additionally, it includes synchronization between the communicating partners through events.

Thus, the sender transfers the data to the storage element and sends an event to notify the receiver that data is available. After the receipt, the receiver sends an acknowledge event to confirm that it consumed the data. Only after receiving the notification, the sender can initialize a new data transfer. The event is implemented as a processor interrupt. If there are more than one communication channels between the processor subsystems, a dedicated synchronization hardware component, e.g. mailbox, is required.

The topology of the interconnect component becomes explicit. Thus, the bus protocol, arbitration and burst transfer are explicitly modeled. For a NoC, details such as number of routers, topology, routing algorithm and router buffer size become explicit.

The simulation at this level gives more precise information on the communication architecture, such as number of conflicts on the global bus, the amount of NoC congestion, number of transmitted bytes through the bus or NoC, number of routing requests or the number of times some routers failed to transmit the packet due to conflicts inside the NoC.

Communication Software Adaptation at the VP level

At the classical virtual prototype level, the communication software is adapted to a particular processor implementation. The adaptation consists of integration of device drivers into the software stack to access specific I/O units. The full software stack is executed on Instruction Set Simulators, while the hardware is modeled using cycle accurate components.

EXPERIMENTS

In this paper, we present the communication modeling and exploration for the H.264 Encoder application running on a complex heterogeneous MPSoC architecture. The SA is described in Simulink, while the VA and TA abstraction levels use SystemCTLM. The H.264 Encoder is a video processing multimedia application [9]. To validate the communication architecture through simulation at different abstraction levels, we use a 10 frames QCIF YUV 420 video sequence, Main Profile and the x264 open source code as reference C code.

The target architecture is a heterogeneous MPSoC platform, made of a flexible ARM9 processor subsystem and two high performance commercial off-the-shelf ATMEL VLIW DSP subsystems (DSP1 and DSP2). The processors can communicate through the external global memory (DXM), the local memory of the ARM9 subsystem (SRAM), the local memory (DMEM) or registers (REG) of the DSP processor subsystems, as shown in figure 4.

The H.264 application functions were grouped into three separate tasks targeting to run on each of the processor subsystems. Thus, the DSP1 subsystem is responsible with encoding a frame of the video sequence, DSP2 subsystem compresses the encoded frame using the CABAC method, and finally the ARM9 subsystem creates the NAL bitstream and computes the bitrate controller. The application executes in pipe-line fashion and requires three communication unitsbetween the tasks.

After the specification, the three communication units are modeled through mapping the buffers on the storage resources in different ways. The corresponding virtual architecture model, made of the processing units, storage resources and abstract interconnect, is given in figure 4 for the case of mapping the communication buffers onto the DXM.

Abstract Interconnect

comm1

comm2

DXM

T3

Abstract ARM9-SS

Abstract DSP1-SS

Abstract POT-SS

comm3

SRAM

DMEM1 REG1

Abstract DSP2-SS

DMEM2 REG2

T1 T2

Figure 4. Communication Buffer Mapping on DXM

The simulation of the virtual architecture allowed to estimate the usage of the interconnect component relatively to the communication scheme. The total amount of data passing through the abstract interconnect is illustrated in figure 5. As expected, mapping the communication buffers into the external memory involves a large amount of transactions through the interconnect component.

6171680

5971690

3085840

3285820

6171670

3085850

5971700

0 1000000 2000000 3000000 4000000 5000000 6000000

DXM+DXM+DXM

DXM+DMEM2+DMEM1

DMEM1+DMEM2+SRAM

DMEM2+DXM+SRAM

DXM+DXM+REG1

DMEM1+SRAM+DXM

DXM+SRAM+DXM

Figure 5. Total Words transmitted through the Interconnect

At the TA level, the mailbox synchronization component, the explicit network component and the whole communication path are fully modeled. Figure 6 illustrates the TA model. We used DXM to store the communication buffers. At this level, we experimented three different interconnect components: AMBA bus, Torus NoC and Mesh NoC. The adopted NoC is a bi-dimensional Hermes NoC with 9 routers [3x3] [10].

EXPLICIT INTERCONNECT (AMBA or HERMES NOC)

ARM9-SS

SRAM

POT-SS

NI

AIC SPI

Timer Mailbox

NI

MEM-SS

NI

DXM

REG DMEM

DSP1-SS

NI DMA

MailboxPIC

REG DMEM

DSP2-SS

NI DMA

MailboxPIC

AbstractARM9

AbstractDSP

AbstractDSP

EXPLICIT INTERCONNECT (AMBA or HERMES NOC)

ARM9-SS

SRAM

POT-SS

NI

AIC SPI

Timer Mailbox

NI

MEM-SS

NI

DXM

REG DMEM

DSP1-SS

NI DMA

MailboxPIC

REG DMEM

DSP2-SS

NI DMA

MailboxPIC

AbstractARM9

AbstractDSP

AbstractDSP

Figure 6. Transaction Accurate Architecture Model

Table I summarizes the estimated execution cycles, simulation time of the H.264 Encoder and the total routing requests for the NoC. The AMBA had average performance. The Torus NoC implied the fewest clock cycles during the execution and the fastest simulation time, while Mesh attained the worse performance. This can be explained by the fact that Torus has better path diversity than Mesh and leads to diminished network congestion, thus reducing routing requests. The results in figure 7 present the charge of the Torus.

TABLE I. Execution Time of Transaction Accurate Architecture

Network Interconnect

Execution Time (Clock cycles)

Simulation Time

Routing Requests

Mesh 64.028.725 26 min 96.618.508 Torus 46.713. 986 19 min 78.217.542

AMBA 58.435.467 23 min -

Figure 7 shows the percentage per each router of the amount of data that traverses the Torus NoC during the 10 frames H.264 encoding process. The local port of each router inserts packets to the NoC, while the remaining ports (north, west, south and east) transfer

them inside the NoC. E.g. the network interface 2x1 associated to DSP1-SS has 30% of data transfers to the west port and 70% locally.

0%

20%

40%

60%

80%

100%

0x0(MEM-SS)

0x1(POT-SS)

0x2 1x0(ARM9-SS)

1x1(DMA1)

1x2(DMA2)

2x0 2x1 (DSP1-SS)

2x2(DSP2-SS)

North South East West Local

Figure 7. Total Mbytes transmitted through the Torus NoC

CONCLUSION

In this paper, we presented high level communication models used in the early stages of MPSoC hardware-software design. The proposed flow allows evaluation of the communication architecture, through simulation at different abstraction levels and early performance estimation. We illustrated and explored the communication models for the H.264 Encoder application using several different communication buffer mapping schemes and Torus NoC, Mesh NoC and AMBA bus interconnect components.

REFERENCES

[1] W. Wolf “High-Performance Embedded Computing”, Morgan Kaufmann, 2006

[2] R. Chandra et al. “Parallel Programming in OpenMP”, Morgan Kaufmann, 2000, ISBN 9781558606715

[3] MPI http://www-unix.mcs.anl.gov/mpi [4] Cagkan Erbes et al.“A Framework for System-Level

Modeling and Simulation of Embedded Systems Architecture”, EURASIP Journal on Embedded Systems, Volume 2007, Article ID 82123, June 2007

[5] Dongwan Shin et al. “Automatic Generation of Transaction-Level Models for Rapid Design Space Exploration”, Proceeding of CODES+ISSS 2006, Seoul, Korea, pp. 64-69

[6] F. Dumitrascu et al. “Flexible MPSoC Platform with fast interconnect exploration for optimal system performance for a specific application”, Proceeding of DATE 2006 Designers’ Forum, 6-10 March 2006, Munich, Germany, pp. 166-171

[7] M. Coppola et al. “OCCN: A Network on chip Modeling and Simulation Framework”, Proceeding of DATE 2004, , Paris, France, pp. 174-179

[8] K. Popovici et al. “Efficient Software Development Platforms for Multimedia Applications at Different Abstraction Levels”, Proceeding of RSP Workshop, 28-30 May 2007, Porto Alegre, Brazil, pp. 113-122

[9] Jian-Wen Chen, Chao-Yang Kao, Youn-Long Lin “Introduction to H.264 Advanced Video Coding”, Proceeding of ASP-DAC 2006, Japan, pp. 736-741

[10] F. Moraes et al. “HERMES: an Infrastructure for Low Area Overhead Packet-switching Networks-on-Chip Integration”, VLSI Journal, v38(1), 2004, pp. 69-93

[IEEE 2008 IEEE International Symposium on VLSI Design, Automation and Test (VLSI-DAT) - Hsinchu,...

Documents

Transcript of [IEEE 2008 IEEE International Symposium on VLSI Design, Automation and Test (VLSI-DAT) - Hsinchu,...