Embedded Multiprocessor Systems · This course is included in the Master in Computer Science...

Author: Matteo Forfori Reviewer: Eng. Marco D. Santambrogio

Date: 06/22/04 Version:1.0 State:

Embedded Multiprocessor Systems

2


Document License

This is a private version of an internal document of Micro Architecture Laboratoryof the Politecnico of Milan. Please, if you get this document by error feel free tothrow it away or contact the +390223993564. Thank you for your attention

3


Revisions

Date Version State Comment

4


Index 1. Introduction....................................................................................................5 2. Classical Multiprocessor Systems ...................................................................6 3. Embedded Multiprocessor Systems ................................................................9

Need for High Performances .........................................................................9 Avdantages of On-Chip Multiprocessor Architectures .......................................9 VLSI Design Issues..................................................................................... 10 Processor and Memory Architecture ............................................................. 11 Interconnection Architecture ....................................................................... 11 Optimizing Applications............................................................................... 12

4. Embedded Multiprocessor Case Studies .......................................................13 Multiprocessor Architectures for Embedded Cartographic Systems.................. 13 A Communication Layer for Embedded Multiprocessors ................................. 16 An Optimal Memory Allocation for SoC Multiprocessors.................................. 18

5. Conclusions...................................................................................................21 6. Bibliography..................................................................................................22

5


1. Introduction This document aims at understanding the state of the art of the Embedded Multiprocessor Systems. It is included in the Cerbero project, which deals with multiprocessors, so the goal of this work is to identify some important issues in designing an embedded multiprocessor that can be used as guidelines for a future work on multiprocessors. Since it does not exist actually a single work which summarizes all the knowledge about embedded multiprocessors, or even one which gives a precise definition of what an embedded multiprocessor is, as the starting point this work tried to identify an essential bibliography about embedded multiprocessors. In the following sections, classical multiprocessor systems are presented in brief while embedded multiprocessors are described in a more detailed way, with particular attention to some design issues of them. This work also includes the description of some case studies that show practical solutions to some design problems of embedded multiprocessors as well as real applications of them. This document has been written as a part of the project of the CS569 course of High Performance Processors and Systems. This course is included in the Master in Computer Science program offered by the Politecnico di Milano in collaboration with University of Illinois, Chicago. A particular acknowledgement to Eng. Marco D. Santambrogio for his advices and his precious help.

6


2. Classical Multiprocessor Systems Multiprocessors are multiple CPU computer systems that have shared memory. In other words, in a multiprocessor there is a single physical address space that is shared by all CPUs. If any CPU writes the value 81 to address 1900, any other CPU subsequently reading from its address 1900 will get the value 81. To be more precise, there are different kinds of multiprocessors, and there are also architectures that do not have a shared memory, but the definitions in these cases often differ. A computer system in which each CPU has its private memory is also called a multicomputer, while an architecture in which there is a wide area network between the CPUs is also called a distributed system. Figure 1 - Different multiprocessor models. Shared memory model (a), message passing multiprocessor with

private memories (b), wide area distributed system (c).

The idea of using multiple processors both to increase performance and to improve availability dates back to the Sixties, when Flynn proposed his famous classification of computers that is still in use. He created four categories, analyzing the level of parallelism in the instruction and data streams called for by the instructions at the most constrained component of the multiprocessor:

SISD: Single instruction stream, single data stream (uniprocessors). SIMD: Single instruction stream, multiple data streams - the same instruction is

executed by multiple processors using different data streams (Vector architectures).

MISD: Multiple instruction stream, single data streams - only approximation of this

type exist, no commercial multiprocessor built until now. MIMD: Multiple instruction stream, multiple data streams - each processor fetches

its own instructions and operates on its own data.

7


This is just simple model that is not rigid, some multiprocessors are hybrids of two categories. SIMD was the model most appreciate until the mid-1990s, when MIMD emerged for general-purpose architectures. The two main reasons are flexibility and cost-performance trade-off. Flexibility because MIMDs can achieve high performance for a single application or run many tasks simultaneously or get a combination of these functions, if well supported by hardware and software. Cost-performance because nearly all multiprocessors built today use the same microprocessors found in workstations and single-processor servers.

MIMD multiprocessors exist in two classes, depending on their memory organization. The first class contains multiprocessors with small processor counts, for which a single centralized memory can be shared. Because there is a single main memory that has a symmetric relationship to all processors and a uniform access time from any processor, these multiprocessors are often called symmetric (shared-memory) multiprocessors (SMPs), and the style of architecture is sometimes called uniform memory access (UMA). This type of centralized shared-memory architecture is currently by far the most popular organization.

Processors and memory are interconnected by a bus. The problem with bus-based multiprocessors is their limited scalability. By replacing a single bus with multiple buses, or even a switch, a centralized shared-memory design can be scaled to a few dozen processors. One possibility is to divide the memory up into modules and connect them to the CPUs with a crossbar switch. Each CPU and each memory has a connection coming out of it. When a CPU wants to access a particular memory, the crosspoint switch connecting them is closed momentarily, to allow the access to take place. The virtue of the crossbar switch is that many CPUs can be accessing memory at the same time, although if two CPUs try to access the same memory simultaneously, one of them will have to wait. The downside of the crossbar switch is that with n CPUs and n memories, 2n crosspoint switches are needed. For large n, this number can be prohibitive. Although scaling beyond that is technically conceivable, sharing a centralized memory becomes less attractive as the number of processors sharing it increases.

Figure 2 - UMA Multiprocessor using a crossbar switch

8


The second class consists of multiprocessors with physically distributed memory. To support large processor counts, memory must be distributed among the processors rather than centralized. The basic architecture of a distributed-memory multiprocessor consists of individual nodes containing a processor, some memory, typically some I/O, and an interface to an interconnection network that connects all the nodes. Distributing the memory among the nodes is a cost-effective way to scale the memory bandwidth if most of the accesses are to the local memory in the node. It also reduces the latency for accesses to the local memory. The key disadvantage for a distributed memory architecture is that communicating data between processors becomes somewhat more complex and has higher latency because the processors no longer share a single, centralized memory.

There are two alternative architectural approaches that differ in the method used for communicating data among processors in a multiprocessor system that use multiple memories physically distributed with the processor. In the first one, the physically separate memories can be addressed as one logically shared address space, meaning that a memory reference can be made by any processor to any memory location, assuming it has the correct access rights. These multiprocessors are called distributed shared-memory (DSM) architectures, referring to the fact that the address space is shared. They are also calles NUMAs (nonuniform memory access), since the access time depends on the location of a word in memory. For a multiprocessor with a shared address space, there is a communication mechanism that use the address space to communicate data implicitly via load and store operations.

Alternatively, the address space can consist of multiple private address spaces that are logically disjoint and cannot be addressed by a remote processor. The same physical address on two different processors refers then to two different locations in two different memories. Each processor-memory module is essentially a separate computer; therefore, these parallel processors have been called multicomputers. For them, communication of data is done by explicitly passing messages among the processors.

Figure 3 - Network interface boards in a multicomputer

9


3. Embedded Multiprocessor Systems Need for High Performances Embedded systems for applications such as video streaming require very high MIPS performance, of the order of a Giga operations per second, which cannot be obtained through a single on-chip signal processor. The N-Gage system from Nokia combines a cell phone, mini-disc and MP3 players, and game player, all into a single system. Personal hand-held video players can display a movie, TV show, or music video on their 2.5 inch display. Set-top boxes can provide access to digital TV and related interactive services, as well as serve as a gateway to the Internet and a hub for a home network. For applications such as these, system architects resort to the use multiprocessor architectures to get the required performance. What has made this decision possible is the power granted by the VLSI system-on-chip technology, which allows the logic of several instruction-set processors and several megabytes of memory to be integrated in the same package. There are software and system-design issues also that make a multiprocessor solution attractive. There are numerous VLSI design challenges that a design team may find daunting when faced with the problem of designing a high-performance System-on-Chip (SOC). Verification, logic design, physical design, timing analysis, and timing closure are some of these issues. The way to harness performance in a single processor alternative are superscalar computing and very large scale instruction word processors. Compilers written for such processors have a limited scope of extracting the parallelism in applications. To increase the compute power of a processor, architects make use of sophisticated features like out of order execution and speculative execution of instructions. These kinds of processors dynamically extract parallelism from the instruction sequence. However, the cost of extracting parallelism from a single thread is becoming prohibitive, making a “single complex processor” alternative unattractive. Advantages of On-chip Multiprocessor Architectures As the applications loaded into system-on-a-chip architectures become more and more complex, it is extremely important to have sufficient compute power on the chip. One way of achieving this to put multiple processor cores in a single chip. This strategy has advantages over an alternate strategy which puts a more powerful and complex processor in the chip. First, the design of an on-chip multiprocessor composed of multiple simple processor cores is simpler than that of a complex single processor system. Note that this simplicity also helps reduce the time spent in verification and validation. Second, an on-chip multiprocessor is expected to result in better utilization of the silicon space. The extra logic that would be spent on register renaming, instruction wake-up, and register bypass on a complex single processor can be spent for providing higher bandwidth on an on-chip multiprocessor.

10


Third, the on-chip multiprocessor architecture can exploit loop-level parallelism at the software level in array-intensive applications. In contrast, a complex single processor architecture needs to convert loop-level parallelism to instruction-level parallelism at runtime (that is, dynamically) using sophisticated (and power-hungry) strategies. During this process, some loss in parallelism is inevitable. An on-chip multiprocessor improves execution time of applications using on-chip parallelism. An application program can be made to run faster by distributing the work it does over multiple processors on the on-chip multiprocessor. There is always some part of the program’s logic that has to be executed serially, by a single processor; however, in many applications, it is possible to parallelize some parts of the program code. Suppose, for example, that there is one loop in the code where the program spends 50% of its execution time. If the iterations of this loop can be divided across two processors so that half of them are done in one processor while the other half are done at the same time in a different processor, the whole loop can be finished in half the time, resulting in a 25% reduction in execution time. VLSI Design Issues The cost of designing a multiprocessor system-on-chip, where the processors work at moderate speeds and the system throughput is multiplied by multiplicity of processors, is smaller than designing a single processor which works at a much higher clock speed. This is due to the difficulties in handling the timing closure problem in an automated design flow. The delays due to parasitic resistance, capacitance, and the inductance of the interconnect make it difficult to predict the critical path delays accurately during logic design. Physical designers attempt to optimize the layout subject to the interconnect-related timing constraints posed by the logic design phase. Failure to meet these timing constraints results in costly iterations of logic and physical design. These problems have only aggravated with scaling down of technology, where tall and thin wires run close to one another, resulting in crosstalk. Voltage drop in the resistance of the power supply rails is another potential cause for timing and functional failures in deep submicron integrated circuits. Although custom design may be used for some performance-critical portions of the chip, today it is quite common to employ automated logic synthesis flows to reduce the front-end design cycle time. The success of logic synthesis, both in terms of timing closure and optimization, depends critically on the constraints specified during synthesis. These constraints include timing constraints, area constraints, load constraints, and so on. Such constraints are easier to provide when a hierarchical approach is followed and smaller partitions are identified. The idea of using multiple processors as opposed to a single processor is more attractive in this scenario. Another benefit that comes from a divide-and-conquer approach is the concurrency in the design flow. A design that can naturally be partitioned into subblocks such as processors, memory, application specific processors, etc. can be design-managed relatively easily. Different design teams can concurrently address the design tasks associated with the individual sub-blocks of the design. When a design has multiple instances of a common block such as a processor, the design team can gain significantly in terms of design cycle time.

11


Processor and Memory Architecture A wide variety of choices exists for selecting the embedded processor today, and the selection is primarily guided by considerations such as overall system cost, performance constraints, power dissipation, system and application software development support which permits rapid prototyping, and the suitability of the instruction set to the embedded application. The code density, power, and performance are closely related to the instruction set of the embedded processor. Compiler optimizations and application software programming style also play a major role in this. RISC, CISC, and DSP are the three main categories of processors available to a designer. Some design decisions that must be made early in the design cycle of the embedded system are:

- General purpose processors versus application specific processors for compute-intensive tasks such as video/audio processing

- Granularity of the processor - selecting a small set of powerful CPUs versus

selecting a large number of less powerful processors - Homogeneous or heterogeneous processing - Reusing an existing CPU core or architecting a new processor

The memory architecture of the MPSOC is equally critical to the performance of the system, since most of the embedded applications are data intensive. In current MPSOC architectures, memory occupies 50% of the die area; this number is expected to increase to 70% by 2005 and to 92% by 2014. Due to numerous choices a system architect has on memory architectures, a systematic approach is necessary for exploring the solution space. Variations in memory architecture come from the choice of sharing mechanism (distributed shared memory or centrally shared memory), ways to improve the memory bandwidth, type of processor-memory interconnect network, cache coherence protocol, and memory parameters (cache size, type of the memory, number of memory banks, size of the memory banks). Most DSP and multimedia applications require very fast memory close to the CPU that can provide minimum two accesses in a processor cycle. Interconnection Architecture An efficient interconnection architecture is also necessary, and it means interprocessor communication, communication between processors and peripherals, and communication between memories and processors/peripherals. The major considerations in designing the interconnection architecture are the propagation delay, testability, layout area, and expandability. Bus-based interconnection schemes continue to remain popular in today’s embedded systems, since the number of processors/peripherals in these systems is still quite small. Busses do not scale very well in terms of performance as the number of masters and slave processors connected to the bus increases. Assuming that Moore’s law will continue to hold for several years to come, one can expect a very large number of processors, memories, and peripherals to be integrated on a

12


single SoC in the future. Bus-based interconnection architectures will not be appropriate in such systems. Given the problems that VLSI design engineers already face in closing timing, one can expect that these problems will escalate further in these future systems where the number of connections will be very high. A modular approach to interconnections will therefore be necessary. Optimizing Applications Compilers and other software development support play an important role in selecting the processor for an embedded application. Compiler optimizations are important for optimizing the code size, performance, and power. While compiler optimizations are useful in the final phase of software development, a significant difference to the quality of the software comes from the programming style and the software architecture itself. Developing an application for a multiprocessor SOC poses several challenges:

- Partitioning the overall functionality into several parallel tasks - Allocation of tasks to available processors - Scheduling of tasks - Management of interprocessor communication

Identifying the coarse-grain parallelism in the target application is a manual task left for the programmer. For example, in video applications, the image is segmented into multiple macro-blocks (16×16 pixels) and each of the segments is assigned to a processor for computation. Fine-grain parallelism in instruction sequences can be identified by compilers. The other key challenge in optimizing an application for a multiprocessor SOC is to limit the number of messages between processors and the number of shared memory accesses. The overall throughput and speedup obtainable through the multiprocessor solution can be marred by an excess of shared memory accesses and interprocessor communications. Performing worst-case analysis of task run-times and interprocessor communication times, and guaranteeing real-time performance are also challenges in optimizing an application for a multiprocessor SOC.

13


4. Embedded Multiprocessor Case Studies Multiprocessor Architectures for Embedded Cartographic Systems Embedded systems with complex graphical interfaces require significant computational power. Moreover, low power consumption and low cost are usually strict specification constraints. A possible solution for addressing these conflicting needs is the adoption of a simple multiprocessor on a single chip, using low-cost CPU cores. Bechini and Prete present a methodology for selecting and tuning the multiprocessor architecture for a cartographic system. They consider cartographic systems to be deployed on hand-held devices with a LCD display, supporting GPS as well. The design process takes into account the specific behavior of cartographic software, in terms of use of system resources (CPU, memory, LCD, other peripheral devices). They try to define the system workload, pointing out the relevant software activities and their dependencies from the input data. They define a pool of eligible architectures and then compare and analyze the simulation results, selecting an architecture in the pool, according to given criteria.

The main time-critical situation for the system is the map plotting on a LCD screen. This activity has to be carefully dealt with, because it directly impacts on the user of the hand-held device. At the same time, the cyclic execution of GPS algorithms takes place: it is in charge of updating the current geographical position of the device. Once the workload has been roughly defined, it has to be better characterized, investigating on the input domain. Specifically, what has to find out are the data causing the heaviest computational load. Particular attention must be devoted to choose the map portion that yields the longest execution time, in order to consider it as the worst input case for the system. Moreover, a map of medium complexity can be used to figure out the average operating conditions for the system.

Figure 4 - The plotting time on the LCD display depends also on the map complexity. The area of each bubble in the diagram is proportional to the plotting time in the corresponding position on the map.

14


The following observations help in populating a pool of eligible architectures: - Symmetric vs. asymmetric architectures. Considering that cartographic applications are continuously evolving, the lack of flexibility hampers the adoption of an asymmetric architecture. On the other hand, a solution with anonymous processors is more suitable to host updated versions of the cartographic applications. Even if an asymmetric architecture could yield impressive performances, it cannot be taken into account for maintenance and easy-deployment issues. - Bus width and number of buses. An important design choice is the bus width. It’s worth adopting a large bus only if it gives considerably better performances, respects to narrower ones. The bus widths analyzed are 16 bits, 32 bits, and 64 bits. The adoption of a 64-bit bus doesn’t yield any special payoff. The main advantage in choosing a 16-bit bus is the silicon saving. On the other side, a 32-bit bus gives a higher tolerance on the LCD refreshing traffic and a better performance (12% more, respect to the 16-bit bus). Whenever it is used a unique internal bus in the system, the traffic due to LCD refreshing is “overlapped” to the traffic towards main memory. This fact might cause bus congestion. Thus, for avoiding time overhead due to bus contention, also architectures with a dedicated on-chip bus can be taken into consideration. This can be done designing a separated bus for supporting mainly the LCD traffic. However, preliminary simulations show that such bus splitting does not yield a significant performance improvement. The speedup on plotting time obtained in this way is less than 4% in every architecture of interest, while the cost in terms of chip die size becomes unacceptable. From the observations presented, it can be seen that, for the selected workload, the bus system does not represent a very bottleneck, neither for the number of buses, nor for their width. Thus one can try to improve the overall performances designing proper caching schemes. - Searching for possible caches. A crucial role in the architecture design is played by the caching scheme. Caches, in order to be really useful and effective, need to be carefully tuned. Many architectures in the pool will be different from each other because of the caching scheme. Moreover, the same architectural structure can be used for several solutions, differing only for the values of the cache parameters. A first result from simulations is that it isn’t possible to make the system respect the specification timings without memory caching. Through simulations, it can be determined what’s the software process causing the highest number of cache misses: such a process is used for tuning the cache parameters. Considering the speed-up due to the introduction of different types of caches, a range of possible values for the cache parameters can be selected . After these considerations, the resulting ranges selected for the chipset are: size: 4-8 kBytes; block size: 8-16 bytes; associativity: 4 ways. Once the workload and the pool of eligible architectures have been defined, trace-driven simulations allows performance evaluation of cartographic applications over such pool. This process is aimed at gathering information on performance-related aspects of each eligible solution. In the evaluation of the chip architecture, it has to be considered also the architecture of the different systems that will be plugged in. In fact, the access timings for all the devices external to the chip (e.g. RAM memory, controllers, etc.) affect the overall system performance (even if cache memory usually has a heavy decoupling effect). For this reason, it has been simulated the behavior of each eligible solution on two different products, employing in the first case (low-end product) cheap RAM memory and

15


a small LCD display, and in the second case (high-end product) a quicker memory and a wider display. The selection of the most appropriate solution is carried out using some given performance indexes. In this particular case, the simplest performance index is the time spent in redrawing a map on the LCD display. Such index can be measured for the plotting of both the most complex and the typical map, according with the worst and average test cases chosen in the previous workload definition phase. In this particular situation, the quickest redraw time is obtained by the same architecture for both the average and the most complex map portion. There are many other issues to take into account in selecting a particular architecture, and most of them have to do with the production costs. The corresponding chip area is one of the most important. Even if it is very difficult to predict the exact chip dimensions from the plain architectural scheme, it is possible to estimate them using heuristic methods, starting from the actual cache parameters. The architecture actually selected for the chipset has two CPU cores with private caches of 8 Kbytes, 4 ways, 16 byte blocks, with copyback writing policy. It has been chosen not only because of its quick redraw time, but also because of its simplicity respect to other architectures with analogous performances, and its reasonable occupancy in terms of chip die size.

Figure 5 - The architecture selected for the chipset. The components placed out of the chip borders may have different response (or access) times,

and this issue severely influences the system performance. For this reason, the external components have to be properly modeled within the simulation environment as well.

The module called “C-card” is a special purpose device for storing digital maps.

16


A Communication layer for Embedded Multiprocessors The application domain for embedded systems is rapidly expanding. Embedded systems are now also used for multimedia processing that requires a high level of sustained performance. On the other hand, the life-time of applications is decreasing. This calls for a hybrid solution, in which a general-purpose processor is used for flexibility and multiple embedded processors (DSPs) for high-throughput signal processing. The embedded processors can be configured as a pipeline, or as a processor farm. The latter architecture is often implemented by connecting all processors (host and DSPs) to a shared bus or network. In such a system, also called embedded multiprocessor system, all processors can communicate with each other, which greatly enhances flexibility. To ease application development for embedded multiprocessor systems it is crucial to develop compiler and runtime-system support to handle difficult and error-prone tasks like processor synchronization and data transport between heterogeneous processors (e.g., host to DSP). Many parallel programming systems exist ranging from library based systems to language-based systems. The latter systems are easy to program because of the parallelizing compiler that hides the complex interface to the parallel hardware. The performance, however, is often compromised because of the layered approach (OS/runtime-system/compiler) taken to achieve portability. The alternative of writing explicitly parallel programs using communication libraries is not very appealing because of the large penalty in development costs (parallel programming at such a low level is diffcult and error-prone). Another approach consists of building a complete tool chain for application development targeting embedded multiprocessors, combining ease-of-programming (compiler) and application performance (efficient DMA-based communication). In this case study, they support data parallel (SPMD) programming through the Spar/Java language, which is a Java derivative with explicit support for scientific computations. The Spar/Java compiler recognizes annotations for data and code placement, and automatically generates a parallel program with explicit communication. On embedded systems communication is handled by the ensemble layer, which has been explicitly designed to take advantage of the hardware capabilities (i.e. DMA engines) present in embedded multiprocessors. Furthermore they tightly integrated the compiler and ensemble to overcome the performance penalties associated with portable communication libraries.

Most embedded processors are capable of initiating asynchronous DMA transfers, so that communication and processing can be overlapped. In ensemble they exploit this feature by overlapping simultaneous DMA transfers and buffer packing and unpacking as much as possible. The communication performed by data-parallel Spar/Java programs is implicit: whenever a processor references data that resides on another processor, the compiler generates a communication event. The performance of so-called element-wise communication is poor. Therefore, the compiler uses message aggregation to send multiple data elements in a single message. With ensemble they are able to overlap message aggregation (computation) and communication. This is not possible when the Spar/Java compiler targets traditional message passing libraries like PVM and MPI. The Spar/Java compiler generates C++ code with explicit send and receive primitives. The Spar/Java compiler performs a sophisticated analysis to identify opportunities for message aggregation. Furthermore, the Spar/Java compiler determines in which order the messages are processed. If messages happen to arrive in a different order at runtime,

17


they cannot be processed immediately, but must be buffered. To avoid unnecessary waiting, we would like to process messages on a first-come first-serve basis. Overlapping message aggregation and communication requires a tight integration: either the compiler must be made communication aware (e.g., address fragmentation for pipelining), or the message passing layer must provide a higher-level interface (e.g., scatter-gather message vectors).

With ensemble they take the latter approach and operate on (n-dimensional) data descriptors instead of contiguous buffers. Furthermore, they require that all sends and receives are registered before invoking ensemble to perform the actual data transfers. This registration-execution mechanism provides ensemble with the opportunity to schedule data transfers dynamically to match availability of data at source processors.

Both source and destination of a data transfer in ensemble must be specified using 'data descriptors'. These allow a variety of often-occurring data access patterns to be specified as either the source or destination of a transfer. The data descriptors used in ensemble are based on the concept of selecting the elements of a one-dimensional array with index-values ranging from a lower-bound L up to an upper-bound U, each time incrementing the index with a stride S.

The goal of ensemble is to overlap message aggregation (computation) and DMA transfers (communication). This calls for a pipelined approach with three stages: packing (at source processor), transmitting (DMA engine at source), and unpacking (at destination processor). To reduce startup costs they use relatively small buffers that are passed down the pipeline; each DMA across the PCI bus transfers 4 KB (or less for the last fragment) of data. These transfer buffers are also the basic unit of flow control between sender and receiver. Flow control is implemented per sender/receiver pairs.

They implemented ensemble (and the Spar/Java compiler) on a heterogeneous multi-processor system consisting of one host CPU (AMD Athlon) and three multimedia DSPs (Philips TriMedia) connected by a PCI bus. The Spar/Java compiler generates C++ code, which is then compiled for the Athlon and cross-compiled for the TriMedia. The Athlon executable downloads the TriMedia code on the three embedded processors and initiates execution. The three TriMedia processors perform the actual parallel computation and return their output to the Athlon. Performance measurements show that ensemble adds little overhead to the raw DMA speed. When supporting Spar/Java, the actual communication speed largely depends on the layout of the data elements involved; the pack and unpack routines are sensitive to how well the cache handles strided accesses. They determined that overlapping message aggregation and communication increases performance up to 39% for peer-to-peer communication, and up to 34% for all-to-all communication. They anticipate an even larger benefit when the Spar/Java compiler will implement shrinking to store distributed arrays compactly.

18


An Optimal Memory Allocation for SoC Multiprocessors The design of modern digital systems is influenced by technological and market trends, including the ability to produce ever more complex chips with increasingly shorter time-to-market. Time-to-market considerations have driven the need for automation in the embedded systems-on-chip (SoC) design industry. Although a completely manual approach relying on an expert designer’s intuition helps reduce chip area and improve performances and power metrics, the time spent by a human designer to optimize the complex chips designed today is prohibitively high, and the manual approach becomes impractical. In embedded systems the memory architecture can be chosen more or less freely, it is only constrained by the application requirements. Different choices can lead to solutions with a very different cost, which means that it is important to make the right choice. For this reason, the allocation of the memory blocs a major steps in the SoC design flow. The goal of the memory allocation and assignment is to make use of the memory architecture freedom to minimize the cost related to background memory storage and transfers. Many applications in fields such as multi-media (audio and video) and image processing handle bulky and strongly dependent data. They consequently require the integration of a great number of memories of various types (local private, local distributed and on-chip global shared memory). Moreover, up to 70% of the chip area is dedicated to memory. Unfortunately, nowadays, there is not a complete and automatic method allowing designers to integrate all these memory types (particularly the shared memory) in the SoC from a high abstraction level. To provide designers with a global and fast method as well as tools to design these systems in order to meet the time-to-market constraints, it can be used an optimal approach to resolve this problem minimizing the memory architecture area and the global time of accesses to the shared data of the application. Actually, multiprocessor SoCs integrate more and more elements, and the description of such systems at the architecture level can reach 200K lines of code (SystemC, C, VHDL). Therefore, it could be better in the memory design flow to insert a stage of automatic application-code transformation taking into account the chosen memory architecture, which is very beneficial to the designers from the time-to-market point of view. System design of a multiprocessor can start with the system specification of a given application with system level communication (written in SystemC for example) and an abstract multiprocessor architecture. Processors and communication components are allocated, and system behavior and communication (ports and channels) are mapped/scheduled on processors and communication components of the architecture template (by the designer). After the allocation/mapping /scheduling step, a system specification at the architecture-level is obtained. For each processor, the software code (Operating System and application code of tasks) is generated. Interfaces between processors and the communication network are also generated. The micro-architecture of the system is so obtained. Typical applications using shared memory are signal, data, and image processing, multimedia, and all distributed applications dealing with a large amount of data. In a design methodology in which each processor is connected to the communication network by an interface, processors have their own local memory and to share data, processors need to send data to the others via interfaces. The idea is to allow designers to insert shared memory in the platform architecture as an IP. This means that an interface is

19


introduced between the communication network and the memory. This interface is both a protocol adapter and a memory controller.

Figure 6 - The design flow.

Starting from a system level specification and an architecture model, they suppose that processors and tasks allocation is done by the designer. The first step is to decide if the application needs a shared memory. It is the memory allocation step. The architecture refinement is performed to insert this shared memory. They obtain an architecture with interconnected processors, IPs and memories. The application code may be transformed if necessary. The next step is memory assignment. They have to decide the best address for each data in the shared memory. For data in local memory, we let compilers do this work. The output of this assignment step is a new application code and an allocation table in which each variable is assigned to one memory (and one address for each variable in the shared memory). Memory synthesis consists of using existing memory blocs from a library. Hardware/software targeting consists of the generation of interfaces from processors and communication network, and software generation. The memory allocation flow takes as input a system level specification of the application (after processor allocation), a generic architecture model and libraries containing the estimated access time of each processor to memories and memory costs. This flow is mainly composed of three parts. The first consists in extracting parameters from the application code. The second carries out the memory allocation using an integer linear program (generated automatically). The third reads/writes primitives of the shared data in the application code (taking into account the memory allocation results), and generates an architecture-level description of the application. The parameter extraction stage consists in extracting from the system level description of the application some information about the handled data, such as their names, sizes (types) and the use of the

20


communication channels. The allocation stage uses the parameters extracted in the previous step and the results of the system-level simulation of the application, and generates automatically an integer linear program. This program gives an exact solution for the memory blocs allocation minimizing the memory cost and the global time to access the shared data in the application.

Figure 7 - The memory design flow. The modeling of the memory allocation problem by an integer linear programming approach presents some major advantages as:

- it is an exact method, which contrary to the heuristic based methods, gives an optimal solution

- it is a very generic model which allows the integration of all the memory types

(local private, local distributed and global shared memories) in the architecture

- it resolves two problems: allocation of the memory blocs, and the data

assignation into these blocs - there are many available tools which permit the resolution of such a model.

Since some variables in the model are boolean, the resolution step can be slow depending on the number of such variables. So, for the applications integrating lot of processors, the use of stochastic methods instead of the linear model is recommended.

21


5. Conclusions In this work the Embedded Multiprocessor System field has been analyzed, starting from a comparison between classical multiprocessors and embedded ones, which pointed out that embedded multiprocessors are convenient in terms of high performances and exploitation of loop-level parallelism. Embedded multiprocessors summarize the key advantages of multiprocessors and the application-specific optimizazions of the embedded systems, so they represent the preferred architecture for applications such as video/audio streaming. A deep analysis of embedded multiprocessors' state of the art through some case studies enlightened that if on one hand multiprocessor embedded systems seem to be capable to meet the demand of processing power and flexibility of complex applications, on the other hand, such systems are very complex to design and optimize, so that the design methodology plays a major role in determining the success of the products.

22


6. Bibliography M.Shalan, V.Mooney, "Hardware Support for Real-Time embedded Multiprocessor System-on-a-Chip Memory Management", CODES'02. M.Kandemir, J.Ramanujam, A.Choudhary, "Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems", DAC 2002. S.Cadot, F.Kuijlman, K.Langendoen, K.van Reeuwijk, H.Sips, "ENSEMBLE: A Communication Layer for Embedded Multiprocessor Systems", LCTES 2001. A.Bechini, C.A.Prete, "Evaluation of On-Chip Multiprocessor Architectures for an Embedded Cartographic System", 2001. S.Meftali, F.Gharsalli, F.Rousseau, A.A.Jerraya, "An Optimal Memory Allocation for Application-Specific Multiprocessor System-on-Chip", ISSS'01. I.Kadayif, M.Kandemir, U.Sezer, "An Integer Linear Programming Based Approach for Parallelizing Applications in On-Chip Multiprocessors", DAC 2002. F.Gharsalli, S.Meftali, F.Rousseau, A.A.Jerraya, "Automatic Generation of Embedded Memory Wrapper for Multiprocessor SoC", DAC 2002. C.P.Ravikumar, "Multiprocessor Architectures for Embedded System-on-chip Applications", VLSID 2004. D.Sciuto, F.Salice, L.Pomante, W.Fornaciari, "Metrics for Design Space Exploration of Heterogeneous Multiprocessor Embedded Systems", CODES'02.

Embedded Multiprocessor Systems · This course is included in the Master in Computer Science...

Documents

Transcript of Embedded Multiprocessor Systems · This course is included in the Master in Computer Science...