Design of a Gigabit Router Packet Bu er using DDR SDRAM Memory21665/FULLTEXT01.pdf · Examensarbete...

Design of a Gigabit Router Packet Bu�er

using DDR SDRAM Memory

Examensarbete utfört i datorteknik

av

Daniel Ferm

LITH-ISY-EX--06/3814--SE

Linköping 2006

Design of a Gigabit Router Packet Bu�er

using DDR SDRAM Memory

Examensarbete utfört i datorteknikvid Linköpings tekniska högskola

av

Daniel Ferm

LITH-ISY-EX--06/3814--SE

Supervisor: Andreas Ehliar

Examiner: Dake LiuLinköping 2006

Avdelning, Institution

Division, DepartmentDatum

Date

Språk

Language

2 Svenska/Swedish

2 Engelska/English

2

Rapporttyp

Report category

2 Licentiatavhandling

2 Examensarbete

2 C-uppsats

2 D-uppsats

2 Övrig rapport

2

URL för elektronisk version

ISBN

ISRN

Serietitel och serienummer

Title of series, numberingISSN

Titel

Title

Författare

Author

Sammanfattning

Abstract

Nyckelord

Keywords

The computer engineering department at Linköping University has aresearch project which investigates the use of an on-chip network in arouter. There has been an implementation of it in a FPGA and forthis router there is a need for bu�er memory. This thesis extends therouter design with a DDR memory controller which uses the featuresprovided by the Virtex-II FPGA family.

The thesis shows that by carefully scheduling the DDR SDRAM mem-ory high volume transfers are possible and the memory can be usedquite e�ciently despite its rather complex interface.

The DDRmemory controller developed is part of a packet bu�er modulewhich is integrated and tested with a previous, slightly modi�ed, FPGAbased router design. The performance of this router is investigatedusing real network interfaces and due to the poor network performanceof desktop computers special hardware is developed for this purpose.

Institutionen för Systemteknik581 83 Linköping

1 mars 2006

�

LITH-ISY-EX--06/3814

�

http://www.ep.liu.se/

9th March 2006

Design of a Gigabit Router Packet Bu�er using DDR SDRAM Memory

Design av en Packetbu�er för en Gigabit Router användandes DDRMinne

Daniel Ferm

××

DDR, SDRAM, memory, FPGA, ethernet, router, socbus

Abstract

The computer engineering department at Linköping University has a researchproject which investigates the use of an on-chip network in a router. There hasbeen an implementation of it in a FPGA and for this router there is a needfor bu�er memory. This thesis extends the router design with a DDR memorycontroller which uses the features provided by the Virtex-II FPGA family.

The thesis shows that by carefully scheduling the DDR SDRAM memoryhigh volume transfers are possible and the memory can be used quite e�cientlydespite its rather complex interface.

The DDR memory controller developed is part of a packet bu�er modulewhich is integrated and tested with a previous, slightly modi�ed, FPGA basedrouter design. The performance of this router is investigated using real net-work interfaces and due to the poor network performance of desktop computersspecial hardware is developed for this purpose.

i

Abbreviations

AFIFO Asynchronous FIFO

ARP Address Resolution Protocol, protocol for IP to MACaddress translation on Ethernet

ASIC Application Speci�c Integrated Circuit

CAS Column Address Strobe, control signal for SDRAM

CPU Central Processing Unit, processor

CRC Cyclic Redundancy Check, an algorithm used for errordetection

CS Chip Select, control signal for SDRAM

DCM Digital Clock Manager, a clock manipulating hardwarein FPGA

DDR Double Data RateDRAM Dynamic RAM, a type of memory

FIFO First In First Out, a type of bu�er

FPGA Field Programmable Gate Array, a type ofprogrammable circuit

IOB Input Output Block, part of FPGA that handlescommunication outside of it

IP Internet Protocol, network protocol used on Internet

IPv4 Internet Protocol version 4MAC address Hardware address used in EthernetMTU Maximum Transmission Unit, the largest packet a

network can transferRAM Random Access Memory

RAS Row Address Strobe, control signal for SDRAM

SDRAM Synchronous DRAM, clocked DRAM

WE Write Enable, control signal for SDRAM

iii

Contents

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2.1 Primary Requirements . . . . . . . . . . . . . . . . . . . 21.2.2 Secondary Requirements . . . . . . . . . . . . . . . . . . 2

1.3 Reading Instructions . . . . . . . . . . . . . . . . . . . . . . . . 21.3.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Memories 5

2.1 SRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 DRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 SDRAM . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2.2 DDR SDRAM . . . . . . . . . . . . . . . . . . . . . . . 8

3 Virtex-II FPGAs 11

3.1 DCMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.1 Clock De-skew . . . . . . . . . . . . . . . . . . . . . . . 123.1.2 Variable Phase Shift . . . . . . . . . . . . . . . . . . . . 123.1.3 Statically Phase Shifted Clock Outputs . . . . . . . . . 123.1.4 Frequency Altered Outputs . . . . . . . . . . . . . . . . 13

3.2 DDR IOBs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.3 Global Clock Network . . . . . . . . . . . . . . . . . . . . . . . 153.4 Block RAMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.1 Asynchronous FIFOs . . . . . . . . . . . . . . . . . . . . 16

4 Router Design 17

4.1 Packet Path: Old Router . . . . . . . . . . . . . . . . . . . . . 174.2 Packet Path: New Router . . . . . . . . . . . . . . . . . . . . . 174.3 Router Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3.1 Input Module . . . . . . . . . . . . . . . . . . . . . . . . 184.3.2 Output Module . . . . . . . . . . . . . . . . . . . . . . . 184.3.3 Routing Table . . . . . . . . . . . . . . . . . . . . . . . 18

v

vi CONTENTS

4.3.4 Packet Bu�er . . . . . . . . . . . . . . . . . . . . . . . . 194.3.5 Socbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Router Memory Usage 23

5.1 Packet Identi�ers . . . . . . . . . . . . . . . . . . . . . . . . . . 235.2 Memory Storage Scheme . . . . . . . . . . . . . . . . . . . . . . 245.3 Packet to Memory Mapping . . . . . . . . . . . . . . . . . . . . 255.4 Packet Identi�er Format . . . . . . . . . . . . . . . . . . . . . . 28

6 Packet Bu�er Design 31

6.1 DDR Controller Selection . . . . . . . . . . . . . . . . . . . . . 316.1.1 Available Controllers . . . . . . . . . . . . . . . . . . . . 326.1.2 Custom Controller . . . . . . . . . . . . . . . . . . . . . 32

6.2 The Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . 336.2.1 Startup Controller . . . . . . . . . . . . . . . . . . . . . 366.2.2 Primary Controller . . . . . . . . . . . . . . . . . . . . . 366.2.3 Secondary Controllers . . . . . . . . . . . . . . . . . . . 36

6.3 Memory Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 376.4 Router Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.4.1 Input Connections . . . . . . . . . . . . . . . . . . . . . 386.4.2 Output Connections . . . . . . . . . . . . . . . . . . . . 396.4.3 Route Connection . . . . . . . . . . . . . . . . . . . . . 39

6.5 Bu�er Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.5.1 Bu�er Memory Usage . . . . . . . . . . . . . . . . . . . 40

7 Veri�cation, Testing and Debugging 43

7.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437.2 Logic Analyzer Hardware Debugging . . . . . . . . . . . . . . . 447.3 Packet Bu�er Veri�cation and Speed . . . . . . . . . . . . . . . 44

7.3.1 Packet Generators . . . . . . . . . . . . . . . . . . . . . 447.3.2 Route Generator . . . . . . . . . . . . . . . . . . . . . . 45

7.4 Status Registers/Counters . . . . . . . . . . . . . . . . . . . . . 457.4.1 Serial Connection . . . . . . . . . . . . . . . . . . . . . . 46

7.5 Testing in Real Network . . . . . . . . . . . . . . . . . . . . . . 467.5.1 Dedicated Hardware Packet Generators . . . . . . . . . 46

8 Results 49

8.1 Controller E�ciency . . . . . . . . . . . . . . . . . . . . . . . . 498.1.1 Di�erent Bu�er Sizes . . . . . . . . . . . . . . . . . . . . 498.1.2 Final Packet Bu�er Design . . . . . . . . . . . . . . . . 518.1.3 Theoretical Ethernet Maximal Throughput . . . . . . . 528.1.4 Packet Bu�er Limits . . . . . . . . . . . . . . . . . . . . 53

8.2 Router Performance . . . . . . . . . . . . . . . . . . . . . . . . 538.2.1 Manual Tests . . . . . . . . . . . . . . . . . . . . . . . . 548.2.2 Automated Tests . . . . . . . . . . . . . . . . . . . . . . 54

CONTENTS vii

8.2.3 Automated Tests, Di�erent Packet Sizes . . . . . . . . . 578.3 FPGA Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . 588.4 Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

9 Problems 61

9.1 Meeting Controller Timing . . . . . . . . . . . . . . . . . . . . 619.2 Synthesizer Bug . . . . . . . . . . . . . . . . . . . . . . . . . . . 629.3 Memory Read Timing . . . . . . . . . . . . . . . . . . . . . . . 629.4 Controller Bu�er Usage . . . . . . . . . . . . . . . . . . . . . . 63

10 Conclusions and Further Work 65

10.1 Controller Improvements . . . . . . . . . . . . . . . . . . . . . . 6610.1.1 Small Packets . . . . . . . . . . . . . . . . . . . . . . . . 6610.1.2 Large Packets . . . . . . . . . . . . . . . . . . . . . . . . 6610.1.3 Bu�er Memory . . . . . . . . . . . . . . . . . . . . . . . 6610.1.4 Controller Command FIFOs . . . . . . . . . . . . . . . . 66

10.2 Router Improvements . . . . . . . . . . . . . . . . . . . . . . . 6710.2.1 Socbus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6710.2.2 ARP Support . . . . . . . . . . . . . . . . . . . . . . . . 68

viii CONTENTS

List of Tables

5.1 Internet Mix . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.2 Relative Block Sizes . . . . . . . . . . . . . . . . . . . . . . . . 27

8.1 8k Bu�ers @115MHz with Internet Mix . . . . . . . . . . . . . 508.2 16k Bu�ers @115MHz with Internet Mix . . . . . . . . . . . . . 508.3 Final Design, 8k Bu�ers @115MHz with Internet Mix . . . . . 518.4 Final Design, 8k Bu�ers @115MHz with 40 byte Packets . . . . 518.5 Final Design, 8k Bu�ers @115MHz with 1500 byte Packets . . 518.6 Packet Generator Measurements at Full Duplex . . . . . . . . . 548.7 FPGA Utilization, Router . . . . . . . . . . . . . . . . . . . . . 598.8 FPGA Utilization, Modules . . . . . . . . . . . . . . . . . . . . 59

ix

x LIST OF TABLES

List of Figures

2.1 Organization of SDRAM with 4 Banks . . . . . . . . . . . . . . 62.2 SDR SDRAM READ Timing Diagram . . . . . . . . . . . . . . 72.3 Strobe to Data Timing Relationship . . . . . . . . . . . . . . . 92.4 DDR SDRAM READ Timing Diagram . . . . . . . . . . . . . . 9

3.1 DCM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 DCM Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.3 DDR IOB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.4 Asynchronous FIFO . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1 Successful Socbus Connection . . . . . . . . . . . . . . . . . . . 204.2 Old Router Socbus Network Con�guration . . . . . . . . . . . . 214.3 New Router Socbus Network Con�guration . . . . . . . . . . . 21

5.1 Block Allocation per Row . . . . . . . . . . . . . . . . . . . . . 275.2 Packet Identi�er Format . . . . . . . . . . . . . . . . . . . . . . 29

6.1 Packet Bu�er Overview . . . . . . . . . . . . . . . . . . . . . . 316.2 Controller . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356.3 Socbus Connections Bu�er Sharing . . . . . . . . . . . . . . . . 41

8.1 Loss for Packet in Range 48 to 288 bytes . . . . . . . . . . . . . 558.2 Loss with Di�erent Packet Sizes, 400-1500 . . . . . . . . . . . . 578.3 Loss with Di�erent Packet Sizes, 1500-1500 . . . . . . . . . . . 58

xi

xii LIST OF FIGURES

Chapter 1

Introduction

1.1 Background

The basis for this project is two other master thesis projects that has beendone at the department.

The �rst was a feasibility study of an Internet core router design using anon-chip network [4]. This design targeted an ASIC, which is a much higherperforming circuit than what was actually used during this and the secondproject.

The second project was based on the feasibility study and implemented agigabit Ethernet router [1], now however a FPGA was used, as in this project.During the second project the number of ports on the router was severelylimited (to 2 ports) due to the available hardware, this restriction was also ina�ect during this project. The department are working on new expansion cardswith 4 gigabit Ethernet interfaces enabling a total of 8 ports to be connectedto the router, but these are not �nished at the time of writing.

With this much increased capacity comes a need for more bu�er memory tostore the packets while processing them in the router. The available memoryin the FPGA is limited so to facilitate the increased capacity it was decidedto use DDR memory available on the FPGA development boards. In order touse the DDR memory it needs a controller and this is the idea for this project,make a DDR memory controller and integrate it into a packet bu�er of theFPGA router.

1.2 Objective

The project objective was to design a working DDR memory controller for theAvnet Kokerboom development board featuring a Xilinx Virtex-II XC2V4000FPGA and to test it with the router.

A number of requirements were formed at the beginning of the project.

1

2 CHAPTER 1. INTRODUCTION

They were divided into two groups, primary and secondary requirements, wherethe primary was to be completed and the secondary only if time allowed.

1.2.1 Primary Requirements

• Implement a working DDR memory controller for the Kokerboom devel-opment board with a Xilinx Virtex-II XC2V4000 FPGA and a 128MBSODIMM memory based on the Micron MT46V8M16-75 chip.

• Design a packet bu�er using the memory controller with socbus inter-face(s).

• The packet bu�er should handle 4 gigabit in/out port pairs.

• Document how the hardware works.

1.2.2 Secondary Requirements

• Support for 8 gigabit in/out port pairs.

• Test the packet bu�er in the previous router design with real networkinterfaces.

1.3 Reading Instructions

In order to understand the contents of this thesis a basic understanding ofcomputer networks is required. Especially a notion of what a router is anddoes is important.

Precise knowledge should not be required and a short introduction to thissubject can be found in [1]. That report can also be interesting in order to getsome background and a better understanding of the router design.

1.3.1 Thesis Outline

Chapters 2 and 3 start o� by describing the technology used in this project,the memories and FPGA.

Chapter 4 goes on to describe the overall router design to put the packetbu�er into context and give the reader an understanding for what tasks itshould perform.

The core of the thesis lies in Chapters 5 and 6 where Chapter 5 deals withthe packet bu�ers use in the overall router design and introduces some designdecisions regarding how the DDR controller will be used. Chapter 6 then goesinto detail on the DDR controller/packet bu�er design and shows how they areimplemented.

1.4. METHOD 3

Chapters 7 and 8 describes the methods and results of the evaluation ofthe packet bu�er/router and Chapter 9 deals with problems throughout theproject.

Chapter 10 sums up the project and deals with what can be done in thefuture.

1.4 Method

The project was started o� with a collection of information on DDR SDRAMmemories, router memory management and the FPGA in question. The detailsof how DDR SDRAM memories work was studied �rst and later what theavailable FPGA could do to solve the di�erent issues was investigated.

Using this information a �rst controller was constructed which was latertransformed into the �nal packet bu�er design. This was integrated into theprevious router design which also needed modi�cation due to changed behav-iour between the new and old packet bu�ers.

Throughout the development process testing of the evolving design wasperformed and bugs were investigated and corrected as they were found.

4 CHAPTER 1. INTRODUCTION

Chapter 2

Memories

There are several di�erent types of memories. This chapter describes two maintypes of memories, both of which are used in this project.

2.1 SRAM

SRAM stands for S tatic Random Access M emory. Static means the memoryretains its contents as long as power is supplied.

SRAM is constructed from cells consisting of a number of transistors, eachbit is stored in one of these SRAM cells. Using a memory based on this principleis easy, there is normally an address bus, data bus and some control signals tocontrol read/write operations.

The Virtex-II FPGA used in this project is a SRAM device, this means thatall con�guration information for the di�erent parts of the FPGA are stored inSRAM cells. Also the FPGA has memories built-in called block rams. Thesealso work according to this principle and have the simple address-data-control-signal-interface, for more detailed information on the block rams see Section3.4.

2.2 DRAM

DRAM stands for Dynamic Random AccessM emory and unlike SRAM it doesnot retain its data just because it has power, DRAM needs to be refreshed inorder to keep its data. The reason for this lies in the way a DRAM cell is con-structed. The SRAM cell has several transistors where the DRAM cell only hasone. Instead of keeping its information in looped-back transistor con�gurationthe DRAM cell uses a small capacitor to keep information. Basically when thecapacitor is charged the bit is 1, when it's not it's a 0.

This is where the refresh comes in, because a capacitor leaks current,thereby draining it of energy. So in order to keep the information one must at

5

6 CHAPTER 2. MEMORIES

regular intervals read out the information from the cell and then write it backin. This makes using DRAM a much more tedious task than SRAM, but inreturn DRAM is much cheaper.

2.2.1 SDRAM

SDRAM stands for Synchronous Dynamic Random Access M emory. Themeaning of synchronous is that all accesses to the memory are clocked. Thissection describes what is sometimes called SDR SDRAM (S ingle Data Rate).

SDR SDRAM memory is organized in a 3-dimensional array, there are rows,columns and banks. The banks are few, usually 2 or 4, while the number ofrows and columns are much larger. Figure 2.1 shows the memory organization.

bank

buffer

row address

column address

data

Figure 2.1: Organization of SDRAM with 4 Banks

During startup the memory needs to go through an initialization procedurein order to place it in a correct operating state. In this procedure all of the banksare put into an inactive state and the settings for the memory are con�gured.All of the memory settings are in the so called Mode Register and to change ita load mode register command is sent with the relevant settings on the addressbus. One important setting is the burst length which is the amount of datathat a read/write operation works on (in multiples of the data width). Forexample having a burst length of 4 means that a read will return data in fourconsecutive clock cycles. There are more settings, some of which are mentionedbelow.

Once the startup procedure has been executed the memory is ready for use.To access it the memory needs to be addressed with the 3 parts of the address;bank, row and column. During the access the bank used cannot be used forany other accesses.

The �rst step in an access is to activate the row of interest (in the bank),this is done by asserting the RAS (Row Address S trobe) signal while having the

2.2. DRAM 7

row and bank addresses on the address bus of the memory. After the commandhas been sent it takes some time for it to complete and in the meantime otheroperations can be performed on the memory (directed at other banks). Whenthe row is being activated the memory contents of that row is read out so thatit can be used.

With the row active read and write operations can be performed on it.By asserting the CAS (Column Address S trobe) and/or WE (W rite Enable)signals while having the column and bank addresses on the address bus a reador write is initiated. During a write there is one additional signal involved inaddition to the data lines, the data mask signal. This signal is used to indicatewhich parts of the data is valid, the invalid parts of the data are not writtento the memory. A read has no use for this signal, unwanted data is simplydiscarded by the controller, but there is however another problem involvedwith reads. When reading the requested data is not immediately transferedover the data lines, it takes some time for the correct data to be selected basedon the column address. This extra time needed is called CAS- or read-latencyand it is a multiple of the clock cycle. Typically the CAS-latency for an SDRSDRAM is 2 or 3 clock cycles and it is speci�ed during memory initialization.Figure 2.2 shows a read cycle with CAS-latency 2 and a burst length of 4.

CK

Command

Address

Data

READ NOP NOP NOP NOP NOP

CAS latency = 2

bankcol

Figure 2.2: SDR SDRAM READ Timing Diagram

After an access is done another accesses can be performed in the same way,otherwise the row needs to be deactivated. Deactivating a row is done witha so called precharge command1 and by doing this the possibly changed dataof the row is written back into the memory. Once this is done the bank is nolonger active and can be used for new accesses.

As earlier mentioned DRAM memories also need to be refreshed. Thismeans that at regular intervals the memory needs to be issued a refresh com-mand. Each row needs to be refreshed every 64ms so on average refreshes can

1There is also the possibility to tell the memory to do an auto precharge when issuinga read or write command, by doing so the memory will at the �rst possible time after theoperation completes initiate the precharge


be no longer than 64rows ms apart2. To give a refresh command all of the banks

has to be in the precharged (inactive) state.

2.2.2 DDR SDRAM

When moving from SDRAM to DDR SDRAM (DoubleData Rate SynchronousDynamic Random Access M emory) some things changed in order to increasespeed. This section describes these di�erences.

First and foremost data is transfered on both clock edges in DDR, hencethe name.

In order to handle this increased speed other changes had to be made. Whenclocking data on both clock edges a good clock signal is essential. To improvethe accuracy of the clock it is di�erential in the DDR standard. A di�erentialclock has two lines, one like the normal clock and one that is the inverse of the�rst. The clock edges are then de�ned as the time when the signals cross, ifthe �rst changes from high to low (and the second from low to high) then it isde�ned as the the falling edge of the di�erential clock, and vice versa.

Another of the changes is the addition of data strobe lines. Data strobe lineswork like a separate clock for the data and there are several of them, one perbyte (8-bit) or nibble (4-bit) depending on the total data width of the memory.The strobes are, like the data lines, tristate signals and they are driven byeither the memory controller or the memory itself depending of whether theoperation is a read or a write.

When writing the strobe (and data) lines are driven by the controller andthe strobe edge should be aligned with the center of the data. The reasonfor this is that the memory should be able to use the strobe as a clock signaland therefor it should have an edge when the data is stable, this is shown inFigure 2.3(a).

During a read it is the other way around, except for the strobe alignment.It is no longer aligned with the center of the data, instead it changes in syncwith the data, see Figure 2.3(b). The reason for this di�erence is because adelay circuit is complex and instead of having one in each memory it only needsbe implemented once in the controller.

As can be seen in the �gures the strobe signal is also a DDR signal (itchanges just as often as the data). In addition to these signals there is oneadditional signal that is DDR, the data mask signal. It works in exactly thesame way as it does for SDR SDRAM, but since data comes on both clockedges the data mask must also do so.

These are the basic changes (di�erential clock, DDR data, strobe and datamask) made to signaling interface of the DDR memory compared to SDRSDRAM. In addition to the signaling changes there are also some new/changedsettings. The burst length for DDR memories has to be a multiple of 2 sinceevery clock can deliver two chunks of data, typical values are 2, 4 and 8. Also

2rows is the number of rows in the memory

2.2. DRAM 9

(a)

(b)

WRITE

READStrobe

Data

Strobe

Data

Figure 2.3: Strobe to Data Timing Relationship

the CAS-latency no longer needs to be a multiple of the clock cycle, half clockcycles are also available which means that the �rst data arrives in between com-mands. Typical CAS-latencies for DDR memories are 2, 2.5 and 3. Figure 2.4shows a read cycle with CAS-latency 2.5 and a burst length of 4, compare witha SDR burst of length 4 (Figure 2.2).

CK

Command

Address

CK

Strobe

Data

READ NOP NOP NOP NOP NOP

CAS latency = 2.5

bankcol

Figure 2.4: DDR SDRAM READ Timing Diagram

In addition to the all ready mentioned changes there is one more changecompared to SDR, the startup procedure. The DDR SDRAMs startup proce-dure is more complex than for a SDR SDRAM. The procedure however followsthe same basic principle as described on page 6, but some more steps are re-quired for correct initialisation. For an exact description of the requirementssee the DDR SDRAM standard [3].

Chapter 3

Virtex-II FPGAs

This chapter describes some of the important features of Virtex-II FPGAswhich are used in the design.

3.1 DCMs

When creating high speed designs it is important to be able to handle clocksignals in di�erent ways, this is what D igital C lock M anagers are for. TheVirtex-II DCMs can do a multitude of things to clock signals, changing thefrequency and delaying them are the most important for this project.

This section covers the features of the DCM used in this project, for a moredetailed and complete description see the Virtex-II User Guide [8].

Figure 3.1 shows all the signals a DCM has and as can be seen there arelots of them. Still almost all of the signals, all except CLK2X180, CLKFX180and STATUS, have been used in the project.

CLK2X

CLKINCLKFB

CLK0CLK90

CLK180CLK270

CLK2X180CLKDV

CLKFXCLKFX180

LOCKEDSTATUS

PSDONE

PSINCDECPSENPSCLK

RST

Figure 3.1: DCM

11

12 CHAPTER 3. VIRTEX-II FPGAS

3.1.1 Clock De-skew

In normal operation the CLK0 output is the same as the CLKIN input clock.When the output clocks are used in the design there is a certain propagationdelay from the DCM. To compensate for this the CLK0 (or CLK2X) signalshould be connected to the feedback input of the DCM, CLKFB. By doing sothe CLKIN signal can be internally delayed until the rising edges of the CLKINand CLKFB signals match. When this happens those clock signals are 360◦

out of phase with each other, which means they are in phase. By having theCLKIN and CLKFB signals in phase the propagation delay for the clock signalsare e�ectively zero1.

When the clocks goes in phase with each other the LOCKED signal ofthe DCM is asserted to signal that the design can start working, before thishappens the output clock should not be used as they are not stable.

3.1.2 Variable Phase Shift

Most features of a DCM are decided at design time through attributes in theHDL code and cannot be changed at runtime. The only thing that can bechanged at runtime is the phase shift for all clock outputs (provided the correctattributes are applied). This is what the PS* signals are for. By phase shiftinga DCM the delay circuit used to align CLKIN and CLKFB is modi�ed to inserta constant delay, the phase shift. Thereby all DCM clock outputs are delayedand by using the PS* signals the delay can be changed, e�ectively allowingvariable phase shifting.

The phase shift interface is very simple, it runs in its own clock domain withthe PSCLK as clock signal. If during one cycle the PSEN signal is active theDCM checks the PSINCDEC signal to see if an increase or decrease in phaseshift is wanted and then performs the change in delay. Every change in delay is1

256 of the clock cycle time. Once the delay change has happened the PSDONEis asserted for one cycle to inform the design that another phase shift changecan be performed. The reason changing delay takes some time is because it ischanged slowly so that the DCM lock is not lost (the LOCKED signal neverneeds to change).

Depending on the frequency the DCM might not be able achieve exactly thedelay requested because the delay is achieved by including a discrete numberof delay elements in a delay line for the CLKIN signal. In this case howeverthe DCM selects the closest number of delay elements.

3.1.3 Statically Phase Shifted Clock Outputs

Other than the CLK0 signal there are also a number of clock outputs that arealways phase shifted in relation to it. CLK90, CLK180 and CLK270 all run

1this is zero propagation delay to the global clock network, see Section 3.3, from therehowever additional delays are introduced

3.1. DCMS 13

at the same frequency as CLK0 but are shifted 14 ,

12 and 3

4 of the clock cycletime, respectively. There are also the two clock outputs running at twice thespeed of CLKIN, CLK2X and CLK2X180. Figure 3.2 shows how the di�erentclocks relate to each other.

CLK90

CLK0

CLK180

CLK270

CLK2X

CLK2X180

Figure 3.2: DCM Outputs

3.1.4 Frequency Altered Outputs

The remaining three clock outputs of the DCM, CLKDV, CLKFX and CLKFX180,can output clocks which are of a di�erent frequency than CLKIN.

CLKDV is the simplest of the three and only produces frequencies lowerthan CLKIN. By setting a DCM attribute the divider can be chosen, the onlylimitation is that the divider has to be from a prede�ned set of values2.

The CLKFX output is more �exible, its output has a frequency of MD ×CLKIN.

M is the multiplier and can have any integer value in the range 2-32, likewiseD is the divider, also an integer but the allowed range is 1-32.

CLKFX180 like the other clock outputs named 180 has the same output asCLKFX, only phase shifted 180◦ in relation to it.

There are limitations on how fast clocks the CLKFX outputs can produce3.What the limitation is however depends of the FPGA model and what speedgrade it is, for the FPGA used in this project CLKFX can output a maximalfrequency of 210MHz, which is more than enough for design. All limitationsare available in Virtex-II Complete Data Sheet [7].

21.5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 9, 10, 11, 12, 13, 14, 15 and 163there are limitations on the other output clocks too, but CLKFX is the most restrictive


3.2 DDR IOBs

One of the points with double data rate signals is to increase the data transferedper wire without increasing the clock frequency. This means for example thatthe FPGA internally can work at 100MHz and still have data transfers at200MHz. But in order to work with an external unit using DDR (for examplea memory) the FPGA needs a facility to interface with DDR signals, this iswhat DDR IOBs (Double Data Rate Input Output B locks) are for.

The element that makes all this possible is the DDR mux. It is driven bytwo �ip-�ops running on clocks phase shifted 180◦ from each other and on eachrising clock edge the DDR mux output changes.

A single DDR IOB contains two DDR mux constructions, one for data andone for tristate control. All four of these �ip-�ops have to be driven by the sametwo clocks because of how the FPGA is constructed internally. In addition tothis there are also two input �ip-�ops, otherwise it wouldn't be much of anIOB (Input Output Block). These �ip-�ops also have to be driven by clocksphase shifted 180◦ from each other, however they do not have to be the sameclocks as the other four �ip-�ops uses.

Figure 3.3 shows the complete IOB with �ip-�ops, DDR muxes and tristatebu�er. The signals in the �gure are the four clocks, the in and out data andthe output enable (tristate) control.

oe0

oe180

o_clk0

o_clk180

o_data0

o_data180

o_clk0

o_clk180

external pin

D Q

D Q

D Q

D Q

DQ

DQ

i_clk180

i_clk0

i_data0

i_data180

Figure 3.3: DDR IOB

3.3. GLOBAL CLOCK NETWORK 15

3.3 Global Clock Network

In Virtex-II devices there is something called the global clock network. This isa number of dedicated clock lines that distribute clock signals to di�erent partsof the design. The clock network is designed in such a way as to minimize clockskew which is the di�erence in arrival-time of a clock signal to di�erent partsof the system. In order to achieve this low skew all clock signals on the globalclock network are routed into the center of the FPGA and from there they aredistributed. This serves to make the distance each clock signal travels beforereaching its destinations as equal as possible.

In addition to the low skew property of the global clock network it alsocontains clock bu�ers. These, as the name implies, bu�ers the clock signals.This is done in order to make them strong enough to drive a large number ofsynchronous elements.

These are both important features of the global clock network, however theimportant thing about it, in the context of this project, is that the number ofclock lines are limited. This means care must be taken in order to make surethe design �ts. If the number of clock signals exceeds the available clock linessome of them will have to be distributed through general routings where delayand thereby skew can be signi�cant. The number of clock lines available inthe FPGA in question, the Xilinx Virtex-II XC2V4000, is 16 in total with amaximum of 8 of those mapped into each quadrant4.

3.4 Block RAMs

As mentioned in Chapter 2 block rams work like SRAM memories. Howeverthey are a little more complex, block rams are synchronous dual port devices,meaning they can perform two operations at the same time and all operationsare clocked.

Each of the two ports has a clock, enable, write enable, address, data-in anddata-out port. The address, data-in and data-out ports are busses and theirwidths depends on how the block ram is con�gured. A block ram has an 18kbitmemory array which can be con�gured for di�erent depth/width ratios. 2kbitof the memory is however only available when using the wider con�gurations.The di�erent data widths available are 1, 2, 4, (8+1), (16+2), (32+4) andthe resulting depths are as needed to access the full memory array. The twoports can also be con�gured independently, giving access to the memory in twodi�erent ways simultaneously.

One important feature of the block rams is that the two ports can workcompletely independent of each other. This is almost true, some care mustbe taken so that a write on a speci�c address does not con�ict with anotheroperation (using the same address) on the other port.

4The Virtex-II FPGAs are divided into 4 di�erent quadrants, North-East, North-West,South-East and South-West


The ports being completely independent means that even the clocks needsno relationship to each other, they can be asynchronous which enables someimportant designs to be built, namely the asynchronous FIFO (F irst In F irstOut).

3.4.1 Asynchronous FIFOs

The asynchronous FIFO is a construction which allows data to be transferedbetween clock domains and at the same time provide some bu�er capacity.Data entered at the write-side of the FIFO will with some delay end up on theread-side and all data will be read in the order in which they were written5,assuming no data was lost due to lack of bu�er space. Figure 3.4 shows thebasic interface of an asynchronous FIFO.

wr_clkwr_en

rd_clkrd_enAFIFO

data in data out

read sidewrite side

Figure 3.4: Asynchronous FIFO

Asynchronous FIFOs are used in a lot of di�erent places in this project be-cause there are a number of di�erent clock domains that need to communicate.

There are also synchronous FIFOs which are like the asynchronous FIFOsexcept that the read- and write-sides both reside in the same clock domain.This type of FIFO is also used in this project where there is a need to bu�erdata in a FIFO-fashion.

5hence the name FIFO, First In First Out

Chapter 4

Router Design

The router consists of a number of di�erent blocks. These blocks are brie�ydescribed in Section 4.3.

All of the blocks existed in the old router but only one remains relativelyintact.

In order to better understand the di�erences between the old and the newrouter designs the following two sections describes what happens to a packeton its way through the router.

4.1 Packet Path: Old Router

When a packet arrives at one of the in-ports it is read in and classi�ed by theinput module. The input module then sends the packet on-wards to the packetbu�er.

In the packet bu�er the destination IP address is extracted from the packetand sent to the routing table.

The routing table looks up which destination (output port and destinationMAC address) the packet should go to and the information is again transferedto the packet bu�er.

Here the destination is translated into source and destination MAC ad-dresses and the packet is forwarded to the correct output module which in turnoutputs it onto the network.

4.2 Packet Path: New Router

When a packet arrives at one of the in-ports it is read in and classi�ed by theinput module. The input module drops unwanted (non-IPv4) packets. If thepacket was indeed a IPv4 packet it sends it to the packet bu�er and while doingso it extracts the destination IP address. Once the packet has been transferedthe input module sends the destination IP address to the routing table.

17

18 CHAPTER 4. ROUTER DESIGN

The packet bu�er stores the packet in memory and awaits the route lookupfrom the routing table.

Once the routing table �nds the correct destination it forward the informa-tion to the packet bu�er.

Here the destination is again translated into a destination MAC address(but not source address) and the packet is forwarded to the correct outputmodule which outputs it onto the network.

4.3 Router Blocks

4.3.1 Input Module

The input module for this project is a slightly modi�ed version of input modulefrom the previous router. The changes made was to move parts of the function-ality to a more suitable place in the router. In the old router the destination IPaddress was extracted in the packet bu�er and from there sent to the routingtable. The new design moves this functionality to the input module so thatthe packet bu�er and routing table can process requests for the same packet inparallel.

The new input module was also modi�ed so that it drops non-IPv4 packets,something the old version did not do. The reason for this is that the router asa whole only handles IPv4 tra�c and allowing anything else to enter the routerwould result in incorrect behaviour, it would result in routing the packets as ifthey were IP packets even though they are not.

4.3.2 Output Module

The output module handles outputting packets onto the network. The oldoutput module wanted full Ethernet packets, requiring the packet bu�er tosend both sender and receiver MAC addresses to the output module. In theredesign it was decided it would be better to add the sender MAC address inthe output module instead since it will always be the same.

Because of this change and the fact that interfacing with the Ethernetphysical interface is a relatively simple task it was decided to rewrite the entiremodule. In this process it was discovered that the old module did not correctlyhandle packets of sizes smaller than the minimum Ethernet packet length,something that had to be �xed.

4.3.3 Routing Table

In the old router the routing table was just a small lookup table mapping afew IP address to di�erent output ports. The original idea was for it to use aspecial pipelined routing table provided by the department.

The lookup table accepted a route lookup request on its socbus connectionand then sent the reply to the packet bu�er, again over socbus. The problem

4.3. ROUTER BLOCKS 19

with this is that every lookup results in a new socbus connection which incursa large overhead.

To handle this problem the new routing table can send the results of severalroute lookups to the packet bu�er in the same socbus transmission, if there areseveral available. Also the departments routing table was integrated into therouting table.

4.3.4 Packet Bu�er

Clearly this part of the router is where the most change was made since it nowincludes a DDR controller. The old packet bu�er also performed some tasksthat were moved elsewhere in the router design, i.e. extracting destinationIP address (moved to the input module) and adding source Ethernet address(moved to the output module).

For a more thorough description of the packet bu�er see Chapter 6.What has not changed is that the packet bu�er still uses two socbus con-

nections, one for packet data and one for route information1. In the old routerthe route interface was used both for sending lookup requests and for receivingthe destinations, the new only uses it for the latter.

4.3.5 Socbus

All of the di�erent blocks are connected with an on-chip-network called socbus[5]. A socbus network consists of socbus routers and connections between them.To each socbus router there can also be a block connected, in the case of therouter the di�erent modules are connected to the socbus routers.

When modules want to communicate a packet-connected circuit is estab-lished. The circuit is unidirectional and is setup by the sender when it makesa connection request to the receiver. The request is routed through the socbusnetwork and the resources needed along the path are reserved. If the receiveror any of the routers along the path cannot accept the connection (because alink is already in use) the connection attempt has failed and the sender willhave to try again later.

Each socbus link2 is bidirectional in the form of two unidirectional links.This allows a module to both send and receive data at the same time. All theunidirectional links consists of a number of control and data signals. There area total of four control signals, strobe, qual, ack and cancel. Two of these, strobeand qual, are used in the forward direction (from sender to receiver), while theother two are used in the reverse direction (from receiver to sender).

The strobe signal is used to control connections, a transition from low tohigh signals a connection request while a high to low signals a connection tear-down. The two reverse direction signals are used to accept/reject connections,

1the packet bu�er described later actually has several more connections for data, but withthe limited number of in/output modules one is enough

2a socbus link is the wires between two routers or a router and a module


as might be guessed from the names ack is used to accept and cancel to reject.The last control signal, qual, is used to indicate valid data once the circuit is upand running. This means that pauses in data transfers is possible, the featureis however not used in this design.

Figure 4.1 shows the life of a successful socbus connection, it also showsa part of the connection setup which hasn't been discussed yet. In order forthe socbus network to know where to route a request the target address isneeded. This is transfered over the data wires during the connection setup andthe data sent is called req0 and req1. The address is contained in req0 alongwith some information on the type of connection needed, here however onlyone connection type is implemented and used. The second request data, req1,contains unspeci�ed data and can be used for whatever purpose the applicationdeems useful3.

clk

strobe

qual

data

cancel

ack

req0 req1 data data data

Figure 4.1: Successful Socbus Connection

In both the old and new routers the modules are connected in a 3× 3 grid,however the placement of them di�er. This is because of the changed behaviourof the input module. Since it now communicates with both the routing tableand packet bu�er a di�erent con�guration was preferred. The old and newsocbus network con�gurations can be seen in �gures 4.2 and 4.3 respectively.

Another di�erence with the two socbus con�gurations is the width of thedata bus, in the old it was 36 bits wide which was used in some special cases.The new design however does not use these extra bits so the bus width wasdecreased to 32 bits, the width at which packet data is transfered.

3it is used in this project to send packet identi�ers, what those are will be presented lateron

4.3. ROUTER BLOCKS 21

PB

IN

OUTOUT

IN

RT

Figure 4.2: Old Router Socbus Network Con�guration

PB IN

OUTOUT

IN

RT

Figure 4.3: New Router Socbus Network Con�guration

Chapter 5

Router Memory Usage

Before designing the packet bu�er and memory controller some idea of howit will be used is required. Just having a memory controller is not enough toget good performance, knowledge about the access patterns used will give vitalinformation to designing a good controller for the task. This chapter describeshow the packet bu�er memory will be used and in what way the router willkeep track of the packages while in the router.

5.1 Packet Identi�ers

In the feasibility study the designed router marked all packets with a 32-bitidenti�er upon entering the input modules. These identi�ers would then betransfered with the packet to the di�erent places in the router; the packetbu�er and routing table.

In the previous implementation this feature was removed but here it isreintroduced. The feasibility study however did not specify a good way togenerate the identi�ers so this needs to be taken care of.

The requirements on the identi�ers is that they should identify the packetuniquely and thereby give all the information needed about the packet. Sowhat is this extra information needed? In the router the packet needs to beallocated some memory space in the packet bu�er memory, it also has a size.Sometime during its passing of the router it also needs to get some kind of adestination. Also the input module might want to classify the packet, so someinformation of what type of packet it is would also be interesting.

When giving the identi�er all of this information is not available, the sizeand classi�cation are, the destination however is not. The location in thepacket bu�er memory is the most complex part of the information. Some kindof storage scheme is needed, see Section 5.2 for information on this.

Regardless of the storage scheme there still has to be a mapping betweenstorage space and packets and this can be decided upon at any time before

23

24 CHAPTER 5. ROUTER MEMORY USAGE

the packet is written into the memory, even as early as when the packet entersthe router and the identi�er is given. This has the advantage that all theinformation about the packet, except the destination, can be associated withthe identi�er right away and the packet bu�er and routing table (which workin parallel on the packet) can have the same information associated with theidenti�er. This in turn means the associated information can be encoded intothe identi�er right from the start, requiring only the destination to be addedlater on.

5.2 Memory Storage Scheme

The choice of storage scheme will depend on a number of factors such as thesize of the available memory, the book-keeping needed and the type of accesspattern it would incur. A number of di�erent schemes have been used in routersand some important types are [2]:

1. Fixed size blocks. The approach is very simple, the memory is divided intoblocks large enough to �t any packet. A variation on this is to divide thememory into �xed sized blocks but letting there be a number of di�erentblock sizes. This gives the advantage that less memory will be wasted onsmall packets because a smaller block size can be used for those packets.Regardless the advantage is that packets will be stored in a continuousaddresses space and book-keeping can be kept minimal because of thesimple scheme.

2. Variable sized blocks. This is complex solution, start and end addresses(or length) of each block needs to be stored. Clearly very high utilizationcan be achieved because no memory at all needs to go to waste, but it hasa potential for fragmentation, which could be very hard to handle. Stillthe memory would be allocated in a continuous fashion which is good,while book-keeping becomes signi�cantly more complex.

3. Linked list of small blocks. The last version is a combination of the goodparts of the previous two. The �xed sized blocks enable the simplerversion of book-keeping while the small blocks allow for a higher memoryutilization. The packets would however not be stored in a contiguousfashion and information about what block is next would need to be stored.

Choosing between the three is simple. DDR SDRAMs are good at pushinglots of data when it is stored continuously in the memory. The third versioncould have this advantage too with a good enough controller and allocation,however the increased complexity is not worth it compared to the alternative.In addition to this both the variable and linked list versions are methods whichincrease memory utilization and they are good when memory size is an issue.

5.3. PACKET TO MEMORY MAPPING 25

Here however there is 128MB of memory available so why not use this abun-dance and get a simpler usage, and thereby a simpler allocation scheme andcontroller.

This leaves the question of one or several di�erent block sizes. The simplesolution is to use one single block size. Since the router only has Ethernetconnections and the Ethernet MTU (M aximum T ransmission Unit) is 1500bytes1 this, or rather 1536 bytes (1536 will be explained below), would be thesize to select.

However it is clear that a higher memory utilization can be reached withseveral block sizes and in fact packet sizes in a Internet core router applicationare rather well suited for this type of memory division. The so called Internetmix is important here, it is an observed packet size distribution from a realcore network which has become more or less a standard in benchmarking ofInternet core applications [4]. Table 5.1 shows the packet size distribution ofthe Internet mix, it also shows how much of all data can be expected to comein a speci�c packet size.

Table 5.1: Internet Mix

Packet Size Probability Relative Size

40 bytes 56% 5%1500 bytes 23% 74%576 bytes 17% 21%52 bytes 5% <1%

Given the Internet mix a division into the 3 di�erent block sizes 64, 576 and1536 bytes covers the packet sizes quite well. All these sizes are divisible by 64which will be explained below. The allocation of blocks in the memory shouldbe similar to the relative size column in Table 5.1 and a simple way to do thatis just continuously allocate �rst 5-6% of the memory to 64 bytes blocks, andso on. However there is another way to allocate memory which better suits thememory in question, thereby resulting in a simpler controller.

5.3 Packet to Memory Mapping

In both schemes above blocks of some size would be used and it is desirableto have these block allocated to the memory in such a way as to allow rapidreading and writing of data. Regardless of the block size it is also good if as littlememory resources as possible is involved in a read/write. Other than actualmemory used (which has already been considered above) the DDR SDRAM

1Ethernet jumbo frames of 9000 bytes will not be considered for reasons explained inSection 5.3


has a few other resources; data bus, command bus (control+address lines) andbanks.

The data bus usage could be optimized by only transferring exactly asmuch data as needed and then break o� the read/write. This is called a burstterminate and the only place where it could be used is on the last read/writeoperation for a packet since this is the only place where there could be amismatch in sizes between the burst length and packet size. This method willnot be considered in this project as the burst terminate command disallowsusing read/write commands with auto-precharge.

To use less of the command bus would mean to use less commands fora read/write, this is hard to do and as will be seen in Chapter 6 it is alsounnecessary.

This just leaves the memory banks to optimize on. There are number ofthem and a packet could be spread across all of them, but it could just aswell be contained within just one bank which would be the resource optimalsolution.

The 128MB SODIMM memory used in this project is a double-sided mem-ory and has a total of eight Micron MT46V8M16-75 chips, four on each side.Double-sided means that it is actually two 64MB memories that share thesame address, data and command bus, all except for the chip select signal.This means that a command can be sent to either side of the memory (or bothif both should get the same command) as long as only one uses the data bus,in other words commands other than read and write commands can work inparallel.

Each of the four chips on a side has a data width of 16 bits giving a total of64 bits = 8 bytes that can be transfered at the time. With the DDR SDRAMmaximal burst of 8 this gives a total of 64 bytes transfered per read/writecommand. This is the reason why the block sizes should be divisible by 64.The reason for using the maximal burst length will become apparent in Chapter6.

A burst must be contained within the same bank and since a bank can onlybe activated with one row at the time this implies a burst must be containedwithin a single row. In fact a DDR SDRAM burst is contained within a blockthe size of the burst, changing the lower bits in the address only a�ects theorder, the same data is still used. What this means is that blocks should beallocated so that they are contained within a row and they need to start on anaddress that is an even multiple of 64.

The row size in the memory used is 4kB. This makes jumbo frames imprac-tical to handle since they would have to extend over three di�erent rows andtherefor it was decided to skip such support. With just one block size (of 1536bytes = 1.5kB) two such blocks �ts in a row and this leaves 1kB of memoryunused. Since the rest of the memory cannot be used for any new block theblocks might as well be made 2kB in size.

If however the three block sizes (64, 576 and 1536) are used a more e�cientallocation can be made. Based on the relative size numbers of Table 5.1 it can

5.3. PACKET TO MEMORY MAPPING 27

be seen that about 75% of the memory should go to 1536 byte blocks, 75%of 4kB is 3kB which �ts nicely with two 1536 blocks. This leaves 1kB for theother two block sizes and since only one 576 byte block can be �t into thisspace there can naturally only be one, leaving the remainder of 448 bytes forseven 64 byte blocks. Figure 5.1(a) shows the resulting block allocation for asingle block size and Figure 5.1(b) shows it for the 64, 576, 1536 allocation.

1536 1536 7x64576

2048 2048 (a)

(b)

Figure 5.1: Block Allocation per Row

Table 5.2 shows the relative sizes for the second allocation and as can beseen compared to Table 5.1 there is too little space for 576 byte blocks andabout two times as much for 64 byte blocks. However the resulting allocationis close enough and gives a single allocation that can be used across all rows.

Table 5.2: Relative Block Sizes

Block Size Relative Size

64 bytes 11%576 bytes 14%1536 bytes 75%

The only issue not addressed is that of how to actually map the packets tothese blocks. As indicated in Section 5.1 this can be done already in the inputmodule by letting each input module manage their own part of the memory.In the requirements for the project a maximum of eight gigabit connectionswould be connected to the router, resulting in eight input modules. As it turnsout each side of the memory has four banks and with two sides this gives atotal of eight banks. Allocating each input module to their own bank has theadvantage that one input module cannot block another by using the same bankall the time. Also since the memory transfer speed is much higher than thatof a single gigabit port (it has to be to be able to handle 4, or even 8 ports) itis very likely the bank will be available again when the next packet from thesame port arrives for reading/writing.

What all this boils down to is that each input port is allocated 18 of the

memory, 16MB.When a new packet arrives it is given the next available block ofthe correct size (if only one block size exists the choice is simple). To minimize


book-keeping a very simple way to allocate blocks is to have a counter per blocksize and when space for a packet is allocated the counter is increased by one.

5.4 Packet Identi�er Format

As discussed in Section 5.1 the packet identi�er can have almost all the infor-mation needed about the package encoded into it. Here the exact format forthe identi�er will be given.

The now three parts of the identi�er; packet size, packet type and storagelocation, are all available when the identi�er is given.

A packet can be up to 1500 bytes long which means dlg2 1500e = 11 bits(range 0 to 2047) are needed to store the size of the packet with byte precision.Here a little trick can be used which simpli�es some hardware later on. Bystoring the packet size minus one it is simple to determine how many blocksof size 2n would be needed to store the packet. For example with n = 3a packet of size 107 bytes (10710 − 110 = 10610 = 11010102) can be stored in11012 = 1310 + 110 = 1410 blocks of size 23 = 8 bytes (14× 8 = 112 ≥ 107).The n lower order bits in the length-minus-one is simply discarded to get thenumber of blocks (minus ones). This helps the controller when it needs todetermine how many 64 byte blocks are needed when storing the packet in theDDR SDRAM.

The packet type could potentially be lots of bits, but to make things simpleallocating whatever is left over after the other two get their share of the 32 bitidenti�er and hoping it is enough will do for now2.

This leaves the storage location and this size will of course depend on thenumber of block sizes that are used but it must at least include 3 bits todi�erentiate between the 8 input modules.

For the simple case of 2kB blocks there is a total of 16MB4kB ×2 = 8192 blocks

available for allocation. This requires dlg2 8192e = 13 bits and leaves 5 bits forthe packet type. Figure 5.2(a) shows the packet identi�er for this case.

With the three di�erent block sizes the bit allocation becomes a little morecomplex, now the packet size can be stored in di�erent sized �elds dependingon the block size which enables some compression to be made. Doing too muchcompression however will make the hardware more complex and that can be aproblem3. To di�erentiate between the block sizes 2 bits is required, 3 bits arestill needed to di�erentiate the input modules and at least for the 1536 byteblocks 11 bits are needed for the packet length. The number of rows availableto a input module is 16MB

4kB = 4096, resulting in dlg2 4096e = 12 bits neededfor row identi�cation and there are two 1536 byte blocks per row which gives1 additional bit. All this adds up to 29 bits for 1536 byte packets, leaving 3for packet type. For 576 byte packets the length required by the 1536 byte

2the feature of letting the input module classify the packets was almost not used at all inthe previous project [1] and this project makes no attempt to extend it

3it is a problem, experience from this project says so

5.4. PACKET IDENTIFIER FORMAT 29

packets can be reused and since there is only one slot per row no additionaldata is needed for that. For the 64 byte packages however there are 7 slotsper row, this requires 3 bits to di�erentiate among them. These 3 bits couldbe allocated in the most signi�cant part of the packet length since such smallpackets will always have zero there, however that makes the hardware morecomplex so it is better to instead add 2 bits (and use the 1 bit from the 1536byte packets) to select the correct slot. This leaves the packet type with only1 bit. Figure 5.2(b) shows what this allocation looks like.

31 10 026 1127

type [5] location [16] length [11]

31 10 011

type [1] input module & row [15] length [11]blocksize [2]

2629 252830

slot [3]

64

576

1536 0-1

0-6

(a)

(b)

Figure 5.2: Packet Identi�er Format

Both of these formats are viable, even though the second only has a singlebit for packet type the current input module makes no real use of it. Both havebeen implemented and in the end it turned out the increased size in hardwarefor the second version was too much of problem without providing any realgain, there is enough memory that wasting some isn't a problem.

Chapter 6

Packet Bu�er Design

This chapter describes the design of the packet bu�er. Figure 6.1 shows anoverview of the design and the following sections will go into detail on thedi�erent parts. The grey part of the �gure shows the clock domain division ofthe packet bu�er.

MEM

INTERFACE

Controller

inbuffer

outbuffer

SOCBUS

Figure 6.1: Packet Bu�er Overview

6.1 DDR Controller Selection

When starting the project a choice had to be made on what DDR controllershould be used or if even a completely new should be designed. As can be seenlater on in this chapter the latter approach was used and the sections belowwill explain why, and from where the basis of the controller selected came.

31

32 CHAPTER 6. PACKET BUFFER DESIGN

To better understand the requirements on the controller the �rst source forinformation was the DDR SDRAM standard document [3]. This gave someinsight as to what modes of operations would be required from a controller toachieve high speed data transfers.

6.1.1 Available Controllers

The �rst step in the process of �nding a good controller design was of courseto look for what ready-to-use solutions were available.

Opencores1 has a DDR controller available which is designed for Virtex-IIdevices. It is free to use but not particularly con�gurable. Especially it isdesigned to perform burst of length 2 only, the shortest burst available. Afterreading the standard it was apparent that such a solution would probably wastetoo much time controlling the memory to get good transfer speeds.

Xilinx, the Virtex-II manufacturer, has a number of di�erent memory con-trollers available for their FPGAs. For the Virtex-II there exists basically tworeference designs that handle DDR SDRAM memory. These are described inXilinx application notes XAPP253 [6] and XAPP688.

Application note XAPP688 is not free, a short document describing thebasic design method is available but the code cost money, thereby disqualifyingit for usage in this project. Also the design includes delaying the strobe signalfrom the memory within the FPGA using a rather sophisticated method. Thismethod needs some sort of control for the delay circuit which depends on thesupply voltage and temperature of the FPGA and �guring out a solution to dothis from scratch seemed like a bit too much work.

XAPP253 on the other hand has code available and the design is simplerthan that of XAPP688. There are however some problems with it and Xilinxdoes not recommend its usage. Internally it works with a special constructionwhich uses clock signals as input to combinatorial logic. This construct requiresspecial constraints on the design and to use it the data bus would have to beextended from 16-bits to 64-bit, which is complicated because of this construct.

The design is not a complete controller either, it does not handle refreshingthe memory and does not allow read and write commands to be issued back-to-back, which would hinder long transfers. It is also made for a single memory,which the memory used in this project is not (see Section 5.3). Some of theideas from XAPP253 can however be applied to building a custom controller.

6.1.2 Custom Controller

A DDR controller implemented in an ASIC design is normally run at twice theclock frequency of the memory. Compared to a FPGA an ASIC is much fasterwhich means this is no problem. By doing so the ASIC requires no special DDRIOBs and doesn't have to deal with half clock cycles internally. This type of adesign would be simpler to implement from a pure code perspective, however

1http://www.opencores.org/

6.2. THE CONTROLLER 33

getting such a design to meet timing at the speeds required is not feasible withthe FPGA.

The XAPP253 controller uses the DDR IOBs and DCMs described in Chap-ter 3 to achieve its goals. It uses DCMs to phase-shift clock signals so that gen-eration of data and strobe signals �ts the memory timing model. It also showshow to con�gure the FPGA to use the I/O-standard used by DDR SDRAMs.

Starting with this information a custom controller that doesn't need to runat twice the memory speed can be constructed by using the special FPGAresources. Designing the controller from scratch gives complete control overhow the memory should be used and allows for a solution that �ts the speci�cneeds of the application. It also gives promise of a solution that integratesnicely into the overall design of the router.

In the next section the exact design of the controller is presented, howeverChapter 5 gives some background on how it will be used and as a consequencesome of the reasons why it is constructed as it is.

6.2 The Controller

The controller is at the heart of the packet bu�er. It controls the memoryby sending commands and generating all the control signals needed for thedi�erent elements in the data path. It also decides how to schedule all read,writes and refreshes during operation and provides a correct startup procedure.

To simplify the implementation the di�erent tasks of the controller has beendivided among several smaller controllers. This helps because smaller statemachines are easier to place and route so that timing is met. Also with a gooddivision the parallel processing simpli�es the problem of issuing commands tothe memory in an e�cient way.

Initially a controller incorporating all the di�erent tasks into a single se-quential state machine was designed. This �rst controller managed to do readsand writes, but only at a low frequency because of the delay between memoryand FPGA during read operations, see Section 9.3 for more information onthis. With compensation for the delay a higher frequency was achievable butin order to get really good performance memory commands cannot be issuedin the straight forward sequential fashion of �rst activating the row, then doingthe read/write and precharging the row.

The reason things cannot be done sequentially is the delays, after a rowas been activated some time needs to pass before it can be used, therefor itis preferred to do the activation during another read/write operation. In thisway activation time is hidden during another operation and no time is lost dueto it. The same goes for precharging the row, the time needed can be hiddenduring another operation.

In addition to the active/precharge and read/write operations there is reallyonly one more operation needed, refresh. During refresh the memory cannotbe used for anything else, but since it is a double-sided memory while one side


refreshes the other can still perform operations.Getting maximal performance out of the memory is the goal and this means

getting as much data in/out of it as possible. In other words the data bus shouldbe used as much as possible so any command that is not a read or write2 shouldbe performed while a read/write operation is using the data bus. Of course thisis not always possible, but when there are no read/write operations availablethere really isn't any need to hide the commands because there is time for themanyway.

So since there can only be one read/write command in progress at any onetime it makes sense to allocate these to a single controller. Of the other opera-tions needed the precharge is easiest to allocate because precharge commandscan be grouped with the read/write commands in the form of the read/writecommands with auto-precharge. This leaves the active and refresh commands,both of which should be executed in parallel with the read/write commands.In other words they should be allocated to a separate controller, and since thedi�erent sides of the memory can work independently a separate controller foreach side might be a good idea.

This just leaves the startup procedure. Both the sides of the memory arethe same in their con�guration so it is possible to send the same sequence ofcommands to both sides at the same time and both will be initialized in thesame way. Also the startup procedure is in a way more complex than the tasksneeded during operation but still it is very straight forward. So it makes senseto put this in a separate controller since that will remove complexity from theother controllers.

With these three di�erent controllers, all of which can send commands tothe memory, a way to give them di�erent priorities is required. The prioritiesmust be given in such a way that no starvation occurs. The controller handlingthe memory startup procedure, from now on called the startup controller, willnever work at the same time as the others so there will never be any con�ict.The other controllers, the primary controller handling read/write commandsand the two secondary controller (handling active and refresh commands, onefor each side of the memory), all compete for the command interface of thememory.

Before going into exactly how to prioritise among these three controllerssome information on how read and write operations get to the controller. Whena packet has arrived that should be read or written a command is sent to one ofthe secondary controllers (depending on which side of the memory the commandis destined for) through an asynchronous FIFO3. Once the secondary controllerhas read the command and activated the row on which the operation should beperformed on it forwards the command to the primary controller so that it inturn can perform the read or write. Since there are two secondary controllersthere might be a con�ict of which command the primary controller chooses.

2read and write operations are the only ones that uses the data bus3it needs to pass an asynchronous FIFO because the socbus input ports and controller

run in di�erent clock domains, see Figure 6.1

6.2. THE CONTROLLER 35

Also the primary controller could already handling a command meaning itcannot accept another. In the case of a con�ict a priority is needed whichdoes not starve either of the two secondary controllers and a simple way to dothis is to let the primary controller accept the command from the secondarycontroller which it is not currently serving. The resulting connections betweenthe di�erent elements of the controller can be seen in Figure 6.2.

SecondaryController

SecondaryController

PrimaryController

StartupController

AFIFO

AFIFO

AFIFO

AFIFO

MEMORY

Figure 6.2: Controller

Now back to the priority of the memory command interface. The primarycontroller has control of the data bus and getting maximal utilization is thegoal so naturally this should in some way be prioritised. With only the twosecondary controllers to compete with a rather simple scheme can ful�ll the re-quirements of giving the primary controller priority while none of the secondarysu�ers from starvation. The reason this works is because the primary controlleronly needs to do read and write operations with the full burst length. The fullburst length is 8, which means 4 clock cycles because it is a DDR memory. Inother words the worst case scenario is that every 4th cycle is used by the pri-mary controller, leaving the two secondary controllers with 3

4 of all commandopportunities to the memory. So the simple scheme is that if the primary con-troller wants to send a command to the memory, let it. If the primary controllerhas no command to send a selection between the secondary controllers mightbe needed. If only one of the secondary controllers want the send a commandthere is no choice, but if both do a number of selection methods could be possi-ble, for example at random, one always having priority over the other or whichever did not get to send last. All of these would probably work out fairly wellbut since implementing the one that did not send last is simple and gives a fairbalance this was the choice.


6.2.1 Startup Controller

To provide the memory with a correct startup procedure as outlined in Chapter2 a special part of the controller is dedicated to this task. The startup controlleris just a state machine which outputs the correct commands to the memory.This procedure requires no interaction with the memory data interface whichmakes the task signi�cantly more simple than it otherwise would have to be.

The startup controller is only used once during startup, in the meantimethe other controllers are idle. After startup is complete the others handle allmemory interaction.

6.2.2 Primary Controller

The primary controller handles all the read and write operations and to do thisit is in control of the bu�er memory and data interface. Only a few signalsare distributed from the primary controller to outside of the controller, the twosignals read and write and the address for the bu�er memory. That is all thatis needed for it to perform its job. The reason why there are so few is becauseof the number of clocks involved. The controller and memory interface all runat the same clock speed but there are multiple phase shifted clocks which thecontrol signals need to cross between. To do this and still maintain timingis not an easy task, the signals need to pass through a number of �ip-�opsbelonging to the di�erent clocks and scheduling this is a time consuming task4.Therefor a solution with as few signals as possible simpli�es the design work.

When the primary controller performs a read or write it always works on acomplete packet, for the reason see Section 5.3. To make the transfer as fastas possible it issues read/write commands every 4th cycle which will keep thedata for the commands back-to-back, wasting no cycles. Once the full packethas been transfered it will fetch another transfer from one of the secondarycontrollers and repeat the process. In this switch between transfers there willalways be at least one cycle of idle time before the next command is issuedbecause the data channel requires it when switching between read and write.This is not a requirement when going from read to read or write to write but forthe primary controller to be able to do that it would need to store two transfersat the time5. Also when going from write to read part of the CAS-latency couldbe hidden by issuing the read command during the last write operation. Rightnow this is not done so the full CAS latency plus the extra cycle delay a�ectthe start of all packets being read from memory.

6.2.3 Secondary Controllers

The secondary controllers are the only entry points of the controller. Theyaccept packet transfers from the router, activate the relevant bank with the

4especially since at least one of the tools involved seems to have bugs when working withthis type of a design

5which by the way is not particularly hard do do, it just requires some work

6.3. MEMORY INTERFACE 37

correct row and forwards it to the primary controller. In addition to this theyalso perform refreshes at regular intervals.

When a secondary controller receives a transfer it activates the correct rowregardless of if it is a read or write. For read commands however there is onemore task for it, to allocate space in the bu�er memory for the packet to beread. Once this is complete it signals the primary controller that a new transferis available and waits for it to accept, then it repeats the process.

Each secondary controller has a counter that keeps track of when it is timeto do a refresh. It counts up and once over a speci�c level value the controllerwaits for the current packet transfer to be done (accepted by the primarycontroller) and then performs the refresh. When performing the refresh thecounter is assigned the di�erence between its current value and the level valueinsuring that the average time between refreshes is not disturbed. At the speedwhich the controller runs it is not a problem that a refresh can be delayed foras long as it takes to transfer a maximal sized packet, at 100MHz refreshes hasto be done about every 1500th cycle and transferring a maximal sized packettakes less than 100 cycles. To ensure that both secondary controllers does notconsistently try to refresh both sides of the memory at the same time (whichwould leave the primary controller without work) the counters are initiated todi�erent values at reset, one is initiated to 0 and the other half the level value.This results in refreshes being issued as evenly as possible over time.

6.3 Memory Interface

The packet bu�er communicates with the memory through two interfaces, thecommand interface and the data interface. The command interface is one-waywhile the data interface is bidirectional, see Chapter 2.

The command interface consists of the RAS, CAS, WE, CS and addresssignals. It is simple compared to the data interface and only involves havingthe controller assert the correct signals every cycle at the positive clock edge.

The data interface interface on the other hand is much more complex. Thesignals involved are DDR and there are lots of them, 8 strobes, 64 data and 8data masks.

The data mask signals are unidirectional and used to indicate which partof a write is valid. In this application, with the selected memory scheme, thatmakes them constantly in the valid state.

The rest of the signals however are not simple, they are all bidirectionalwhich from the FPGA point of view means there are three signals, one tocontrol the tristate bu�er, one for data out and one for data in. Adding to thisthat they are DDR signals means that every strobe and data signal has a totalof six signals internally, since two values are needed per cycle.

For the data signals connections for all of these six signals are obvious. Thein/out data signals have to be connected to the bu�er memories handling therespective read and write data. Leaving only the tristate control signals to be


generated from the write signal from the primary controller.

For the strobe signal things are both harder and easier than the data signals.The selected read method does not use the strobe to determine when data isstable, so it is only used during writes. In other words only the four outboundsignals are needed. However the data for these signals are not just taken froma memory, instead they all have to generated from the write signal. Since thereare 8 strobes in total and the physical locations of them is very spread outgenerating them in a single place and distributing them is a hassle. Insteadusing a small hardware construction that can be duplicated and placed closeto the physical strobe pins is the solution. This insures that very few signalsneed to travel long distances and to many places in the FPGA. This mightseem like a problem for the data signals too since their respective pins are alsospread out physically. However for them there are only two connection pointsper signal, one at the IOB �ip-�op and one at the bu�er memory6.

6.4 Router Interface

The packet bu�er has several connections to the rest of the router. All inall there are �ve socbus connections, four are for packet data and the lastis for route information. Sections 6.4.1 and 6.4.2 describes the four socbusconnections for data and 6.4.3 describes the route connection.

Socbus connections are bidirectional and the in and out connection can beused independently of each other.

6.4.1 Input Connections

Packets enter the packet bu�er via one of the four input connections. Whena packet is to be transfered over socbus there is �rst a connection setup phaseduring which a path through the socbus network is found and reserved. Witha connection request there can also some some data transfered so that when arequest reaches one of the input connection of the packet bu�er a decision toaccept or reject it can be made based on this short data.

When receiving a packet it is the job of the input connection to store it inthe bu�er for write commands so that it can be written to memory. Thereforthere must be space in the bu�er before the packet can be accepted. Henceinformation about the packet size is needed before accepting. This is wherethe small extra information comes into play, it is large enough to �t the fullpacket identi�er (see Chapter 5) which contains the packet size. Based on thisthe input connection can check for and reserve space for the packet. If enoughspace is not available the connection is rejected and the sender has to retry ata later time.

6this however in no way means all is well with the data signals, as the controller frequencyis increased this also becomes more and more of a problem

6.5. BUFFER MEMORY 39

If the connection is accepted the packet in received and once the full packetis stored a write command is entered into the asynchronous FIFO leading tothe controller. When the controller performs the write operation the memoryfor the packet is again released so new packets can be received.

6.4.2 Output Connections

The four output connections works like the input connections in reverse.When the controller has completed a read operation and stored a packet

in the bu�er memory one of the output connections is signaled via an asyn-chronous FIFO. This initiates a socbus connection request to the destinationoutput module. If the connection fails it will retry until it succeeds.

Once a connection is established all packet data is transfered and as it goesalong the bu�er memory is freed for new packets to be read by the controller.

6.4.3 Route Connection

The route connection exists so that the packet bu�er can get information onwhere to send packets. It is only used to receive route lookups from the routingtable and when one arrives it immediately forwards a read command for thepacket in question to the controller through an asynchronous FIFO.

That is all it currently does but if at a later time some runtime con�gurationof the packet bu�er should be added the route connection is where to do it.

6.5 Bu�er Memory

In Figure 6.1 the bu�er memory is divided into two parts, in and out. Tohave such a con�guration is by no means a requirement, all that is required isenough memory width to handle the memory data bus.

As the names in- and out-bu�er implies one is only used to read fromwhile the other is only written to (from the memory point of view). Thesetwo separate functionalities can of course just as well be implemented into onememory that allows both read and write, since the memory can only performone transfer at the time anyway. There are two main reasons it is not done inthis way, �rst the total size of the bu�er would be cut in half and second, andmost important reason, is the speed at which data can be read/written on therouter side of the bu�ers. If a single in/out-bu�er was used it is clear that thespeed required on the router side of the bu�er would have to be as great orgreater than the memory side to be able to handle the controller data rate7.The reality is that the controller runs faster than most parts in the router,including the router interface, so the split is required.

7this is not completely true, the controller could be so ine�cient that it cannot handlemore than the speed of router interface anyway, but that is not a good starting point for ahigh performance packet bu�er so it will be assumed this is not the case


The requirement of bu�er data width sets a limit on the smallest size possi-ble for the bu�er. The memory used for the bu�ers are the block rams describedin Section 3.4. These have a maximal data width of 32 bits, assuming only oneport is used, which is a requirement since the other port will be used by therouter interface. The memory data bus width is 64 bits, but it is also DDRwhich in e�ect makes it 128 bits wide. The lower limit on the bu�er size istherefor the size of four block rams, 128

32 × 2 kB = 8 kB, for each of the in- andout-bu�ers. For larger bu�ers the size can be multiples of this value, for the�nal decision on bu�er size see Chapter 8.

What still hasn't been addressed is the way in which the four socbus dataconnections share the respective bu�ers. The socbus data width is 32 bitswhile the bu�ers have a width of 128 bits. In total the four socbus connectionshave a width of 128 bits but unfortunately it is not as simple as assigning oneblock ram to each socbus connection. This is because when the read and writeoperations are performed they need the packet data stored in all of the blockrams since all have to work in parallel to handle the memory data bus width.

Sharing the block rams can be done several ways but two basic ways are thatthey either get access to all of the block rams every 4th cycle or they get accessto one at the time, but each cycle which block ram is accessed changes. Thetwo methods have some di�erent properties, accessing all at once means havingto register data for use during the other 3 cycles, while getting access to thedi�erent memories sequentially means having the line up the data in a correctway. Registering data is a little more resource intensive but is required if thedata cannot be aligned correctly, as in the case of the input connections. Theredata arrives at a time which the sender decides, something the packet bu�erhas no control over and therefor data needs to be registered until alignment isachieved. Figure 6.3(a) shows how the smaller 32 bit registers are connectedto a larger 128 bit register which in turn can feed the bu�er every 4th cycle.By loading the registers at the correct times this stores the incoming data ineach of the input connections until it can be written to the bu�er.

With the output connections these extra registers are not needed sincewaiting for the right block ram to become available and then starting thetransfer is possible. Also there is no point in using the other solution as itwould have to wait for its own cycle to read the data from the bu�er anyway.Therefor a simpler hardware solution with only a mux which selects the correctblock ram can be used, as shown in Figure 6.3(b).

6.5.1 Bu�er Memory Usage

The whole packet bu�er is built around the concept that packets should beserved in-order. The reason for this is the simple allocation/deallocation ofbu�er memory this method makes possible.

Both the bu�er memories are divided into four equal parts, one for eachsocbus connection. This will allocate at least8 2kB to each input and output

8depending on the total bu�er size

6.5. BUFFER MEMORY 41

data in32 bits

reg

reg

reg

128

reg

buffer

(a) (b)

buffer

data out32 bits

Figure 6.3: Socbus Connections Bu�er Sharing

connection, which allows a full size Ethernet packet to �t. The memory willalways use bursts of 64 bytes when doing reads and writes so these bu�er partsare also divided and allocated in blocks of 64 bytes. By doing this the taskof allocating memory has been reduced to allocating the correct number of 64byte blocks to each packet.

Whatever solution is chosen the allocation information needs to be trans-fered to/from the controller. Since it resides in a di�erent clock domain this isa problem. The easiest way to do it is to simply allocate the blocks in sequence,this way only the starting and number of blocks needs to be transfered. A solu-tion where the next block to be used is stored in the previous block could be apossibility but it would result in this information being written to the memorybecause the controller allows no pause in its transfers. Also keeping track ofthe allocation information would become signi�cantly harder and determiningif enough bu�er memory is available would require a more complicated circuitbecause blocks no longer contain an even power of 2 bytes.

With the straight sequence all that is needed to keep track of the allocationis the block numbers of the �rst free (FF ) and oldest allocated (OA) blocks.Using this information to determine if a packet can �t in the free bu�er spacebecomes the simple task of subtracting FF from OA and checking that theresult is as great as the number of blocks needed by the packet. The subtractionjust needs to be modulus the total number of blocks, which is exactly what thehardware will produce9. For example with a total of 32 blocks (block numbersin the range 0-31), FF = 25 and OA = 31 the maximal packet that can beaccepted is OA − FF = 31 − 25 (mod 32) = 6 blocks, or with FF = 13 and

9if the result is negative it can also be interpreted as an unsigned with the value (result+ total number of blocks), which is the correct modulus value


OA = 1 packets of maximal length OA − FF = 1 − 13 (mod 32) = 20 blockscan be accepted.

This allocation method also allows for a very simple deallocation procedure.A block is simply deallocated when the �rst data is read from it. This works�ne in the out-bu�er (data to be written to the memory) since when the �rstread in a block happens there is no time for the router interface to �ll thatblock faster than the controller reads it. In the in-bu�er however the situationis reversed and the controller can possibly have time to overwrite the data,therefor one block is always left unallocated in the in-bu�er10.

10this method is actually used in the out-bu�er allocation too because the same code wasoriginally used in both places

Chapter 7

Veri�cation, Testing and

Debugging

During the project a lot of testing and debugging has been done. This chapterdescribes the basic methods applied in this veri�cation process.

7.1 Simulation

The basic way to test the design has been to simulate it. For this purposelibraries for all the di�erent functional units in the FPGA were available. Alsoa simulation model for the memory was available from Microns homepage1,which proved crucial in debugging the controller.

Throughout the project all the di�erent parts have been simulated to someextent but in general test benches for each individual module have not beendeveloped. The initial controller developed was tested with dedicated hardwareparts which could be synthesised so that an early design could be proved towork in hardware. During this time the only test bench used covered thecomplete design and basically only connected it to the memory simulationmodel. Here and as the development of the controller progressed the Micronsimulation model was vital to ensure that each new addition did not violatecorrect memory usage, especially the timing constraints.

Simulation also showed lots of small errors made throughout the develop-ment process so that they could be discovered at an early time. However afew errors exhibited themselves only in hardware2. These were the hardest to�nd, and the methods used to �nally �nd them are described in the followingsections.

1http://www.micron.com/2or so seldom that signi�cant time was required with the used input stimuli

43

44 CHAPTER 7. VERIFICATION, TESTING AND DEBUGGING

7.2 Logic Analyzer Hardware Debugging

With a controller working in simulation the task of getting it to run in hardwareproved much harder than expected. The problem turned out to be delays inthe memory while reading (see Section 9.3), but in order to �nd this out lotsof time was spent on measuring the design in hardware. There was no way toconnect equipment directly to the memory interface so all measurements hadto be done on a few selected signals that could be routed out from the designto an external logic analyzer. There the e�ects of the problem could be seen,but not the cause.

Instead the cause of the problem had to be guessed and a solution workedout based on that guess. Testing again hopefully showed the solution worked.This is common for most of the hard to �nd problems, observing the problemand its cause is hard but getting an indication of what or where the problemcould be is easier.

In some cases after guessing the problem a new type of input stimuli couldbe developed to make the problem exhibit itself more often, allowing for simu-lation. This also proved to be a useful route to go because in simulation thereis no limit on what is observable, every part of the design is available.

7.3 Packet Bu�er Veri�cation and Speed

Evaluating the full packet bu�er is not possible if real network interfaces areto be used. The reason is that there are not enough of them to generatethe required tra�c volumes. Also both the feasibility study and the previousproject showed that the socbus network could, and would be a bottleneck.Therefor to show that the packet bu�er can handle the speed requirementssome other solution is needed.

To do this the idea of packet generators was conceived. Each of themconnects directly to one of the packet bu�ers socbus data ports, thereby cuttingall of the socbus network out of the picture. This still leaves the socbus interfacelogic in the packet bu�er which should be included in the test.

The units described here were used both to evaluate the packet bu�er per-formance but also to verify that the packet bu�er worked, both in simulationand hardware.

7.3.1 Packet Generators

Each of the packet generators are to pump as much data as possible into thepacket bu�er. They are also responsible for verifying that whatever comes outof it is correct.

To do this a very simple way to generate packets was developed. First thepacket identi�er generation scheme from the input module is used. Then acounter value, starting at 1 at the beginning of each packet, is used as packet

7.4. STATUS REGISTERS/COUNTERS 45

data. When verifying another counter is used and if any data in the packetdoes not match the counter the packet was received wrong.

To make the test more real world like a random distribution of packet sizesis used. It has on average a size distribution like the Internet Mix (see Table5.1). To achieves this a precalculated distribution is stored in a block ram andused during testing. The precalculated distribution is made by a Java programdeveloped speci�cally for this purpose and each of the four packet generatorhas their own distribution.

To measure the speed at which the packet bu�er works two 32 bit countersare included in each packet generator, one for the number of cycles data is sent,and one for the number of cycles during which data is received. In addition tothese a single 32 bit counter is also included which contains the total numberof cycles passed. By reading all of the counters simultaneously the e�ciency ofthe packet bu�er can be measured, both in raw data transfered per time unitbut also on averaged how much of the memory data bus is used3. For resultsof these tests see Chapter 8.

7.3.2 Route Generator

For the packet bu�er to read packets out and send them on the outgoing socbusconnections it needs to be issued route lookups. To generate these a unit calledthe route generator was developed. It connects to the packet bu�er routeconnection, but it also eavesdrops on the four socbus connections to �nd outthe packet identi�ers of the packets being sent to it. With some delay the routegenerator sends faked route lookups with a random destination to the packetbu�er which then reads out the packet.

7.4 Status Registers/Counters

When the di�erent modules in the router was connected to each other the needto get statistics from the design became apparent. If the problems could notbe easily shown in simulation �nding them would be hard.

Therefor a number of status registers/counters was put into the design.By performing tests in real hardware and looking at the registers/counters anindication of where the problem was situated could be deduced. For exampleduring tests with real network tra�c small amounts of packet loss was detected.By looking at the counters for number of packets received and the number ofpacket sent onto the socbus network by the input module the source of thepacket loss could be pinned down to the input module.

3assuming the clock frequency of both the memory and packet generators is known, whichthey are


7.4.1 Serial Connection

To communicate with the hardware and access the counters and registers dur-ing operation a small monitor CPU with a serial connection (provided by thedepartment) was integrated into the design. Using this greatly simpli�ed check-ing the state of the system to perform rudimentary debugging. These featureswere also used during performance evaluation to check how many packets weretransfered/lost and if lost in what part of the router they were lost.

7.5 Testing in Real Network

The �nal router should be used in a network and to really test this real networkinterfaces have to be used. To do this a simple setup with two computers withgigabit network interfaces connected to the router was used. This enablesbasic tests to be run and when the router fails to forward packets these can becaptured in the computers and the cause examined more carefully, for exampleby running the same packet through a simulation.

Using a tool called iperf 4 some performance statistics can also be gathered.This method was used in evaluating the old router and a comparison might beinteresting.

As can be seen in Chapter 8 this method of performance evaluation is notenough. Sadly the computers used cannot handle the speeds required to pushthe router to its limits, especially with small packets.

7.5.1 Dedicated Hardware Packet Generators

To be able to really push the router to its limits the computer bottlenecks needsto be cut out of the picture. What was used instead was dedicated hardwarewhich could send and receive packets of any valid size at wire speed.

These extra hardware units were implemented with two additional devel-opment boards exactly like the one the router runs on.

They are not very advanced, basically they consist of a send and receivemodule incorporating an Ethernet CRC generator/checker and a number ofregisters/counters. To control them the same monitor CPU with serial connec-tion used in the router is used. Via this statistics for the number of packetsreceived and dropped for various reasons can be read out and the send modulecan be controlled to set the send packet size, what the intergap time5 shouldbe and how many should be sent.

To make sure the packets received hasn't been altered by the router theyalso incorporate a complete packet content check. When one of them sends apacket the packet content is read from a block ram and at the receiver side thesame contents is available in another block ram which is used to check that thedata is correct byte-by-byte.

4http://dast.nlanr.net/Projects/Iperf/5intergap time is the time between two successive packet

7.5. TESTING IN REAL NETWORK 47

There really isn't anything good about the tra�c generated by these packetgenerators except that they're fast, all packets are of the same size and theintergap is always the same during a session. This is a highly unrealistic tra�cpattern but it is still used because the hardware needed to generate it is verysimple and could (and was) built in a very short time, plus no other solutionto generate the required tra�c was available in the project time-frame.

Chapter 8

Results

This chapter describes the results of the various performance evaluations doneand how well the �nal design ful�lls the requirements.

8.1 Controller E�ciency

At two di�erent stages in the design the packet bu�er speed was measuredusing the hardware described in Section 7.3, once before integration with therest of the router and once with the �nal packet bu�er.

The �rst test was not repeated with the �nal packet bu�er design becausewhat it measured was how the bu�er size a�ected performance. The reasonthis test could not be repeated was the work involved, changing the bu�er sizemeans all the addresses used in conjunction with them also changes width,resulting in lots of changes throughout the packet bu�er. Also increasing thebu�er size adds block rams to the data path to/from the memory which makesit much harder to meet timing.

The �nal packet bu�er uses the minimal sized bu�ers described in Section6.5. The �rst test also used the minimal bu�er size and compared this to usingtwice the bu�er size.

8.1.1 Di�erent Bu�er Sizes

Tables 8.1 and 8.2 shows the counter values for two runs with 8k and 16kbu�ers, respectively. All counters are in the socbus clock domain which runsat 80MHz, the 115MHz referenced in the tables is the the controller speed.

Both of these performance evaluations were performed with only the Inter-net Mix as input packet size distribution and an equivalent evaluation on the�nal can be found in Section 8.1.2.

To see how e�cient the controller is a simple percentage of how many ofits cycles was actually spent transferring useful data can be calculated. Eachof the data in and out counter values represent how many 32 bit words has

49

50 CHAPTER 8. RESULTS

Table 8.1: 8k Bu�ers @115MHz with Internet Mix

Socbus # Data Out Cycles Data In Cycles Total Cycles

0 851317C5h 84EFD93Fh F518E81Bh

1 84CC23BCh 84EF5820h F518E81Bh

2 85902F20h 8545579Fh F518E81Bh

3 84FAE8FDh 8545B9DCh F518E81Bh

Total: 2146A539Eh 2146A42DAh F518E81Bh

Table 8.2: 16k Bu�ers @115MHz with Internet Mix


0 9029A308h 91512AD4h F6498115h

1 90780286h 90506D42h F6498115h

2 9062F402h 908892BBh F6498115h

3 90AE46AAh 908895FCh F6498115h

Total: 241B2E03Ah 242B2C0CDh F6498115h

been transfered. The total cycles counter only gives the number of 80MHzcycles over which the test was performed. Using the total cycles counter thenumber of controller cycles can be calculated as total cycles× 115

80 , then usingthe data counters the number of useful cycles can be inferred as sum of all data

counters× 3264×2

1.With these calculations the controller e�ciency for the two cases becomes:

• (8932447134+8932442842)× 3264×2

4112050203× 11580

= 75.6% for the 8k bu�ers

• (9692176442+9708945613)× 3264×2

4132012309× 11580

= 81.7% for the 16k bu�ers

Clearly increasing the bu�er size increases the controller e�ciency, howeverdoubling the bu�er size yields only an 8% performance increase. What thisshows is that if the bu�ers are too small they set a limit for how much datacan be transfered in and out of the packet bu�er. This is because less data canbe queued up waiting for the controller to have time with it.

To solve this either the bu�er size can be increased, as shown here, or a moree�cient usage of the bu�ers could be interesting, in other words the circularbu�er principle has to be revised. Neither of these solutions has be includedin the �nal packet bu�er design because increasing the bu�er sizes causes too

132 is the data counters bit-width, 64 the memory data width and the extra 2 because itis DDR meaning two 64 bit values are transfered per cycle

8.1. CONTROLLER EFFICIENCY 51

much problems with meeting timing and a more complex bu�er allocationwould require more time to develop because of the increased complexity.

8.1.2 Final Packet Bu�er Design

In the same way as above the performance of the �nal packet bu�er was mea-sured. Now however more than the Internet Mix case was studied, �distribu-tions� consisting of only small (40 bytes) and large (1500 bytes) packets wasalso measured. Tables 8.3, 8.4 and 8.5 contain the relevant counter values foreach case.

Table 8.3: Final Design, 8k Bu�ers @115MHz with Internet Mix


0 8C3FBA7Ah 8BFF1AE7h F88502CAh

1 8BE851BBh 8C09F5A8h F88502CAh

2 8C5EB030h 8C16E119h F88502CAh

3 8BA342EFh 8C0A0615h F88502CAh

Total: 23029FF54h 23029F7BDh F88502CAh

Table 8.4: Final Design, 8k Bu�ers @115MHz with 40 byte Packets


0 3123DE47h 3108D096h F645989Bh

1 3123DE40h 3108D08Ch F645989Bh

2 30EDC54Eh 3108D08Ch F645989Bh

3 30EDC54Eh 3108D082h F645989Bh

Total: C4234723h C4234230h F645989Bh

Table 8.5: Final Design, 8k Bu�ers @115MHz with 1500 byte Packets


0 A566C7FEh A566C6F9h F7EA0F38h

1 A566C777h A566C626h F7EA0F38h

2 A566C8D7h A566C600h F7EA0F38h

3 A566C7B8h A566C5C9h F7EA0F38h

Total: 2959B2004h 2959B18E8h F7EA0F38h


The controller e�ciency thereby becomes:

• (9397993300+9397991357)× 3264×2

4169466570× 11580

= 78.4% with the Internet Mix

• (3290646307+3290645040)× 3264×2

4131756187× 11580

= 27.7% with 40 byte packets

• (11099906052+11099904232)× 3264×2

4159311672× 11580

= 92.8% with 1500 byte packets

Compared to the �rst measurements it can be seen that the controller e�ciencyhas increased some for the Internet Mix case, from 75.6% to 78.4%. Thisslight increase is due to increased parallelism in the controller (each secondarycontroller looks at one read and one write command at the same time). Forthe other two cases what can be said is that the performance for small packetsis quite horrid and for large packets performance is good.

For the small packets the reason for the poor performance lies in the con-troller, each read or write is a multiple of 64 bytes which means that for evenan ideal controller that uses the data bus 100% of the time the e�ciency wouldstill just be 40

64 = 62.5%. This doesn't explain all of the poor performance, therest lies in the fact that a 64 byte transfer takes 4 cycles, and to setup thattransfer at least 1 cycle is needed if it is a write or 1+2.5 (and a wasted 0.5)cycles for a read. Additionally the active/precharge times becomes noticeablein these cases as the transfer times are so short.

8.1.3 Theoretical Ethernet Maximal Throughput

A gigabit Ethernet connection cannot achieve a full 1Gbit/s of data transferspeed, there is some overhead. The overhead comes from the Ethernet frameheader and trailer plus the required gap between packets.

The Ethernet header consists of a 64 bit preamble, the source and desti-nation MAC addresses, each 48 bits long, and a 16 bit type �eld. The trailerconsists of only a 32 bit checksum. In between them up to 1500 bytes of datacan be stored, however there has to be a minimum of 46 bytes, shorter packetsare padded.

Before a new packet can be sent the required minimum time after the lastpacket �nishes is 96 bit-times.

All in all this results in an overhead of 64 + 48 + 48 + 16 + 32 + 96 = 304bits for packets larger than 45 bytes.

The theoretical throughput of a gigabit Ethernet connection therefor de-pends on the packet size. To have something to compare the packet bu�ertests with the throughput of 40 byte packets, 1500 bytes packets and InternetMix are interesting.

For 40 byte packets the overhead is an additional 6 bytes (46 − 40 = 6byte = 48 bits), total overhead 304 + 48 = 352. The total packet size will be40× 8 + 352 = 672 bits giving a throughput of 40×8

672 = 47.62% (of the 1Gbit/savailable).

8.2. ROUTER PERFORMANCE 53

For 1500 byte packets in the same way the overhead is 304 bits, total packetsize 1500× 8 + 304 = 12304 bits and �nal throughput 1500×8

12304 = 97.53%.Calculating the maximal throughput for the Internet Mix assuming minimal

gaps between packets is easiest done by calculating the average for 100 packets.Using the Internet Mix numbers from Table 5.1 gives a total length of 100 ×304+8× (55× (40 + 6) + 23× 1500 + 17× 576 + 5× 52) = 407056 bits giving

a throughput of 8×(55×40+23×1500+17×576+5×52)407056 = 91.88%.

8.1.4 Packet Bu�er Limits

Comparing the theoretical Ethernet numbers to the controller e�ciency num-bers gives an idea of how much the packet bu�er can handle.

For each of the three cases the number of full duplex gigabit Ethernet portsthe packet bu�er can support can be calculated. Each port will need two timesthe Ethernet maximal throughput, because of the full duplex. The throughputvaries with the packet size distribution as does the controller e�ciency, whicha�ects the memory bandwidth. The raw memory bandwidth (using all cyclesfor data) is 115MHz ×2× 64bits = 14.72 Gbit/s.

For the three cases the number of ports the packet bu�er can support istherefor:

• 14.72×0.7842×0.9188 = 6.28 with the Internet Mix

• 14.72×0.2772×0.4762 = 4.28 with 40 byte packets

• 14.72×0.9282×0.9753 = 7.00 with 1500 byte packets

Most interesting in these numbers is the performance for 40 byte packets, eventhough the controller performs really bad with them the number of supportedports is still reasonable. This case doesn't happen in reality either and as canbe seen with the more real-world like example Internet Mix the performanceincreases signi�cantly even though over 60% of the packets are small.

8.2 Router Performance

When measuring the router performance real network interfaces were used.Since only two Ethernet ports were available in hardware pushing the wholepacket bu�er to its limits this way was not possible.

Two types of measurements were performed, using computers to send/receivedata and the specialised hardware packet generators described in Section 7.5.1.The former was also used during the measurements of the old router whichwould give something to compare with. The same computers were used how-ever the same driver was not. This a�ected the results greatly, so much that acomparison based on them is rather useless.


8.2.1 Manual Tests

The real test for the router comes with the use of the packet generators for mea-surements. First simple tests with one sender and one receiver were performed.The size of the packets were varied and the intergap was kept at a minimum.The results aren't very interesting, regardless of the packet size every singleone was routed and received correctly at the destination.

Instead to push the router further both sides was con�gured to work at fullduplex. Using this con�guration as the packet sizes decreased loss started tohappen. Table 8.6 shows the results of the measurements. The size columnshows the size of the packet after source, destination and CRC are stripped,which is the size of the packet when it is transfered on the socbus network. Thesent column is the number of packets each side sent and each of the receivedcolumns show the number of packets actually received per side.

Table 8.6: Packet Generator Measurements at Full Duplex

Size Sent Received #1 Received #2 Loss

48 06000000h 052926C3h 05292BF8h 27.97%88 02000000h 01EB8AB6h 01EB8C72h 7.99%120 01000000h 00FB5474h 00FB5550h 3.65%136 01000000h 00D2EF34h 00D2F017h 35.21%180 01000000h 00E21542h 00E21621h 23.37%188 01000000h 00E77FF5h 00E780D8h 19.14%196 01000000h 00FFE088h 00FFF786h 0.06%236 01000000h 00FFFDD2h 00FFFFE0h 0.00%288 00C00000h 00C00000h 00C00000h -388 00800000h 00800000h 00800000h -788 00400000h 00400000h 00400000h -1500 00F00000h 00F00000h 00F00000h -

As seen from the results loss happens on small packets, however it is not aneven curve of smaller packets giving more loss.

8.2.2 Automated Tests

To further investigate the loss behaviour with small packets an automated testwas setup where the loss for all packet sizes, in steps of 4 bytes, from theminimum 48 bytes (anything smaller would have to be padded to be sent) to288 bytes (where loss is no longer a problem) was measured. It had to be insteps of 4 because the packet generators cannot check packet contents unlessthe size is a multiple of 4. The results of this test can be seen in the diagramin Figure 8.1 and the diagram actually shows three test runs plotted on-top ofeach other and they all exhibit exactly the same loss behaviour.


48 60 72 84 96 108 120 132 144 156 168 180 192 204 216 228 240 252 264 276 2880

5

10

15

20

25

30

35

packet size

loss

(%

)

Figure 8.1: Loss for Packet in Range 48 to 288 bytes

There are a number of di�erent reasons a packet can get lost and by usingthe counters available in the di�erent parts of the router/packet generators thereason for most of the lost packets is known. For the rest of them the numberof unaccounted packets can at least be recorded.

For the test depicted in Figure 8.1 almost all loss (well above 99%) occurreddue to lack of bu�er space in the input modules, which is as it should. Whenthe packet bu�er and/or socbus network has reached its capacity the inputmodules cannot empty their bu�ers fast enough and new packets have to bedropped. The irregular curve however is not clearly explained by this.

Part of the reason for the irregular loss lies in the way bu�er memory isallocated. The packet bu�er allocates packets in blocks of 64 bytes while theinput module uses 16 byte blocks. This explains the sharp rise in loss between48 and 52 byte packets as 48 byte packets only consume three 16 byte blockwhile 52 uses four. As packet sizes grow this e�ect should subside and from 52to 80 loss clearly goes down. The peak at 84 however cannot be explained in thisway, from 80 to 84 there is the addition of one 16 byte block but that happensbetween 64 and 68 too, without any increase in loss. Instead it is more likelythat the reason for the loss either depends on a problem with how the inputmodule uses its bu�er when it is full or that the socbus network works poorlyon speci�c packet sizes due to resource con�icts between the input modules.

Allocation block sizes can also explain the bump after 128 byte packets,there the memory used by the packet bu�er increases from two to three blockswhich should have some e�ect. This is however probably not the only reasonjudging by the surrounding loss. At 192 however the move from three to four64 byte blocks seems to have a positive stabilizing e�ect on the loss.

If the socbus network when transferring packets of certain sizes has total


transfer times that interact poorly with the time needed to perform a failedconnection setup it is entirely possible this could lead to the strange loss behav-iour observed. It is known that the socbus network has quite a large overheadon small packets [1] and since the input modules share packet bu�er and routetable they will from time to time block each other causing failed connectionssetups. This can explain why such behaviour would manifest itself only onsmaller packets and not on large.

During a few cases some of the loss which occurred was unaccounted forin the router2, but it was minimal (less than 100 packets out more than of8,000,000). In even fewer cases the receiving packet generator also got smallamounts of corrupt packets (CRC and data contents incorrect). The reasonsfor these problems are unknown but there is no correlation between these smalllosses and the peaks seen in the diagram. Also these types of loss only happensduring full duplex tests, when data is �owing in only one direction the loss iszero, as in not a single packet lost.

Also interesting to note is that in all tests exactly the same number ofpackets leaving the input modules is received by the corresponding packet gen-erators, indicating that there are no problems with the rest of the router design,except that it can't handle the pressure of lots of small packets of course.

In Figure 4.3 on page 21 it can be seen that the packet bu�er only has twoof its connections in the socbus network, one of them is the route connectionwhich means only one of the four data connections is available for the twoin/output modules. The reason this socbus con�guration is used is because ifeach in/output module got its own packet bu�er data connection the wholepoint of the socbus network is lost. With such a con�guration all the socbusnetwork would be is an additional delay. Also if the full eight Ethernet portswere added to the router two ports per packet bu�er connection is the sharingthat would be needed.

By sharing one port this way and leaving the other three unconnected theresources actually used in the packet bu�er are only the ones allocated to thatsingle port. This means only a 1

4 of the bu�er memories are used and only twoof the memory banks (both on the same side of the memory as well). Also withonly one port all packets has to be processed in a strict sequential order bythe controller due to the way the bu�er memory is used. This severely limitsits parallel processing capability by only letting it select between a read or awrite. As a result the controller basically works as ine�cient as it can with theselected implementation.

To determine whether the controller can handle the packets a simulationwas setup. By simulating a constant �ow of small (40 bytes) packets directlyon the socbus connection of the packet bu�er an average for how long timeeach memory read/write took could be calculated to 17.7 memory clock cy-cles. This would allow for 115,000,000

17.7 = 6, 497, 175 packets per second to beread/written to the memory. Two full duplex gigabit Ethernet ports can han-

2packets entered the input modules but never left them, without being counted as droppeddue to a full bu�er


dle 2×2×1,000,000,00096+64+48+48+16+8×46+32 = 5, 952, 381 packets per second which means the

controller is not the reason for the loss. Instead the problem must lie in thesocbus network not being able to feed enough packets to the packet bu�er.

8.2.3 Automated Tests, Di�erent Packet Sizes

Given the fairly weird loss behaviour with small packet sizes another automatedtest was setup to investigate if small packets would introduce massive loss ifit was combined with larger packets. Now instead of just varying a singlepacket size for both packet generators each of them was changed independently.Unfortunately doing a complete test with all packet sizes divisible by 4 wouldtake close to two weeks so instead, with the previous results in mind, thenumber of data-points was kept higher on smaller packet sizes and the gap wasincreased successively. The �rst test case had one side vary from 48 to 400bytes while the other varied more course-grain from 48 to 1500. Figure 8.2shows the results of this test and as can be seen the loss is fairly smooth whenone size has �large� packets and the other small packets. With both sizes beingover around 200 the loss is zero.

050

100150

200250

300350

400

0250

500750

10001250

1500

0

packet size (bytes) packet size (bytes)

5

10

15

20

25

30

35

loss

(%

)

Figure 8.2: Loss with Di�erent Packet Sizes, 400-1500

To ensure that no other weird e�ects existed with large packets on bothsides another test was made where the same sizes was used as above, howeverthe side using 48 to 400 byte packet was extended to go the full range 48 to


1500. The result can be seen in Figure 8.3 and there are no new surprises. Theparts that look very black in the diagram are due to the increased number ofdata-points there.

0250

500750

10001250

1500

0250

500750

10001250

packet size (bytes) packet size (bytes)

0

5

10

15

20

25

30

35

loss

(%

)

Figure 8.3: Loss with Di�erent Packet Sizes, 1500-1500

8.3 FPGA Utilization

One of the requirements was for the design to be implemented in a XilinxVirtex-II XC2V4000 FPGA. Obviously this has been done but it is also inter-esting to view how much of the FPGA is used, to determine the prospects ofextending it. How much and what type of resources the di�erent modules useis important in this context to see whether extending the router to eight portsis even feasible.

Table 8.7 shows the FPGA utilization for the complete router while Ta-ble 8.8 shows the resource usage per module. Important to note is that theindividual modules combined do not add up to the complete router in regardto clocking resources. All modules use the socbus clock which is generated bya DCM directly in the router and this is counted as a global clock in each ofthe modules.

8.4. REQUIREMENTS 59

The packet bu�er also lists two additional clocks which are not really partof it, one is for the monitor CPU and the other is a clock signal that is routedthrough it because a suitable DCM is located there. The remaining three clocksare used for the memory controller.

The input modules also uses extra clocking resources, each uses a DCM andtwo additional clocks for the physical interface.

Table 8.7: FPGA Utilization, Router

Resource Used Total Usage

Slices 11421 23040 49%Slice Flip Flops 10137 46080 21%4 input LUTs 19047 46080 41%Bonded IOBs 165 824 20%Block RAMs 41 120 34%Global Clocks 11 16 68%DCMs 6 12 50%

Table 8.8: FPGA Utilization, Modules

Resource Input Output Packet Bu�er Route Table Socbus

Slices 895 3% 273 1% 2542 11% 1970 8% 6354 27%

Slice Flip Flops 856 1% 271 0% 3470 7% 1745 3% 3944 8%

4 input LUTs 1139 2% 444 0% 3395 7% 3234 7% 11565 25%

Block RAMs 2 1% 3 2% 18 15% 12 10% 0 0%

Global Clocks 3 18% 2 12% 6 37% 1 6% 1 6%

DCMs 1 8% 0 0% 3 25% 0 0% 0 0%

8.4 Requirements

All the primary requirements have been ful�lled, a working packet bu�er usingDDRmemory has been implemented and it can handle 4 gigabit Ethernet ports.Of the secondary requirements the support for 8 ports could not be achievedwith the selected design, it runs too slowly and has too much overhead. Testingof the packet bu�er in the router has however been done, and much more sothan originally envisioned. Many parts of the router have been replaced orchanged and a more advanced routing table has been integrated.

Some performance issues appeared when testing the router and to be able toreally make use of the packet bu�er some redesign would be necessary, however


this was never the goal of the integration.

Chapter 9

Problems

During the development process problems were naturally encountered. Manyof them were logic errors and unimportant in the grand scheme of things oncesolved, however some of them merits a closer look.

This chapter also includes some issues which existed throughout the project.

9.1 Meeting Controller Timing

Some parts of the design, particularly the controller, is what one might callhigh speed for the technology used.

When the project was conceived the original idea was for the controller torun at 125MHz which would allow for a raw memory bandwidth of 125×2×64 =16000Mbit. This would be su�cient to provide eight gigabit Ethernet portsrunning at full duplex with bu�er capacity.

Reaching these kinds of speeds in the FPGA is hard, only one try to run thecontroller at 125MHz succeeded, but this was with a simpler controller whichdid not interface with the rest of the design as necessary. For a while 120MHzwas used but also at this speed as bugs were found and corrected, resulting ina slightly more complex design, the controller became too complex to meet therequired timing.

The �nal design runs the controller at 115MHz and in order to meet timingthe design tries to use a low fan-out1. This is one of the methods used inXAPP253 [6] to achieve a high speed design. It is a simple concept, if a signalneeds to be routed to less places it is easier to meet the timing. However thetools used are far from perfect, complicating things. A couple of times duringdevelopment, for no apparent reason, the design failed to meet timing whenan extra level of �ip-�ops were put into a path to help reduce fan-out. Theextra �ip-�ops were of course put there in the �rst place to help with timing,basically resulting in neither design working. No real solution was found to

1fan-out is a term to describe how many inputs are driven by a speci�c output

61

62 CHAPTER 9. PROBLEMS

these problems but working around them, redesigning parts and moving �ip-�ops around until timing was meet.

It seems like the tools cannot handle too complicated timing behaviourson certain signals, but no consistent behaviour was ever found. Sometimestiming failed and the tools reported the failing paths as something that shouldnot even be in the design. In some of the cases when testing in hardwarewith these timing errors the design even worked, indicating at least part of theproblem might be in the timing analyzer.

The tools however are not all bad. What made it possible for the design tomeet timing throughout the project was a feature in them. The normal way togenerate the FPGA con�guration is a three step process; synthesize, map andplace & route. There is however an option for the map step which will performso called �timing-driven packing and placement�. This does the normal work ofmap and additionally it places the hardware in the FPGA, the work normallyperformed by the placer in the last step, place & route. By combining thetwo tasks the end result is a better starting point for the router than if theplacer would have been run separately. It doesn't come cheap however, thetime needed to complete the full process on the �nal design exceeds 2 hours onthe computer used, of which more than half is spent in map.

9.2 Synthesizer Bug

Another tool problem, which was shown to be a bug, was encountered whentranslating parts of the old input module from a schematic to verilog HDLcode.

The code translated was an Internet checksum generator and after trans-lation to verilog a combinatorial loop was introduced by the synthesizer. Atleast it gave an error stating this and half a days work yielded a solution witha work-around for the problem.

9.3 Memory Read Timing

This problem was the �rst of the really hard to �nd and the only relating tohow the provided hardware worked.

During testing with the �rst simple sequential controller everything worked�ne at 50MHz. This showed the hardware worked with the selected design. Thespeed however would not be enough to handle the amounts of data required.Naturally the frequency was increased and all of a sudden things stopped work-ing2. This was at 60MHz and the memory is not speci�ed to run below 70MHz.This was puzzling as things worked at 50MHz.

To �nd out what worked and what didn't the clock frequency was variedand it was shown that the controller worked from around 35MHz to 55MHz.

2no data was at this time available as to what was wrong, this came later by using a logicanalyzer

9.4. CONTROLLER BUFFER USAGE 63

The reason it did not work below 35MHz is not know to this day, but given thatit is below half the speci�ed minimum frequency and the fact that at 30MHzthe controller wouldn't have even close to the required bandwidth that part ofthe problem was left.

The fact that read data could be corrupted because it arrived at the wrongtime at higher speeds was not considered because of the Xilinx application noteused as the basis for the controller design. In the application note [6] Xilinxshows a DDR memory controller which uses the memory clock to capture readdata, without using the strobe signals, at 133MHz. This was the basis for thebelief that the memory clock could be used for this purpose in this design too,which obviously turned out to be false. That it could happen with su�cientspeed was obvious, but for it to happen as early as 60MHz, at less than halfthe speed at which the Xilinx controller runs, was inconceivable at the time.

To be able to directly observe the problem a logic analyzer connecting to thememory bus would be needed. To do this soldering small wires to the memorywould be required and how that would e�ect the delicate electrical environmentsurrounding the memory was reason enough to dis-encourage such a route, ontop of the fact that soldering wires to 0.45mm wide connectors seems like apainful job. Instead to try and get to the bottom of the problem the FPGAwas reprogrammed to output some of the read data and control signals toobservable external pins, to which a logic analyzer was connected.

Adding the information provided by the logic analyzer a pattern could beseen as the frequency was increased. When writing the data sequence 1, 2, 3,4, ... to the memory the result when reading would be correct at 50MHz, butas the frequency was increased the the data became somewhat distorted andat 75MHz the result of a read was in the order 2, 1, 4, 3, ... What this meansis that the data read was o�set by half a cycle (since data comes two times percycle in DDR). Using this information it was deduced that the clock used forreading data was �awed and alternative solutions were looked at, including theselected solution, delaying the read clock until correct operation is achieved.

9.4 Controller Bu�er Usage

The simple allocation method described in 6.5.1 may be good when it worksbut before that it was one of the biggest sources of problems in the packetbu�er.

The bu�er memory needs to be used strictly in a sequential fashion, bothwhen reading and writing for the allocation to work properly. Initially the bu�ermemory was allocated not per socbus connection but per secondary controller.Allocating per secondary controller works most of the time, it fails when a largepacket is entered on one port and a smaller, that completes before the �rst, isentered on another port because when the smaller packet is deallocated thisimplies that the larger packet is deallocated as well.

On the memory read side there were also problems with the strict sequential

64 CHAPTER 9. PROBLEMS

ordering because before a packet is allowed to be read it has to be written tothe memory. A simple enough concept, but that also implies that a packetcannot be allocated bu�er space before it actually is allowed to be read out.Additionally it cannot be allowed to start reading before there is bu�er memoryavailable and just letting it wait for bu�er memory to become available isbad for performance because that locks up part of the controller which couldotherwise perform useful operations.

Before all of this was handled in a correct way many days had been spenttrying to �nd the problems, which with each correction became more and morerare. Sometimes it was not even evident that this was the actual problem.Testing in hardware showed it didn't work but simulating until the problemappeared took several hours.

Chapter 10

Conclusions and Further

Work

By using the special resources of the Virtex-II FPGA family it is possibleto construct a DDR memory controller which can work with data rates inexcess of 200MHz. High e�ciency can also be achieved by doing large memorytransfers at maximal memory speed. This is possible by using an e�cientmemory operation schedule which allows there to be few interruptions fromrefreshes and other SDRAM memory operations when doing reads and writes.

Getting good performance when making a large number of small memoryreads and writes is however a much harder task. This would require very carefulscheduling of memory operations and need a controller capable of selectingamong a large number of possible transfers.

How the memory is used from a transfer perspective is important, butthere are more aspects of memory usage. How to allocate and keep track ofthe memory is an issue that needs to be addressed. The use of large amountsof memory in a router can be problematic but by using a system where packetsare stored in equal sized slots and identi�ed by their memory location a systemwith low bookkeeping requirements is achieved.

In addition to the packet bu�er memory smaller bu�ers within the routeralso a�ect performance. The size/use of the packet bu�ers local storage forpackets being read/written is particularly interesting with regard to overallpacket bu�er performance.

In the router however what limits it is not the packet bu�er but the socbusnetwork. By using socbus for interconnecting modules a very �exible routerdesign is achieved which unfortunately has the drawback of some loss when therouter is heavily loaded with small packets. This is due to the socbus networknot being able to transfer packets to the packet bu�er fast enough and newpackets have to be dropped.

65

66 CHAPTER 10. CONCLUSIONS AND FURTHER WORK

10.1 Controller Improvements

To get more performance out of the packet bu�er the controller needs to beimproved. There are basically two paths to work along to increase the through-put, raising the clock speed and improving parallel use of the di�erent memoryresources. To do the latter changes needs to be made in the way controller getsits read/write commands in addition to how memory operations are scheduled.

10.1.1 Small Packets

The main limitation when it comes to small packets is the overhead. As it issetting up a memory transfer incurs too many extra cycles to be able to handlesmall packets e�ciently.

Solving this is by no means impossible, it involves increasing the paralleluse of banks, interrupting bursts once the real data is transfered and issuingread/writes for several packets back-to-back. The increased complexity of thecontroller would however make for a di�cult development process.

10.1.2 Large Packets

To be able handle large packets at full wire speed in all directions at the sametime a more e�cient schedule is not enough. It would improve the throughputslightly but at 115MHz there is no way it can support 8 gigabit ports.

Instead the frequency of the controller has to be increased. This togetherwith the increased complexity of the controller would probably require an ex-tremely pipelined architecture.

10.1.3 Bu�er Memory

The small bu�er memories associated with the controller would also have tochange. An increased size and/or more e�cient allocation scheme would berequired to give the controller a broad selection of packets to choose from,insuring that as few memory cycles as possible go to waste.

One idea to improve their usage is to keep a list of all available blocksinstead of using them sequentially. This way there are no limitations on theorder in which the bu�er memory blocks have to be used and packets canbe allocated in a highly fragmented way within the bu�er memories. Thiswould complicate the transfer of read/write commands to the controller sinceinformation on the exact blocks used are required too. However the sameasynchronous FIFOs could be used for this task, again with some increasedcomplexity in the controller.

10.1.4 Controller Command FIFOs

Trying to improve the parallel memory usage is no good unless the controlleractually has something to use it for. Therefor the number of command FI-

10.2. ROUTER IMPROVEMENTS 67

FOs used within the packet bu�er to transport read/write commands to thecontroller should be increased.

Twice1 as many command FIFOs as there are input modules would be ideal.This way each input module would have their own �path� to the controller andeach bank of the memory can be scheduled independently. Using more thanthat could be bad as it would allow reordering of packets from the same inputmodule, which could be bad for TCP performance.

10.2 Router Improvements

To be able to take advantage of packet bu�er improvements the router designneeds to change too. This relates speci�cally to socbus.

There are other types of improvements that should also be made, to makethe router more router like. Currently it is very static and does not even havean IP address of its own.

10.2.1 Socbus

The most obvious limitation of the router is the socbus network. As it is socbuscannot handle many small packets being transfered and as the router grows tomore ports this problem will become more apparent as each input module willneed two socbus transfers per packet, one for the actual packet data and onefor the route lookup.

Given a good module placement the problems with getting packets to thepacket bu�er could probably be kept as it is now without increasing congestion.The problem will instead be when all input modules want to send lookups tothe route table. Unless the number of route table socbus connections are asmany as the packet bu�er has the congestion for route lookups would increase.Having so many socbus connections would make the socbus network large andgiven the fact that almost half of the current design lies in the socbus networkthis would be very expensive, most probably the design wouldn't �t in theFPGA.

A simple solution to solve the problems with loss and get many ports intothe design would be to replace the socbus network with dedicated connectionsbetween the modules that communicate. This would however remove the verygeneral interconnection method where all modules can communicate with allothers.

Instead introducing dedicated connections for those paths which have highbandwidth requirements could be interesting. That would mean that the con-nections on which packets are transfered between input/output modules andthe packet bu�er would be replaced, leaving the socbus network to do much

1there should be separate for read and write commands

68 CHAPTER 10. CONCLUSIONS AND FURTHER WORK

less work2. By doing so the width of the socbus network can be decrease be-cause the high speed transfers are done elsewhere and the resources needed forit would be more reasonable.

10.2.2 ARP Support

The current router does not support any automatic discovery of hosts, it reliescompletely on precon�gured information. It neither processes ARP informationnor announces itself in ARP replies.

To include such support a CPU connected to the socbus network that couldreceive ARP information and form ARP replies would be good. The ARPreplies could then be forwarded directly to the output modules by the CPU.The input module would have to be modi�ed to forward ARP packets to theCPU, something it has been prepared for. Translation of destinations to MACaddresses would also need to be modi�ed so that the CPU can update it. Thishardware currently resides within the packet bu�er but an investigation shouldbe made whether the output modules would be a more suitable place.

Such a CPU has been mentioned in the previous projects too and it could beused for other tasks like handling routing protocols and updating the routingtable too.

2In the current router this �less work� would consist only of route lookups, but other typesof data like ARP and recon�guring modules would be transfered there also

Bibliography

[1] Tobias Borslehag, Implementation of a Gigabit IP router on an FPGAplatform, LITH-ISY-EX�05/3708�SE, 2005

[2] Douglas E. Comer, Network Systems Design using Network Processors :Intel 2XXX Version, ISBN 0-13-1872-86-9, 2005

[3] Jedec Standard, Double Data Rate (DDR) SDRAM Speci�cation,JESD79D, January 2004

[4] Jimmy Svensson, Design of a core router using the SoCBUS on-chip net-work, LiTH-ISY-EX�04/3562 SE, 2004

[5] Daniel Wiklund, Development and Performance Evaluation of Networks onChip, Linköping University, ISBN 91-85297-62-3, 2005

[6] Xilinx Application Note, Synthesizable 266 Mbit/s DDR SDRAM Con-troller, XAPP253, January 2001

[7] Xilinx Product Speci�cation, Virtex II Platform FPGAs: Complete DataSheet, DS031 (v3.4), March 2005

[8] Xilinx User Guide, Virtex II Platform User Guide, UG002 (v2.0), March2005

69

70 BIBLIOGRAPHY

LINKÖPING UNIVERSITY ELECTRONIC PRESS

Copyright

Svenska

Detta dokument hålls tillgängligt på Internet - eller dess framtida ersättare - un-der en längre tid från publiceringsdatum under förutsättning att inga extra-ordinäraomständigheter uppstår.

Tillgång till dokumentet innebär tillstånd för var och en att läsa, ladda ner, skrivaut enstaka kopior för enskilt bruk och att använda det oförändrat för ickekommer-siell forskning och för undervisning. Överföring av upphovsrätten vid en senare tid-punkt kan inte upphäva detta tillstånd. All annan användning av dokumentet kräverupphovsmannens medgivande. För att garantera äktheten, säkerheten och tillgäng-ligheten �nns det lösningar av teknisk och administrativ art.

Upphovsmannens ideella rätt innefattar rätt att bli nämnd som upphovsman iden omfattning som god sed kräver vid användning av dokumentet på ovan beskrivnasätt samt skydd mot att dokumentet ändras eller presenteras i sådan form eller i så-dant sammanhang som är kränkande för upphovsmannens litterära eller konstnärligaanseende eller egenart.

För ytterligare information om Linköping University Electronic Press se förlagetshemsida: http://www.ep.liu.se/

English

The publishers will keep this document online on the Internet - or its possible re-placement - for a considerable time from the date of publication barring exceptionalcircumstances.

The online availability of the document implies a permanent permission for any-one to read, to download, to print out single copies for your own use and to use itunchanged for any non-commercial research and educational purpose. Subsequenttransfers of copyright cannot revoke this permission. All other uses of the documentare conditional on the consent of the copyright owner. The publisher has taken tech-nical and administrative measures to assure authenticity, security and accessibility.

According to intellectual property law the author has the right to be mentionedwhen his/her work is accessed as described above and to be protected against in-fringement.

For additional information about the Linköping University Electronic Press andits procedures for publication and for assurance of document integrity, please refer toits WWW home page: http://www.ep.liu.se/

c©Daniel FermLinköping, 2006

Design of a Gigabit Router Packet Bu er using DDR SDRAM Memory21665/FULLTEXT01.pdf · Examensarbete...

Documents

Transcript of Design of a Gigabit Router Packet Bu er using DDR SDRAM Memory21665/FULLTEXT01.pdf · Examensarbete...