Isochronous Time Based WRR Paper

Multimedia and Quality of Service Support in PCI Express* Architecture

JASMIN AJANOVIC AND HONG JIANG INTEL CORPORATION

White Paper - September 19, 2002

Information in this document is provided in connection with Intel products. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. Except as provided in Intel's Terms and Conditions of Sale for such products, Intel assumes no liability whatsoever, and Intel disclaims any express or implied warranty, relating to sale and/or use of Intel products including liability or warranties relating to fitness for a particular purpose, merchantability, or infringement of any patent, copyright or other intellectual property right. Intel products are not intended for use in medical, life saving, or life sustaining applications. Intel may make changes to specifications and product descriptions at any time, without notice.

Copyright Intel Corporation 2002

* Other names and brands may be claimed as the property of others.

MULTIMEDIA AND QUALITY OF SERVICE SUPPORT IN PCI EXPRESS* ARCHITECTURE

September 2002

1

ABSTRACT PCI Express* is a third generation, general-purpose I/O technology that is a natural evolution of the PCI architecture. It greatly enhances physical interconnect capability by using high-speed, point-to-point, serial signaling with embedded clocking to enable long-term frequency scaling. PCI Express technology applies full split-transaction packetized protocol to the load/store architecture model. This method allows for the preservation of PCI software investment by providing backwards compatibility with PCI configuration and driver software model. While preserving software compatibility, PCI Express introduces capabilities that are essential for enabling new applications in client and server computer platforms as well as in telecommunication systems. Examples of these capabilities are: Enhanced Configuration and Power Management, Advanced Reliability, Availability and Serviceability, Quality of Service and Isochronous Support and Advanced Switching.

This white paper focuses on Quality of Service (QoS) aspects of the PCI Express architecture that are driven by the requirements of multimedia and communication applications. It covers PCI Express mechanisms that are provided to support traffic differentiation including Traffic Class labeling and Virtual Channel mechanism. In addition, details of isochronous support are provided from an application perspective.

Background Quality of Service Challenges

Computer platforms based on current mainstream I/O interconnect technologies, typically do not incorporate standard mechanisms for management of bandwidth and latency for I/O traffic within the system. Servicing of the I/O traffic within such platforms is commonly based on the best-effort approach: any I/O agent can instantly generate a burst of traffic at any time and the system will service it the best way possible. Since in such systems multiple agents can simultaneously generate bursts, that condition frequently leads to the problem of bandwidth over-subscription. In that case during traffic peaks an I/O agent cannot be provided with the amount of bandwidth that is required for maximum performance of its associated application. In addition, since instantly available bandwidth is directly related to service latencies, this can cause extraordinary latencies which in turn may affect not only performance but even the basic functionality of the I/O agent and consequently of the entire platform. For example, lack of adequate peak bandwidth for I/O disk subsystem may result in lower application performance but without greatly impacting the functionality. However, extraordinary latencies can affect I/O traffic that has some real-time deadlines (such as traffic that moves video and audio information) to the level that application fails to deliver minimum acceptable quality and therefore results in broken functionality.

CCPPUU MMeemmoorryy

PCI

CChhiippsseett

RR

PCI

R

LLAANN BBrriiddggee22BBrriiddggee11

R

X

X

Figure 1. Traffic Blocking Example

Besides the problems caused by the lack of bandwidth management, current computer platforms are also exposed to the traffic blocking conditions where through a combination of traffic ordering rules and flow-control actions an I/O agent or subsystem may become a bottleneck for local traffic. Figure 1 shows an example of a larger system topology where a local blockage at


September 2002

2

Bridge 1 is caused by combination of oversubscribed write and read traffic. This blockage eventually ripples through the entire system causing saturation of read completion queues within host memory subsystem. At that point host stops servicing new read requests from LAN or I/O subsystem connected to Bridge 2. When this condition lasts long enough it causes visible bandwidth reduction and extraordinary service latencies.

For the platforms that are exposed to this type of problems, we can say that they do not provide mechanisms for managing the Quality of Service (QoS).

PCI Express Solution for QoS

PCI Express solution for QoS relies on the set of mechanisms that enable traffic differentiation/prioritization and bandwidth allocation control within the interconnect fabric. The foundation of traffic differentiation is in:

Traffic Class (TC) labeling mechanism where packet-level tags are used to classify each packet

Virtual Channel (VC) mechanism that allows segregation of traffic with different TC labels into a separate ordering and flow-control domains

PCI Express defines TC labeling mechanism that supports differentiation of packets into eight Traffic Classes. Packets contain TC information as an invariant label that is carried end to end within the fabric. As the packet traverses across the fabric, this information is used at every link and within each switch element to make decisions concerning proper servicing of the traffic. A key aspect of servicing is the routing of the packets based on their TC labels through corresponding Virtual Channels.

Figure 2 illustrates the concept of Virtual Channel. The foundation of VC is in independent fabric resources (shown in enlarged area): queues/buffers and associated control logic. Conceptually, traffic that flows through VCs is multiplexed onto a common physical link resource on the transmit side and de-multiplexed into separate VC paths on the receive side. These resources are used to move information across individual links with complete independence of flow-control between different VCs. In addition, from the transaction-ordering point of view, VCs represent completely independent traffic paths. This is the key to solving the problem of flow-control/ordering induced blocking where a single traffic flow may create a bottleneck for all the traffic within the system.

VVCC00

VVCC00 VVCCnn

VVCCnn

LLiinnkk

PPaacckkeettss

PPaacckkeettss

VVCC00

VVCC00 VVCCnn

VVCCnn

LLiinnkk

PPaacckkeettss

PPaacckkeettss

CCPPUU

Memory

LLAANN

PCI

BBrriiddggee

PCI Express

CChhiippsseett

SSwwiittcchh

Figure 2. Virtual Channel Concept An Illustration

Traffic is associated with up to eight VCs by mapping packets with particular TC labels to their corresponding VCs. PCI Express VC mechanism allows flexible mapping of TCs onto the VCs. In the simplest form, TCs can be mapped to VCs on 1:1 basis. To


September 2002

3

allow performance/cost tradeoffs, PCI Express provides the capability of mapping multiple TCs onto a single VC. It is up to the system software to determine TC labeling and TC/VC mapping in order to provide differentiated services that meet target platform requirements. For example, for a platform that supports isochronous data traffic, TC7 can be reserved for isochronous transactions and in that case TC7 will be mapped to the VC with the highest weight/priority. Figure 3 provides a platform level example that illustrates the flexibility of TC/VC mapping supported within the PCI Express architecture.

Root Root ComplexComplexSwitchSwitch

EndpointEndpoint

EndpointEndpointVC0TC[0:1]VC3TC7

TC[0:1]

TC7

LinkVC0TC[0:1]VC3TC7

TC[0:1]

TC7

Link

VC0TC[0:1]VC1TC[2:4]

TC[0:1]TC[2:4]

VC2TC[5:6]

VC3TC7

TC[5:6]

TC7

LinkVC0TC[0:1]VC1TC[2:4]

TC[0:1]TC[2:4]

VC2TC[5:6]

VC3TC7

TC[5:6]

TC7

Link

TC[0:1]TC[2:4]

TC[0:1]TC[2:4]

TC[5:6]

TC7

TC[5:6]

TC7

VC0VC1

VC2VC3

LinkTC[0:1]TC[2:4]

TC[0:1]TC[2:4]

TC[5:6]

TC7

TC[5:6]

TC7

VC0VC1

VC2VC3

Link

Map

ping

EndpointEndpointSwitchSwitch

VC0TC[0:6]

VC1TC7TC[0:6]

TC7

Link

VC0TC[0:6]

VC1TC7TC[0:6]

TC7

Link

TC[0:7] TC[0:7]Link

VC0TC[0:7] TC[0:7]Link

VC0

Map

ping

Figure 3. TC/VC Mapping Example

By segregating traffic into up to eight classes and associating it with VC resources within the fabric, PCI Express provides the foundation for QoS. This foundation is extended with the definition of arbitration mechanisms that need to be deployed within the fabric elements (primarily switches) to control the allocation and usage of the bandwidth. The following paragraphs cover some key aspects of QoS related to the switch.

PCI Express Switch and QoS

As illustrated by a platform example in Figure 2, switch is an element of PCI Express fabric that is used to provide a fan-out for point-to-point links. Switch consists of multiple ports, each of which can serve the function of Ingress (incoming) or Egress (outgoing) Port depending on the traffic flow. Packets that flow through a switch are subject of the following actions/operations:

1. At Ingress Port address/routing information within the packet is used to determine Egress Port

2. After Egress Port determination is made, a TC/VC mapping information associated with that port is used along with TC label contained within the packet to determine specific VC resource within Egress Port

3. Since multiple packets from different Ingress Ports can target same Egress VC resource, the next step is to arbitrate for a VC resource (queue/buffer) before the packet can be moved from Ingress to Egress. This is called Port Arbitration

4. Once the packet is transferred to an appropriate VC resource within the Egress Port, it is subject to VC Arbitration. VC resources within the same Egress port arbitrate for the shared link connected to the port to transfer the packet

Figure 4 shows an example of a switch illustrating the points made previously.


September 2002

4

Port Arbitration

within a VC in Egress Port VC Arbitration

of an Egress Port

ARB VC 1

VC 0ARB

ARB

These structures are replicated for each egress port

TC/VC Mapping

of the Egress

Port

TC/VC Mapping

of the Egress

Port 3

IngressPorts

EgressPorts

2 0

1

Figure 4. Switch Model

VC Arbitration PCI Express defines a default VC prioritization via the VC Identification (VC ID) assignment, i.e. the VC IDs are arranged in ascending order of relative priority in the Virtual Channel Capability Structure. The example in Figure 5 illustrates an Egress Port that supports eight VCs with VC0 treated as the lowest priority and VC7 as the highest priority.

VC 7 VC 6

Priority Order

High

Low

Strict Priority

ISOCH VC

Default VC (3GIO/PCI)

For QoS usage

VC 5

VC 4

VC 3

VC 2

VC 1

VC 0

VC Relative Resource VC ID Priority VC Arbitration Usage

Low Priority Extended VC Count = 3

Extended VC Count = 7

8th VC

7th VC

6th VC

5th VC

4th VC

3rd VC

2nd VC

1st VC

Governed by VC ArbitrationCapability field

(e.g.WRR)

Figure 5. VC Arbitration Model

PCI Express provides flexibility for tailoring/configuring arbitration scheme to the specific requirements of a platform. The availability of default prioritization does not restrict the type of algorithms that may be implemented to support VC arbitration. Either implementation-specific or one of the following defined methods can be implemented:

Strict Priority Based on inherent prioritization, i.e., VC0=lowest, VC7=highest

Round Robin (RR) Simplest form of arbitration where all VCs have equal priority


September 2002

5

Weighted Round Robin (WRR) Programmable weight factor determines the level of service

Depending on their product requirements, switches can support either a fixed subset of arbitration methods or can allow full configurability that is controlled by the PCI Express defined VC programming model. Switch component reports its capabilities to the PCI Express configuration software using VC Arbitration Capability register. If strict priority arbitration is supported by the hardware for a subset of the VC resources, software can configure the VCs into two priority groups a lower and an upper group. The upper group is treated as a strict priority arbitration group while the lower group is arbitrated for only when there are no packets to process in the upper group. Figure 5 illustrates an example configuration that supports eight VCs separated into two groups the lower group consisting of VC0-VC3 and the upper group consisting of VC4-VC7. The arbitration within the lower group can be configured to one of the supported arbitration methods. When the Low Priority Extended VC Count configuration parameter is set to zero, all VCs are governed by the strict priority VC arbitration. When the parameter is equal to the Extended VC Count (i.e. to the maximum number of supported VCs), all VCs are governed by the VC arbitration indicated by the VC Arbitration Capability.

Port Arbitration Arbitration within a VC refers to the arbitration between the traffic that is coming from different Ingress Ports but is mapped onto the same VC. Inherent prioritization scheme that makes sense when talking about arbitration among VCs in this context is not applicable since it would imply strict arbitration priority for different ports. Traffic from different ports can be arbitrated using the following supported schemes:

Hardware-fixed arbitration scheme, e.g., Round Robin

Programmable WRR arbitration scheme

Programmable Time-based WRR arbitration scheme

Hardware-fixed RR or RR-like scheme is the simplest to implement since it does not require any programmability. It makes all ports equal priority, which is acceptable for applications where no software-managed differentiation i.e. per-port-based bandwidth budgeting is required. Programmable WRR allows flexibility since it can operate as flat RR or if differentiation is required, different weights can be applied to traffic coming from different ports. This scheme is used where different allocation of bandwidth needs to be provided for different ports. A Time-based WRR is used for applications where not only different allocation of bandwidth is required but also a tight control of usage of that bandwidth. This scheme allows control of the amount of traffic that can be injected from different ports within certain fixed time period. This is required for certain applications such as isochronous services, where traffic needs to meet a strict deadline requirement. The following sections of this document discuss support for isochronous applications in more details.

Example of a Traffic Flow through a Switch

After we have identified basic QoS mechanisms within a switch, now we can use example in Figure 6 to illustrate the entire traffic flow.

The example shows packets that belong to two TCs that map to two Virtual Channels of different priority (with darker shaded boxes symbolizing higher priority TC packets). Position of incoming packets in horizontal dimension is used to illustrate their relative arrival times. Assuming RR arbitration among Port 1 and Port 2 and fixed size of packets, the Egress ports packet streams are result of previously defined actions that are applied within the switch.


September 2002

6

0

1

2

3

IngressPorts

VC ArbitrationPort Arbitration

p1

p2

EgressPorts

2.0a

3.02.0

2.0b

3.12.1a

3.1a3.1b

2.1b

3.0

PriorityOrder

Egress Port # Ingress Port #

2.1a

2.0

3.1a

3.1b

3.0

2.1b

2.0a

2.0b2.0

3.13.1a

3.1b

3.0

2.1b0

1

2

3

IngressPorts

VC ArbitrationPort Arbitration

p1

p2

EgressPorts

2.0a

3.02.0

2.0b

3.12.1a

3.1a3.1b

2.1b

3.0

PriorityOrder


2.0a

3.02.0

2.0b

3.12.1a

3.1a3.1b

2.1b

3.0

PriorityOrder


2.1a

2.0

3.1a

3.1b

3.0

2.1b

2.0a

2.0b2.0

3.13.1a

3.1b

3.0

2.1b

2.1a

2.0

3.1a

3.1b

3.0

2.1b

2.0a

2.0b2.0

3.13.1a

3.1b

3.0

2.1b

Figure 6. Example of Traffic Flow Through a Switch

PCI Express QoS Programming Model and Software Requirements

To complete the QoS support, PCI Express architecture defines a standard programming model for software visible QoS mechanisms. The model uses a standard PCI Capability Structure to define a register set required for control of:

VC setup/configuration VC status TC/VC mapping Arbitration (Port, VC) configuration

Figure 7 illustrates the layout of PCI Express VC capability structure.


September 2002

7

31 16 15 0

Port VC Capability Register 1

Port VC Capability Register 2

Port VC Status Register Port VC Control Register

VC Resource Capability Register (0)

VC Resource Status Reg. (0) RsvdP

VC Resource Control Register (0)

VC Resource Capability Register(n)

VC Resource Status Reg. (n) RsvdP

VC Resource Control Register (n)

VC Arbitration Table

Port Arbitration Table (0)

Port Arbitration Table (n)

...

VC Arb Table Offset (31:24)

Port Arb Table Offset (31:24)

Port Arb Table Offset (31:24)

Byte Offset 00h

04h

08h

0Ch

10h

14h

18h

10h + n * 0Ch 14h + n * 0Ch 18h + n * 0Ch VAT_Offset * 04h

PAT_Offset(0) * 04h

PAT_Offset(n) * 04h

PCI Express Enhanced Capability Header

Figure 7. VC Capability Structure

A requirement for PCI Express based platforms that support QoS capabilities is to include a QoS Manager (an OS-level QoS management software). This software is responsible for resource management and configuration of VC capabilities/parameters within the fabric using the VC Capability Structure registers. Part of this task includes controlling the admission process for the PCI Express devices that are competing for VC resources within the fabric. This is accomplished in coordination with device driver software that negotiates with QoS Manager required bandwidth resources and traffic priorities on the behalf of corresponding I/O hardware and an associated application. In order to perform these functions, the QoS Manager must create a "fabric data base" where it associates QoS capabilities (bandwidth, VC resources, TC/VC mapping etc.) with every link within the discovered system topology. Performance characteristics of the link are determined by calculating the maximum available bandwidth of each link based on the width and the operational speed of the link. Only after the fabric data base is constructed, the QoS Manager will be able to evaluate the request from the device drivers associated with individual PCI Express Endpoints to determine which VC resources and how much bandwidth (per VC) can be allocated for the particular Endpoint.

QoS Application Example

We have identified standard mechanisms that PCI Express provides in support of QoS at the platform level. Figure 8 illustrates a potential application using an example where PCI Express-based platform is part of the larger network. In this example, a real-time video/audio stream is transmitted from the video-server source across the network to the client machine. Part of the network includes single or multiple communication switching/routing platforms based on PCI Express technology. (Note also that video-server, as well as client computer may be based on PCI Express technology.) Along with time-critical information such as video/audio, some other information (for example textual news and/or financial statistics) that is not time-sensitive is also communicated as a part of the application. Since the entire network can consist of elements that are based on different technologies, the higher-level QoS mechanisms are typically translated to specific QoS at every element in such a hybrid network. Within PCI Express based network elements, the Traffic Class and Virtual Channel mechanisms are used to provide differentiated services at the hardware level. Note that within PCI Express platforms, time-sensitive information (e.g. audio/video) is communicated using dedicated VC resource separate from non real-time information and from network control/management traffic.


September 2002

8

G-7 Says Recovery Prospects `Strengthened,' ...

Bush Urges Congress to Hold Down Farm Spending Overthe Next Five Years

PCI Express VCsPCI Express VCs

Server

G-7 Says Recovery Prospects `Strengthened,' ...

Bush Urges Congress to Hold Down Farm Spending Overthe Next Five Years

G-7 Says Recovery Prospects `Strengthened,' ...G-7 Says Recovery Prospects `Strengthened,' ...

Bush Urges Congress to Hold Down Farm Spending Overthe Next Five YearsBush Urges Congress to Hold Down Farm Spending Overthe Next Five Years


Server


ServerServer

Figure 8. Example of QoS Usage

To further illustrate the QoS support in PCI Express architecture and the practical implementation aspects of the VC mechanism, the rest of this whitepaper will cover isochronous service support as a specific usage model.

What Is Isochrony?

While a PCI Express Endpoint device with bulk data transfer generally requires high throughput and low latency to achieve good performance, it can tolerate occasional data transfers that complete with arbitrarily long delays. The normal semantics for general-purpose I/O transactions, as defined for the default Traffic Class (TC0), are supported by the default Virtual Channel (VC0). VC0 supports bulk data transfer by providing best-effort class of service. This means, since there is no traffic regulation for VC0, during any given time period, any device may issue more transactions than PCI Express links can support and may saturate the physical links. Therefore, there is no guaranteed bandwidth or deterministic latency provided to the device by VC0. This is why the default general purpose I/O Traffic Class is referred to as the best-effort Traffic Class.

On the other hand, a PCI Express Endpoint with real-time data transfer requirements, such as audio and video data streaming, would continuously/periodically generate transactions. For example, a NTSC video data in YUV422 format would require about 20MB/s sustained bandwidth. Another aspect of timely service of such type of data is to provide a latency boundary. For example, a video frame arriving later than its required time (within video information stream) may be useless. This is also true at I/O transaction level that each transaction must be serviced within a predetermined time bound. This type of applications requires not only guaranteed data bandwidth but also deterministic service latency. The corresponding service is called isochronous (ISOCH) service, and some times is also referred to as hard real-time service.

Due to limited system resources, the amount of bandwidth that a device can consume will depend on devices requirements and may be subject of limitation that can be imposed by the platform software and hardware. Admission control refers to the process of determining whether a new isochronous service request can be granted based on the status of system resources.

PC the Center of Your Digital World

Isochronous applications are becoming more important as personal computer moves into the Center of Your Digital World. As shown in Figure 9, current generation of personal computers is used for various new applications:


September 2002

9

Audio playback/recording (CD or MP3 music)

Video editing or simply watching full motion video from DVD, PC TV and possibly serving Video on Demand (VOD)

Burning personal CDs or DVDs,

Exchanging multimedia contents with other wired or wireless devices

Ever-increasing demands of the existing applications as well as emergence of new applications expose certain weaknesses of todays computers. Many of us have already experienced problems such as audio glitches, dropping video frames in capture or playback, or getting a bad CD due to FIFO under-run conditions of the CD burner. Most of these problems are associated with some sorts of failures in real-time services at the platform level. These failures may be caused by hardware components (I/O devices, interconnect buses, memory subsystem and CPU/cache subsystem) as well as by software components (applications, middleware, device drivers and operating system).

Burn Burn CDsCDsFullFull--motionmotion

videovideo

MusicMusicVoIPVoIP

PCTVPCTVVODVOD

CommCommQoSQoS

Figure 9. Examples of Application that Require Isochronous or Other QoS Services

When looking from an I/O capability point of view, people are often surprised by such failures as the actual bandwidth of the application may be way below the capability of the system and peripheral interconnects within the platform. Improving peak bandwidth definitely helps the situation, but may not solve the fundamental problem whenever there are common resources used for integrated services. There have been certain attempts in the past in dealing with real-time services by defining some latency registers to provide latency hints or some rather loosely defined bandwidth/latency guidelines. However, such attempts have failed as they either lack details in driving interoperable implementations or are short of providing control/regulation mechanisms such that they are subject to misinterpretation and likely misuse. In short, providing glitch-less multimedia services would require ISOCH Done Right. As an effort to achieve such a goal, PCI Express architecture delivers a comprehensive isochronous protocol solution that helps provide robust support for real-time data transactions.

PCI Express Isochronous Support

The foundation of isochronous service support in PCI Express is the VC mechanism. TC labeling and VC arbitration provide segregation of isochronous traffic and non-isochronous traffic. In particular, the time-based WRR port arbitration is the most important mechanism defined for support of isochronous service. This mechanism provides several service policies that include bandwidth allocation, traffic regulation as well as traffic policing (metering).


September 2002

10

The PCI Express isochronous support is based on a quantitative isochronous contract that covers both bandwidth and latency. The contract has the following parameters:

T Isochronous Period t Virtual Timeslot Y Isochronous Payload Size N Maximum allocatable number of virtual timeslots (N =T/t) Allocated number of virtual timeslots for client i iM L Latency for a requester Lfabric PCI Express fabric latency Lcompleter Latency of the completer

By assigning timeslots for client, the bandwidth allocated to client i is calculated as: iM

TYM

BW ii

= .

For mainstream workstation, desktop and mobile computers, the parameters can be fixed as the follows: T = 12.8 microsecond; t = 100 ns; Y = 128 B (Max_Payload_Size); and N = 128. With the fixed isochronous parameters, the number of virtual timeslots can be easily converted from bandwidth value into the megabytes per second. For example, a 20MB/s bandwidth is equivalent to two timeslots.

Besides bandwidth, another important isochronous parameter is latency. Total latency for servicing any requests from a requester can be calculated from two elements, the latency caused by the PCI Express fabric and the latency contributed by the completer that is servicing the request:

completerfabric LLL += .

The completer latency accounts for the allowed delays for its internal data services that may include its memory subsystem. The fabric latency can be further divided into the propagation delay and queuing delay. The propagation delay, which may consist of propagation through transaction layer, data link layer and the physical link, as well as switching delays, is relatively fixed and is a function of the data path width and speed. The queuing delay is variable and depends on the servicing protocols such as port and VC arbitration. Given the isochronous service policies, the fabric latency can be readily calculated based on the topology. The completer latency may be platform dependent. Future versions of this document will provide more detailed discussion of latency.

End-to-End Isochronous Services

In order to meet the requirement of an above-mentioned isochronous contract, the system must provide end-to-end isochronous service. That means that each component along the path between the requester and completer must have TC/VC support and conform to the isochronous requirements. This includes managing and regulating the traffic at each aggregation point. To configure operations of an aggregation point, an ISOCH Broker software (part of QoS Manager) will scan the topology in order to determine if certain isochronous request can be served, and if so, it will configure the components along the path to provide the service.

PCI Express follows the standard address decoding rules of PCI, where each device is given a range of a flat memory address space. Within a PCI Express hierarchy associated with a Root Complex, one hierarchy domain is formed from each Root Port. A hierarchy domain, consisting of Root Port, one or more Endpoints, Switches, and PCI-Express to PCI bridges, resembles a tree topology (The simplest hierarchy domain may contain one Root Port and one Endpoint). For arbitration purpose, such type of topology can be extended to the PCI Express hierarchy using Root Complex Register Block (RCRB; configuration register structure defined by the PCI Express Specification), which provide TC/VC mapping and arbitration control among Root Ports. For systems where a single host memory subsystem is target of all I/O accesses, a tree topology can be established for the entire PCI Express hierarchy. Figure 10 shows an example of a system that consists of a Root Complex, one Switch, one PCI


September 2002

11

Express to PCI Bridge and three connected Endpoint devices. The Root Complex has 4 Root Ports. The arbitration within the Root Complex is controlled using an RCRB structure. In this example, it is fair to assume that there is no isochronous service required from Port 1 since it connects a PCI Express hierarchy to legacy PCI bus via a Bridge component.

CPUCPUCPUMemory

PCI

BridgeBridgeBridge

PCI Express

ChipsetChipset

SwitchSwitch

A# 1

B# 2

C# 0

# 1 # 2 # 4

RCRB # 0

# 3

A B

C

D E

Figure 10. An Example of PCI Express Topology

When configuring isochronous service for Endpoint C, the ISOCH Broker needs to check the following two configurable elements to determine the ISOCH bandwidth support along the path between Endpoint C and the host memory subsystem:

Root Port #2: the Maximum Time Slot field reported by the associated VC in Root Port #2.

RCRB: the remaining timeslots in RCRB (the Maximum Time Slot reported minus the timeslots already assigned for Root Port #4 if any).

However, for Endpoint D, the ISOCH Broker will have to check the following four configurable elements:

Switch Port A (#1): the Maximum Time Slot field reported by the associated VC in the port.

Switch Port C (#0): the remaining timeslots in the port (the Maximum Time Slot reported minus the timeslots already assigned for Port B (#2) if any).

Root Port #4: the Maximum Time Slot field reported by the associated VC in Root Port #4.

RCRB: the remaining timeslots in RCRB (the Maximum Time Slot reported minus the timeslots already assigned for Root Port #2 if any).

When the ISOCH Broker determines that isochronous service can be provided for Endpoint D, it maps the assigned TC to the corresponding VC and enables the VC for the path (within all above listed configurable elements). It also programs the following two Time-based WRR port arbitration tables according to the bandwidth allocation provided for Endpoint D:

RCRB: governing arbitration between Root Ports #1 (not expecting any isochronous requests), #2, #3 (not connected) and #4.


September 2002

12

Switch Port C (#0): governing arbitration between Ports A and B.

Time-based WRR Arbitration

WRR arbitration process does provide certain type of fair services for the clients. It has been shown that for traffic being leaky bucket regulated, it can also provide rate/jitter bound to clients [See References 3-6]. However, there are several drawbacks with such a scheme:

Large buffer (cost): Larger buffer requirements for the server (switch or any fabric components). In particular, it has shown that the minimal buffer for each client is up to one WRR cycle time.

Large jitter (performance): Even though jitter can be bounded, large jitter is introduced by the bursty service because the arbiter services one client continuously until it reaches the weight limit then turns its attention to next client. Therefore, traffic for each client would experience alternating busy and idle cycles. Without per-stage traffic regulation, such bursty-ness accumulates at each stage in the fabric. The larger a fabric the bigger the burst.

Lack of traffic policing (No traffic reshaping): WRR arbitration alone does not reshape traffic. Consequently, traffic may become burstier inside a fabric due to traffic aggregation. Even if the traffic injected into the fabric (at edge) is nicely regulated with certain rate/jitter characteristics, more and more buffer space is required for downstream switches (also true for the completer) in order to meet certain prescribed latency bound. Due to the same traffic aggregation, non-conforming (e.g. conforming to certain rate/jitter bound) traffic would have negative impact to the service to other conforming traffic along or across the same data path. As a consequence, certain ill-behaved devices may cause failure of applications associated with well-behaved (i.e. compliant) devices.

In order to provide cost-effective isochronous service solutions for mainstream computing and communication systems, PCI Express architecture solves the above problems by defining the Time-based Uniform WRR (UWRR) arbitration scheme as an element of the Virtual Channel mechanism. In short, given a fixed WRR cycle, UWRR arbitration tries to serve each client uniformly according to its weight assignment. This is achieved by dividing the WRR cycle into small units and servicing one client per unit. Therefore, a WRR table can represent one WRR cycle with total number of entries equaling to the sum of weights. The problem of uniform servicing becomes a mathematical problem of assigning all clients with weights that are 'uniformly' distributed in the UWRR Table. Adding time dimension to the UWRR scheme, the so called time-based UWRR arbitration further provides traffic policing capability, which allows traffic policing at the edge of PCI Express fabric and traffic reshaping within the fabric. With time-based UWRR arbitration, one WRR cycle corresponds to an isochronous period of 12.8 microseconds, the minimal service unit is one virtual timeslot, and one transaction regardless of its packet payload size accounts for one service.

Figure 11 depicts an example of isochronous traffic flowing upstream through the Switch shown in Figure 10. In this simplified example, an isochronous period T is subdivided into 10 virtual timeslots t. The time-based WRR port arbitration table at Port C is shown as [0, B, 0, A, 0, 0, A, 0, 0, A], in which, one slot is assigned to Port B, three slots to Port A. Value 0 (the Port Number for Port C) in the table indicates idle slot. (Note that in order to make the example simpler, non-isochronous data traffic is not shown here.) When isochronous traffic is served as the highest priority traffic using strict priority VC arbitration, starvation of other traffic is avoided as the isochronous traffic is regulated through the use of idle slots to not consume all the bandwidth resources. In this case, impact from other traffic can be modeled as jitter on individual isochronous transactions, which would not affect the general isochronous services. By examining Figure 11, we can observe the following key aspects of the time-based WRR arbitration:

Uniform service of Ingress Port: Any request arrived at an Ingress Port may be served at the next timeslot scheduled for the Ingress Port. When traffic is uniformly injected to Ingress Port A, it is also uniformly serviced at Egress Port C. In this case, each request experiences a tightly bounded queuing latency that is smaller than the worst-case schedule jitter in the arbitration table (e.g. 4 timeslots for traffic from Port A).


September 2002

13

Uniform injecting at the Egress Port: When both Ingress Ports are fully loaded, the isochronous output from the Egress Port C takes the exact form of the port arbitration table, where traffic from Ingress Ports A and B are uniformly mixed and evenly spaced. When an Ingress Port does not have a pending request when the current arbitration phase contains that ports ID, the port arbitration simply enters an idle state, allowing any pending non-isochronous traffic to go through. This uniform output behavior of a Switch allows cascading of Switches while maintaining the same behavior, thus the same requirement, for a Switch, regardless of the location within the fabric (i.e. position within the interconnect hierarchy).

Isochronous traffic regulation: In the case of non-uniform injections, the time-based WRR arbitration will delay the early-arrived request until the next assigned timeslot. For example, the second request arrived at Port B only two timeslot after the first request. Because there is no timeslot assigned for Port B within the next 9 timeslots, this request has to wait for about 10 timeslots before it can be serviced.

Isochronous traffic policing: For the next two requests (3rd and 4th) at Port B, if the corresponding receive buffer at Port B is large enough to accept them early as shown, they will be sitting in the receive buffer for long time (two to three periods long in this example). This would eventually cause flow-control induced backpressure to the device on the other side of the link. If the receive buffer at Port B is not very large, these two requests might not be accepted by Port B. In this case, if the device connecting to Port B is an ill-behaved device that attempts to inject more traffic than what has been negotiated in the ISOCH contract, behavior of that device would not have an impact on other compliant devices.

t T Table @ C 0 B 0 A 0 0 A 0 0 A 0 B 0 A Port Arbitration

Served @ C

Recd @ A

Recd @ B

time

Figure 11. Time-based WRR Port Arbitration Example

Programming Uniform WRR Table

Given a set of bandwidth requirements, generating uniformly distributed WRR tables is essential in providing isochronous services. Here is an example of the solution where programming of a time-based Uniform WRR table is formulated as an MINMAX problem and the MINMAX problem is solved using a DDA (Digital Differential Algorithm) Huffman Tree algorithm.

Without loss of generalization, let us consider a (n+1)-port Switch. The UWRR Table programming problem can be formulated as follows:

Objective: Define a service schedule for an Egress Port (with Port Number 0) of an (n+1)-port Switch that not only provides uniform service for each of the n Ingress Ports (with Port Number 1 to n) but also generates uniformly distributed output stream from the Egress Port.


September 2002

14

Inputs: Bandwidth allocation for the n Ingress Ports are {M1, M2, , Mn} transactions per isochronous period T with M = (Mi) not exceeding the Egress Ports capability M0 (M0 N = 128).

Output: TABLE[0:N-1] (Time-based WRR Port Arbitration Table) with M fields assigned with valid Ingress Port Number and N-M idle fields assigned with Port Number 0.

Constraints:

Minimize the maximum input jitter for each Ingress Port as:

( )[ ]{ } 1 to 0for , = iiI MiMiJitterMAXMIN

Minimize the maximum output jitter for the Egress Port as:

( )[ ]{ } 1 to 0for , = MiMiJitterMAXMIN E

Where jitter is a floating-point value defined as the absolute difference of two adjacently assigned fields cyclically measured in TABLE from ideal uniform distance, disttarget = N/M

( ) ( )[ ] 1 to 0for 1,, target == MidistiidistABSMiJitterE

The MINMAX problem described above is similar to the entropy-coding problem. Here each port can be viewed as a symbol and probability of symbol i (Port i) is expressed as p(i)=Mi/N. The Huffman coding, using a Huffman tree, gives the minimal entropy coding for this set of symbols. The same Huffman tree can also be used to generate the WRR table by introducing a DDA function to each parent node of the tree. DDA, commonly used in graphics, can generate evenly distributed 0s and 1s for a given slope of a line. As there are many valid Huffman codes for any given set of symbols, the results from a DDA Huffman Tree is not unique, depending on the initial phase of each DDA. Each solution may have different measure of jitters. To meet the above-mentioned MINMAX constraint, a search of the valid codes can result in an optimal solution as described by the following procedure.

Setup: Build a Huffman binary tree from bandwidth assignments

The Ingress Ports, plus a virtual idle port, form the leaf nodes of the tree with value of the assigned timeslots.

Sort the leaf nodes in ascending order and replace the first two nodes by a parent node. Continue this process until there is only one node left, which is the root node of the tree.

The value at a parent node is the sum of the two child nodes

The function at a parent node is a DDA that uniformly mixes the assignments of the two child nodes.


September 2002

15

Operation: Traverse the tree using DDA and assign a slot to one leaf node each step

Assigning new initial phase for each DDA

Evaluation starts at the Root node. Assignment traverses down the tree and stops at a leaf node. N evaluations per round; each round one timeslot is assigned to one leaf node (one port, including the virtual idle port).

Measure the input and output jitters.

Repeat this process for a new set of initial phase assignment

Results: The table with the minimal jitter is the final result.

The above DDA Huffman Tree algorithm is applied to an example 5-port Switch. The Switch and the resulting Huffman tree for Switch Port E are illustrated in Figure 12. The four Ingress Ports of the Switch, A, B, C, D with Port Number 1, 2, 3, 4, are assigned with 3, 4, 5, 6 timeslots, respectively, from an isochronous period with total of 32 phases. As shown, a leaf node with value 14 is also inserted in the Huffman tree. That leaf node corresponds to the number of idle timeslots that need to be assigned. Depending on the selection of the initial phase for each DDA, the resulting jitter measurements are different. The result with the smallest jitter measurement generated by the algorithm is given in Table 1.

3 4 5 6

7

11

18

32

32

Mi=A,B,C,D = {3, 4, 5, 6}, Total slots N = 32.

A B C D

E

Switch

14

Figure 12. A DDA Huffman Tree for Port E of a 5-port Switch

Table 1. UWRR Table for Port E

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31Entry C 0 B 0 D 0 A 0 C D 0 B 0 C 0 A D 0 C 0 B 0 D 0 C A 0 D 0 B 0 D


September 2002

16

Device Considerations

This section summarizes some key requirements/considerations related to design of PCI Express platform elements that support isochronous services.

Design of an ISOCH capable Switch or Root Complex should consider the following requirements:

Minimum 2 VCs Support VC arbitration such as strict priority Time-based WRR port arbitration for a non-0 VC Report bandwidth capability through MaxTimeSlots Meet ISOCH latency guideline

Design of an ISOCH capable Endpoint should consider the following requirements:

Minimum 2 VCs Aggregate data up to MaxPayloadSize Use naturally aligned packets Proper use of No Snoop attribute field Proper size of buffer to meet latency guideline Expect backpressure when injecting non-uniformly

Multimedia Platform Example

As shown in Figure 13, PCI Express links may be used in a multimedia computer as the backbone (host interface) to support isochronous I/O traffic for other external ISOCH capable interconnects such as 1394 and USB. Also, PCI Express can serve as the backbone for QoS support of wired (Gigabit Ethernet) or wireless (IEEE 802.11) communications, or as a direct connection for PCI Express native audio/video devices. The PCI Express architecture supporting multimedia and QoS paves the path for many other new and exciting applications for various computing and communication platforms.

CPUCPUCPU

PCI

BridgeBridgeBridge

PCI ExpressPCI Express

ChipsetChipset

1394 Video

R

RR

GbE WLAN

GraphicsMemory

Audio

SwitchSwitch


September 2002

17

Figure 13. A Multimedia Platform Example

Summary

PCI Express provides comprehensive support for QoS and isochronous services at the interconnect fabric level. Along with the appropriate support within operating system software and platform firmware, this will enable a new class of applications for client, server and communication platforms. At the time when this white paper is written deployment of the QoS capabilities in the mainstream platforms is in progress targeting first generation of products that will be introduced in 2003.

References 1. PCI Express Base Specification, Rev 1.0, PCI SIG, July 2002. 2. Proposed Advanced Switching Addendum to the PCI Express Specification, Draft. 3. Demers, S. Kershav, and S. Shenkar, "Analysis and simulation of a fair queuing algorithm," Internet. Res. and Exper.,

vol. 1, 1990. 4. J. Bennett and H. Zhang, "WF2Q: Worst case fair weighted fair queuing," Proc. IEEE INFOCOM'96, pp. 120-128,

Mar. 1996. 5. A.K. Parekh and R.G. Gallager, "A generalized processor sharing approach to flow control in integrated services

networks: the single-node case," IEEE/ACM Trans. on Networking, Vol. 1, pp. 344-357, June 1993. 6. A.K. Parekh and R.G. Gallager, "A generalized processor sharing approach to flow control in integrated services

networks: the multiple node case," IEEE/ACM Trans. on Networking, Vol. 2, pp. 137-150, April 1994.

Want More Info on PCI Express?

Intel Developer Network for PCI Express: http://developer.intel.com PCI-SIG Web Site: http://www.pcisig.com

http://developer.intel.com/http://www.pcisig.com/

ABSTRACT

Isochronous Time Based WRR Paper

Documents

Transcript of Isochronous Time Based WRR Paper