An High-Performance, Low-Latency Fully-Optical Peripheral ...

An High-Performance, Low-Latency Fully-Optical Peripheral

Interconnect for Short Range Applications

David J. Miller (supervised by Andrew W. Moore)

October 2008

1 Introduction

PCI Express 3 was announced in August, 2007. Likeits predecessor, it sought to further the insatiable de-mand for performance by doubling throughput. Un-like its predecessor, it did not do so by simply dou-bling the bit rate.

It is well understood that performance of a commu-nications connection is a function of both throughputand latency. Beyond cut-through communicationsprotocols such as Infiniband, very little work has beendone to address latency - especially not in peripheralinterconnects within a computer. In fact, an hypoth-esis of my research is that if anything, latency hasgrown worse as a largely unavoidable consequence ofcertain design decisions required to boost through-put.

Like processors in the past, throughput was con-ventionally advanced by building faster and widerbuses - but like processors, there is a definite limitto the success of this strategy.

The physical dimensions of a bus limit frequencyboth because of dielectric loss and because power con-sumption is a function of both frequency and capac-itance. PCI Express can perhaps double frequencyonce more before costly materials such as Teflon be-come necessary in manufacture.

Optical technologies show great promise as solu-tions for improving both throughput and latency buthave evident drawbacks, notably the lack of an equiv-alent of RAM.

2 Related work

The importance of latency in communications per-formance is generally well understood. [MVCA97]showed how host overhead influenced performancein Myricom hardware, and described a microbench-marking technique.

The HPC community makes heavy use of low-latency protocols with cut-through routing such asMyrinet and Infiniband. Liu et al. showed in[LCW+03, LWP04] that in such systems, latency isin the order of 5 µs and that RDMA polling timescan be as low as 600 ns, figures low enough that in-terconnect latency could be a nontrivial factor.

Data vortex [LSLL+03] uses deflection routing(concentric rings of fibre with periodic entries andexits) to avoid the need for buffering packets.[SSLLB05] describes a fully populated 12×12 datavortex. It’s a novel approach, but difficult to buildbecause it doesn’t support variable length packets,every fibre must be exactly the same length, and la-tency through the fabric is potentially unbounded.

A PCI Express interface for the data vortex wasdescribed in [LLWB07], but the paper assumes fixedlength packets (as required by the data vortex), andthe use of only one photo detector per agent limitssustained throughput to that of one wavelength.

SPINet [SLB05] is an optically addressed, self rout-ing architecture intended for use in highly integratedsituations. It avoids delay lines and deflection rout-ing, but resolves contention by dropping traffic.

Micro-ring resonators can be used to build very effi-cient, low power switching elements [KKL07]. Thereis a lot of work relevant work that builds upon micro-

1

ring resonators refs. Micro-ring resonators are excit-ing, but are difficult to fabricate reliably, are verysensitive to temperature and by their nature are notwell suited to WDM.

[KL06] and [KKL06] describe a low latency sharedmemory suited to large multiprocessor environments.Corona [VSM+08] is an on-chip interconnect suitablefor chip multiprocessors.

[GDM+05] describes a local area WDM optical in-terconnect which uses an hybrid approach of using anelectronic control plane with an optical data plane,similar to what is proposed in this thesis.

[Tuc06a] and [Tuc06b] consider optically switchIP routers with reference to buffering requirements,power consumption and physical density. Tucker con-cludes that due to the lack of optical RAM, electron-ics will remain superior to all-optical implementa-tions for the next decade or two, although McKeownmade an argument in [McK04] for IP routers with nobuffering.

Recent work describes an optical [HBY+08]memory interface, and IBM’s TeraBus [KDK+05,SKD+06] describes a complete chip-to-chip packag-ing transmission system.

3 Factors of performance

3.1 Throughput

Bus throughput can be augmented either by increas-ing width or clock frequency up to a limit imposed bya combination of clock skew, package pin count andsignal integrity.

PCI Express avoided these problems by employinghigh speed serial lines in which a clock is encoded inthe data. Multiple lanes can be ganged together toprovide required throughput, and de-skewing can bedone on a lane-by-lane basis.

As with conventional PCI, PCI Express doubledthroughput once by doubling frequency to 4 Gb/s ata signalling rate of 5 GT/s, 1 [PS06] which is lowenough for signal integrity to be acceptable in FR4

1The unit gigatransfers-per-second accounts for the over-

head imposed by coding method.

printed circuit boards provided care when designingtransmission lines.

Doubling the signalling rate again brings the sig-nalling rate to the point where signal integrity be-comes challenging in FR4 PCBs. To manage this,the proposed new revision of the PCI Express stan-dard has changed the coding method from 8b10b (acoding method common in high speed serial commu-nications protocols) to 128b130b, reducing overheadfrom 25% down to about 2%. For almost double thethroughput, the signalling rate increases by only 60%to 8 GT/s.

Like processor clock frequency in the past then,it seems likely that easy performance increases arethings of the past.

3.2 Latency

Latency is an important a factor in performance[Che96] and, in general, bandwidth has grown at afaster rate than latency has been improved. This wasat least been established for disc, CPU, memory andnetworks in [Pat04], but Patterson’s paper was silentabout local interconnects.

3.2.1 Sources of latency

The PCI protocol [PS02] was introduced in 1998 foruse in personal computers to improve throughput andsolve certain shortcomings of earlier bus protocols.The original protocol was specified to 33 MHz, andprovided up to 16 clock cycles for slow peripherals torespond to a transaction.

PCI-X was introduced in 1998 to further enhancethroughput [PS00]. It was rated to a somewhathigher maximum clock speed (133 MHz), and pro-vided an option for split transactions. Split trans-actions allow a target to shoulder responsibility forcompleting a request by initiating a new transac-tion containing the requested result at its own conve-nience. Split transactions were necessary to accom-modate devices that couldn’t respond within the 16cycle wait state limitation at 133 MHz, but had theunfortunate side effect of increasing overhead (twobus transactions instead of one) and removing theupper bound on completion latency.

2

PCI Express made a radical departure from con-ventionally clocked parallel buses by introducingganged, high speed serial links. The standard bor-rowed heavily from the Physical Coding Sublayer ofGigabit ethernet, using the same 8b10b coding. Onlythe bit rate was increased from from 1.25 GT/s to2.50 GT/s. Transactions became messages encap-sulated in packets, and split transactions perforcebecame compulsory where they were optional underPCI-X.

Conventional PCI and PCI-X used parity to detectbus errors. Data integrity is checked as data arrive,so an agent (bus device) can begin to process a trans-action before the transaction is ended. PCI Expressis packet oriented and as with Ethernet, the entiretransaction is protected with a CRC. A PCI Expressagent must buffer the packet until the entire trans-action has arrived before it can begin to process thetransaction because it is impossible to tell where theerror is, if one is present.

Even if it were unavoidable, PCI Express probablydid a lot to hurt latency.

3.2.2 Significance

Conventionally, bus latency was probably insignifi-cant relative to the speed at which peripherals oper-ated. As peripherals become faster, the significanceof bus throughput and latency increases.

[MVCA97] showed that system overhead was themost significant factor in performance of Myrinetadapters. [LCW+03, LWP04] showed that RDMAoperations over Infiniband and Myrinet are in the or-der of microseconds, with polling operations as littleas 600 ns.

Algorithmic share trading accounted for an esti-mated third of all trades in 2006 [Gro06], up to over50% projected into 2010. Latency, even into the mi-croseconds translates into a competitive advantage[Thi08].

In his 2008 Hot Interconnects keynote, AndrewBach said that the NYSE data centre already usesmultiple 10 Gb/sec networks, and would readily usemultiple 100 Gb/s networks if they were commer-cially available because trade latency is so impor-tant. By comparison, a sixteen lane PCI Express

v3 link offers only about 114 Gb/s for payload,2 onlymarginally more than the network’s link speed. Thehost interface of 100 Gb/s host adapter would haveto be very efficient before it could saturate its link.

3.3 Power consumption

The high speed serial transceivers used by PCI Ex-press consume more power than the equivalent par-allel bus because they contain logic not required onparallel buses (such as clock recovery, CRC logic andbuffer management). By way of example, the powerconsumption of each transceiver in the Virtex II is300mW [Xil04].

Consumption is further increased by the transitiondensity required to maintain clock synchronisationand DC balance, and by the greater amount of logicrequired to implement a PCI Express end-point.

Power consumption is also a function of frequency.Faster electronic interconnects will not help containor reduce power consumption in data centres.

4 Optical solutions

Given issues of throughput, latency and power con-sumption taken together, it seems likely that existingelectronic interconnect technologies are not capableof meeting the demands of emerging and near-futureperipherals, and drastically limit the effectiveness ofperipherals such as accelerators. A solution may liein an optical interconnect.

4.1 Properties of optical technologies

Photonic technologies have a number of featureswhich make them attractive as a possible alternativeto electronic transmission lines.

The most obvious benefit is sheer bandwidth.Unlike loss in electronic transmission lines, loss in

optics is a function of (large) distances only and notof modulation frequency. Latency due to bufferingis absent because there is no equivalent of optical

2based on 64 bit addressing and 128 byte TLPs. Although

a TLP can theoretically carry a 4 kB payload, 128 bytes is

standard because few chipsets have buffering for larger.

3

RAM. Where electronics use power during switching,lasers (including SOAs, described below) use powerto maintain the inversion layer of the lasing medium,so power consumption is independent of modulationfrequency.

Optics provides not just a potential escape fromall of the problems outlined above, but other ben-efits. Where electronic switches require at least oneswitching element per bit or lane, wavelength divisionmultiplexing can be used to combine multiple lanesso that a single optical switching element can switchall lanes at once. Further, optical switching elementsare bidirectional - light can go in both directions atonce.

Light doesn’t radiate electromagnetic interference,at least, not of the kind that matters to the FCC.Transmission lines carrying high frequency signalsmake electromagnetic compliance harder. Optical in-terconnects promise escape from a problem which hasconventionally been very difficult to manage. [ban07]

4.2 Optical building blocks

The field of photonics has existed for a long time andoffers a wide array of components to choose from.This section describes the components that may beuseful for the purposes of this research.

Laser transceivers are a commodity item availablein compact pluggable modules. For moderate speeds,laser light can be modulated directly by end-pointelectronics. Above 10 Gb/s, it is more common tomodulate a constant wave light source with an exter-nal component such as a Mach-Zehnder interferome-ter.

Light can be switched using a wide variety of de-vices, including Semiconductor Optical Amplifiers(SOAs), Micro Electro Mechanical Systems (MEMS),Lithium Niobate Mach-Zehnder Interferometers andSpacial Light Modulators. Each have their strengthsand weaknesses.

An SOA is a laser without mirrors. It generatesvery little light on its own, and outputs light froman external source as a function of the bias currentapplied to it. SOA switching time is related to carrierlife-time, which is in the order of 1 ns. Put together,SOAs make excellent switching elements, and have

been used successfully in a number of applications[AWG+07]. We will use these in the proposed thesis.

As described in that paper, an optical switch fabriccan be constructed using pairs of SOAs functioning ascomplementary on-off switches, connected using opti-cal combiners and splitters. The intrinsic gain capa-bility of using SOAs as switches is useful for compen-sating for passive loss in the combiners and splittersrequired for coupling the switch fabric.

4.3 Optical architectures

Interconnect networks have been well described andare the subject of text books like [DT04] and[DYN03]. Many of them, notably Benes [Ben62]and Clos [Clo53] networks lend themselves well toimplementation in optics. Eng Tin Aw et al. de-scribed a practical implementation of such a mediumdegree (32×32), fully interconnected, non-blockingSOA based switch [AMW+08].

Fully interconnected switches are possible, but ex-pensive and large due to the number of SOAs in-volved (640, in the case of the aforementioned 32×32switch.) Although, in principle, any device can com-municate with any other device in personal comput-ers and servers, this very seldom happens because itis unusual for any two peripherals to speak a mutu-ally compatible protocol.

Therefore, an N×1 switch is likely sufficient forthe vast majority of cases, and inter-device commu-nication can be implemented in two passes throughthe switch fabric for the few cases in which direct,peripheral-peripheral communication is necessary.

Scheduling a device’s access to an optical switchfabric is made considerably easier by dividing timeinto fixed length slots at the potential expense of in-creased latency and fabric utilisation. An importantpart of the work in this thesis will be to find a suit-able choice for the time slot size, and assess the per-formance trade-offs involved.

5 Thesis and desired outcomes

The preceding sections introduced and providedbackground on the problem which my research hopes

4

to address and some of the tools that may be used todo so.

This section sets out the claims that my researchseeks to prove, and describes the contribution that itwill make.

The hypothesis of my research that an unbufferedoptical interconnect can:

• be viable

• perform at least as well as conventional elec-tronic interconnects

• outperform electronic interconnect bandwidth

• reduce latency and therefore increase perfor-mance

by virtue of the loss profile inherent to photonic com-ponents.

For the sake of simplicity, the research will beginby considering these claims in respect of the periph-eral interconnects found in standard COTS (commer-cial, off-the-shelf) PC server hardware. PCs are selfcontained (they lend themselves well to the proposedarchitecture), readily available (ease of experimen-tation) and collectively represent a highly significantproportion of the world’s aggregate computing power(economically worthwhile).

Front-side bus (CPU, memory, etc) interconnectsare out of the scope of this research, but are logicaland worthwhile targets for future work. Similarly,high performance computing may benefit (especiallyclusters) from the techniques covered in this thesis,but are outside the scope of the research.

The research is not concerned with photonic com-ponents, and will use the existing technology de-scribed in section 4.2 as discrete components. Theintegration required to make the proposed architec-ture commercially viable, such as building multipleSOAs together with suitable semiconductor waveg-uides is left to others.

There are significant implications for the way fu-ture computers and applications are built should thethesis be proved.

The experimental work aims to support a thesisstatement similar to the following:

• Unbuffered optical backplanes work and performbetter than electronics, or

• Unbuffered optical backplanes work, are compa-rable with electronics, and could be made to ex-ceed by given development, or

• Unbuffered optical backplanes can be made towork, but are as yet impractical for industrialapplications because of given factors, or

• Unbuffered optical backplanes won’t work be-cause of given factors, and an alternative towider, faster interconnects must be found.

Additionally, some estimation of the point at whichpower profiles favour optical interconnects over elec-tronic interconnects and/or some estimation of howfar the proposed system might scale before a bufferedOEO bridge would be required may be made.

6 Methodology

The work begins with several hypotheses, which theexperimental work described here is designed to ei-ther support, modify, or disprove:

• Peripheral interconnect latency is significantenough to affect application performance

• Buffering together with store-and-forwardpacket-oriented interconnects are make thislatency worse

• The characteristics of photonic technology allowfor an unbuffered optical switching fabric with-out the store-and-forward architecture of exist-ing electronic interconnects, as well as increasingsystem throughput past that which electronic in-terconnects are capable

• The interconnect can be designed so that it isotherwise competitive with electronic intercon-nects

For the sake of keeping things simple and contain-ing expenses, all of these experiments will be car-ried out on a 1 lane PCI Express link, with the

5

consequence that bandwidth figures will look low.Real implementations would run with much higherthroughput (both higher symbol rates and widths(viz. WDM.))

It should also be noted that in the various demon-strator experiments, the overall system performancecan never be greater than the host uplink, and thatat best, the performance of devices plugged into thedemonstrator can only equal the performance of thesame devices plugged directly into the host. Themeasure of how good the demonstrator is is how closethe demonstrator is to native performance.

A close result demonstrates that the proposed in-terconnect is viable and efficient, and the expecta-tion is that it will carry that performance when thethroughput (bit rate and number of lanes) is in-creased past that of an electronic interconnect.

The rest of this section will describe in more detailthe flowchart in figure 1.

6.1 Experiment 1:

PCI Express latency

The first experiment is designed to characterise trans-action timing on a variety of PCI buses, with a viewto establishing the role of interconnect latency andbuffering in peripheral performance. The data gath-ered will be used to answer the following questions:

• Do split completions and/or CRC protected,serialised packet-like transactions make latencyworse than conventional parallel immediate(non-split) transactions?

• How much does transaction buffering contributeto the overall transaction latency?

• How much latency is there in the interconnect,and is it significant relative to the speed at whichdifferent sorts of workloads (network, disc, video,memory, hardware accelerators etc) operate?

Slow devices like discs and USB devices are notgoing to benefit from a low latency interconnect.High performance interconnects like Infiniband andMyrinet might - especially when those protocols em-ploy cut-through switching in order to reduce latency,

Figure 1: Flowchart of intended experiments.

6

yet the interconnects which feed the interface cardsare store-and-forward.

As network parameters approach, or even exceedthe parameters of a host’s internal interconnect, itseems very likely that the high bandwidth and low la-tency possible in optical interconnects could be verybeneficial. Finally, accelerators and co-processors areperipherals which absolutely depend on low latencyin order to be worthwhile. As a class, they havelargely been under exploited because the intercon-nects between peripheral and host processor(s) makeaccelerators ineffective.

For reasons described in the (separate) proposalfor experiment 1, a direct measurement of a PCI busis difficult. Instead, the latency associated with eachsegment of the local interconnect can be inferred froma set of differential measurements (microbenchmarks)from an origin to each hop along the path to a target(similar to how traceroute works). The latency dueto buffering can be calculated from the link speed andthe size of the probe packet.

The data will provide a base-line against which tocompare the data from later experiments, and informcertain design decisions about the new optical inter-connect. For example, if it turns out that CRC errordetection increases latency because processing cannotbegin until the whole packet has arrived, there maybe benefit from reversion to a distributed checksumor parity detection code, similar to the way it wasdone in parallel PCI.

It is unclear at this stage whether split completionscan be avoided in an optical fabric, but this situationmay become clearer as the research progresses.

Results from this experiment should be apparentby Christmas ’08.

6.2 Experiment 2a:

Evaluate impact of latency

Martin et al. showed [MVCA97] that system over-head (of which latency is a component) is the factorwith the greatest impact on performance. The ob-ject of this experiment is to see whether the latencyattributable to buffering and interconnect matters toperipheral performance.

As observed in [MVCA97], overhead cannot read-ily be reduced in a real system. Instead, the effectof increasing overhead by use of an electronic delayline can be used to extrapolate what might happenif overhead were decreased past that of ordinary per-formance.

Performance will be measured using benchmarksappropriate to the application hardware connectedto the apparatus as latency is gradually increased.For example, SpecSFS can be used to measure theperformance of a network and a RAID controller.

A positive result shows that latency affects periph-eral performance, and by inference, reduction in la-tency (such as by removing buffers from the intercon-nect) would improve performance.

There are good reasons to expect a positive result.[MVCA97, LCW+03, LWP04, Thi08] all suggest thatHPC and financial applications are already sensitiveto delays in the order of microseconds. If the result isnegative or equivocal, it may be that the applicationexamined isn’t sensitive enough to latency to show aneffect. In this case, some investigation will be madeinto why the application isn’t significantly affectedby latency, and if the hardware is available, to repeatthe experiment with a more applicable workload.

If a case cannot be made, theoretically or experi-mentally, for the significance of interconnect latency,the focus of the research will switch to providingbandwidth past that of which an electronic intercon-nect would be capable.

This experiment involves designing some customhardware to break out a one lane PCI Express slot toSMA connectors. One end will be a small PCI Ex-press card with some hardware to replicate the ref-erence clock for each of the devices connected. Theother end will contain a PCI Express socket, a powerconnector and a regulator to supply the 3.3 volt railto the socket.

The experiment will be implemented on a Xil-inx Virtex prototype board interposed between thesebreak-out boards. Allow about four months for thisexperiment.

7

6.3 Experiment 2b:

Evaluate slot size selection

For the sake of the manageability of fabric scheduling,and given the lack of an optical equivalent to RAM,access to an optical switch fabric is typically dividedinto time slots [JRG+03].

A trade-off must therefore be made between fab-ric utilisation, overhead, and latency. For sufficientlylarge time slots (where slot length >> MTU, such asin optical burst switching [JV05]), efficiency is tradedfor latency. The results of experiments 1 and 2a willindicate whether this is a viable trade-off. For slotlength ≈ MTU, latency is improved at the expenseof overhead.

This experiment is designed to investigate the ef-fect of quantising transactions at set intervals. Theapparatus used for experiment 2a can be retasked toforward packets according to a given time slot size.As in the last experiment, a benchmark will be usedto evaluate the effect of different time slot sizes onvarious applications.

The design of the fabric access scheduler will de-pend on the results of this experiment.

The apparatus required for this experiment is thesame as for 2a. All that should be required is alteringthe way transactions are emitted from the delay line.Allow one month for this.

6.4 Experiment 3:

Evaluate an optical fabric

This experiment seeks to build an electronic modelof a switch fabric. It is essentially a PCI Expressbridge with two subordinate buses. The hardwarerequired is much the same as for Experiment 2, withthe addition of a second card adapter.

The FPGA on the Virtex prototype board will con-tain a partial PCI Express end-point for each deviceconnected, and the model of the optical switch itself.Although electronic, the model will be designed sub-ject to the same rules and limitations of an opticalimplementation.

PCI Express is designed to expect buffers on bothdirections and ends of a link, and uses credit adver-tisements to mediate flow control between the buffers.

Figure 2: Block diagram of components in Experi-ment 3.

It expects to be able to transmit at any time there issufficient credit to do so, and thus it does not haveto arbitrate for access to the fabric. Conversely, ina time division multiplexed switch fabric, an agentcan only transmit when it is in possession of an openchannel (whether the channel is arbitrated for, or theagent is allocated a dedicated time slot.)

In order to interface with the architecture beingdeveloped here, the credit based flow control mecha-nism of PCI Express will need to be replaced with anarbitration mechanism. The end-point module willmake this conversion along with any other optimisa-tions of the protocol (such as distributed error detec-tion codes). The latency introduced by this step willhave to be allowed for when calculating the efficiencyof the electronic model as a whole.

As in a real optical implementation, the bandwidthof each of the downlink ports will be the same asthe bandwidth of the uplink port. When multipledownlink agents are active, the uplink bandwidth isshared between the active downlink agents, and noone downlink can reach 100% utilisation.

8

Unlike experiment 2, PCI Express is now beingused as a transport emulating another protocol. Per-formance of even a single active downlink agent can-not be expected to equal the performance of thatsame agent when speaking native PCI Express. Eval-uation of the performance of the interconnect cannottherefore be done with benchmarks as in previousexperiments because end-to-end measurements willinclude delay components that don’t belong to themodel (such as the PCI Express links themselves.)

Instead, the model and end-point adapters willhave to be instrumented to measure parameters suchas fabric utilisation and arbitration latency. It mightbe possible to measure true round-return latency dif-ferentially, using the same technique as in experiment1. Useful data could also be collected if a “native”agent were designed and implemented inside the pro-totype board.

This is a challenging and complex experiment toexecute, and will require the design and implemen-tation of a large number of new components includ-ing fabric model, scheduler, the aforementioned end-point, and enough logic to implement a simple PCIbridge. This might take six months.

If successful, this experiment will have produced ademonstrator which shows the feasibility of an opticalinterconnect.

6.5 Experiment 4:

Optical implementation

The electronic model can model architectural sourcesof latency, but it cannot readily model technologicalparameters like the time taken to recover a new clockafter the fabric has been reconfigured. CDR latencyhas the potential to be a major factor in the overalloverhead associated with the optical switch. A pos-sible fourth experiment therefore involves building areal optical switch to investigate practical matterslike CDR latency.

The apparatus is an extension of Experiment 3.It will require the use of a Virtex evaluation boardwhich can accommodate six high-speed serial ports(enough for three PCI Express links and three op-tical transceivers), four SOAs and associated opticalcouplers and current drivers.

Figure 3: Block diagram of components in Experi-ment 4.

7 Related issues

7.1 Power consumption and density

Since any current implementation will be built usingdiscrete components, to a great extent density (thephysical volume occupied by the switch) - and to alesser extent, its power consumption - is not some-thing that can readily be analysed in this thesis. Ul-timately, density and power consumption will dependon the degree of integration which can be achieved(and which will also affect cost).

7.2 Scheduling

Until a viable optical equivalent to RAM becomesavailable, switch fabrics of the kind mentioned herearen’t random access and therefore require a sched-uler to arbitrate access between devices. Aside fromslot size, the design of that scheduler is a critical fac-tor in the latency through the switch.

PCI Express divides traffic into two classes: linklayer and transaction layer. Transaction layer pack-

9

ets carry data transfers (PCI transactions), while linklayer packets are responsible for flow control and sig-nalling errors (amongst other things.) PCI Express isalso a windowed protocol, meaning that within sen-sible limits, the timing of acknowledgements affectsonly the amount of buffering required within an end-point, not throughput.

The scheduler can be designed to take advantage ofthe features of the protocol it is transporting, and ofthe switching technology in which the fabric is imple-mented - such as by pipelining (advance scheduling)of multiple requests and delaying acknowledgementsto an endpoint until the fabric is next configured totransfer more data with that endpoint.

7.3 Subsequent work

The results of these experiments will determine wherethe work goes next. Problems encountered will in-spire research into how to solve them, and successeswill provide opportunities to study ways of leveragingthe technology.

The arbitration and time slot mechanism is an areawhich might benefit from further research, since bothare significant factors in the latency of the switch.Analysis of power consumption is worth doing.

CPU and memory interconnects are not coveredby this research, and would no doubt benefit greatlyfrom reduced latency and power consumption. Thisis probably the most promising and enabling line ofinquiry.

8 Conclusion

The principle contribution of this research is notwhether a photonic switch can be developed, butwhether it can be done efficiently enough to make itboth a viable replacement for existing PCI Expressand to provide scope to address the considerationsoutlined in the introduction.

Despite the considerable advances in photonic com-ponents, the field is still in its infancy compared withwhat we can do with electronics today. It is clear,even at this point, that for optical interconnects tobecome a commercial reality in the future, much work

will need to be done both in the development and in-tegration of the technology, and in understanding theimplications for computer architecture of such sys-tems.

9 Proposed thesis chapters

I Introduction

II Background

There’s a good number of papers which look rele-vant at the up-coming Hot Interconnects 16 con-ference, which will be worth mentioning here.

1 Evolution of power and performance in localinterconnects

Bring in experimental results from the latencywork.

2 Existing photonic switches.

Cover IP routers, SWIFT and SOAPS fromIntel, Rod Tucker’s analyses on power anddensity of optical IP routers vs. electronic IProuters.

3 Comparison between PCI Express and Infini-band

HotI 16, and any other work that comes up.

4 Applications for an optical interconnect

I’ve been focusing on PCI Express, but alsomention CPU interconnects etc.

5 On-chip and off-chip optical interconnects

Multi-drop optical buses from HotI 16; waveg-uides in silicon (NoC applications etc) and de-posited on printed circuit boards, plus the re-cent work from CUDOS (photonic integration,scratch waveguides)

III Architectural considerations

1 Switch fabric configuration

i Overview of switch designVery brief mention of choices in intercon-nect network architecture

10

ii N:N and N:1Make the case for a minimally populatedswitch for PCs

2 Overview of photonic components

Briefly describe the choices, and why we selectwhat we have

i TransceiversLasers and photodiodes, VCSELs for di-rect coupling to PCBs, CDR

ii Amplifiers and switching elementsSOAs and YDFA; trade offs in switchingtime and noise; MEMS; Lithium niobateMach-Zehnder modulators

iii PassivesCouplers and splitters, insertion loss andassociated impact on the depth of a switchfabric

iv Loss and noiseExtent to which loss in passives can becompensated for with amplifiers, with ref-erence to noise

v Inter- and intra-chip WaveguidesFibres; on-chip and off-chip waveguides (abit more detail than mentioned in associ-ated work)

3 Medium access arbitration

i Time slots and their structureGuard bands for switching, CDR and datatime. CDR probably the limiting factor, socomment on solutions

ii The role of bufferingUse of buffering in conventional non-blocking switch fabrics; bufferlessswitches; cost of OEO buffering; useof edge buffering

iii Scheduling

a Overview of algorithmsAlgorithms as appropriate to intercon-nect type

b Transaction types and schedulingPipelining of consecutive requests; de-layed acknowledgements and impact onedge buffering requirements

4 Latency and performanceRelationship between slot size, fabric util-isation/efficiency and latency, with refer-ence to CDR performance. Contrast withoptical burst switching [JV05]. Describewhat determines effective throughput

i Trade-offs between latency, efficiency andutilisation

ii Effective throughput

iii Optical Burst Switching

IV Characterisation of real-world application per-formance

1 On the need to model whole systems

Performance depends on both ends of the com-munication; modelling just one end of it is anincomplete picture

2 Direct measurement

Use of measurement apparatus to characterisedifferent workloads

3 Simulation

See if an effective simulation model can bebuilt from measured data. Some related workexists in [FW02]

4 Verification using optical testbed

and if it can, does the model relate to reality

V Testbed implementation

Describe the model and prototype implementa-tions of the optical switch as proposed in thisthesis

1 Electronic model

2 Photonic prototype

3 Technological considerations

If CDR performance turns out to be as big aproblem as it was for the SOAPs team, thenperhaps I can build a better CDR.

VI Evaluation of the testbed

Compare performance against existing local in-terconnects with reference to theoretical perfor-mance

11

1 Throughput and latency performance

2 Power performance

3 Analysis of time slot allocation

i Clock recovery timeAnalysis of how significant CDR perfor-mance was to overall performance

ii End-point buffering requirementsAnalysis of the effect of time slots on endpoint buffering requirements

4 Analysis of medium access scheduler

5 Synchronous endpoints

Analysis of how the implicit requirement thateven low performance end-points would haveto run at memory/CPU interface speed maybe ameliorated.

VII Conclusion

References

[AMW+08] Eng Tin Aw, David J. Miller, AdrianWonfor, Andrew W. Moore, MadeleineGlick, Richard Penty, and Ian White,Practical non-blocking soa switch archi-tecture for optical interconnects, To besubmitted to Journal Ligthwave Tech-nology, 2008.

[AWG+07] Eng Tin Aw, Adrian Wonfor, MadeleineGlick, Richard Penty, and Ian White,Large dynamic range 32 x 32 opti-mized non-blocking soa based switchfor 2.56tb/s interconnect applications,ECOC (2007).

[ban07] ”banana skins” compendium,www.cherryclough.com, Cherry CloughConsultants, 2007, Anecdotes of prac-tical experience dealing with EMCissues.

[Ben62] Vaclav Benes, On re-arrangeable threestage connecting networks, J. Bell Sys-tems Technology 41 (1962).

[Che96] Stuart Cheshire, It’s the latency,stupid, http://rescomp.stanford.

edu/∼cheshire/rants/Latency.html,May 1996.

[Clo53] Charles Clos, A study of non-blockingswitch networks, J. Bell Systems Tech-nology 32 (1953).

[DT04] William James Dally and Brian Towles,Principles and practices of interconnec-tion networks, Morgan Kaufmann, 2004.

[DYN03] Jos Duato, Sudhakar Yalamanchili, andLionel Ni, Interconnection networks,Morgan Kaufmann, 2003.

[FW02] Ehud Finkelstein and Shlomo Weiss, APCI bus simulation framework and somesimulation resolts on pci standard 2.1 la-tenct limitations, Journal of Systems Ar-chitecture (2002), no. 47.

[GDM+05] M. Glick, M. Dales, D. McAuley, TaoLin, K. Williams, R. Penty, and I. White,Swift: a testbed with optically switcheddata paths for computing applications,Transparent Optical Networks, 2005,Proceedings of 2005 7th InternationalConference 2 (2005), 29–32 Vol. 2.

[Gro06] Aite Group, Algorithmic trading2006: More bells and whistles,http://www.aitegroup.com/reports/

200610311.php, November 2006.

[HBY+08] A. Hadke, T. Benavides, S.J.B. Yoo,R. Amirtharajah, and V. Akella,Ocdimm: Scaling the dram memory wallusing wdm based optical interconnects,High Performance Interconnects, 2008.HOTI ’08. 16th IEEE Symposium on(2008), 57–63.

[JRG+03] L. B. James, G. F. Roberts, M. Glick,D. McAuley, K.A. Williams, R. V. Penty,and I. H. White, Wavelength striped

12

semi-synchronous optical local area net-works, London Communications Sympo-sium, September 2003.

[JV05] Jason P. Jue and Vinod M. Vokkarane,Optical burst switched networks,Springer-Verlag, 2005.

[KDK+05] J.A. Kash, F. Doany, D. Kuchta,P. Pepeljugoski, L. Schares, J. Schaub,C. Schow, J. Trewhella, C. Baks,Y. Kwark, C. Schuster, L. Shan, C. Pa-tel, C. Tsang, J. Rosner, F. Libsch,R. Budd, P. Chiniwalla, D. Gucken-berger, D. Kucharski, R. Dangel, B. Of-frein, M. Tan, G. Trott, D. Lin, A. Tan-don, and M. Nystrom, Terabus: achip-to-chip parallel optical interconnect,Lasers and Electro-Optics Society, 2005.LEOS 2005. The 18th Annual Meetingof the IEEE (2005), 363–364.

[KKL06] Chander Kochar, Avinash Kodi, andAhmed Louri, nD-RAPID: A multi-dimensional scalable fault-tolerantopto-electronic interconnection forscalable high-performance computingsystem, Optical Society of America,2006.

[KKL07] C. Kochar, A. Kodi, and A. Louri,Proposed low-power high-speed microringresonator-based switching technique fordynamically reconfigurable optical inter-connects, Photonics Technology Letters,IEEE 19 (2007), no. 17, 1304–1306.

[KL06] Avinash Karanth Kodi and AhmedLouri, RAPID for high-performancecomputing systems: architecture andperformance evaluation, Applied Optics45 (2006), no. 25.

[LCW+03] Jiuxing Liu, B. Chandrasekaran, Jiesh-eng Wu, Weihang Jiang, S. Kini,Weikuan Yu, D. Buntinas, P. Wyckoff,and D.K. Panda, Performance compar-ison of mpi implementations over in-

finiband, myrinet and quadrics, Super-computing, 2003 ACM/IEEE Confer-ence (2003), 58–58.

[LLWB07] Odile Liboiron-Ladouceur, HowardWang, and Keren Bergman, An all-optical pci-express network interface foroptical packet switched networks, OFC(2007).

[LSLL+03] W. Lu, B. A. Small, Odile Liboiron-Ladouceur, J. N. Kutz, and KerenBergman, Optical packet switchingthrough multiple nodes in the datavortex, IEEE LEOS (2003).

[LWP04] Jiuxing Liu, Jiesheng Wu, and Dha-baleswar K. Panda, High performancerdma-based mpi implementation over in-finiband, Int. J. Parallel Program. 32

(2004), no. 3, 167–198.

[McK04] Nick McKeown, Buffers: How we fell inlove with them, and why we need a di-vorce, Hot Interconnects, 2004.

[MVCA97] R.P. Martin, A.M. Vahdat, D.E. Culler,and T.E. Anderson, Effects of communi-cation latency, overhead, and bandwidthin a cluster architecture, Computer Ar-chitecture, 1997. Conference Proceed-ings. The 24th Annual InternationalSymposium on (1997), 85–97.

[Pat04] David Patterson, Latency lags band-width, Communications of the ACM 47

(2004), no. 10.

[PS00] PCI-SIG (ed.), Pci-x addendum to thepci local bus specification, 1.0a ed., PCI-SIG, July 2000.

[PS02] PCI-SIG (ed.), Pci local bus specifica-tion, 2.3 ed., PCI-SIG, March 2002.

[PS06] PCI-SIG (ed.), Pci express base spec-ification, 2.0 ed., PCI-SIG, December2006.

13

[SKD+06] L. Schares, J.A. Kash, F.E. Doany,C.L. Schow, C. Schuster, D.M. Kuchta,P.K. Pepeljugoski, J.M. Trewhella, C.W.Baks, R.A. John, L. Shan, Y.H. Kwark,R.A. Budd, P. Chiniwalla, F.R. Libsch,J. Rosner, C.K. Tsang, C.S. Patel, J.D.Schaub, R. Dangel, F. Horst, B.J. Of-frein, D. Kucharski, D. Guckenberger,S. Hegde, H. Nyikal, C.-K. Lin, A. Tan-don, G.R. Trott, M. Nystrom, D.P.Bour, M.R.T. Tan, and D.W. Dolfi, Ter-abus: Terabit/second-class card-level op-tical interconnect technologies, SelectedTopics in Quantum Electronics, IEEEJournal of 12 (2006), no. 5, 1032–1044.

[SLB05] Assaf Shacham, Benjamin G. Lee, andKeren Bergman, A scalable, self-routed,terabit capacity photonic interconnectionnetwork, Hot Interconnects 13 (2005).

[SSLLB05] Assaf Shacham, Benjamin A. Small,Odile Liboiron-Ladouceur, and KerenBergman, A fully implemented 12×12data vortex optical packet switchinginterconnection network, J. LightwaveTechnology (2005).

[Thi08] Patrick Thibodeau, Stock exchangesstart thinking in microseconds,http://www.computerworld.com/

action/article.do?command=

viewArticleBasic&articleId=

323391, August 2008.

[Tuc06a] R. S. Tucker, The role of optics and elec-tronics in high-capacity routers, Light-wave Technology, Journal of 24 (2006),no. 12, 4655–4673.

[Tuc06b] R.S. Tucker, Petabit-per-second routers:optical vs. electronic implementations,Optical Fiber Communication Confer-ence, 2006 and the 2006 National FiberOptic Engineers Conference. OFC 2006(2006), 3 pp.–.

[VSM+08] D. Vantrease, R. Schreiber,M. Monchiero, M. McLaren, N.P.Jouppi, M. Fiorentino, A. Davis,N. Binkert, R.G. Beausoleil, and J.H.Ahn, Corona: System implicationsof emerging nanophotonic technology,Computer Architecture, 2008. ISCA’08. 35th International Symposium on(2008), 153–164.

[Xil04] Inc. Xilinx (ed.), Virtex-ii pro datasheet,v4.0 ed., no. DS038-3, Xilinx, Inc., June2004.

14

An High-Performance, Low-Latency Fully-Optical Peripheral ...

Documents

Transcript of An High-Performance, Low-Latency Fully-Optical Peripheral ...