Hewlett Packard Journal 90

8/7/2019 Hewlett Packard Journal 90

1/96

H E W L E T T -F E B R U A R Y 1 9 9 O

P A C K A R D Copr. 1949-1998 Hewlett-Packard Co.


2/96

H E W L E T T - P A C K A R D_j February 1990 Volume 41 Number 1

Articles

An Ove rv iew o f the HP OSI Exp ress Ca rd , by Wi l l i am R . Johnson

8 The HP OSI Exp ress Ca rd Backp lane Hand le r , by G lenn F . Ta lbo t t1 5 Cu s to m V L S I Ch i p s fo r DM A

1Q CONE: A So f twa re Env i ronmen t fo r Ne two rk P ro toco ls , by S teven M. Dean , Dav id A .O K u m p f , a n d H . M i c h a e l W e n z e l

r\ Q The Michael Layers of the H P OSI Express Card Stack, by Kimball K. Banker and MichaelA. Ellis

O O I m p l e m e n t a t i o n o f t h e O S I C l a ss 4 T r a n sp o r t P r o t o co l i n t h e H P O S I E xp r e ss C a r d ,O O b y R e x A . P u g hA f - Da ta L ink Laye r Des ign and Tes t ing fo r the HP OSI Exp ress Ca rd , by Jud i th A . Smi thT O a n d B i l l T h o m a s

49 The OSI Connect ion less Network Protocol

51 HP OSI Exp ress Des ign fo r Pe r fo rmance , by E l i zabe th P . Bo r to lo t t o

59 The HP OSI Exp ress Ca rd So f twa re D iagnos t i c P rog ram, by Joseph R . Longo , J r .

Editor, Richard P. Doian Associate Editor, Charles L Leath Assistant Editor, Gene M. Sadoff Art Director, Photographer, Arvid A. DanielsonSupport Anne Susan E. Wright Administrat ive Services, D iane W. Woodworth Typography, Anne S. LoPrest i European Product ion Supervisor, Sonja Wir th

2 HEWLETT-PACKARD JOURNAL FEBRUARY 1990 C Hewlet t -Packard Company 1990 Pr inted in U.S.A. Copr. 1949-1998 Hewlett-Packard Co.


3/96

6 77 2

Support Features of the HP OSI Express Card, byJayesh K. Shah and Charles L. Hamer

I n t e g r a t i o n a n d T e s t f o r t h e H P O S I E x p r e s s C a r d ' s P r o t o c o l S t a c k , b y N e i l M .A lexander and Randy J . Westra

80 High -Speed L igh twave S igna l Ana lys is , by Ch r is tophe r M. M i l le r84 A Broadband Instrumentat ion Photoreceiver

92 L i n e w i d t h a n d P o w e r S p e c t r a l M e a s u r e m e n t s o f S i n g l e - F r e q u e n c y L a s e r s , b yDouglas M. Baney and Wayne V. Sor inDepartments

4 I n t h i s I s s u e5 C o v e r5 W h a t ' s A h e a d

7 7 A u t h o r s

The Hewlet t -Packard Journal is publ ished bimonthly by the Hewlet t -Packard Company to recognize technical contr ibut ions made by Hewlet t -Packard (HP) personnel . Whi lethe informat ion of in th is publ icat ion is bel ieved to be accurate, the Hewlet t -Packard Company makes no warrant ies, express or impl ied, as to the accuracy or rel iabi l i ty ofsuch informat ion. The Hewlet t -Packard Company disclaims al l warrant ies of merchantabi l i ty and f i tness for a part icular purpose and al l obl igat ions and l iabi l i t ies for damages,including but not l im i ted to indi rect , special , or consequent ial damages, at torney's and expert 's fees, and court costs, ar is ing out of or in connect ion wi th this publ icat ion.

Subscr ipt ions: non-HP Hewlet t -Packard Journal is dist r ibuted f ree of charge to HP research, design, and manufactur ing engineer ing personnel , as wel l as to qual i f ied non-HPindiv iduals, business and educat ional inst i tut ions. Please address subscr ipt ion or change of address requests on pr inted let terhead (or include a business card) to the HP addresson the please cover that is c losest to you. When submit t ing a change of address, please include your z ip or postal code and a copy of your old label .

Submissions: research ar t ic les in the Hewlet t -Packard Journal are pr imar i ly authored by HP employees, ar t ic les f rom non-HP authors deal ing wi th HP-related research orsolut ions contact technical problems made possible by using HP equipment are also considered for publ icat ion. Please contact the Edi tor before submit t ing such ar t ic les. Also, theHewlet t -Packard should encourages technical discussions of the topics presented in recent ar t ic les and may publ ish let ters expected to be of interest to readers. Let ters shouldbe br ief , and are subject to edi t ing by HP.

Copyr ight publ icat ion 1990 copies Company. Al l r ights reserved. Permission to copy wi thout fee al l or part of th is publ icat ion is hereby granted provided that 1) the copiesare not Hewlett-Packard used, displayed, or distributed for commercial advantage; 2) the Hewlett-Packard Company copyright notice and the tit le of the publication and date appear onthe copies; Otherwise, be a notice stating that the copying is by permission of the Hewlett-Packard Company appears on the copies. Otherwise, no portion of this publication may beproduced recording, information in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage retrieval system without writtenpermission of the Hewlet t -Packard Company.

Please Journal , inquir ies, submissions, and requests to: Edi tor , Hewlet t -Packard Journal , 3200 Hi l lv iew Avenue, Palo Al to, CA 94304, U.S.A.

FEBRUARY 1990 HEW LETT-PACKARD JO URNAL 3 Copr. 1949-1998 Hewlett-Packard Co.


4/96

In this Issue Open us ing are systems that communicate with the outs ide wor ld us ingstandard protocols . Because they communicate us ing standard protocols ,open systems can be interconnected and wi l l work together regardless ofwhat company manufactured them. For a user who is conf igur ing a computersystem or network, the benef i t o f open systems is the f reedom to choosethe best component for each function from the offerings of all manufacturers.In 1983, the Internat ional Organizat ion for Standardizat ion ( ISO) publishedi t s Open Sys tems In te rconnec t ion (OSI ) Re fe rence Mode l t o se rve as amaster model for coordinat ing a l l open systems act iv i t ies. The OSI modelstarts functions a framework that organizes all intersystem communications functions into seven layers.Specif ic protocols perform the funct ions of each layer. Any organizat ion can have an open systemby imp lement ing these s tandard p ro toco ls . The movement t owards th is mode l as t he g loba lstandard many open systems has steadily gained momentum. Hewlett-Pack ard, along with manyother manufacturers and the governments of many count r ies, is commit ted to the developmentof s tandards and products based on the OSI model.The HP OSI Express card implements the OSI model for HP 9000 Series 800 computers. Thehardware and f irmware on the card off- load most of the processing for the seven-layer OSI stackfrom the also computer. This not only gets the job done faster and improves throughput, but alsoleaves single t ime for the host processor to service user applicat ions. Although it 's only a singleSeries required I/O card, the HP OSI Express card implements many complex ideas and required amajor design ef for t that c la ims most of th is issue. You' l l f ind an overv iew of i ts des ign on page6. The sof tware) between the card dr iver (which is par t of the host sof tware) and the operat ingsystem described in card is a set of f irmware routines called the backplane handler; it 's described inthe ar t ic le on page 8. The card 's archi tecture and most of i ts operat ing system are determinedb y a n HP 1 8 ) . c a l l e d t h e c o m m o n O S I n e t wo r k in g e n v i r o n m e n t , o r CO NE ( s e e p a g e 1 8 ) .CONE de f ines how the p ro toco l f i rmware modu les in te rac t and p rov ides sys tem func t ions tosupport the protocol modules. The top three layers of the OSI Express card protocol stack theapplicat ion, presentat ion, and session layer modules are descr ibed in the art icle on page 28.These three layers share the same architecture and are implemented using tables. In the protocolmodule for the fourth OSI layer the transport layer are the important functions of error detectionand recovery, mult ip lex ing, address ing, and f low cont ro l , inc luding congest ion avoidance (seepage and bottom bottom three OSI layers are the network, data link, and physical layers. The bottomof the OSI s tack on the OSI Express card is covered in the ar t ic le on page 45. Because of thenumber of layers in the OSI s tack, data throughput is an impor tant considerat ion in the designof any card implementat ion. Performance analysis of the HP OSI Express card began in the earlydesign eliminated and helped identify crit ical bottlenecks that needed to be eliminated (see page 51).As a result , throughput as high as 600,000 bytes per second has been measured. Troubleshootingin a multivendor environment is also an important concern because of the need to avoid situationsin which vendors blame each other 's products for a problem. Logging, t racing, and other supportfeatures and the OSI Express card are discussed in the art icle on page 67. Finally, debugging andfinal testing of the card's firmware are the subjects of the articles on pages 59 and 72, respectively.

4 HEWLETT-PACKARD JOURNAL FEBRUARY 1990 Copr 1949 1998 Hewlett Packard Co


5/96

Both communications use and the performance of fiber optic voice and data communications systemscont inue to increase and we cont inue to see new forms o f measur ing ins trumenta t ion adaptedto the art ic le of f i ber optic system design, test, and maintenance. The art ic le on page 80 presentsthe des ign and app l i ca t ions o f the HP 71400A l igh twave s igna l ana lyze r . Th is ins t rumen t i sdesigned intensity measure signal strength and distortion, modulation depth and bandwidth, intensitynoise, systems susceptibility to reflected light of high-performance optical systems and componentssuch as semiconductor lasers and broadband photodetectors. Unlike an optical spectrum analyzer,it does spectrum provide information about the frequency of the carrier. Rather, it acts as a spectrumana lyze r fo r the modu la t ion on a l i gh twave ca r r ie r . I t comp lemen ts the l igh twave componen tana lyzer descr ibed in our June 1989 issue, wh ich can be thought o f as a network ana lyzer fo rl ightwave components. Using high-frequency photodiodes and a broadband amplif ier consistingof four s igna l mic rowave monol i th ic d is t r ibu ted ampl i f i e r s tages, the l igh twave s igna l ana lyzercan measure l igh twave modula t ion up to 22 GHz. (Th is seems l ike a huge bandwid th unt i l onerealizes that the carrier f requency in a f iber optic system is 200,000 GHz or more!) A companionins trument, the HP 11980A f iber opt ic in ter f erometer (page 92), can be used wi th the ana lyzerto measure the l inewid th , ch i rp , and f requency modula t ion charac ter is t ics o f s ing le- f requencylasers. The interferometer acts as a frequency discriminator, converting optical phase or frequencyvariations into intensity variations, which are then detected by the analyzer. Chirp and frequencymodula t ion measurements wi th the in ter f erometer use a new measurement techn ique ca l led agated self-homodyne technique.

R.P. Do lanEditor

CoverThis is for Standardi rendition of the seven layers of the International Organization for Standardization's OSI Reference Model on the HP OSI Express card, and the communication path betweentwo end sys tems over a network .

What's AheadThe Apr i l and issue wi l l fea ture the des ign o f the HP 1050 modular l iqu id chromatograph andthe HP Open View network management software.

FEBRUARY 1990 HEW LETT-PACKARD JO URNALS

Copr. 1949-1998 Hewlett-Packard Co.


6/96

An Overv iew of the HP OSI Express CardThe OSI Express card provides on an I /O card thenetworking services defined by the ISO OSI (Open SystemsInterconnect ion) Reference Model , resul t ing in of f - loadingmuch of the network overhead from the host computer. Thisand other features set the OSI Express card apart from othernetwork implementat ions in ex is tence today.by Wil l iam R. Johnson

THE DAYS WHEN A VENDOR used a proprietarynetwork to "lock in" customer commitment are over.Today, customers demand mul t ivendor network

connectivity providing standardized application services.HP's commitment to OSI-based networks provides a pathto fill this customer requirement.The ISO (International Organization for Standardization)OSI (Open Systems Interconnection) Reference Model wasdeveloped to facilitate the development of protocol specifications for the implementation of vendor independent networks. HP has been committed to implementation of OSI-based standards since the early 1980s. Now that the standards have become stable, OSI-based products are becoming available. One of HP's contributions to this arena is theOSI Express card for HP 9000 Series 800 computers.

The OSI Express card provides a platform for the protocolstack used by OSI applications. Unlike other networkingimplementations, the common OSI protocol stack resideson the card. Thus, much of the network overhead is offloaded from the host, leaving CPU bandwidth available forprocessing user applications. This common protocol stackconsists of elements that implement layers 1 through 6 ofthe OSI Reference Model and the Association Control Service Element (ACSE), which is the protocol for the seventhlayer of the OSI stack. Most of the application la yer functionality is performed outside the card environment sinceapplications are more intimately tied to specific user functions (e.g., mail service or file systems). An architecturalview of the OSI Express card is given in Fig. 1 , and Fig. 2shows the association between the OSI Express stack andthe OSI Reference Model.The series of articles in this issue associated with theOSI Express card provides some insight into how the project team at HP's Roseville Networks Division implementedthe card and what sets it apart from many other implementations currently in existence today. This article gives anoverview of the topics covered in the other articles and thecomponents shown in Fig. 1.OSI Express StackThe protocol layers on the OSI Express card stack providethe following services:M e d i a A c c e s s C o n t r o l ( M A C ) H a r d w a r e . T h e M A Chardware is responsible for reading data from the LANinterface into the card buffers as specified by the link/MACinterface software module. All normal data packets des

tined for a particular node's address a re forwarded by thelogical link control (LLC) to the network layer.Network Layer. The network layer on the OSI Express carduses the connectionless network service (CLNS). The OSIExpress card's CLNS implementation supports the end-sys-tem-to-intermediate-system protocol, which facil i tatesdynamic network routing capabilities. As new nodes arebrought up on the LAN, they announce themselves usingthis subset of the network protocol. The service providedby CLNS is not reliable and dictates the use of the transportlayer to provide a reliable data transfer service. Both thetransport layer and CLNS can provide segmentation andreassembly capabilities when warranted.Transport Layer Class 4. In addition to ensuring a reliabledata transfer service, the OSI Express tran sport is also responsible for monitoring card congestion and providingflow control.Session Layer. The OSI Express card's implementation ofthe session layer protocol facilitates the management of anapplication's dialogue by passing parameters, policing statetransitions, and providing an extensive set of service primitives for applications.Presentation Layer and Association Control Service Element (ASCE). The OSI Express card's presentation layerextracts protocol parameters and negotiates the syntax rulesfor transmitting information across the current association.Both ACSE and the presentation layer use a flexible methodof protocol encoding called ASN.l (Abstract Syntax Notation One). ASN.l allows arbitrary encodings of the protocolheader, posing special challenges to the decoder. ASCE isused in the transmission of parameters used in the establishment and release of the association.OSI Express ProtocolsThe OSI protocols are implemented within the commonOSI networking environment (CONE). CONE is basically anetwork-specific operating system for the OSI Express card.The uti l i t ies provided by CONE include buffer management, timer management, connection management, queuemanagement, nodal management, and network management. CONE defines a standard protocol interface that provides of i so la t ion . This feature ensures por tabi l i ty ofnetworking software across various hardware anoVor software platforms. The basic operating system used in theOSI Express card is not part of CONE and consists of a simpleinterrupt system and a rudimentary memory manager.

6 HEWLETT-PACKARD JOURNAL FEBRUARY 1990



7/96

Service requests are transmitted to CONE using the backplane message interface (BMI) protocol. Application requests are communicated through the OSI user interfaceto the CONE interface adapter (CIA). The interface adapterbundles the request into the BMI format and hands it offto the system driver. The BMI converts message-based requests, which are asynchronous, from the interface adapterinto corresponding CONE procedure calls, which are synchronous.The backplane handler controls the hardware that movesmessages between the host computer and the OSI Expresscard. A special chip, called the HP Precision bus interfacechip, is used by the backplane handler to gain control ofthe HP Precision bus and perform DMA between the OSIExpress card and the host memory space. Another specialchip, called the midplane memory controller, is used bythe backplane handler to take care of OSI Express cardmidplane bus arbitration and card-resident memory. Thebackplane handler conceals the interactions of these twochips from CONE and the driver.

Diagnostics and MaintenanceThe OSI Express card uses three utilities to aid in faultdetection and isolation. The hardware diagnostics and

maintenance program uses the ROM-resident code on thecard to perform initial configuration of the MAC hardware.After configuration, the program is used to access the ROM-based test code that exercises both local and remote networking hardware. The same utility is also used to download the OSI Express card software into RAM. The host-based network and nodal management tool contains thetracing (event recording) and logging (error reporting)facilities. The network and nodal management tool can beused to report network management events and statisticsas well. However, it is primarily used to resolve protocolnetworking problems causing abnormal application behavior (e.g., receipt of nonconforming protocol header information). The software diagnostic program, which is thethird fault detection program on the OSI Express card, wasdeveloped to aid in the identification of defects encountered during the software development of the card. Thisprogram uses the software diagnostic module on the cardto read and write data, set software breakpoints, and indicate when breakpoints have been encountered. The interface to the software diagnostic program provides access todata structures and breakpoint locations through the useof symbolic names. It also searches CONE-defined datastructures with little user intervention.

Network and NodalManagementTools

ApplicationProtocol andUser InterfaceSoftwareDiagnosticProgram

HardwareDiagnostics andMaintenance

OSI Interface Services

Backplane MessageInterface Protocol

OSI Express CardOperatingSystem

CONE Inter face Adapter

HP Precis ion Bus -Backplane HardwareBackplane Handler

Backplane Message Inter face

Host"Card

C om m onOS INetworkEnvironment(CONE)

Associat ion Control Serv ice E lementPresentation Layer

Session LayerTranspor t Class 4 (TP) Layer

Connect ionless Network Serv ice (CLNS)IEEE 802.2 Type 1 LogicaL Link Control (LLC)

SoftwareDiagnosticModule

ROM ResidentBackplane,Self-test, andDownloadMonitor

IEEE 802.4Media Access Control (MAC) HardwareLink/MAC Interface Software

IEEE 802.4 LAN

Other LocalHostsFig. 1 . HP OSI Express card overv iew.

FEBRUARY 1990 HEWLETT-PACKARD JOURNAL?



8/96

L a y e r O S I M o d e l7 Application

OSI ExpressComponentsNetwork and NodalManagement Tools

Transport

Data Link

Physical

Transport C lass 4Connectionless NetworkServiceIEEE 802.2 LLC (Type 1)

IEEE 802.3 or 802.4 MAC1 0 - M b i t / s | 5 - M b i t / sBroadband I Carr ierband

Fig . 2 . Compar ison be tween the OSI Express componen tsand the OSI Reference Model.

In a multivendor environment it is crucial that networking problems be readily diagnosable. The OSI Express diagnostics provide ample data (headers and state information)to resolve the problem at hand quickly.An implementation of the OSI protocols is not inherentlydoomed to poor performance. In fact, file transfer throughput using the OSI Express card in some cases is similar tothat of existing networking products based on the TCP/IPprotocol stack. Performance is important to HP's customers,and special attention to performance was an integral partof the development of the OSI Express card. Special focuson critical code paths for the OSI Express card resulted inthroughputs in excess of 600,000 bytes per second. Intelligent use of card memory and creative congestion controlallow the card to support up to 100 open connections.AcknowledgmentsMuch credit needs to be given to our section managerDoug Boliere and his staff Todd Alleckson, Diana Bare,Mary Ryan, Lloyd Serra, Randi Swisley, and Gary Wer-muth for putting this effort together and following itthrough. Gratitude goes to the hardware engineers whogave the networking software a home, as well as those whodeveloped the host software at Information Networks Division and Colorado Networks Division.

The HP OSI Express Card BackplaneHandlerThe backplane on the HP OSI Express card is handled bya pair of VLSI chips and a sef of firmware routines. Thesecomponents provide the inter face between the HP OSIExpress card driver on the host machine and the commonOSI networking environment, or CONE, on the OSI Expresscard.by Glenn F. Talbot t

THE HP OSI EXPRESS CARD BACKPLANE handleris a set of firmware routines that provide an interfacebetween the common OSI networking environment(CONE) software and the host-resident driver. CONE provides network-specific operating system functions andother facilities for the OSI Express card (see the article onpage 18). The handler accomplishes its tasks by controllingthe hardware that moves messages between the host computer and the OSI Express card. The backplane handlerdesign is compatible with the I/O architecture defined forHP Precision Architecture systems,1 and it makes use of

the features of this architecture to provide the communication paths between CONE and the host-resident driver (seeFig 1). The HP Precision I/O Architecture defines the typesof modules that can be connected to an HP Precision bus(including processors, memory, and I/O). The OSI Expresscard is classified as an I/O module.The OSI Express card connects to an HP 9000 Series 800system via the HP Precision bus (HP-PB), which is a 32-bit-wide high-performance memory and I/O bus. The HP-PBallows all modules connected to it to be either masters orslaves in bus transactions. Bus transactions are initiated


Copr 1949-1998 Hewlett-Packard Co


9/96

H o s tMemory

1OS IExpressCardDriver

HP P rec i s i on BusHP Precision BusInterface Chip

HostCardDM AChains

[MessagesBackplaneMessageInterface ,

Events

MidplaneMemoryControllerChip

Fig. 1 . The data f low relat ionships between the OSI Expressc a rd d r i v e r on t he hos t c om pu t e r and t he m a j o r ha rdwareand sof tware components on the card .

by a master and responses are invoked from one or moreslaves. For a read transaction, data is transferred from theslave to the master, and for a write transaction, data istransferred from the master to the slave. Each module thatcan act as a master in bus transactions is capable of beinga DMA controller. Bus transactions include reading or writing 4, 16, or 32 bytes and atomically reading and clearing16 bytes for semaphore operations.The OSI Express card uses a pair of custom VLSI chipsto perform DMA between its own resident memory andthe host memory. The first chip is the HP-PB interface chip,which acts as the master in the appropriate HP Precisionbus transactions to perform DMA between the OSI Expresscard and the host system memory space. The second chipis the midplane memory controller, which controls theDMA between the HP-PB interface chip and the OSI Express card resident memory. The memory controller chipalso performs midplane bus arbitration and functions as adynamic RAM memory controller and an interrupt controller. See the box on page 15 for more information about theHP-IB interface chip and the midplane memory controllerchip. The backplane handler hides all the programmingrequired for these chips from the host computer OSI Express driver and CONE.

Host InterfaceThe HP Precision I/O Architecture views an I/O moduleas a continuously addressable portion of the overall HPPrecision Architecture address space. I/O modules are assigned starting addresses and sizes in this space at systeminitialization time. The HP Precision I/O Architecturefurther divides this I/O module address space (called softphysical address space, or SPA) into uniform, independentregister sets consisting of 16 32-bit registers each.The OSI Express backplane handler is designed to support up to 2048 of these register sets. (The HP-PB interfacechip maps HP-PB accesses to these register sets into theExpress card's resident memory.) With one exception, forthe backplane handler each register set is independent ofall the other register sets, and the register sets are organizedin inbound-outbound pairs to form full-duplex paths orconnections. The one register set that is the exception (RS1) is used to notify the host system driver of asynchronousevents on the OSI Express card, and the driver is alwaysexpected to keep a read transaction pending on this registerbecause it is set to receive notification of these events.

Register SetsThe registers are numbered zero through 15 within agiven register set. The registers within each set that areused by the backplane handler as they are defined by theHP Precision I/O Architecture are listed below. The registers not included in the list are used by the backplanehandler to maintain internal state information about theregister set.N u m b e r N a m e

4 I O J 3 M A J L I N K5 I O _ D M A _ C O M M A N D6 I O _ D M A _ A D DR E S S7 I O _ D M A _ C O U N T

1 2 I O _ C O M M A N D1 3 I O _ S T A T U S

FunctionPointer to DMA controlstructureCurrent DMA chaincommandDMA buffer addressDMA buffer size (bytes)Register set I/O commandRegister set status

The OSI Express card functions as a DMA controller anduses DMA chaining to transfer data to and from the card.DMA chaining consists of the DMA control ler 'sautonomously following chains of DMA commands writtenin memory by the host processor. HP Precision I/O Architecture defines DMA chaining methods and commandsfor HP Precision Architecture systems. A DMA chain consists of a linked list of DMA control structures known asquads. Fig. 2 shows a portion of a DMA chain and thenames of the entries in each quad.Data Quad. The data quad is used to maintain referenceto and information about the data that is being transferred.The fields in the data quad have the following meaningand use.

FEBRUARY 1990 HEWLETT-PACKARD JOURNAL 9 Copr 1949 1998 Hewlett Packard Co


10/96

Data Quad 1 Data Quad 2 Link Quad

CompletionList Head

F i e l d M e a n i n g a n d U s eC H A I N J J N K P o i n t e r t o t h e n e x t q u a d i n t h e c h a i nC H A I N _ C M D D M A c h a i n i n g c o m m a n d p l u s

appl ica t ion-speci f ic f ie ldsA D D R E S S M e m o r y a d d r e s s o f t h e d a t a b u f f e rC O U N T L e n g t h o f t h e d a t a b u f f e r i n b y t e s

Bits in the application-specific fields of CHAINLCMD cont r o l t h e g e n e r a t i o n o f a s y n c h r o n o u s e v e n t s t o C O N E a n dthe acknowledgment o f a synchronous even t i nd i ca t i ons t othe host .Link Quad. A l ink quad is c rea ted by the dr iver to indica tethe end o f a DMA t ransac t ion (no t e : no t t he end o f a DMAc h a i n ) . W h e n a l i n k q u a d i s e n c o u n t e r e d i n t h e c h a i n , ac o m p l e t i o n l i s t i s f i l l e d i n a n d l i n k e d i n t o a c o m p l e t i o nl is t . I f the CHAINJJNK f ie ld does not conta in an END_OF_CHAIN, DMA transfers cont inue . The f ie lds in the l ink quadhave the fo l l owing mean ing and use .

F i e l d M e a n i n g a n d U s eC H A I N J J N K P o i n t e r t o t h e n e x t q u a d i n t h e c h a i n , o r

END_OF_CHAIN valueC C M D J J N K C a u s e s a c o m p l e t i o n l i s t e n t r y t o b e

crea ted and may speci fy whether the hostshou ld be i n t e r rup t ed

H E A D _ A D D R A d d r e s s o f t h e c o m p l e t i o n l i s tENTRY_ADDR Addre ss o f t he comple t i on l i s t en t ry t o

be used to report comple t ion sta tusComple t ion L i s t En t ry . A comple t ion l i s t en t ry i s used toind i ca t e t he comple t i on s t a tus o f a DMA t ransac t ion . Onei s f i l l e d i n w h e n a l i n k q u a d i s e n c o u n t e r e d i n t h e D M Acha in . The f i e lds i n t he comple t i on l i s t have t he fo l l owingmean ings and use :F i e l d M e a n i n g a n d U s eN E X T J J N K P o i n t e r u s e d t o l i n k t h e e n t r y i n t o t h e

comple t ion l i s tI C L S T A T U S C o m p l e t i o n s t a t u s f i e l d , a c o p y o f t h e

IO_STATUS registerS A V E J J N K P o i n t e r t o t h e q u a d w h e r e a n e r r o r

occurred, or to the l ink quad in the caseof no error

S A V E _ C O U N T

Fig . 2 . A por t ion o f a DMA cha in .Residue count of bytes remaining in thebuffer associated with the quad pointedto by SAVE JJNK, or zero if no error

The completion list head contains a semaphore that allows a completion list to be shared by multiple I/O modules,and a pointer to the first entry in the completion list.DMA ChainingDMA chaining is started by the host system driver whenthe address of the first quad in a DMA chain is written intothe IO_DMA_LINK register of a register set. To tell the OSIExpress card to start chaining, the driver writes the chaincommand CMD_CHAIN into the register set's IO_COMMANDregister. This causes an interrupt and the Express card'sbackplane handler is entered. From this point until a completion list entry is made, the DMA chain belongs to theOSI Express card's register set, and DMA chaining is undercontrol of the backplane handler through the register set.Fig. 3 shows the flow of activities for DMA chaining in thedriver and in the backplane handler.Once control is transferred to the backplane handler, thefirst thing the handler does is queue the register set forservice. When the register set reaches the head of the queue,the backplane handler fetches the quad pointed to by IO_DMAJLINK and copies the quad into the registers IO_DMA_LINK, IO_DMA_COMMAND, IO_DMA_ADDRESS, and IO_DMA_COUNT. The backplane handler then interprets thechain command in register IOJDMA_COMMAND, executesthe indicated DMA operation, and fetches and copies thenext quad pointed to by IO_DMA_LINK. This fetch, interpret, and execute process is repeated until the value inIOJ3MA_LINK is END_OF_CHAIN. When END_OF_CHAIN isreached, the backplane handler indicates that the registerset is ready for a new I/O command by posting the statusof the DMA transaction in the lO^STATUS register.The DMA operation executed by the backplane handleris determined by the chain command in the IOJ)MA_COM-MAND register. For quads associated with data buffers,this chain command is CCMDJN or CCMDJDUT for inboundor outbound buffers, respectively. In this case the backplane handler transfers the number of bytes of dataspecified in the IO_DMA_COUNT register to or from thebuffer at the host memory location in the IO_DMA_ ADDRESS register. The IO_DMA_ADDRESS and IO_DMA_


Copr 1949 1998 Hewlett Packard Co


11/96

Quad 1 Quad 2 L ink QuadI O J 3 M A J J N K

[7] IMPLICIT Default-context-result OPTIONAL,Default -context -result ::= INTEGER{acceptance ( 0 ) ,user -re ject ion (1),provider -re ject ion (2)}[10] IMPLICIT Provider-reason OPTIONALProvider -reason ::= INTEGER{reason-not-specified (0),temporary-congestion (1),local - 1 imit -exceeded (2),cal led -pre sen t at i on -address -unknown ( 3 > ,protocol -ver s ion- not -supported ( 5 ) ,default -context -not -supported ( 6 ) ,user -data-not -readable (6),no-PSAP-avail lable (7) }J s e r - d a t a O P T I O N A L f \ g . 7 . A n A S N . 1 s p e c i f i c a t i o n f o r} t h e p r e s e n t a t i o n c o n n e c t c o n f i r m} n e g a t i v e P D U .

FEBRUARY 1990 HEWLETT-PACKARD JOURNAL 33


34/96

Responding-presentation-selector is encoded in two ways, bothof which are valid since octet strings can be encoded asconstructed types at the discretion of the sender. In Fig.8a a constructed type is used to break the Responding-presen-tation-selector value into three primitive encodings, each witha tag of 04 (universal tag type for octet string) and a lengthof 02. Fig. 8b merely encodes the entire value as a primitivewith a length of 06. ASN.l encoding rules are not deterministic because the encodings given Figs. 8a and 8b arevalid for the same PDU.Another important aspect of ASN.l is the concept of acontext-specific tag. Some tag values are universal in scopeand apply to all ASN.l encodings. Other tag types assumevalues whose meaning is specific to a particular PDU. Forexample, context-specific tag value [0] identifies the presentation Protocol-version in Fig. 7. This tag value only meansProtocol-version when encountered in a presentation connectconfirm negative PDU. In another PDU, the value [0] meanssomething else entirely.Context tags allow a protocol designer to assign a tagvalue such that the value of the tag determines the type ofvalue. To decode and validate the PDU, the decoder musthave knowledge of a protocol's context-specific values,their meanings, and the order and range of the PDU primitive values. This means that some parts of an ASN.l decoder may be generic to any ASN.l encoded PDU (such asan ASN.l integer decode routine), while other parts of thedecoder are quite specific to a single PDU (such as thechecking needed to verify that presentation transfer syntaxes are in the appropriate sequence).A final key to understanding ASN.l encoding rules isthat in almost all cases, the sender chooses which optionsto use. These options include the way in which lengthsare encoded and when constructed elements may be segmented. Octet strings, for example, may optionally be sentas a contiguous string or parsed into a constructed versionwith many pieces, which may themselves be segmented.A decoder must handle any combination of the above.Thus, the decoder must be able to handle an almost infinitenumber of byte combinations for PDUs of any complexity.This makes the decoder more complicated to constructthan an encoder. For example, Fig. 8 shows that the Respond-ing-presentation-selector can be encoded in two ways bothvalid.EncoderThe encoder is responsible for encoding outbound datapackets based on the ASN.l syntax. Because the encodercan select a limited set of options within the rather largeASN.l set of choices, encoding is much easier than decoding. The main requirement of the encoder is to know thesyntax of the PDU to be constructed. In particular, it needsto know the order and values of the tags and be equippedwith the mechanisms to encode the actual lengths andvalues.The OSI Express card implementation encodes PDUsfront to back using indefinite length encoding. An alternative, encoding ASN.l back to front, has the advantage ofbeing able to calculate the lengths and allow definite lengthencoding. Once all of the primitive values are encoded,the encoder can work backwards, filling in all of the con

structed tag lengths. However, encoding back to front doesnot allow data streaming, since all of the PDU must bepresent and encoded (including user data) to calculate thelengths. Without data streaming, large pieces of sharedmemory must be used, thus making memory unavailableto the rest of the card's processes until all of the PDU andits user data has been encoded.The encoder is table-driven in that a set of tables is usedfor each type of PDU. Each table contains constants for thetag and length and an index to a routine for a particularvalue. A generic algorithm uses the tables to build eachPDU. The tables allow modifications to be made easilywhen there are changes to the OSI standards. OSI standardsfor tag values and primitives changed constantly duringour implementation. However, these changes merely meantchanging a constant used by the table (often a simple macro

30 80 SEQUENCE80 02 07 80a3 8004 02 01 0201 02 03 04

04 02 05 0600 00a5 80

Protocol -ver s i onRe spending -pre s en t at ion-selector

EOCPre se nt at i on -context -de finit i on -result-listIMPLICIT SEQUENCE

30 80 SEQUENCE8 0 0 1 0 0 R e s u l t81 02 51 01 Transfer - syntax-name00 00 EOC30 80 SEQUENCE8 0 0 1 0 0 R e s u l t81 02 51 01 Transfer - syntax-name00 00 EOC30 80 SEQUENCE8 0 0 1 0 1 R e s u l t00 00 EOC

00 00 EOC87 01 03 Default-context-result8a 01 00 Provider -reason

00 00 EOC

3 0 2 B S E Q U E N C E8 0 0 2 0 7 8 0 P r o t o c o l - v e r s i o n83 06 01 02 03 04 05 06 Responding-presentat ion-selectora 5 1 7 P r e s e n t a t i o n - c o n t e x t - d e f i n i t i o n -

result - 1 i stIMPLICIT SEQUENCE

30 0780 01 0081 02 51 0130 0780 01 00

81 02 51 0130 0380 01 01

87 01 038a 01 00

SEQUENCEResultTransfer - syntax -nameSEQUENCEResultTransfer -syntax -nameSEQUENCEResult

Default -context -re suitProvider- reason

(b)Fig. 8. Encoding for the PDU shown in Fig. 7. (a) Indef in i telength encoding, (b) Def in i te length encoding.


Copr 1949 1998 Hewlett Packard Co


35/96

update). Changing the order or adding or deleting a valuewas also easy because only the table entries had to bealtered.DecoderThe decoder presented a more significant challenge inthe ACSE/presentation protocol machine. In an effort toreduce memory requirements, the decoder does not dependupon having the entire PDU in memory to decode. Piecescan be received separately, and these need to be decodedand the memory released. The decoder also does not requirecontiguous memory. PDU segments can be received fromthe session layer according to the transport segment size.In addition, the memory manager on the card presentsPDUs in separate physical buffers called line data buffers(see the article on page 18, which describes the CONEmemory manager).The main job of the decoder is to find the primitivevalues encoded within the complex nesting of tags andvalues, and extract those primitives. Along the way, thedecoder must also verify that the outer constructed tagsare correct, and that the lengths associated with all theconstructed tags are correct.The decoder uses a mathematical calculation to predictand check directly the appropriate tag values. The idea isto generate a unique token that directly identifies particularprimitive values. This unique tag is calculated by successively using the outer nested tag values to create a uniquenumber that can be predicted a priori. For example, a simple method to calculate a unique value for any primitiveis to take every constructed tag value and add it to the totalcalculated from previous constructed tags, and then multiply the new total by some base. This calculation derives aunique value for every primitive in a PDU. The uniquevalue can be calculated statically from the standard. Ourimplementation uses the same constants as were used inthe encoding tables above to construct a compiled constant.The unique value can then be calculated dynamically asthe decoder goes through a received PDU. Thus, as thedecoder is parsing a PDU and successively reading constructed tags, it is calculating the currenLuniquejag = (old_unique_tag x base) + tag_value.The advantage of this method is that a generic decoderoutine can be used to validate ASN.l syntax, and as soonas a primitive is reached within a nested PDU, the genericroutine can jump directly to a specific routine to deal withthe primitive. The value can be checked for specifics andthen used or stored. The generic routine is relatively simple. It merely loops looking for a tag, length, and value. Ifthe value is not a primitive it calculates the unique tag.Otherwise it uses the calculated unique tag to know whichroutine to call. Much of the syntax is automatically verifiedduring the calculation.The disadvantage of using such a calculation is that whileit guarantees a unique number, the number may grow quitelarge as the depth of nesting within a PDU grows. Theproblem is that the base used must be at least as large asthe total number of tag values. Thus, the unique tag mustbe able to represent a number as large as the base to thenth power, where n is the depth of nesting required. PDUsthat allow very large nesting may not be suitable for unique

tag calculation if the largest reasonable number cannot holdthe maximum calculated unique tag. Calculating a uniquetag has proven to be fairly quick in comparison to using astructure definition to verify each incoming PDU.Once a primitive tag value is reached, the derived uniquetag is used to vector to a procedure specific to that primitiveThe procedure contains the code to deal with the primitive.The decoder has a switch table of valid tags, as well as abit table used to determine correct orders of values andmandatory or optional field checks. This mechanism allows the decoder to identify quickly the primitives nestedwithin a complex PDU, verify correctness, and take thenecessary action.The decoder must perform two types of length checking:definite lengths in which lengths must be kept and verifiedand indefinite lengths in which the decoder must keeptrack of end-of-contents flags. Definite and indefinitelengths can appear together in the same PDU at the discretion of the encoder. The decoder uses two stacks in parallelto check the lengths, one for definite values, and one forEOCs. The definite length stack pushes a value for eachconstructor type encountered and subtracts a primitivelength from each of the appropriate constructor values inthe stack. When the last, innermost primitive is subtracted,the appropriate constructor values are popped from thestack. Using and saving stacks allows the decoder to receivePDU segments and decode part way, stop, save the stackvalues, and resume decoding when the next PDU segmentis received. Thus, a complete PDU does not have to bereceived before memory can be released back to the cardmemory pool. With this design we have not noticed anydifference in the amount of time it takes to decode definiteand indefinite length types.Using a Compi lerDuring the design phase, the option of using an ASN.lcompiler was considered for the ACSE and presentationprotocol machines. The main advantage of a compiler isthat once the compiler is written, any protocol specificationthat uses ASN.l can be compiled into useful object code.The object code then interacts with the protocol machinevia a set of interface structures. The disadvantages of compilers are that they are complicated to write and existingcompilers expect PDUs to be decoded from contiguous buffers. The generic code produced is also larger than thespecific code necessary for relatively small protocols.Given the requirement to stream PDUs in memory segments, to use as little memory as possible, and to decodeonly ACSE and presentation PDUs, the compiler alternativewas not as attractive as it might be in other applications.References1. Information Processing Systems - Open Systems Interconnection - Application Layer Structure, ISO/DP 9545, ISO/TC97/SC21/N1743. July 24, 1987. Revised November 1987.2. In/ormation Processing Systems - Open Systems Interconnection - Specification of Abstract Syntax Notation One f ASN.l), ISO8824: 1987 (E).3. Information Processing Systems - Open Systems Interconnection - Specification of Basic Encoding Rules for Abstract SyntaxNotation One (ASN.l), ISO 8825: 1987 (E).4. Information Processing Systems - Open Systems In terconnec-



36/96

tion Ele Service Definition for the Association Control Service Element, ISO 8649: 1987 (E) (ISO/IEC JTC1/SC21 N2326).5. Information Processing Systems - Open Systems Interconnection - Protocol Definition for the Association Control Service Element, ISO 8650: 1987 (E) (ISO/IEC JTC1/SC21 N2327).6. Information Processing Systems - Open Systems Interconnec

tion - Connection Oriented Presentation Service Specification,ISO 8822: 1988 (ISO/IEC JTC1/SC21 N2335).7. Information Processing Systems - Open Systems Interconnection - Connection Oriented Presentation Protocol Specification,ISO 8822: 1988 (ISO/IEC JTC1/SC21 N2336).

Implementation of the OSI Class 4Transport Layer Protocol in the HP OSIExpress CardThe HP OSI Express card's implementation of the transportlayer protocol provides f low control, congestion control, andcongest ion avoidance.by Rex A. Pugh

THE TRANSPORT LAYER is responsible for providing reliable end-to-end transport services, such aserror detection and recovery, multiplexing, addressing, flow control, and other features. These services relievethe upper-layer user (typically the session layer) of anyconcern about the details of achieving reliable cost-effective data transfers. These services are provided on top ofboth connection-oriented and connectionless network protocols. Basically, the transport layer is responsible for converting the quality of service provided by the network layerinto the quality of services demanded by the upper layerprotocol.This article describes the OSI Express card's implementation of OSI Class 4 Transport Protocol (TP4). The OSIExpress TP4 implementation extends the definition of theOSI transport layer's basic flow control mechanisms to provide congestion avoidance and congestion control for thenetwork and the OSI Express card itself. Because we haverequirements to support a large number of connections ona fairly inexpensive platform, the memory managementand flow control schemes are designed to work closelytogether and to use the card's limited memory as efficientlyas possible. This efficiency also includes ensuring fair buffer utilization among connections.

Flow Control BasicsAn introduction to the basic concepts of flow control,congestion control, and congestion avoidance is useful insetting the stage for a discussion of the OSI Express cardTP4 implementation. These concepts are related becausethey all solve the problem of resource management in the

network. They are also distinct because they solve resourceproblems either in different parts of the network or in adifferent manner.Flow ControlFlow control is the process of controlling the flow ofdata between two network entities. Flow control at thetransport layer is needed because of the interactions between the transport service users, the transport protocolmachines, and the network service. A transport entity canbe modeled as a pair of queues (inbound and outbound)between the transport service user and the transport protocol machine, and a set of buffers dedicated to receivinginbound data and/or storing outbound data for retransmission (see Fig. 1). The transport entity would want to restrainthe rate of transport protocol data unit (TPDU*) transmission over a connection from another transport entity forthe following reasons: The user of the receiving transport entity cannot keepup with the flow of inbound data. In other words, theinbound queue between the transport service user andthe transport protocol machine has grown too deep. The receiving transport entity does not have enough buffers to keep up with the flow of inbound data from thenetwork.Note that analogous situations exist in the outbound direction, but they are usually handled internally betweenthe transport user and the transport entity. If the sendingtransport entity does not have enough buffers to keep upwith the flow of data from the transport user, or the sendingtransport entity is flow controlled by the receiving transport" A TPDU conta ins transpor t layer contro l commands and data packets-



37/96

entity, then the transport user must be flow controlled bysome backpressure mechanism caused by the outboundqueue's growing too deep.Thus flow control is a two-party agreement between thetransport entities of a connection to limit the flow of packetswithout taking into account the load on the network. Itspurpose is to ensure that a packet arriving at its destinationis given the resources it needs to be processed up to thetransport user.Congest ion ControlWhile flow control is used to prevent end system resources from being overrun, congestion control is used tokeep resources along a network path from becoming congested. Congestion is said to occur in the network whenthe resource demands exceed the capacity and packets arelost because of too much queuing in the network.Congestion control is usually categorized as a networklayer function. In an X.25 type network where the networklayer is connection-oriented, the congestion problem ishandled by reserving resources at each of the routers alonga path during connection setup. The X.25 flow controlmechanism can be used between the X.25 routers to ensurethat these resources do not become congested. With a connectionless network layer like ISO 8473, the routers candetect that they are becoming congested, but there are noexplicit flow control mechanisms (like choke packets1) thatcan be used by the OSI network layer alone for controllingcongestion.The most promising approach to congestion control inconnectionless networks is the use of implicit techniqueswhereby the transport entities are notified that the networkis becoming congested. The binary feedback scheme2 is anexample of such a notification technique. The transportentities can relieve the congestion by exercising varyingdegrees of flow control.

Thus congestion control is a social agreement amongnetwork entities. Different connections may choose differ-TransportServiceUser

SessionLayerTransportServiceUser

1 T

TransportProtocolMachine

R et r ansm i ss i onQueue Bu f fe r s

T=1 Inbound Data~ B u f f e rs

ent flow control practices, but all entities on a networkmust follow the same congestion control strategy. The purpose of congestion control is to control network traffic toreduce resource overload.Congest ion AvoidanceCongestion control helps to improve performance aftercongestion has occurred. Congestion avoidance tries tokeep congestion from occurring. Thus congestion controlprocedures are curative while congestion avoidance procedures are preventive. Given that a graph of throughputversus network load typically looks like Fig. 2, a congestionavoidance scheme should cause the network to oscillateslightly around the knee, while a congestion controlscheme tries to minimize the chances of going over thecliff. The knee is the optimal operating point because increases in load do not offer a proportional increase inthroughput, and it provides a certain amount of reserve forthe natural burstiness associated with network traffic.

Flow Control Mechanisms in TP4

The OSI Class 4 Transport, or TP4, protocol is describedin ISO document number 8073. It provides a reliable end-to-end data transfer service by using error detection andrecovery mechanisms. Flow control is an inherent part ofthis reliable service. This section will describe the protocolmechanisms that are used to provide flow control in OSITP4. These mechanisms make use of the TP4 data streamstructure, TPDU numbering, and TPDU acknowledgments.TP4 Data Stream StructureThe main service provided by the transport layer is, ofcourse, data transfer. Two types of transfer service are available from TP4: a normal data service and an expedited dataservice. Expedited data at the transport layer bypasses normal data end-to-end flow control, so we need not concernourselves with expedited data when discussing TP4 flowcontrol.The OSI transport service (TS) interface is modeled as aset of primitives through which information is passed between the TS provider and the TS user. Normal TS userdata is given to the transport layer by the sending TS userin a transport data request primitive. TS user data is delivered to the receiving TS user in a transport data indication

Knee Cliff

Load

Network LayerFig. 1. Model of a t ranspor t ent i ty .

Fig. 2. A typ ica l graph of throughput versus network load. Acongest ion avoidance scheme should cause the network tooscil late around the knee, while a congestion control schemetr ies to min imize the chances of going over the c l i f f .



38/96

primitive.The data carried in each transport data request and transport data indication primitive is called a transport servicedata unit (TSDU). There is no limit on the length of a TSDU.To deliver a TSDU, the transport protocol may segmentthe TSDU into multiple data transport protocol data units(DT TPDUs). The maximum data TPDU size is negotiatedfor each connection at connection establishment. Negotiation of a particular size depends on the internal buffermanagement scheme and the maximum packet size supported by the underlying network service. The maximumTPDU sizes allowed in TP4 are 128, 256, 512, 1024, 2048,4096, and 8192 octets.

TransportServiceInterfaceTransportServiceInterface

T_Connect_lndication

T_Connec t_Res pons e

TPDU NumberingThe error detection, recovery, and flow control functionsall rely on TPDU numbering. Unlike ARPA TCP, wheresequencing is based on numbering each byte in the datastream since connection establishment, TP4 sequencing isbased on numbering each TPDU in the data stream sinceconnection establishment. A transport entity allocates thesequence number zero to the first DT TPDU that it transmitsfor a transport connection. For subsequent DT TPDUs senton the same transport connection, the transport entity allocates a sequence number one greater than the previous one,modulo the sequence space size (see Fig. 3).The sequence number is carried in the header of eachDT TPDU and its corresponding AK (acknowledgment)TPDU. The sequence number field can be either 7 or 31bits long. The size of the sequence space is negotiated atconnection establishment. Since a transport entity mustwait until the network's maximum packet lifetime has expired before reusing a sequence number, the 31-bit sequence space is preferred for performance reasons.

TP4 AcknowledgmentsAn AK (acknowledgment) TPDU is used in OSI TP4 forthe following reasons: It is the third part of the three-way handshake that isused for connection establishment (see Fig. 4). It acknowledges the receipt of the CC (connect confirm)TPDU. It is used to provide the connection assurance or keep-alive function. To detect an unsignaled loss of the network connection or failure of the remote transport entity,an inactivity timer is used. A connection's inactivity*An octet is eight bits.

End System End SystemFig. 4. Three-way handshake used for connect ion establ ishment.

timer is reset each time a valid TPDU is received on thatconnection. If a connection's inactivity timer expires,the connection is presumed lost and the local transportentity invokes its release procedures for the connection.The keep-alive function maintains an idle connectionby periodically transmitting an AK TPDU upon expiration of the window timer. Thus the interval of one transport entity's window timer must be less than that of itspeer's inactivity timer. Since there is no mechanism forsharing information about timer values, a transport entity must respond to the receipt of a duplicate AK TPDUnot containing the FCC (flow control confirmation) parameter by transmitting an AK TPDU containing theFCC parameter. Thus, a transport entity can provokeanother transport entity into sending an AK TPDU tokeep the connection alive by transmitting a duplicateAK TPDU.It is used to acknowledge the in-sequence receipt of oneor more DT TPDUs. Since OSI TP4 retains DT TPDUsuntil acknowledgment (for possible retransmission), receipt of an AK TPDU allows the sender to release theacknowledged TPDUs and free transmit buffers. To acknowledge the receipt of multiple DT TPDUs, an implementation of OSI TP4 may withhold sending an AKTPDU for some time (maximum acknowledgmentholdback time) after receipt of a DT TPDU. This holdbacktime must be conveyed to the remote transport entity atconnection establishment time.

DT TPDUs (Data Transport Protocol Data Units)

Sequence SpaceSize (31 Bits)

TSDU (Transport Service Data Unit )

O, i , m, n = Sequence Numbersm = i + 1 Mo d ( Seq u en ce Sp ace S iz e )

Fig. 3. Transport data service unit(TSDU) fo rmat and the DT TPDUnumber ing scheme.

38 HEWLETT- PACKARD J OURNAL FEBRUARY 1990


39/96

It is used to convey TP4 flow control information, asdescribed in the next section.TP4 Flow ControlOSI TP4 flow control, like many other schemes, is managed by the receiver. TP4 uses a credit scheme. The receiversends an indication through the AK TPDU of how manyDT TPDUs it is prepared to receive. More specifically, anAK TPDU carries the sequence number of the next expectedDT TPDU (this is called the LWE or lower window edge)and the credit window (CDT), which is the number of DTTPDUs that the peer transport entity may send on thisconnection. The sequence number of the first DT TPDUthat cannot be sent, called the upper window edge (UWE),is then the lower window edge plus the credit windowmodulo the sequence space size (see Fig. 5). As an example,say that the receiving transport entity has received DTTPDUs up through sequence number 5. Then the LWE ornext expected DT TPDU number is 6. If the receiver transmits an AK TPDU with a CDT of 10 and an LWE of 6, thenthe transmitter (receiver of the AK TPDU) has permissionto transmit 10 DT TPDUs numbered 6 through 15. Thetransmitter is free to retransmit any DT TPDU that has notbeen acknowledged and for which it has credit. A DT TPDUis acknowledged when an AK TPDU is received whoseLWE is greater than the sequence number of the DT TPDU.Credit ReductionOSI TP4 allows the receiver to reduce the credit windowas well as take back credit for DT TPDUs that it has notyet acknowledged. The LWE cannot be reduced, however,since it represents the next expected DT TPDU sequencenumber and acknowledges receipt of all DT TPDUs of lowernumber. Another way of saying this is that the UWE neednot move forward with each successive AK TPDU, and infact it may move backwards as long as it isn't less than theLWE. As will be discussed later, the OSI Express card'sTP4 takes advantage of this feature to provide memorycongestion control by closing the credit window (AK TPDUwith CDT of zero) under certain circumstances.

LWE UWE

illlllllllllliDT TPDUs- C D T -U WE = U p p er Win d o w Ed g e(Sequence Number of the F irst DT TPDU that Cannot Be Sent )L WE = L o wer Win d o w Ed g e(Sequence Number Carried by an AK TPDU of the Next ExpectedDT TPDU)C D T = C red i t Win d o w o r Win d o w S iz e(Number of DT TPDUs a Receiver Can Handle)

Fig. 5. Parameters associated with a buf fer of DT TPDUs.

OSI Express Card TP4The OSI Express card's implementation of TP4 (hereaftercalled the Express TP4) flow control and network congestion control and avoidance policies use many of the basicprotocol mechanisms described above.

Flow ControlIn Express TP4 the maximum receive credit window size(W) is a user-settable parameter. A similar parameter (QJis used to provide an upper limit on the number of DTTPDUs a given connection is allowed to retain awaitingacknowledgment. The Express TP4 dynamically changesthe window size and queuing limit based on the state ofcongestion, so W and Q are treated as upper limits. Anapplication can set values for W and Q for a particularconnection during connection establishment. A set of values may also be associated with a particular TSAP (transport service access point) selector, so that applications canselect from different transport service profiles. In lieu of aconnection using one of the two methods just described,configured default values are used.There is no real notion of flow control in the outbounddirection, although TPDU transmissions are paced duringtimes of congestion. The Express TP4 continues to sendTPDUs until it has used all the credit that it was allocatedby the peer entity, or it has Q TPDUs in its retransmissionqueue awaiting acknowledgment, whichever comes first.Ignoring any congestion control mechanisms for the moment, inbound flow control is also fairly simple. When theExpress TP4 sends an AK TPDU, its goal is to grant a fullwindow's worth of credit. The CDT field of the AK TPDUis set to W, and the LWE field is set to the sequence numberof the last in-sequence DT TPDU received plus one (i.e.,the next expected DT TPDU). The key to the efficient oper-

ReceiverReceiveDT TPDU

SenderSend NextD T T PD U

Sen d A K T PD Uwith NewL WE an d C D T

Wait ing for AK TPDU

ProcessA K T PD U

Fig. 6. Simple f low contro l pol icy.



40/96

ation of the flow control policy is the timing of the AKTPDU transmissions.A simple flow control policy (see Fig. 6) could be to sendan AK TPDU granting a full credit window when the lastin-sequence DT TPDU of the current credit window hasbeen received. This policy would degrade the potentialthroughput of the connection, however, because it neglectsthe propagation delays and processing times of the DTTPDUs and AK TPDUs. After transmitting the last DT TPDUof the current credit window, the sender is idle until theAK TPDU is received and processed. After sending the AKTPDU, the receiver is idle until the first DT TPDU of thenew credit window has propagated across the network.These delays could be lengthy depending on the speed ofthe underlying transmission equipment and on the relativespeeds of the sending and receiving end systems.A more efficient flow control policy, like that implemented in the Express TP4, sends credit updates suchthat the slowest part of the transmission pipeline (sendingentity, receiving entity, or network subsystem) is not idleas long as there is data to be transmitted. This is done bysending an AK TPDU granting a full window's worth ofcredit before all of the DT TPDUs of the current creditwindow have been received. The point in the current creditwindow at which the credit-giving AK TPDU is sent iscalled the credit acknowledgment point (CAP). Thus theCAP is the sequence number of a DT TPDU in the currentcredit window whose in-sequence receipt will generate thetransmission of an AK TPDU. The AK TPDU's LWE willbe the sequence number of the DT TPDU causing the generation of the AK TPDU and the CDT field of the AK TPDUwill contain the value of W. The CAP is calculated eachtime an AK TPDU is sent, and is just the sum of the creditacknowledgment interval (CA) and the current LWE. CAIrepresents the number of data packets received before anAK TPDU is sent.

ExampleConsider a hypothetical connection where two end systems are connected through an intermediate system viatwo 9600-baud full-duplex serial links. Fig. 7 shows theprogression of DT TPDUs and the flow control pacing AKTPDUs across the links of this connection. At time TO, endsystem 1 has received the DT TPDU whose sequencenumber is the CAP. End system 1 then places an AK TPDUin the transmission queue of link A', thereby granting anew credit window. Meanwhile links A and B are busyprocessing DT TPDUs numbered CAP + 1 and CAP + 2 respectively. At time Tl, the AK TPDU has made it to thelink B' transmission queue and the DT TPDUs have advanced one hop, allowing DT TPDU number CAP + 3 to beinserted in the link B transmission queue. Finally, at timeT2, the AK TPDU has made it to end system 2, and againthe DT TPDUs have advanced one hop, allowing DT TPDUnumber CAP + 4 to be inserted in the link B transmissionqueue. Note that for simplicity, it is assumed that the propagation delay of a DT TPDU across a link is equal to thatof an AK TPDU. In reality, DT TPDUs are larger than AKTPDUs and would take longer to propagate.For this example, the minimal CAI needed to keep thelinks busy is four, and the minimal window size W is eight.Thus the AK TPDU would carry a CDT of eight, so thatend system 2 has credit to send DT TPDUs numberedCAP + 5 through CAP + 8 at the time it receives the AKTPDU (time T2). DT TPDU number CAP + 4 would triggerend system 1 to send another credit-granting AK. The CAIshould not be greater than W 4 for this example, or endsystem 1 will notice an abnormal delay in the packet trainbecause end system 2 does not have enough credit to keepthe links busy while the AK TPDU is in transit. Any CAIless than W -4 would avoid the delay problem, but theincrease in AK TPDU traffic tends to decrease the amountof CPU and link bandwidth that can be used for data trans-

Rece iverDT [CAP]

DT [CAP + 1 ]

Link A'

Sender

IntermediateSystem

DT[CAP + 1]L i n k A f

J L i n k A 'Intermediate

System AK [CAP, CDT] Link B'

DT [CAP + 2]En dSystem1

H L l n k A r D T [ C A P + 3 ]

T 2

Link B

Transmission Queue

IntermediateSystem

AK [CAP, CDT]- f D T [ C A P + 4 ]

Link A' Link B'F ig . 7 . H ypo the t i ca l connec t i onof two end systems through an intermedia te system v ia two 9600baud ful l -duplex ser ial l inks.

40 HEW LETT- PACKARD JOURNAL FEBRUARY 1990


41/96

mission. The optimal CAI for this example would be W 4since that avoids the credit delay and minimizes thenumber of AK TPDUs. The graph in Fig. 9 on page 56shows the effect on throughput of different values for Wand different CAIs (packets per AK TPDU) for each of thewindow sizes. This graph was created from a simulationof the Express TP4 implementation running a connectionbetween two end systems connected to a single LAN segment. This simulation data and analysis of real ExpressTP4 data have shown that a maximum CAI of W/2 yieldsthe best performance with the least amount of algorithmiccomplexity.Optimal Credi t WindowIn the Express TP4, the CAI initially starts at half thecredit window size (W/2), but can be reduced and subsequently increased dynamically to reach and maintain theoptimal interval during the life of the connection. The optimal value, as shown in the above example, is large enoughto ensure that the sender receives the AK TPDU grantinga new credit window before it finishes transmitting at leastthe last DT TPDU of the current window, but not largerthan the number of DT TPDUs the sender is willing toqueue on its retransmit queue awaiting acknowledgment(note that this scheme relies on the setting of sufficientlylarge values for Q and W such that the optimal CAI can bereached). If the sending transport entity does not allow W/2DT TPDUs to be queued awaiting acknowledgment, thenas a receiver, the Express TP4 will decrease the CAI toavoid waiting for the CAP DT TPDU that would nevercome. This situation is detected with the maximumacknowledgment holdback timer. Since any AK TPDU thatis sent cancels the acknowledgment holdback timer, expiration of the holdback timer indicates that the sender maynot have sent the CAP DT TPDU. When the timer expires,the CAI is decreased to half the number of DT TPDUsreceived since the last credit update. This is done to preserve the pipelining scheme, since it has been shown thatit is better to send AK TPDUs slightly more often than toallow the pipeline to dry up. The amount of credit offeredto the receiver is not shrunk (unless congestion is detected),so if the sender devotes more resources to the connection,it can take advantage of the larger window size. The CAPwill increase linearly as long as the sender is able to sendup to the CAP DT TPDU before the acknowledgmentholdback timer expires. The linear increase allows the Express TP4 to probe the sender's transmit capability, andhas proved fairly effective.A more effective mechanism for matching the receiver'sAK TPDU rate to the sender's needs has reached draft proposal status as an enhancement to OSI TP4. That mechanism allows the sending transport entity to request anacknowledgment from the receiving transport entity.

Congestion Control and AvoidanceSeveral network congestion control and avoidance algorithms are used in the Express TP4. All of these algorithms have been described and rationalized in reference3. This section provides a basic description of each algorithm and how they were effectively incorporated in the

Express TP4 implementation. There is also a descriptionof how these algorithms are used together with the dynamiccredit window and retransmit queue sizing algorithms toprovide congestion control of card resources and networkresources.Slow Start CUTE Congest ion AvoidanceTwo very similar congestion avoidance schemes havebeen described by Jacobsen3 and Jain.4 The fundamentalobservation of these two algorithms is that the flow on atransport connection should obey a "conservation of packets" principle. If a network is running in equilibrium, thena new packet isn't put onto the network until an old oneleaves. Congestion and ultimately packet loss occur as soonas this principle is violated. In practice, whenever a newconnection is started or an existing connection is restartedafter an idle period, new packets are injected into the network before an old packet has exited. To minimize thedestabilizing effects of these new packet injections, theCUTE and slow start schemes require the sender to startfrom one packet and linearly increase the number of packets sent per round-trip time. The basic algorithm is as follows: When starting or restarting after a packet loss or an idleperiod, set a variable congestion window to one. When sending DT TPDUs, send the minimum of thecongestion window or the receiver's advertised creditwindow size. On receipt of an AK TPDU acknowledging outstandingDT TPDUs, increase the congestion window by one upto some maximum (the minimum of Q or the receiver'sadvertised credit window size).Note that this algorithm also is employed when the retransmit or retransmission timer expires. The CUTEscheme proposes that a retransmission time-out be used asan indication of packet loss because of congestion. Jacob-sen3 also argues, with some confidence, that if a goodround-trip-time estimator is used in setting the retransmittimer, a time-out indicates a lost packet and not a brokentimer (assuming that a delayed packet is equated with alost packet). For a LAN environment, packets are droppedbecause of congestion.The Express TP4 uses the slow start algorithm (if configured to do so) when a connection is first established, uponexpiration of the retransmission timer, and after an idleperiod on an existing connection. An idle period is detectedwhen certain number of keep-alive AK TPDU's have beensent or received. The slow start and CUTE schemes limittheir description to sender functions. The Express TP4 provides the slow start function on the receive side as well,to protect both the network and the OSI Express card froma sender that does not use the slow start scheme. The receiver slow start algorithm is nearly identical to the sender's and works as follows: When starting or restarting after an idle period, set avariable congestion window to one. When sending an AK TPDU, offer a credit window sizeequal to the congestion window to the sender. On receipt of the CAP DT TPDU, increase the congestionwindow by one up to some maximum W as described

'CUTE stands for Congest ion contro l Us ing Time-outs at the End- to-end layer .

FEBRUARY 1990 HEWLETT-PACKARD JOURNAL 41 Copr 1949 1998 Hewlett Packard Co


42/96

above.Round-Tr ip-Time Variance Est imat ionSince the retransmit timer is used to provide congestionnotification, it must be sensitive to abnormal packet delayas well as packet loss. To do this, it must maintain anaccurate measurement of the round-trip time (RTT). Theround-trip time is defined as the time it takes for a packetto propagate across the network and its acknowledgmentto propagate back. Most transport implementations use anaveraging algorithm to keep an ongoing estimation of theround-trip time using measurements taken for each received acknowledgment.The Express TP4 uses RTT mean and variance estimationalgorithms3 to derive its retransmission timer value. Thebasic estimator equations in a C-language-like notation are:

Err = M-AA = A + (Err Gain)D = D + ((|Err| - D) Gain)where:

M = current RTT measurementA = average estimation for RTT, oraprediction of the next measurementErr = error in the previous predictionof M which may be treated as a varianceGain = a weighting factorD = estimated mean deviation = C notation f or the logical shift rightoperation (a division of the left operandby 2 to the power of the right operand) .The retransmission timer is then calculated as:

retrans_time = A + 2D.The addition of the deviation estimator has provided amore reactive retransmission timer while still damping thesomewhat spurious fluctuations in the round-trip time.

Exponent ial Retransmit TimerIf it can be believed that a retransmit timer expiration isa signal of network congestion, then it should be obviousthat the retransmission time should be increased when thetimer expires to avoid further unnecessary retransmissions.If the network is congested, then the timer most likelyexpired because the round-trip time has increased appreciably (a packet loss could be viewed as an infinite increase). The question is how the retransmissions shouldbe spaced. An exponential timer back-off seems to be goodenough to provide stability in the face of congestion, although in theory even an exponential back-off won'tguarantee stability.5The Express TP4 uses an exponential back-off withclamping. Clamping means that the backed-off retransmittime is used as the new round-trip time estimate, and thusdirectly effects the retransmit time for subsequent DTTPDUs. The exponential back-off equation is as follows:

retransjime = retrans_time x 2n

where n is the number of times the packet has been transmitted.For a given DT TPDU, the first time the retransmissiontimer expires the retransmission time is doubled. The second time it expires, the doubled retransmission time isquadrupled, and so on.Dynamic Window and Retransmit Queue SizingThe slow start described earlier provides congestionavoidance when used at connection start-up and restartafter by It provides congestion control when triggered bya retransmission. The problem with it is that a slow startonly reduces a connection's resource demands for a shortwhile. It takes time RTTlog2W, where RTT is the round-triptime and W is the credit window size, for the windowincrease to reach W. When a window size reaches W again,congestion will most likely recur if it doesn't still exist.Something needs to be done to control a connection's contribution to the load on the network for the long run.The transport credit window size is the most appropriatecontrol point, since the size of the offered credit windowdirectly effects the load on the network. Increasing thewindow size increases the load on the network, and decreasing the window size decreases the load. A simple ruleis that to avoid congestion, the sum of all the window sizes(W) of the connections in the network must be less thanthe network capacity. If the network becomes congested,then having each connection reduce its W (while also employing the slow start algorithm to alleviate the congestion)should bring the network back into equilibrium. Since thereis no notification by the network when a connection isusing less than its fair share of the network resources, aconnection should increase its W in the absence of congestion notification to find its limit. For example, a connectioncould have been sharing a path with someone else andconverged to a window that gave each connection half theavailable bandwidth. If the other connection shuts down,the released bandwidth will be wasted unless the remaining connection increases its window size.It is argued that a multiplicative decrease of the windowsize is best when the feedback selector signals congestion,while an additive increase of the window size is best inthe absence of congestion.3'6 Network load grows non-linearly at the onset of congestion, so a multiplicative decrease is about the least that can be done to help the networkreach equilibrium again. A multiplicative decrease alsoaffects connections with large window sizes more thanthose with small window sizes, so it penalizes connectionsfairly. An additive increase slowly probes the capacity ofthe network and lessens the chance of overestimating theavailable bandwidth. Overestimation could result in frequent congestion oscillations.Like the slow start algorithm, the Express TP4 uses multiplicative decrease and additive increase by adjusting Won a connection's receive side and by adjusting Q on aconnection's send side. This allows us to control the injection of packets into the network and control the memoryutilization of each connection on the OSI Express card.The amount of credit given controls the amount of bufferspace needed in the network and on the card. The size ofQ also controls the amount of buffer space needed on the

42 HEW LETT- PACKARD JOURNAL FEBRUARY 1990


43/96

card, because TSDUs are not sent to the card from the hostcomputer unless the connection has credit to send themor there are less than Q TPDUs already queued awaitingacknowledgment. The Express TP4 uses the following equations to implement multiplicative decrease and additiveincrease.Upon notification of congestion (multiplicative decrease): W = W / 2 ( 1 )Q ' = Q V 2 . ( 2 )

Upon absence of congestion (additive increase):W = W + W/4Q' = Q' + Q'/4. (3 )(4 )

W and Q' are the actual values used by the connectionand W and Q are upper limits for W and Q' respectively.The window and queue size adjustments are used withthe retransmit timer congestion notification in the following manner: Expiration of the retransmit timer signals network congestion and Q' is decreased. The slow start algorithm is used to clock data packetsout until the congestion window equals Q'. As long as no other notifications of congestion occur, Q'is increased each time an AK TPDU is received, up to amaximum of Q.

OSI Express Congestion ControlOne of the main design goals of the OSI Express cardwas to support a large number of connections. To achievethis goal, the memory management scheme had to be asefficient as possible since memory (for data structures anddata buffers) is the limiting factor in supporting many connections. OSI Express memory management is providedby the CONE memory buffer manager (see page 27 for more

about CONE memory buffer manager).Initially, the memory buffer manager was designed suchthat A connection's packet buffers were preallocated. Aconnection was guaranteed that the buffers it needed wouldbe available on demand. This scheme provided good performance for each connection when there were many activeconnections, but it would not support enough active connections. The connections goal had to be met, so the memory buffer manager was redesigned such that all connections share the buffer pool. Theoretically.there can be moreconnections active than there are data buffers, so thisscheme maximizes the number of supportable connectionsat the cost of individual connection performance as theratio of data buffers to the number of connections approaches one.The Problem and The Solut ionWith a shared buffer scheme comes the possibility ofcongestion. (Actually, even without a shared bufferscheme, other resources such as CPU and queuing capacityare typically shared, so congestion is not a problem specificto statistical buffering.) Since no resources are reserved foreach connection, congestion on the card arises from thesame situations as congestion in the network. A new connection coming alive or an existing connection restarting

after an idle period injects new packets into the systemwithout waiting for old packets to leave the system. Also,since there can be many connections, it is likely that thesum of the connections' window sizes and other resourcedemands could become greater than what the card canactually supply.A shared resource scheme also brings the problem ofensuring that each connection can get its fair share of theresources. Connections will operate with different windowsizes, packet sizes, and consumption and production rates.This leads to many different patterns and quantities ofresource use. As many connections start competing forscarce resources, the congestion control scheme must beable to determine which connections are and which connections are not contributing to the shortage.The problem of congestion and fairness was addressedby modeling the card as a simple feedback control system.The system model used consists of processes (connections)that take input in the form of user data, buffers, CPU resources, and control signals, and produce output in theform of protocol data units. To guarantee the success ofthe system as a whole, each process must be successful.Each process reports its success by providing feedback signals to a central control decision process. The control decision process is responsible for processing these feedbacksignals, determining how well the system is performingand providing control information to the connection processes so that they will adjust their use of buffers and CPUresources such that system performance can be maximized.Control SystemCertain measures are needed to determine the load onthe card so that congestion can be detected, controlled, andhopefully avoided. When the card is lightly loaded, fairnessis not an issue. As resources become scarce, however, someway is needed to measure each connection's resource useso that fairness can be determined and control applied toreduce congestion.Two types of accounting structures are used on the OSIExpress card to facilitate measurement: accounts and creditcards. Since outbound packets are already associated witha connection as they are sent from the host to the card,each connection uses its own account structure to maintainits outbound resource use information. All protocol layersinvolved in a particular connection charge their outboundoperations directly to the connection's outbound account.For inbound traffic, when a packet is received from theLAN, the first three protocol layers do not know whichupper-layer connection the packet is for. Therefore, a singleinbound account is used for all inbound resource use information for the first three protocol layers, and some combined resource use information for upper-layer connections. This provides some level of accountability for inbound resource use at the lower layers such that comparisons can be made to overall outbound resource use. Sincea single inbound account exists for all connections, creditcards are used by the upper four layers (transport and up)to charge their inbound operations to specific connections.Thus each connection has an outbound account and a creditcard for the inbound account.The protocol modules and CONE utilities are responsible



44/96

for updating the statistics (i.e., charging the operations)that are used to measure resource use. These statistics include various system and connection queue depths, CPUuse, throughput, and time-averaged memory utilization.When summed over all of the connections, these statisticsare used along with other signals to determine the degreeof resource shortage or congestion on the card. The individual connection values indicate which connections arecontributing the most to the congestion (and should bepunished) and which connections are not using their fairshare of resources (and should be allowed to do so).Flow Control DaemonThe control decision and feedback filtering function isimplemented in a CONE daemon process aptly named theflow control daemon. Using a daemon allows the overheadfor flow control to be averaged over a number of packets.The daemon periodically looks at the global resource statistics and then sets a target for each of the resources for eachconnection. The target level is not just the total number of,say, buffers divided by the number of connections. Targetsare based on the average use adjusted up or down basedon the scarcity of various resources. This allows more flexibility of system configurations since one installation ormix of connections may perform better with differentmaximum queue depths than another. It is also the simplestway to set targets for things like throughput since totalthroughput is not a constant or a linear function of thenumber of connections.Control signals are generated by the flow control daemonas simple indications of whether a connection should increase, decrease, or leave as is its level of resource use.The daemon determines the direction by comparing theconnection's level of use with the current target levels.There is a separate direction indication for inbound andoutbound resource use.The fairness function falls out very simply from thisdecision and control scheme. Any connection that is usingmore than its fair share of a resource will have a level ofuse greater than the average and thus greater than the targetwhen that resource is scarce. In other words, the "fairshare" is the target.The control signals are generated when a connectionqueries the daemon. The most likely point for querying thedaemon is when a connection is about to make a flowcontrol decision. That decision point is, of course, in theTP4 layer of the OSI Express card.Effects of the Daemon

The effects of flow control notifications to a connectionregarding decreasing or increasing resource use vary according to whether the direction of traffic is inbound oroutbound.Outbound. The Express TP4 queries the flow controldaemon for outbound congestion/fairness notificationwhen it receives an AK TPDU. It is at this point that DTTPDUs are released from the retransmission queue, and itcan be decided if more or fewer DT TPDUs can be queueduntil the next AK TPDU is received.If the connection is using more than its fair share of

Hewlett Packard Journal 90

Documents

Transcript of Hewlett Packard Journal 90