Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf ·...

59

Transcript of Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf ·...

Page 1: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Virtio PCI Card Speci�cation

v0.9.4 DRAFT

-

Rusty Russell <[email protected]> IBM Corporation (Editor)

2012February 7.

Page 2: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Chapter 1

Purpose and Description

This document describes the speci�cations of the �virtio� family of PCI de-vices. These are devices are found in virtual environments, yet by design theyare not all that di�erent from physical PCI devices, and this document treatsthem as such. This allows the guest to use standard PCI drivers and discoverymechanisms.

The purpose of virtio and this speci�cation is that virtual environments andguests should have a straightforward, e�cient, standard and extensible mecha-nism for virtual devices, rather than boutique per-environment or per-OS mech-anisms.

Straightforward: Virtio PCI devices use normal PCI mechanisms of inter-rupts and DMA which should be familiar to any device driver author.There is no exotic page-�ipping or COW mechanism: it's just a PCI de-vice.1

E�cient: Virtio PCI devices consist of rings of descriptors for input and out-put, which are neatly separated to avoid cache e�ects from both guest anddevice writing to the same cache lines.

Standard: Virtio PCI makes no assumptions about the environment in whichit operates, beyond supporting PCI. In fact the virtio devices speci�ed inthe appendices do not require PCI at all: they have been implemented onnon-PCI buses.2

1This lack of page-sharing implies that the implementation of the device (e.g. the hyper-visor or host) needs full access to the guest memory. Communication with untrusted parties(i.e. inter-guest communication) requires copying.

2The Linux implementation further separates the PCI virtio code from the speci�c virtiodrivers: these drivers are shared with the non-PCI implementations (currently lguest andS/390).

1

Page 3: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Extensible: Virtio PCI devices contain feature bits which are acknowledgedby the guest operating system during device setup. This allows forwardsand backwards compatibility: the device o�ers all the features it knowsabout, and the driver acknowledges those it understands and wishes touse.

1.1 Virtqueues

The mechanism for bulk data transport on virtio PCI devices is pretentiouslycalled a virtqueue. Each device can have zero or more virtqueues: for example,the network device has one for transmit and one for receive.

Each virtqueue occupies two or more physically-contiguous pages (de�ned, forthe purposes of this speci�cation, as 4096 bytes), and consists of three parts:

Descriptor Table Available Ring (padding) Used Ring

When the driver wants to send a bu�er to the device, it �lls in a slot in thedescriptor table (or chains several together), and writes the descriptor indexinto the available ring. It then noti�es the device. When the device has �nisheda bu�er, it writes the descriptor into the used ring, and sends an interrupt.

2

Page 4: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Chapter 2

Speci�cation

2.1 PCI Discovery

Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000 through 0x103Finclusive is a virtio device1. The device must also have a Revision ID of 0 tomatch this speci�cation.

The Subsystem Device ID indicates which virtio device is supported by thedevice. The Subsystem Vendor ID should re�ect the PCI Vendor ID of theenvironment (it's currently only used for informational purposes by the guest).

Subsystem Device ID Virtio Device Speci�cation

1 network card Appendix C2 block device Appendix D3 console Appendix E4 entropy source Appendix F5 memory ballooning Appendix G6 ioMemory -7 rpmsg -8 SCSI host Appendix H9 9P transport -

2.2 Device Con�guration

To con�gure the device, we use the �rst I/O region of the PCI device. Thiscontains a virtio header followed by a device-speci�c region.

1The actual value within this range is ignored

3

Page 5: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

There may be di�erent widths of accesses to the I/O region; the �natural� accessmethod for each �eld in the virtio header must be used (i.e. 32-bit accesses for32-bit �elds, etc), but the device-speci�c region can be accessed using any widthaccesses, and should obtain the same results.

Note that this is possible because while the virtio header is PCI (i.e. little)endian, the device-speci�c region is encoded in the native endian of the guest(where such distinction is applicable).

2.2.1 Device Initialization Sequence

We start with an overview of device initialization, then expand on the detailsof the device and how each step is preformed.

1. Reset the device. This is not required on initial start up.

2. The ACKNOWLEDGE status bit is set: we have noticed the device.

3. The DRIVER status bit is set: we know how to drive the device.

4. Device-speci�c setup, including reading the Device Feature Bits, discov-ery of virtqueues for the device, optional MSI-X setup, and reading andpossibly writing the virtio con�guration space.

5. The subset of Device Feature Bits understood by the driver is written tothe device.

6. The DRIVER_OK status bit is set.

7. The device can now be used (ie. bu�ers added to the virtqueues)2

If any of these steps go irrecoverably wrong, the guest should set the FAILEDstatus bit to indicate that it has given up on the device (it can reset the devicelater to restart if desired).

We now cover the �elds required for general setup in detail.

2.2.2 Virtio Header

The virtio header looks as follows:

Bits 32 32 32 16 16 16 8 8Read/Write R R+W R+W R R+W R+W R+W RPurpose Device Guest Queue Queue Queue Queue Device ISR

Features bits 0:31 Features bits 0:31 Address Size Select Notify Status Status

2Historically, drivers have used the device before steps 5 and 6. This is only allowed if thedriver does not use any features which would alter this early use of the device.

4

Page 6: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

If MSI-X is enabled for the device, two additional �elds immediately follow thisheader:3

Bits 16 16Read/Write R+W R+WPurpose Con�guration Queue

(MSI-X) Vector Vector

Immediately following these general headers, there may be device-speci�c head-ers:

Bits Device Speci�cRead/Write Device Speci�cPurpose Device Speci�c...

2.2.2.1 Device Status

The Device Status �eld is updated by the guest to indicate its progress. Thisprovides a simple low-level diagnostic: it's most useful to imagine them hookedup to tra�c lights on the console indicating the status of each device.

The device can be reset by writing a 0 to this �eld, otherwise at least one bitshould be set:

ACKNOWLEDGE (1) Indicates that the guest OS has found the device andrecognized it as a valid virtio device.

DRIVER (2) Indicates that the guest OS knows how to drive the device.Under Linux, drivers can be loadable modules so there may be a signi�cant(or in�nite) delay before setting this bit.

DRIVER_OK (4) Indicates that the driver is set up and ready to drive thedevice.

FAILED (128) Indicates that something went wrong in the guest, and it hasgiven up on the device. This could be an internal error, or the driverdidn't like the device for some reason, or even a fatal error during deviceoperation. The device must be reset before attempting to re-initialize.

2.2.2.2 Feature Bits

The�rst con�guration �eld indicates the features that the device supports. Thebits are allocated as follows:

0 to 23 Feature bits for the speci�c device type

3ie. once you enable MSI-X on the device, the other �elds move. If you turn it o� again,they move back!

5

Page 7: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

24 to 32 Feature bits reserved for extensions to the queue and feature negoti-ation mechanisms

For example, feature bit 0 for a network device (i.e. Subsystem Device ID 1)indicates that the device supports checksumming of packets.

The feature bits are negotiated: the device lists all the features it understandsin the Device Features �eld, and the guest writes the subset that it understandsinto the Guest Features �eld. The only way to renegotiate is to reset the device.

In particular, new �elds in the device con�guration header are indicated byo�ering a feature bit, so the guest can check before accessing that part of thecon�guration space.

This allows for forwards and backwards compatibility: if the device is enhancedwith a new feature bit, older guests will not write that feature bit back to theGuest Features �eld and it can go into backwards compatibility mode. Similarly,if a guest is enhanced with a feature that the device doesn't support, it will notsee that feature bit in the Device Features �eld and can go into backwards com-patibility mode (or, for poor implementations, set the FAILED Device Statusbit).

2.2.2.3 Con�guration/Queue Vectors

When MSI-X capability is present and enabled in the device (through standardPCI con�guration space) 4 bytes at byte o�set 20 are used to map con�gurationchange and queue interrupts to MSI-X vectors. In this case, the ISR Status�eld is unused, and device speci�c con�guration starts at byte o�set 24 in vir-tio header structure. When MSI-X capability is not enabled, device speci�ccon�guration starts at byte o�set 20 in virtio header.

Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of Con�gu-ration/Queue Vector registers, maps interrupts triggered by the con�gurationchange/selected queue events respectively to the corresponding MSI-X vector.To disable interrupts for a speci�c event type, unmap it by writing a specialNO_VECTOR value:

/∗ Vector va lue used to d i s ab l e MSI f o r queue ∗/#de f i n e VIRTIO_MSI_NO_VECTOR 0 x f f f f

Reading these registers returns vector mapped to a given event, or NO_VECTORif unmapped. All queue and con�guration change events are unmapped by de-fault.

Note that mapping an event to vector might require allocating internal de-vice resources, and might fail. Devices report such failures by returning theNO_VECTOR value when the relevant Vector �eld is read. After mappingan event to vector, the driver must verify success by reading the Vector �eld

6

Page 8: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

value: on success, the previously written value is returned, and on failure,NO_VECTOR is returned. If a mapping failure is detected, the driver canretry mapping with fewervectors, or disable MSI-X.

2.3 Virtqueue Con�guration

As a device can have zero or more virtqueues for bulk data transport (for ex-ample, the network driver has two), the driver needs to con�gure them as partof the device-speci�c con�guration.

This is done as follows, for each virtqueue a device has:

1. Write the virtqueue index (�rst queue is 0) to the Queue Select �eld.

2. Read the virtqueue size from the Queue Size �eld, which is always a powerof 2. This controls how big the virtqueue is (see below). If this �eld is 0,the virtqueue does not exist.

3. Allocate and zero virtqueue in contiguous physical memory, on a 4096byte alignment. Write the physical address, divided by 4096 to the QueueAddress �eld.4

4. Optionally, if MSI-X capability is present and enabled on the device, selecta vector to use to request interrupts triggered by virtqueue events. Writethe MSI-X Table entry number corresponding to this vector in QueueVector �eld. Read the Queue Vector �eld: on success, previously writtenvalue is returned; on failure, NO_VECTOR value is returned.

The Queue Size �eld controls the total number of bytes required for the virtqueueaccording to the following formula:

#de f i n e ALIGN(x ) ( ( ( x ) + 4095) & ~4095)s t a t i c i n l i n e unsigned vr ing_s i ze ( unsigned i n t qsz ){

re turn ALIGN( s i z e o f ( s t r u c t vring_desc )∗ qsz + s i z e o f ( u16 )∗ (2 + qsz ) )+ ALIGN( s i z e o f ( s t r u c t vring_used_elem )∗ qsz ) ;

}

This currently wastes some space with padding, but also allows future exten-sions. The virtqueue layout structure looks like this (qsz is the Queue Size �eld,which is a variable, so this code won't compile):

s t r u c t vr ing {/∗ The ac tua l d e s c r i p t o r s (16 bytes each ) ∗/

4The 4096 is based on the x86 page size, but it's also large enough to ensure that theseparate parts of the virtqueue are on separate cache lines.

7

Page 9: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

s t r u c t vring_desc desc [ qsz ] ;

/∗ A r ing o f a v a i l a b l e d e s c r i p t o r heads with f r e e−running index . ∗/s t r u c t vr ing_ava i l a v a i l ;

// Padding to the next 4096 boundary .char pad [ ] ;

// A r ing o f used d e s c r i p t o r heads with f r e e−running index .s t r u c t vring_used used ;

} ;

2.3.1 A Note on Virtqueue Endianness

Note that the endian of these �elds and everything else in the virtqueue is thenative endian of the guest, not little-endian as PCI normally is. This makes forsimpler guest code, and it is assumed that the host already has to be deeplyaware of the guest endian so such an �endian-aware� device is not a signi�cantissue.

2.3.2 Descriptor Table

The descriptor table refers to the bu�ers the guest is using for the device. Theaddresses are physical addresses, and the bu�ers can be chained via the next�eld. Each descriptor describes a bu�er which is read-only or write-only, but achain of descriptors can contain both read-only and write-only bu�ers.

No descriptor chain may be more than 2^32 bytes long in total.

s t r u c t vring_desc {/∗ Address ( guest−phy s i c a l ) . ∗/u64 addr ;/∗ Length . ∗/u32 l en ;

/∗ This marks a bu f f e r as cont inu ing v ia the next f i e l d . ∗/#de f i n e VRING_DESC_F_NEXT 1/∗ This marks a bu f f e r as write−only ( o therwi se read−only ) . ∗/#de f i n e VRING_DESC_F_WRITE 2/∗ This means the bu f f e r conta in s a l i s t o f bu f f e r d e s c r i p t o r s . ∗/#de f i n e VRING_DESC_F_INDIRECT 4

/∗ The f l a g s as i nd i c a t ed above . ∗/u16 f l a g s ;/∗ Next f i e l d i f f l a g s & NEXT ∗/u16 next ;

} ;

8

Page 10: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

The number of descriptors in the table is speci�ed by the Queue Size �eld forthis virtqueue.

2.3.3 Indirect Descriptors

Some devices bene�t by concurrently dispatching a large number of large re-quests. The VIRTIO_RING_F_INDIRECT_DESC feature can be used toallow this (see 3). To increase ring capacity it is possible to store a table of in-direct descriptors anywhere in memory, and insert a descriptor in main virtqueue(with �ags&INDIRECT on) that refers to memory bu�er containing this indi-rect descriptor table; �elds addr and len refer to the indirect table address andlength in bytes, respectively. The indirect table layout structure looks like this(len is the length of the descriptor that refers to this table, which is a variable,so this code won't compile):

s t r u c t i nd i r e c t_de s c r i p t o r_tab l e {/∗ The ac tua l d e s c r i p t o r s (16 bytes each ) ∗/s t r u c t vring_desc desc [ l en / 1 6 ] ;

} ;

The �rst indirect descriptor is located at start of the indirect descriptor table (in-dex 0), additional indirect descriptors are chained by next �eld. An indirect de-scriptor without next �eld (with �ags&NEXT o�) signals the end of the indirectdescriptor table, and transfers control back to the main virtqueue. An indirectdescriptor can not refer to another indirect descriptor table (�ags&INDIRECTmust be o�). A single indirect descriptor table can include both read-only andwrite-only descriptors; write-only �ag (�ags&WRITE) in the descriptor thatrefers to it is ignored.

2.3.4 Available Ring

The available ring refers to what descriptors we are o�ering the device: it refersto the head of a descriptor chain. The ��ags� �eld is currently 0 or 1: 1 indicatingthat we do not need an interrupt when the device consumes a descriptor fromthe available ring. Alternatively, the guest can ask the device to delay interruptsuntil an entry with an index speci�ed by the �used_event� �eld is written in theused ring (equivalently, until the idx �eld in the used ring will reach the valueused_event + 1 ). The method employed by the device is controlled by the VIR-TIO_RING_F_EVENT_IDX feature bit (see 3). This interrupt suppressionis merely an optimization; it may not suppress interrupts entirely.

The �idx� �eld indicates where we would put the next descriptor entry (modulothe ring size). This starts at 0, and increases.

s t r u c t vr ing_ava i l {#de f i n e VRING_AVAIL_F_NO_INTERRUPT 1

9

Page 11: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

u16 f l a g s ;u16 idx ;u16 r ing [ qsz ] ; /∗ qsz i s the Queue S i z e f i e l d read from dev i ce ∗/u16 used_event ;

} ;

2.3.5 Used Ring

The used ring is where the device returns bu�ers once it is done with them. The�ags �eld can be used by the device to hint that no noti�cation is necessary whenthe guest adds to the available ring. Alternatively, the �avail_event� �eld can beused by the device to hint that no noti�cation is necessary until an entry with anindex speci�ed by the �avail_event� is written in the available ring (equivalently,until the idx �eld in the available ring will reach the value avail_event + 1 ).The method employed by the device is controlled by the guest through theVIRTIO_RING_F_EVENT_IDX feature bit (see 3). 5.

Each entry in the ring is a pair: the head entry of the descriptor chain describingthe bu�er (this matches an entry placed in the available ring by the guestearlier), and the total of bytes written into the bu�er. The latter is extremelyuseful for guests using untrusted bu�ers: if you do not know exactly how muchhas been written by the device, you usually have to zero the bu�er to ensure nodata leakage occurs.

/∗ u32 i s used here f o r i d s f o r padding reasons . ∗/s t r u c t vring_used_elem {

/∗ Index o f s t a r t o f used d e s c r i p t o r chain . ∗/u32 id ;/∗ Total l ength o f the d e s c r i p t o r chain which was used ( wr i t t en to ) ∗/u32 l en ;

} ;

s t r u c t vring_used {#de f i n e VRING_USED_F_NO_NOTIFY 1

u16 f l a g s ;u16 idx ;s t r u c t vring_used_elem r ing [ qsz ] ;u16 avai l_event ;

} ;

5These �elds are kept here because this is the only part of the virtqueue written by thedevice

10

Page 12: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

2.3.6 Helpers for Managing Virtqueues

The Linux Kernel Source code contains the de�nitions above and helper rou-tines in a more usable form, in include/linux/virtio_ring.h. This was explicitlylicensed by IBM and Red Hat under the (3-clause) BSD license so that it canbe freely used by all other projects, and is reproduced (with slight variation toremove Linux assumptions) in Appendix A.

2.4 Device Operation

There are two parts to device operation: supplying new bu�ers to the device,and processing used bu�ers from the device. As an example, the virtio networkdevice has two virtqueues: the transmit virtqueue and the receive virtqueue.The driver adds outgoing (read-only) packets to the transmit virtqueue, andthen frees them after they are used. Similarly, incoming (write-only) bu�ers areadded to the receive virtqueue, and processed after they are used.

2.4.1 Supplying Bu�ers to The Device

Actual transfer of bu�ers from the guest OS to the device operates as follows:

1. Place the bu�er(s) into free descriptor(s).

(a) If there are no free descriptors, the guest may choose to notify thedevice even if noti�cations are suppressed (to reduce latency).6

2. Place the id of the bu�er in the next ring entry of the available ring.

3. The steps (1) and (2) may be performed repeatedly if batching is possible.

4. A memory barrier should be executed to ensure the device sees the updateddescriptor table and available ring before the next step.

5. The available �idx� �eld should be increased by the number of entriesadded to the available ring.

6. A memory barrier should be executed to ensure that we update the idx�eld before checking for noti�cation suppression.

7. If noti�cations are not suppressed, the device should be noti�ed of thenew bu�ers.

6The Linux drivers do this only for read-only bu�ers: for write-only bu�ers, it is assumedthat the driver is merely trying to keep the receive bu�er ring full, and no noti�cation of thisexpected condition is necessary.

11

Page 13: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Note that the above code does not take precautions against the available ringbu�er wrapping around: this is not possible since the ring bu�er is the samesize as the descriptor table, so step (1) will prevent such a condition.

In addition, the maximum queue size is 32768 (it must be a power of 2 which�ts in 16 bits), so the 16-bit �idx� value can always distinguish between a fulland empty bu�er.

Here is a description of each stage in more detail.

2.4.1.1 Placing Bu�ers Into The Descriptor Table

A bu�er consists of zero or more read-only physically-contiguous elements fol-lowed by zero or more physically-contiguous write-only elements (it must haveat least one element). This algorithm maps it into the descriptor table:

1. for each bu�er element, b:

(a) Get the next free descriptor table entry, d

(b) Set d.addr to the physical address of the start of b

(c) Set d.len to the length of b.

(d) If b is write-only, set d.flags to VRING_DESC_F_WRITE, oth-erwise 0.

(e) If there is a bu�er element after this:

i. Set d.next to the index of the next free descriptor element.

ii. Set the VRING_DESC_F_NEXT bit in d.flags.

In practice, the d.next �elds are usually used to chain free descriptors, and aseparate count kept to check there are enough free descriptors before beginningthe mappings.

2.4.1.2 Updating The Available Ring

The head of the bu�er we mapped is the �rst d in the algorithm above. A naiveimplementation would do the following:

ava i l−>r ing [ ava i l−>idx % qsz ] = head ;

However, in general we can add many descriptors before we update the �idx��eld (at which point they become visible to the device), so we keep a counterof how many we've added:

ava i l−>r ing [ ( ava i l−>idx + added++) % qsz ] = head ;

12

Page 14: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

2.4.1.3 Updating The Index Field

Once the idx �eld of the virtqueue is updated, the device will be able to accessthe descriptor entries we've created and the memory they refer to. This is whya memory barrier is generally used before the idx update, to ensure it sees themost up-to-date copy.

The idx �eld always increments, and we let it wrap naturally at 65536:

ava i l−>idx += added ;

2.4.1.4 Notifying The Device

Device noti�cation occurs by writing the 16-bit virtqueue index of this virtqueueto the Queue Notify �eld of the virtio header in the �rst I/O region of thePCI device. This can be expensive, however, so the device can suppress suchnoti�cations if it doesn't need them. We have to be careful to expose the newidx value before checking the suppression �ag: it's OK to notify gratuitously,but not to omit a required noti�cation. So again, we use a memory barrier herebefore reading the �ags or the avail_event �eld.

If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if theVRING_USED_F_NOTIFY �ag is not set, we go ahead and write to the PCIcon�guration space.

If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read theavail_event �eld in the available ring structure. If the available index crossed_theavail_event �eld value since the last noti�cation, we go ahead and write to thePCI con�guration space. The avail_event �eld wraps naturally at 65536 aswell:

( u16 ) ( new_idx − avai l_event − 1) < ( u16 ) ( new_idx − old_idx )

2.4.2 Receiving Used Bu�ers From The Device

Once the device has used a bu�er (read from or written to it, or parts of both,depending on the nature of the virtqueue and the device), it sends an interrupt,following an algorithm very similar to the algorithm used for the driver to sendthe device a bu�er:

1. Write the head descriptor number to the next �eld in the used ring.

2. Update the used ring idx.

3. Determine whether an interrupt is necessary:

13

Page 15: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

(a) If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated:check if f the VRING_AVAIL_F_NO_INTERRUPT �ag is not setin avail->�ags

(b) If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: checkwhether the used index crossed the used_event �eld value since thelast update. The used_event �eld wraps naturally at 65536 as well:

( u16 ) ( new_idx − used_event − 1) < ( u16 ) ( new_idx − old_idx )

4. If an interrupt is necessary:

(a) If MSI-X capability is disabled:

i. Set the lower bit of the ISR Status �eld for the device.

ii. Send the appropriate PCI interrupt for the device.

(b) If MSI-X capability is enabled:

i. Request the appropriate MSI-X interrupt message for the device,Queue Vector �eld sets the MSI-X Table entry number.

ii. If Queue Vector �eld value is NO_VECTOR, no interrupt mes-sage is requested for this event.

The guest interrupt handler should:

1. If MSI-X capability is disabled: read the ISR Status �eld, which will resetit to zero. If the lower bit is zero, the interrupt was not for this device.Otherwise, the guest driver should look through the used rings of eachvirtqueue for the device, to see if any progress has been made by thedevice which requires servicing.

2. If MSI-X capability is enabled: look through the used rings of each virtqueuemapped to the speci�c MSI-X vector for the device, to see if any progresshas been made by the device which requires servicing.

For each ring, guest should then disable interrupts by writing VRING_AVAIL_F_NO_INTERRUPT�ag in avail structure, if required. It can then process used ring entries �nally en-abling interrupts by clearing the VRING_AVAIL_F_NO_INTERRUPT �agor updating the EVENT_IDX �eld in the available structure, Guest should thenexecute a memory barrier, and then recheck the ring empty condition. This isnecessary to handle the case where, after the last check and before enablinginterrupts, an interrupt has been suppressed by the device:

v r ing_di sab l e_inte r rupt s ( vq ) ;f o r ( ; ; ) {

i f ( vq−>last_seen_used != vring−>used . idx ) {vr ing_enable_interrupts ( vq ) ;mb( ) ;

14

Page 16: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

i f ( vq−>last_seen_used != vring−>used . idx )break ;

}s t r u c t vring_used_elem ∗e = vr ing . used−>r ing [ vq−>last_seen_used%vsz ] ;p roce s s_buf f e r ( e ) ;vq−>last_seen_used++;

}

2.4.3 Dealing With Con�guration Changes

Some virtio PCI devices can change the device con�guration state, as re�ectedin the virtio header in the PCI con�guration space. In this case:

1. If MSI-X capability is disabled: an interrupt is delivered and the sec-ond highest bit is set in the ISR Status �eld to indicate that the drivershould re-examine the con�guration space.Note that a single interrupt canindicate both that one or more virtqueue has been used and that the con-�guration space has changed: even if the con�g bit is set, virtqueues mustbe scanned.

2. If MSI-X capability is enabled: an interrupt message is requested. TheCon�guration Vector �eld sets the MSI-X Table entry number to use. IfCon�guration Vector �eld value is NO_VECTOR, no interrupt messageis requested for this event.

15

Page 17: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Chapter 3

Creating New Device Types

Various considerations are necessary when creating a new device type:

How Many Virtqueues?

It is possible that a very simple device will operate entirely through its con�g-uration space, but most will need at least one virtqueue in which it will placerequests. A device with both input and output (eg. console and network de-vices described here) need two queues: one which the driver �lls with bu�ers toreceive input, and one which the driver places bu�ers to transmit output.

What Con�guration Space Layout?

Con�guration space is generally used for rarely-changing or initialization-timeparameters. But it is a limited resource, so it might be better to use a virtqueueto update con�guration information (the network device does this for �ltering,otherwise the table in the con�g space could potentially be very large).

Note that this space is generally the guest's native endian, rather than PCI'slittle-endian.

What Device Number?

Currently device numbers are assigned quite freely: a simple request mail tothe author of this document or the Linux virtualization mailing list1 will besu�cient to secure a unique one.

1https://lists.linux-foundation.org/mailman/listinfo/virtualization

16

Page 18: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Meanwhile for experimental drivers, use 65535 and work backwards.

How many MSI-X vectors?

Using the optional MSI-X capability devices can speed up interrupt processingby removing the need to read ISR Status register by guest driver (which might bean expensive operation), reducing interrupt sharing between devices and queueswithin the device, and handling interrupts from multiple CPUs. However, somesystems impose a limit (which might be as low as 256) on the total number ofMSI-X vectors that can be allocated to all devices. Devices and/or device driversshould take this into account, limiting the number of vectors used unless thedevice is expected to cause a high volume of interrupts. Devices can control thenumber of vectors used by limiting the MSI-X Table Size or not presenting MSI-X capability in PCI con�guration space. Drivers can control this by mappingevents to as small number of vectors as possible, or disabling MSI-X capabilityaltogether.

Message Framing

The descriptors used for a bu�er should not e�ect the semantics of the message,except for the total length of the bu�er. For example, a network bu�er consistsof a 10 byte header followed by the network packet. Whether this is presentedin the ring descriptor chain as (say) a 10 byte bu�er and a 1514 byte bu�er, ora single 1524 byte bu�er, or even three bu�ers, should have no e�ect.

In particular, no implementation should use the descriptor boundaries to deter-mine the size of any header in a request.2

Device Improvements

Any change to con�guration space, or new virtqueues, or behavioural changes,should be indicated by negotiation of a new feature bit. This establishes clarity3

and avoids future expansion problems.

Clusters of functionality which are always implemented together can use a sin-gle bit, but if one feature makes sense without the others they should not begratuitously grouped together to conserve feature bits. We can always extendthe spec when the �rst person needs more than 24 feature bits for their device.

2The current qemu device implementations mistakenly insist that the �rst descriptor coverthe header in these cases exactly, so a cautious driver should arrange it so.

3Even if it does mean documenting design or implementation mistakes!

17

Page 19: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Nomenclature

PCI Peripheral Component Interconnect; a common device bus. Seehttp://en.wikipedia.org/wiki/Peripheral Component Interconnect

virtualized Environments where access to hardware is restricted (and often em-ulated) by a hypervisor.

18

Page 20: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Appendix A: virtio_ring.h

#i f n d e f VIRTIO_RING_H#de f i n e VIRTIO_RING_H/∗ An i n t e r f a c e f o r e f f i c i e n t v i r t i o implementation .∗∗ This header i s BSD l i c e n s e d so anyone can use the d e f i n i t i o n s∗ to implement compatible d r i v e r s / s e r v e r s .∗∗ Copyright 2007 , 2009 , IBM Corporat ion∗ Copyright 2011 , Red Hat , Inc∗ Al l r i g h t s r e s e rved .∗∗ Red i s t r i bu t i on and use in source and binary forms , with or without∗ modi f i ca t i on , are permitted provided that the f o l l ow i ng cond i t i on s∗ are met :∗ 1 . Red i s t r i bu t i on s o f source code must r e t a i n the above copyr ight∗ not i ce , t h i s l i s t o f c ond i t i on s and the f o l l ow i ng d i s c l a ime r .∗ 2 . Red i s t r i bu t i on s in binary form must reproduce the above copyr ight∗ not i ce , t h i s l i s t o f c ond i t i on s and the f o l l ow i ng d i s c l a ime r in the∗ documentation and/or other mat e r i a l s provided with the d i s t r i b u t i o n .∗ 3 . Ne i ther the name o f IBM nor the names o f i t s c on t r i bu t o r s∗ may be used to endorse or promote products der ived from th i s so f tware∗ without s p e c i f i c p r i o r wr i t t en permis s ion .∗ THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS ` `AS IS ' ' AND∗ ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE∗ IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE∗ ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE LIABLE∗ FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL∗ DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS∗ OR SERVICES ; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)∗ HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY , WHETHER IN CONTRACT, STRICT∗ LIABILITY , OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY∗ OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF∗ SUCH DAMAGE.∗/

19

Page 21: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

/∗ This marks a bu f f e r as cont inu ing v ia the next f i e l d . ∗/#de f i n e VRING_DESC_F_NEXT 1/∗ This marks a bu f f e r as write−only ( o therwi se read−only ) . ∗/#de f i n e VRING_DESC_F_WRITE 2

/∗ The Host uses t h i s in used−>f l a g s to adv i s e the Guest : don ' t k i ck me∗ when you add a bu f f e r . I t ' s un r e l i ab l e , so i t ' s s imply an∗ opt imiza t i on . Guest w i l l s t i l l k i ck i f i t ' s out o f b u f f e r s . ∗/

#de f i n e VRING_USED_F_NO_NOTIFY 1/∗ The Guest uses t h i s in ava i l−>f l a g s to adv i s e the Host : don ' t∗ i n t e r r up t me when you consume a bu f f e r . I t ' s un r e l i ab l e , so i t ' s∗ s imply an opt imiza t i on . ∗/

#de f i n e VRING_AVAIL_F_NO_INTERRUPT 1

/∗ Vi r t i o r ing d e s c r i p t o r s : 16 bytes .∗ These can chain toge the r v ia "next " . ∗/s t r u c t vring_desc {

/∗ Address ( guest−phy s i c a l ) . ∗/uint64_t addr ;/∗ Length . ∗/uint32_t l en ;/∗ The f l a g s as i nd i c a t ed above . ∗/uint16_t f l a g s ;/∗ We chain unused d e s c r i p t o r s v ia th i s , too ∗/uint16_t next ;

} ;

s t r u c t vr ing_ava i l {uint16_t f l a g s ;uint16_t idx ;uint16_t r ing [ ] ;uint16_t used_event ;

} ;

/∗ u32 i s used here f o r i d s f o r padding reasons . ∗/s t r u c t vring_used_elem {

/∗ Index o f s t a r t o f used d e s c r i p t o r chain . ∗/uint32_t id ;/∗ Total l ength o f the d e s c r i p t o r chain which was wr i t t en to . ∗/uint32_t l en ;

} ;

s t r u c t vring_used {uint16_t f l a g s ;uint16_t idx ;

20

Page 22: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

s t r u c t vring_used_elem r ing [ ] ;uint16_t avai l_event ;

} ;

s t r u c t vr ing {unsigned i n t num;

s t r u c t vring_desc ∗desc ;s t r u c t vr ing_ava i l ∗ av a i l ;s t r u c t vring_used ∗used ;

} ;

/∗ The standard layout f o r the r ing i s a cont inuous chunk o f memory which∗ l o ok s l i k e t h i s . We assume num i s a power o f 2 .∗∗ s t r u c t vr ing {∗ // The ac tua l d e s c r i p t o r s (16 bytes each )∗ s t r u c t vring_desc desc [num ] ;∗∗ // A r ing o f a v a i l a b l e d e s c r i p t o r heads with f r e e−running index .∗ __u16 ava i l_ f l a g s ;∗ __u16 avai l_idx ;∗ __u16 av a i l a b l e [num ] ;∗∗ // Padding to the next a l i g n boundary .∗ char pad [ ] ;∗∗ // A r ing o f used d e s c r i p t o r heads with f r e e−running index .∗ __u16 used_f lags ;∗ __u16 EVENT_IDX;∗ s t r u c t vring_used_elem used [num ] ;∗ } ;∗ Note : f o r v i r t i o PCI , a l i g n i s 4096 .∗/s t a t i c i n l i n e void vr ing_in i t ( s t r u c t vr ing ∗vr , unsigned i n t num, void ∗p ,

unsigned long a l i g n ){

vr−>num = num;vr−>desc = p ;vr−>ava i l = p + num∗ s i z e o f ( s t r u c t vring_desc ) ;vr−>used = ( void ∗ ) ( ( ( unsigned long)&vr−>ava i l−>r ing [num]

+ a l ign −1)& ~( a l i g n − 1 ) ) ;

}

s t a t i c i n l i n e unsigned vr ing_s i ze ( unsigned i n t num, unsigned long a l i g n )

21

Page 23: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

{return ( ( s i z e o f ( s t r u c t vring_desc )∗num + s i z e o f ( uint16_t )∗(2+num)

+ a l i g n − 1) & ~( a l i g n − 1) )+ s i z e o f ( uint16_t )∗3 + s i z e o f ( s t r u c t vring_used_elem )∗num;

}

s t a t i c i n l i n e i n t vring_need_event ( uint16_t event_idx , uint16_t new_idx , uint16_t old_idx ){

return ( uint16_t ) ( new_idx − event_idx − 1) < ( uint16_t ) ( new_idx − old_idx ) ;}#end i f /∗ VIRTIO_RING_H ∗/

22

Page 24: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Appendix B: Reserved

Feature Bits

Currently there are �ve device-independent feature bits de�ned:

VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature indi-cates that the driver wants an interrupt if the device runs out of availabledescriptors on a virtqueue, even though interrupts are suppressed usingthe VRING_AVAIL_F_NO_INTERRUPT �ag or the used_event �eld.An example of this is the networking driver: it doesn't need to know ev-ery time a packet is transmitted, but it does need to free the transmittedpackets a �nite time after they are transmitted. It can avoid using a timerif the device interrupts it when all the packets are transmitted.

VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature in-dicates that the driver can use descriptors with the VRING_DESC_F_INDIRECT�ag set, as described in 2.3.3.

VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_eventand the avail_event �elds. If set, it indicates that the device should ignorethe �ags �eld in the available ring structure. Instead, the used_event �eldin this structure is used by guest to suppress device interrupts. Further,the driver should ignore the �ags �eld in the used ring structure. Instead,the avail_event �eld in this structure is used by the device to suppressnoti�cations. If unset, the driver should ignore the used_event �eld; thedevice should ignore the avail_event �eld; the �ags �eld is used

23

Page 25: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Appendix C: Network Device

The virtio network device is a virtual ethernet card, and is the most complexof the devices supported so far by virtio. It has enhanced rapidly and demon-strates clearly how support for new features should be added to an existingdevice. Empty bu�ers are placed in one virtqueue for receiving packets, andoutgoing packets are enqueued into another for transmission in that order. Athird command queue is used to control advanced �ltering features.

Con�guration

Subsystem Device ID 1

Virtqueues 0:receiveq. 1:transmitq. 2:controlq4

Feature bits

VIRTIO_NET_F_CSUM (0) Device handles packets with partialchecksum

VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets withpartial checksum

VIRTIO_NET_F_MAC (5) Device has given MAC address.

VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets withany GSO type.5

VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4.

VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6.

VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO withECN.

VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO.

VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4.

4Only if VIRTIO_NET_F_CTRL_VQ set5It was supposed to indicate segmentation o�oad support, but upon further investigation

it became clear that multiple bits were required.

24

Page 26: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6.

VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO withECN.

VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO.

VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receivebu�ers.

VIRTIO_NET_F_STATUS (16) Con�guration status �eld is avail-able.

VIRTIO_NET_F_CTRL_VQ (17) Control channel is available.

VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode sup-port.

VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN �l-tering.

VIRTIO_NET_F_GUEST_ANNOUNCE(21) Guest can send gra-tuitous packets.

Device con�guration layout Two con�guration �elds are currently de�ned.The mac address �eld always exists (though is only valid if VIRTIO_NET_F_MACis set), and the status �eld only exists if VIRTIO_NET_F_STATUS isset. Two bits are currently de�ned for the status �eld: VIRTIO_NET_S_LINK_UPand VIRTIO_NET_S_ANNOUNCE.

#de f i n e VIRTIO_NET_S_LINK_UP 1#de f i n e VIRTIO_NET_S_ANNOUNCE 2

s t r u c t v i r t i o_net_con f i g {u8 mac [ 6 ] ;u16 s t a tu s ;

} ;

Device Initialization

1. The initialization routine should identify the receive and transmissionvirtqueues.

2. If the VIRTIO_NET_F_MAC feature bit is set, the con�guration space�mac� entry indicates the �physical� address of the the network card, oth-erwise a private MAC address should be assigned. All guests are expectedto negotiate this feature if it is set.

3. If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identifythe control virtqueue.

25

Page 27: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

4. If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link sta-tus can be read from the bottom bit of the �status� con�g �eld. Otherwise,the link should be assumed active.

5. The receive virtqueue should be �lled with receive bu�ers. This is de-scribed in detail below in �Setting Up Receive Bu�ers�.

6. A driver can indicate that it will generate checksumless packets by nego-tating the VIRTIO_NET_F_CSUM feature. This �checksum o�oad� isa common feature on modern network cards.

7. If that feature is negotiated6, a driver can use TCP or UDP segmentationo�oad by negotiating the VIRTIO_NET_F_HOST_TSO4 (IPv4 TCP),VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and VIRTIO_NET_F_HOST_UFO(UDP fragmentation) features. It should not send TCP packets requiringsegmentation o�oad which have the Explicit Congestion Noti�cation bitset, unless the VIRTIO_NET_F_HOST_ECN feature is negotiated.7

8. The converse features are also available: a driver can save the virtual de-vice some work by negotiating these features.8 The VIRTIO_NET_F_GUEST_CSUMfeature indicates that partially checksummed packets can be received,and if it can do that then the VIRTIO_NET_F_GUEST_TSO4, VIR-TIO_NET_F_GUEST_TSO6, VIRTIO_NET_F_GUEST_UFO andVIRTIO_NET_F_GUEST_ECN are the input equivalents of the fea-tures described above. See �Receiving Packets� below.

Device Operation

Packets are transmitted by placing them in the transmitq, and bu�ers for in-coming packets are placed in the receiveq. In each case, the packet itself ispreceeded by a header:

s t r u c t virt io_net_hdr {#de f i n e VIRTIO_NET_HDR_F_NEEDS_CSUM 1

u8 f l a g s ;#de f i n e VIRTIO_NET_HDR_GSO_NONE 0#de f i n e VIRTIO_NET_HDR_GSO_TCPV4 1#de f i n e VIRTIO_NET_HDR_GSO_UDP 3#de f i n e VIRTIO_NET_HDR_GSO_TCPV6 4

6ie. VIRTIO_NET_F_HOST_TSO* and VIRTIO_NET_F_HOST_UFO are depen-dent on VIRTIO_NET_F_CSUM; a dvice which o�ers the o�oad features must o�er thechecksum feature, and a driver which accepts the o�oad features must accept the checksumfeature. Similar logic applies to the VIRTIO_NET_F_GUEST_TSO4 features dependingon VIRTIO_NET_F_GUEST_CSUM.

7This is a common restriction in real, older network cards.8For example, a network packet transported between two guests on the same system may

not require checksumming at all, nor segmentation, if both guests are amenable.

26

Page 28: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

#de f i n e VIRTIO_NET_HDR_GSO_ECN 0x80u8 gso_type ;u16 hdr_len ;u16 gso_size ;u16 csum_start ;u16 csum_offset ;

/∗ Only i f VIRTIO_NET_F_MRG_RXBUF: ∗/u16 num_buffers

} ;

The controlq is used to control device features such as �ltering.

Packet Transmission

Transmitting a single packet is simple, but varies depending on the di�erentfeatures the driver negotiated.

1. If the driver negotiated VIRTIO_NET_F_CSUM, and the packet hasnot been fully checksummed, then the virtio_net_hdr's �elds are set asfollows. Otherwise, the packet must be fully checksummed, and �ags iszero.

• �ags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set,

• csum_start is set to the o�set within the packet to begin checksum-ming, and

• csum_o�set indicates how many bytes after the csum_start the new(16 bit ones' complement) checksum should be placed.9

2. If the driver negotiated VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO,and the packet requires TCP segmentation or UDP fragmentation, thenthe �gso_type� �eld is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6or UDP. (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). Inthis case, packets larger than 1514 bytes can be transmitted: the metadataindicates how to replicate the packet header to cut it into smaller packets.The other gso �elds are set:

• hdr_len is a hint to the device as to how much of the header needsto be kept to copy into each packet, usually set to the length of theheaders, including the transport header.10

9For example, consider a partially checksummed TCP (IPv4) packet. It will have a 14 byteethernet header and 20 byte IP header followed by the TCP header (with the TCP checksum�eld 16 bytes into that header). csum_start will be 14+20 = 34 (the TCP checksum includesthe header), and csum_o�set will be 16. The value in the TCP checksum �eld should beinitialized to the sum of the TCP pseudo header, so that replacing it by the ones' complementchecksum of the TCP header and body will give the correct result.

10Due to various bugs in implementations, this �eld is not useful as a guarantee of thetransport header size.

27

Page 29: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

• gso_size is the maximum size of each packet beyond that header (ie.MSS).

• If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature,the VIRTIO_NET_HDR_GSO_ECN bit may be set in �gso_type�as well, indicating that the TCP packet has the ECN bit set.11

3. If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,the num_bu�ers �eld is set to zero.

4. The header and packet are added as one output bu�er to the transmitq,and the device is noti�ed of the new entry (see 2.4.1.4).12

Packet Transmission Interrupt

Often a driver will suppress transmission interrupts using the VRING_AVAIL_F_NO_INTERRUPT�ag (see 2.4.2) and check for used packets in the transmit path of following pack-ets. However, it will still receive interrupts if the VIRTIO_F_NOTIFY_ON_EMPTYfeature is negotiated, indicating that the transmission queue is completely emp-tied.

The normal behavior in this interrupt handler is to retrieve and new descriptorsfrom the used ring and free the corresponding headers and packets.

Setting Up Receive Bu�ers

It is generally a good idea to keep the receive virtqueue as fully populated aspossible: if it runs out, network performance will su�er.

If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6or VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need toaccept packets of up to 65550 bytes long (the maximum size of a TCP or UDPpacket, plus the 14 byte ethernet header), otherwise 1514 bytes. So unlessVIRTIO_NET_F_MRG_RXBUF is negotiated, every bu�er in the receivequeue needs to be at least this length 13.

If VIRTIO_NET_F_MRG_RXBUF is negotiated, each bu�er must be at leastthe size of the struct virtio_net_hdr.

11This case is not handled by some older hardware, so is called out speci�cally in theprotocol.

12Note that the header will be two bytes longer for the VIRTIO_NET_F_MRG_RXBUFcase.

13Obviously each one can be split across multiple descriptor elements.

28

Page 30: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Packet Receive Interrupt

When a packet is copied into a bu�er in the receiveq, the optimal path is todisable further interrupts for the receiveq (see 2.4.2) and process packets untilno more are found, then re-enable them.

Processing packet involves:

1. If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,then the �num_bu�ers� �eld indicates how many descriptors this packetis spread over (including this one). This allows receipt of large packetswithout having to allocate large bu�ers. In this case, there will be atleast �num_bu�ers� in the used ring, and they should be chained togetherto form a single packet. The other bu�ers will not begin with a struct

virtio_net_hdr.

2. If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, orthe �num_bu�ers� �eld is one, then the entire packet will be containedwithin this bu�er, immediately following the struct virtio_net_hdr.

3. If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, theVIRTIO_NET_HDR_F_NEEDS_CSUM bit in the ��ags� �eld may beset: if so, the checksum on the packet is incomplete and the �csum_start�and �csum_o�set� �elds indicate how to calculate it (see 1).

4. If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were ne-gotiated, then the �gso_type� may be something other than VIRTIO_NET_HDR_GSO_NONE,and the �gso_size� �eld indicates the desired MSS (see 2).

Control Virtqueue

The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is ne-gotiated) to send commands to manipulate various features of the device whichwould not easily map into the con�guration space.

All commands are of the following form:

s t r u c t v i r t i o_ne t_ct r l {u8 c l a s s ;u8 command ;u8 command−s p e c i f i c −data [ ] ;u8 ack ;

} ;

/∗ ack va lue s ∗/#de f i n e VIRTIO_NET_OK 0#de f i n e VIRTIO_NET_ERR 1

29

Page 31: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

The class, command and command-speci�c-data are set by the driver, and thedevice sets the ack byte. There is little it can do except issue a diagnostic if theack byte is not VIRTIO_NET_OK.

Packet Receive Filtering

If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can sendcontrol commands for promiscuous mode, multicast receiving, and �ltering ofMAC addresses.

Note that in general, these commands are best-e�ort: unwanted packets maystill arrive.

Setting Promiscuous Mode

#de f i n e VIRTIO_NET_CTRL_RX 0#de f i n e VIRTIO_NET_CTRL_RX_PROMISC 0#de f i n e VIRTIO_NET_CTRL_RX_ALLMULTI 1

The class VIRTIO_NET_CTRL_RX has two commands: VIRTIO_NET_CTRL_RX_PROMISCturns promiscuous mode on and o�, and VIRTIO_NET_CTRL_RX_ALLMULTIturns all-multicast receive on and o�. The command-speci�c-data is one bytecontaining 0 (o�) or 1 (on).

Setting MAC Address Filtering

s t r u c t virt io_net_ctrl_mac {u32 e n t r i e s ;u8 macs [ e n t r i e s ] [ETH_ALEN] ;

} ;

#de f i n e VIRTIO_NET_CTRL_MAC 1#de f i n e VIRTIO_NET_CTRL_MAC_TABLE_SET 0

The device can �lter incoming packets by any number of destination MAC ad-dresses.14 This table is set using the class VIRTIO_NET_CTRL_MAC andthe command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-speci�c-data is two variable length tables of 6-byte MAC addresses. The �rsttable contains unicast addresses, and the second contains multicast addresses.

VLAN Filtering

If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it cancontrol a VLAN �lter table in the device.

14Since there are no guarentees, it can use a hash �lter orsilently switch to allmulti orpromiscuous mode if it is given too many addresses.

30

Page 32: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

#de f i n e VIRTIO_NET_CTRL_VLAN 2#de f i n e VIRTIO_NET_CTRL_VLAN_ADD 0#de f i n e VIRTIO_NET_CTRL_VLAN_DEL 1

Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DELcommand take a 16-bit VLAN id as the command-speci�c-data.

Gratuitous Packet Sending

If the driver negotiates the VIRTIO_NET_F_GUEST_ANNOUNCE, it canask the guest to send gratuitous packets; this is usually done after the guesthas been physically migrated, and needs to announce its presence on the newnetwork links. (As hypervisor does not have the knowledge of guest networkcon�guration (eg. tagged vlan) it is simplest to prod the guest in this way).

The Guest needs to check VIRTIO_NET_S_ANNOUNCE bit in status �eldwhen it notices the changes of device con�guration.

Processing this noti�cation involves:

1. Clearing VIRTIO_NET_S_ANNOUNCE bit in the status �eld.

2. Sending the gratuitous packets.

31

Page 33: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Appendix D: Block Device

The virtio block device is a simple virtual block device (ie. disk). Read andwrite requests (and other exotic requests) are placed in the queue, and serviced(probably out of order) by the device except where noted.

Con�guration

Subsystem Device ID 2

Virtqueues 0:requestq.

Feature bits

VIRTIO_BLK_F_BARRIER (0) Host supports request barriers.

VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single seg-ment is in �size_max�.

VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segmentsin a request is in �seg_max�.

VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry speci�edin �geometry�.

VIRTIO_BLK_F_RO (5) Device is read-only.

VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in �blk_size�.

VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands.

VIRTIO_BLK_F_FLUSH (9) Cache �ush command support.

Device con�guration layout The capacity of the device (expressed in 512-byte sectors) is always present. The availability of the others all dependon various feature bits as indicated above.

s t r u c t v i r t i o_b lk_con f i g {u64 capac i ty ;u32 size_max ;u32 seg_max ;

32

Page 34: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

s t r u c t virt io_blk_geometry {u16 c y l i n d e r s ;u8 heads ;u8 s e c t o r s ;

} geometry ;u32 b lk_s ize ;

} ;

Device Initialization

1. The device size should be read from the �capacity� con�guration �eld. Norequests should be submitted which goes beyond this limit.

2. If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the blk_size�eld can be read to determine the optimal sector size for the driver to use.This does not e�ect the units used in the protocol (always 512 bytes), butawareness of the correct value can e�ect performance.

3. If the VIRTIO_BLK_F_RO feature is set by the device, any write re-quests will fail.

Device Operation

The driver queues requests to the virtqueue, and they are used by the device(not necessarily in order). Each request is of form:

s t r u c t v i r t io_blk_req {

u32 type ;u32 i o p r i o ;u64 s e c t o r ;char data [ ] [ 5 1 2 ] ;u8 s t a tu s ;

} ;

If the device has VIRTIO_BLK_F_SCSI feature, it can also support scsi packetcommand requests, each of these requests is of form:

s t r u c t v i r t io_scs i_pc_req {u32 type ;u32 i o p r i o ;u64 s e c t o r ;

char cmd [ ] ;

33

Page 35: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

char data [ ] [ 5 1 2 ] ;#de f i n e SCSI_SENSE_BUFFERSIZE 96

u8 sense [SCSI_SENSE_BUFFERSIZE ] ;u32 e r r o r s ;u32 data_len ;u32 sense_len ;u32 r e s i d u a l ;

u8 s t a tu s ;} ;

The type of the request is either a read (VIRTIO_BLK_T_IN), a write (VIR-TIO_BLK_T_OUT), a scsi packet command (VIRTIO_BLK_T_SCSI_CMDor VIRTIO_BLK_T_SCSI_CMD_OUT15) or a �ush (VIRTIO_BLK_T_FLUSHor VIRTIO_BLK_T_FLUSH_OUT16). If the device has VIRTIO_BLK_F_BARRIERfeature the high bit (VIRTIO_BLK_T_BARRIER) indicates that this requestacts as a barrier and that all preceeding requests must be complete before thisone, and all following requests must not be started until this is complete. Notethat a barrier does not �ush caches in the underlying backend device in host,and thus does not serve as data consistency guarantee. Driver must use FLUSHrequest to �ush the host cache.

#de f i n e VIRTIO_BLK_T_IN 0#de f i n e VIRTIO_BLK_T_OUT 1#de f i n e VIRTIO_BLK_T_SCSI_CMD 2#de f i n e VIRTIO_BLK_T_SCSI_CMD_OUT 3#de f i n e VIRTIO_BLK_T_FLUSH 4#de f i n e VIRTIO_BLK_T_FLUSH_OUT 5#de f i n e VIRTIO_BLK_T_BARRIER 0x80000000

The ioprio �eld is a hint about the relative priorities of requests to the device:higher numbers indicate more important requests.

The sector number indicates the o�set (multiplied by 512) where the read orwrite is to occur. This �eld is unused and set to 0 for scsi packet commandsand for �ush commands.

The cmd �eld is only present for scsi packet command requests, and indicatesthe command to perform. This �eld must reside in a single, separate read-onlybu�er; command length can be derived from the length of this bu�er.

Note that these �rst three (four for scsi packet commands) �elds are alwaysread-only: the data �eld is either read-only or write-only, depending on therequest. The size of the read or write can be derived from the total size of therequest bu�ers.

15the SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device does not distin-guish between them

16the FLUSH and FLUSH_OUT types are equivalent, the device does not distinguish be-tween them

34

Page 36: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

The sense �eld is only present for scsi packet command requests, and indicatesthe bu�er for scsi sense data.

The data_len �eld is only present for scsi packet command requests, this �eldis deprecated, and should be ignored by the driver. Historically, devices copieddata length there.

The sense_len �eld is only present for scsi packet command requests and indi-cates the number of bytes actually written to the sense bu�er.

The residual �eld is only present for scsi packet command requests and indicatesthe residual size, calculated as data length - number of bytes actually transferred.

The �nal status byte is written by the device: either VIRTIO_BLK_S_OK forsuccess, VIRTIO_BLK_S_IOERR for host or guest error or VIRTIO_BLK_S_UNSUPPfor a request unsupported by host:

#de f i n e VIRTIO_BLK_S_OK 0#de f i n e VIRTIO_BLK_S_IOERR 1#de f i n e VIRTIO_BLK_S_UNSUPP 2

Historically, devices assumed that the �elds type, ioprio and sector reside ina single, separate read-only bu�er; the �elds errors, data_len, sense_len andresidual reside in a single, separate write-only bu�er; the sense �eld in a separatewrite-only bu�er of size 96 bytes, by itself; the �elds errors, data_len, sense_lenand residual in a single write-only bu�er; and the status �eld is a separate read-only bu�er of size 1 byte, by itself.

35

Page 37: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Appendix E: Console Device

The virtio console device is a simple device for data input and output. Adevice may have one or more ports. Each port has a pair of input and outputvirtqueues. Moreover, a device has a pair of control IO virtqueues. The controlvirtqueues are used to communicate information between the device and thedriver about ports being opened and closed on either side of the connection,indication from the host about whether a particular port is a console port,adding new ports, port hot-plug/unplug, etc., and indication from the guestabout whether a port or a device was successfully added, port open/close, etc..For data IO, one or more empty bu�ers are placed in the receive queue forincoming data and outgoing characters are placed in the transmit queue.

Con�guration

Subsystem Device ID 3

Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control receiveq17, 3:con-trol transmitq, 4:receiveq(port1), 5:transmitq(port1), ...

Feature bits

VIRTIO_CONSOLE_F_SIZE (0) Con�guration cols and rows �eldsare valid.

VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support formultiple ports; con�guration �elds nr_ports and max_nr_ports arevalid and control virtqueues will be used.

Device con�guration layout The size of the console is supplied in the con-�guration space if the VIRTIO_CONSOLE_F_SIZE feature is set. Fur-thermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature is set,the maximum number of ports supported by the device can be fetched.

17Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set

36

Page 38: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

s t r u c t v i r t i o_conso l e_con f i g {u16 c o l s ;u16 rows ;

u32 max_nr_ports ;} ;

Device Initialization

1. If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver canread the console dimensions from the con�guration �elds.

2. If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, thedriver can spawn multiple ports, not all of which may be attached to a con-sole. Some could be generic ports. In this case, the control virtqueues areenabled and according to the max_nr_ports con�guration-space value,the appropriate number of virtqueues are created. A control message in-dicating the driver is ready is sent to the host. The host can then sendcontrol messages for adding new ports to the device. After creating andinitializing each port, a VIRTIO_CONSOLE_PORT_READY controlmessage is sent to the host for that port so the host can let us know ofany additional con�guration options set for that port.

3. The receiveq for each port is populated with one or more receive bu�ers.

Device Operation

1. For output, a bu�er containing the characters is placed in the port's trans-mitq.18

2. When a bu�er is used in the receiveq (signalled by an interrupt), thecontents is the input to the port associated with the virtqueue for whichthe noti�cation was received.

3. If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a con-�guration change interrupt may occur. The updated size can be read fromthe con�guration �elds.

18Because this is high importance and low bandwidth, the current Linux implementationpolls for the bu�er to be used, rather than waiting for an interrupt, simplifying the implemen-tation signi�cantly. However, for generic serial ports with the O_NONBLOCK �ag set, thepolling limitation is relaxed and the consumed bu�ers are freed upon the next write or pollcall or when a port is closed or hot-unplugged.

37

Page 39: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

4. If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT fea-ture, active ports are announced by the host using the VIRTIO_CONSOLE_PORT_ADDcontrol message. The same message is used for port hot-plug as well.

5. If the host speci�ed a port `name', a sysfs attribute is created with thename �lled in, so that udev rules can be written that can create a symlinkfrom the port's name to the char device for port discovery by applicationsin the guest.

6. Changes to ports' state are e�ected by control messages. Appropriateaction is taken on the port indicated in the control message. The layoutof the structure of the control bu�er and the events associated are:

s t r u c t v i r t i o_conso l e_cont ro l {uint32_t id ; /∗ Port number ∗/uint16_t event ; /∗ The kind o f c on t r o l event ∗/uint16_t value ; /∗ Extra in fo rmat ion f o r the event ∗/

} ;

/∗ Some events f o r the i n t e r n a l messages ( c on t r o l packets ) ∗/

#de f i n e VIRTIO_CONSOLE_DEVICE_READY 0#de f i n e VIRTIO_CONSOLE_PORT_ADD 1#de f i n e VIRTIO_CONSOLE_PORT_REMOVE 2#de f i n e VIRTIO_CONSOLE_PORT_READY 3#de f i n e VIRTIO_CONSOLE_CONSOLE_PORT 4#de f i n e VIRTIO_CONSOLE_RESIZE 5#de f i n e VIRTIO_CONSOLE_PORT_OPEN 6#de f i n e VIRTIO_CONSOLE_PORT_NAME 7

38

Page 40: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Appendix F: Entropy Device

The virtio entropy device supplies high-quality randomness for guest use.

Con�guration

Subsystem Device ID 4

Virtqueues 0:requestq.

Feature bits None currently de�ned

Device con�guration layout None currently de�ned.

Device Initialization

1. The virtqueue is initialized

Device Operation

When the driver requires random bytes, it places the descriptor of one or morebu�ers in the queue. It will be completely �lled by random data by the device.

39

Page 41: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Appendix G: Memory

Balloon Device

The virtio memory balloon device is a primitive device for managing guestmemory: the device asks for a certain amount of memory, and the guest suppliesit (or withdraws it, if the device has more than it asks for). This allows theguest to adapt to changes in allowance of underlying physical memory. If thefeature is negotiated, the device can also be used to communicate guest memorystatistics to the host.

Con�guration

Subsystem Device ID 5

Virtqueues 0:in�ateq. 1:de�ateq. 2:statsq.19

Feature bits

VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must betold before pages from the balloon are used.

VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for report-ing guest memory statistics is present.

Device con�guration layout Both �elds of this con�guration are always avail-able. Note that they are little endian, despite convention that device �eldsare guest endian:

s t r u c t v i r t i o_ba l l oon_con f i g {u32 num_pages ;u32 ac tua l ;

} ;

19Only if VIRTIO_BALLON_F_STATS_VQ set

40

Page 42: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Device Initialization

1. The in�ate and de�ate virtqueues are identi�ed.

2. If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated:

(a) Identify the stats virtqueue.

(b) Add one empty bu�er to the stats virtqueue and notify the host.

Device operation begins immediately.

Device Operation

Memory Ballooning The device is driven by the receipt of a con�gurationchange interrupt.

1. The �num_pages� con�guration �eld is examined. If this is greater thanthe �actual� number of pages, memory must be given to the balloon. Ifit is less than the �actual� number of pages, memory may be taken backfrom the balloon for general use.

2. To supply memory to the balloon (aka. in�ate):

(a) The driver constructs an array of addresses of unused memory pages.These addresses are divided by 409620 and the descriptor describingthe resulting 32-bit array is added to the in�ateq.

3. To remove memory from the balloon (aka. de�ate):

(a) The driver constructs an array of addresses of memory pages it haspreviously given to the balloon, as described above. This descriptoris added to the de�ateq.

(b) If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set,the guest may not use these requested pages until that descriptor inthe de�ateq has been used by the device.

(c) Otherwise, the guest may begin to re-use pages previously given tothe balloon before the device has acknowledged their withdrawl. 21

4. In either case, once the device has completed the in�ation or de�ation,the �actual� �eld of the con�guration should be updated to re�ect thenew number of pages in the balloon.22

20This is historical, and independent of the guest page size21In this case, de�ation advice is merely a courtesy22As updates to con�guration space are not atomic, this �eld isn't particularly reliable, but

can be used to diagnose buggy guests.

41

Page 43: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Memory Statistics

The stats virtqueue is atypical because communication is driven by the device(not the driver). The channel becomes active at driver initialization time whenthe driver adds an empty bu�er and noti�es the device. A request for memorystatistics proceeds as follows:

1. The device pushes the bu�er onto the used ring and sends an interrupt.

2. The driver pops the used bu�er and discards it.

3. The driver collects memory statistics and writes them into a new bu�er.

4. The driver adds the bu�er to the virtqueue and noti�es the device.

5. The device pops the bu�er (retaining it to initiate a subsequent request)and consumes the statistics.

Memory Statistics Format Each statistic consists of a 16 bit tag and a 64bit value. Both quantities are represented in the native endian of theguest. All statistics are optional and the driver may choose which onesto supply. To guarantee backwards compatibility, unsupported statisticsshould be omitted.

s t r u c t v i r t i o_ba l l oon_sta t {#de f i n e VIRTIO_BALLOON_S_SWAP_IN 0#de f i n e VIRTIO_BALLOON_S_SWAP_OUT 1#de f i n e VIRTIO_BALLOON_S_MAJFLT 2#de f i n e VIRTIO_BALLOON_S_MINFLT 3#de f i n e VIRTIO_BALLOON_S_MEMFREE 4#de f i n e VIRTIO_BALLOON_S_MEMTOT 5

u16 tag ;u64 va l ;

} __attribute__ ( ( packed ) ) ;

Tags

VIRTIO_BALLOON_S_SWAP_IN The amount of memory that hasbeen swapped in (in bytes).

VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that hasbeen swapped out to disk (in bytes).

VIRTIO_BALLOON_S_MAJFLT The number of major page faults thathave occurred.

VIRTIO_BALLOON_S_MINFLT The number of minor page faults thathave occurred.

42

Page 44: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

VIRTIO_BALLOON_S_MEMFREE The amount of memory not beingused for any purpose (in bytes).

VIRTIO_BALLOON_S_MEMTOT The total amount of memory avail-able (in bytes).

43

Page 45: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Appendix H: SCSI Host

Device

The virtio SCSI host device groups together one or more virtual logical units(such as disks), and allows communicating to them using the SCSI protocol. Aninstance of the device represents a SCSI host to which many targets and LUNsare attached.

The virtio SCSI device services two kinds of requests:

• command requests for a logical unit;

• task management functions related to a logical unit, target or command.

The device is also able to send out noti�cations about added and removed logicalunits. Together, these capabilities provide a SCSI transport protocol that usesvirtqueues as the transfer medium. In the transport protocol, the virtio driveracts as the initiator, while the virtio SCSI host provides one or more targetsthat receive and process the requests.

Con�guration

Subsystem Device ID 7

Virtqueues 0:controlq; 1:eventq; 2..n:request queues.

Feature bits

VIRTIO_SCSI_F_INOUT (0) A single request can include bothread-only and write-only data bu�ers.

VIRTIO_SCSI_F_HOTPLUG (1) The host should enable hot-plug/hot-unplug of new LUNs and targets on the SCSI bus.

Device con�guration layout All �elds of this con�guration are always avail-able. sense_size and cdb_size are writable by the guest.

44

Page 46: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

s t r u c t v i r t i o_s c s i_con f i g {u32 num_queues ;u32 seg_max ;u32 max_sectors ;u32 cmd_per_lun ;u32 event_info_s ize ;u32 sense_s i z e ;u32 cdb_size ;u16 max_channel ;u16 max_target ;u32 max_lun ;

} ;

num_queues is the total number of request virtqueues exposed by thedevice. The driver is free to use only one request queue, or it can usemore to achieve better performance.

seg_max is the maximum number of segments that can be in a com-mand. A bidirectional command can include seg_max input seg-ments and seg_max output segments.

max_sectors is a hint to the guest about the maximum transfer size itshould use.

cmd_per_lun is a hint to the guest about the maximum number oflinked commands it should send to one LUN. The actual value to beused is the minimum of cmd_per_lun and the virtqueue size.

event_info_size is the maximum size that the device will �ll for bu�ersthat the driver places in the eventq. The driver should always putbu�ers at least of this size. It is written by the device depending onthe set of negotated features.

sense_size is the maximum size of the sense data that the device willwrite. The default value is written by the device and will always be96, but the driver can modify it. It is restored to the default whenthe device is reset.

cdb_size is the maximum size of the CDB that the driver will write.The default value is written by the device and will always be 32, butthe driver can likewise modify it. It is restored to the default whenthe device is reset.

max_channel, max_target and max_lun can be used by the driveras hints to constrain scanning the logical units on the host.h

Device Initialization

The initialization routine should �rst of all discover the device's virtqueues.

45

Page 47: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

If the driver uses the eventq, it should then place at least a bu�er in the eventq.

The driver can immediately issue requests (for example, INQUIRY or REPORTLUNS) or task management functions (for example, I_T RESET).

Device Operation: request queues

The driver queues requests to an arbitrary request queue, and they are used bythe device on that same queue. It is the responsibility of the driver to ensurestrict request ordering for commands placed on di�erent queues, because theywill be consumed with no order constraints.

Requests have the following format:

s t r u c t virtio_scsi_req_cmd {// Read−onlyu8 lun [ 8 ] ;u64 id ;u8 task_attr ;u8 pr i o ;u8 crn ;char cdb [ cdb_size ] ;char dataout [ ] ;// Write−only partu32 sense_len ;u32 r e s i d u a l ;u16 s t a t u s_qua l i f i e r ;u8 s t a tu s ;u8 response ;u8 sense [ s ense_s i ze ] ;char data in [ ] ;

} ;

/∗ command−s p e c i f i c r e sponse va lue s ∗/#de f i n e VIRTIO_SCSI_S_OK 0#de f i n e VIRTIO_SCSI_S_OVERRUN 1#de f i n e VIRTIO_SCSI_S_ABORTED 2#de f i n e VIRTIO_SCSI_S_BAD_TARGET 3#de f i n e VIRTIO_SCSI_S_RESET 4#de f i n e VIRTIO_SCSI_S_BUSY 5#de f i n e VIRTIO_SCSI_S_TRANSPORT_FAILURE 6#de f i n e VIRTIO_SCSI_S_TARGET_FAILURE 7#de f i n e VIRTIO_SCSI_S_NEXUS_FAILURE 8#de f i n e VIRTIO_SCSI_S_FAILURE 9

/∗ task_attr ∗/

46

Page 48: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

#de f i n e VIRTIO_SCSI_S_SIMPLE 0#de f i n e VIRTIO_SCSI_S_ORDERED 1#de f i n e VIRTIO_SCSI_S_HEAD 2#de f i n e VIRTIO_SCSI_S_ACA 3

The lun �eld addresses a target and logical unit in the virtio-scsi device's SCSIdomain. The only supported format for the LUN �eld is: �rst byte set to 1,second byte set to target, third and fourth byte representing a single level LUNstructure, followed by four zero bytes. With this representation, a virtio-scsidevice can serve up to 256 targets and 16384 LUNs per target.

The id �eld is the command identi�er (�tag�).

task_attr, prio and crn should be left to zero. task_attr de�nes the task at-tribute as in the table above, but all task attributes may be mapped to SIMPLEby the device; crn may also be provided by clients, but is generally expected tobe 0. The maximum CRN value de�ned by the protocol is 255, since CRN isstored in an 8-bit integer.

All of these �elds are de�ned in SAM. They are always read-only, as are thecdb and dataout �eld. The cdb_size is taken from the con�guration space.

sense and subsequent �elds are always write-only. The sense_len �eld indi-cates the number of bytes actually written to the sense bu�er. The residual �eldindicates the residual size, calculated as �data_length - number_of_transferred_bytes�,for read or write operations. For bidirectional commands, the number_of_transferred_bytesincludes both read and written bytes. A residual �eld that is less than the size ofdatain means that the dataout �eld was processed entirely. A residual �eld thatexceeds the size of datain means that the dataout �eld was processed partiallyand the datain �eld was not processed at all.

The status byte is written by the device to be the status code as de�ned inSAM.

The response byte is written by the device to be one of the following:

VIRTIO_SCSI_S_OK when the request was completed and the status byteis �lled with a SCSI status code (not necessarily "GOOD").

VIRTIO_SCSI_S_OVERRUN if the content of the CDB requires trans-ferring more data than is available in the data bu�ers.

VIRTIO_SCSI_S_ABORTED if the request was cancelled due to an ABORTTASK or ABORT TASK SET task management function.

VIRTIO_SCSI_S_BAD_TARGET if the request was never processedbecause the target indicated by the lun �eld does not exist.

VIRTIO_SCSI_S_RESET if the request was cancelled due to a bus ordevice reset (including a task management function).

47

Page 49: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

VIRTIO_SCSI_S_TRANSPORT_FAILURE if the request failed dueto a problem in the connection between the host and the target (severedlink).

VIRTIO_SCSI_S_TARGET_FAILURE if the target is su�ering a fail-ure and the guest should not retry on other paths.

VIRTIO_SCSI_S_NEXUS_FAILURE if the nexus is su�ering a failurebut retrying on other paths might yield a di�erent result.

VIRTIO_SCSI_S_BUSY if the request failed but retrying on the samepath should work.

VIRTIO_SCSI_S_FAILURE for other host or guest error. In particular,if neither dataout nor datain is empty, and the VIRTIO_SCSI_F_INOUTfeature has not been negotiated, the request will be immediately returnedwith a response equal to VIRTIO_SCSI_S_FAILURE.

Device Operation: controlq

The controlq is used for other SCSI transport operations. Requests have thefollowing format:

s t r u c t v i r t i o_ s c s i_ c t r l {u32 type ;. . .u8 response ;

} ;

/∗ re sponse va lue s va l i d f o r a l l commands ∗/#de f i n e VIRTIO_SCSI_S_OK 0#de f i n e VIRTIO_SCSI_S_BAD_TARGET 3#de f i n e VIRTIO_SCSI_S_BUSY 5#de f i n e VIRTIO_SCSI_S_TRANSPORT_FAILURE 6#de f i n e VIRTIO_SCSI_S_TARGET_FAILURE 7#de f i n e VIRTIO_SCSI_S_NEXUS_FAILURE 8#de f i n e VIRTIO_SCSI_S_FAILURE 9#de f i n e VIRTIO_SCSI_S_INCORRECT_LUN 12

The type identi�es the remaining �elds.

The following commands are de�ned:

Task management function

48

Page 50: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

#de f i n e VIRTIO_SCSI_T_TMF 0

#de f i n e VIRTIO_SCSI_T_TMF_ABORT_TASK 0#de f i n e VIRTIO_SCSI_T_TMF_ABORT_TASK_SET 1#de f i n e VIRTIO_SCSI_T_TMF_CLEAR_ACA 2#de f i n e VIRTIO_SCSI_T_TMF_CLEAR_TASK_SET 3#de f i n e VIRTIO_SCSI_T_TMF_I_T_NEXUS_RESET 4#de f i n e VIRTIO_SCSI_T_TMF_LOGICAL_UNIT_RESET 5#de f i n e VIRTIO_SCSI_T_TMF_QUERY_TASK 6#de f i n e VIRTIO_SCSI_T_TMF_QUERY_TASK_SET 7

s t r u c t v i r t i o_sc s i_ct r l_tmf{

// Read−only partu32 type ;u32 subtype ;u8 lun [ 8 ] ;u64 id ;// Write−only partu8 response ;

}

/∗ command−s p e c i f i c r e sponse va lue s ∗/#de f i n e VIRTIO_SCSI_S_FUNCTION_COMPLETE 0#de f i n e VIRTIO_SCSI_S_FUNCTION_SUCCEEDED 10#de f i n e VIRTIO_SCSI_S_FUNCTION_REJECTED 11

The type is VIRTIO_SCSI_T_TMF; the subtype �eld de�nes. All �eldsexcept response are �lled by the driver. The subtype �eld must alwaysbe speci�ed and identi�es the requested task management function.

Other �elds may be irrelevant for the requested TMF; if so, they areignored but they should still be present. The lun �eld is in the same formatspeci�ed for request queues; the single level LUN is ignored when the taskmanagement function addresses a whole I_T nexus. When relevant, thevalue of the id �eld is matched against the id values passed on the requestq.

The outcome of the task management function is written by the devicein the response �eld. The command-speci�c response values map 1-to-1with those de�ned in SAM.

Asynchronous noti�cation query

#de f i n e VIRTIO_SCSI_T_AN_QUERY 1

s t r u c t v i r t i o_sc s i_ct r l_an {// Read−only part

49

Page 51: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

u32 type ;u8 lun [ 8 ] ;u32 event_requested ;// Write−only partu32 event_actual ;u8 response ;

}

#de f i n e VIRTIO_SCSI_EVT_ASYNC_OPERATIONAL_CHANGE 2#de f i n e VIRTIO_SCSI_EVT_ASYNC_POWER_MGMT 4#de f i n e VIRTIO_SCSI_EVT_ASYNC_EXTERNAL_REQUEST 8#de f i n e VIRTIO_SCSI_EVT_ASYNC_MEDIA_CHANGE 16#de f i n e VIRTIO_SCSI_EVT_ASYNC_MULTI_HOST 32#de f i n e VIRTIO_SCSI_EVT_ASYNC_DEVICE_BUSY 64

By sending this command, the driver asks the device which events thegiven LUN can report, as described in paragraphs 6.6 and A.6 of the SCSIMMC speci�cation. The driver writes the events it is interested in intothe event_requested; the device responds by writing the events that itsupports into event_actual.

The type is VIRTIO_SCSI_T_AN_QUERY. The lun and event_requested�elds are written by the driver. The event_actual and response �eldsare written by the device.

No command-speci�c values are de�ned for the response byte.

Asynchronous noti�cation subscription

#de f i n e VIRTIO_SCSI_T_AN_SUBSCRIBE 2

s t r u c t v i r t i o_sc s i_ct r l_an {// Read−only partu32 type ;u8 lun [ 8 ] ;u32 event_requested ;// Write−only partu32 event_actual ;u8 response ;

}

By sending this command, the driver asks the speci�ed LUN to reportevents for its physical interface, again as described in the SCSI MMCspeci�cation. The driver writes the events it is interested in into theevent_requested; the device responds by writing the events that it sup-ports into event_actual.

Event types are the same as for the asynchronous noti�cation query mes-sage.

50

Page 52: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

The type is VIRTIO_SCSI_T_AN_SUBSCRIBE. The lun and event_requested�elds are written by the driver. The event_actual and response �eldsare written by the device.

No command-speci�c values are de�ned for the response byte.

Device Operation: eventq

The eventq is used by the device to report information on logical units that areattached to it. The driver should always leave a few bu�ers ready in the eventq.In general, the device will not queue events to cope with an empty eventq, andwill end up dropping events if it �nds no bu�er ready. However, when reportingevents for many LUNs (e.g. when a whole target disappears), the device canthrottle events to avoid dropping them. For this reason, placing 10-15 bu�erson the event queue should be enough.

Bu�ers are placed in the eventq and �lled by the device when interesting eventsoccur. The bu�ers should be strictly write-only (device-�lled) and the sizeof the bu�ers should be at least the value given in the device's con�gurationinformation.

Bu�ers returned by the device on the eventq will be referred to as "events" inthe rest of this section. Events have the following format:

#de f i n e VIRTIO_SCSI_T_EVENTS_MISSED 0x80000000

s t r u c t v i r t i o_sc s i_event {// Write−only partu32 event ;. . .

}

If bit 31 is set in the event �eld, the device failed to report an event due tomissing bu�ers. In this case, the driver should poll the logical units for unitattention conditions, and/or do whatever form of bus scan is appropriate forthe guest operating system.

Other data that the device writes to the bu�er depends on the contents of theevent �eld. The following events are de�ned:

No event

#de f i n e VIRTIO_SCSI_T_NO_EVENT 0

This event is �red in the following cases:

51

Page 53: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

• When the device detects in the eventq a bu�er that is shorter thanwhat is indicated in the con�guration �eld, it might use it immedi-ately and put this dummy value in the event �eld. A well-writtendriver will never observe this situation.

• When events are dropped, the device may signal this event as soon asthe drivers makes a bu�er available, in order to request action fromthe driver. In this case, of course, this event will be reported withthe VIRTIO_SCSI_T_EVENTS_MISSED �ag.

Transport reset

#de f i n e VIRTIO_SCSI_T_TRANSPORT_RESET 1

s t r u c t v i r t i o_sc s i_event_re s e t {// Write−only partu32 event ;u8 lun [ 8 ] ;u32 reason ;

}

#de f i n e VIRTIO_SCSI_EVT_RESET_HARD 0#de f i n e VIRTIO_SCSI_EVT_RESET_RESCAN 1#de f i n e VIRTIO_SCSI_EVT_RESET_REMOVED 2

By sending this event, the device signals that a logical unit on a targethas been reset, including the case of a new device appearing or disap-pearing on the bus.The device �lls in all �elds. The event �eld is setto VIRTIO_SCSI_T_TRANSPORT_RESET. The lun �eld addresses alogical unit in the SCSI host.

The reason value is one of the three #de�ne values appearing above:

• VIRTIO_SCSI_EVT_RESET_REMOVED (�LUN/target re-moved�) is used if the target or logical unit is no longer able to receivecommands.

• VIRTIO_SCSI_EVT_RESET_HARD (�LUN hard reset�) isused if the logical unit has been reset, but is still present.

• VIRTIO_SCSI_EVT_RESET_RESCAN (�rescan LUN/tar-get�) is used if a target or logical unit has just appeared on the device.

The �removed� and �rescan� events, when sent for LUN 0, may apply tothe entire target. After receiving them the driver should ask the initiatorto rescan the target, in order to detect the case when an entire target hasappeared or disappeared. These two events will never be reported unlessthe VIRTIO_SCSI_F_HOTPLUG feature was negotiated betweenthe host and the guest.

52

Page 54: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Events will also be reported via sense codes (this obviously does not ap-ply to newly appeared buses or targets, since the application has neverdiscovered them):

• �LUN/target removed� maps to sense key ILLEGAL REQUEST, asc0x25, ascq 0x00 (LOGICAL UNIT NOT SUPPORTED)

• �LUN hard reset� maps to sense key UNIT ATTENTION, asc 0x29(POWER ON, RESET OR BUS DEVICE RESET OCCURRED)

• �rescan LUN/target� maps to sense key UNIT ATTENTION, asc0x3f, ascq 0x0e (REPORTED LUNS DATA HAS CHANGED)

The preferred way to detect transport reset is always to use events, becausesense codes are only seen by the driver when it sends a SCSI commandto the logical unit or target. However, in case events are dropped, theinitiator will still be able to synchronize with the actual state of the con-troller if the driver asks the initiator to rescan of the SCSI bus. Duringthe rescan, the initiator will be able to observe the above sense codes, andit will process them as if it the driver had received the equivalent event.

Asynchronous noti�cation

#de f i n e VIRTIO_SCSI_T_ASYNC_NOTIFY 2

s t r u c t virt io_scs i_event_an {// Write−only partu32 event ;u8 lun [ 8 ] ;u32 reason ;

}

By sending this event, the device signals that an asynchronous event was�red from a physical interface.

All �elds are written by the device. The event �eld is set to VIR-TIO_SCSI_T_ASYNC_NOTIFY. The lun �eld addresses a logical unitin the SCSI host. The reason �eld is a subset of the events that thedriver has subscribed to via the "Asynchronous noti�cation subscription"command.

When dropped events are reported, the driver should poll for asynchronousevents manually using SCSI commands.

53

Page 55: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Appendix X: virtio-mmio

Virtual environments without PCI support (a common situation in embeddeddevices models) might use simple memory mapped device (�virtio-mmio�) in-stead of the PCI device.

The memory mapped virtio device behaviour is based on the PCI device speci�-cation. Therefore most of operations like device initialization, queues con�gura-tion and bu�er transfers are nearly identical. Existing di�erences are describedin the following sections.

Device Initialization

Instead of using the PCI IO space for virtio header, the �virtio-mmio� deviceprovides a set of memory mapped control registers, all 32 bits wide, followed bydevice-speci�c con�guration space. The following list presents their layout:

• O�set from the device base address | Direction | NameDescription

• 0x000 | R | MagicValue�virt� string.

• 0x004 | R | VersionDevice version number. Currently must be 1.

• 0x008 | R | DeviceIDVirtio Subsystem Device ID (ie. 1 for network card).

• 0x00c | R | VendorIDVirtio Subsystem Vendor ID.

• 0x010 | R | HostFeaturesFlags representing features the device supports.Reading from this register returns 32 consecutive �ag bits, �rst bit depend-ing on the last value written to HostFeaturesSel register. Access to thisregister returns bits HostFeaturesSel∗32 to (HostFeaturesSel∗32)+31,

54

Page 56: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

eg. feature bits 0 to 31 if HostFeaturesSel is set to 0 and features bits 32to 63 if HostFeaturesSel is set to 1. Also see 2.2.2.2

• 0x014 | W | HostFeaturesSelDevice (Host) features word selection.Writing to this register selects a set of 32 device feature bits accessible byreading from HostFeatures register. Device driver must write a value tothe HostFeaturesSel register before reading from the HostFeatures register.

• 0x020 | W | GuestFeaturesFlags representing device features understood and activated by the driver.Writing to this register sets 32 consecutive �ag bits, �rst bit depending onthe last value written to GuestFeaturesSel register. Access to this registersets bits GuestFeaturesSel ∗ 32 to (GuestFeaturesSel ∗ 32) + 31, eg.feature bits 0 to 31 if GuestFeaturesSel is set to 0 and features bits 32 to63 if GuestFeaturesSel is set to 1. Also see 2.2.2.2

• 0x024 | W | GuestFeaturesSelActivated (Guest) features word selection.Writing to this register selects a set of 32 activated feature bits accessi-ble by writing to the GuestFeatures register. Device driver must write avalue to the GuestFeaturesSel register before writing to the GuestFeaturesregister.

• 0x028 | W | GuestPageSizeGuest page size.Device driver must write the guest page size in bytes to the register duringinitialization, before any queues are used. This value must be a power of2 and is used by the Host to calculate Guest address of the �rst queuepage (see QueuePFN).

• 0x030 | W | QueueSelVirtual queue index (�rst queue is 0).Writing to this register selects the virtual queue that the following oper-ations on QueueNum, QueueAlign and QueuePFN apply to.

• 0x034 | R | QueueNumMaxMaximum virtual queue size.Reading from the register returns the maximum size of the queue the Hostis ready to process or zero (0x0) if the queue is not available. This appliesto the queue selected by writing to QueueSel and is allowed only whenQueuePFN is set to zero (0x0), so when the queue is not actively used.

• 0x038 | W | QueueNumVirtual queue size.Queue size is a number of elements in the queue, therefore size of thedescriptor table and both available and used rings.Writing to this register noti�es the Host what size of the queue the Guestwill use. This applies to the queue selected by writing to QueueSel.

55

Page 57: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

• 0x03c | W | QueueAlignUsed Ring alignment in the virtual queue.Writing to this register noti�es the Host about alignment boundary of theUsed Ring in bytes. This value must be a power of 2 and applies to thequeue selected by writing to QueueSel.

• 0x040 | RW | QueuePFNGuest physical page number of the virtual queue.Writing to this register noti�es the host about location of the virtual queuein the Guest's physical address space. This value is the index number ofa page starting with the queue Descriptor Table. Value zero (0x0) meansphysical address zero (0x00000000) and is illegal. When the Guest stopsusing the queue it must write zero (0x0) to this register.Reading from this register returns the currently used page number of thequeue, therefore a value other than zero (0x0) means that the queue is inuse.Both read and write accesses apply to the queue selected by writing toQueueSel.

• 0x050 | W | QueueNotifyQueue noti�er.Writing a queue index to this register noti�es the Host that there are newbu�ers to process in the queue.

• 0x60 | R | InterruptStatusInterrupt status.Reading from this register returns a bit mask of interrupts asserted by thedevice. An interrupt is asserted if the corresponding bit is set, ie. equalsone (1).

� Bit 0 | Used Ring UpdateThis interrupt is asserted when the Host has updated the Used Ringin at least one of the active virtual queues.

� Bit 1 | Con�guration changeThis interrupt is asserted when con�guration of the device has changed.

• 0x064 | W | InterruptACKInterrupt acknowledge.Writing to this register noti�es the Host that the Guest �nished handlinginterrupts. Set bits in the value clear the corresponding bits of the Inter-ruptStatus register.

• 0x070 | RW | StatusDevice status.Reading from this register returns the current device status �ags.Writing non-zero values to this register sets the status �ags, indicating theGuest progress. Writing zero (0x0) to this register triggers a device reset.Also see 2.2.1

56

Page 58: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

• 0x100+ | RW | Con�gDevice-speci�c con�guration space starts at an o�set 0x100 and is accessedwith byte alignment. Its meaning and size depends on the device and thedriver.

Virtual queue size is a number of elements in the queue, therefore size of thedescriptor table and both available and used rings.

The endianness of the registers follows the native endianness of the Guest. Writ-ing to registers described as �R� and reading from registers described as �W� isnot permitted and can cause unde�ned behavior.

The device initialization is performed as described in 2.2.1 with one exception:the Guest must notify the Host about its page size, writing the size in bytes toGuestPageSize register before the initialization is �nished.

The memory mapped virtio devices generate single interrupt only, therefore nospecial con�guration is required.

Virtqueue Con�guration

The virtual queue con�guration is performed in a similar way to the one de-scribed in 2.3 with a few additional operations:

1. Select the queue writing its index (�rst queue is 0) to the QueueSel register.

2. Check if the queue is not already in use: read QueuePFN register, returnedvalue should be zero (0x0).

3. Read maximum queue size (number of elements) from the QueueNumMaxregister. If the returned value is zero (0x0) the queue is not available.

4. Allocate and zero the queue pages in contiguous virtual memory, aligningthe Used Ring to an optimal boundary (usually page size). Size of the al-located queue may be smaller than or equal to the maximum size returnedby the Host.

5. Notify the Host about the queue size by writing the size to QueueNumregister.

6. Notify the Host about the used alignment by writing its value in bytes toQueueAlign register.

7. Write the physical number of the �rst page of the queue to the QueuePFNregister.

The queue and the device are ready to begin normal operations now.

57

Page 59: Virtio PCI Card Speci cation v0.9.4 DRAFT - Red Hatpeople.redhat.com/pbonzini/virtio-spec.pdf · Virtio PCI Card Speci cation v0.9.4 DRAFT-Rusty Russell

Device Operation

The memory mapped virtio device behaves in the same way as described in 2.4,with the following exceptions:

1. The device is noti�ed about new bu�ers available in a queue by writingthe queue index to register QueueNum instead of the virtio header in PCII/O space (2.4.1.4).

2. The memory mapped virtio device is using single, dedicated interrupt sig-nal, which is raised when at least one of the interrupts described in the In-terruptStatus register description is asserted. After receiving an interrupt,the driver must read the InterruptStatus register to check what caused theinterrupt (see the register description). After the interrupt is handled, thedriver must acknowledge it by writing a bit mask corresponding to theserviced interrupt to the InterruptACK register.

58