Better Utilization of Storage Features from KVM Guest via...
Transcript of Better Utilization of Storage Features from KVM Guest via...
© Hitachi, Ltd. 2013. All rights reserved.
Hitachi, Ltd.
Information & Telecommunication Systems Company IT Platform Division Group
Masaki Kimura
Better Utilization of Storage Features from KVM Guest via virtio-scsi
© Hitachi, Ltd. 2013. All rights reserved.
1. Background (Use Cases and Requirements)
2. KVM features for guest SCSI commands
3. Current Status of these features
Better Utilization of Storage Features from KVM Guest via virtio-scsi
1
Contents
4. Summary
5. Future work
© Hitachi, Ltd. 2013. All rights reserved.
1. Background
2
Better Utilization of Storage Features from KVM Guest via virtio-scsi
3 © Hitachi, Ltd. 2013. All rights reserved.
1-1. Background
Enterprise systems expect that virtualized environment has the same level of manageability, availability, and reliability achieved in bare-metal.
For example: 1. Thin-provisioned Storage for manageability, 2. HA cluster for availability, 3. Backup server for reliability.
In bare-metal environment, some of these requirements are achieved by using storage features, such as SCSI commands.
In virtualized environment, the same use cases exists for guests. Issuing SCSI commands from guests are required.
Three use cases, thin-provisioned storage, HA cluster, and backup server will be explained in the next slides.
4 © Hitachi, Ltd. 2013. All rights reserved.
1-2. Use case #1: Thin-provisioned Storage
Server
OS
Server
Host OS
QEMU
WriteSame
reclaiming disk blocks
Bare-metal KVM Guest
WriteSame
Guest OS
Guest OS
This use case exists for both bare-metal and KVM guests. WriteSame is required to be issued to storage from guests.
- Many types of enterprise storage have thin-provision function. - For achievement of thin-provision, a disk block is allocated on access. - However, once it is allocated, it can not be reclaimed by storage automatically even when the disk block becomes unused by OS. - To reclaim unused disk block, OS needs to let storage know unused blocks by issuing WriteSame SCSI command.
5 © Hitachi, Ltd. 2013. All rights reserved.
1-3. Use Case #2: HA cluster (1/2)
LU
Server
OS
Server
OS
LU
Server
Host OS
Server
Host OS
QEMU QEMU HA Cluster
HA Cluster
Persistent Reservation Persistent
Reservation
Guest OS
Guest OS
Guest OS
Guest OS
This use case also exists for KVM guests. Persistent Reservation is required to be issued to storage from guest.
- To improve availability, HA cluster is commonly used in bare-metal.
- For HA cluster, Persistent Reservation SCSI command is generally used to guarantee an exclusive access from an active system.
Bare-metal KVM Guest
blocked blocked
Active Standby
Standby Active
6 © Hitachi, Ltd. 2013. All rights reserved.
1-4. Use Case #2: HA cluster (2/2)
- Persistent Reservation is held by so-called I_T nexus, the combination of initiator ID and target ID.
- I/Os from standby system are blocked, because the I_T nexus is different.
- Therefore, I_T nexus is required to be unique for Persistent Reservation to work properly.
LU
Server
OS
Server
OS
Bare-metal
LU
Server
Host OS
Server
Host OS
QEMU QEMU
KVM Guest
HA Cluster
HA Cluster
Persistent Reservation
Persistent Reservation
Guest OS
Guest OS
Guest OS
Guest OS
initiator_1 initiator_2
initiator_3 initiator_4
target_A target_B
Reserved (initiator_1, target_A)
Reserved (initiator_3, target_B)
initiator_* target_* : Initiator ID. : Target ID.
blocked blocked
Active Standby
Standby Active
7 © Hitachi, Ltd. 2013. All rights reserved.
1-5. Use Case #3: Backup Server
- Persistent Reservation is also used by some backup-server products to guarantee an exclusive access from a backup server on backup.
- Persistent Reservation to storage and unique I_T nexus are required by these products.
Server
OS
Bare-metal
Server
Host OS
QEMU
KVM Guest
Persistent Reservation
Persistent Reservation
Guest OS
Guest OS
Guest OS
initiator_1
initiator_4
initiator_* target_* : Initiator ID. : Target ID.
blocked blocked
Backup Server
Server
OS
LU
initiator_2
Server
OS
Business systems
LU
initiator_3
initiator_5
Backup Server
LU LU
Business systems
initiator_6
target_A target_B
8 © Hitachi, Ltd. 2013. All rights reserved.
1-6. Summary of Requirements
Requirements from the use cases: Requirement #1: SCSI commands to storage from guests.
- Thin-provisioned storage requires WriteSame to storage from guests.
- HA cluster and backup server require Persistent Reservation to storage from guests.
Requirement #2: Unique initiator ID across guests. - HA cluster and backup server require I_T nexus to
be unique.
© Hitachi, Ltd. 2013. All rights reserved.
2. KVM features for guest SCSI commands
9
Better Utilization of Storage Features from KVM Guest via virtio-scsi
10 © Hitachi, Ltd. 2013. All rights reserved.
2-1. KVM features for guest disks
This presentation focuses on following three device types and their configurations.
# Configuration
Device Type Initiator Target Backend
1
(1) virtio-blk -
File
2 Device
3 LUN
4
(2) virtio-scsi
- (a)qemu
File
5 Device
6 LUN
7 - (b)lio
block
8 pscsi
9 (c) libiscsi - iSCSI storage
10 (3)PCI device assignment
(a)Legacy - PCI device
11 (b)VFIO - PCI device
11 © Hitachi, Ltd. 2013. All rights reserved.
2-2. (1) virtio-blk (1/2)
- Para-virtualized disk. - Shown as /dev/vdX on Linux guests. - Maximum number of disks is limited by maximum number of PCI devices (32). - Improving performance with virtio-blk plane and bio-based I/O.
Guest kernel
QEMU
Host kernel
Hardware
virtio-blk
Block Layer
SCSI Layer
/dev/vdx
/dev/sdx
Block Layer
SCSI Layer
virtio-blk Device Layer
Device
Device Layer
Characteristic (1)virtio-blk
12 © Hitachi, Ltd. 2013. All rights reserved.
2-3. (1) virtio-blk (2/2)
Configurations and how SCSI commands are handled.
- SCSI command from guest reaches to storage only when attached as LUN.
Guest
QEMU
Host kernel
Hardware
virtio-blk
Block
/dev/vdx Block
SCSI
Device
Device
File Block
LUN VFS
virtio-blk /dev/vdx
virtio-blk /dev/vdx
virtio-blk
Pass-through
SCSI Command Capability
Backend KVM Command Line Libvirt XML SCSI command
Disk (file)
-device virtio-blk-pci,scsi=off <disk type=‘file’ device=‘disk’> <target dev='vda' bus='virtio'/>
Not Supported
Disk (block)
-device virtio-blk-pci,scsi=off <disk type=‘block’ device=‘disk’> <target dev='vda' bus='virtio'/>
Not Supported
LUN -device virtio-blk-pci,scsi=on <disk type=‘block’ device=‘lun’> <target dev='vda' bus='virtio'/>
Pass-through
virtio-blk virtio-blk
13 © Hitachi, Ltd. 2013. All rights reserved.
2-4. (2) virtio-scsi
virtio-scsi has three types of configurations.
(a)Qemu target (b)lio target (c)libiscsi
Guest kernel
QEMU
Host kernel
Hardware
virtio-scsi scsi_mod
Block Layer
SCSI Layer
/dev/sdx
/dev/sdx
Block Layer
SCSI Layer
virtio-scsi Device Layer
Device
Device Layer
virtio-scsi scsi_mod
/dev/sdx
virtio-scsi
virtio-scsi scsi_mod
/dev/sdx
libiscsi
tcm_vhost
/dev/sdx LIO
14 © Hitachi, Ltd. 2013. All rights reserved.
2-5. (2) virtio-scsi: (a) qemu target (1/2)
[virtio-scsi] - Para-virtualized SCSI transport. - Shown as /dev/sdX on Linux guests.
[Qemu target] - User space target
Guest kernel
QEMU
Host kernel
Hardware
Block Layer
SCSI Layer
Block Layer
SCSI Layer
Device Layer
Device
Device Layer
virtio-scsi
scsi_mod
/dev/sdx
/dev/sdx
virtio-scsi
Characteristic (a)Qemu target
15 © Hitachi, Ltd. 2013. All rights reserved.
2-6. (2) virtio-scsi (a) qemu-target (2/2)
Configurations and how SCSI commands are handled.
- SCSI command from guest reaches to storage only when attached as LUN. - Emulated results return to guest when attached as file or device.
Guest
QEMU
Host kernel
Hardware
Block
Block
SCSI
Device
Device
File Device
LUN VFS
/dev/sdx
virtio-scsi
virtio-scsi scsi_mod
/dev/sdx
virtio-scsi
/dev/sdx
virtio-scsi
virtio-scsi scsi_mod
virtio-scsi scsi_mod
Emulated result Pass-
through
SCSI Command Capability
Backend KVM Command Line Libvirt XML SCSI command
Disk (file)
-device scsi-hd <disk type=‘file’ device=‘disk’> <target dev='sda' bus='scsi'/>
Emulated
Disk (block)
-device scsi-hd <disk type=‘block’ device=‘disk’> <target dev='sda' bus='scsi'/>
Emulated
LUN -device scsi-block <disk type=‘block’ device=‘lun’> <target dev='sda' bus='scsi'/>
Pass-through
16 © Hitachi, Ltd. 2013. All rights reserved.
2-7. (2) virtio-scsi: (b) lio target
Guest kernel
QEMU
Host kernel
Hardware
Block Layer
SCSI Layer
Block Layer
SCSI Layer
Device Layer
Device
Device Layer
virtio-scsi scsi_mod
/dev/sdx
tcm_vhost
/dev/sdx
[virtio-scsi] - Para-virtualized SCSI transport. - Shown as /dev/sdX on Linux guests.
[lio target] - Kernel space target. - Using LIO (linux-iscsi.org ) as backend. - LIO supports following back-stores: - block - fileio - pscsi - ramdisk
- Not yet evaluated. (pscsi is pass-through SCSI, therefore it is expected to work well.)
Characteristic
SCSI Command Capability
LIO
(b)lio target
17 © Hitachi, Ltd. 2013. All rights reserved.
2-8. (2) virtio-scsi: (c) libiscsi
Guest kernel
QEMU
Host kernel
Hardware
Block Layer
SCSI Layer
Device Layer
Device
virtio-scsi scsi_mod
/dev/sdx
virtio-scsi libiscsi
[virtio-scsi] - Para-virtualized SCSI transport. - Shown as /dev/sdX on Linux guests.
[libiscsi] - User space iSCSI initiator. - Only support iSCSI. - QEMU directly talk to iSCSI storage, therefore host does not see guest disks.
- SCSI command from guest reaches to storage.
Characteristic
SCSI Command Capability
Directly talk to iSCSI storage.
(c)libiscsi
18 © Hitachi, Ltd. 2013. All rights reserved.
2-9. (3) PCI Device assignment
(a) Legacy
Guest kernel
QEMU
Host kernel
Hardware PCI device
SCSI LLD scsi_mod
PCI device
/dev/sdx Block Layer
SCSI Layer
pci-stub
Device Layer
Device
Device Layer
PCI device
SCSI LLD scsi_mod
PCI device
/dev/sdx
vfio
(b) VFIO
PCI Device assignment has two types:
Characteristic
SCSI Command Capability
- Assign PCI device to guests. - Host PCI device is dedicated to one guest, therefore the number of guests is limited to the number of PCI devices (or their ports.)
- SCSI command from guest reaches to storage in both legacy and VFIO configurations.
19 © Hitachi, Ltd. 2013. All rights reserved.
2-10. Summary of SCSI command capability
# Configuration Whether guest
SCSI commands reach to storage Device Type Initiator Target Backend
1
virtio-blk -
File No
2 Device No
3 LUN Yes
4
virtio-scsi
- qemu
File No
5 Device No
6 LUN Yes
7 - lio
block No
8 pscsi ???
9 libiscsi - iSCSI storage Yes
10 PCI Device assignment
Legacy - PCI device Yes
11 VFIO - PCI device Yes
In the next chapter, we will see SCSI command capabilities deeper only with configurations marked “Yes” in above table.
© Hitachi, Ltd. 2013. All rights reserved.
3. Current Status of these features
20
Better Utilization of Storage Features from KVM Guest via virtio-scsi
21 © Hitachi, Ltd. 2013. All rights reserved.
3-1. Evaluation Item and Configuration
Evaluation Items:
# Device Type Initiator Target Backend
1 virtio-blk - LUN
2 virtio-scsi
- qemu LUN
3 libiscsi - iSCSI storage
4 PCI Device assignment
Legacy - PCI device
5 VFIO - PCI device
From next slide, I will share what problems remain in which configurations.
Configurations:
# Evaluation Item Remarks
(a) Whether SCSI commands reach to storage. Requirement #1
(b) Whether unique initiator ID is assigned. Requirement #2
(c) Whether SCSI commands return proper results. -
22 © Hitachi, Ltd. 2013. All rights reserved.
[virtio-blk] [virtio-scsi]
Guest kernel
QEMU
Host kernel
Hardware
Block Layer
SCSI Layer
/dev/vdx /dev/sdx Block Layer
SCSI Layer
Device Layer
Device
Device Layer
Permission Check
Permission Check
CAP_SYS_RAWIO
read_ok command
write_ok command && has write permission
No
No
Not allowed Allowed No
Yes
- There is a permission check in host kernel, when guest SCSI commands is issued via virtio-blk or virtio-scsi with qemu target.
- Libvirt-managed KVM guests run as qemu user, who lacks CAP_SYS_RAWIO.
Some SCSI commands, such as Persistent Reserve and Write Same, are blocked by this check unless KVM is running as root user.
: Inquiry, Report LUN, etc
: Persistent Reserve, WriteSame, etc
3-2. (a) Whether SCSI commands reach to storage(1/4) Problem
virtio-blk
virtio-scsi scsi_mod
/dev/sdx /dev/sdx
virtio-blk virtio-scsi
23 © Hitachi, Ltd. 2013. All rights reserved.
However, neither of them has been merged yet.
To solve this issue, following patches have been submitted.
3-3. (a) Whether SCSI commands reach to storage(2/4)
Subject : [PATCH 00/13] Corrections and customization of the SG_IO command whitelist (CVE-2012-4542) Date : January 24, 2013 Committer: Paolo Bonzini URL : https://lkml.org/lkml/2013/1/24/279
Subject : [PATCH v3 0/2] add per-device sysfs knob to enable unrestricted, unprivileged SG_IO Date : November 13, 2012 Committer: Paolo Bonzini URL : https://lkml.org/lkml/2012/11/13/440
Subject : [PATCH v3 part2] Add per-device sysfs knob to enable unrestricted, unprivileged SG_IO Date : May 23, 2013 Committer: Paolo Bonzini URL : https://lkml.org/lkml/2013/5/23/294
24 © Hitachi, Ltd. 2013. All rights reserved.
Concept of these patches is to introduce a flag, unpriv_sgio, to allow non-root users to issue SCSI commands. Permission Check
CAP_SYS_RAWIO || unpriv_sgio
read_ok command
write_ok command && has write permission
No
No
Not allowed Allowed No
Yes # cat /sys/block/sda/queue/unpriv_sgio 0 # echo 1 > /sys/block/sda/queue/unpriv_sgio
* Kernel side interface (Not merged yet.)
If this kernel patch is merged, KVM guest running as qemu user will be able to configure to issue any SCSI commands to storages.
(*) Unpriv_sgio flag can be set per disk.
3-4. (a) Whether SCSI commands reach to storage(3/4)
: Inquiry, Report LUN, etc
: Persistent Reserve, WriteSame, etc
25 © Hitachi, Ltd. 2013. All rights reserved.
3-5. (a) Whether SCSI commands reach to storage(4/4) Why these patches have not been merged yet? Still under discussion on how to avoid opcodes-overlap problem.
READ SUBCHANNEL
UNMAP
0x42 0x42
READ SUBCHANNEL
0x42
0x42 => UNMAP
Intended in guest.
Interpreted in host.
When issued from host When issued from guest
Share same opcode.
Different device classes.
Guest
QEMU
Host
Hardware
[Problem] : Destructive commands might pass-throughed to the host from guests. [Implementation]: Split permission check by device class (*1)
or Introduce per-device filter w/o unpriv_sgio (*2)
Permission Check
Permission Check
Permission Check
Permission Check
w/ unpriv_sgio.
(*1) https://lkml.org/lkml/2013/1/24/299 (*2) https://lkml.org/lkml/2013/5/27/230
26 © Hitachi, Ltd. 2013. All rights reserved.
Guest 2 Guest 1
3-6. (b) Whether unique initiator ID is assigned(1/3)
In such a condition, exclusive access is not guaranteed.
When both guest1 and guest2 are on the same host and use the same HBA, they share the same initiator ID. (virtio-blk or vitio-scsi w/ qemu target)
[virtio-blk] [virtio-scsi]
Guest kernel
QEMU
Host kernel
Hardware
Block Layer SCSI Layer
Block Layer SCSI Layer
virtio-blk virtio-scsi Device Layer
Device
Device Layer
Persistent Reservation
virtio-blk /dev/vda
virtio-blk /dev/vda
Guest 2 Guest 1
virtio-scsi scsi_mod
/dev/sda
virtio-scsi scsi_mod
/dev/sda Persistent Reservation
initiator_* target_* : Initiator ID. : Target ID.
target_a target_b
Problem
/dev/sdx /dev/sdx
HBA HBA initiator_1 initiator_2
27 © Hitachi, Ltd. 2013. All rights reserved.
3-7. (b) Whether unique initiator ID is assigned(2/3)
With NPIV (N-Port ID Virtualization), virtio-blk and vitio-scsi w/ qemu target can assign unique initiator ID.
Guest 2 Guest 1
[virtio-blk] [virtio-scsi]
Guest kernel
QEMU
Host kernel
Hardware
Block Layer SCSI Layer
Block Layer SCSI Layer
virtio-blk virtio-scsi Device Layer
Device
Device Layer
Persistent Reservation
virtio-blk /dev/vda
virtio-blk /dev/vda
Guest 2 Guest 1
virtio-scsi scsi_mod
/dev/sda
virtio-scsi scsi_mod
/dev/sda Persistent Reservation
initiator_* target_* : Initiator ID. : Target ID.
target_a target_b
initiator_1_1 initiator_1_2 initiator_2_1 initiator_2_2
/dev/sdy /dev/sdy
NPIV
Solution
/dev/sdx
HBA initiator_1
/dev/sdx
HBA initiator_2 blocked blocked
28 © Hitachi, Ltd. 2013. All rights reserved.
3-8. (b) Whether unique initiator ID is assigned(3/3)
With libiscsi or PCI device assignment, exclusive access is guaranteed, because initiator IDs are unique with these configurations.
FYI
: Initiator ID. : Target ID.
Guest 2 Guest 1
[PCI device assignment]
Guest kernel
QEMU
Host kernel
Hardware
Block Layer SCSI Layer
Block Layer SCSI Layer Device Layer
Device
Device Layer
Guest 2 Guest 1
initiator_* target_*
target_a target_b
virtio-scsi scsi_mod
/dev/sda
virtio-scsi libiscsi
initiator_1
virtio-scsi scsi_mod
/dev/sda
virtio-scsi libiscsi
Persistent Reservation
initiator_2
[virtio-scsi + libiscsi]
SCSI LLD scsi_mod
pci_device
/dev/sda
pci-stub/vfio
SCSI LLD scsi_mod
pci_device
/dev/sda
pci-stub/vfio
HBA initiator_4 HBA initiator_3
Persistent Reservation
blocked
blocked
29 © Hitachi, Ltd. 2013. All rights reserved.
With virtio-blk, Report LUNs returns the list of LUNs including LUNs which are not assigned to the guest.
Guest kernel
QEMU
Host kernel
Hardware
virtio-blk
virtio-scsi scsi_mod
Block Layer SCSI Layer
/dev/vdx /dev/sdx
/dev/sdb /dev/sdb
Block Layer SCSI Layer
LUN1 LUN1
virtio-blk virtio-scsi Device Layer
Device
Device Layer
LUN0 LUN2 LUN0 LUN2
/dev/sda /dev/sdc /dev/sda /dev/sdc
Report LUNs Report LUNs
Just pass-through Returns emulated results, depending on SCSI commands.
HBA HBA
3-9. (c) Whether SCSI commands return proper results
Problem
* The list of LUNs that host sees.
LUN0 LUN1 LUN2
* The list of LUNs that guest sees.
LUN1
Virtio-blk needs emulation functions to return proper results for particular SCSI commands, such as Report LUNs.
[virtio-blk] [virtio-scsi]
© Hitachi, Ltd. 2013. All rights reserved.
4. Summary
30
Better Utilization of Storage Features from KVM Guest via virtio-scsi
31 © Hitachi, Ltd. 2013. All rights reserved.
4. Summary
Enterprise system requires SCSI commands in virtualized environments. KVM has some configurations which can issue SCSI commands to
storage from guests, however each configuration has some restrictions.
# Configuration
SCSI command Unique WWN
Restriction Persistent Reservation
Write Same
Inquiry Report LUNs
1 virtio-blk - No No Yes No Yes Requires NPIV for unique WWN.
2 virtio-scsi
qemu No No Yes Yes Yes Requires NPIV for unique WWN.
3 libiscsi Yes Yes Yes Yes Yes iSCSI only.
4 PCI Device assignment
Legacy Yes Yes Yes Yes Yes Max number of guests is limited by the number of HBA ports.
5 VFIO Yes Yes Yes Yes Yes
: Already available. : Patch exists, but not merged yet. : No patch, but could be fixed.
© Hitachi, Ltd. 2013. All rights reserved.
5. Future Work
32
Better Utilization of Storage Features from KVM Guest via virtio-scsi
33 © Hitachi, Ltd. 2013. All rights reserved.
5. Future work
1. To allow qemu user to issue Persistent Reservation and WriteSame, proper permission check in host kernel is needed.
2. To make Report LUNs return proper results with virtio-blk, an emulation function for Report LUNs is needed.
3. To make SCSI command capability of virtio-scsi w/ lio target clear, evaluation is needed for virtio-scsi w/ lio target.
© Hitachi, Ltd. 2013. All rights reserved.
LinuxCon North America
Better Utilization of Storage Features from KVM Guest via virtio-scsi
END
35
Hitachi Ltd.
Information & Telecommunication Systems Company IT Platform Division Group
Masaki Kimura <[email protected]>
36 © Hitachi, Ltd. 2013. All rights reserved.
Trademarks
Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.