Completing the all flash datastore
In case you read my last two posts and followed along, you now have an all flash ZFS on Linux based datastore. Only problem is, it’s not performing at all. In this post, we’ll fall back to Linux software RAID and Logical Volume Manager (LVM) without destroying the pool data.
Testing Linux software RAID
The setup we want is the combination of software RAID through mdadm to create mirrors, together with logical volume manager to represent multiple mirrors as a single device.
I have to start by taking 1 device out of my pool:
$ sudo zpool detach tank /dev/nvme0n1
And install the software RAID and lvm tooling:
$ sudo apt-get install mdadm lvm2
Next, the disk has to be repartitioned. Detailed instructions can be found here, I’ll just list the commands used:
$ sudo parted /dev/nvme0n1
mklabel gpt
mkpart primary 0% 100%
set 1 raid on
align-check optimal 1
quit
Once done, we create the first mirror, with one device (we tell mdadm the second device is currently missing):
$ sudo mdadm --create --verbose /dev/md0 --level=mirror --raid-devices=2 /dev/nvme0n1p1 missing
We repeat the steps above for the second mirror device (nvme1), creating a second mirror md1. This leavs the pool unmirrored but operational.
Next, save the array configuration:
$ sudo -i
$ mdadm --detail --scan >> /etc/mdadm/mdadm.conf
Measure software RAID impact
Time for the first performance test against /dev/md0:
Mode | Blocksize | IOPS | Bandwidth (MB/s) | vs. single raw device |
---|---|---|---|---|
random read | 4k | 162k | 664 | 0.81x |
random write | 4k | 142k | 580 | 0.84x |
random read | 64k | 23.6k | 1548 | 0.99x |
random write | 64k | 22.2k | 1453 | 0.83x |
random read | 1M | 1545 | 1621 | 1.00x |
random write | 1M | 1418 | 1488 | 0.84x |
Much better. Now to add the logical volume manager, add the two mirror to the volume lvg0 and create a volume lv0.
$ sudo pvcreate /dev/md0
$ sudo pvcreate /dev/md1
$ sudo vgcreate lvg0 /dev/md0
$ sudo vgextend lvg0 /dev/md1
$ sudo lvcreate --name lv0 -L 1500g --stripes 2 -I 16 lvg0
The striped logical volume will make sure the writes are spread over the 2 backing mirrors which increases performance, and makes certain the mirrors are filled equally. The stripe size of 16K was determined experimentally: larger stripes improve large blocksize reads/writes, but deteriorate small blocksize access. Even smaller stripe size didn’t lead to significant further performance improvements of small blocksize accesses.
Measure lvm impact
Time for another test against /dev/lvg0/lv0:
Mode | Blocksize | IOPS | Bandwidth (MB/s) | vs. single raw device |
---|---|---|---|---|
random read | 4k | 135k | 553 | 0.87x |
random write | 4k | 124k | 509 | 0.75x |
random read | 64k | 50k | 3279 | 2.1x |
random write | 64k | 48.5k | 3030 | 1.8x |
random read | 1M | 3439 | 3606 | 2.2x |
random write | 1M | 3229 | 3386 | 2.0x |
So much better…now to restore the mirrors.
To restore the mirrors we first have to copy the data from the zvol to the new volume:
$ sudo dd if=/dev/zvol/tank/iSCSIvol of=/dev/lvg0/lv0 bs=64K status=progress
Activate the mirrors
We can delete the pool now:
$ sudo zpool destroy tank
The devices nvme2n1 and nvme3n1 can now be partitioned (see above), and added to the mirrors:
$ sudo mdadm --manage /dev/md0 --add /dev/nvme2n1p1
$ sudo mdadm --manage /dev/md1 --add /dev/nvme3n1p1
By checking /proc/mdstat we can see the mirrors are rebuilding:
$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 nvme3n1p1[2] nvme1n1p1[0]
976628736 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 1.5% (14805248/976628736) finish=77.7min speed=206111K/sec
bitmap: 6/8 pages [24KB], 65536KB chunk
md0 : active raid1 nvme2n1p1[2] nvme0n1p1[0]
976628736 blocks super 1.2 [2/1] [U_]
[>....................] recovery = 1.6% (16126592/976628736) finish=80.0min speed=200041K/sec
bitmap: 8/8 pages [32KB], 65536KB chunk
unused devices: <none>
After this finishes, we take a final measurement against a new logical volume (we don’t want to overwrite the data):
Mode | Blocksize | IOPS | Bandwidth (MB/s) | vs. single raw device |
---|---|---|---|---|
random read | 4k | 137k | 545 | 0.72x |
random write | 4k | 90.2k | 370 | 0.55x |
random read | 64k | 40.4k | 2650 | 1.7x |
random write | 64k | 23.2k | 1521 | 0.90x |
random read | 1M | 2325 | 2439 | 1.5x |
random write | 1M | 1550 | 1626 | 0.96x |
Excellent. A little bit surprising to see the numbers are worse than the pre-mirrored ones, but it’s fine. One possibility is this is because the drives are already partly filled with data now. Moving on to sharing the datastore with ESXi.
Setting up an iSCSI target
It was quite easy to find tooling for consuming iSCSI (initiators), but I had a hard time searching for iSCSI server (target) software. I wasted some time on tgt, which is really old, but most readily findable. In the end it turns out there’s a standard iSCSI target kernel implementation called LIO which works great.
$ sudo apt-get install targetcli-fb
Now the convention is to set this up through the CLI, but you can edit the resulting config file manually afterwards. The process involves 2 steps: setting up a backstore - in this case a block device:
$ sudo targetcli
/> backstores/block create lvm_backend dev=/dev/lvg0/lv0
Followed by setting up the iscsi endpoint (target):
/> cd iscsi
/iscsi> create iqn.2020-03.local.lab.nas:fast
Connect the two by creating a lun mapped to the backstore device:
/iscsi> cd iqn.2020-03.local.lab.nas:fast/tpg1
/iscsi/iqn.2020-03.local.lab.nas:fast/tpg1> luns/ create /backstores/block/lvm_backend
Now for sharing: in production environments you should set up ACLs, but this is a homelab, so I’ll just allow any initiators to connect.
/iscsi/iqn.2020-03.local.lab.nas:fast/tpg1> set attribute generate_node_acls=1
/iscsi/iqn.2020-03.local.lab.nas:fast/tpg1> set attribute authentication=0
/iscsi/iqn.2020-03.local.lab.nas:fast/tpg1> set attribute demo_mode_write_protect=0
The configuration should look like this now:
/> ls
o- / .................................................................................................................. [...]
o- backstores ....................................................................................................... [...]
| o- block ........................................................................................... [Storage Objects: 1]
| | o- lvm_backend .......................................................... [/dev/lvg0/lv0 (1.5TiB) write-thru activated]
| o- fileio .......................................................................................... [Storage Objects: 0]
| o- pscsi ........................................................................................... [Storage Objects: 0]
| o- ramdisk ......................................................................................... [Storage Objects: 0]
o- iscsi ..................................................................................................... [Targets: 1]
| o- iqn.2020-03.local.lab.nas:fast ............................................................................. [TPGs: 1]
| o- tpg1 ........................................................................................... [gen-acls, no-auth]
| o- acls ................................................................................................... [ACLs: 0]
| o- luns ................................................................................................... [LUNs: 1]
| | o- lun0 ....................................................................... [block/lvm_backend (/dev/lvg0/lv0)]
| o- portals ............................................................................................. [Portals: 1]
| o- 0.0.0.0:3260 .............................................................................................. [OK]
o- loopback .................................................................................................. [Targets: 0]
o- vhost ..................................................................................................... [Targets: 0]
You could test if the share is visible from another machine with iscsiadm:
$ sudo iscsiadm -m discovery -t st -p nas.lab.local
nas.lab.local:3260,1 iqn.2020-03.local.lab.nas:fast
All is peachy, now to hook up to ESXi.
Consume in ESXi
I added the iSCSI target, and did a storage rescan. The lun was visible, but the datastore didn’t come online in the GUI. It turns out ESXi detected my new iSCSI share as a datastore snapshot of my previous FreeNAS share. What you need to do in this case is tell ESXi this is the real thing, not a snapshot. So you don’t want to resignature the datastore, but rather keep the existing signature. If you’re lucky, you can do all this through the GUI. I kept getting weird errors of datastore extents still being online:
$ esxcli storage vmfs snapshot list
5df0ecf0-b9a7f923-4a7f-0015178ab812
Volume Name: Fast
VMFS UUID: 5df0ecf0-b9a7f923-4a7f-0015178ab812
Can mount: false
Reason for un-mountability: the original volume has some extents online
Can resignature: true
Reason for non-resignaturability:
Unresolved Extent Count: 1
In the end I rebooted the host. After that it showed:
$ esxcli storage vmfs snapshot list
5df0ecf0-b9a7f923-4a7f-0015178ab812
Volume Name: Fast
VMFS UUID: 5df0ecf0-b9a7f923-4a7f-0015178ab812
Can mount: true
Reason for un-mountability:
Can resignature: false
Reason for non-resignaturability: the volume is being actively used
Unresolved Extent Count: 1
This time I could mount the ‘snapshot’:
$ esxcli storage vmfs snapshot mount -l Fast
VAAI
The iSCSI datastore we made available through LIO has full support for the VMware VAAI primitives Atomic Test & Set (ATS), Zero, Clone and Delete (Unmap). These can greatly speed up storage operations. Unmap is especially important on space constrained flash but is disabled by default on the LIO side. So one small tweak we have to make is to enable the Delete operation. It comes down to setting the emulate_tpu flag on the backstore in targetcli:
$ sudo targetcli
/> backstores/block/lvm_backend set attribute emulate_tpu=1
This concludes the setup of an all flash datastore. So how is the performance in a VM?
One final measurement
For our final measurement I moved a VM to sit on the datastore, and performed measurements on its local disk.
Mode | Blocksize | IOPS | Bandwidth (MB/s) | vs. single raw device | vs. lvm volume |
---|---|---|---|---|---|
random read | 4k | 31.1k | 121 | 0.16x | 0.22x |
random write | 4k | 24.2k | 94.5 | 0.14x | 0.26x |
random read | 64k | 14.0k | 876 | 0.56x | 0.33x |
random write | 64k | 11.0k | 687 | 0.41x | 0.45x |
random read | 1M | 723 | 723 | 0.45x | 0.30x |
random write | 1M | 649 | 650 | 0.38x | 0.40x |
Apparently the combination of iSCSI, network stacks, VMFS and EXT4 (in the guest) causes a performance hit of 60-80%. I’m not diving into the details here as I’m quite happy with the overall result.
Now I need 10G networking :)