ZFS on Linux with all flash?
In the previous post I described how I mounted 4 NVMe flash drives in a single host in order to build an all-flash datastore. In this post, I’ll describe how to move from FreeNAS, TrueNAS (or any other ZFS host OS) to ZFS on Linux, and test if the performance is acceptable to continue with ZFS. Spoiler: it’s not.
Moving to ZFS on Linux
Still on the FreeNAS box, I extended my single SSD pool (yes really, NVMEe 1TB SSDs were expensive 3 years ago) to a pool with 2 mirrored vdevs through standard zpool commands. Afterwards I shutdown the VM and connected the disks (via passthrough/VT-d) to a new Ubuntu VM.
Installing ZFS on Ubuntu is easy as the essential parts are included in the kernel, and there’s a standard package for managing ZFS:
$ sudo apt install zfsutils-linux
Instead of creating a pool, I imported my existing pool of 4 disks
$ sudo zpool import tank -f
The -f (force) option is because I didn’t export the pool properly in FreeNAS (shame on me).
$ zpool status
pool: tank
state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(5) for details.
scan: resilvered 232G in 1h29m with 0 errors on Thu Feb 27 12:39:37 2020
config:
NAME STATE READ WRITE CKSUM
tank ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
nvme0n1 ONLINE 0 0 0
nvme3n1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
nvme1n1 ONLINE 0 0 0
nvme2n1 ONLINE 0 0 0
errors: No known data errors
The pool contains a ZFS volume (zvol) named iSCSIvol, which is a virtual block device to allow ESXi to connect through iSCSI:
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
tank 1.53T 339G 88K /tank
tank/.system 50.8M 339G 100K legacy
tank/iSCSIvol 1.51T 1.50T 350G -
Performance
We already measured the raw performance of one of the disks in an earlier post:
Mode | Blocksize | IOPS | Bandwidth (MB/s) |
---|---|---|---|
random read | 4k | 184k | 752 |
random write | 4k | 165k | 675 |
random read | 64k | 23.9k | 1566 |
random write | 64k | 25.8k | 1694 |
random read | 1M | 1544 | 1620 |
random write | 1M | 1616 | 1695 |
The next step is to get a measurement for the raw performance of the ZFS volume before things like iSCSI, VMFS, network or ESXi come into play. To do this, I created a new zvol on the zpool, and did the same experiment I did before, only instead of using the NVMe disk (/dev/nvmex) directly, I now accessed to the zvol test device (/dev/zvol/tank/test) directly from the VM containing the pool. The results:
$ fio fio --ioengine=libaio --direct=1 --name=test --filename=/dev/zvol/tank/test --iodepth=32 --size=12G --numjobs=16 --group_reporting --bs=4k --readwrite=randread
Mode | Blocksize | IOPS | Bandwidth (MB/s) | vs. single raw device |
---|---|---|---|---|
random read | 4k | 155k | 634 | 0.84x |
random write | 4k | 30.3k | 124 | 0.18x |
random read | 64k | 16.4k | 1073 | 0.69x |
random write | 64k | 2.8k | 184 | 0.11x |
random read | 1M | 1.22k | 1278 | 0.79x |
random write | 1M | 189 | 199 | 0.12x |
Goodbye ZFS
So I expected some overhead, but this is terrible. Keep in mind the devices are mirrored, and there are two mirrors, so the real read performance should be 4x as good as, and the write 2x as good as a single device. ZFS requires a lot of RAM to perform, but for now consider the VM has plenty of RAM. The results are not CPU constrained either, as I increased the number of virtual CPUs (vCPU) until the VM was not constantly at 100% load. Instead, iostat -x measurements indicated the NVMe devices themselves were never utilized more than 25%, with the zfs device at 100%. Tuning ZFS doesn’t help here as performance traces showed most of the time was spent in spinlocks and mutexes. This is clearly a performance problem in the ZFS code, which becomes apparent as the NVMe devices are too fast.
These results are so bad in fact, that as much as I love ZFS, I can’t continue with ZFS as a backing for my all flash datastore until ZFS gets some much needed performance tweaks for all flash purposes. That’s why in the next post I’ll continue the same exercise with Linux software RAID and LVM, and see how that works.