Ruurd

Mar 9, 2020 7 min read

Completing the all flash datastore

In case you read my last two posts and followed along, you now have an all flash ZFS on Linux based datastore. Only problem is, it’s not performing at all. In this post, we’ll fall back to Linux software RAID and Logical Volume Manager (LVM) without destroying the pool data.

Testing Linux software RAID

The setup we want is the combination of software RAID through mdadm to create mirrors, together with logical volume manager to represent multiple mirrors as a single device.

I have to start by taking 1 device out of my pool:

$ sudo zpool detach tank /dev/nvme0n1

And install the software RAID and lvm tooling:

$ sudo apt-get install mdadm lvm2

Next, the disk has to be repartitioned. Detailed instructions can be found here, I’ll just list the commands used:

$ sudo parted /dev/nvme0n1
mklabel gpt
mkpart primary 0% 100%
set 1 raid on
align-check optimal 1
quit

Once done, we create the first mirror, with one device (we tell mdadm the second device is currently missing):

$ sudo mdadm --create --verbose /dev/md0 --level=mirror --raid-devices=2 /dev/nvme0n1p1 missing

We repeat the steps above for the second mirror device (nvme1), creating a second mirror md1. This leavs the pool unmirrored but operational.

Next, save the array configuration:

$ sudo -i
$ mdadm --detail --scan >> /etc/mdadm/mdadm.conf

Measure software RAID impact

Time for the first performance test against /dev/md0:

Mode	Blocksize	IOPS	Bandwidth (MB/s)	vs. single raw device
random read	4k	162k	664	0.81x
random write	4k	142k	580	0.84x
random read	64k	23.6k	1548	0.99x
random write	64k	22.2k	1453	0.83x
random read	1M	1545	1621	1.00x
random write	1M	1418	1488	0.84x

Much better. Now to add the logical volume manager, add the two mirror to the volume lvg0 and create a volume lv0.

$ sudo pvcreate /dev/md0
$ sudo pvcreate /dev/md1
$ sudo vgcreate lvg0 /dev/md0
$ sudo vgextend lvg0 /dev/md1
$ sudo lvcreate --name lv0 -L 1500g --stripes 2 -I 16 lvg0

The striped logical volume will make sure the writes are spread over the 2 backing mirrors which increases performance, and makes certain the mirrors are filled equally. The stripe size of 16K was determined experimentally: larger stripes improve large blocksize reads/writes, but deteriorate small blocksize access. Even smaller stripe size didn’t lead to significant further performance improvements of small blocksize accesses.

Measure lvm impact

Time for another test against /dev/lvg0/lv0:

Mode	Blocksize	IOPS	Bandwidth (MB/s)	vs. single raw device
random read	4k	135k	553	0.87x
random write	4k	124k	509	0.75x
random read	64k	50k	3279	2.1x
random write	64k	48.5k	3030	1.8x
random read	1M	3439	3606	2.2x
random write	1M	3229	3386	2.0x

So much better…now to restore the mirrors.

To restore the mirrors we first have to copy the data from the zvol to the new volume:

$ sudo dd if=/dev/zvol/tank/iSCSIvol of=/dev/lvg0/lv0 bs=64K status=progress

Activate the mirrors

We can delete the pool now:

$ sudo zpool destroy tank

The devices nvme2n1 and nvme3n1 can now be partitioned (see above), and added to the mirrors:

$ sudo mdadm --manage /dev/md0 --add /dev/nvme2n1p1
$ sudo mdadm --manage /dev/md1 --add /dev/nvme3n1p1

By checking /proc/mdstat we can see the mirrors are rebuilding:

$ cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 nvme3n1p1[2] nvme1n1p1[0]
      976628736 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  1.5% (14805248/976628736) finish=77.7min speed=206111K/sec
      bitmap: 6/8 pages [24KB], 65536KB chunk

md0 : active raid1 nvme2n1p1[2] nvme0n1p1[0]
      976628736 blocks super 1.2 [2/1] [U_]
      [>....................]  recovery =  1.6% (16126592/976628736) finish=80.0min speed=200041K/sec
      bitmap: 8/8 pages [32KB], 65536KB chunk

unused devices: <none>

After this finishes, we take a final measurement against a new logical volume (we don’t want to overwrite the data):

Mode	Blocksize	IOPS	Bandwidth (MB/s)	vs. single raw device
random read	4k	137k	545	0.72x
random write	4k	90.2k	370	0.55x
random read	64k	40.4k	2650	1.7x
random write	64k	23.2k	1521	0.90x
random read	1M	2325	2439	1.5x
random write	1M	1550	1626	0.96x

Excellent. A little bit surprising to see the numbers are worse than the pre-mirrored ones, but it’s fine. One possibility is this is because the drives are already partly filled with data now. Moving on to sharing the datastore with ESXi.

Setting up an iSCSI target

It was quite easy to find tooling for consuming iSCSI (initiators), but I had a hard time searching for iSCSI server (target) software. I wasted some time on tgt, which is really old, but most readily findable. In the end it turns out there’s a standard iSCSI target kernel implementation called LIO which works great.

$ sudo apt-get install targetcli-fb

Now the convention is to set this up through the CLI, but you can edit the resulting config file manually afterwards. The process involves 2 steps: setting up a backstore - in this case a block device:

$ sudo targetcli
/> backstores/block create lvm_backend dev=/dev/lvg0/lv0

Followed by setting up the iscsi endpoint (target):

/> cd iscsi
/iscsi> create iqn.2020-03.local.lab.nas:fast

Connect the two by creating a lun mapped to the backstore device:

/iscsi> cd iqn.2020-03.local.lab.nas:fast/tpg1
/iscsi/iqn.2020-03.local.lab.nas:fast/tpg1> luns/ create /backstores/block/lvm_backend

Now for sharing: in production environments you should set up ACLs, but this is a homelab, so I’ll just allow any initiators to connect.

/iscsi/iqn.2020-03.local.lab.nas:fast/tpg1> set attribute generate_node_acls=1
/iscsi/iqn.2020-03.local.lab.nas:fast/tpg1> set attribute authentication=0
/iscsi/iqn.2020-03.local.lab.nas:fast/tpg1> set attribute demo_mode_write_protect=0

The configuration should look like this now:

/> ls
o- / .................................................................................................................. [...]
  o- backstores ....................................................................................................... [...]
  | o- block ........................................................................................... [Storage Objects: 1]
  | | o- lvm_backend .......................................................... [/dev/lvg0/lv0 (1.5TiB) write-thru activated]
  | o- fileio .......................................................................................... [Storage Objects: 0]
  | o- pscsi ........................................................................................... [Storage Objects: 0]
  | o- ramdisk ......................................................................................... [Storage Objects: 0]
  o- iscsi ..................................................................................................... [Targets: 1]
  | o- iqn.2020-03.local.lab.nas:fast ............................................................................. [TPGs: 1]
  |   o- tpg1 ........................................................................................... [gen-acls, no-auth]
  |     o- acls ................................................................................................... [ACLs: 0]
  |     o- luns ................................................................................................... [LUNs: 1]
  |     | o- lun0 ....................................................................... [block/lvm_backend (/dev/lvg0/lv0)]
  |     o- portals ............................................................................................. [Portals: 1]
  |       o- 0.0.0.0:3260 .............................................................................................. [OK]
  o- loopback .................................................................................................. [Targets: 0]
  o- vhost ..................................................................................................... [Targets: 0]

You could test if the share is visible from another machine with iscsiadm:

$ sudo iscsiadm -m discovery -t st -p nas.lab.local
nas.lab.local:3260,1 iqn.2020-03.local.lab.nas:fast

All is peachy, now to hook up to ESXi.

Consume in ESXi

I added the iSCSI target, and did a storage rescan. The lun was visible, but the datastore didn’t come online in the GUI. It turns out ESXi detected my new iSCSI share as a datastore snapshot of my previous FreeNAS share. What you need to do in this case is tell ESXi this is the real thing, not a snapshot. So you don’t want to resignature the datastore, but rather keep the existing signature. If you’re lucky, you can do all this through the GUI. I kept getting weird errors of datastore extents still being online:

$ esxcli storage vmfs snapshot list
5df0ecf0-b9a7f923-4a7f-0015178ab812
   Volume Name: Fast
   VMFS UUID: 5df0ecf0-b9a7f923-4a7f-0015178ab812
   Can mount: false
   Reason for un-mountability: the original volume has some extents online
   Can resignature: true
   Reason for non-resignaturability:
   Unresolved Extent Count: 1

In the end I rebooted the host. After that it showed:

$ esxcli storage vmfs snapshot list
5df0ecf0-b9a7f923-4a7f-0015178ab812
   Volume Name: Fast
   VMFS UUID: 5df0ecf0-b9a7f923-4a7f-0015178ab812
   Can mount: true
   Reason for un-mountability:
   Can resignature: false
   Reason for non-resignaturability: the volume is being actively used
   Unresolved Extent Count: 1

This time I could mount the ‘snapshot’:

$ esxcli storage vmfs snapshot mount -l Fast

VAAI

The iSCSI datastore we made available through LIO has full support for the VMware VAAI primitives Atomic Test & Set (ATS), Zero, Clone and Delete (Unmap). These can greatly speed up storage operations. Unmap is especially important on space constrained flash but is disabled by default on the LIO side. So one small tweak we have to make is to enable the Delete operation. It comes down to setting the emulate_tpu flag on the backstore in targetcli:

$ sudo targetcli
/> backstores/block/lvm_backend set attribute emulate_tpu=1

This concludes the setup of an all flash datastore. So how is the performance in a VM?

One final measurement

For our final measurement I moved a VM to sit on the datastore, and performed measurements on its local disk.

Mode	Blocksize	IOPS	Bandwidth (MB/s)	vs. single raw device	vs. lvm volume
random read	4k	31.1k	121	0.16x	0.22x
random write	4k	24.2k	94.5	0.14x	0.26x
random read	64k	14.0k	876	0.56x	0.33x
random write	64k	11.0k	687	0.41x	0.45x
random read	1M	723	723	0.45x	0.30x
random write	1M	649	650	0.38x	0.40x

Apparently the combination of iSCSI, network stacks, VMFS and EXT4 (in the guest) causes a performance hit of 60-80%. I’m not diving into the details here as I’m quite happy with the overall result.

Now I need 10G networking :)

« You like a fancy suit, I like a fancy video stream ZFS on Linux with all flash? »

devops.lol