Background

My main hypervisor is a Dell R720 server running Proxmox. It has 8 spinning hard drives making up a ZFS pool called r720_storge_pool. There is also a high performance VM pool that runs on NVMe SSDs, and a boot pool created by Proxmox. Every month, I upgrade Proxmox and reboot to apply new kernel.It has been running mostly maintenance-free for a few years until yesterday, after I routinely rebooted it.

Before I jump to the actual issue, it would be helpful to lay some details on how the current stack works:

the ZFS pool r720_storge_pool has some encrypted datasets, whose key is stored in the boot pool and loaded upon reboot. The process does not require user invervention and the pool is automatically imported upon reboot, mounted under /r720_storage_pool

based on ArchWiki SFTP chroot, I set up bind mount in /etc/fstab so that OpenSSH server can serve the pool via SFTP:

/r720_storage_pool/encrypted/media /srv/ssh/media none bind,defaults,nofail,x-systemd.requires=zfs-mount.service 0 0

I also created dedicated users for SFTP/SSHFS purpose only. Their entry in /etc/passwd is as follows:
```
media:x:1001:1000::/srv/ssh/media:/usr/sbin/nologin
```

the VMs (in this case, a Docker host) access the SFTP chroot jail upon boot, conveniently defined in /etc/fstab:

media@proxmox.local.lan:/ /home/ewon/media fuse.sshfs defaults,delay_connect,_netdev,allow_other,default_permissions,uid=1000,gid=1000,IdentityFile=/home/ewon/.ssh/id_ed25519 0 0

Docker containers consume ZFS storage backend through bind mounted Docker volumes, defined in docker-compose file, for example:
```
...
volumes:
  - /home/ewon/media:/data/photo
  - /home/ewon/media:/data/video
...
```

Incident

After rebooting Proxmox, I noticed some services are not available. I went to the Docker VM (runs on Proxmox) and found out that all the containers that uses r720_storge_pool failed to start.

I've had some trouble in the past when I reboot the hypervisor, due to a race condition between Docker VM and SFTP server on Proxmox. Since then, I added start delay on the Docker host and the issue never happened again. However, this time it's different.

Investigation

I ssh'ed into docker.local.lan and noticed that SSHFS are mounted correctly, but there were no content in the directory. "Oh no!" This can't be good.

Following up the chain, I ssh'ed into proxmox.local.lan and checked ZFS pools. zfs status would not show r720_storage_pool. I started sweating.

Manually zfs import -a would not import the pool, either. I rushed down to the server rack, all 8 drives are still blinking and humming. "Ok", my drive are still there, nobody stole them, cats didn't piss on them (story for another day). Did the disk controller give up? On the terminal, I checked /dev/disk/by-id, thank goodness all of my sdX devices still show up.

Next, I need to manually import all the disks and make the pool available again:

zpool import -d /dev/disk/by-id it took a good few seconds to run, and my pool showed up again!
zpool status -v shows pool with 0 error, very healthy.
zfs load-keys -r r720_storage_pool/media/encrypted load encryption key file
zfs get key-status r720_storage_pool/media/encrypted (optionally) check key status
zfs mount -a mount all the pools again, just in case
mount -a to bind mount zpool to /srv/ssh directory again, so it won't show as empty

Lastly, on Docker VM, re-mount all the SSHFS resources and start the docker containers. Crisis mode over, for now. Wipe forehead.

Root cause

As soon as I put my beans together, I started to wonder why it happened in the first place and how to avoid it in the future.

I usually always start by checking failed systemd units:

root@r720:~# systemctl --failed
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.

Nothing to see here, move along.

After some online search, folks have been talking about corrupted zfs-import-cache, like here and here. The problem for me is not a failed zfs-import-cache.service, in fact, it seemed the service runs just fine:

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; preset: enabled)
     Active: active (exited) since Sat 2024-05-04 20:37:25 EDT; 17h ago
       Docs: man:zpool(8)
    Process: 1935 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN $ZPOOL_IMPORT_OPTS (code=exited, status=0/SUCCESS)
   Main PID: 1935 (code=exited, status=0/SUCCESS)
        CPU: 15ms

May 04 20:37:25 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
May 04 20:37:25 r720 zpool[1935]: no pools available to import
May 04 20:37:25 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.

However, this line no pools available to import caught my attention. If we dig a little deeper, this line is rather an exception than the norm, comparing to recent server reboots:

root@r720:~# journalctl -u zfs-import-cache.service

-- Boot 11f348a818b2439598566752a8d2cdbc --
Feb 04 11:05:24 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
Feb 04 11:05:31 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
-- Boot 1ca32d8475854b98b65153f2a801dd15 --
Mar 03 21:45:27 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
Mar 03 21:45:34 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
-- Boot b71ade2b29ce4b999c337c65636238c7 --
Apr 07 22:13:45 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
Apr 07 22:13:51 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
-- Boot e921bbd16046429d8fdb03e6f35a0d88 --
May 04 20:37:25 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
May 04 20:37:25 r720 zpool[1935]: no pools available to import
May 04 20:37:25 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
lines 30-84/84 (END)

I started to check and compare raw journalctl messages, between current boot and previous boot. with a focus on keywords zfs and r720_storage_pool.

In previous boot processes, zfs-zed.seivice was able to import the pool, as can be seen in the following three lines. The same cannot be said for the current boot, at least not prior to me manually importing the pool.

Apr 07 22:13:54 r720 zed[2783]: eid=11 class=pool_import pool='r720_storage_pool'
Apr 07 22:13:54 r720 zed[2786]: eid=10 class=config_sync pool='r720_storage_pool'
Apr 07 22:13:54 r720 zed[2794]: eid=15 class=config_sync pool='r720_storage_pool'

At this point, without getting too deep into the rabbit hole, it is reasonaby safe to conclude that the disks are not ready when Proxmox's ZFS daemon tries to import the pool. As to why that is the case, I can't tell. I do, however, believe the following factors can contribute and may even have contributed to the issue:

aging hard drives are slow to for the operating system to initialize
Proxmox 8.2 brings some changes (most likely heavier workload for the system) that delays disk initialization
ZFS got upgraded, and runs a bit faster, hence is started to import pools before the devices are ready
It's Saturday night and nobody wants to work?!

Final words

For now, there is no clear answer as to what action to be taken to prevent this from hapenning again. Maybe the issue is just an one-off, or maybe it will keep happening. I decided to do nothing for now and keep an eye out in the future.

Perhaps I can edit zfs-import-cache.service to add some delay, like this post suggests, but I don't like the idea of adding ad-hoc fixes for an unconfirmed issue. I've seen far too many overly reacting sysadmins (or their managers) in my previous jobs. The bandage ended up in production for so long, nobody even knows why it was there in the first place, and fear to tear them down. Fortunately, this is my homelab and I have 100% say in this.

The ultimate "right" solution is to replace my aging hard drives with something newer and faster, even enterprise SSDs. But I'm satisfied with its current speed and capacity, why bother? I plan to run the current batch of hard drives to the ground, despite grey beard all suggest otherwise.

Ewon's Blog

System Administrator

ZFS pool not importing upon reboot

Background

Incident

Investigation

Root cause

Final words

One thought on “ZFS pool not importing upon reboot”

Leave a Reply Cancel reply