ZFS import fix

This is a follow up on my previous post ZFS pool not importing upon reboot. In there, I documented the issue that my storage pool would not automatically import upon reboot. The take away was that it seemed to be an occasional blip, and that without digging further, I would wait and see if it ever happens again.

Well, during the very next patch and reboot cycle, I had the same issue. After researching some more (1 and 2), the fix was relatively easy:

  • systemctl status zfs-import-cache.service make sure it's enabled and running
  • systemctl disable zfs-import@mypool.service do this for every pool; previously, it was enabled in Proxmox
  • zpool set cachefile=/etc/zfs/zpool.cache mypool do this for every pool; previously, this value was unset in Proxmox
  • update-initramfs -k all -u
  • reboot now

Here is the brief explanation:

  • ZFS should use zfs-import-cache.service to automatically load and import pools; as the name suggests, it uses cache file instead of the actual hard drives, which may or may not be available during boot.
  • zfs-import@mypool.service is the service that does loading by looking for the hard drives. Disable it for each pool. (why was it enabled in Proxmox?)
  • manually set the "cachefile" value for each pool. (again, why was it empty?)

Since I installed Proxmox 7 on this machine three years ago and only started to experience ZFS import issue in recent months, I suspect that a recent update brought changes to the order of services being loaded during boot. As a result, the bad default exposed itself.

I know that Proxmos has its quarks when it comes to HA stuff, but this ZFS implementation is another knock on its reputation (for me). If I ever have to do it from scratch, I'll probably go with stock Ubuntu with KVM/Qemu and ZFS.

ZFS pool not importing upon reboot

Background

My main hypervisor is a Dell R720 server running Proxmox. It has 8 spinning hard drives making up a ZFS pool called r720_storge_pool. There is also a high performance VM pool that runs on NVMe SSDs, and a boot pool created by Proxmox. Every month, I upgrade Proxmox and reboot to apply new kernel.It has been running mostly maintenance-free for a few years until yesterday, after I routinely rebooted it.

Before I jump to the actual issue, it would be helpful to lay some details on how the current stack works:

  • the ZFS pool r720_storge_pool has some encrypted datasets, whose key is stored in the boot pool and loaded upon reboot. The process does not require user invervention and the pool is automatically imported upon reboot, mounted under /r720_storage_pool
  • based on ArchWiki SFTP chroot, I set up bind mount in /etc/fstab so that OpenSSH server can serve the pool via SFTP:
    /r720_storage_pool/encrypted/media /srv/ssh/media none bind,defaults,nofail,x-systemd.requires=zfs-mount.service 0 0
  • I also created dedicated users for SFTP/SSHFS purpose only. Their entry in /etc/passwd is as follows:
    media:x:1001:1000::/srv/ssh/media:/usr/sbin/nologin
  • the VMs (in this case, a Docker host) access the SFTP chroot jail upon boot, conveniently defined in /etc/fstab:
    media@proxmox.local.lan:/ /home/ewon/media fuse.sshfs defaults,delay_connect,_netdev,allow_other,default_permissions,uid=1000,gid=1000,IdentityFile=/home/ewon/.ssh/id_ed25519 0 0
  • Docker containers consume ZFS storage backend through bind mounted Docker volumes, defined in docker-compose file, for example:
    ...
    volumes:
      - /home/ewon/media:/data/photo
      - /home/ewon/media:/data/video
    ...

Incident

After rebooting Proxmox, I noticed some services are not available. I went to the Docker VM (runs on Proxmox) and found out that all the containers that uses r720_storge_pool failed to start.

I've had some trouble in the past when I reboot the hypervisor, due to a race condition between Docker VM and SFTP server on Proxmox. Since then, I added start delay on the Docker host and the issue never happened again. However, this time it's different.

Investigation

I ssh'ed into docker.local.lan and noticed that SSHFS are mounted correctly, but there were no content in the directory. "Oh no!" This can't be good.

Following up the chain, I ssh'ed into proxmox.local.lan and checked ZFS pools. zfs status would not show r720_storage_pool. I started sweating.

Manually zfs import -a would not import the pool, either. I rushed down to the server rack, all 8 drives are still blinking and humming. "Ok", my drive are still there, nobody stole them, cats didn't piss on them (story for another day). Did the disk controller give up? On the terminal, I checked /dev/disk/by-id, thank goodness all of my sdX devices still show up.

Next, I need to manually import all the disks and make the pool available again:

  1. zpool import -d /dev/disk/by-id it took a good few seconds to run, and my pool showed up again!
  2. zpool status -v shows pool with 0 error, very healthy.
  3. zfs load-keys -r r720_storage_pool/media/encrypted load encryption key file
  4. zfs get key-status r720_storage_pool/media/encrypted (optionally) check key status
  5. zfs mount -a mount all the pools again, just in case
  6. mount -a to bind mount zpool to /srv/ssh directory again, so it won't show as empty

Lastly, on Docker VM, re-mount all the SSHFS resources and start the docker containers. Crisis mode over, for now. Wipe forehead.

Root cause

As soon as I put my beans together, I started to wonder why it happened in the first place and how to avoid it in the future.

I usually always start by checking failed systemd units:

root@r720:~# systemctl --failed
  UNIT LOAD ACTIVE SUB DESCRIPTION
0 loaded units listed.

Nothing to see here, move along.

After some online search, folks have been talking about corrupted zfs-import-cache, like here and here. The problem for me is not a failed zfs-import-cache.service, in fact, it seemed the service runs just fine:

● zfs-import-cache.service - Import ZFS pools by cache file
     Loaded: loaded (/lib/systemd/system/zfs-import-cache.service; enabled; preset: enabled)
     Active: active (exited) since Sat 2024-05-04 20:37:25 EDT; 17h ago
       Docs: man:zpool(8)
    Process: 1935 ExecStart=/sbin/zpool import -c /etc/zfs/zpool.cache -aN $ZPOOL_IMPORT_OPTS (code=exited, status=0/SUCCESS)
   Main PID: 1935 (code=exited, status=0/SUCCESS)
        CPU: 15ms

May 04 20:37:25 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
May 04 20:37:25 r720 zpool[1935]: no pools available to import
May 04 20:37:25 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.

However, this line no pools available to import caught my attention. If we dig a little deeper, this line is rather an exception than the norm, comparing to recent server reboots:

root@r720:~# journalctl -u zfs-import-cache.service

-- Boot 11f348a818b2439598566752a8d2cdbc --
Feb 04 11:05:24 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
Feb 04 11:05:31 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
-- Boot 1ca32d8475854b98b65153f2a801dd15 --
Mar 03 21:45:27 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
Mar 03 21:45:34 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
-- Boot b71ade2b29ce4b999c337c65636238c7 --
Apr 07 22:13:45 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
Apr 07 22:13:51 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
-- Boot e921bbd16046429d8fdb03e6f35a0d88 --
May 04 20:37:25 r720 systemd[1]: Starting zfs-import-cache.service - Import ZFS pools by cache file...
May 04 20:37:25 r720 zpool[1935]: no pools available to import
May 04 20:37:25 r720 systemd[1]: Finished zfs-import-cache.service - Import ZFS pools by cache file.
lines 30-84/84 (END)

I started to check and compare raw journalctl messages, between current boot and previous boot. with a focus on keywords zfs and r720_storage_pool.

In previous boot processes, zfs-zed.seivice was able to import the pool, as can be seen in the following three lines. The same cannot be said for the current boot, at least not prior to me manually importing the pool.

Apr 07 22:13:54 r720 zed[2783]: eid=11 class=pool_import pool='r720_storage_pool'
Apr 07 22:13:54 r720 zed[2786]: eid=10 class=config_sync pool='r720_storage_pool'
Apr 07 22:13:54 r720 zed[2794]: eid=15 class=config_sync pool='r720_storage_pool'

At this point, without getting too deep into the rabbit hole, it is reasonaby safe to conclude that the disks are not ready when Proxmox's ZFS daemon tries to import the pool. As to why that is the case, I can't tell. I do, however, believe the following factors can contribute and may even have contributed to the issue:

  • aging hard drives are slow to for the operating system to initialize
  • Proxmox 8.2 brings some changes (most likely heavier workload for the system) that delays disk initialization
  • ZFS got upgraded, and runs a bit faster, hence is started to import pools before the devices are ready
  • It's Saturday night and nobody wants to work?!

Final words

For now, there is no clear answer as to what action to be taken to prevent this from hapenning again. Maybe the issue is just an one-off, or maybe it will keep happening. I decided to do nothing for now and keep an eye out in the future.

Perhaps I can edit zfs-import-cache.service to add some delay, like this post suggests, but I don't like the idea of adding ad-hoc fixes for an unconfirmed issue. I've seen far too many overly reacting sysadmins (or their managers) in my previous jobs. The bandage ended up in production for so long, nobody even knows why it was there in the first place, and fear to tear them down. Fortunately, this is my homelab and I have 100% say in this.

The ultimate "right" solution is to replace my aging hard drives with something newer and faster, even enterprise SSDs. But I'm satisfied with its current speed and capacity, why bother? I plan to run the current batch of hard drives to the ground, despite grey beard all suggest otherwise.

Arch Linux installation notes: three firesystem schemes

I started my Linux journey with Arch Linux about five years ago (Technically, I used Ubuntu 9.10 way back when, but that's a story for another day). As a newbie, installing Arch by hand was a pretty big deal for me. It taught me almost everything about Linux and open source software in general.

Overtime, I moved on from Arch to Debian/Ubuntu/Fedora, but I always keep a technical note on how to install Arch Linux. What drives me to revisit this topic? Well, because I decided to put Linux on an old MacBook Pro with Intel CPU. I can think of no better distro than Arch Linux: it is flexible, lightweight, and... did I mention it's Arch, btw?

This article focuses on installing Arch Linux base system on a laptop or desktop with a single hard drive, and with UEFI support. It uses one of three partition/firesystem schemes and one of two boot loaders (GRUB or systemd-boot). The scenarios are:

  1. LVM with ext4: no encryption
  2. LVM on LUKS: offers root partition encryption
  3. Btrfs on LUKS: above, and btrfs
  4. Encrypted EFI system partition with Unified Kernel Image

The purpose is to dual boot Arch Linux with MacOS, so encrypted EFI partition is out of the question here. I might write another article in the future about Unified Kernel Image.

When researching and refining my notes, I came across YouTube channel EF linux. He was such a great guy, concise and straight to the point. I also recommend his video about btrfs snapshot with timeshift.

So, without further ado, let me present my raw notes of installing Arch Linux in (late) 2023.


Live system stage

Preparation

wireless network config
Wi-Fi: iwctl (authenticates to Wi-Fi) or wifi-menu (netctl)
Ethernet: ArchISO's systemd-networkd and systemd-resolved should work out of the box

SSH remote install:
passwd to set root password
systemctl start sshd.service
ip addr to get IP address

Optional:
setfont ter-132b set larger font for HiDPI
timedatectl set-ntp true to ensure the system clock is accurate
ls /sys/firmware/efi/efivars verify boot mode (BIOS or UEFI)
cat /sys/firmware/efi/fw_platform_size another way to verify (64 or 32)

Partitioning

https://wiki.archlinux.org/title/Partitioning
https://wiki.archlinux.org/title/EFI_system_partition
https://wiki.archlinux.org/title/Btrfs
https://wiki.archlinux.org/title/Install_Arch_Linux_on_LVM
https://wiki.archlinux.org/title/Dm-crypt/Encrypting_an_entire_system

Boot partition

Partition the disk: create ESP (EFI system partition) and "Linux root" partition
cfdisk is TUI of fdisk
fdisk -l check existing disks/partitions
fdisk /dev/sd[X]

Format and mount:
mkfs.fat -F 32 /dev/sda1 format ESP to FAT32
mkdir -p /mnt/boot
mount /dev/sda1 /mnt/boot

Notes on ESP mount point:

  • /boot: cannot be encrypted; contains kernels, initramfs images, microcode, boot loader config files; supports dual boot with Windows/MacOS
  • /efi (historically /boot/efi): only boot loader config files
Mount point Partition Partition type GUID Size
/boot or /efi /dev/sda1 EFI system partition C12A7328-F81F-11D2-BA4B-00A0C93EC93B 300MB to 1GB
/ /dev/sda2 Linux root 4F68BCE3-E8CD-4DB1-96E7-FBCAF984B709 remainder

1. LVM

LVM with ext4 (no LUKS encryption), see Install Arch on LVM

  • create pv, vg and lv on /dev/sda2 (if lv will be formatted with ext4, leave 256 MiB space for e2scrub; see next section on how to)
  • format mkfs.ext4 /dev/VolGroup/root
  • mount mount /dev/VolGroup/root /mnt
  • note: swap and home logic volumes are optional

2. LVM on LUKS

LVM on LUKS

  • LUKS2
    cryptsetup luksFormat /dev/sda2
    cryptsetup open /dev/sda2 cryptroot
  • LVM
    pvcreate /dev/mapper/cryptroot
    pvdisplay/pvscan
    vgcreate VolGroup /dev/mapper/cryptroot
    lvcreate -L 100%FREE VolGroup -n root
    lvreduce -L -256M VolGroup/root ### if ext4, leave 256 MiB space for e2scrub
  • format and mount
    mkfs.ext4 /dev/VolGroup/root
    mount /dev/VolGroup/root /mnt

3. Btrfs on LUKS

Btrfs

  • LUKS2
    cryptsetup luksFormat /dev/sda2
    cryptsetup open /dev/sda2 cryptroot
  • Btrfs
    mkfs.btrfs -L archlinux /dev/mapper/cryptroot
    mount /dev/mapper/cryptroot /mnt
    cd /mnt
    btrfs subvolume create @            ## or root
    btrfs subvolume create @home   ## or home
    cd
    umount /mnt
    mkdir /mnt/home
    mount -o subvol=@,compress=zstd /dev/mapper/cryptroot /mnt
    mount -o subvol=@home,compress=zstd /dev/mapper/cryptroot /mnt/home

Install base system

(optionally) edit mirrors /etc/pacman.d/mirrorlist
pacstrap -K /mnt base linux linux-firmware amd/intel-ucode sudo lvm2 btrfs-progs nano optionally networkmanager

  • Kernels can be linux-lts or linux-zen
  • skip linux-firmware if it's a VM

FSTAB

genfstab -U /mnt >> /mnt/etc/fstab use -U or -L to define by UUID or labels, respectively
cat /mnt/etc/fstab to verify

Chroot stage

arch-chroot /mnt

Initramfs (mkinitcpio)

Workflow:
edit hooks in/etc/mkinitcpio.conf
re-generate mkinitcpio -P

See also:
mkinitcpio common hooks
kernel parameters

1. LVM

lvm2 package must be installed in the arch-chroot environment
"udev" and "lvm2" for busybox-based initramfs: HOOKS=(base udev ... block lvm2 filesystems)
"systemd" and "lvm2" for systemd-based initramfs: HOOKS=(base systemd ... block lvm2 filesystems)

2. LVM on LUKS

lvm2 package must be installed in the arch-chroot environment
"keyboard", "encrypt" and "lvm" for busybox-based initramfs: HOOKS=(base udev autodetect modconf kms keyboard keymap consolefont block encrypt lvm2 filesystems fsck)
"keyboard", "sd-encrypt" and "lvm" for systemd-based initramfs: HOOKS=(base systemd autodetect modconf kms keyboard sd-vconsole block sd-encrypt lvm2 filesystems fsck)

3. Btrfs on LUKS

btrfs-progs package must be installed in the arch-chroot environment
"keyboard" and "encrypt" for busybox-based initramfs
"keyboard" and "sd-encrypt" for systemd-based initramfs

For single device btrfs pool, "filesystem" hook is sufficient (no need for "btrfs" hook)
For multi device btrfs pool, use one of "udev", "systemd" or "btrfs" hooks. See common hooks

Additionally, edit /etc/fstab to add mount options: (get UUID by lsblk -f or blkid)

UUID=XXX / btrfs subvol=@,compress=zstd:9,discard=async,noatime,ssd
UUID=YYY /home btrfs subvol=@home,compress=zstd:9,discard=async,noatime,ssd

Boot loader

Installation:

  • systemd-boot is shipped with the systemd package which is a dependency of the base meta package
  • GRUB pacman -S grub efibootmgr
  • rEFInd: pacman -S refind-efi efibootmgr

Kernel parameters references:

1. LVM

Choose any boot loader
Kernel parameter root=/dev/VolGroup/root

2. LVM on LUKS

Grub install (assuming esp is /boot)
grub-install --target=x86_64-efi --efi-directory=/boot --bootloader-id=GRUB
grub-mkconfig -o /boot/grub/grub.cfg

(Optional) fallback boot path:
either use --removable flag
or mkdir *esp*/EFI/BOOT and cp *esp*/EFI/GRUB/grubx64.efi *esp*/EFI/BOOT/BOOTX64.EFI

Kernel parameters: https://wiki.archlinux.org/title/Dm-crypt/Encrypting_an_entire_system#Configuring_the_boot_loader_2
Get UUID by lsblk -f or blkid
Unlock encrypted root partition at boot: (get device-UUID refers to LUKS pratition /dev/sda2)
For encrypt hook: cryptdevice=UUID=<device-UUID>:cryptroot:allow-discards root=/dev/VolGroup/root
For sd-encrypt hook: rd.luks.name=<device-UUID>=cryptroot rd.luks.options=discard root=/dev/VolGroup/root

3. Btrfs on LUKS

Install systemd-boot: bootctl install optionally set --esp-path=/custom/esp
Automatic update: enable systemd-boot-update.service and/or add pacman hook

Configure nano /boot/loader/loader.conf

# default name must match filename /boot/loader/entries/arch.conf
default arch
timeout 3
console-mode max
# console-mode auto/keep
# editor no

Add loaders nano /boot/loader/entries/arch.conf add -lts for LTS kernel

title Arch Linux
linux   /vmlinuz-linux
initrd  /intel-ucode.img
# initrd  /amd-ucode.img
initrd  /initramfs-linux.img
# kernel parameters for btrfs on LUKS ("encrypt" hook), where XXX is /dev/sdb2, YYY is /dev/mapper/cryptroot
options cryptdevice=UUID=XXX:cryptroot:allow-discards root=/dev/mapper/VolGroup-root rw quiet splash
# kernel parameters for btrfs on LUKS ("sd-encrypt" hook), where XXX is /dev/sdb2, YYY is /dev/mapper/cryptroot
options rd.luks.name=XXX=cryptroot rd.luks.options=discard root=UUID=YYY rootflags=subvol=@ rw quiet splash

Note:

  • Fedora GRUB only has "rd.luks.name=", no "root=" nor "rootflags="; but Arch has to have these
  • Either set "rootflags=" here or btrfs subvolume set-default <subvolume-id> /

Fallback nano /boot/loader/entries/arch-fallback.conf; add -lts for LTS kernel

title Arch Linux (fallback initramfs)
...
initrd  /initramfs-linux-fallback.img
...

Time zone

From here on is easy, just follow Installation Guide

ln -sf /usr/share/zoneinfo/Canada/Toronto /etc/localtime to set timezone
hwclock -w -u to set time

Localization

nano /etc/locale.gen uncomment en_CA.UTF-8 and other needed locales
or
echo "en_CA.UTF-8 UTF-8" >> /etc/locale.gen
lastly
locale-gen to generate locale

nano /etc/locale.conf to set the LANG variable: LANG=en_US.UTF-8
or
optionally locale > /etc/locale.conf

Network configuration

echo MYHOSTNAME > /etc/hostname

nano /etc/hosts

127.0.0.1   localhost
::1     localhost
127.0.1.1   MYHOSTNAME.localdomain  MYHOSTNAME

either systemctl enable systemd-networkd.service systemd-resolved.service and follow some example configurations
or pacman -S networkmanager + systemctl enable NetworkManager

User and password

passwd for root We want passwordless root
ensure sudo package is installed
useradd -m -G wheel -s /bin/bash your_user
passwd your_user
EDITOR=nano visudo uncomment %wheel All=(All) All

Reboot

exit to exit chroot
umount -R /mnt optional but safe
reboot now

Post-install

Useful topics

systemd-boot: enable systemd-boot-update.service and/or add pacman hook

zram: replaces swap file or swap partition

Hibernation: (swap partition)
For "encrypt" hook: resume=/dev/VolGroup/swap (same format as root parameter)
systemd-boot ("sd-encrypt" hook): does not require additional kernel parameter with systemd >= v255

Snapper does not have command line tool; but has GRUB and rEFInd integration
Timeshift has a GUI as well as a command line tool

Simplify Linux VM installation on KVM/QEMU with virt-install and cloud-init

This is a follow-up post of my previous post about Windows VM installation. This one, surprise surprise, is about installing Linux VM.

I hate tedious and manual work, but sometimes it also doesn't make sense to spend time modifying an Ansible playbook that I probably will only use a few times. I find virt-install and cloud-init meet most of my needs when it comes to quickly spinning VMs up for testing. They offer simplicity with great flexibility. Within minutes I can create VMs for testing; if I want to go crazy, I can tell it to run Ansible during the first boot. Probably other automation tools as well.

For more serious stuff (like a production server), I will stick to Ansible for deployment and config management.


I will use Ubuntu as example for this tutorial. Debian/Fedora/CentOS Stream all have cloud editions. Download cloud image .img file

In a directory, create files meta-data and user-data (optionally vendor-data)

meta-data:

instance-id: <Your-ID> # not important; will not be in virtual machine's XML file
local-hostname: ubuntu.local.lan # this will be the FQDN

user-data docs and examples

users:
  - name: user1
    gecos: A super admin user on Ubuntu with nopassword sudo; 
    groups: [sudo, adm, audio, cdrom, dialout, floppy, video, plugdev, dip, netdev] 
    # other than sudo, the rest are ubuntu defaults
    shell: /bin/bash
    sudo: 'ALL=(ALL) NOPASSWD:ALL'
    lock_passwd: true # by default; disables password login
    chpasswd:
      expire: True
    ssh_authorized_keys:
      - <Your SSH pub key>

    # Another example
  - name: user2
    gecos: A generic admin user with sudo privilege but requires password
    groups: users,admin,wheel
    shell: /bin/bash
    sudo: 'ALL=(ALL) ALL'
    passwd: <hash of password> # mkpasswd --method=SHA-512 --rounds=4096 ## to get the hash
    ssh_authorized_keys:
      - ' <Your SSH pub key>'

package_update: true
package_upgrade: true # default command on Ubuntu is 'apt dist-upgrade'

# installing additional packages
packages:
  - ansible

# cloud-init is able to chain Ansible pull mode, if further configuration is needed
ansible:
  pull:
    url: "https://git.../xxx.git"
    playbook_name: xxx.yml

# run some commands on first boot
bootcmd: # very similar to runcmd, but commands run very early in the boot process, only slightly after a 'boothook' would run.
- some commands...
runcmd:
- systemctl daemon-reload

#swap: # by default, there is no swap
#  filename: /swap
#  size: "auto" # or size in bytes
#  maxsize: 2147484000   # size in bytes (2 Gibibyte)

# after system comes up first time; find IP in the output text
final_message: "The system is finally up, after $UPTIME seconds"

Finally, install the VM with cloud-init scripts and the cloud image we downloaded earlier. We are going to use user session qemu:///session and store the qcow2 image to ~/.local/share/libvirt/images/xxx.qcow2

virt-install \
  --connect qemu:///session \
  --name ubuntu \
  --vcpus 2 \ # --cpu MODEL[,+feature][,-feature][,match=MATCH][,vendor=VENDOR],...
  --memory 2048 \
  #--memballoon driver.iommu=on \
  --osinfo ubuntu22.04 \
  --network bridge=virbr0,model=virtio,driver.iommu=on \
  --graphics none \ # server install
  --disk ~/.local/share/libvirt/images/xxx.qcow2,size=30,backing_store=$PWD"/jammy-server-cloudimg-amd64.img",target.bus=virtio \
  --cloud-init user-data=$PWD"/user-data",meta-data=${PWD}"/meta-data"
  # to get the list of accepted OS variant `virt-install --osinfo list` debian11/fedora37/win10;

As usual, tweak any flags as you see fit.

Simplify Windows VM installation on KVM/QEMU with virt-install

This post is for you if you:

  • Need to quickly spin up a Windows virtual machine on a Linux server or workstation
  • Want to have performance optimized hardware settings for Windows VM
  • Don't want to click through a graphical interface such as virt-manager or Gnome Boxes every time

Well, I have the solution for you. From time to time I need a Windows VM for various purposes. Manually installing Windows on Linux KVM/QEMU is error-prune and time-consuming. To scratch my own itch, I have found and documented the way to reliably spin up Windows 10 and 11 VMs on any Linux machines.

Prerequisite

You will need to prepare the following things before you can start:

  • Windows 10 or 11 ISO image (nowadays you can download directly from Microsoft)
  • Virtualisation stack (sudo apt install qemu-kvm libvirt-daemon-system or sudo dnf install @virtualization)
  • virt-install command-line utility (provided by package virtinst on Debian/Ubuntu; virt-install on RHEL/Fedora)

virt-install command

The one-liner command for Windows 10 or 11. Adjust anything you see fit.

virt-install \
  --connect qemu:///session \
  --name win11-test \
  --boot uefi \
  --vcpus 4 \
  --cpu qemu64,-vmx \
  --memory 8192 \
  --memballoon driver.iommu=on \
  --osinfo win11 \
  --network bridge=virbr0,model=virtio,driver.iommu=on \
  --graphics spice \
  --noautoconsole \
  --cdrom Win11_22H2_English_x64v2.iso \
  --disk /home/ewon/.local/share/libvirt/images/win11-test.qcow2,size=50,target.bus=scsi,cache=writeback \
  --controller type=scsi,model=virtio-scsi,driver.iommu=on

Explanations:

  • To use KVM/QEMU system session (opposed to user session), specify --connect qemu:///system
  • The --boot uefi may not work reliably on some distros. For example, Fedora 37 (as I tested; Fedora 38 seems to be fine) would default to non-4M version of the OVMF file, resulting in non-working UEFI, hence no Windows 11 support. You may need to manually specify OVMF 4M file path using the following flags instead:
--machine q35 \
--boot loader=/usr/share/edk2/ovmf-4m/OVMF_CODE.fd,loader.readonly=yes,loader.type=pflash,nvram.template=/usr/share/edk2/ovmf-4m/OVMF_VARS.fd,loader_secure=yes \
# For Fedora, the OVMF 4M code is under /usr/share/edk2/ovmf-4m/OVMF_CODE.fd
# For Debian, the OVMF 4M code is under /usr/share/OVMF/OVMF_CODE_4M.fd
  • The --cpu flag for AMD is qemu64,-vmx; qemu64 enables Windows 11 support
  • The --osinfo flag can be either win10 or win11
  • Memory, network and disk controller all support IOMMU driver. Enable them for best performance
  • --graphics spice implies both --video=qxl and --channel=spicevmc; use it for best performance
  • --disk specify the path of qcow2 image file; use scsi (VirtIO) and writeback for best performance
  • --controller this is a rather important flag and often got overlooked. It has to be specified in order for scsi type disk (see --disk) to show up in Windows

Installation process

Follow the steps:

  • Paste the virt-install one-liner and hit enter. Ideally you would get the following output:
[ewon@ThinkPad]$ virt-install \
  --connect qemu:///session \
  --name win11-test \
  --boot uefi \
  --vcpus 4 \
  --cpu qemu64,-vmx \
  --memory 8192 \
  --memballoon driver.iommu=on \
  --osinfo win11 \
  --network bridge=virbr0,model=virtio,driver.iommu=on \
  --graphics spice \
  --noautoconsole \
  --cdrom Win11_22H2_English_x64v2.iso \
  --disk /home/ewon/.local/share/libvirt/images/win11-test.qcow2,size=50,target.bus=scsi,cache=writeback \
  --controller type=scsi,model=virtio-scsi,driver.iommu=on

Starting install...
Allocating 'win11-test.qcow2'                                                                     |    0 B  00:00:00 ...
Creating domain...                                                                               |    0 B  00:00:00

Domain is still running. Installation may be in progress.
You can reconnect to the console to complete the installation process.
  • We also need VirtIO drivers ISO attached to the VM during the installation. Since virt-install does not support loading multiple CDROM, we have to add it by using virt-manager (see below step) or directly editing the XML file (see how-to).
  • Shutdown the VM, edit the VM to include the second CDROM. Don't forget to enable SATA CDROM 1 (Windows ISO) as a boot device.
  • Start the VM and attach to graphical console. Windows Installer should appear.
  • If you manually specified --machine q35 and --boot loader= instead of --boot uefi, press Esc during boot, turn on Secure boot. While you are in UEFI settings, you can also adjust the screen resolution.
  • Follow Windows Installer, load drivers (vioscsi, NetKVM, Balloon) from virtio-win CD Drive and continue the installation process.

Post-installation and bugs

After Windows is installed, you want to install the VirtIO Guest Tools by running "virtio-win-gt-x64.msi" on the CDROM. It enables quality-of-life improvements such as dynamic resolution and two-way clipboard share. After that, you can remove the two CDROMs from the VM instance.

If the screen resolution still looks off, make sure the GPU is detected by Windows: check Windows Updates "receive updates for other Microsoft products..." then install graphics driver.

A downside of enabling UEFI firmware is that internal snapshot is not possible, see StackExchange for workarounds. Personally I don't bother to snapshot Windows VM anyway, since they are ephemeral. If I were to solve it, I would utilize filesystem snapshot (btrfs or ZFS).

I've possibly encountered other bugs/annoyances in the past that I didn't document. Since installing Windows VM is a somewhat "popular" practice for Linux users, I think most problems have been found and fixed, or at least worked around. Fire up your search engine if you couldn't solve something on your own.

Nextcloud upgrade woes

I have been self-hosting a Nextcloud instance for almost two years. It is a LAMP stack in a Proxmox LXC container. The container's operating system is Debian 11, with PHP 7.2.

Up until Nextcloud 25, everything is good. I always use the web updater for minor and major Nextcloud upgrades. It wasn't always smooth sails (sometimes I need to drop into the command line to do some post upgrade stuff), but generally speaking things work as intended.

A few months ago, I heard Nextcloud 26 would deprecated PHP 7.2, which means Debian 11 would not be able to upgrade to Nextcloud 26. That's fine, because Debian 12 was just around the horizon. I can rock 25 until Debian 12 comes out in the summer.

Fast forward to yesterday, I decided to upgrade my LXC container to Debian 12 and Nextcloud from 25 to 27, since both projects just released major upgrade within the last week. How exciting! Strangely enough, in the Nextcloud web interface, under "Administration settings", it doesn't even report new version 26 or 27.

I thought "Fine, I will upgrade Debian first and then use Nextcloud web updater". Turns out, Debian upgrade went very smoothly; all php packages were bumped from 7.2 to 8.2; reboot, done. However, Nextcloud cannot be opened, the web interface says something like "This version of Nextcloud is not compatible with PHP>=8.2. You are currently running 8.2.7". I start to grind my teeth as Nextcloud throws me into this hoop. "Fine, I will manually upgrade".

Following the How to upgrade guide, I downloaded latest.zip from Nextcloud website, and start the (painful) process:

  • turn maintenance mode on
  • unzip the file
  • copy everything except config and data into the document root located at var/www/nextcloud
  • make sure user, group and permissions are correct
  • added “apc.enable_cli = 1” to php cli config because of this bug
  • sudo -u www-data php occ upgrade

Of course it didn't work. I went to the web interface to see why, it says "Updates between multiple major versions are unsupported". You can hear me grinding my teeth from across the street.

Finally, after a lot of faffing, I downloaded Nextcloud 26.0.2 and successfully upgraded. However, that's not the end of misery. As per usually, major upgrade always needs some cleaning up. I got half a dozen warnings under "Administration settings", like php memory_limit, file hash mismatch, cron job failed, etc. They are not difficult to fix, just hella annoying.

Just thinking about 26-27 upgrade will put me through (some of) the rigmarole again, I'm already tired. This process is stressful and tedious, especially for something you only need to do every half a year. It periodically reminds me of the bad old days of system administration. Maybe I should've opted in the docker container deployment, I don't know.

On the flip side, thank goodness I have ZFS snapshots for the container and data directory. Should something goes wrong I can always roll back.

Practical udev rules

udev is a userspace subsystem on Linux that provides system administrators the ability to register userspace handlers for events. In other words, it allows custom actions to be executed upon device plugging in or removing. The device can be physical or virtual, as long as the device node lives under /dev directory.

udev rules can do pretty powerful things, and I'm only scratching the surface here. It's also amazing how little has changed in terms of syntax and capabilities, since 2004, when udev was first introduced. My learning resources include:

The motivation for me to learn udev systematically is due to work requirements. The custom Linux image we are building has to have specific devices shows up under certain /dev/tty path. This has to work on multiple physical hardware models, forward compatible with future devices, and most importantly, reliably. For example, pinpad shows up as /dev/ttyS6 and weightscale shows up as /dev/ttyS7, no matter what port it plugs into, or what distro it currently uses (CentOS or Ubuntu).

Monitoring events

Upon device plugging in and removing, we can monitor the verbose message by running the monitor sub command. Ideally, we should get important info such as device node (e.g., /dev/ttyUSB0) and environment variables such as "ACTION=add". If it's a USB device, we can also easily use lsusb to find vendorID and deviceID.

# udevadm monitor --environment --udev

The next step is to use device node path to find all information about this device and its parents.

# udevadm info --attribute-walk --path=$(udevadm info --query=path --name=/dev/ttyUSB0)

Note the message printed out by the above command: "Udevadm info starts with the device specified by the devpath and then walks up the chain of parent devices. It prints for every device found, all possible attributes in the udev rules key format. A rule to match, can be composed by the attributes of the device and the attributes from one single parent device."

It means exactly what it says.

Writing udev rules

Rule files go under /etc/udev/rules.d and fortunately the path and syntax are distro-agnostic. Some common match keys include "KERNEL/SUBSYSTEM/ATTR". Corresponding match keys for parent device are "KERNELS/SUBSYSTEMS/ATTRS" -- think of them as the plural form of the former words. For a complete list of match keys, refer to man page. A rule to create a symlink of a tty device looks like this:

SUBSYSTEM=="tty", KERNELS=="1-7.3", ATTRS{idVendor}=="067b", ATTRS{idProduct}=="23c3", SYMLINK+="ttyS2"

Here, only the first match key SUBSYTESM is against the device itself. Three other match keys are against a parent device. Note that all parent match keys have to come from the same parent, i.e., you cannot pick and choose match keys from different parent level devices.

Some common mistakes I found in other people's rule files:

  • it's not possible to change device name assigned by the kernel (e.g., NAME="myUSB"). The limitation is due to udev being only an userspace program.
  • most of the time, it's not necessary to specify ACTION=="add" environment variable match key.
  • for symlinks, it's usually not necessary to specify GROUP and MODE as soft links don't inherit ownership and permissions from the original file. Do it only when you know what you are doing.

Some advanced topics include:

  • string substitutions: udev uses printf-like string substitution operators
  • string matching: much like regular expression, accepts "*", "?" and "[]"
  • for removing events, try to leverage environment variable as match keys (ENV{KEY}=="VALUE"), as device attributes may not be accessible.
  • run external scripts/programs with RUN+="/path/to/executable"; think of it like a subshell, in which environment variables will differ from the ones in user shell, and no stdout/stderr.
  • for systemd intergration, refer to this Scripting with udev
  • OPTIONS+="last_rule" (I can't think of a possible use case)

Triggering new rules

After saving the rules files, manually trigger them against existing devices:

# udevadm trigger

You will find out instantly whether your rules work or not. Novice like myself may rely on trail-and-error to develop the first couple of rules, and I shamelessly confess that's how I learned udev. Once I get the basics, it feels like a second language.

Put your computer behind a firewall

A recent task from work required me to investigate a failure on a Linux machine deployed at customer's site.

I remoted into said machine, and quickly found out the problem. The log file for GDM (~/.cache/gdm/access.log) display manager grows to almost 100 GiB, driving the free space to zero. As a result, the system crashes, and log files got cleared. The cycle repeats.

Upon checking access.log, I found continuous failed login attempt to port 5900/TCP (default VNC server) from malicious bots. I also noticed thousands of failed SSH login attempt on root.

Turns out, this machine is assigned with a public IP address and open to the internet. By design, these Linux machines are never meant to be exposed to the open internet, but here we are. I could only try to patch up the firewall as much as I possible on the machine level, knowing it would inevitably be fallen into the hands of bot net.

Fingers crossed this particular client won't be owned by ransomware gangs, at least not soon.

Why I think Apple device is a better choice for normal people

I have been a FOSS user and advocate for a few years now. My main computers run Linux, and I work for a company on Linux related stuff. However, I always have Apple devices around, some are my own purchase (iPhones and iPads) and some are passed on to me (Macs). Before you read on, this post is about the reasoning behind recommending Apple devices for non-technical users. It is subjective and heavily biased. You have been warned.

As a long time (well, since around 2010) mobile user and tech follower, I have to give Apple credits for their entire hardware and software solutions, especially if you are looking at it from a "normie's" perspective. The longevity of OS support, good privacy settings and general availability of battery service are the main reasons for me to say this. Full disclaimer: I used to be a Google/Android fan; I’ve owned several Google branded Android phones, including Nexus 4, the Original Pixel and Pixel 3.

Take my own experience as an example: I purchased a refurbished iPhone 8 on 2020 (it originally came out in Fall 2018); been using it for 2.5 years now and just had battery replaced. Now that I can keep using the phone until the end of 2023, when (presumably) iOS 16 stops being supported. For a device I paid roughly $300 CAD, that's hell out of a value. A startling contrast would be the Google Pixle 3 (Fall 2018), which lost support from Google after a mere 3 years. People may argue that you can root it and flash it with LineageOS. While it is technically feasible and might be fun for some, I wouldn't even consider this option for normal users.

The next major point is default privacy. I know some people may disagree and even spit on the idea that Apple is good for privacy, but the truth is that Apple’s centrally controlled App Store is doing a lot better than its competitors. For some reason people are forced to use third-party app markets other than the Play Store (stock ROM defaults to third-party store; certain apps are not available on the Play Store; or Play Store is just not accessible). Third-party app markets are a wild west, let's put it mildly, and their popularity is high in certain regions of the world. Again, I am not endorsing Apple but contrasting it with its Android counterparts. You can make up your own conclusion.

Lastly, the ease of battery replacement and modest cost ($49 CAD for iPhone 8, at the time of writing). Give me an example of Android phone battery replacement service that is universally available across North America and costs less than $100 CAD for a 3+ year device? Probably rarer than a dinosaur. For an iPhone, I can just walk into an authorized store and have it serviced in less than 40 minutes.

All in all, personally I would only recommend Apple devices for my family members. Non-technical people also deserve reasonably good privacy, serviceable battery and more than 3 years of security updates out of a device.

Scripting with nmcli to connect RADIUS/WPA2 Enterprise Wi-Fi network

Recently there is a challenge that came from work. A batch of Linux client machines that are going to be deployed onsite need to connect to enterprise Wi-Fi with RADIUS authentication server.

Due to the sheer number of client machines, it is impractical to configure them individually using NetworkManager's GUI. So I decided to write a small script that automates this process by utilizing the command-line interface of NetworkManager: nmcli.

The script is very straightforward: it reads the desired IP address, turns on Wi-Fi radio and connect to a pre-configured Wi-Fi network with static IP and manual DNS/gateway settings.

#!/bin/bash

currentstaticip=$(ip -4 --brief address | grep -m1 192.168 | awk '{print $3}')
echo "The static IP address of $HOSTNAME is $currentstaticip"

# Turn Wi-Fi on and scan for Wi-Fi signals
nmcli radio wifi on
sleep 3

# Configure wlan0 connection
nmcli con modify wlan0 802-11-wireless.ssid THE-SSID

nmcli con modify wlan0 802-1x.eap peap 802-1x.identity THE-IDENTITY \
802-1x.password THE-PASSWD \
802-1x.phase2-auth mschapv2 \
802-11-wireless-security.key-mgmt wpa-eap

nmcli con modify wlan0 ipv4.method manual
nmcli con modify wlan0 ipv4.address $currentstaticip
nmcli con modify wlan0 ipv4.dns 8.8.8.8,1.1.1.1
nmcli con modify wlan0 ipv4.gateway 192.168.x.1

# Connect
nmcli con up "wlan0"
nmcli con modify "wlan0" wifi.hidden yes

The only part that required trial and error is the sequence in which security and identity information is supplied to the RADIUS server. Every RADIUS setup is different and what worked in this scenario may not work under a different setup. On the other hand, there's not a lot of scripting examples out on the internet that deal with enterprise Wi-Fi. All in all, it took me a few hours to read the man pages and come up with this solution.

Hope it will bring value to people who are struggling with similar problems.