OPNsense zpool upgrade

TLDR: I enabled ZFS feature flags on the boot pool of OPNsense (by ignorance), and had to update UEFI boot code in order not to "brick" it.
I want to document this unsettling experience for anyone who has walked the same path and is desperately searching for remedy.

Background

I was doing OPNsense major version upgrade from 25.1 to 25.7. Things went pretty smoothly and I did some post-upgrade checks. One of the checks was zpool status -v and I discovered that there are new feature flags that can be enabled for ZFS pool.

Story

Without thinking too much (read: at all), I went ahead and did zpool upgrade -a. Here is the output:

root@OPNsense:/home/ewon # zpool upgrade -a
This system supports ZFS pool feature flags.

Enabled the following features on 'zroot':
  edonr
  zilsaxattr
  head_errlog
  blake3
  block_cloning
  vdev_zaps_v2

Pool 'zroot' has the bootfs property set, you might need to update
the boot code. See gptzfsboot(8) and loader.efi(8) for details.
root@OPNsense:/home/ewon #

The seemingly casual sentence "you might need to update the boot code" caught my attention, I went searching for this and discovered this forum post. I feel a cold shiver runs down my spine and break into a sweat. If I hadn't caught this, the next reboot will send my home network to hell, literally.

Fix

Luckily, following that people shared in the post by updating UEFI code, I was able to avert a crisis.

cp /boot/loader.efi /boot/efi/efi/boot/bootx64.efi
cp /boot/loader.efi /boot/efi/efi/freebsd/loader.efi

If your machine is running in BIOS mode, do

gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 2 da0

From now on, I won't do zpool upgrade on OPNsense. It should be left alone as a network appliance, not a storage server.

Strange issue on OPNsense with Unbound DNS

I'm documenting an (quite possibly the strangest) issue I have ever experienced in IT. At the moment of writing, I managed to fix it but still don't know why it happened in the first place.

On a peaceful Saturday afternoon, out of nowhere, internet stopped working. Under panic mode (I wish I was exaggerating), I rushed down to my computer and started troubleshooting. "Thankfully", it's only an issue with DNS, as ping 1.1.1.1 still works.

A quick background on my router setup, I use OPNsense with Unbound DNS as recursive DNS server for my entire LAN. Pretty normal setup, in fact, it's the default.

Restarting Unbound DNS service did not help; rebooting OPNsense did not help, power cycle the ISP ONT box didn't help, either.

As the whole family needs internet access ASAP, I did a quick fix by turning off Unbound DNS service entirely, and let OPNsense use upstream public DNS service instead.

(A quick note: only Proxmox did not receive the newly advertised public DNS server, as the network settings are "static"; other client devices received the change shortly)

In the evening, I had some time to troubleshoot further. I disabled DNSBL, domain overrides and some other fancy features I enabled in Unbound, making it almost a vanilla install. Still no luck. I noticed the clients are not able to reach DNS server (gateway/VLAN interface/network address, it's all the same thing) over port 53; however, logs still don't show anything useful.

In a final attempt, I pulled OPNsense documentation and started to examine my config line by line. I noticed the Access Lists (ACL) default action is "refuse". After I change it back to the default "accept", DNS service immediately started working.

Ok, problem solved, but questions started popping up. Why was ACL changed? Nobody touched the box in months; automated upgrades happened weeks ago; what gives?

Since I rebooted OPNsense shortly after the incident, I lost logs; and since I only configured the system to keep 5 backups, I consumed this number quickly when I was troubleshooting and overwrote the config prior to reboot.

Searching on the internet didn't give me much answer, either. Nobody seems to have Unbound ACL suddenly flipping from accept to refuse. The real cause of this issue may never be known.

After this, I will consider keeping logs for longer and maybe even consider remote logging options. Enable more automatic backups (from 5 to say, 20) is also a good idea.