Strange issue on OPNsense with Unbound DNS

I'm documenting an (quite possibly the strangest) issue I have ever experienced in IT. At the moment of writing, I managed to fix it but still don't know why it happened in the first place.

On a peaceful Saturday afternoon, out of nowhere, internet stopped working. Under panic mode (I wish I was exaggerating), I rushed down to my computer and started troubleshooting. "Thankfully", it's only an issue with DNS, as ping 1.1.1.1 still works.

A quick background on my router setup, I use OPNsense with Unbound DNS as recursive DNS server for my entire LAN. Pretty normal setup, in fact, it's the default.

Restarting Unbound DNS service did not help; rebooting OPNsense did not help, power cycle the ISP ONT box didn't help, either.

As the whole family needs internet access ASAP, I did a quick fix by turning off Unbound DNS service entirely, and let OPNsense use upstream public DNS service instead.

(A quick note: only Proxmox did not receive the newly advertised public DNS server, as the network settings are "static"; other client devices received the change shortly)

In the evening, I had some time to troubleshoot further. I disabled DNSBL, domain overrides and some other fancy features I enabled in Unbound, making it almost a vanilla install. Still no luck. I noticed the clients are not able to reach DNS server (gateway/VLAN interface/network address, it's all the same thing) over port 53; however, logs still don't show anything useful.

In a final attempt, I pulled OPNsense documentation and started to examine my config line by line. I noticed the Access Lists (ACL) default action is "refuse". After I change it back to the default "accept", DNS service immediately started working.

Ok, problem solved, but questions started popping up. Why was ACL changed? Nobody touched the box in months; automated upgrades happened weeks ago; what gives?

Since I rebooted OPNsense shortly after the incident, I lost logs; and since I only configured the system to keep 5 backups, I consumed this number quickly when I was troubleshooting and overwrote the config prior to reboot.

Searching on the internet didn't give me much answer, either. Nobody seems to have Unbound ACL suddenly flipping from accept to refuse. The real cause of this issue may never be known.

After this, I will consider keeping logs for longer and maybe even consider remote logging options. Enable more automatic backups (from 5 to say, 20) is also a good idea.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.