Unbound getting random SERVFAIL for any domain

Please follow the below template, it will help us to help you!

If you are Experiencing issues with a Pi-hole install that has non-standard elements (e.g you are using nginx instead of lighttpd, or there is some other aspect of your install that is customised) - please use the Community Help category.

Expected Behaviour:

Unbound should resolve most domains in a timely manner.

Actual Behaviour:

Unbound is randomly failing to resolve some domains, also in a random fashion, that have been resolved in the past. The issue resolves itself by either spamming refresh or waiting for some time before trying the page again.

Debug Token:

https://tricorder.pi-hole.net/1N2CtGxA/

So I just got done with clean slate: a DietPi VM running Debian Bookworm, freshly pulled from DietPi Downloads. Both Pi-hole and Unbound installed using dietpi-software. The problem persists.
I uploaded the debug token just in case but I'm pretty positive that it's to do with Unbound, using another upstream DNS provider works without any issue. Here's my current config:

/etc/unbound/unbound.conf:include-toplevel: "/etc/unbound/unbound.conf.d/*.conf"
/etc/unbound/unbound.conf.d/root-auto-trust-anchor-file.conf:server:
/etc/unbound/unbound.conf.d/root-auto-trust-anchor-file.conf:    auto-trust-anchor-file: "/var/lib/unbound/root.key"
/etc/unbound/unbound.conf.d/remote-control.conf:remote-control:
/etc/unbound/unbound.conf.d/remote-control.conf:  control-enable: yes
/etc/unbound/unbound.conf.d/remote-control.conf:  control-interface: /run/unbound.ctl
/etc/unbound/unbound.conf.d/dietpi.conf:server:
/etc/unbound/unbound.conf.d/dietpi.conf:        do-daemonize: no
/etc/unbound/unbound.conf.d/dietpi.conf:        num-threads: 2
/etc/unbound/unbound.conf.d/dietpi.conf:        verbosity: 0
/etc/unbound/unbound.conf.d/dietpi.conf:        log-queries: no
/etc/unbound/unbound.conf.d/dietpi.conf:        log-replies: no
/etc/unbound/unbound.conf.d/dietpi.conf:        interface: 127.0.0.1
/etc/unbound/unbound.conf.d/dietpi.conf:        port: 5335
/etc/unbound/unbound.conf.d/dietpi.conf:        access-control: 0.0.0.0/0 refuse
/etc/unbound/unbound.conf.d/dietpi.conf:        access-control: 10.0.0.0/8 allow
/etc/unbound/unbound.conf.d/dietpi.conf:        access-control: 127.0.0.1/8 allow
/etc/unbound/unbound.conf.d/dietpi.conf:        access-control: 172.16.0.0/12 allow
/etc/unbound/unbound.conf.d/dietpi.conf:        access-control: 192.168.0.0/16 allow
/etc/unbound/unbound.conf.d/dietpi.conf:        access-control: ::/0 refuse
/etc/unbound/unbound.conf.d/dietpi.conf:        access-control: ::1/128 allow
/etc/unbound/unbound.conf.d/dietpi.conf:        access-control: fd00::/8 allow
/etc/unbound/unbound.conf.d/dietpi.conf:        access-control: fe80::/10 allow
/etc/unbound/unbound.conf.d/dietpi.conf:        private-address: 10.0.0.0/8
/etc/unbound/unbound.conf.d/dietpi.conf:        private-address: 172.16.0.0/12
/etc/unbound/unbound.conf.d/dietpi.conf:        private-address: 192.168.0.0/16
/etc/unbound/unbound.conf.d/dietpi.conf:        private-address: 169.254.0.0/16
/etc/unbound/unbound.conf.d/dietpi.conf:        private-address: fd00::/8
/etc/unbound/unbound.conf.d/dietpi.conf:        private-address: fe80::/10
/etc/unbound/unbound.conf.d/dietpi.conf:        do-udp: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        do-tcp: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        do-ip4: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        do-ip6: no
/etc/unbound/unbound.conf.d/dietpi.conf:        root-hints: "/var/lib/unbound/root.hints"
/etc/unbound/unbound.conf.d/dietpi.conf:        ratelimit: 1000
/etc/unbound/unbound.conf.d/dietpi.conf:        unwanted-reply-threshold: 10000
/etc/unbound/unbound.conf.d/dietpi.conf:        edns-buffer-size: 1232
/etc/unbound/unbound.conf.d/dietpi.conf:        so-rcvbuf: 4m
/etc/unbound/unbound.conf.d/dietpi.conf:        so-sndbuf: 4m
/etc/unbound/unbound.conf.d/dietpi.conf:        harden-glue: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        harden-dnssec-stripped: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        harden-algo-downgrade: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        harden-large-queries: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        harden-short-bufsize: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        use-caps-for-id: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        rrset-roundrobin: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        qname-minimisation: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        minimal-responses: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        hide-identity: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        identity: "Server"
/etc/unbound/unbound.conf.d/dietpi.conf:        hide-version: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        cache-min-ttl: 300
/etc/unbound/unbound.conf.d/dietpi.conf:        cache-max-ttl: 86400
/etc/unbound/unbound.conf.d/dietpi.conf:        serve-expired: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        neg-cache-size: 4M
/etc/unbound/unbound.conf.d/dietpi.conf:        prefetch: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        prefetch-key: yes
/etc/unbound/unbound.conf.d/dietpi.conf:        msg-cache-size: 16m
/etc/unbound/unbound.conf.d/dietpi.conf:        rrset-cache-size: 32m

This was auto-generated by installing via dietpi-software, with some minor modifications. I've also tried following the official pihole unbound guide to a T, and while the config there has simpler settings, the problem still presented itself.
Here's the results from DNSSEC validation in that same guide (though I've read these domains are no longer that reliable):

; <<>> DiG 9.18.19-1~deb12u1-Debian <<>> fail01.dnssec.works @127.0.0.1 -p 5335
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 47723
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;fail01.dnssec.works.           IN      A

;; Query time: 2243 msec
;; SERVER: 127.0.0.1#5335(127.0.0.1) (UDP)
;; WHEN: Fri Oct 20 18:49:13 BST 2023
;; MSG SIZE  rcvd: 48
; <<>> DiG 9.18.19-1~deb12u1-Debian <<>> dnssec.works @127.0.0.1 -p 5335
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40153
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;dnssec.works.                  IN      A

;; ANSWER SECTION:
dnssec.works.           3593    IN      A       5.45.107.88

;; Query time: 31 msec
;; SERVER: 127.0.0.1#5335(127.0.0.1) (UDP)
;; WHEN: Fri Oct 20 18:49:19 BST 2023
;; MSG SIZE  rcvd: 57

Both of which are expected outputs.
Other things I've noted:

  1. For any domain that Unbound fails to resolve, I can sometimes lookup using dig within the Pi-hole machine, where it returns NOERROR and suddently the website loads again afterwards. This might be a coincidence though since other times it would just return SERVFAIL anyway.
  2. Restarting unbound service sometimes works in a pinch, any domain that couldn't be resolved works again, until Unbound decides to not resolve them some time in the future.

Before this ordeal I was running DietPi on Debian Bullseye just fine, then it happened a few days ago where I noticed several websites weren't loading at all. I'm pulling my hair out at this point, been at it with this issue reading various threads across different forums of different setups. Maybe someone can spot what I'm missing. Could this be a problem with my ISP?

Please post some sample digs for domains that aren't resolving.

Sure, let's start with this very forum:

; <<>> DiG 9.18.19-1~deb12u1-Debian <<>> discourse.pi-hole.net @127.0.0.1 -p 5335
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 20651
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;discourse.pi-hole.net.         IN      A

;; Query time: 35 msec
;; SERVER: 127.0.0.1#5335(127.0.0.1) (UDP)
;; WHEN: Sat Oct 21 07:03:26 BST 2023
;; MSG SIZE  rcvd: 50
; <<>> DiG 9.18.19-1~deb12u1-Debian <<>> mangadex.org @127.0.0.1 -p 5335
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 65176
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;mangadex.org.                  IN      A

;; Query time: 0 msec
;; SERVER: 127.0.0.1#5335(127.0.0.1) (UDP)
;; WHEN: Sat Oct 21 07:06:38 BST 2023
;; MSG SIZE  rcvd: 41
; <<>> DiG 9.18.19-1~deb12u1-Debian <<>> www.3blue1brown.com @127.0.0.1 -p 5335
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 49395
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;www.3blue1brown.com.           IN      A

;; Query time: 0 msec
;; SERVER: 127.0.0.1#5335(127.0.0.1) (UDP)
;; WHEN: Sat Oct 21 07:24:43 BST 2023
;; MSG SIZE  rcvd: 48
; <<>> DiG 9.18.19-1~deb12u1-Debian <<>> @127.0.0.1 -p 5335 see.stanford.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 2587
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;see.stanford.edu.              IN      A

;; Query time: 667 msec
;; SERVER: 127.0.0.1#5335(127.0.0.1) (UDP)
;; WHEN: Sat Oct 21 07:28:08 BST 2023
;; MSG SIZE  rcvd: 45

These are the ones that failed on me just now, everything (including the above) is working normally again however, it's very annoying to say the least. I'll edit the post and update with more examples when I come across them.

Domain-specific failures are often related to maintenance work on the domain servers authoritative for that domain, or a switch of the domain's digital signatures (as used by DNSSEC validating resolvers like unbound), or sometimes misconfigurations.
Those kind of failures usually get sorted quickly by the domain's maintainers, i.e. they don't last for long.

DNSSEC also relies on correct time information, so if your host's clock would be off by too much, consequently DNSSEC validation and thus DNS resolution would fail. However, that may not quite match your observation, as that would affect all DNS requests alike, until time is in sync again.

A RTC for 3-5 Euro/ USD is a very effective tool to avoid unmatched times between the systems.

Much thanks for the suggestions guys but I've also ruled out the clock being out-of-sync, it is one of the main concerns when DNSSEC is involved so I made sure that's not the problem. Not to mention, like I said, this setup was working just fine not long ago.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.