Some websites are suddenly failing to resolve on my Pi-hole + Unbound without any config changes. Query log shows a lot of "Retried" and some "SERVFAIL" errors when it happens. Refreshing "fixes" the page until it happens again

The issue I am facing:

I have been using my Pi-hole + Unbound for a couple weeks without issue, when I suddenly started having this error randomly pop up in Firefox (with its DoH disabled) for a bunch of different pages:

It is "fixed" each time by simply refreshing the page once or twice. I was wondering what might be causing it, so I went to my Pi-hole query log and saw entries that look like this:

There is usually a Retried status immediately followed by a BOGUS / SERVFAIL. I did a query log over the past seven days and saw that there were over 3000 Retried entries across several different devices, not just this one.

I haven't made any configuration changes in the past few weeks, so I'm not sure why it has suddenly started to fail when trying to resolve pages. The only thing I can think of is that I changed the speed of my AT&T broadband plan about the same time, and maybe they pushed some sort of change to their BGW210 modem I'm forced to use.

After doing some Googling, others who've had lots of Retried and/or SERVFAIL errors said they might have an issue with packet filtering upstream of the Pi-hole on Port 53 or something, but I haven't been able to figure out if that's my problem, or even how to fix it. Any help would be appreciated!

Details about my system:

  • AT&T fiber jack > Arris BGW210 modem/router [in pass-through mode] > AmpliFi Instant Router > Gigabit switch > Raspberry Pi (+ rest of network)

  • Debug Token: https://tricorder.pi-hole.net/x6im0qciqu

  • Contents of /etc/unbound/unbound.conf.d/pi-hole.conf:

server:
    # If no logfile is specified, syslog is used
    # logfile: "/var/log/unbound/unbound.log"
    verbosity: 0

    interface: 127.0.0.1
    port: 5335
    do-ip4: yes
    do-udp: yes
    do-tcp: yes

    # May be set to yes if you have IPv6 connectivity
    do-ip6: no

    # You want to leave this to no unless you have *native* IPv6. With 6to4 and
    # Terredo tunnels your web browser should favor IPv4 for the same reasons
    prefer-ip6: no

    # Use this only when you downloaded the list of primary root servers!
    root-hints: "/var/lib/unbound/root.hints"

    # Trust glue only if it is within the server's authority
    harden-glue: yes

    # Require DNSSEC data for trust-anchored zones, if such data is absent, the zone becomes BOGUS
    harden-dnssec-stripped: yes

    # Don't use Capitalization randomization as it known to cause DNSSEC issues sometimes
    # see https://discourse.pi-hole.net/t/unbound-stubby-or-dnscrypt-proxy/9378 for further details
    use-caps-for-id: no

    # Reduce EDNS reassembly buffer size.
    # Suggested by the unbound man page to reduce fragmentation reassembly problems
    edns-buffer-size: 1472

    # Perform prefetching of close to expired message cache entries
    # This only applies to domains that have been frequently queried
    prefetch: yes

    # One thread should be sufficient, can be increased on beefy machines. In reality for most users running on small networks or on a single machine, it should be unnecessary to seek performance enhancement by increasing num-threads above 1.
    num-threads: 1

    # Ensure kernel buffer is large enough to not lose messages in traffic spikes
    so-rcvbuf: 1m

    # Ensure privacy of local IP ranges
    private-address: 192.168.0.0/16
    private-address: 169.254.0.0/16
    private-address: 172.16.0.0/12
    private-address: 10.0.0.0/8
    private-address: fd00::/8
    private-address: fe80::/10

What I have changed since installing Pi-hole:

A BOGUS reply for a DNSSEC signed record indicates that a signed record was found, and the signature was bad.

If you are seeing this randomly, it may indicate that the date/time on the Pi are not accurate. Accurate time is required for the DNSSEC algorithm.

You can also check the unbound anchor.

docs.pi-hole.net isn't a DNSSEC signed domain.

Ah, I thought BOGUS meant it couldn't find anything and was the reason my browser kept coming up short. I'm still having the issue where pages aren't loading randomly though. Could there be a broader outage elsewhere?

I ran timedatectl status and verified the time/date are correct. I'm not sure what you mean by checking the unbound anchor, but here's the output from cat /var/lib/unbound/root.hints:

;       This file holds the information on root name servers needed to
;       initialize cache of Internet domain name servers
;       (e.g. reference this file in the "cache  .  <file>"
;       configuration file of BIND domain name servers).
;
;       This file is made available by InterNIC
;       under anonymous FTP as
;           file                /domain/named.cache
;           on server           FTP.INTERNIC.NET
;       -OR-                    RS.INTERNIC.NET
;
;       last update:     June 24, 2021
;       related version of root zone:     2021062401
;
; FORMERLY NS.INTERNIC.NET
;
.                        3600000      NS    A.ROOT-SERVERS.NET.
A.ROOT-SERVERS.NET.      3600000      A     198.41.0.4
A.ROOT-SERVERS.NET.      3600000      AAAA  2001:503:ba3e::2:30
;
; FORMERLY NS1.ISI.EDU
;
.                        3600000      NS    B.ROOT-SERVERS.NET.
B.ROOT-SERVERS.NET.      3600000      A     199.9.14.201
B.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:200::b
;
; FORMERLY C.PSI.NET
;
.                        3600000      NS    C.ROOT-SERVERS.NET.
C.ROOT-SERVERS.NET.      3600000      A     192.33.4.12
C.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:2::c
;
; FORMERLY TERP.UMD.EDU
;
.                        3600000      NS    D.ROOT-SERVERS.NET.
D.ROOT-SERVERS.NET.      3600000      A     199.7.91.13
D.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:2d::d
;
; FORMERLY NS.NASA.GOV
;
.                        3600000      NS    E.ROOT-SERVERS.NET.
E.ROOT-SERVERS.NET.      3600000      A     192.203.230.10
E.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:a8::e
;
; FORMERLY NS.ISC.ORG
;
.                        3600000      NS    F.ROOT-SERVERS.NET.
F.ROOT-SERVERS.NET.      3600000      A     192.5.5.241
F.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:2f::f
;
; FORMERLY NS.NIC.DDN.MIL
;
.                        3600000      NS    G.ROOT-SERVERS.NET.
G.ROOT-SERVERS.NET.      3600000      A     192.112.36.4
G.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:12::d0d
;
; FORMERLY AOS.ARL.ARMY.MIL
;
.                        3600000      NS    H.ROOT-SERVERS.NET.
H.ROOT-SERVERS.NET.      3600000      A     198.97.190.53
H.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:1::53
;
; FORMERLY NIC.NORDU.NET
;
.                        3600000      NS    I.ROOT-SERVERS.NET.
I.ROOT-SERVERS.NET.      3600000      A     192.36.148.17
I.ROOT-SERVERS.NET.      3600000      AAAA  2001:7fe::53
;
; OPERATED BY VERISIGN, INC.
;
.                        3600000      NS    J.ROOT-SERVERS.NET.
J.ROOT-SERVERS.NET.      3600000      A     192.58.128.30
J.ROOT-SERVERS.NET.      3600000      AAAA  2001:503:c27::2:30
;
; OPERATED BY RIPE NCC
;
.                        3600000      NS    K.ROOT-SERVERS.NET.
K.ROOT-SERVERS.NET.      3600000      A     193.0.14.129
K.ROOT-SERVERS.NET.      3600000      AAAA  2001:7fd::1
;
; OPERATED BY ICANN
;
.                        3600000      NS    L.ROOT-SERVERS.NET.
L.ROOT-SERVERS.NET.      3600000      A     199.7.83.42
L.ROOT-SERVERS.NET.      3600000      AAAA  2001:500:9f::42
;
; OPERATED BY WIDE
;
.                        3600000      NS    M.ROOT-SERVERS.NET.
M.ROOT-SERVERS.NET.      3600000      A     202.12.27.33
M.ROOT-SERVERS.NET.      3600000      AAAA  2001:dc3::35

You'd likely be unable to resolve any domain if your anchor was incorrect.

Assuming that you are just seeing those SERVFAILs for certain domains, your issue seems to be upstream.

SERVFAIL indicates one of Pi-hole's and in turn unbound's upstreams returned an error.

Intermittent SERVFAILs are not uncommon, and you'll hardly ever notice them.

Unfortunately, they are somewhat hard to troubleshoot if they persist over a longer period.
They may indicate a DNS server authoritative for that domain may be down, or something is interfering with DNS resolution (outside of your network) - see Pi-hole unbound servfail where an ISP was filtering DNS requests.

(Some) domains returning "SERVFAIL" but no DNSSEC enabled - #9 by deHakkelaar also may have some leads on troubleshooting issues with resolving mail.protonmail.com.

That's an interesting point about the ISP. I checked AT&T's ARRIS BGW210's settings and noticed packet filtering was turned on, so I disabled it since I have a downstream router to handle it. I'll keep an eye out and see if this helps going forward. If it doesn't help, I'll turn on the unbound-remote and see if I can get some additional insight there.

I did notice some additional firewall settings, but I'm not certain which, if any, I should change from their defaults: