Unbound intermittenly fails to resolve anything

Expected Behaviour:

Hello everyone,

I'm currently running a docker setup with pi-hole and unbound together on an Intel NUC8 running Ubuntu server 24.04. I am ssh:ed to the server from my local machine. My network setup is a Unifi network using a Cloud Gateway Ultra. I have configured 3 networks for my devices with one for networking gear, one for all user devices and one for IoT. All separate networks have my server set as the DNS provider, as does as my primary WAN.

This setup works okay-ish. While the blocking of pi-hole is working as intended, something about the Unbound functionality is not.

I would expect this setup to be slow (up to ~200ms) when first looking up and caching the DNS result in Unbound, with subsequent querys being in the 1-2ms range.

Actual Behaviour:

The problem is that sometimes this works well with fast and responsive DNS queries, and sometimes it times out or takes upwards of 2000-4000ms (just my head timing when connecting to ex google.com in my browser).

When performing the dig command, sometimes it resolves fast and sometimes it times out and I cannot figure out why...

See below for some examples of the dig command working once and failing once. These were run with ~5s of delay between each other.

henke@klassikern:/DATA/AppData$ dig google.com

; <<>> DiG 9.18.28-0ubuntu0.24.04.1-Ubuntu <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46249
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;google.com.                    IN      A

;; ANSWER SECTION:
google.com.             56      IN      A       142.250.74.14

;; Query time: 1 msec
;; SERVER: 10.42.1.28#53(10.42.1.28) (UDP)
;; WHEN: Wed Sep 04 19:43:42 CEST 2024
;; MSG SIZE  rcvd: 55

henke@klassikern:/DATA/AppData$ dig google.com
;; communications error to 10.42.1.28#53: timed out
;; communications error to 10.42.1.28#53: timed out
;; communications error to 10.42.1.28#53: timed out

; <<>> DiG 9.18.28-0ubuntu0.24.04.1-Ubuntu <<>> google.com
;; global options: +cmd
;; no servers could be reached

I have also used the GitHub - cleanbrowsing/dnsperftest: DNS Performance test. The results from this test is just as wierd. See below for a few runs performed within a minute.

henke@klassikern:/DATA/AppData$ sudo bash ./dnstest.sh
                     test1   test2   test3   test4   test5   test6   test7   test8   test9   test10  Average
10.42.1.28           2500 ms 31 ms   36 ms   112 ms  18 ms   137 ms  85 ms   67 ms   1 ms    82 ms     306.90
cloudflare           24 ms   2 ms    2 ms    48 ms   2 ms    3 ms    2 ms    2 ms    31 ms   4 ms      12.00
level3               20 ms   21 ms   20 ms   20 ms   20 ms   21 ms   20 ms   20 ms   21 ms   20 ms     20.30
google               9 ms    8 ms    25 ms   8 ms    9 ms    63 ms   9 ms    9 ms    18 ms   39 ms     19.70
quad9                8 ms    9 ms    9 ms    9 ms    180 ms  9 ms    8 ms    10 ms   9 ms    10 ms     26.10
freenom              154 ms  155 ms  182 ms  155 ms  154 ms  154 ms  167 ms  154 ms  154 ms  196 ms    162.50
opendns              7 ms    7 ms    ^C

henke@klassikern:/DATA/AppData$ sudo bash ./dnstest.sh
                     test1   test2   test3   test4   test5   test6   test7   test8   test9   test10  Average
10.42.1.28           2500 ms 2500 ms 2500 ms 1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms      750.70
cloudflare           44 ms   2 ms    3 ms    1 ms    2 ms    3 ms    2 ms    2 ms    26 ms   2 ms      8.70
level3               20 ms   20 ms   20 ms   20 ms   19 ms   19 ms   20 ms   20 ms   20 ms   20 ms     19.80
google               8 ms    9 ms    16 ms   9 ms    9 ms    63 ms   9 ms    17 ms   8 ms    68 ms     21.60
quad9                9 ms    9 ms    9 ms    9 ms    9 ms    9 ms    9 ms    9 ms    9 ms    9 ms      9.00
freenom              153 ms  154 ms  167 ms  155 ms  186 ms  155 ms  175 ms  175 ms  154 ms  195 ms    166.90
opendns              7 ms    7 ms    23 ms   26 ms   7 ms    7 ms    7 ms    26 ms   7 ms    7 ms      12.40
norton               9 ms    8 ms    9 ms    9 ms    10 ms   9 ms    9 ms    9 ms    9 ms    9 ms      9.00
cleanbrowsing        19 ms   19 ms   19 ms   19 ms   19 ms   19 ms   19 ms   20 ms   19 ms   20 ms     19.20
yandex               48 ms   48 ms   56 ms   48 ms   49 ms   34 ms   52 ms   49 ms   51 ms   49 ms     48.40
adguard              2500 ms 2500 ms ^C

henke@klassikern:/DATA/AppData$ sudo bash ./dnstest.sh
                     test1   test2   test3   test4   test5   test6   test7   test8   test9   test10  Average
10.42.1.28           1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms      1.00
cloudflare           22 ms   2 ms    2 ms    2 ms    4 ms    2 ms    2 ms    2 ms    24 ms   9 ms      7.10
level3               20 ms   27 ms   20 ms   20 ms   20 ms   20 ms   19 ms   20 ms   20 ms   20 ms     20.60
google               9 ms    8 ms    17 ms   8 ms    8 ms    119 ms  16 ms   17 ms   8 ms    39 ms     24.90
quad9                8 ms    8 ms    9 ms    8 ms    9 ms    8 ms    9 ms    9 ms    9 ms    9 ms      8.60
freenom              154 ms  154 ms  167 ms  154 ms  154 ms  154 ms 

I manually changed the time-out of the script from 1000ms to 2500ms.

The first run was after a fresh restart of the Unbound container, with subsequent runs performed back-to-back.

My Unbound configuration looks like this.

server:
    # If no logfile is specified, syslog is used
    logfile: "/var/log/unbound/unbound.log"
    verbosity: 2
    interface: 0.0.0.0
    port: 5353
    do-ip4: yes
    do-udp: yes
    do-tcp: yes
    do-ip6: no
    prefer-ip6: no
    # Trust glue only if it is within the server's authority
    harden-glue: yes
    # Require DNSSEC data for trust-anchored zones, if such data is absent, the zone becomes BOGUS
    harden-dnssec-stripped: yes
    use-caps-for-id: no
    # Reduce EDNS reassembly buffer size.
    edns-buffer-size: 1232
    #Performance settings
    prefetch: yes
    num-threads: 2
    so-rcvbuf: 5m
    access-control: 10.0.0.0/8 allow
    access-control: 192.168.0.0/16 allow
    access-control: 172.16.0.0/12 allow
    # Ensure privacy of local IP ranges
    private-address: 192.168.0.0/16
    private-address: 169.254.0.0/16
    private-address: 172.16.0.0/12
    private-address: 10.0.0.0/8
    private-address: fd00::/8
    private-address: fe80::/10

    #root-hints: "/var/lib/unbound/root.hints"

I have tried manually providing the root-hints but that did nothing to improve the problem.

I have also disabled the Pi-hole cache and DNSSEC.

I'm at a loss for why this imtermitten timing out of Unbound is happening. I would like to continue to use Unbound together with Pi-hole just because I like the idea of self-hosting my stuff. Though as other live with me I will use quad9 for the time being as it works better than Unbound does at the moment.

Thanks for any help! :slight_smile:

What's your unbound version?

I'm on version 1.20.0 using the https://hub.docker.com/r/alpinelinux/unbound latest container :slight_smile:

/etc/unbound # unbound -V
Version 1.20.0

Configure line: --build=x86_64-alpine-linux-musl --host=x86_64-alpine-linux-musl --prefix=/usr --sysconfdir=/etc --mandir=/usr/share/man --localstatedir=/var --with-username=unbound --with-run-dir= --with-pidfile= --with-rootkey-file=/usr/share/dnssec-root/trusted-key.key --with-libevent --with-pthreads --disable-static --disable-rpath --enable-dnstap --with-ssl --without-pythonmodule --with-pyunbound
Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.3.1 4 Jun 2024
Linked modules: dns64 respip validator iterator

With a validating recursive resolver, it would be expected that lookups take longer on average than with a caching DNS server.
This is obviously for two reasons: Recursion itself takes longer, as it involves communicating with multiple authoritative DNS servers, and DNSSEC validation comes on top of that.

Of course, unbound would cache DNS replies for as long as a domain's TTL allows, serving cached reply records almost instantaneously.

Sporadic long resolution times may (re)occur when a cached reply's TTL expires, as unbound has to partially or sometimes completely rewalk the recursion chain.
The time required for that refresh would depend on how many of the domains along the recursion chain would still be cached.
The refresh will take longer if more domains have to be rerequested. If any of the involved authoritative servers would be slow to respond or even unresponsive due to heavy load, that may further increase reply times, as unbound may have to wait and potentially repeat its request to that authoritative servers.

Above should explain why a recursive resolver takes longer to resolve domains, and sometimes may fail to respond in time.

With your unbound 1.20.0, you should be able to mitigate this.

Starting with version 1.11.0, unbound supports serving expired records according to RFC 8767, allowing it to use expired records if recursion did not complete in a certain time, in an attempt to avoid client side timeouts (see also Serving Stale Data — Unbound 1.21.0 documentation).
As the expired reply may be incorrect (e.g. a domain's IP address may have been changed in the meantime), it is served with a short TTL of 30 seconds. This in turn should prompt the client to send a new request after 30 seconds, by which time unbound should have completed its recursion and serve a current reply.

You should be able to add the following lines to your unbound configuration:

server:
    serve-expired: yes
    serve-expired-ttl: 86400            # do not serve replies older than one day, in seconds
    serve-expired-client-timeout: 1500  # consider serving expired replies when resolution takes longer than 1.5 seconds, in milliseconds

You should just add the three serve-expired* options - the initial server: line above is only included to help you find the appropriate section where those options have to be added. It should already be present in your configuration.

Afterwards, you'd need to restart unbound (sudo service unbound restart).

This would not avoid longer response times altogether, but it would have unbound provide a reply after at least 1.5 seconds.

You may want to tune that value further, if your clients would still time out before that.