Unbound intermittenly fails to resolve anything

henkish · September 4, 2024, 6:18pm

Expected Behaviour:

Hello everyone,

I'm currently running a docker setup with pi-hole and unbound together on an Intel NUC8 running Ubuntu server 24.04. I am ssh:ed to the server from my local machine. My network setup is a Unifi network using a Cloud Gateway Ultra. I have configured 3 networks for my devices with one for networking gear, one for all user devices and one for IoT. All separate networks have my server set as the DNS provider, as does as my primary WAN.

This setup works okay-ish. While the blocking of pi-hole is working as intended, something about the Unbound functionality is not.

I would expect this setup to be slow (up to ~200ms) when first looking up and caching the DNS result in Unbound, with subsequent querys being in the 1-2ms range.

Actual Behaviour:

The problem is that sometimes this works well with fast and responsive DNS queries, and sometimes it times out or takes upwards of 2000-4000ms (just my head timing when connecting to ex google.com in my browser).

When performing the dig command, sometimes it resolves fast and sometimes it times out and I cannot figure out why...

See below for some examples of the dig command working once and failing once. These were run with ~5s of delay between each other.

henke@klassikern:/DATA/AppData$ dig google.com

; <<>> DiG 9.18.28-0ubuntu0.24.04.1-Ubuntu <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46249
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;google.com.                    IN      A

;; ANSWER SECTION:
google.com.             56      IN      A       142.250.74.14

;; Query time: 1 msec
;; SERVER: 10.42.1.28#53(10.42.1.28) (UDP)
;; WHEN: Wed Sep 04 19:43:42 CEST 2024
;; MSG SIZE  rcvd: 55

henke@klassikern:/DATA/AppData$ dig google.com
;; communications error to 10.42.1.28#53: timed out
;; communications error to 10.42.1.28#53: timed out
;; communications error to 10.42.1.28#53: timed out

; <<>> DiG 9.18.28-0ubuntu0.24.04.1-Ubuntu <<>> google.com
;; global options: +cmd
;; no servers could be reached

I have also used the GitHub - cleanbrowsing/dnsperftest: DNS Performance test. The results from this test is just as wierd. See below for a few runs performed within a minute.

henke@klassikern:/DATA/AppData$ sudo bash ./dnstest.sh
                     test1   test2   test3   test4   test5   test6   test7   test8   test9   test10  Average
10.42.1.28           2500 ms 31 ms   36 ms   112 ms  18 ms   137 ms  85 ms   67 ms   1 ms    82 ms     306.90
cloudflare           24 ms   2 ms    2 ms    48 ms   2 ms    3 ms    2 ms    2 ms    31 ms   4 ms      12.00
level3               20 ms   21 ms   20 ms   20 ms   20 ms   21 ms   20 ms   20 ms   21 ms   20 ms     20.30
google               9 ms    8 ms    25 ms   8 ms    9 ms    63 ms   9 ms    9 ms    18 ms   39 ms     19.70
quad9                8 ms    9 ms    9 ms    9 ms    180 ms  9 ms    8 ms    10 ms   9 ms    10 ms     26.10
freenom              154 ms  155 ms  182 ms  155 ms  154 ms  154 ms  167 ms  154 ms  154 ms  196 ms    162.50
opendns              7 ms    7 ms    ^C

henke@klassikern:/DATA/AppData$ sudo bash ./dnstest.sh
                     test1   test2   test3   test4   test5   test6   test7   test8   test9   test10  Average
10.42.1.28           2500 ms 2500 ms 2500 ms 1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms      750.70
cloudflare           44 ms   2 ms    3 ms    1 ms    2 ms    3 ms    2 ms    2 ms    26 ms   2 ms      8.70
level3               20 ms   20 ms   20 ms   20 ms   19 ms   19 ms   20 ms   20 ms   20 ms   20 ms     19.80
google               8 ms    9 ms    16 ms   9 ms    9 ms    63 ms   9 ms    17 ms   8 ms    68 ms     21.60
quad9                9 ms    9 ms    9 ms    9 ms    9 ms    9 ms    9 ms    9 ms    9 ms    9 ms      9.00
freenom              153 ms  154 ms  167 ms  155 ms  186 ms  155 ms  175 ms  175 ms  154 ms  195 ms    166.90
opendns              7 ms    7 ms    23 ms   26 ms   7 ms    7 ms    7 ms    26 ms   7 ms    7 ms      12.40
norton               9 ms    8 ms    9 ms    9 ms    10 ms   9 ms    9 ms    9 ms    9 ms    9 ms      9.00
cleanbrowsing        19 ms   19 ms   19 ms   19 ms   19 ms   19 ms   19 ms   20 ms   19 ms   20 ms     19.20
yandex               48 ms   48 ms   56 ms   48 ms   49 ms   34 ms   52 ms   49 ms   51 ms   49 ms     48.40
adguard              2500 ms 2500 ms ^C

henke@klassikern:/DATA/AppData$ sudo bash ./dnstest.sh
                     test1   test2   test3   test4   test5   test6   test7   test8   test9   test10  Average
10.42.1.28           1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms    1 ms      1.00
cloudflare           22 ms   2 ms    2 ms    2 ms    4 ms    2 ms    2 ms    2 ms    24 ms   9 ms      7.10
level3               20 ms   27 ms   20 ms   20 ms   20 ms   20 ms   19 ms   20 ms   20 ms   20 ms     20.60
google               9 ms    8 ms    17 ms   8 ms    8 ms    119 ms  16 ms   17 ms   8 ms    39 ms     24.90
quad9                8 ms    8 ms    9 ms    8 ms    9 ms    8 ms    9 ms    9 ms    9 ms    9 ms      8.60
freenom              154 ms  154 ms  167 ms  154 ms  154 ms  154 ms

I manually changed the time-out of the script from 1000ms to 2500ms.

The first run was after a fresh restart of the Unbound container, with subsequent runs performed back-to-back.

My Unbound configuration looks like this.

server:
    # If no logfile is specified, syslog is used
    logfile: "/var/log/unbound/unbound.log"
    verbosity: 2
    interface: 0.0.0.0
    port: 5353
    do-ip4: yes
    do-udp: yes
    do-tcp: yes
    do-ip6: no
    prefer-ip6: no
    # Trust glue only if it is within the server's authority
    harden-glue: yes
    # Require DNSSEC data for trust-anchored zones, if such data is absent, the zone becomes BOGUS
    harden-dnssec-stripped: yes
    use-caps-for-id: no
    # Reduce EDNS reassembly buffer size.
    edns-buffer-size: 1232
    #Performance settings
    prefetch: yes
    num-threads: 2
    so-rcvbuf: 5m
    access-control: 10.0.0.0/8 allow
    access-control: 192.168.0.0/16 allow
    access-control: 172.16.0.0/12 allow
    # Ensure privacy of local IP ranges
    private-address: 192.168.0.0/16
    private-address: 169.254.0.0/16
    private-address: 172.16.0.0/12
    private-address: 10.0.0.0/8
    private-address: fd00::/8
    private-address: fe80::/10

    #root-hints: "/var/lib/unbound/root.hints"

I have tried manually providing the root-hints but that did nothing to improve the problem.

I have also disabled the Pi-hole cache and DNSSEC.

I'm at a loss for why this imtermitten timing out of Unbound is happening. I would like to continue to use Unbound together with Pi-hole just because I like the idea of self-hosting my stuff. Though as other live with me I will use quad9 for the time being as it works better than Unbound does at the moment.

Thanks for any help!

Bucking_Horn · September 4, 2024, 7:14pm

What's your unbound version?

henkish · September 4, 2024, 7:21pm

I'm on version 1.20.0 using the https://hub.docker.com/r/alpinelinux/unbound latest container

/etc/unbound # unbound -V
Version 1.20.0

Configure line: --build=x86_64-alpine-linux-musl --host=x86_64-alpine-linux-musl --prefix=/usr --sysconfdir=/etc --mandir=/usr/share/man --localstatedir=/var --with-username=unbound --with-run-dir= --with-pidfile= --with-rootkey-file=/usr/share/dnssec-root/trusted-key.key --with-libevent --with-pthreads --disable-static --disable-rpath --enable-dnstap --with-ssl --without-pythonmodule --with-pyunbound
Linked libs: libevent 2.1.12-stable (it uses epoll), OpenSSL 3.3.1 4 Jun 2024
Linked modules: dns64 respip validator iterator

Bucking_Horn · September 5, 2024, 8:22am

With a validating recursive resolver, it would be expected that lookups take longer on average than with a caching DNS server.
This is obviously for two reasons: Recursion itself takes longer, as it involves communicating with multiple authoritative DNS servers, and DNSSEC validation comes on top of that.

Of course, unbound would cache DNS replies for as long as a domain's TTL allows, serving cached reply records almost instantaneously.

Sporadic long resolution times may (re)occur when a cached reply's TTL expires, as unbound has to partially or sometimes completely rewalk the recursion chain.
The time required for that refresh would depend on how many of the domains along the recursion chain would still be cached.
The refresh will take longer if more domains have to be rerequested. If any of the involved authoritative servers would be slow to respond or even unresponsive due to heavy load, that may further increase reply times, as unbound may have to wait and potentially repeat its request to that authoritative servers.

Above should explain why a recursive resolver takes longer to resolve domains, and sometimes may fail to respond in time.

With your unbound 1.20.0, you should be able to mitigate this.

Starting with version 1.11.0, unbound supports serving expired records according to RFC 8767, allowing it to use expired records if recursion did not complete in a certain time, in an attempt to avoid client side timeouts (see also Serving Stale Data — Unbound 1.22.0 documentation).
As the expired reply may be incorrect (e.g. a domain's IP address may have been changed in the meantime), it is served with a short TTL of 30 seconds. This in turn should prompt the client to send a new request after 30 seconds, by which time unbound should have completed its recursion and serve a current reply.

You should be able to add the following lines to your unbound configuration:

server:
    serve-expired: yes
    serve-expired-ttl: 86400            # do not serve replies older than one day, in seconds
    serve-expired-client-timeout: 1500  # consider serving expired replies when resolution takes longer than 1.5 seconds, in milliseconds

You should just add the three serve-expired* options - the initial server: line above is only included to help you find the appropriate section where those options have to be added. It should already be present in your configuration.

Afterwards, you'd need to restart unbound (sudo service unbound restart).

This would not avoid longer response times altogether, but it would have unbound provide a reply after at least 1.5 seconds.

You may want to tune that value further, if your clients would still time out before that.

henkish · September 14, 2024, 12:07pm

Hi, and thanks for the reply. Sorry for my own slow reply.

Unfortunetly, this does not seem to help. Initially it worked well, but after about 48 hours I started having the same issue of extremely slow responses, as if Unbound is not serving cached replies. Navigating to google.com would give a DNS error, followed by a "this site does not support https", and then on a third attempt it would load normally. Then if I open a new incognito window the same process is repeated.

At this point I'm not sure if it has to do with the fact that unbound is running in a docker container or if this is just a basic problem of using a self-hosted recursive DNS resolver.

While I do like the aspecs of self-hosting, if half or more of my requests will time out or fail then it feels like a big inconvenience rahter than a small sacrifice for a bit more privacy.

Unless you or anyone else has any idea of what my problem might be then I will most likely have to shelve this project for the time being and revisit it another time

Ayways, thanks for all the help! And pi-hole itself still works like a charm!

deHakkelaar · September 14, 2024, 5:59pm

Some hints for diagnosing (scroll down a bit):

Bucking_Horn · September 15, 2024, 10:16am

As the extended configuration would now serve replies no later than 1.5 seconds, that could suggest a different issue altogether.

Also, this sounds as if you would regularly encounter lags in web site rendering, rather than occasional, infrequent longer DNS resolutions?

Let's see some related stats for the last week:

a.Total number of queries:

pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT count(*) FROM queries \
WHERE timestamp > strftime('%s','now','-7 days');"

b. Number of queries with long reply times over a second:

pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT count(*) FROM queries \
WHERE reply_time > 1 AND timestamp > strftime('%s','now','-7 days');"

c. Top 10 most frequently requested domains with long reply times:

pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT domain, count(domain), round(avg(reply_time),5), max(id) FROM queries \
WHERE reply_time > 1 AND timestamp > strftime('%s','now','-7 days') \
GROUP BY domain ORDER BY 2 DESC LIMIT 10;"

As a point of reference, in my own system, the quota of b. divided by a. is about 0.00129, or just above 0.1%, so roughly one in ~1,000 DNS requests would take longer than 1 second.

This could indicate that some DNS requests may go to an alternative DNS server, either via IPv6 or if something would intercept and redirect DNS requests.

Please upload a debug log and post just the token URL that is generated after the log is uploaded by running the following command from the Pi-hole host terminal:

pihole -d

or do it through the Web interface:

Tools > Generate Debug Log

system · October 6, 2024, 10:16am

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.