About 10% of queries take 5s to respond

Non-standard Elements

None that I can think of.

Expected Behaviour:

Queries should be answered in a reasonable time. Direct queries to cloudfare (the upstream dns i use) respond quickly when run on the pi:

$ for i in `seq 1 10`; do time dig this.domain.does.not.exist.in.the.cache.com.example.$i +short @1.1.1.1 ; done

real	0m0.088s
user	0m0.070s
sys	0m0.011s

real	0m0.078s
user	0m0.060s
sys	0m0.011s

real	0m0.072s
user	0m0.037s
sys	0m0.028s

real	0m0.057s
user	0m0.030s
sys	0m0.020s

real	0m0.059s
user	0m0.040s
sys	0m0.010s

real	0m0.058s
user	0m0.010s
sys	0m0.040s

real	0m0.058s
user	0m0.050s
sys	0m0.001s

real	0m0.058s
user	0m0.037s
sys	0m0.013s

real	0m0.058s
user	0m0.030s
sys	0m0.021s

real	0m0.058s
user	0m0.025s
sys	0m0.026s

real	0m0.057s
user	0m0.025s
sys	0m0.025s

real	0m0.059s
user	0m0.030s
sys	0m0.021s

real	0m0.058s
user	0m0.049s
sys	0m0.001s

real	0m0.058s
user	0m0.050s
sys	0m0.001s

real	0m0.064s
user	0m0.035s
sys	0m0.018s

real	0m0.080s
user	0m0.041s
sys	0m0.032s

real	0m0.060s
user	0m0.043s
sys	0m0.010s

real	0m0.058s
user	0m0.030s
sys	0m0.021s

real	0m0.058s
user	0m0.039s
sys	0m0.011s

Actual Behaviour:

It takes ~5s for my pi-hole to respond to a query about 10% of the time, usually for new domains.

Run on my laptop (wired):

 $ for i in `seq 2 20`; do time dig this.domain.does.not.exist.in.the.cache.com.example.$i +short ; done

real	0m0.053s
user	0m0.002s
sys	0m0.005s

real	0m0.048s
user	0m0.002s
sys	0m0.004s

real	0m0.074s
user	0m0.002s
sys	0m0.005s

real	0m0.054s
user	0m0.002s
sys	0m0.005s

real	0m0.055s
user	0m0.002s
sys	0m0.005s

real	0m0.076s
user	0m0.002s
sys	0m0.005s

real	0m0.051s
user	0m0.002s
sys	0m0.005s

real	0m0.054s
user	0m0.002s
sys	0m0.005s

real	0m0.048s
user	0m0.002s
sys	0m0.004s

real	0m5.016s
user	0m0.002s
sys	0m0.005s

real	0m0.058s
user	0m0.003s
sys	0m0.006s

real	0m0.048s
user	0m0.002s
sys	0m0.004s

real	0m0.048s
user	0m0.002s
sys	0m0.005s

real	0m0.045s
user	0m0.002s
sys	0m0.004s

real	0m0.047s
user	0m0.002s
sys	0m0.005s

real	0m0.046s
user	0m0.002s
sys	0m0.005s

real	0m0.055s
user	0m0.002s
sys	0m0.004s

real	0m0.049s
user	0m0.002s
sys	0m0.006s

real	0m0.080s
user	0m0.002s
sys	0m0.004s

Debug Token:

https://tricorder.pi-hole.net/kvgx9pdlgw

Additional info:

All connections are wired.
CPU use does not jump when queries are slow ~1-5%.
Lan has both ipv4 and ipv6 (prefix delegated) addresses.
An Edgerouter-x is the router.

time cannot reveal to you how long actual DNS resolution took, it will distort values by whatever random delay your OS is arbitrarily imposing.

Fortunately, DNS resolution duration is readily available from dig output:

~ $ time dig this.domain.does.not.exist.in.the.cache.com.example.1

; <<>> DiG 9.11.5-P4-5.1+deb10u2-Raspbian <<>> this.domain.does.not.exist.in.the.cache.com.example.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51697
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;this.domain.does.not.exist.in.the.cache.com.example.1. IN A

;; Query time: 32 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Nov 18 08:52:55 CET 2020
;; MSG SIZE  rcvd: 82

real    0m0.284s
user    0m0.154s
sys     0m0.084s

DNS resolution took 32ms, while time reports a durations that is roughly nine times higher - so off by almost an order of magnitude.

Use some real world domain examples and see if observing DNS resolution times as reported by dig would confirm your suspicion.

It's still unlikely that the OS delays the execution of dig by 5 seconds. @CamJN Is the system you're testing from under high load? What is the output of uptime ?

The system I'm using to test the dns resolution is not under any meaningful load, it's a 6 core mbp that was not doing anything but the dns test. uptime says 10:23 up 12 days, 20:10, 7 users, load averages: 2.41 2.49 3.53

@Bucking_Horn I used time instead of the dig timing because time was used here: Redirecting..., but dig also reports 5s. Also, reading the dig man page, the default timeout is 5 seconds, so these queries are likely timing out.

Yes, comparing timings to benchmark a certain set of variables and conditions against another is sound, I won't argue against that. It allowed you to observe a difference, but that's about it.

I'm not denying your observation as such, but until now, that's a suspicion that needs some substantiating evidence.

time won't help you any further in finding out why certain queries take so long.
Using dig will provide correct resolution times along with more details.

And I'd still recommend to use some real domains instead of your artificial ones for a meaningful benchmark and analysis (the benchmark page you linked tries to use real domains as well).
If you'd ask me to speculate: Public DNS servers are well monitored, so firing a set of queries for non-existent domains that are obviously generated by a template in short succession may well trigger defensive measures on their behalf.

@Bucking_Horn fair enough. I ran dig with several of my recent domains, and output the dig output as well as the time output using this command: for i in $(pbpaste); do HOME=/dev/null time dig $i ; done as I'm looking for individual timings not the total time for the group, I didn't use the file flag for dig. I've included the output from the queries that wound up taking 5s. So it looks like the response IS found quickly and then dig just hangs for 5s?

; <<>> DiG 9.10.6 <<>> cdn.discordapp.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44408
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;cdn.discordapp.com.		IN	A

;; ANSWER SECTION:
cdn.discordapp.com.	98	IN	A	162.159.135.233
cdn.discordapp.com.	98	IN	A	162.159.130.233
cdn.discordapp.com.	98	IN	A	162.159.129.233
cdn.discordapp.com.	98	IN	A	162.159.133.233
cdn.discordapp.com.	98	IN	A	162.159.134.233

;; Query time: 9 msec
;; SERVER: 10.0.1.204#53(10.0.1.204)
;; WHEN: Wed Nov 18 14:25:16 MST 2020
;; MSG SIZE  rcvd: 127

        5.02 real         0.00 user         0.00 sys


; <<>> DiG 9.10.6 <<>> google.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47569
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.ca.			IN	A

;; ANSWER SECTION:
google.ca.		213	IN	A	216.58.193.67

;; Query time: 5 msec
;; SERVER: 10.0.1.204#53(10.0.1.204)
;; WHEN: Wed Nov 18 14:25:22 MST 2020
;; MSG SIZE  rcvd: 54

        5.01 real         0.00 user         0.00 sys

Also I don't think I'm triggering defensive measures in cloudfare, because as I said in my first post, making the requests directly to cloudfare results in fast results every time. It's just querying the pi-hole that results in these weird delays. And the delay can affect any programs running on the client, not just dig: browsers, scripts, whatever. And it affects every client on the network too.

Query times of 9 and 5 msecs rule out my "defensive" speculations (unless cached). :wink:
But they also shift the blame away from DNS.

Would you observe 5 secs response times when just using dig without time?

Running again without time, dig got two results in 1msec each. I do have a really fast upstream connection, so that part isn't all that surprising to me.

However while it was running, even if it wasn't being timed by time, there were noticeable pauses between some queries which were likely the 5s delay.

1 msec could be served from cache.

I can't explain why time would report as much as 5 seconds when dig itself returns pretty much instantly, but anyway: As long as you don't see dig returning long query times, I don't think your observations using time should raise your concern.

But the reason i started looking into this is that this affects browsers and other stuff running on my clients. The 5s time is reporting are indicative of what I see just using devices, about 10% of queries just stall for several seconds.

The same reasoning would apply:
Are you able to correlate browser responsiveness to DNS query times of about 5 seconds as logged in Pi-hole's Query Log?

Looks like I'm experiencing this: linux - DNS lookups sometimes take 5 seconds - Unix & Linux Stack Exchange

So not the pi-hole's fault I guess.

Does that imply you are not observing 5 seconds reply times in Pi-hole's Query Log?

That wouldn't rule out DNS completely, and Pi-hole may yet have a part in it if you'd configured an incorrect IPv6 address in your router.

I'm not seeing values of 5s in the reply column of the query log table. Is there a way for me to generate that data on the cli on the pi? The sorting by reply column doesn't seem to be working for me.

Unfortunately, a known bug
https://github.com/pi-hole/AdminLTE/issues/1249


Not directly, as far as I know.

I tried to add DEBUG_QUERIES=true to /etc/pihole/pihole-FTL.conf as it should pint "extensive query information (domains, types, replies, etc.)." but I can't see the reply time in /var/log/pihole-FTL.log

[2020-11-19 06:08:25.568 7672M] **** new UDP query[A] query "connectivity-check.ubuntu.com" from eth0:10.0.10.136 (ID 71, FTL 10172, /root/project/src/dnsmasq/forward.c:1557)
[2020-11-19 06:08:25.568 7672M] connectivity-check.ubuntu.com is known as not to be blocked
[2020-11-19 06:08:25.569 7672M] **** forwarded connectivity-check.ubuntu.com to 127.0.0.1#5335 (ID 71, /root/project/src/dnsmasq/forward.c:550)
[2020-11-19 06:08:25.569 7672M] **** got DNSSEC details for connectivity-check.ubuntu.com: 2 (ID 71, /root/project/src/dnsmasq/forward.c:1238)
[2020-11-19 06:08:25.570 7672M] connectivity-check.ubuntu.com is known as not to be blocked
[2020-11-19 06:08:25.570 7672M] CNAME connectivity-check.ubuntu.com
[2020-11-19 06:08:25.570 7672M] **** got reply connectivity-check.ubuntu.com is 35.224.99.156 (ID 71, /root/project/src/dnsmasq/cache.c:494)
[2020-11-19 06:08:25.570 7672M] connectivity-check.ubuntu.com is known as not to be blocked
[2020-11-19 06:08:25.570 7672M] CNAME connectivity-check.ubuntu.com
[2020-11-19 06:08:25.570 7672M] **** got reply connectivity-check.ubuntu.com is 35.222.85.5 (ID 71, /root/project/src/dnsmasq/cache.c:49

@DL6ER is there a way to output the reply time to the cli for every query?

Here's a merged pcap from the client and raspberry pi of a problematic query and a normal one.

Problem Query:

No. Time Source Port S Destination Port D Protocol Info Length
79501 2020-11-18 17:23:05.034234 SECUR-T.local 64583 raspberrypi 53 DNS Standard query 0xeef6 A domain.does.not.exist.7 OPT 94
79513 2020-11-18 17:23:05.036727 raspberrypi 53 SECUR-T.local 64583 DNS Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT 94
79514 2020-11-18 17:23:05.040627 SECUR-T.local 64583 raspberrypi 53 DNS Standard query 0xeef6 A domain.does.not.exist.7 OPT 94
79515 2020-11-18 17:23:05.042739 raspberrypi 53 SECUR-T.local 64583 DNS Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT 94
82475 2020-11-18 17:23:10.036756 SECUR-T.local 64583 raspberrypi 53 DNS Standard query 0xeef6 A domain.does.not.exist.7 OPT 94
82476 2020-11-18 17:23:10.043180 SECUR-T.local 64583 raspberrypi 53 DNS Standard query 0xeef6 A domain.does.not.exist.7 OPT 94
82477 2020-11-18 17:23:10.064949 raspberrypi 53 SECUR-T.local 64583 DNS Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT 94
82478 2020-11-18 17:23:10.070844 raspberrypi 53 SECUR-T.local 64583 DNS Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT 94

Normal Query:

No. Time Source Port S Destination Port D Protocol Info Length
79471 2020-11-18 17:23:04.985761 SECUR-T.local 62129 raspberrypi 53 DNS Standard query 0xde0b A domain.does.not.exist.6 OPT 94
79472 2020-11-18 17:23:04.988310 raspberrypi 53 SECUR-T.local 62129 DNS Standard query response 0xde0b No such name A domain.does.not.exist.6 OPT 94
79473 2020-11-18 17:23:04.992161 SECUR-T.local 62129 raspberrypi 53 DNS Standard query 0xde0b A domain.does.not.exist.6 OPT 94
79474 2020-11-18 17:23:04.994326 raspberrypi 53 SECUR-T.local 62129 DNS Standard query response 0xde0b No such name A domain.does.not.exist.6 OPT 94

I'm not sure if the merged pcap is responsible for the duplicated query & response in the fast case, but it seems that there's additional duplication or possibly a retry in the slow case. But no AAAA queries to be seen which is odd, as my network does ipv6, and my problem seems so similar to the description here: linux - DNS lookups sometimes take 5 seconds - Unix & Linux Stack Exchange

Wouldn't be odd if...

Please upload a debug log and post just the token generated by

pihole -d

allowing to upload when prompted, or do it through the Web interface:

Tools > Generate Debug Log

not directly, however, you have the information available indirectly. In your example the reply time was about 1 millisecond, so your local unbound had the domain cached.

Ok I'm working on getting a ULA for my pi in the ipv6 address space. Previously there was no dns server in the ipv6 address space and everything had to use the static ipv4 address. Since I'm using an edgerouter, making this change is rather difficult, but I'll comment back when it's done.