Queries should be answered in a reasonable time. Direct queries to cloudfare (the upstream dns i use) respond quickly when run on the pi:
$ for i in `seq 1 10`; do time dig this.domain.does.not.exist.in.the.cache.com.example.$i +short @1.1.1.1 ; done
real 0m0.088s
user 0m0.070s
sys 0m0.011s
real 0m0.078s
user 0m0.060s
sys 0m0.011s
real 0m0.072s
user 0m0.037s
sys 0m0.028s
real 0m0.057s
user 0m0.030s
sys 0m0.020s
real 0m0.059s
user 0m0.040s
sys 0m0.010s
real 0m0.058s
user 0m0.010s
sys 0m0.040s
real 0m0.058s
user 0m0.050s
sys 0m0.001s
real 0m0.058s
user 0m0.037s
sys 0m0.013s
real 0m0.058s
user 0m0.030s
sys 0m0.021s
real 0m0.058s
user 0m0.025s
sys 0m0.026s
real 0m0.057s
user 0m0.025s
sys 0m0.025s
real 0m0.059s
user 0m0.030s
sys 0m0.021s
real 0m0.058s
user 0m0.049s
sys 0m0.001s
real 0m0.058s
user 0m0.050s
sys 0m0.001s
real 0m0.064s
user 0m0.035s
sys 0m0.018s
real 0m0.080s
user 0m0.041s
sys 0m0.032s
real 0m0.060s
user 0m0.043s
sys 0m0.010s
real 0m0.058s
user 0m0.030s
sys 0m0.021s
real 0m0.058s
user 0m0.039s
sys 0m0.011s
Actual Behaviour:
It takes ~5s for my pi-hole to respond to a query about 10% of the time, usually for new domains.
Run on my laptop (wired):
$ for i in `seq 2 20`; do time dig this.domain.does.not.exist.in.the.cache.com.example.$i +short ; done
real 0m0.053s
user 0m0.002s
sys 0m0.005s
real 0m0.048s
user 0m0.002s
sys 0m0.004s
real 0m0.074s
user 0m0.002s
sys 0m0.005s
real 0m0.054s
user 0m0.002s
sys 0m0.005s
real 0m0.055s
user 0m0.002s
sys 0m0.005s
real 0m0.076s
user 0m0.002s
sys 0m0.005s
real 0m0.051s
user 0m0.002s
sys 0m0.005s
real 0m0.054s
user 0m0.002s
sys 0m0.005s
real 0m0.048s
user 0m0.002s
sys 0m0.004s
real 0m5.016s
user 0m0.002s
sys 0m0.005s
real 0m0.058s
user 0m0.003s
sys 0m0.006s
real 0m0.048s
user 0m0.002s
sys 0m0.004s
real 0m0.048s
user 0m0.002s
sys 0m0.005s
real 0m0.045s
user 0m0.002s
sys 0m0.004s
real 0m0.047s
user 0m0.002s
sys 0m0.005s
real 0m0.046s
user 0m0.002s
sys 0m0.005s
real 0m0.055s
user 0m0.002s
sys 0m0.004s
real 0m0.049s
user 0m0.002s
sys 0m0.006s
real 0m0.080s
user 0m0.002s
sys 0m0.004s
All connections are wired.
CPU use does not jump when queries are slow ~1-5%.
Lan has both ipv4 and ipv6 (prefix delegated) addresses.
An Edgerouter-x is the router.
It's still unlikely that the OS delays the execution of dig by 5 seconds. @CamJN Is the system you're testing from under high load? What is the output of uptime ?
The system I'm using to test the dns resolution is not under any meaningful load, it's a 6 core mbp that was not doing anything but the dns test. uptime says 10:23 up 12 days, 20:10, 7 users, load averages: 2.41 2.49 3.53
@Bucking_Horn I used time instead of the dig timing because time was used here: Redirecting..., but dig also reports 5s. Also, reading the dig man page, the default timeout is 5 seconds, so these queries are likely timing out.
Yes, comparing timings to benchmark a certain set of variables and conditions against another is sound, I won't argue against that. It allowed you to observe a difference, but that's about it.
I'm not denying your observation as such, but until now, that's a suspicion that needs some substantiating evidence.
time won't help you any further in finding out why certain queries take so long.
Using dig will provide correct resolution times along with more details.
And I'd still recommend to use some real domains instead of your artificial ones for a meaningful benchmark and analysis (the benchmark page you linked tries to use real domains as well).
If you'd ask me to speculate: Public DNS servers are well monitored, so firing a set of queries for non-existent domains that are obviously generated by a template in short succession may well trigger defensive measures on their behalf.
@Bucking_Horn fair enough. I ran dig with several of my recent domains, and output the dig output as well as the time output using this command: for i in $(pbpaste); do HOME=/dev/null time dig $i ; done as I'm looking for individual timings not the total time for the group, I didn't use the file flag for dig. I've included the output from the queries that wound up taking 5s. So it looks like the response IS found quickly and then dig just hangs for 5s?
; <<>> DiG 9.10.6 <<>> cdn.discordapp.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44408
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;cdn.discordapp.com. IN A
;; ANSWER SECTION:
cdn.discordapp.com. 98 IN A 162.159.135.233
cdn.discordapp.com. 98 IN A 162.159.130.233
cdn.discordapp.com. 98 IN A 162.159.129.233
cdn.discordapp.com. 98 IN A 162.159.133.233
cdn.discordapp.com. 98 IN A 162.159.134.233
;; Query time: 9 msec
;; SERVER: 10.0.1.204#53(10.0.1.204)
;; WHEN: Wed Nov 18 14:25:16 MST 2020
;; MSG SIZE rcvd: 127
5.02 real 0.00 user 0.00 sys
; <<>> DiG 9.10.6 <<>> google.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47569
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.ca. IN A
;; ANSWER SECTION:
google.ca. 213 IN A 216.58.193.67
;; Query time: 5 msec
;; SERVER: 10.0.1.204#53(10.0.1.204)
;; WHEN: Wed Nov 18 14:25:22 MST 2020
;; MSG SIZE rcvd: 54
5.01 real 0.00 user 0.00 sys
Also I don't think I'm triggering defensive measures in cloudfare, because as I said in my first post, making the requests directly to cloudfare results in fast results every time. It's just querying the pi-hole that results in these weird delays. And the delay can affect any programs running on the client, not just dig: browsers, scripts, whatever. And it affects every client on the network too.
Running again without time, dig got two results in 1msec each. I do have a really fast upstream connection, so that part isn't all that surprising to me.
However while it was running, even if it wasn't being timed by time, there were noticeable pauses between some queries which were likely the 5s delay.
I can't explain why time would report as much as 5 seconds when dig itself returns pretty much instantly, but anyway: As long as you don't see dig returning long query times, I don't think your observations using time should raise your concern.
But the reason i started looking into this is that this affects browsers and other stuff running on my clients. The 5s time is reporting are indicative of what I see just using devices, about 10% of queries just stall for several seconds.
The same reasoning would apply:
Are you able to correlate browser responsiveness to DNS query times of about 5 seconds as logged in Pi-hole's Query Log?
I'm not seeing values of 5s in the reply column of the query log table. Is there a way for me to generate that data on the cli on the pi? The sorting by reply column doesn't seem to be working for me.
[2020-11-19 06:08:25.568 7672M] **** new UDP query[A] query "connectivity-check.ubuntu.com" from eth0:10.0.10.136 (ID 71, FTL 10172, /root/project/src/dnsmasq/forward.c:1557)
[2020-11-19 06:08:25.568 7672M] connectivity-check.ubuntu.com is known as not to be blocked
[2020-11-19 06:08:25.569 7672M] **** forwarded connectivity-check.ubuntu.com to 127.0.0.1#5335 (ID 71, /root/project/src/dnsmasq/forward.c:550)
[2020-11-19 06:08:25.569 7672M] **** got DNSSEC details for connectivity-check.ubuntu.com: 2 (ID 71, /root/project/src/dnsmasq/forward.c:1238)
[2020-11-19 06:08:25.570 7672M] connectivity-check.ubuntu.com is known as not to be blocked
[2020-11-19 06:08:25.570 7672M] CNAME connectivity-check.ubuntu.com
[2020-11-19 06:08:25.570 7672M] **** got reply connectivity-check.ubuntu.com is 35.224.99.156 (ID 71, /root/project/src/dnsmasq/cache.c:494)
[2020-11-19 06:08:25.570 7672M] connectivity-check.ubuntu.com is known as not to be blocked
[2020-11-19 06:08:25.570 7672M] CNAME connectivity-check.ubuntu.com
[2020-11-19 06:08:25.570 7672M] **** got reply connectivity-check.ubuntu.com is 35.222.85.5 (ID 71, /root/project/src/dnsmasq/cache.c:49
@DL6ER is there a way to output the reply time to the cli for every query?
Here's a merged pcap from the client and raspberry pi of a problematic query and a normal one.
Problem Query:
No.
Time
Source
Port S
Destination
Port D
Protocol
Info
Length
79501
2020-11-18 17:23:05.034234
SECUR-T.local
64583
raspberrypi
53
DNS
Standard query 0xeef6 A domain.does.not.exist.7 OPT
94
79513
2020-11-18 17:23:05.036727
raspberrypi
53
SECUR-T.local
64583
DNS
Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT
94
79514
2020-11-18 17:23:05.040627
SECUR-T.local
64583
raspberrypi
53
DNS
Standard query 0xeef6 A domain.does.not.exist.7 OPT
94
79515
2020-11-18 17:23:05.042739
raspberrypi
53
SECUR-T.local
64583
DNS
Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT
94
82475
2020-11-18 17:23:10.036756
SECUR-T.local
64583
raspberrypi
53
DNS
Standard query 0xeef6 A domain.does.not.exist.7 OPT
94
82476
2020-11-18 17:23:10.043180
SECUR-T.local
64583
raspberrypi
53
DNS
Standard query 0xeef6 A domain.does.not.exist.7 OPT
94
82477
2020-11-18 17:23:10.064949
raspberrypi
53
SECUR-T.local
64583
DNS
Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT
94
82478
2020-11-18 17:23:10.070844
raspberrypi
53
SECUR-T.local
64583
DNS
Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT
94
Normal Query:
No.
Time
Source
Port S
Destination
Port D
Protocol
Info
Length
79471
2020-11-18 17:23:04.985761
SECUR-T.local
62129
raspberrypi
53
DNS
Standard query 0xde0b A domain.does.not.exist.6 OPT
94
79472
2020-11-18 17:23:04.988310
raspberrypi
53
SECUR-T.local
62129
DNS
Standard query response 0xde0b No such name A domain.does.not.exist.6 OPT
94
79473
2020-11-18 17:23:04.992161
SECUR-T.local
62129
raspberrypi
53
DNS
Standard query 0xde0b A domain.does.not.exist.6 OPT
94
79474
2020-11-18 17:23:04.994326
raspberrypi
53
SECUR-T.local
62129
DNS
Standard query response 0xde0b No such name A domain.does.not.exist.6 OPT
94
I'm not sure if the merged pcap is responsible for the duplicated query & response in the fast case, but it seems that there's additional duplication or possibly a retry in the slow case. But no AAAA queries to be seen which is odd, as my network does ipv6, and my problem seems so similar to the description here: linux - DNS lookups sometimes take 5 seconds - Unix & Linux Stack Exchange
not directly, however, you have the information available indirectly. In your example the reply time was about 1 millisecond, so your local unbound had the domain cached.
Ok I'm working on getting a ULA for my pi in the ipv6 address space. Previously there was no dns server in the ipv6 address space and everything had to use the static ipv4 address. Since I'm using an edgerouter, making this change is rather difficult, but I'll comment back when it's done.