About 10% of queries take 5s to respond

CamJN · November 18, 2020, 7:08am

Non-standard Elements

None that I can think of.

Expected Behaviour:

Queries should be answered in a reasonable time. Direct queries to cloudfare (the upstream dns i use) respond quickly when run on the pi:

$ for i in `seq 1 10`; do time dig this.domain.does.not.exist.in.the.cache.com.example.$i +short @1.1.1.1 ; done

real	0m0.088s
user	0m0.070s
sys	0m0.011s

real	0m0.078s
user	0m0.060s
sys	0m0.011s

real	0m0.072s
user	0m0.037s
sys	0m0.028s

real	0m0.057s
user	0m0.030s
sys	0m0.020s

real	0m0.059s
user	0m0.040s
sys	0m0.010s

real	0m0.058s
user	0m0.010s
sys	0m0.040s

real	0m0.058s
user	0m0.050s
sys	0m0.001s

real	0m0.058s
user	0m0.037s
sys	0m0.013s

real	0m0.058s
user	0m0.030s
sys	0m0.021s

real	0m0.058s
user	0m0.025s
sys	0m0.026s

real	0m0.057s
user	0m0.025s
sys	0m0.025s

real	0m0.059s
user	0m0.030s
sys	0m0.021s

real	0m0.058s
user	0m0.049s
sys	0m0.001s

real	0m0.058s
user	0m0.050s
sys	0m0.001s

real	0m0.064s
user	0m0.035s
sys	0m0.018s

real	0m0.080s
user	0m0.041s
sys	0m0.032s

real	0m0.060s
user	0m0.043s
sys	0m0.010s

real	0m0.058s
user	0m0.030s
sys	0m0.021s

real	0m0.058s
user	0m0.039s
sys	0m0.011s

Actual Behaviour:

It takes ~5s for my pi-hole to respond to a query about 10% of the time, usually for new domains.

Run on my laptop (wired):

 $ for i in `seq 2 20`; do time dig this.domain.does.not.exist.in.the.cache.com.example.$i +short ; done

real	0m0.053s
user	0m0.002s
sys	0m0.005s

real	0m0.048s
user	0m0.002s
sys	0m0.004s

real	0m0.074s
user	0m0.002s
sys	0m0.005s

real	0m0.054s
user	0m0.002s
sys	0m0.005s

real	0m0.055s
user	0m0.002s
sys	0m0.005s

real	0m0.076s
user	0m0.002s
sys	0m0.005s

real	0m0.051s
user	0m0.002s
sys	0m0.005s

real	0m0.054s
user	0m0.002s
sys	0m0.005s

real	0m0.048s
user	0m0.002s
sys	0m0.004s

real	0m5.016s
user	0m0.002s
sys	0m0.005s

real	0m0.058s
user	0m0.003s
sys	0m0.006s

real	0m0.048s
user	0m0.002s
sys	0m0.004s

real	0m0.048s
user	0m0.002s
sys	0m0.005s

real	0m0.045s
user	0m0.002s
sys	0m0.004s

real	0m0.047s
user	0m0.002s
sys	0m0.005s

real	0m0.046s
user	0m0.002s
sys	0m0.005s

real	0m0.055s
user	0m0.002s
sys	0m0.004s

real	0m0.049s
user	0m0.002s
sys	0m0.006s

real	0m0.080s
user	0m0.002s
sys	0m0.004s

Debug Token:

https://tricorder.pi-hole.net/kvgx9pdlgw

Additional info:

All connections are wired.
CPU use does not jump when queries are slow ~1-5%.
Lan has both ipv4 and ipv6 (prefix delegated) addresses.
An Edgerouter-x is the router.

Bucking_Horn · November 18, 2020, 8:20am

time cannot reveal to you how long actual DNS resolution took, it will distort values by whatever random delay your OS is arbitrarily imposing.

Fortunately, DNS resolution duration is readily available from dig output:

~ $ time dig this.domain.does.not.exist.in.the.cache.com.example.1

; <<>> DiG 9.11.5-P4-5.1+deb10u2-Raspbian <<>> this.domain.does.not.exist.in.the.cache.com.example.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51697
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;this.domain.does.not.exist.in.the.cache.com.example.1. IN A

;; Query time: 32 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Wed Nov 18 08:52:55 CET 2020
;; MSG SIZE  rcvd: 82

real    0m0.284s
user    0m0.154s
sys     0m0.084s

DNS resolution took 32ms, while time reports a durations that is roughly nine times higher - so off by almost an order of magnitude.

Use some real world domain examples and see if observing DNS resolution times as reported by dig would confirm your suspicion.

Coro · November 18, 2020, 3:29pm

It's still unlikely that the OS delays the execution of dig by 5 seconds. @CamJN Is the system you're testing from under high load? What is the output of uptime ?

CamJN · November 18, 2020, 5:24pm

The system I'm using to test the dns resolution is not under any meaningful load, it's a 6 core mbp that was not doing anything but the dns test. uptime says 10:23 up 12 days, 20:10, 7 users, load averages: 2.41 2.49 3.53

@Bucking_Horn I used time instead of the dig timing because time was used here: Redirecting..., but dig also reports 5s. Also, reading the dig man page, the default timeout is 5 seconds, so these queries are likely timing out.

Bucking_Horn · November 18, 2020, 9:17pm

Yes, comparing timings to benchmark a certain set of variables and conditions against another is sound, I won't argue against that. It allowed you to observe a difference, but that's about it.

I'm not denying your observation as such, but until now, that's a suspicion that needs some substantiating evidence.

time won't help you any further in finding out why certain queries take so long.
Using dig will provide correct resolution times along with more details.

And I'd still recommend to use some real domains instead of your artificial ones for a meaningful benchmark and analysis (the benchmark page you linked tries to use real domains as well).
If you'd ask me to speculate: Public DNS servers are well monitored, so firing a set of queries for non-existent domains that are obviously generated by a template in short succession may well trigger defensive measures on their behalf.

CamJN · November 18, 2020, 9:33pm

@Bucking_Horn fair enough. I ran dig with several of my recent domains, and output the dig output as well as the time output using this command: for i in $(pbpaste); do HOME=/dev/null time dig $i ; done as I'm looking for individual timings not the total time for the group, I didn't use the file flag for dig. I've included the output from the queries that wound up taking 5s. So it looks like the response IS found quickly and then dig just hangs for 5s?

; <<>> DiG 9.10.6 <<>> cdn.discordapp.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44408
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;cdn.discordapp.com.		IN	A

;; ANSWER SECTION:
cdn.discordapp.com.	98	IN	A	162.159.135.233
cdn.discordapp.com.	98	IN	A	162.159.130.233
cdn.discordapp.com.	98	IN	A	162.159.129.233
cdn.discordapp.com.	98	IN	A	162.159.133.233
cdn.discordapp.com.	98	IN	A	162.159.134.233

;; Query time: 9 msec
;; SERVER: 10.0.1.204#53(10.0.1.204)
;; WHEN: Wed Nov 18 14:25:16 MST 2020
;; MSG SIZE  rcvd: 127

        5.02 real         0.00 user         0.00 sys


; <<>> DiG 9.10.6 <<>> google.ca
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 47569
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.ca.			IN	A

;; ANSWER SECTION:
google.ca.		213	IN	A	216.58.193.67

;; Query time: 5 msec
;; SERVER: 10.0.1.204#53(10.0.1.204)
;; WHEN: Wed Nov 18 14:25:22 MST 2020
;; MSG SIZE  rcvd: 54

        5.01 real         0.00 user         0.00 sys

CamJN · November 18, 2020, 9:38pm

Also I don't think I'm triggering defensive measures in cloudfare, because as I said in my first post, making the requests directly to cloudfare results in fast results every time. It's just querying the pi-hole that results in these weird delays. And the delay can affect any programs running on the client, not just dig: browsers, scripts, whatever. And it affects every client on the network too.

Bucking_Horn · November 18, 2020, 9:41pm

Query times of 9 and 5 msecs rule out my "defensive" speculations (unless cached).
But they also shift the blame away from DNS.

Would you observe 5 secs response times when just using dig without time?

CamJN · November 18, 2020, 9:45pm

Running again without time, dig got two results in 1msec each. I do have a really fast upstream connection, so that part isn't all that surprising to me.

However while it was running, even if it wasn't being timed by time, there were noticeable pauses between some queries which were likely the 5s delay.

Bucking_Horn · November 18, 2020, 9:52pm

1 msec could be served from cache.

I can't explain why time would report as much as 5 seconds when dig itself returns pretty much instantly, but anyway: As long as you don't see dig returning long query times, I don't think your observations using time should raise your concern.

CamJN · November 18, 2020, 9:54pm

But the reason i started looking into this is that this affects browsers and other stuff running on my clients. The 5s time is reporting are indicative of what I see just using devices, about 10% of queries just stall for several seconds.

Bucking_Horn · November 18, 2020, 9:58pm

The same reasoning would apply:
Are you able to correlate browser responsiveness to DNS query times of about 5 seconds as logged in Pi-hole's Query Log?

CamJN · November 18, 2020, 10:58pm

Looks like I'm experiencing this: linux - DNS lookups sometimes take 5 seconds - Unix & Linux Stack Exchange

So not the pi-hole's fault I guess.

Bucking_Horn · November 18, 2020, 11:52pm

Does that imply you are not observing 5 seconds reply times in Pi-hole's Query Log?

That wouldn't rule out DNS completely, and Pi-hole may yet have a part in it if you'd configured an incorrect IPv6 address in your router.

CamJN · November 19, 2020, 12:04am

I'm not seeing values of 5s in the reply column of the query log table. Is there a way for me to generate that data on the cli on the pi? The sorting by reply column doesn't seem to be working for me.

yubiuser · November 19, 2020, 4:40am

Unfortunately, a known bug
https://github.com/pi-hole/AdminLTE/issues/1249

Not directly, as far as I know.

I tried to add DEBUG_QUERIES=true to /etc/pihole/pihole-FTL.conf as it should pint "extensive query information (domains, types, replies, etc.)." but I can't see the reply time in /var/log/pihole-FTL.log

[2020-11-19 06:08:25.568 7672M] **** new UDP query[A] query "connectivity-check.ubuntu.com" from eth0:10.0.10.136 (ID 71, FTL 10172, /root/project/src/dnsmasq/forward.c:1557)
[2020-11-19 06:08:25.568 7672M] connectivity-check.ubuntu.com is known as not to be blocked
[2020-11-19 06:08:25.569 7672M] **** forwarded connectivity-check.ubuntu.com to 127.0.0.1#5335 (ID 71, /root/project/src/dnsmasq/forward.c:550)
[2020-11-19 06:08:25.569 7672M] **** got DNSSEC details for connectivity-check.ubuntu.com: 2 (ID 71, /root/project/src/dnsmasq/forward.c:1238)
[2020-11-19 06:08:25.570 7672M] connectivity-check.ubuntu.com is known as not to be blocked
[2020-11-19 06:08:25.570 7672M] CNAME connectivity-check.ubuntu.com
[2020-11-19 06:08:25.570 7672M] **** got reply connectivity-check.ubuntu.com is 35.224.99.156 (ID 71, /root/project/src/dnsmasq/cache.c:494)
[2020-11-19 06:08:25.570 7672M] connectivity-check.ubuntu.com is known as not to be blocked
[2020-11-19 06:08:25.570 7672M] CNAME connectivity-check.ubuntu.com
[2020-11-19 06:08:25.570 7672M] **** got reply connectivity-check.ubuntu.com is 35.222.85.5 (ID 71, /root/project/src/dnsmasq/cache.c:49

@DL6ER is there a way to output the reply time to the cli for every query?

CamJN · November 19, 2020, 5:55am

Here's a merged pcap from the client and raspberry pi of a problematic query and a normal one.

Problem Query:

No.	Time	Source	Port S	Destination	Port D	Protocol	Info	Length
79501	2020-11-18 17:23:05.034234	SECUR-T.local	64583	raspberrypi	53	DNS	Standard query 0xeef6 A domain.does.not.exist.7 OPT	94
79513	2020-11-18 17:23:05.036727	raspberrypi	53	SECUR-T.local	64583	DNS	Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT	94
79514	2020-11-18 17:23:05.040627	SECUR-T.local	64583	raspberrypi	53	DNS	Standard query 0xeef6 A domain.does.not.exist.7 OPT	94
79515	2020-11-18 17:23:05.042739	raspberrypi	53	SECUR-T.local	64583	DNS	Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT	94
82475	2020-11-18 17:23:10.036756	SECUR-T.local	64583	raspberrypi	53	DNS	Standard query 0xeef6 A domain.does.not.exist.7 OPT	94
82476	2020-11-18 17:23:10.043180	SECUR-T.local	64583	raspberrypi	53	DNS	Standard query 0xeef6 A domain.does.not.exist.7 OPT	94
82477	2020-11-18 17:23:10.064949	raspberrypi	53	SECUR-T.local	64583	DNS	Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT	94
82478	2020-11-18 17:23:10.070844	raspberrypi	53	SECUR-T.local	64583	DNS	Standard query response 0xeef6 No such name A domain.does.not.exist.7 OPT	94

Normal Query:

No.	Time	Source	Port S	Destination	Port D	Protocol	Info	Length
79471	2020-11-18 17:23:04.985761	SECUR-T.local	62129	raspberrypi	53	DNS	Standard query 0xde0b A domain.does.not.exist.6 OPT	94
79472	2020-11-18 17:23:04.988310	raspberrypi	53	SECUR-T.local	62129	DNS	Standard query response 0xde0b No such name A domain.does.not.exist.6 OPT	94
79473	2020-11-18 17:23:04.992161	SECUR-T.local	62129	raspberrypi	53	DNS	Standard query 0xde0b A domain.does.not.exist.6 OPT	94
79474	2020-11-18 17:23:04.994326	raspberrypi	53	SECUR-T.local	62129	DNS	Standard query response 0xde0b No such name A domain.does.not.exist.6 OPT	94

I'm not sure if the merged pcap is responsible for the duplicated query & response in the fast case, but it seems that there's additional duplication or possibly a retry in the slow case. But no AAAA queries to be seen which is odd, as my network does ipv6, and my problem seems so similar to the description here: linux - DNS lookups sometimes take 5 seconds - Unix & Linux Stack Exchange

Bucking_Horn · November 19, 2020, 6:55am

Wouldn't be odd if...

Please upload a debug log and post just the token generated by

pihole -d

allowing to upload when prompted, or do it through the Web interface:

Tools > Generate Debug Log

DL6ER · November 19, 2020, 9:49pm

not directly, however, you have the information available indirectly. In your example the reply time was about 1 millisecond, so your local unbound had the domain cached.

CamJN · November 24, 2020, 5:03pm

Ok I'm working on getting a ULA for my pi in the ipv6 address space. Previously there was no dns server in the ipv6 address space and everything had to use the static ipv4 address. Since I'm using an edgerouter, making this change is rather difficult, but I'll comment back when it's done.