Very Slow Queries > 100ms for hosts with many CNAMES/ Collapsed Answers – How to Debug & Fix? site = www.ucla.edu

Pihole seems very show with collapsed / recursive queries–even when cached. For example, www.ucla.edu resolves to (4) cnames . Querying' pihole is avg MS = 115ms , whereas my router's DNS is ~ 1.8ms (see results below). hostnames with fewer CNAME entries have less delta between pihole & router DNS server (e.g. www.facebook.com = AVG 4ms pihole vs 1.5ms router dns)

Questions

  • Is this poor performance expected?
  • Config changes that can help here?
  • how to trace dnsmasq / pihole-FTL to find the issue?

What I've Tried to Fix

  • i've set local-TTL=360 to force caching.

Pihole Results for www.ucla.edu AVG=143ms [n=30, concurrent=2]

dnstrace -n30 -c2 -s 192.168.1.250 -r www.ucla.edu

DNS timings, 60 datapoints
	 min:		 71.303168ms
	 mean:		 143.480149ms
	 [+/-sd]:	 28.928263ms
	 max:		 301.989887ms

Router Results for www.ucla.edu AVG=1.8 MS

➜  ~ dnstrace -n30 -c2 -s 192.168.1.1 -r www.ucla.edu|head -n 20
Benchmarking 192.168.1.1:53 via udp with 2 concurrent requests 

DNS timings, 60 datapoints
	 min:		 1.31072ms
	 mean:		 1.883067ms
	 [+/-sd]:	 353.3µs
	 max:		 3.407871ms

DNS Answer www.ucla.edu

 dig +short @192.168.1.250 www.ucla.edu                     
d2qnx9mnmqbxo6.cloudfront.net.
13.226.237.98
13.226.237.30
13.226.237.64
13.226.237.101     

Pihole is acceptible for www.facebook.com AVG=4ms

dnstrace -n30 -c2 -s 192.168.1.250 -r www.facebook.com


DNS timings, 60 datapoints
	 min:		 2.62144ms
	 mean:		 4.010803ms
	 [+/-sd]:	 898.966µs
	 max:		 7.864319ms

Pihole log sample during test

Findings:

  • Results are cached.
  • pihole seems to query each cname
  • query latency appears to scale on the order of # CNAMES
Sep  9 16:40:16 dnsmasq[21711]: query[A] www.ucla.edu from 192.168.1.134
Sep  9 16:40:16 dnsmasq[21711]: cached www.ucla.edu is <CNAME>
Sep  9 16:40:16 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.30
Sep  9 16:40:16 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.64
Sep  9 16:40:16 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.101
Sep  9 16:40:16 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.98
Sep  9 16:40:25 dnsmasq[21711]: query[A] www.ucla.edu from 192.168.1.134
Sep  9 16:40:25 dnsmasq[21711]: cached www.ucla.edu is <CNAME>
Sep  9 16:40:25 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.98
Sep  9 16:40:25 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.30
Sep  9 16:40:25 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.64
Sep  9 16:40:25 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.101
Sep  9 16:40:29 dnsmasq[21711]: query[type=65] gateway.fe.apple-dns.net from 192.168.1.133

Tests are performed using dnstrace

Debug Token:

https://tricorder.pi-hole.net/etJFUb1t/

What are your dig results? I don't see what you are seeing for query times.

I don't know that particular tool, seems that it's a docker image but there are no arm images to use to test with.

dan@nanopineo:~$ dig www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> www.ucla.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46972
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; ANSWER SECTION:
www.ucla.edu.           60      IN      CNAME   d2qnx9mnmqbxo6.cloudfront.net.
d2qnx9mnmqbxo6.cloudfront.net. 60 IN    A       13.226.237.101
d2qnx9mnmqbxo6.cloudfront.net. 60 IN    A       13.226.237.98
d2qnx9mnmqbxo6.cloudfront.net. 60 IN    A       13.226.237.30
d2qnx9mnmqbxo6.cloudfront.net. 60 IN    A       13.226.237.64

;; Query time: 282 msec
;; SERVER: 192.168.88.2#53(192.168.88.2)
;; WHEN: Thu Sep 09 17:04:20 PDT 2021
;; MSG SIZE  rcvd: 148

dan@nanopineo:~$ dig www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> www.ucla.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16514
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; ANSWER SECTION:
www.ucla.edu.           57      IN      CNAME   d2qnx9mnmqbxo6.cloudfront.net.
d2qnx9mnmqbxo6.cloudfront.net. 57 IN    A       13.226.237.64
d2qnx9mnmqbxo6.cloudfront.net. 57 IN    A       13.226.237.30
d2qnx9mnmqbxo6.cloudfront.net. 57 IN    A       13.226.237.98
d2qnx9mnmqbxo6.cloudfront.net. 57 IN    A       13.226.237.101

;; Query time: 12 msec
;; SERVER: 192.168.88.2#53(192.168.88.2)
;; WHEN: Thu Sep 09 17:04:23 PDT 2021
;; MSG SIZE  rcvd: 148

dan@nanopineo:~$ dig www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> www.ucla.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6132
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; ANSWER SECTION:
www.ucla.edu.           56      IN      CNAME   d2qnx9mnmqbxo6.cloudfront.net.
d2qnx9mnmqbxo6.cloudfront.net. 56 IN    A       13.226.237.101
d2qnx9mnmqbxo6.cloudfront.net. 56 IN    A       13.226.237.64
d2qnx9mnmqbxo6.cloudfront.net. 56 IN    A       13.226.237.30
d2qnx9mnmqbxo6.cloudfront.net. 56 IN    A       13.226.237.98

;; Query time: 10 msec
;; SERVER: 192.168.88.2#53(192.168.88.2)
;; WHEN: Thu Sep 09 17:04:24 PDT 2021
;; MSG SIZE  rcvd: 148

The first response is a bit longer since nothing was in cache, the consecutive queries all came from cache as you can see form the short Query Times.

And there's only one CNAME. www.ucla.edu >> d2qnx9mnmqbxo6.cloudfront.net >> round robin A records from Amazon Cloudfront.

Lastly, that 353.3µs is the standard deviation, not the actual query time. The Minimum, Maximum and Mean are what you should be looking at, Mean being the most relevant.

that's a typo i meant 1.3ms mean i'll update

with dig my cached queries are 80-115ms

 dig +recurse  @192.168.1.250 www.ucla.edu
[truncated]
;; Query time: 113 msec
;; SERVER: 192.168.1.250#53(192.168.1.250)
;; WHEN: Thu Sep 09 17:15:05 PDT 2021
;; MSG SIZE  rcvd: 148

dig +norecurse  @192.168.1.250 www.ucla.edu

; <<>> DiG 9.16.20 <<>> +norecurse @192.168.1.250 www.ucla.edu
[truncated]
;; Query time: 732 msec

Is that on the Pi-hole server or a client?

I'm showing vastly different results on a NanoPi Neo with 512M RAM:

dan@nanopineo:~$ dig +norecurse @127.0.0.1 www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> +norecurse @127.0.0.1 www.ucla.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 16120
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; Query time: 59 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 09 17:18:20 PDT 2021
;; MSG SIZE  rcvd: 30

dan@nanopineo:~$ dig +norecurse @127.0.0.1 www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> +norecurse @127.0.0.1 www.ucla.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 40470
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; Query time: 5 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 09 17:18:22 PDT 2021
;; MSG SIZE  rcvd: 30

dan@nanopineo:~$ dig +norecurse @127.0.0.1 www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> +norecurse @127.0.0.1 www.ucla.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 55265
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; Query time: 7 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 09 17:18:23 PDT 2021
;; MSG SIZE  rcvd: 30

do you have a tool that can test a sample (multiple + concurrent).

Dns trace can be installed with go install github.com/redsift/dnstrace@latest to build locally

a shortcoming with dig is that you are not getting an adequate sample size.

I just noticed the REFUSED results, here's a full result.

dan@nanopineo:~$ dig +recurse @127.0.0.1 www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> +recurse @127.0.0.1 www.ucla.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15151
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; ANSWER SECTION:
www.ucla.edu.           58      IN      CNAME   d2qnx9mnmqbxo6.cloudfront.net.
d2qnx9mnmqbxo6.cloudfront.net. 59 IN    A       13.226.237.98
d2qnx9mnmqbxo6.cloudfront.net. 59 IN    A       13.226.237.30
d2qnx9mnmqbxo6.cloudfront.net. 59 IN    A       13.226.237.64
d2qnx9mnmqbxo6.cloudfront.net. 59 IN    A       13.226.237.101

;; Query time: 30 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 09 17:21:12 PDT 2021
;; MSG SIZE  rcvd: 148

i'm testing from a client. server is running on pi zero non-w

i'll run some tests straight from server to see diff

I'm not installing a full go toolchain to get one binary, sorry.

What are you using for the upstream?

What I'm seeing is that results are from cache and fast from cache. Throwing more queries at the resolver isn't going to change that or cause the results to slow down. So we need to find where the slowdown actually is happening and why it's happening.

run from server i see big variance between +recurse & +norecurse

$ dig @localhost +norecurse www.ucla.edu|grep time
;; Query time: 65 msec
pi@piholezero:~ $ dig @localhost +recurse www.ucla.edu|grep time
;; Query time: 57 msec
pi@piholezero:~ $ dig @localhost +recurse www.ucla.edu|grep time
;; Query time: 55 msec
pi@piholezero:~ $ dig @localhost +norecurse www.ucla.edu|grep time
;; Query time: 2007 msec

You're getting a lot of noise and very little signal.

Change the upstream to something besides the router and check the actual results to see if you're getting back IP addresses and results instead of timing out. And remove any changes you've made to TTL values in the configurations.

You can run dig -f filename and the process will resolve every domain contained in the file.

you can also try dnsping a perl module

i recommend having a statistical dns trace tool

Upstream = router (192.168.1.1) (very fast), then ISP dns 75.75.75.75

the test above (router) shows response time at router / 192.168.1.1 is very low 1.8 ms. that makes me confident that I'm getting high latency signal from testing pihole . can you clarify?

dan@nanopineo:~$ dnsping -s 127.0.0.1 www.ucla.edu
dnsping DNS: 127.0.0.1:53, hostname: www.ucla.edu, rdatatype: A
212 bytes from 127.0.0.1: seq=0   time=148.009 ms
212 bytes from 127.0.0.1: seq=1   time=11.096 ms
212 bytes from 127.0.0.1: seq=2   time=10.740 ms
212 bytes from 127.0.0.1: seq=3   time=19.682 ms
212 bytes from 127.0.0.1: seq=4   time=19.038 ms
212 bytes from 127.0.0.1: seq=5   time=21.634 ms
212 bytes from 127.0.0.1: seq=6   time=10.756 ms
212 bytes from 127.0.0.1: seq=7   time=18.132 ms
212 bytes from 127.0.0.1: seq=8   time=19.273 ms
212 bytes from 127.0.0.1: seq=9   time=18.999 ms

--- 127.0.0.1 dnsping statistics ---
10 requests transmitted, 10 responses received, 0% lost
min=10.740 ms, avg=29.736 ms, max=148.009 ms, stddev=41.763 ms

And just for comparison:

dan@nanopineo:~$ dnsping -s 8.8.8.8 www.ucla.edu
dnsping DNS: 8.8.8.8:53, hostname: www.ucla.edu, rdatatype: A
212 bytes from 8.8.8.8: seq=0   time=16.569 ms
212 bytes from 8.8.8.8: seq=1   time=18.149 ms
212 bytes from 8.8.8.8: seq=2   time=14.132 ms
212 bytes from 8.8.8.8: seq=3   time=14.882 ms
212 bytes from 8.8.8.8: seq=4   time=14.152 ms
212 bytes from 8.8.8.8: seq=5   time=14.519 ms
212 bytes from 8.8.8.8: seq=6   time=13.936 ms
212 bytes from 8.8.8.8: seq=7   time=20.670 ms
212 bytes from 8.8.8.8: seq=8   time=13.472 ms
212 bytes from 8.8.8.8: seq=9   time=13.521 ms

--- 8.8.8.8 dnsping statistics ---
10 requests transmitted, 10 responses received, 0% lost
min=13.472 ms, avg=15.400 ms, max=20.670 ms, stddev=2.364 ms

On a fresh vm install i get similar results to you.

Is there tracing I can activate for FTL so i could see where the extra latency is coming from? my theory is that e.g. logging IO or recursive call is causing the difference between www.ucla.edu & www.facebook.com, and the reduced resources of the server are highlighting that issue well