Very Slow Queries > 100ms for hosts with many CNAMES/ Collapsed Answers – How to Debug & Fix? site = www.ucla.edu

tonymet · September 9, 2021, 11:46pm

Pihole seems very show with collapsed / recursive queries–even when cached. For example, www.ucla.edu resolves to (4) cnames . Querying' pihole is avg MS = 115ms , whereas my router's DNS is ~ 1.8ms (see results below). hostnames with fewer CNAME entries have less delta between pihole & router DNS server (e.g. www.facebook.com = AVG 4ms pihole vs 1.5ms router dns)

Questions

Is this poor performance expected?
Config changes that can help here?
how to trace dnsmasq / pihole-FTL to find the issue?

What I've Tried to Fix

i've set local-TTL=360 to force caching.

Pihole Results for www.ucla.edu AVG=143ms [n=30, concurrent=2]

dnstrace -n30 -c2 -s 192.168.1.250 -r www.ucla.edu

DNS timings, 60 datapoints
	 min:		 71.303168ms
	 mean:		 143.480149ms
	 [+/-sd]:	 28.928263ms
	 max:		 301.989887ms

Router Results for www.ucla.edu AVG=1.8 MS

➜  ~ dnstrace -n30 -c2 -s 192.168.1.1 -r www.ucla.edu|head -n 20
Benchmarking 192.168.1.1:53 via udp with 2 concurrent requests 

DNS timings, 60 datapoints
	 min:		 1.31072ms
	 mean:		 1.883067ms
	 [+/-sd]:	 353.3µs
	 max:		 3.407871ms

DNS Answer www.ucla.edu

 dig +short @192.168.1.250 www.ucla.edu                     
d2qnx9mnmqbxo6.cloudfront.net.
13.226.237.98
13.226.237.30
13.226.237.64
13.226.237.101

Pihole is acceptible for www.facebook.com AVG=4ms

dnstrace -n30 -c2 -s 192.168.1.250 -r www.facebook.com


DNS timings, 60 datapoints
	 min:		 2.62144ms
	 mean:		 4.010803ms
	 [+/-sd]:	 898.966µs
	 max:		 7.864319ms

Pihole log sample during test

Findings:

Results are cached.
pihole seems to query each cname
query latency appears to scale on the order of # CNAMES

Sep  9 16:40:16 dnsmasq[21711]: query[A] www.ucla.edu from 192.168.1.134
Sep  9 16:40:16 dnsmasq[21711]: cached www.ucla.edu is <CNAME>
Sep  9 16:40:16 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.30
Sep  9 16:40:16 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.64
Sep  9 16:40:16 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.101
Sep  9 16:40:16 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.98
Sep  9 16:40:25 dnsmasq[21711]: query[A] www.ucla.edu from 192.168.1.134
Sep  9 16:40:25 dnsmasq[21711]: cached www.ucla.edu is <CNAME>
Sep  9 16:40:25 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.98
Sep  9 16:40:25 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.30
Sep  9 16:40:25 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.64
Sep  9 16:40:25 dnsmasq[21711]: cached d2qnx9mnmqbxo6.cloudfront.net is 13.226.237.101
Sep  9 16:40:29 dnsmasq[21711]: query[type=65] gateway.fe.apple-dns.net from 192.168.1.133

Tests are performed using dnstrace

Debug Token:

https://tricorder.pi-hole.net/etJFUb1t/

DanSchaper · September 10, 2021, 12:06am

What are your dig results? I don't see what you are seeing for query times.

I don't know that particular tool, seems that it's a docker image but there are no arm images to use to test with.

dan@nanopineo:~$ dig www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> www.ucla.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46972
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; ANSWER SECTION:
www.ucla.edu.           60      IN      CNAME   d2qnx9mnmqbxo6.cloudfront.net.
d2qnx9mnmqbxo6.cloudfront.net. 60 IN    A       13.226.237.101
d2qnx9mnmqbxo6.cloudfront.net. 60 IN    A       13.226.237.98
d2qnx9mnmqbxo6.cloudfront.net. 60 IN    A       13.226.237.30
d2qnx9mnmqbxo6.cloudfront.net. 60 IN    A       13.226.237.64

;; Query time: 282 msec
;; SERVER: 192.168.88.2#53(192.168.88.2)
;; WHEN: Thu Sep 09 17:04:20 PDT 2021
;; MSG SIZE  rcvd: 148

dan@nanopineo:~$ dig www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> www.ucla.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16514
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; ANSWER SECTION:
www.ucla.edu.           57      IN      CNAME   d2qnx9mnmqbxo6.cloudfront.net.
d2qnx9mnmqbxo6.cloudfront.net. 57 IN    A       13.226.237.64
d2qnx9mnmqbxo6.cloudfront.net. 57 IN    A       13.226.237.30
d2qnx9mnmqbxo6.cloudfront.net. 57 IN    A       13.226.237.98
d2qnx9mnmqbxo6.cloudfront.net. 57 IN    A       13.226.237.101

;; Query time: 12 msec
;; SERVER: 192.168.88.2#53(192.168.88.2)
;; WHEN: Thu Sep 09 17:04:23 PDT 2021
;; MSG SIZE  rcvd: 148

dan@nanopineo:~$ dig www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> www.ucla.edu
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6132
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; ANSWER SECTION:
www.ucla.edu.           56      IN      CNAME   d2qnx9mnmqbxo6.cloudfront.net.
d2qnx9mnmqbxo6.cloudfront.net. 56 IN    A       13.226.237.101
d2qnx9mnmqbxo6.cloudfront.net. 56 IN    A       13.226.237.64
d2qnx9mnmqbxo6.cloudfront.net. 56 IN    A       13.226.237.30
d2qnx9mnmqbxo6.cloudfront.net. 56 IN    A       13.226.237.98

;; Query time: 10 msec
;; SERVER: 192.168.88.2#53(192.168.88.2)
;; WHEN: Thu Sep 09 17:04:24 PDT 2021
;; MSG SIZE  rcvd: 148

The first response is a bit longer since nothing was in cache, the consecutive queries all came from cache as you can see form the short Query Times.

DanSchaper · September 10, 2021, 12:09am

And there's only one CNAME. www.ucla.edu >> d2qnx9mnmqbxo6.cloudfront.net >> round robin A records from Amazon Cloudfront.

DanSchaper · September 10, 2021, 12:11am

Lastly, that 353.3µs is the standard deviation, not the actual query time. The Minimum, Maximum and Mean are what you should be looking at, Mean being the most relevant.

tonymet · September 10, 2021, 12:15am

that's a typo i meant 1.3ms mean i'll update

tonymet · September 10, 2021, 12:17am

with dig my cached queries are 80-115ms

 dig +recurse  @192.168.1.250 www.ucla.edu
[truncated]
;; Query time: 113 msec
;; SERVER: 192.168.1.250#53(192.168.1.250)
;; WHEN: Thu Sep 09 17:15:05 PDT 2021
;; MSG SIZE  rcvd: 148

dig +norecurse  @192.168.1.250 www.ucla.edu

; <<>> DiG 9.16.20 <<>> +norecurse @192.168.1.250 www.ucla.edu
[truncated]
;; Query time: 732 msec

DanSchaper · September 10, 2021, 12:19am

Is that on the Pi-hole server or a client?

I'm showing vastly different results on a NanoPi Neo with 512M RAM:

dan@nanopineo:~$ dig +norecurse @127.0.0.1 www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> +norecurse @127.0.0.1 www.ucla.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 16120
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; Query time: 59 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 09 17:18:20 PDT 2021
;; MSG SIZE  rcvd: 30

dan@nanopineo:~$ dig +norecurse @127.0.0.1 www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> +norecurse @127.0.0.1 www.ucla.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 40470
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; Query time: 5 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 09 17:18:22 PDT 2021
;; MSG SIZE  rcvd: 30

dan@nanopineo:~$ dig +norecurse @127.0.0.1 www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> +norecurse @127.0.0.1 www.ucla.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 55265
;; flags: qr ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0

;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; Query time: 7 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 09 17:18:23 PDT 2021
;; MSG SIZE  rcvd: 30

tonymet · September 10, 2021, 12:21am

do you have a tool that can test a sample (multiple + concurrent).

Dns trace can be installed with go install github.com/redsift/dnstrace@latest to build locally

tonymet · September 10, 2021, 12:21am

a shortcoming with dig is that you are not getting an adequate sample size.

DanSchaper · September 10, 2021, 12:21am

I just noticed the REFUSED results, here's a full result.

dan@nanopineo:~$ dig +recurse @127.0.0.1 www.ucla.edu

; <<>> DiG 9.11.5-P4-5.1+deb10u5-Debian <<>> +recurse @127.0.0.1 www.ucla.edu
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15151
;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;www.ucla.edu.                  IN      A

;; ANSWER SECTION:
www.ucla.edu.           58      IN      CNAME   d2qnx9mnmqbxo6.cloudfront.net.
d2qnx9mnmqbxo6.cloudfront.net. 59 IN    A       13.226.237.98
d2qnx9mnmqbxo6.cloudfront.net. 59 IN    A       13.226.237.30
d2qnx9mnmqbxo6.cloudfront.net. 59 IN    A       13.226.237.64
d2qnx9mnmqbxo6.cloudfront.net. 59 IN    A       13.226.237.101

;; Query time: 30 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Sep 09 17:21:12 PDT 2021
;; MSG SIZE  rcvd: 148

tonymet · September 10, 2021, 12:22am

i'm testing from a client. server is running on pi zero non-w

i'll run some tests straight from server to see diff

DanSchaper · September 10, 2021, 12:24am

I'm not installing a full go toolchain to get one binary, sorry.

What are you using for the upstream?

What I'm seeing is that results are from cache and fast from cache. Throwing more queries at the resolver isn't going to change that or cause the results to slow down. So we need to find where the slowdown actually is happening and why it's happening.

tonymet · September 10, 2021, 12:24am

run from server i see big variance between +recurse & +norecurse

$ dig @localhost +norecurse www.ucla.edu|grep time
;; Query time: 65 msec
pi@piholezero:~ $ dig @localhost +recurse www.ucla.edu|grep time
;; Query time: 57 msec
pi@piholezero:~ $ dig @localhost +recurse www.ucla.edu|grep time
;; Query time: 55 msec
pi@piholezero:~ $ dig @localhost +norecurse www.ucla.edu|grep time
;; Query time: 2007 msec

DanSchaper · September 10, 2021, 12:25am

You're getting a lot of noise and very little signal.

Change the upstream to something besides the router and check the actual results to see if you're getting back IP addresses and results instead of timing out. And remove any changes you've made to TTL values in the configurations.

jfb · September 10, 2021, 12:29am

You can run dig -f filename and the process will resolve every domain contained in the file.

tonymet · September 10, 2021, 12:33am

you can also try dnsping a perl module

i recommend having a statistical dns trace tool

Upstream = router (192.168.1.1) (very fast), then ISP dns 75.75.75.75

tonymet · September 10, 2021, 12:35am

the test above (router) shows response time at router / 192.168.1.1 is very low 1.8 ms. that makes me confident that I'm getting high latency signal from testing pihole . can you clarify?

DanSchaper · September 10, 2021, 12:46am

dan@nanopineo:~$ dnsping -s 127.0.0.1 www.ucla.edu
dnsping DNS: 127.0.0.1:53, hostname: www.ucla.edu, rdatatype: A
212 bytes from 127.0.0.1: seq=0   time=148.009 ms
212 bytes from 127.0.0.1: seq=1   time=11.096 ms
212 bytes from 127.0.0.1: seq=2   time=10.740 ms
212 bytes from 127.0.0.1: seq=3   time=19.682 ms
212 bytes from 127.0.0.1: seq=4   time=19.038 ms
212 bytes from 127.0.0.1: seq=5   time=21.634 ms
212 bytes from 127.0.0.1: seq=6   time=10.756 ms
212 bytes from 127.0.0.1: seq=7   time=18.132 ms
212 bytes from 127.0.0.1: seq=8   time=19.273 ms
212 bytes from 127.0.0.1: seq=9   time=18.999 ms

--- 127.0.0.1 dnsping statistics ---
10 requests transmitted, 10 responses received, 0% lost
min=10.740 ms, avg=29.736 ms, max=148.009 ms, stddev=41.763 ms

DanSchaper · September 10, 2021, 12:48am

And just for comparison:

dan@nanopineo:~$ dnsping -s 8.8.8.8 www.ucla.edu
dnsping DNS: 8.8.8.8:53, hostname: www.ucla.edu, rdatatype: A
212 bytes from 8.8.8.8: seq=0   time=16.569 ms
212 bytes from 8.8.8.8: seq=1   time=18.149 ms
212 bytes from 8.8.8.8: seq=2   time=14.132 ms
212 bytes from 8.8.8.8: seq=3   time=14.882 ms
212 bytes from 8.8.8.8: seq=4   time=14.152 ms
212 bytes from 8.8.8.8: seq=5   time=14.519 ms
212 bytes from 8.8.8.8: seq=6   time=13.936 ms
212 bytes from 8.8.8.8: seq=7   time=20.670 ms
212 bytes from 8.8.8.8: seq=8   time=13.472 ms
212 bytes from 8.8.8.8: seq=9   time=13.521 ms

--- 8.8.8.8 dnsping statistics ---
10 requests transmitted, 10 responses received, 0% lost
min=13.472 ms, avg=15.400 ms, max=20.670 ms, stddev=2.364 ms

tonymet · September 10, 2021, 12:58am

On a fresh vm install i get similar results to you.

Is there tracing I can activate for FTL so i could see where the extra latency is coming from? my theory is that e.g. logging IO or recursive call is causing the difference between www.ucla.edu & www.facebook.com, and the reduced resources of the server are highlighting that issue well