Very Slow Queries > 100ms for hosts with many CNAMES/ Collapsed Answers – How to Debug & Fix? site = www.ucla.edu

jfb · September 10, 2021, 1:00am

There are a number of debugging options in pihole-FTL. One of these may meet your needs.

tonymet · September 10, 2021, 1:00am

e.g. testing on i386 VM shows acceptible cached results

#uncached upstream 8.8.8.8 = 140ms
$dig @localhost www.ucla.edu
[truncated]

;; Query time: 140 msec
#cached 
$ dig @localhost www.ucla.edu
[truncated]
;; Query time: 0 msec

DanSchaper · September 10, 2021, 1:05am

What are the resources? My results are from a Pi-hole running as a docker container on a tiny SBC with half a gig of RAM and using an SDCard for the storage. Can't get much more resource constrained than that.

DanSchaper · September 10, 2021, 1:06am

If you want to run under gdb then you can follow Redirecting...

tonymet · September 10, 2021, 8:39pm

I was able to resolve by re-starting pihole-FTL (sudo service pihole-FTL restart).

I'll use the guidance here to try to debug if the issue comes up again.

Bucking_Horn · September 11, 2021, 1:07am

That doesn't force caching.
Pi-hole is already caching DNS requests results according to their TTL by default.

local-ttl only affects "information from /etc/hosts or configuration or the DHCP leases" - in the context of Pi-hole, that would be blocked domains specifically (but not exclusively).

Your benchmark setup is flawed in at least two ways:

You are not testing Pi-hole specifically, but rather the whole DNS resolution chain.
Reported timings comprise Pi-hole's own latency plus those of all upstream DNS servers involved in the resolution of a domain, i.e. your router's and any upstream DNS server your router uses.
Pi-hole may contribute but just 10ms to your max figure of 301.99ms, or it may take 300ms - you cannot tell either way from that lone figure.
You do not control the load on that resolution chain
According to your debug log, you are using your router as Pi-hole's only upstream. Any DNS traffic on your network, and any DNS traffic of your router itself, plus the computational load of all traffic flowing through your router at the time of testing, will impact your measurements, as well as any load on the public upstreams as well.
Furthermore, any upstream DNS server used by your router may stop to answer any further DNS queries once its quota for your IP has been exceeded - which is more likely to happen if you increase load on purpose.

Any figures you've captured are thus likley unsuited for comparisons, as you would not be able to attribute them individually, nor to recreate the exact conditions you did gather them under.

You'd have to revise your testing conditions to eliminate all potential outside influence to gain in significance, i.e query only domains that Pi-hole knows about to evaluate Pi-hole's performance, and query only domains that its direct upstream will answer without rate-limiting them to evaulate Pi-hole's forwarding performance, preferably in an isolated network comprising just the test clients, Pi-hole and its upstream.

An alternative to putting artificial load on systems that you do not control would be monitoring your average latency as observable during Pi-hole's real-world operation. This would also give you real-world mean values instead of or in addition to hypothetical statistics.

tonymet · September 11, 2021, 11:59pm

This comment has been very helpful. I could use some help defining a better experiment to identify the performance issues.

I’m a big fan of pihole, but i’m having to bypass it during the day due to DNS lookup issues (evidence in chrome profiler dns latency)

I admit the test above does have noise, but there is a huge discrepancy between pihole and my router DNS server.

I’m a big fan of pihole and I’m hoping this experiment helps me contribute to a performance fix

tonymet · September 12, 2021, 12:02am

Just to help clarify what I’m trying to investigate. I’m seeing sluggishness that is interfering with network performance. My goal is to build a reproducible test to help identify which queries are slow & why. I thought based on the evidence above (www.ucla.edu being slower than www.facebook.com) was a case where collapsed queries were an issue.

Bucking_Horn · September 12, 2021, 11:58am

You haven't yet supplied any reproducible evidence that Pi-hole would be responsible for slow lookups.

That's not to deny your observation, but reproducible slowdowns often turn out to be related to upstream filtering (e.g. by your ISP), upstream DNS servers, geo-location issues or your router's configuration in conjunction with Pi-hole's, with full or partial DNS loops being the most common reason.

If you had configured such a loop, you'd usually see a largely exaggerated query count originating from your router.

By creating artificial load on a real-world resolution chain for a domain where you've once observed a high latency value, you may even amplify a slow resolution.

Also, the order in which you run your commands would be important. If you'd always ran your tests against Pi-hole first and your router second, your router may not accumulate the penalty of the first lookup, but start answering queries straight from its cache.

If you were to benchmark DNS resolution, you'd need to create a set of DNS records with a DNS server you control (so that records are served from a local source entirely), and use that as your Pi-hole's upstream.
I guess you could use a second Pi-hole for this, without any filtering applied.

If you would want to repeat requests for the same domain, you may want to avoid using Local DNS records and use dnsmasq's host-record options instead, to explicitly set a TTL.

Alternatively, you could probably skip manual configuration of DNS records if you'd configured your second Pi-hole for a min-cache-ttl to 500 seconds or any other arbitrary value that you want your benchmarks to run (up to an hour).
Refer to dnsmasq's man page for further details on both of those options)

You'd then have to prime your second DNS server (e.g. by respective lookups for the domains you are going to use in your benchmarks) to populate its cache before running your actual benchmarks against your Pi-hole and that second server directly.

You could then compare reply times as reported by both of your Pi-holes.
I'd expect your filtering Pi-hole to be slower than the upstream one by the average amount of your network connection's latency.
The penalty of the first Pi-hole's filtering itself should be negligible, with one notable exception: If you were making extensive use of regex filtering.

That would be quite different from benchmarking.

You could try my suggestion for monitoring actual delays from Pi-hole's in-memory data.
As response times are an in-memory only issue, you'd have to query those values over Pi-hole's Telnet API .
You are looking for the 8th column of the getallqueries result set produced by:

echo ">>getallqueries >quit" | nc localhost 4711"

That result set can be interpreted as follows:

timestamp	query type	domain	client	status type	DNSSEC result	reply type	delay	CNAME domain	RegEx#	upstream
1631437263	A	fonts.gstatic.com	192.168.0.65	9	2	3	14	gstaticadssl.l.google.com	-1	unbound#5335
where the `14` in the `delay` column would translate to a 1.4ms response time.

As Pi-hole's in-memory data applies a rolling 24 hours time-frame, you could consider to collect mutiple data points via its Telnet API every 24 hours if you'd like to cover several days.

tonymet · September 21, 2021, 4:38pm

thanks for sharing the telnet api i'll look into that. that's helpful.

system · October 12, 2021, 4:38pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.