Large number of clients cause DNS timeouts over time

Ladrien · February 5, 2025, 9:03pm

I run pihole in a semi-large environment, around 1500 clients. I've run into an issue whereby enabling EDNS0 to show the client's true IP causes pihole to hang during the DB update cycle.

Here is the setup:

Both piholes are on Ubuntu 22.04LTS and run nginx for ssl, and dnsdist to load balance the dns queries. This issue only presents itself when enabling the pihole DB.

Pihole1:

ip: 172.16.1.28 (pihole)
ip: 172.16.1.26 (dnsdist listens on this IP and forwards queries to both pihole IPs)

Pihole2:

ip:172.16.1.29 (pihole)
ip:172.16.1.27 (dnsdist listens on this IP and forwards queries to both pihole IPs)

Below is an image that shows the pihole hangs during the db update. DNS queries timeout during this process.

Anyone have any ideas to improve the DB performance? I've tried moving the DB to a ramdisk, but I've no luck.

Bucking_Horn · February 6, 2025, 12:04am

Please upload a debug log and post just the token URL that is generated after the log is uploaded by running the following command from the Pi-hole host terminal:

pihole -d

or do it through the Web interface:

Tools > Generate Debug Log

Ladrien · February 6, 2025, 8:37pm

https://tricorder.pi-hole.net/SpSQD7dk/ -- pihole with the database enabled and the issue present

https://tricorder.pi-hole.net/cak3QdJw/ -- pihole with the database disabled and no issues present

This is after ~15 minutes of starting pihole. I've noticed that if I disable the DB through setting MAXDBDAYS=0 or DBFILE= the issue goes away.

The issue gets progressively worse as the total number of queries goes higher, and it tends to start around 30k queries.

Ladrien · February 6, 2025, 9:21pm

After playing around with the DBINTERVAL variable, I've found some interesting correlations.

When leaving the interval at 1 minute, the hangs start to reliably occur at around 32,000 total queries.

However, when adjusting this to a shorter interval, such as .25 (15 seconds) it will delay this issue until the total queries get to 128,000, after which the pihole will reliably hang at the 15 second interval.

Bucking_Horn · February 7, 2025, 8:38am

Both your debug logs show that you have disabled Pi-hole's query logging.

Would that mean that you do not intend to log DNS traffic?
In that case, leaving Pi-hole's long term query database disabled as well (as you've already done temporarily) would be a viable option, avoiding your issue.

When the database is enabled, your debug log shows that you've set

REPLY_WHEN_BUSY=ALLOW

Deviating from the default DROP would mean that Pi-hole -in lieu of being able to check gravity for blocked domains- may allow otherwise blocked queries.
If that's not what you intend, you probably should consider to stick with the default.

It could also contribute to your observation, at least to the total request count, as allowing clients to access usually blocked domains may entail consecutive additional lookups that would not have happened had the domain been blocked or dropped.

Other than that, I don't see anything in your debug logs that may explain your observation - but there are a few other things that stick out (click for details).

For once, you've applied a change to Pi-hole's web UI - it seems you've disabled the dashboard's client activity graph?

You've added a bunch CNAME records to enforce safe search for some popular search engines, using their respective public safe search domains as target.
As Pi-hole's UI points out, CNAME records won't work if Pi-hole isn't authoritative for the target domain, which it wouldn't be for public domains.
I see that you are aware of this, as you've created Local DNS records for those public domain targets - but not for all of them: forcesafesearch.google.com is missing.

And seeing that you are trying to enforce safe search, it seems contradictory that you would set MOZILLA_CANARY=false, which prevents Pi-hole from indicating Mozilla/Firefox browsers to disable DoH, as that would allow those browsers to by-pass Pi-hole via DoH altogether.

Ladrien · February 7, 2025, 7:41pm

Both your debug logs show that you have disabled Pi-hole's query logging.

Query logging simply generates way too much information for the amount of queries that we handle. The long-term database, however, I believe is more helpful to track down DNS issues anyways; which is why I am on this mission to get it to work.

Deviating from the default DROP would mean that Pi-hole -in lieu of being able to check gravity for blocked domains- may allow otherwise blocked queries.
If that's not what you intend, you probably should consider to stick with the default.

I've set the REPLY_WHEN_BUSY variable during my testing. It has been set as default for the entire life of the pihole.

For once, you've applied a change to Pi-hole's web UI - it seems you've disabled the dashboard's client activity graph?

When you have this many clients, trying to load that graph will lock up pihole for a very long time. I've commented it out in the html code for that reason. There have been no other weird configurations made to the install that don't involve the standard variables.

You've added a bunch CNAME records to enforce safe search for some popular search engines, using their respective public safe search domains as target.
As Pi-hole's UI points out, CNAME records won't work if Pi-hole isn't authoritative for the target domain, which it wouldn't be for public domains.
I see that you are aware of this, as you've created Local DNS records for those public domain targets - but not for all of them: forcesafesearch.google.com is missing.

forcesafesearch.google.com works as expected with the current configuration. There is no A record needed. Why? I believe it's because google themselves have set it as a CNAME as well.

nslookup google.com
Server:  ****
Address:  172.16.1.26

Non-authoritative answer:
Name:    forcesafesearch.google.com
Addresses:  2001:4860:4802:32::78
          216.239.38.120
Aliases:  google.com

And seeing that you are trying to enforce safe search, it seems contradictory that you would set MOZILLA_CANARY=false, which prevents Pi-hole from indicating Mozilla/Firefox browsers to disable DoH, as that would allow those browsers to by-pass Pi-hole via DoH altogether.

False = block to my knowledge.

nslookup use-application-dns.net
Server:  ****
Address:  172.16.1.26

*** ***** can't find use-application-dns.net: Non-existent domain

I appreciate your time in looking through the debug log, but our piholes run great.

The issue is the database and the underlying SQL / C code that makes that work. I understand that this is a free and open-source project and I am pushing it to it's absolute limits, but I am willing to help. Send over a dev branch with some SQL performance tweaks and I will run it on one of the piholes.

The main issue is the Pihole rejecting DNS when it's trying to update the long-term database. I question why that is the case, when there are multiple ways to achieve the same result besides locking up the pihole, especially when the gravity db is a separate file.

Thank you for your time.

Bucking_Horn · February 7, 2025, 11:22pm

You are free to ignore my advice, but that wouldn't invalidate it (click to see why).

Neither www.google.com nor google.com nor forcesafesearch.google.com are CNAMEs.

Pi-hole's CNAME UI explains that:

The target of a CNAME must be a domain that the Pi-hole already has in its cache or is authoritative for. This is a universal limitation of CNAME records.

As Pi-hole isn't authoritative for forcesafesearch.google.com, it may only work if by chance Pi-hole would have cached the A / AAAA records for forcesafesearch.google.com (which have a long TTL of one day).

Once they expire, you'd only receive a CNAME reply, e.g.:

$ dig google.com

; <<>> DiG 9.11.5-P4-5.1+deb10u11-Raspbian <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39987
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;google.com.			IN	A

;; ANSWER SECTION:
google.com.		60	IN	CNAME	forcesafesearch.google.com.

;; Query time: 11 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fr Feb 07 22:13:38 CET 2025
;; MSG SIZE  rcvd: 79

which logs like:

query[A] www.google.com from 127.0.0.1
config www.google.com is <CNAME>

Only a consecutive direct request for forcesafesearch.google.com may repopulate the cache and allow your google.com CNAME to resolve as you'd expect.
This would register in Pi-hole's log in a sequence like:

query[A] forcesafesearch.google.com from 127.0.0.1
forwarded forcesafesearch.google.com to 127.0.1.1#5335
reply forcesafesearch.google.com is 216.239.38.120
(…)
query[A] www.google.com from 127.0.0.1
config www.google.com is <CNAME>
cached forcesafesearch.google.com is 216.239.38.120

To avoid relying on caching chances, you need to create Local DNS records for forcesafesearch.google.com (or perhaps run a job that would request it before it expires).

MOZILLA_CANARY has to be true to communicate that DoH should not be used, which is Pi-hole's default, see Configuration - Pi-hole documentation.

Altering your configuration will make it harder for us to conduct meaningful analysis of your debug logs to help you.

When trying to reproduce your observation hitting a wall at 30,000 queries, with the default DROP in place, my Pi-hole instance would start to exhaust its max concurrent connections and refuse additional queries, and may also drop some DNS requests, but it would not hang.

Response codes:  NOERROR 20757 (71.17%), SERVFAIL 126 (0.43%), NXDOMAIN 6774 (23.22%), REFUSED 1510 (5.18%)

However, while such REFUSED queries may contribute to your observation, a max concurrency warning was absent from your debug logs.

Your initial screenshot shows dnsdist output only, but we are lacking Pi-hole's own information.

For a time of an outage, could you correlate DNS requests as sent from dnsdist with DNS requests processed by Pi-hole, to see whether Pi-hole received them and how it handled the ones that dnsdist may have been waiting for unsuccessfully?

And what would prompt dnsdist to mark a DNS server as down?
Would one query timing out be enough, or would it take several?
What's dnsdist's time-out threshold?

Ladrien · February 8, 2025, 12:47am

MOZILLA_CANARY has to be true to communicate that DoH should not be used, which is Pi-hole's default, see Configuration - Pi-hole documentation.

After looking through the commit, I was wrong. I block all DoH servers so that is what caught that domain. Thank you.

When trying to reproduce your observation hitting a wall at 30,000 queries, with the default DROP in place, my Pi-hole instance would start to exhaust its max concurrent connections and refuse additional queries, and may also drop some DNS requests, but it would not hang.

This is precisely the issue. I suppose "hang" is not a good description, as everything else but DNS resolution works fine. This is only compounded as the total number of queries in the long-term database increases, and will ultimately get to a point where DNS resolution simply does not work at all. eg: 1,500,000 queries in a single day

Your initial screenshot shows dnsdist output only, but we are lacking Pi-hole's own information.

I included a picture of dnsdist to show the problem occurring, DNS still fails even if I query pihole directly. I am happy to provide information on dnsdist, but that is simply not the problem here. This issue still presents itself on a bare bones pihole install.

Nevertheless, my dnsdist configuration is unmodified as far as server-health checking. Here is a link to the documentation describing the process.

Quoting from documentation:

dnsdist uses health-check queries, sent once every second, to determine the availability of a backend server. Since 1.8.0, it also supports a lazy health-checking mode which only sends active health-check queries after a configurable threshold of regular queries have failed, see below.

By default, an A query for the “a.root-servers.net.” name is sent. A different query type, class and target can be specified by passing, respectively, the checkType, checkClass and checkName parameters to newServer(). The interval between two health-check queries can be set via the checkInterval interval parameter, and the amount of time for a response to be received via the checkTimeout one.

The default behavior is to consider any valid response with an RCODE different from ServFail as valid. If the mustResolve parameter of newServer() is set to true, a response will only be considered valid if its RCODE differs from NXDomain, ServFail and Refused.

This is confirmed in my pihole query log with 2 queries/s on each pihole (2 for each dnsdist server)

So essentially, dnsdist will consider the server "up" if it receives anything other than a servfail response. I have found it to be quite accurate in determining if the pihole will resolve DNS queries or not. Simultaneously querying the pihole directly from another computer when dnsdist claims that the pihole is offline will have the same effect (servfail)

Please let me know how else I can assist, and thank you for your time.

Bucking_Horn · February 8, 2025, 11:09am

If that would be the case, you should have seen quite a few queries with a REFUSED reply along with a maximum concurrency warning in Pi-hole's diagnosis section.
As said before, such a warning is missing from both your debug logs.

Furthermore, REFUSED is distinctively different from SERVFAIL: REFUSED replies would commonly be created by Pi-hole on various conditions, while SERVFAIL is just handed down as received from a Pi-hole upstream DNS server.

As you state to see SERVFAILs:

That may suggest an issue with Pi-hole's upstreams.

Also, your Query Log screenshot above shows Pi-hole successfully resolved a.root-servers.net and served the reply immediately (0.0ms).

Are you positive that this query correlates with a dnsdist outage as requested?

If so, that would indicate that while Pi-hole has sent the reply, dnsdist did not receive it.

So dnsdist would classify either a timeout (of unknown checkTimeout length) or a single SERVFAIL as outage.

However, SERVFAIL isn't an outage - it is a valid DNS reply. In order to observe a SERVFAIL, a full DNS exchange must have happened, i.e. the queried DNS server has been fully operational and responsive.

You didn't provide the actual time-out value, and the linked dnsdist docs didn't disclose that either, but it would seem that dnsdist would probe your Pi-hole once every second, implying a checkTimeout value of at most 1 second (where most Linux distros would configure a default DNS timeout of 5 seconds).

Since Pi-hole would DROP DNS requests by default during database busy times, that may explain why dnsdist would classify Pi-hole as down.

However, the reply type would show as either N/A, NONE or REFUSED in Pi-hole's Query Log, where your screenshots show the query as IP, while you report clients to receive SERVFAIL.

Those observations seem both inconclusive and contradictive.

To get a clearer picture of what is happening on Pi-hole's end:

That would include all DNS requests, not just dnsdist's own probes.
It would be interesting to know how many reply times exceeding a second, SERVFAILs and REFUSED replies you'd actually see during or shortly before a dnsdist down period classification.

EDIT: And also:

As simple clients would never send ENDS0 ECS information, could you explain how EDNS0 would be involved?

Ladrien · February 12, 2025, 1:48am

I am going to re-summarize since we seem to be on different pages; and after some further testing, the database seems to be a red herring.

clients > dnsdist > piholes > Windows DNS > root servers

Pihole's upstreams are Windows DNS and they are rock solid. With dnsdist as the load balancer and EDNS(0) forwarding disabled, pihole can handle the normal load that we require from them. (300-500 qps/server) Unfortunately, pihole only sees a few clients out of the two thousand this way. (see below picture while EDNS(0) is disabled)

After enabling EDNS(0) forwarding on dnsdist, pihole now sees the "true" clients, totaling to around two thousand. As this happens, the CPU load will gradually increase over multiple hours (the numbers at the left side of the picture) to a point where the pihole will respond with SERVFAIL under extreme single core load to both clients and dnsdist's health checks. I must stress that this only happens when there are a large number of clients & queries alike.

Again, the only variable that has changed is that the pihole(s) are now aware of the true number of clients, around two thousand of them.

To me, this is pretty cut and dry that the issue is with the pihole(s); not the surrounding infrastructure. It is made even more apparent that restarting pihole's dns resolver fixes the issue for a few hours, until it happens again.

Bucking_Horn · February 12, 2025, 1:11pm

We are on the same page, but you are the one with the book, and you seem reluctant both to read me the interesting parts about Pi-hole as well as to acknowledge the significance of my explanations with regard to the little bits you did share so far.

Your dnsdist classifies Pi-hole as down either if it times out or if it replies SERVFAIL. Despite me asking for it, you haven't demonstrated what condition actually triggers in your case, presenting me a screenshot of a successful IP resolution instead.

I've explained that Pi-hole would issue reply types different from SERVFAIL when running into the 'database busy' condition that you did insinuate to cause your observation, and that Pi-hole in general would not invent SERVFAILs itself, but reflect an upstream reply.

Also, load tests I've run at my end to recreate your issue would never return SERVFAIL (click for details).

(The single one that did show SERVFAIL was caused by my router, disagreeing to handle the sudden surge of UDP requests.)

The tests aimed to issue 30k queries during 5 minutes (plus rampup time).

You'll notice that spikes in latency occur about every 60 seconds, correlating with failures to resolve (requests replied with REFUSED), as caused by a maximum concurrency condition, which in turn was likely caused by the database commits.

Max concurrency happened twice (roughly 90 and 150 scs into the run)

[2025-02-12 11:23:57.098 (~90s)]  WARNING in dnsmasq core: Maximum number of concurrent DNS queries reached (max: 150)
[2025-02-12 11:24:57.114 (~150s)] WARNING in dnsmasq core: Maximum number of concurrent DNS queries reached (max: 150)

Also, 47 queries are counted as lost, dropped by Pi-hole when the database was busy, so a client never received a reply for those.

Those results seem quite different from what you seem to observe.

Note that this was run against a Pi-hole on an RPi 3A.

All of above is intended to help you better understand your Pi-hole's behaviour, facilitating your analysis.

Stating that your upstreams are rock solid won't help us to analyse your issue.

This may:

This is the third time that I'm asking you for some actual Pi-hole query data to help you analyse your issue.

Without that, we won't be able to understand why dnsdist classifies Pi-hole as down, what Pi-hole is actually doing at those times, and what may trigger the SERVFAILs you observe, nor how to reproduce your issue.

Bucking_Horn · February 13, 2025, 10:54am

How so?

My loadtest show that database commits will impact reply times (going up to several seconds, and a few queries dropped entirely), which could explain dnsdist classifying Pi-hole as down by its timeout rule (but not on SERVFAILs as you observe them).

I also can confirm that enabling EDNS0 ECS would negatively impact my loaded Pi-hole, further increasing response times by ~10 to ~15% during db commits, but mainly doubling the amount of REFUSED and tripling the amount of dropped queries (on an RPi 3A writing to an SD card).

Also, I learned from its documentation that enabling ECS would negatively impact dnsdist's caching: Since ECS may result in client-specific DNS replies (either based on client's geo-location or client-specific blocking rules), a received reply can only be a cache hit for clients in the same ECS subnet. With a /32 subnet as required for proper client identification, that would mean that e.g. one client's reply for google.com won't be shared with another client requesting it - dnsdist would have to repeat the lookup through Pi-hole, and an increasing query count would also affect Pi-hole's database commits.

So with ECS enabled in dnsdist, Pi-hole would suffer twice the penalty: Not only would it have to receive larger size DNS packets and decode ECS information, but also cope with an increased number of DNS requests and larger database commits.
(EDIT: On a side note, ECS may also affect your Windows DNS servers - once that info is in, it will travel all the way upstream, unless you'd have your upstreams strip or replace that ECS information. To that end, could you provide a fresh debug token, preferably one from a Pi-hole configured as used in your production environment, so I have a better chance of reproducing your observation?)

That makes me wonder:
Would your Pi-holes be able to better cope with your DNS traffic if you would expose both of them directly, i.e. without dnsdist in front of them?

You'd still see about the same amount of overall DNS requests, but traffic size would be smaller, and you'd free Pi-hole from processing ECS information.

Would that perhaps counter the potential effect of not having dnsdist distributing load evenly between your two Pi-holes?
To that end, in an attempt to spread DNS load, would your DHCP servers be able to rotate their offered DNS servers, or offer them in different order for different DHCP clients?

Ladrien · February 14, 2025, 2:26am

My apologies, everything's been very busy. I've also migrated dnsdist to separate VMs, which is why they won't show in the debug log this time.

Debug token: https://tricorder.pi-hole.net/Gss6HdDI/

Also, I learned from its documentation that enabling ECS would negatively impact dnsdist's caching:

dnsdist disables caching by default - we currently use it to loadbalance between the two piholes, and refuse high-traffic (garbage) queries to lighten the load from the piholes. That makes me wonder, however, if disabling caching on the pihole's side would help here. Our Windows DNS does the caching already anyways.

That makes me wonder:
Would your Pi-holes be able to better cope with your DNS traffic if you would expose both of them directly, i.e. without dnsdist in front of them?
You'd still see about the same amount of overall DNS requests, but traffic size would be smaller, and you'd free Pi-hole from processing ECS information.
Would that perhaps counter the potential effect of not having dnsdist distributing load evenly between your two Pi-holes?
To that end, in an attempt to spread DNS load, would your DHCP servers be able to rotate their offered DNS servers, or offer them in different order for different DHCP clients?

We do load balance DHCP, and distributing DNS servers through DHCP leases is a potential solution. I'd ideally like to get DoH working at some point though, and dnsdist offers that option, translating the TCP > UDP for the piholes.

As I've got it set up right now, I can see clients through ECS and pihole is relatively stable without the DB. I have to restart the pihole resolver halfway through the day though.

I can attempt to set the clients to query the piholes directly, if that would aid in your testing.

Bucking_Horn · February 14, 2025, 7:49am

That may negatively impact your Pi-holes, especially as they are under load already, as it would keep connections busy for longer (potentially on an order of magnitude).

It also wouldn't help analysing by much.

Complying with my previous request would allow us to investigate your observation of SERVFAILs:

Bucking_Horn:

Bucking_Horn:

To get a clearer picture of what is happening on Pi-hole's end:

Bucking_Horn:

For a time of an outage, could you correlate DNS requests as sent from dnsdist with DNS requests processed by Pi-hole, to see whether Pi-hole received them and how it handled the ones that dnsdist may have been waiting for unsuccessfully?

That would include all DNS requests, not just dnsdist's own probes.
It would be interesting to know how many reply times exceeding a second, SERVFAILs and REFUSED replies you'd actually see during or shortly before a dnsdist down period classification.

This is the third time that I'm asking you for some actual Pi-hole query data to help you analyse your issue.

Without that, we won't be able to understand why dnsdist classifies Pi-hole as down, what Pi-hole is actually doing at those times, and what may trigger the SERVFAILs you observe, nor how to reproduce your issue.

Ladrien · March 4, 2025, 6:51pm

Replying so the thread doesn't close.

I realize things have been busy with the v6 release, but have you had a chance to see the packet captures that I PM'd you?

Bucking_Horn · March 6, 2025, 4:04pm

No, I haven't, other than a quick try to open the link (which doesn't work for me).

It would have been of interest how Pi-hole would have processed dnsdist's requests along with upstream replies, in particular for SERVFAIL and REFUSED as well the ones that presumably timed-out on dnsdist, as registered with Pi-hole.

It would be much harder to correlate those activities from tcpdumps taken on two different systems, and those TCP dumps won't contain anything about Pi-hole's internal state anyway, so I don't think they would be of much use in investigating your issue (which I notice you have since renamed from being related to SERVFAILs to time-outs?).

Ladrien · March 6, 2025, 5:52pm

Those files had expired. Please let me know if you'd like me to send them again.

These pi-holes are on v5 still, but can you tell me what you'd need to effectively debug this? Commands to get debug logs on the pi-hole would be ideal. I will also enable query logging on the test pi-hole.

I will get debug logs for dnsdist as well.

Bucking_Horn · March 9, 2025, 8:43am

Your initial observation was about SERVFAILS, which your dnsdist detected and classified as Pi-hole being unavailable.

That would have been highly unusual, since my own tests did not prompt SERVFAILs, which is also expected, as Pi-hole in general would not invent SERVFAILs itself, but reflect an upstream reply, making me suspect that dnsdist could also classify Pi-hole based on time-outs caused by dropped queries.

You haven't presented Pi-hole Query Log or database excerpts to substantiate your initial observation, so we still don't know which condition would have prompted dnsdist more often.

Since you disabled logging, you'd have to draft database queries that match my suggestions:

I don't know how dnsdist probes look like, so can't really help you designing that query, and you may want to limit analysis to those time slots (which are also not known to me) where you actually observed dnsdist down classifications.

Analysis should GROUP BY reply_type to investigate SERVFAILs/REFUSEDs, and you'd perhaps need a separate query to investigate reply_time, as putting that in a WHERE clause of the same statement may exclude too many SERVFAILs/REFUSEDs queries.

Ladrien · March 9, 2025, 8:03pm

Yes. I've come to the conclusion that the SERVFAILS were not pi-hole's fault, but an issue with the upstream DNSSEC validation. That has been resolved, and that is why the topic title has been renamed. What is pi-holes fault is the query latency over time reaching the timeout threshold, as I've tried to convey through this discussion multiple times. It is also referenced in my first ever topic on this forum.

Put simply, this issue occurs even without dnsdist. It just made this problem easier to explain through dnsdist classifying pi-hole as down, ie: queries not responding within the timeout period.

While I have not provided those, I did provide tcpdumps of both dnsdist and pihole, and instructions on how to follow the dns queries between the two. In there is irrefutable evidence that pi-hole was responding with over a one-second delay, even on a special domain, such as mask.icloud.com. That query should be responded in milliseconds and stays local within pi-hole. I believe that this should be enough evidence in itself.

Nonetheless, I will get query logs from the pi-hole this week and send them over.

system · March 30, 2025, 8:04pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.