Pi-hole large scale help

mbrady · September 6, 2024, 1:54pm

I saw that FTL v3 supported more than 1M queries in a 24 hour period. - I'm trying to make use of pi-hole on a large scale - roughly 2K clients. I'm getting easily 1/2M requests in 10 mins.

I'm running into a problem where CPU usage goes high and in bound DNS requests start to timeout. Restarting resolver brings things back to normal for some time, then gets worse again.

At first, I was getting a lot of evictions with a the cache set to 10000 - so I've been bumping up the cache so I don't get any evictions but still having a CPU issue.

I'm running on a ESXi VM, 4CPU, 2GB RAM

Anyway I can get pihole working on a larger scale?

dnsmasq config...
no-negcache
dns-forward-max=5096
min-cache-ttl=3600

pihole config....
PRIVACYLEVEL=0
RATE_LIMIT=0/0
PIHOLE_PTR=HOSTNAMEFQDN
RESOLVE_IPV6=no
DBIMPORT=no
MAXDBDAYS=1
BLOCKINGMODE=NXDOMAIN
REPLY_WHEN_BUSY=ALLOW

DL6ER · September 8, 2024, 7:28am

This is a though one, we've discussed this internally and our first suggestions would be that it may have to do with your disk being too slow, i.e., the database cannot be updated timely enough which causes the DNS resolver to wait for disk sync at some point. Even looking at your rather severely limited amount of RAM (2 GB) for this task.

In case you have enough disk space available (we are expecting - possibly many (!) - GB):
Please add

DEBUG_ALL=true

to the file /etc/pihole/pihole-FTL.conf (create if it does not exist) and run

pihole restartdns

Next time the timeouts happen, please check /var/log/pihole/FTL.log to see what the last messages are. This may help us getting a better picture of where/why exactly the issue happens.

However, if your disk is really the bottleneck you may find that logging the queries by turning on the debug settings may cause a similar slowdown. The upcoming Pi-hole v6 should be less affected by this.

mbrady · September 8, 2024, 10:54am

Thanks so much for the feedback - I have some additional info that may help. Since the original post. If/when queries started to timeout, as soon as we were to issue a reload, things were immediately better.

I did come across this article (https://www.redhat.com/en/blog/five-nines-dnsmasq) and have added the dnsmasq options "all-servers", "no-negcache" and "dns-forward-max=5096" to the config.

I reverted the cache size back to 10000 and set pihole MAXDBDAYS to 0.

I've disabled all adlists and just have a handful of regex allows/blocks.

I've disabled query logging.

I can easily throw more resources at pihole. Will enable debuging and I should have more info in just over 24 hours. - Enabling debugging just now, running for 5 minutes with 7K queries, is causing a high CPU load...I'm not sure if I'll be able to let this persist to collect any info. - Is there a slightly lighter debug level available?

With little load (~1M queries in 9 hours) pihole stats show (with changes above, excluding debgug)...

Below is what the VM resourse utilization shows. Disk latency seems only spike in the off-hours but still stays near zero.

Bucking_Horn · September 8, 2024, 12:08pm

Both of those will potentially increase cpu, i/o, latency and perhaps memory consumption, as Pi-hole will forward any DNS request to all of its upstreams instead of just to the fastest responding one, and it would have to wait for an upstream for any NXDOMAIN reply instead of serving that directly from its cache.

That would amount to 3 million requests per hour, or 72 million requests per day, or well over 800 requests per second.

At those rates, you may be hitting an upstream DNS server's rate limit, which potentially would result in a lot of upstream refusals or -even worse- upstream time-outs, in turn prompting clients to repeat their requests.

Also, 72 million daily requests by roughly 2k clients would translate to 36,000 requests per client per day.
Comparing that with my most busy client at 6,000 per day, that's about 6 times as many requests.
That seems somewhat excessive and may perhaps suggest a (partial) DNS loop.

To investigate, please upload a debug log and post just the token URL that is generated after the log is uploaded by running the following command from the Pi-hole host terminal:

pihole -d

or do it through the Web interface:

Tools > Generate Debug Log

mbrady · September 8, 2024, 3:20pm

Totally understood on the dnsmasq config...I was just grasping for somthing to work with. Being it was a change from the default config of dnsmasq in the article, that showed improvement, I gave it a shot. I was also thinking that maybe there was a slow or mis-configured upstream DNS.

That count of requests isn't persistent throughout the day. - Def between the hours of 8am-2pm and may come in waves. That was the only metric I was able to grab before needing to do a reload. Before pihole, we were using a really old version of BIND, configured with the same upstream servers and that worked without a problem on 1GB of ram. I totally understand there's more overhead and read/writes going on with pihole.

Here's a debug from the latest config - running for a few hours. (with no load)
https://tricorder.pi-hole.net/9A5mN8eq/

Thank you!

mbrady · September 9, 2024, 2:06pm

I was able to run DEBUG_ALL=true for about 30 seconds before queries from clients started to timeout. Nothing stands out to me in the logs.

Also tried removing all-servers, no-negcache config options from dnsmasq with no noticable difference.

It seems once the FTL process get's close to the 40% mark, things go downhill until I issue a reload.

Unless if you have another idea, I think I'm going to try running the same config under dnsmasq itself and not run pihole, just to see if dnsmasq itself can handle the load.

DL6ER · September 9, 2024, 2:42pm

I cannot really recommend a more lightweight debug configuration because we don't yet know what we are looking for. If running the configuration under dnsmasq is an option, it's surely worth a try.

Mind that regex is rather hefty as all of them need to be re-evaluated for all unknown domains. Because regexes can match in obscure ways (by design), there is no short-circuiting available. It'd definitely be worth trying without regexes, too!

One thing for me to clarify: What do you mean exactly by

Is it sending SIGHUP to pihole-FTL? This would cause some caches to be flushed but not all. If this is a working fix for you, then we can almost certainly rule out that it is the database somehow slowing things down.

mbrady · September 9, 2024, 3:30pm

Totally understood on debugging.

I'll try dropping the regex's and see if helps at all! That makes sense.

Once queries from client machiens start to timeout, I can run a pihole restartdns and it immediately resolves things and I'm good again for X amount of time. Is a 'SIGNUP', the same as the restart DNS?

Bucking_Horn · September 9, 2024, 5:02pm

Your debug log shows that your Pi-hole is only using one public upstream, so all-servers would make no difference in your case.

DNS loops may be closed if one of Pi-hole's upstreams would feed DNS requests back to Pi-hole.
As you do not have Pi-hole's Conditional Forwarding enabled, and you are not using a local DNS resolver or your router as Pi-hole's upstreams, a DNS loop is rather unlikely to cause your issue.

Probably unrelated, but your debug log shows that you've currently disabled all of your Pi-hole's blocklists?
Also, as you block mask.icloud.com as well as mask-h2.icloud.com, you may want to be aware that Pi-hole would block those automatically by default, which can be controlled via BLOCK_ICLOUD_PR in your pihole-FTL.conf.

Your debug log also shows that your have enabled Pi-hole's DHCP server with quite an elaborate setup covering multiple subnets. While you've set a reasonable leasetime of 8 hours, 2,000 clients dis- and reconnecting and renewing leases could contribute to Pi-hole's load.
You could consider to pass on DHCP duties to another machine, for the purpose of assessing raw DNS performance.

As to my previous remark:

Are you positive that your chosen upstream wouldn't rate limit Pi-hole?

You could analyse Pi-hole's database to count refused and unanswered replies to see whether those would make up a substantial portion of queries, especially in the periods before your Pi-hole qets unresponsive, e.g. by running:

pihole-FTL sqlite3 "/etc/pihole/pihole-FTL.db" "SELECT reply_type, count(reply_type) FROM queries WHERE timestamp > strftime('%s','now','-1 hour') GROUP BY reply_type ORDER BY 1;"

For unanswered or refused requests, reply_type would be 0 or 8 (see also Redirecting...).

And finally, your debug log shows that some of your queries are for potentially problematic names:

INFO: FTL replaced 1 invalid characters with ~ in the query "imac~27&quot;athletic"

As those seem to relate to local hostnames exclusively, I recall a recently fixed bug where a specific wrong hostname could have prompted a true crash of pihole-FTL.
However, that was in the beta code of Pi-hole v6, and your debug log shows you to be on Pi-hole v5, but still: @DL6ER, do you think that v5 could suffer from invalid hostnames in a way that could contribute to mbrady's observation?

mbrady · September 9, 2024, 5:54pm

There's a total of 3 upstream servers, I have two listed in /etc/dnsmasq.d/10-pvcsd-dns.conf

Yeah I'm pretty confident there is no DNS loop - the only thing that's changed is we swapped out BIND/Named for pihole/dnsmasq

I was grasping for potential causes. I was attempting to limit as much processing as possible to see if that would help the issue at all. I also disabled the RegEx's and (@DL6ER ) that has seemed to extend our uptime before needing to issue a 'pihole restartdns'.

Thank you for the icloud info - I had threw it in the web interface, just so all the custom blocks were in one spot.

I noticed that there was one scope with an hour lease time, I'm going to increase that. The problem is that we have a large volume of unique devices, so we'll eat through that scope if we extend lease time too much.

I'm postive that the upstream's config hasn't changed, we've been using BIND with those same upstream servers for 20 years with no problems. We were just looking for an update.

I wasn't able to get anything out of that sql command. It didn't throw errors and didn't give any results. Would things still be captured in DB if I have 'MAXDBDAY=0' would that make a difference?

DL6ER · September 9, 2024, 6:50pm

No, this was due to a unique bug in the "new" v6 code, precisely in the code we had to implement because switching to alpine build containers (musl) instead of Debian (glibc) removed a few features from the DNS resolver we then had to implement ourselves.

Okay, this is a full restart not a reload. Try pihole restartdns reload for a real reloading. I guess it won't really help.

"start" to timeout - does it mean that FTL stops processing them altogether or just some of them and only those timeout? I could also ask: Is the log still filling when you see the timeouts?
To me, honestly, this currently very much looks like a slow disk not able to keep up with data being written to disk. How do the VM metrics look like with debugging turned on?

mbrady · September 10, 2024, 1:07pm

My bad on the full restart vs a reload. I can confirm 'pihole restartdns reload' doesn't help the situation at all. I need to run 'pihole restart dns', to get queries to be responded to.

Being that the userbase is so wide, it's difficult to see if any of the queries get processed but I can say our network monitor picks up on it almost immediately and I can confim with my local machine....I'll get a few requests timed out, then a response and eventually, I'll stop getting a reply. (I've disabled the query logging to attempt to improve performance, but I can see the DHCP logs entries start to get to spaced out as queries time out on my devices.

Here are VM metrics....it's really just seems to be the CPU that goes high and drops after a restart. First screenshot includes timeframe of enabling full debugging. Second screenshot is today, after a restart.

At this point (today) - I deleted the db and started fresh, to make sure there wasn't a db issue, expanded dhcp lease time to help with load, disabled adlists and regex's, disabled query logging.

I think now, my next move is to muddle through the day and I'll uninstall pihole and just utilize dnsmasq to see if it provides the same symptoms, just so we can rule it in/out as an issue.

Bucking_Horn · September 10, 2024, 2:14pm

My bad, that SQL was too restrictive on reply_type. I've edited the statement since.
Please try:

pihole-FTL sqlite3 "/etc/pihole/pihole-FTL.db" "SELECT reply_type, count(reply_type) FROM queries WHERE timestamp > strftime('%s','now','-1 hour') GROUP BY reply_type ORDER BY 1;"

Setting MAXDBDAYS to 0 would disable the database.
You could try that to see if db storage would have an impact on your observation.

You obviously cannot run that SQL then, but you'd still be able to retrieve some in-memory stats via:

echo ">stats >quit" | nc localhost 4711

smokingwheels · September 11, 2024, 2:30am

What about running a Dnsmasq server and forwarding requests to 4-6 piholes?
I currently run 3.
I use a tool Dnsblast to load my DNS system up.

mbrady · September 11, 2024, 1:11pm

I'm testing running on just plain dnsmasq presently, just to confirm it's robust enough itself. That could be an option. What's your reasoning for running on 3 piholes? Were you experiencing similar issues?

smokingwheels · September 12, 2024, 2:16am

Yes when my Yacy search engine was crawling lots of sites. From memory the number of DNS lookup was about 1.5 million in 24 hours.
Also if one pihole crashes or being restarted you will still have DNS resolution.

mbrady · September 12, 2024, 1:47pm

@DL6ER and @Bucking_Horn - So I've been running pure dnsmasq (no pihole) for a day with no troubles at all. VM metrics show a clear drop in CPU usage. - I also have query logging enabled with no real noticable change in disk latency. So it seems it's something in pihole that's not scaling well.

mbrady · September 12, 2024, 1:59pm

Gotcha, yeah I suppose we could add multiple pihole instances upstream, just adds to the complexity. But that's good to know I'm not the only one.

Bucking_Horn · September 12, 2024, 2:07pm

How did you implement blocking in that scenario?
Just using a hosts file with all 0.0.0.0?

mbrady · September 15, 2024, 12:04pm

Sorry, missed this - presently not doing any blocking aside from mask.icloud.com and mask-h2.icloud.com via dnsmasq 'server=' directive.