Please follow the below template, it will help us to help you!
Expected Behavior:
Pi-hole to preform DNS lookups and respond to DHCP requests.
Actual Behavior:
Pi-hole randomly fails to preform DNS lookups or handout IPs via DHCP. There appears to be no pattern to when this happens. No device on local network can resolve hostnames or obtain an IP while this happens. Host is still UP/Ping-able during this time.
Pi-hole will then start working again without intervention, but shows high load averages. When this happens Top shows 'pihole-FTL' as 80%+ CPU.
Pi-hole is running in a Debian 9 VM on ESXi with 2 vCPU and 2GB RAM. ESXi host has other VMs that all run flawlessly.
(You can format your output by highlighting a text passage and choose </> - Preformatted text from the menu)
I see no issues with your free space.
A large number of queries would, and number of queries would scale with your number of clients and internet usage, of course, as well as over time. (click for more)
Pi-hole will keep queries for a limited time only, dropping entries older than a configurable threshold.
That threshold value defaults to 365 days, but can be customised via /etc/pihole/pihole-FTL.conf:
MAXDBDAYS=365
We haven't established that as a cause yet, though.
Let’s run the following command on your Pi-hole machine to check some of Pi-hole’s stats:
echo ">stats" | nc localhost 4711 -w 1
Not sure if this next one is related either:
Your debug log shows (only partially) a section for some issues with DHCP requests conflicting over names.
You should be able to get the full story.:
grep "not giving name" /var/log/pihole.log
I deliberately didn't post the output as to not compromise your local naming scheme.
I see what you mean about the DHCP issues in the logs. This is owing to the cloning of some other VMs, with the clone being config'd for DHCP, albeit with the same hostname. This is a recent event and the issue of pi-hole randomly failing pre-dates this. the 'localhost' DHCP conflict comes from a smart TV that reports its name as localhost. I've now added this MAC to the static list with a new hostname.
As for the stats, as requested here is the output of
The number of queries is somewhere in the high range, but not excessive. It is in line with your number of clients (29). From personal observations, I'd consider anything from 1k to 3k queries per client per day as normal (though I don't have the actual hard stats to prove this).
This makes it much less likely that your configuration is closing a DNS loop that could have accounted for a large number of queries (easily in the higher 100,000s or even millions a day).
And as your Pi-hole does not use your router as one of its upstream servers, and you are not using Conditional Forwarding, we can rule out a DNS loop as causing a high number of queries,
That TV wouldn't be a Samsung specimen per chance?
In the past, I have encountered a Samsung TV at a friend's place that insisted on being called localhost - and it sporadically spawned DNS requests for time servers in the 10,000s before going back to normal.
You might give that a try, though I wouldn't know how this affects CPU load once applied, as Pi-hole would have to throw away a large portion of its database.
It might be easier to just drop the database as a whole and to start anew, if you are willing to part with your statistics.
Paging @DL6ER for confirming my assumption, to be on the safe side, and for possible advice on the best strategy here.
I'm not convinced that the large long-term database is the culpit here. My long-term database is on the order of 300 MB (much fewer clients) but that is even on a Raspberry Pi.
It would be interesting to see the times at which this happens so you could correlate this to the "queries over time" graph. Are there any visible spikes in the latter?
Maybe your VM software does already offer a monitoring for CPU load? If not, there should be plenty of software available that can automatically collect this data.
I've had a look at the ESXi logs and that didn't show anything around the time it last happened.
I've reduced the days the stats are held for to 60 days. So far (12hrs) it seems stable. The WebUI is also a lot slicker, and memory usage is now showing at around 10% rather than the 85-90% I was seeing before. The VM also runs Webmin and this also running a lot slicker.
It's still a bit early to say if this has solved it but it's looking good. I'll monitor for a few days and see what happens.
So you did this by setting MAXDBDAYS in pihole-FTL.log? This just defines after which amount of time the queries are removed from the pihole-FTL.db database on disk, it has no effect at all on a running FTL instance.
Let's see what happens. There seems to be something happening at a specific point in time.
Well with no changes other than setting MAXDBDAYS to 60 in pihole-FTL.log everything is now running smoothly. No locks or freezes at all for 48+hours. Although intermittent and with no pattern, the fault would have most definitely have occurred during this time.
The WebUI is also a lot slicker as is the rest of the system (SSH/Webmin etc).