Pi-hole randomly fails

gavinhatton · April 2, 2020, 10:24am

Please follow the below template, it will help us to help you!

Expected Behavior:

Pi-hole to preform DNS lookups and respond to DHCP requests.

Actual Behavior:

Pi-hole randomly fails to preform DNS lookups or handout IPs via DHCP. There appears to be no pattern to when this happens. No device on local network can resolve hostnames or obtain an IP while this happens. Host is still UP/Ping-able during this time.

Pi-hole will then start working again without intervention, but shows high load averages. When this happens Top shows 'pihole-FTL' as 80%+ CPU.

Pi-hole is running in a Debian 9 VM on ESXi with 2 vCPU and 2GB RAM. ESXi host has other VMs that all run flawlessly.

No other DHCP/DNS servers on the local network.

Debug Token:

gwob4waorn

Bucking_Horn · April 2, 2020, 10:50am

Weclome to the Pi-hole community, gavinhatton.

From your debug log, it seems your Pi-hole's long term database has grown quite large, about 1.4GB.

You wouldn't be applying any disk quotas to your VM?

Let's see how much space is left by running the following command on your Pi-hole machine:

df -ah

Please report back with the output, preferably by pasting it here in its textual form - I'll help with the editing, if required

gavinhatton · April 2, 2020, 11:27am

Thanks for the quick reply.

There's no disk quotas on the VM and it has a higher allocation of CPU, memory and disk I/O than the others.

What would cause the database to become so large?

As requested, output of df -ah:

localadmin@pihole:~$ df -ah
Filesystem   Size   Used  Avail  Use% Mounted on
sysfs        0      0     0      - /sys
proc         0      0     0      - /proc
udev         991M   0     991M   0% /dev
devpts       0      0     0      - /dev/pts
tmpfs        201M   14M   188M   7% /run
/dev/mapper/pihole--vg-root   15G  3.2G   11G  23% /
securityfs   0      0     0      - /sys/kernel/security
tmpfs        1003M  3.4M  999M   1% /dev/shm
tmpfs        5.0M   0  5.0M      0% /run/lock
tmpfs        1003M  0 1003M      0% /sys/fs/cgroup
cgroup       0     0     0       - /sys/fs/cgroup/systemd
pstore       0     0     0       - /sys/fs/pstore
cgroup       0     0     0       - /sys/fs/cgroup/memory
cgroup       0     0     0       - /sys/fs/cgroup/freezer
cgroup       0     0     0       - /sys/fs/cgroup/net_cls,net_prio
cgroup       0     0     0       - /sys/fs/cgroup/blkio
cgroup       0     0     0       - /sys/fs/cgroup/devices
cgroup       0     0     0       - /sys/fs/cgroup/cpuset
cgroup       0     0     0       - /sys/fs/cgroup/cpu,cpuacct
cgroup       0     0     0       - /sys/fs/cgroup/pids
cgroup       0     0     0       - /sys/fs/cgroup/perf_event
systemd-1    -     -     -       - /proc/sys/fs/binfmt_misc
hugetlbfs    0     0     0       - /dev/hugepages
debugfs      0     0     0       - /sys/kernel/debug
mqueue       0     0     0       - /dev/mqueue
/dev/sda1    236M  89M   135M    40% /boot
tmpfs        201M  0     201M    0% /run/user/999
tmpfs        201M  0     201M    0% /run/user/1000
binfmt_misc  0     0     0       - /proc/sys/fs/binfmt_misc
localadmin@pihole:~$

Bucking_Horn · April 2, 2020, 12:13pm

(You can format your output by highlighting a text passage and choose </> - Preformatted text from the menu)

I see no issues with your free space.

A large number of queries would, and number of queries would scale with your number of clients and internet usage, of course, as well as over time. (click for more)

Pi-hole will keep queries for a limited time only, dropping entries older than a configurable threshold.
That threshold value defaults to 365 days, but can be customised via /etc/pihole/pihole-FTL.conf:

MAXDBDAYS=365

We haven't established that as a cause yet, though.

Let’s run the following command on your Pi-hole machine to check some of Pi-hole’s stats:

echo ">stats" | nc localhost 4711 -w 1

Not sure if this next one is related either:
Your debug log shows (only partially) a section for some issues with DHCP requests conflicting over names.
You should be able to get the full story.:

 grep "not giving name" /var/log/pihole.log

I deliberately didn't post the output as to not compromise your local naming scheme.

gavinhatton · April 2, 2020, 1:38pm

I see what you mean about the DHCP issues in the logs. This is owing to the cloning of some other VMs, with the clone being config'd for DHCP, albeit with the same hostname. This is a recent event and the issue of pi-hole randomly failing pre-dates this. the 'localhost' DHCP conflict comes from a smart TV that reports its name as localhost. I've now added this MAC to the static list with a new hostname.

As for the stats, as requested here is the output of

localadmin@pihole:~$ echo ">stats" | nc localhost 4711 -w 1

domains_being_blocked 582972
dns_queries_today 53698
ads_blocked_today 11751
ads_percentage_today 21.883497
unique_domains 1873
queries_forwarded 16797
queries_cached 25148
clients_ever_seen 29
unique_clients 29
dns_queries_all_types 53698
reply_NODATA 8
reply_NXDOMAIN 11
reply_CNAME 29
reply_IP 69
privacy_level 0
status enabled
---EOM---

localadmin@pihole:~$

I might try reducing the number of days the stats are held for. I don't need an entire years worth.

Bucking_Horn · April 2, 2020, 2:08pm

The number of queries is somewhere in the high range, but not excessive. It is in line with your number of clients (29). From personal observations, I'd consider anything from 1k to 3k queries per client per day as normal (though I don't have the actual hard stats to prove this).

This makes it much less likely that your configuration is closing a DNS loop that could have accounted for a large number of queries (easily in the higher 100,000s or even millions a day).

And as your Pi-hole does not use your router as one of its upstream servers, and you are not using Conditional Forwarding, we can rule out a DNS loop as causing a high number of queries,

That TV wouldn't be a Samsung specimen per chance?

In the past, I have encountered a Samsung TV at a friend's place that insisted on being called localhost - and it sporadically spawned DNS requests for time servers in the 10,000s before going back to normal.

You might give that a try, though I wouldn't know how this affects CPU load once applied, as Pi-hole would have to throw away a large portion of its database.
It might be easier to just drop the database as a whole and to start anew, if you are willing to part with your statistics.

Paging @DL6ER for confirming my assumption, to be on the safe side, and for possible advice on the best strategy here.

DL6ER · April 2, 2020, 7:15pm

I'm not convinced that the large long-term database is the culpit here. My long-term database is on the order of 300 MB (much fewer clients) but that is even on a Raspberry Pi.

It would be interesting to see the times at which this happens so you could correlate this to the "queries over time" graph. Are there any visible spikes in the latter?
Maybe your VM software does already offer a monitoring for CPU load? If not, there should be plenty of software available that can automatically collect this data.

gavinhatton · April 3, 2020, 10:02am

Thanks for the responses.

I've had a look at the ESXi logs and that didn't show anything around the time it last happened.

I've reduced the days the stats are held for to 60 days. So far (12hrs) it seems stable. The WebUI is also a lot slicker, and memory usage is now showing at around 10% rather than the 85-90% I was seeing before. The VM also runs Webmin and this also running a lot slicker.

It's still a bit early to say if this has solved it but it's looking good. I'll monitor for a few days and see what happens.

Thank you.

DL6ER · April 3, 2020, 8:53pm

So you did this by setting MAXDBDAYS in pihole-FTL.log? This just defines after which amount of time the queries are removed from the pihole-FTL.db database on disk, it has no effect at all on a running FTL instance.

Let's see what happens. There seems to be something happening at a specific point in time.

gavinhatton · April 6, 2020, 10:19am

Well with no changes other than setting MAXDBDAYS to 60 in pihole-FTL.log everything is now running smoothly. No locks or freezes at all for 48+hours. Although intermittent and with no pattern, the fault would have most definitely have occurred during this time.

The WebUI is also a lot slicker as is the rest of the system (SSH/Webmin etc).

Thank you both for you help with this matter.

system · April 27, 2020, 9:38pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.