Nightly disk space shortage

The issue I am facing:

I noticed this by visiting http://pi.hole/admin/messages.php

There are a pair of messages generated every night

Disk shortage (`/etc/pihole/pihole-FTL.db`) ahead: **96% used**
/etc/pihole: 30.0GB used, 31.1GB total
Disk shortage (`/var/log/pihole-FTL.log`) ahead: **96% used**
/var/log: 30.0GB used, 31.1GB total

I have prometheus monitoring and can see that indeed, the available disk space plummets every night to zero bytes free, then recovers.

The filesystem currently has 8GB free and usually does

/dev/mmcblk0p2   29G   20G  8.2G  71% /

It appears something is kicking off at midnight and running the system out of disk space, then cleaning it all back up? I have logs but nothing seems conclusive.

Details about my system:

I documented my pi-hole OS build and install on my blog - lots of details there.
https://lowtek.ca/roo/category/pi-hole/

High level: Raspberry Pi 4 (2GB)
Ubuntu 20.04.4 LTS
Pi-hole v5.10
FTL v5.15
Web Interface v5.12

What I have changed since installing Pi-hole:

Not terribly much. This has worked fine for months and months, then started getting warnings about disk storage.

The other night - it seems pi.hole got sick. DNS wasn't running properly and thus hanging most of my network, but I was still able to ssh into the box and recover by doing a reboot.

I am trying to determine if the nightly spike on disk space is a pi-hole driven process, or if there is some aspect of the OS that is causing this. From the logs - it is difficult to tell - but I suspect that my logging agent (td-agent-bit) is the culprit. Having reviewed things - it seems that the project has moved away from this version of the agent and there is now a fluent-bit build that replaces it.

hmm posting here reminded me that yeah - td-agent-bit has been a problem in the past too CPU usage grows until next reboot - caused by td-agent-bit

While suspicious, this may not be the actual root cause.

I see 2 things happening in my logs during this time period

  • man-db regeneration
  • log file rotation

the log file rotation causes td-agent-bit (the logging agent) to find new files to rotate etc.

You can see the free disk space graph here

It seems running

$ sudo /usr/bin/mandb

takes nearly no time - and does not increase disk space in a measurable way

Neither does the log rotation

$ sudo /usr/sbin/logrotate /etc/logrotate.conf

Hmm..

System activities (log rotate, etc) and cron tasks are logged in /var/log/syslog.

I would look there to start.

I have already looked at the logs - but it is always worth digging further.

The disk space metric graph shows that starting at midnight (00:00:00) space starts to run out - up until we have zero free bytes, then it suddenly frees up.

The start time is consistent - always some triggered job. The recover time varies a few minutes from day to day. Last night it was right at 00:18:00 (different that the previous day graph posted above)

I can see a few jobs get started right around midnight

2022-07-04 00:00:01	{"log":"Jul  4 00:00:01 localhost CRON[37275]: (root) CMD (   PATH=\"$PATH:/usr/sbin:/usr/local/bin/\" pihole flush once quiet)"}
2022-07-04 00:00:01	{"log":"Jul  4 00:00:01 localhost CRON[37274]: (root) CMD (   PATH=\"$PATH:/usr/sbin:/usr/local/bin/\" pihole updatechecker local)"}
2022-07-04 00:00:01	{"log":"Jul  4 00:00:01 localhost systemd[1]: Starting Daily man-db regeneration..."}
2022-07-04 00:00:01	{"log":"Jul  4 00:00:01 localhost systemd[1]: Starting Rotate log files..."}
2022-07-04 00:00:01	{"log":"Jul  4 00:00:01 localhost systemd[1]: Starting Discard unused blocks on filesystems from /etc/fstab..."}

I also see success messages from these services

2022-07-04 00:00:04	{"log":"Jul  4 00:00:04 localhost systemd[1]: Finished Daily man-db regeneration."}
2022-07-04 00:00:04	{"log":"Jul  4 00:00:04 localhost systemd[1]: Finished Rotate log files."}
2022-07-04 00:00:07	{"log":"Jul  4 00:00:07 localhost systemd[1]: Finished Discard unused blocks on filesystems from /etc/fstab."}

At 00:20:01 I see this - which seems suspicious

2022-07-04 00:20:01	
{"log":"Jul  4 00:20:01 localhost CRON[37588]: (root) CMD (   PATH=\"$PATH:/usr/sbin:/usr/local/bin/\" pihole updatechecker local)"}

Oh.. look at that - it might be the flush that is breaking me..

$ ls -l /var/log/pihole.l*
-rw-r--r-- 1 pihole pihole 11409416192 Jul  3 00:18 /var/log/pihole.log
-rw-r--r-- 1 pihole pihole     2363163 May 31 00:00 /var/log/pihole.log.6.gz

Reading about pihole flush and checking the implementation.. I can see what is wrong here

        # Copy pihole.log over to pihole.log.1
        # and empty out pihole.log
        # Note that moving the file is not an option, as
        # dnsmasq would happily continue writing into the
        # moved file (it will have the same file handler)
        cp -p /var/log/pihole.log /var/log/pihole.log.1
        echo " " > /var/log/pihole.log
        chmod 644 /var/log/pihole.log

Given that it is only ~10:38am here, and my pihole.log file is already 11GB is size a 'copy' of that file is certainly going to blow my available disk space (~8GB).

This is also a problem that I've caused, having a very high number of DNS queries. I am running MotionEye - but have turned off the camera. MotionEye aggressively tries to look up the camera by name - hitting pi-hole repeatedly (like 120,000 times in 24hrs). Thus with a long history of rotated logs and a lot of queries I'm going to use a lot of disk space up quickly.

With a 32GB SDCard as my filesystem - it should be enough, but clearly I'm exceeding the capacity with the number of queries (200,000+/day)

It might be wise to modify the pihole.log rotation to immediately target a .z version - but this might break some logging agents? The gotcha for me is that my log file size was much more modest - until I turned off the camera and MotionEye basically doubled the log file size in a day.

Another solution might be to 'fix' the pi-hole diagnostic to look for this particular failure and warn more specifically. Even identifying that the size of the pihole.log file is greater than the size of (pihole.log.1 + avail disk space) would be a slick way to warn specifically that this is going to fail.

Net here - others that have this problem - may benefit from looking for "missing" rotated pihole.log.[1|gz] files and discovering this particular issue.

11 GB does not appear consistent with a daily query volume of 200K queries per day.

As a scaling comparison, one of my Pi-holes received 13K queries in the past 24 hours. Yesterday's dnsmasq log is 6.1 MB.

-rw-r--r-- 1 pihole pihole 6.1M Jul 4 00:00 pihole.log.1

To scale up to 20 times this volume (260K queries), I would expect in the vicinity of 125-150 MB. You are at 11 GB, which is about 2 orders of magnitude higher, and in less than half a day.

What are the outputs of the following commands from the Pi terminal:

ls -lha /var/log/pihole*

ls -lha /var/pihole/*.db

echo ">stats" | nc 127.0.0.1 4711

echo ">top-domains >quit" | nc localhost 4711

echo ">top-ads >quit" | nc localhost 4711

echo ">top-clients >quit" | nc localhost 4711

It is odd - and I was "bad" as I just nuked the oversized logs and have lost easy access to that data now.

I stopped MotionEye from generating a zillion bad lookups.

Here is the log sizes now - very normal IMO

$ ls -l /var/log/pihole.l*
-rw-r--r-- 1 pihole pihole  8296018 Jul  5 10:06 /var/log/pihole.log
-rw-r--r-- 1 pihole pihole 12182688 Jul  5 00:00 /var/log/pihole.log.1

My pi.hole console is saying 81,000 total queries. This seems to be aligned with the ratio you're seeing. Maybe failed lookups for local domains generate more logs

Yeah.. it was terrible - every ~5 seconds (possibly more often) it tried to do this lookup series for the camera - both IPv4 and IPv6..

	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: reply wyze-v2.lan.lan is NXDOMAIN"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: reply wyze-v2.lan.lan is NXDOMAIN"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: forwarded wyze-v2.lan.lan to 192.168.1.14"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: query[A] wyze-v2.lan.lan from 192.168.1.82"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: forwarded wyze-v2.lan.lan to 192.168.1.14"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: query[AAAA] wyze-v2.lan.lan from 192.168.1.82"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: reply wyze-v2.lan is NXDOMAIN"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: reply wyze-v2.lan is NXDOMAIN"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: forwarded wyze-v2.lan to 192.168.1.14"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: query[A] wyze-v2.lan from 192.168.1.82"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: forwarded wyze-v2.lan to 192.168.1.14"}
	2022-07-02 20:09:21	{"log":"Jul  2 20:09:21 dnsmasq[912]: query[AAAA] wyze-v2.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[AAAA] wyze-v2.lan.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[A] wyze-v2.lan.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[A] wyze-v2.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[AAAA] wyze-v2.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[A] wyze-v2.lan.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[AAAA] wyze-v2.lan.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[A] wyze-v2.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[AAAA] wyze-v2.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[AAAA] wyze-v2.lan.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[A] wyze-v2.lan.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: reply wyze-v2.lan is NXDOMAIN"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[AAAA] wyze-v2.lan from 192.168.1.82"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: forwarded wyze-v2.lan to 192.168.1.14"}
	2022-07-02 20:09:20	{"log":"Jul  2 20:09:20 dnsmasq[912]: query[A] wyze-v2.lan from 192.168.1.82"}

Can you edit the hosts file on the MotionEye server node and hardcode in the IP? That should preclude that server from even trying a DNS query if it can pull it from the local hosts file.

That looks like your DHCP isn't set up quite right, or the search domains for the DNS setup isn't tuned.

Normally you'd see a query for wyze-v2 and if that didn't have a record then the DNS client would add the local domain or the search domain(s) to be wyze-v2.lan. Having that additional wyze-v2.lan.lan looks suspicious. And wyze-v2.lan coming back as NXDOMAIN doesn't look right if you're using .lan as the local domain and handing things out via DHCP.

If you can post a debug token URL we can take a deeper look if you're still interested. (I haven't been to the blog so I apologize if you've covered some of this over there.)

Have you created a local DNS record on Pi-hole for this camera name?
If so, see below:

It is worth checking into - but I'm not sure that there is a problem here. MotionEye is configured to know about "wyze-v2.lan" - that software may be appending the .lan domain in an attempt to locate the camera which is gone from the network and pasting on another .lan .. but I agree, it does seem a little bit suspicious because I'm pretty sure I didn't tell MotionEye that it was on the .lan domain - and it's running in a container which shouldn't know that either.

Yeah - I could just tell MotionEye the IP of the camera - my network is setup with OpenWRT and rarely will cycle IPs (or I could assign it a fixed IP). For now I've just turned off MotionEye because I wasn't using it at all.

As I mentioned above, I run OpenWRT. It runs my DHCP, which in turn is configured to advertise the pi.hole as the DNS.

OpenWRT hands out the names / IPs. The pi.hole just manages DNS.
I'm using the "Conditional forwarding" to forward .lan to the OpenWRT DNS to resolve names locally when needed.

However using your dig line - I am seeing a zero TTL in the ANSWER section for local names. Hmm..

Ahh, but modifying the dig to point at my OpenWRT - it appears that the DNS there is the one giving the zero TTL - (ie: can bypass the pi.hole entirely and get the same zero.. that might be worth looking into)

Seems easy enough to do [SOLVED] How do I increase DNS local-ttl in OpenWRT? - Network and Wireless Configuration - OpenWrt Forum

I've never really looked at the local DNS TTL - and using zero seems pretty aggressive. I wonder why that is the default?

Its the default for dnsmasq but IMO the man page doesnt adequately explain why :

pi@ph5b:~ $ man dnsmasq
[..]
       -T, --local-ttl=<time>
              When replying with information from /etc/hosts or configura‐
              tion or the DHCP leases file dnsmasq  by  default  sets  the
              time-to-live  field  to  zero,  meaning  that  the requester
              should not itself cache the information. This is the correct
              thing  to  do in almost all situations. This option allows a
              time-to-live (in seconds) to be  given  for  these  replies.
              This  will  reduce  the load on the server at the expense of
              clients using stale data under some circumstances.

Well - IMO - it's worse than that. OpenWRT also ships with this as the default

The OpenWRT community has some deep network guru's - I'm going to see if someone can help educate me. I know how to fix this in OpenWRT based on the doc page -- but just because you can turn the knob to a higher value, what is the trade-off?

Feedback from a couple of openwrt community members seemed to be as follows

  • dnsmasq provides zero by default
  • as it's local network, simplicity of always looking up is better than caching and complexity
  • dnsmasq doc says

"When replying with information from /etc/hosts or configuration or the DHCP leases file dnsmasq by default sets the time-to-live field to zero, meaning that the requester should not itself cache the information. This is the correct thing to do in almost all situations. This option allows a time-to-live (in seconds) to be given for these replies. This will reduce the load on the server at the expense of clients using stale data under some circumstances. "

I have still set my OpenWRT configuration to use a local_ttl of 60. Having a name be wrong for a minute (60 seconds) doesn't seem like a big deal to me, and it'll in theory reduce some of the local lookups I'm seeing in pi.hole a lot.

Boo - well, the code that is running that's making all those queries - is still hitting the pi.hole - so the statement that "most clients don't cache" is quite true.

Some of the code is rustlang, some of the code is python. They both did not change behaviour with the increased local_ttl value. I did see the pi.hole start replying from the (cache) vs. always forwarding upstream to my OpenWRT router - so name resolution got a tiny bit faster -- but probably not worth it.

5 posts were split to a new topic: PADD/Pihole: Logs account for almost all disk usage