High level: Raspberry Pi 4 (2GB)
Ubuntu 20.04.4 LTS
Web Interface v5.12
What I have changed since installing Pi-hole:
Not terribly much. This has worked fine for months and months, then started getting warnings about disk storage.
The other night - it seems pi.hole got sick. DNS wasn't running properly and thus hanging most of my network, but I was still able to ssh into the box and recover by doing a reboot.
I am trying to determine if the nightly spike on disk space is a pi-hole driven process, or if there is some aspect of the OS that is causing this. From the logs - it is difficult to tell - but I suspect that my logging agent (td-agent-bit) is the culprit. Having reviewed things - it seems that the project has moved away from this version of the agent and there is now a fluent-bit build that replaces it.
I have already looked at the logs - but it is always worth digging further.
The disk space metric graph shows that starting at midnight (00:00:00) space starts to run out - up until we have zero free bytes, then it suddenly frees up.
The start time is consistent - always some triggered job. The recover time varies a few minutes from day to day. Last night it was right at 00:18:00 (different that the previous day graph posted above)
I can see a few jobs get started right around midnight
Reading about pihole flush and checking the implementation.. I can see what is wrong here
# Copy pihole.log over to pihole.log.1
# and empty out pihole.log
# Note that moving the file is not an option, as
# dnsmasq would happily continue writing into the
# moved file (it will have the same file handler)
cp -p /var/log/pihole.log /var/log/pihole.log.1
echo " " > /var/log/pihole.log
chmod 644 /var/log/pihole.log
Given that it is only ~10:38am here, and my pihole.log file is already 11GB is size a 'copy' of that file is certainly going to blow my available disk space (~8GB).
This is also a problem that I've caused, having a very high number of DNS queries. I am running MotionEye - but have turned off the camera. MotionEye aggressively tries to look up the camera by name - hitting pi-hole repeatedly (like 120,000 times in 24hrs). Thus with a long history of rotated logs and a lot of queries I'm going to use a lot of disk space up quickly.
With a 32GB SDCard as my filesystem - it should be enough, but clearly I'm exceeding the capacity with the number of queries (200,000+/day)
It might be wise to modify the pihole.log rotation to immediately target a .z version - but this might break some logging agents? The gotcha for me is that my log file size was much more modest - until I turned off the camera and MotionEye basically doubled the log file size in a day.
Another solution might be to 'fix' the pi-hole diagnostic to look for this particular failure and warn more specifically. Even identifying that the size of the pihole.log file is greater than the size of (pihole.log.1 + avail disk space) would be a slick way to warn specifically that this is going to fail.
Net here - others that have this problem - may benefit from looking for "missing" rotated pihole.log.[1|gz] files and discovering this particular issue.
That looks like your DHCP isn't set up quite right, or the search domains for the DNS setup isn't tuned.
Normally you'd see a query for wyze-v2 and if that didn't have a record then the DNS client would add the local domain or the search domain(s) to be wyze-v2.lan. Having that additional wyze-v2.lan.lan looks suspicious. And wyze-v2.lan coming back as NXDOMAIN doesn't look right if you're using .lan as the local domain and handing things out via DHCP.
If you can post a debug token URL we can take a deeper look if you're still interested. (I haven't been to the blog so I apologize if you've covered some of this over there.)
It is worth checking into - but I'm not sure that there is a problem here. MotionEye is configured to know about "wyze-v2.lan" - that software may be appending the .lan domain in an attempt to locate the camera which is gone from the network and pasting on another .lan .. but I agree, it does seem a little bit suspicious because I'm pretty sure I didn't tell MotionEye that it was on the .lan domain - and it's running in a container which shouldn't know that either.
Yeah - I could just tell MotionEye the IP of the camera - my network is setup with OpenWRT and rarely will cycle IPs (or I could assign it a fixed IP). For now I've just turned off MotionEye because I wasn't using it at all.
As I mentioned above, I run OpenWRT. It runs my DHCP, which in turn is configured to advertise the pi.hole as the DNS.
OpenWRT hands out the names / IPs. The pi.hole just manages DNS.
I'm using the "Conditional forwarding" to forward .lan to the OpenWRT DNS to resolve names locally when needed.
However using your dig line - I am seeing a zero TTL in the ANSWER section for local names. Hmm..
Ahh, but modifying the dig to point at my OpenWRT - it appears that the DNS there is the one giving the zero TTL - (ie: can bypass the pi.hole entirely and get the same zero.. that might be worth looking into)
Its the default for dnsmasq but IMO the man page doesnt adequately explain why :
pi@ph5b:~ $ man dnsmasq
When replying with information from /etc/hosts or configura‐
tion or the DHCP leases file dnsmasq by default sets the
time-to-live field to zero, meaning that the requester
should not itself cache the information. This is the correct
thing to do in almost all situations. This option allows a
time-to-live (in seconds) to be given for these replies.
This will reduce the load on the server at the expense of
clients using stale data under some circumstances.
Well - IMO - it's worse than that. OpenWRT also ships with this as the default
The OpenWRT community has some deep network guru's - I'm going to see if someone can help educate me. I know how to fix this in OpenWRT based on the doc page -- but just because you can turn the knob to a higher value, what is the trade-off?
Feedback from a couple of openwrt community members seemed to be as follows
dnsmasq provides zero by default
as it's local network, simplicity of always looking up is better than caching and complexity
dnsmasq doc says
"When replying with information from /etc/hosts or configuration or the DHCP leases file dnsmasq by default sets the time-to-live field to zero, meaning that the requester should not itself cache the information. This is the correct thing to do in almost all situations. This option allows a time-to-live (in seconds) to be given for these replies. This will reduce the load on the server at the expense of clients using stale data under some circumstances. "
I have still set my OpenWRT configuration to use a local_ttl of 60. Having a name be wrong for a minute (60 seconds) doesn't seem like a big deal to me, and it'll in theory reduce some of the local lookups I'm seeing in pi.hole a lot.
Boo - well, the code that is running that's making all those queries - is still hitting the pi.hole - so the statement that "most clients don't cache" is quite true.
Some of the code is rustlang, some of the code is python. They both did not change behaviour with the increased local_ttl value. I did see the pi.hole start replying from the (cache) vs. always forwarding upstream to my OpenWRT router - so name resolution got a tiny bit faster -- but probably not worth it.