Please follow the below template, it will help us to help you!
If you are Experiencing issues with a Pi-hole install that has non-standard elements (e.g you are using nginx instead of lighttpd, or there is some other aspect of your install that is customised) - please use the Community Help category.
Expected Behaviour:
DNS resolution should work continuously.
Hardware: Raspberry Pi 3 Model B Plus Rev 1.3
OS: Raspbian 9.11
Actual Behaviour:
Periodically pihole will stop responding to DNS queries. I haven't yet deciphered a particular pattern, but when this happens the CPU increases from roughly 10% to 30% and the 1-minute load average increases from around 0.5 to 1.5. Running pihole restartdns resolved the issue. Unfortunately I did not capture the processes with high CPU (though I'll make sure to do that next time it happens). This appears to happen less frequently than daily, it may happen once a week, I'm trying to track that better. I have excepts from pihole.log, pihole-FTL.log, and syslog for the period around the issue. Based on the information in pihole.log and collected metrics, the system stopped responding after 20-June 11:40:36 and started up again at around 11:47 after I'd restarted. (There are a couple of lines out output at 11:45:50 but it didn't seem to be responding at that time).
I did look through several similar posts but none seem to match what I'm seeing.
Hard to say. Here are a few steps where you can move the existing long term database and start with a fresh one. If that resolves your problem, then you will know it was the database.
Ok, I've done that. I'll monitor and if it hasn't happened in the next week or so, I'll close this.
Thanks again for the quick response.
BTW, I don't remember where those regex whitelists came from or if they are needed. I know that I didn't write them as I know that's not valid; I must have imported from somewhere I guess.
Form the output you shared above you had dns_queries_today 80709 with only clients_ever_seen 6. Six clients producing over 80.000 is a very active network. From my experience one client produces between 1000-2000 queries/day (just a rough estimation).
Is one of your clients sticking out? By any chance, did you configure a DNS loop?
The number of clients is a little misleading. Most of the clients get their DNS from the main router (which also does DHCP), so the bulk are from that one device. I've only got one Windows machine and a couple of Linux boxes that point directly to the pihole. Interestingly, for today, I'm at about 70K queries total, with 25K going from the router, a few thousand from the one Windows machine, but a total of almost 40K from the 2 Linux boxes (which mostly don't do a whole lot. I'm going to look at what is generating all those queries.
Thanks for the tip.
Edit: I took at look at the queries from both the Linux boxes. They both seem legitimate; One is running as a Docker server and the majority of the queries are lookups for Docker-related things, mostly automatically checking for updated images, etc. The other box handles email and a lot of what it's looking up is legitimately related to that. I don't see any evidence of DNS loops, they just seem to talk a lot.
It looks like the issue happened again, during the overnight hours. I have monitoring on this now and I could see that from around 04:18 to 04:24 local there was nothing written in pihole.log.
Jun 28 04:18:07 dnsmasq[25557]: forwarded 71IX8569-3vMYm5dnh.ebl.msbl.org to 4.2.2.2
Jun 28 04:18:08 dnsmasq[25557]: reply 71IX8569-3vMYm5dnh.ebl.msbl.org is NXDOMAIN
Jun 28 04:24:42 dnsmasq[25557]: read /etc/hosts - 5 addresses
Jun 28 04:24:42 dnsmasq[25557]: read /etc/pihole/custom.list - 7 addresses
Jun 28 04:24:42 dnsmasq[25557]: read /etc/pihole/local.list - 2 addresses
Jun 28 04:24:42 dnsmasq[25557]: query[A] safebrowsing.googleapis.com from 192.168.0.1
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 1.0.0.1
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 1.1.1.1
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 4.2.2.2
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 4.2.2.1
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 8.8.4.4
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 8.8.8.8
Jun 28 04:24:42 dnsmasq[25557]: query[A] x8XlnitH.multi.surbl.org from 192.168.0.46
Jun 28 04:24:42 dnsmasq[25557]: forwarded x8XlnitH.multi.surbl.org to 4.2.2.2
I should mention that things resumed automatically (the last time I happened to be at the keyboard and manually restarted.
Checking various logs, I see that there was a cronjob that started at 04:18:01:
Jun 28 04:17:01 pi1 CRON[21035]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jun 28 04:18:01 pi1 CRON[21051]: (root) CMD ( PATH="$PATH:/usr/local/bin/" pihole updateGravity >/var/log/pihol
e_updateGravity.log || cat /var/log/pihole_updateGravity.log)
Jun 28 04:20:01 pi1 CRON[21244]: (root) CMD ( PATH="$PATH:/usr/local/bin/" pihole updatechecker local)
Jun 28 04:30:01 pi1 CRON[21308]: (root) CMD ( PATH="$PATH:/usr/local/bin/" pihole updatechecker local)
Jun 28 04:39:01 pi1 CRON[21360]: (root) CMD ( [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system
]; then /usr/lib/php/sessionclean; fi)
Might be coincidence, but it looks like that's the only time that particular job has run all week. I went back through the older syslogs but I don't see it at all, the past time probably happened more than 7 days ago (which I believe is the case). Could this be a smoking gun?
Any harm in trying to run that manually to see if that reproduces the issue?
Gravity update runs only once a week on Sunday morning at a random time between 03:01 to 04:58. It fetches your adlists from the internet and builds a new gravity.db database. At the end the old database is swapped with the new one. DNS resolution (and blocking) should be preserved during this time. As you have "only" domains_being_blocked 94541 this whole update and swapping should not take long at all.
You can manually run pihole -g and see what happens.
It looks like this the gravity update is in fact related. I ran it once, and it zipped right thought. I noticed that there was a presumably outdated server list ( https://hosts-file.net/ad_servers.txt) so I removed it and re-ran. This time, this happened (and it stopped resolving as tested from another system):
[✗] DNS resolution is currently unavailable
[i] Time until retry: 87
That counted down to zero, then this:
pi@pi1:/var/log $ pihole -g
[✗] DNS resolution is currently unavailable
[✗] DNS resolution is not available
I ran pihole -d while it was happening but it was unable to upload the output presumably because it couldn't resolve the host address. I've saved a copy locally and if someone tells me how to upload it securely I'll do that.
I've used pihole restartdns to recover, I'll look at the performance tips mentioned and see if that helps.
[2020-06-28 04:18:12.804 25557] Reloading DNS cache
[2020-06-28 04:18:12.804 25557] Blocking status is enabled
[2020-06-28 04:18:12.807 25557] SQLite3 message: file unlinked while open: /etc/pihole/gravity.db (28)
[2020-06-28 04:21:27.776 25557] INFO: No regex blacklist entries found
[2020-06-28 04:21:27.776 25557] INFO: No regex whitelist entries found
[2020-06-28 04:21:27.779 25557] Compiled 0 whitelist and 0 blacklist regex filters in 4.7 msec
I see now that there is a delay of 3 minutes between the database being deleted and FTL being instructed to re-opening the database. This should not happen.
Your log contains a second instance here:
[2020-06-28 15:54:25.132 589] Reloading DNS cache
[2020-06-28 15:54:25.133 589] Blocking status is enabled
[2020-06-28 15:54:25.133 589] SQLite3 message: file unlinked while open: /etc/pihole/gravity.db (28)
[2020-06-28 15:54:25.184 589] INFO: No regex blacklist entries found
[2020-06-28 15:54:25.185 589] INFO: No regex whitelist entries found
[2020-06-28 15:54:25.190 589] Compiled 0 whitelist and 0 blacklist regex filters in 6.5 msec
and a third here:
[2020-06-28 16:08:03.932 892] Reloading DNS cache
[2020-06-28 16:08:03.932 892] Blocking status is enabled
[2020-06-28 16:08:03.933 892] SQLite3 message: file unlinked while open: /etc/pihole/gravity.db (28)
[2020-06-28 16:08:04.221 892] INFO: No regex blacklist entries found
[2020-06-28 16:08:04.222 892] INFO: No regex whitelist entries found
[2020-06-28 16:08:04.226 892] Compiled 0 whitelist and 0 blacklist regex filters in 4.8 msec
In the second case it was only 50 milliseconds and in the third case it was 1/3 of a second. I do now agree that something strange is going on here.