DNS resolution unexpectedly stops after upgrade to 5.0

dkozinn · June 20, 2020, 5:56pm

Please follow the below template, it will help us to help you!

If you are Experiencing issues with a Pi-hole install that has non-standard elements (e.g you are using `nginx` instead of `lighttpd`, or there is some other aspect of your install that is customised) - please use the Community Help category.

Expected Behaviour:

DNS resolution should work continuously.
Hardware: Raspberry Pi 3 Model B Plus Rev 1.3
OS: Raspbian 9.11

Actual Behaviour:

Periodically pihole will stop responding to DNS queries. I haven't yet deciphered a particular pattern, but when this happens the CPU increases from roughly 10% to 30% and the 1-minute load average increases from around 0.5 to 1.5. Running pihole restartdns resolved the issue. Unfortunately I did not capture the processes with high CPU (though I'll make sure to do that next time it happens). This appears to happen less frequently than daily, it may happen once a week, I'm trying to track that better. I have excepts from pihole.log, pihole-FTL.log, and syslog for the period around the issue. Based on the information in pihole.log and collected metrics, the system stopped responding after 20-June 11:40:36 and started up again at around 11:47 after I'd restarted. (There are a couple of lines out output at 11:45:50 but it didn't seem to be responding at that time).

I did look through several similar posts but none seem to match what I'm seeing.

Where should I upload the log files?

Debug Token:

https://tricorder.pi-hole.net/iev4d27te8

jfb · June 20, 2020, 6:28pm

A few things jump out on the debug log.

Your long term database is quite large, which typically indicates a high amount of daily activity and can lead to space issues on a smaller SD card.

[2020-06-20 12:59:00.553 24087] Notice: Database size is 5002.62 MB, deleted 3540 rows

Please post the output of this command which will show the most recent 24 hour activity:

echo ">stats >quit" | nc localhost 4711

You also have some invalid regex that will not compile:

   [2020-06-20 11:47:15.436 24087] Warning: Invalid regex whitelist filter "*.push.apple.com": Invalid preceding regular expression (error code 13)
   [2020-06-20 11:47:15.436 24087] Warning: Invalid regex whitelist filter "*.itunes.apple.com": Invalid preceding regular expression (error code 13)
   [2020-06-20 11:47:15.436 24087] Warning: Invalid regex whitelist filter "*.apps.apple.com": Invalid preceding regular expression (error code 13)
   [2020-06-20 11:47:15.436 24087] Warning: Invalid regex whitelist filter "*.mzstatic.com": Invalid preceding regular expression (error code 13)
   [2020-06-20 11:47:15.436 24087] Warning: Invalid regex whitelist filter "*.amazonaws.com": Invalid preceding regular expression (error code 13)
   [2020-06-20 11:47:15.436 24087] Warning: Invalid regex whitelist filter "*.cdn-apple.com": Invalid preceding regular expression (error code 13)
   [2020-06-20 11:47:15.436 24087] Warning: Invalid regex whitelist filter "*.digicert.com": Invalid preceding regular expression (error code 13)
   [2020-06-20 11:47:15.436 24087] Warning: Invalid regex whitelist filter "*.symcb.com": Invalid preceding regular expression (error code 13)
   [2020-06-20 11:47:15.436 24087] Warning: Invalid regex whitelist filter "*.symcd.com": Invalid preceding regular expression (error code 13)

dkozinn · June 20, 2020, 6:32pm

Thanks for the quick response. Heres's the stats output:

domains_being_blocked 94541
dns_queries_today 80709
ads_blocked_today 4423
ads_percentage_today 5.480182
unique_domains 13000
queries_forwarded 45845
queries_cached 30434
clients_ever_seen 6
unique_clients 6
dns_queries_all_types 80709
reply_NODATA 2232
reply_NXDOMAIN 1796
reply_CNAME 2259
reply_IP 3465
privacy_level 0
status enabled

I've got a 32GB SD card so I'm not worried about space, but if I need to clean it up I'm happy to do so. I'll fix those whitelist filters.

jfb · June 20, 2020, 6:36pm

If you aren't low on space, then you can leave the database as is. Run free and see where you stand on drive space.

Also, what is the output of this command which will show the other database sizes:

ls -lha /etc/pihole/*.db

dkozinn · June 20, 2020, 6:38pm

pi@pi1:~ $ free
              total        used        free      shared  buff/cache   available
Mem:         948304       82844       28712       62416      836748      737640
Swap:        102396       10496       91900
pi@pi1:~ $ ls -lha /etc/pihole/*.db
-rw-rw-r-- 1 pihole pihole 6.2M Jun 20 14:36 /etc/pihole/gravity.db
-rw-r--r-- 1 pihole pihole 2.4M May 13 15:56 /etc/pihole/macvendor.db
-rw-r--r-- 1 pihole pihole 4.7G Jun 20 14:37 /etc/pihole/pihole-FTL.db

Why would the database get that big? I don't think my system is that active, is it possible that somehow it's gotten corrupted?

Also:

Filesystem      Size  Used Avail Use% Mounted on
/dev/root        28G   11G   16G  40% /

So I'm good on disk

jfb · June 20, 2020, 6:41pm

Hard to say. Here are a few steps where you can move the existing long term database and start with a fresh one. If that resolves your problem, then you will know it was the database.

sudo service pihole-FTL stop

sudo mv /etc/pihole/pihole-FTL.db /etc/pihole/pihole-FTL-old.db

sudo service pihole-FTL start

dkozinn · June 20, 2020, 6:45pm

Ok, I've done that. I'll monitor and if it hasn't happened in the next week or so, I'll close this.

Thanks again for the quick response.

BTW, I don't remember where those regex whitelists came from or if they are needed. I know that I didn't write them as I know that's not valid; I must have imported from somewhere I guess.

jfb · June 20, 2020, 6:49pm

They were all added at almost the same time, and were not imported from your previous Pi-hole install:

2020-05-23 12:31:25

yubiuser · June 21, 2020, 7:31pm

Form the output you shared above you had dns_queries_today 80709 with only clients_ever_seen 6. Six clients producing over 80.000 is a very active network. From my experience one client produces between 1000-2000 queries/day (just a rough estimation).
Is one of your clients sticking out? By any chance, did you configure a DNS loop?

dkozinn · June 21, 2020, 10:39pm

The number of clients is a little misleading. Most of the clients get their DNS from the main router (which also does DHCP), so the bulk are from that one device. I've only got one Windows machine and a couple of Linux boxes that point directly to the pihole. Interestingly, for today, I'm at about 70K queries total, with 25K going from the router, a few thousand from the one Windows machine, but a total of almost 40K from the 2 Linux boxes (which mostly don't do a whole lot. I'm going to look at what is generating all those queries.

Thanks for the tip.

Edit: I took at look at the queries from both the Linux boxes. They both seem legitimate; One is running as a Docker server and the majority of the queries are lookups for Docker-related things, mostly automatically checking for updated images, etc. The other box handles email and a lot of what it's looking up is legitimately related to that. I don't see any evidence of DNS loops, they just seem to talk a lot.

yubiuser · June 22, 2020, 7:38am

You have a very active network

dkozinn · June 28, 2020, 6:42pm

It looks like the issue happened again, during the overnight hours. I have monitoring on this now and I could see that from around 04:18 to 04:24 local there was nothing written in pihole.log.

Jun 28 04:18:07 dnsmasq[25557]: forwarded 71IX8569-3vMYm5dnh.ebl.msbl.org to 4.2.2.2
Jun 28 04:18:08 dnsmasq[25557]: reply 71IX8569-3vMYm5dnh.ebl.msbl.org is NXDOMAIN
Jun 28 04:24:42 dnsmasq[25557]: read /etc/hosts - 5 addresses
Jun 28 04:24:42 dnsmasq[25557]: read /etc/pihole/custom.list - 7 addresses
Jun 28 04:24:42 dnsmasq[25557]: read /etc/pihole/local.list - 2 addresses
Jun 28 04:24:42 dnsmasq[25557]: query[A] safebrowsing.googleapis.com from 192.168.0.1
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 1.0.0.1
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 1.1.1.1
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 4.2.2.2
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 4.2.2.1
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 8.8.4.4
Jun 28 04:24:42 dnsmasq[25557]: forwarded safebrowsing.googleapis.com to 8.8.8.8
Jun 28 04:24:42 dnsmasq[25557]: query[A] x8XlnitH.multi.surbl.org from 192.168.0.46
Jun 28 04:24:42 dnsmasq[25557]: forwarded x8XlnitH.multi.surbl.org to 4.2.2.2

Here's my debug token: https://tricorder.pi-hole.net/1f2w6hi71y

I should mention that things resumed automatically (the last time I happened to be at the keyboard and manually restarted.

Checking various logs, I see that there was a cronjob that started at 04:18:01:

Jun 28 04:17:01 pi1 CRON[21035]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 28 04:18:01 pi1 CRON[21051]: (root) CMD (   PATH="$PATH:/usr/local/bin/" pihole updateGravity >/var/log/pihol
e_updateGravity.log || cat /var/log/pihole_updateGravity.log)
Jun 28 04:20:01 pi1 CRON[21244]: (root) CMD (   PATH="$PATH:/usr/local/bin/" pihole updatechecker local)
Jun 28 04:30:01 pi1 CRON[21308]: (root) CMD (   PATH="$PATH:/usr/local/bin/" pihole updatechecker local)
Jun 28 04:39:01 pi1 CRON[21360]: (root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system 
]; then /usr/lib/php/sessionclean; fi)

Might be coincidence, but it looks like that's the only time that particular job has run all week. I went back through the older syslogs but I don't see it at all, the past time probably happened more than 7 days ago (which I believe is the case). Could this be a smoking gun?

Any harm in trying to run that manually to see if that reproduces the issue?

yubiuser · June 28, 2020, 7:28pm

Gravity update runs only once a week on Sunday morning at a random time between 03:01 to 04:58. It fetches your adlists from the internet and builds a new gravity.db database. At the end the old database is swapped with the new one. DNS resolution (and blocking) should be preserved during this time. As you have "only" domains_being_blocked 94541 this whole update and swapping should not take long at all.

https://github.com/pi-hole/pi-hole/pull/1747

https://github.com/pi-hole/pi-hole/pull/3100

You can manually run pihole -g and see what happens.

dkozinn · June 28, 2020, 7:52pm

It looks like this the gravity update is in fact related. I ran it once, and it zipped right thought. I noticed that there was a presumably outdated server list ( https://hosts-file.net/ad_servers.txt) so I removed it and re-ran. This time, this happened (and it stopped resolving as tested from another system):

  [✗] DNS resolution is currently unavailable
  [i] Time until retry: 87

That counted down to zero, then this:

pi@pi1:/var/log $ pihole -g
  [✗] DNS resolution is currently unavailable
  [✗] DNS resolution is not available

I ran pihole -d while it was happening but it was unable to upload the output presumably because it couldn't resolve the host address. I've saved a copy locally and if someone tells me how to upload it securely I'll do that.

I've used pihole restartdns to recover, I'll look at the performance tips mentioned and see if that helps.

yubiuser · June 28, 2020, 7:59pm

Can you provide the output of

/var/log/pihole-FTL.log

Is there a crash? If yes, especially the lines above the crash are of interest.

dkozinn · June 28, 2020, 8:17pm

I don't believe there was a crash. I've uploaded the contents of pihole-FTL.log to http://ix.io/2qr4.

yubiuser · June 28, 2020, 8:20pm

I think the interesting line is
[2020-06-28 04:18:12.807 25557] SQLite3 message: file unlinked while open: /etc/pihole/gravity.db (28)

Pinging @DL6ER

DL6ER · June 28, 2020, 8:22pm

No, this is expected and safe. It happens when gravity is updated using pihole -g (either manually or via the weekly cron job)

dkozinn · June 28, 2020, 8:33pm

I saw that and was wondering if it was related. @jfb any ideas?

DL6ER · June 28, 2020, 8:36pm

Extending on what I said before:

[2020-06-28 04:18:12.804 25557] Reloading DNS cache
[2020-06-28 04:18:12.804 25557] Blocking status is enabled
[2020-06-28 04:18:12.807 25557] SQLite3 message: file unlinked while open: /etc/pihole/gravity.db (28)
[2020-06-28 04:21:27.776 25557] INFO: No regex blacklist entries found
[2020-06-28 04:21:27.776 25557] INFO: No regex whitelist entries found
[2020-06-28 04:21:27.779 25557] Compiled 0 whitelist and 0 blacklist regex filters in 4.7 msec

I see now that there is a delay of 3 minutes between the database being deleted and FTL being instructed to re-opening the database. This should not happen.

Your log contains a second instance here:

[2020-06-28 15:54:25.132 589] Reloading DNS cache
[2020-06-28 15:54:25.133 589] Blocking status is enabled
[2020-06-28 15:54:25.133 589] SQLite3 message: file unlinked while open: /etc/pihole/gravity.db (28)
[2020-06-28 15:54:25.184 589] INFO: No regex blacklist entries found
[2020-06-28 15:54:25.185 589] INFO: No regex whitelist entries found
[2020-06-28 15:54:25.190 589] Compiled 0 whitelist and 0 blacklist regex filters in 6.5 msec

and a third here:

[2020-06-28 16:08:03.932 892] Reloading DNS cache
[2020-06-28 16:08:03.932 892] Blocking status is enabled
[2020-06-28 16:08:03.933 892] SQLite3 message: file unlinked while open: /etc/pihole/gravity.db (28)
[2020-06-28 16:08:04.221 892] INFO: No regex blacklist entries found
[2020-06-28 16:08:04.222 892] INFO: No regex whitelist entries found
[2020-06-28 16:08:04.226 892] Compiled 0 whitelist and 0 blacklist regex filters in 4.8 msec

In the second case it was only 50 milliseconds and in the third case it was 1/3 of a second. I do now agree that something strange is going on here.