[Solved] DNS resolution fails twice a day at the same time - where should I look?

josuah · July 23, 2020, 11:29am

Expected Behaviour:

Pihole + Wireguard running smoothly (has been this way for months) on a VPS running Ubuntu 18.04.

Apart from Syncthing, nothing else is installed on the server. All the packages are up to date.

Actual Behaviour:

DNS has been failing every day at 10.30AM and 10.30PM, for a week now.

Can't SSH into the server once it happens. I have to go to the VPS provider (OVH) interface, reboot the server and sudo service pihole-FTL restart to get everything back, for another 12 hours.

Debug Token:

https://tricorder.pi-hole.net/wdvhn2bfeo

Coro · July 23, 2020, 1:11pm

Just out of curiosity: Does the fail also happen when pihole-FTL is not running?

josuah · July 23, 2020, 2:06pm

How do I check it? I though pihole-FTL should be running all the time...

yubiuser · July 23, 2020, 4:34pm

How do you try to access the destination? By IP or hostname?

Please check
/var/log/pihole-FTL.log for errors and crashed.

josuah · July 23, 2020, 4:59pm

I access it by IP from the terminal (macOS).

I've shared an extract of my pihole-FTL.log below.

The first starred line is when I reboot the server.
The second one is (apparently) the result of sudo service pihole-FTL restart.

I've made it long enough, in case you'd find some useful bits in it.

[2020-07-23 02:19:38.166 3396M] Resizing "/FTL-strings" from 12288 to 16384
[2020-07-23 04:12:00.853 3396M] Resizing "/FTL-dns-cache" from 12288 to 16384
[2020-07-23 05:10:24.416 3396M] Resizing "/FTL-queries" from 262144 to 524288
[2020-07-23 05:57:06.452 3396M] WARN: Regex blacklist evaluation took 41.858 msec
[2020-07-23 06:00:10.472 3396M] WARN: Regex blacklist evaluation took 53.811 msec
[2020-07-23 06:00:55.862 3396M] WARN: Regex whitelist evaluation took 11.991 msec
[2020-07-23 06:01:28.818 3396M] WARN: Regex blacklist evaluation took 65.802 msec
[2020-07-23 06:37:43.384 3396M] WARN: Regex blacklist evaluation took 67.696 msec
[2020-07-23 06:37:43.507 3396M] WARN: Regex blacklist evaluation took 20.325 msec
[2020-07-23 07:09:47.863 3396M] WARN: Regex blacklist evaluation took 39.428 msec
[2020-07-23 07:09:49.566 3396M] Resizing "/FTL-strings" from 16384 to 20480
[2020-07-23 09:30:02.176 3396M] WARN: Regex blacklist evaluation took 30.938 msec
[2020-07-23 10:43:11.787 3396M] WARN: Regex blacklist evaluation took 14.258 msec
[2020-07-23 12:08:24.036 3396M] Reloading DNS cache
[2020-07-23 12:08:24.037 3396M] Blocking status is enabled
[2020-07-23 12:08:24.580 3396M] Compiled 2 whitelist and 25 blacklist regex filters for 5 clients in 33.6 msec
* [2020-07-23 12:08:24.639 3396M] Shutting down...
[2020-07-23 12:08:24.815 3396M] ########## FTL terminated after 4.322532e+04 s! ##########
[2020-07-23 12:11:44.072 1203M] Using log file /var/log/pihole-FTL.log
[2020-07-23 12:11:44.079 1203M] ########## FTL started! ##########
[2020-07-23 12:11:44.079 1203M] FTL branch: master
[2020-07-23 12:11:44.079 1203M] FTL version: v5.1
[2020-07-23 12:11:44.079 1203M] FTL commit: b522646
[2020-07-23 12:11:44.079 1203M] FTL date: 2020-07-15 22:26:33 +0100
[2020-07-23 12:11:44.079 1203M] FTL user: pihole
[2020-07-23 12:11:44.079 1203M] Compiled for x86_64 (compiled on CI) using gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
[2020-07-23 12:11:44.079 1203M] Starting config file parsing (/etc/pihole/pihole-FTL.conf)
[2020-07-23 12:11:44.079 1203M]    SOCKET_LISTENING: only local
[2020-07-23 12:11:44.079 1203M]    AAAA_QUERY_ANALYSIS: Show AAAA queries
[2020-07-23 12:11:44.079 1203M]    MAXDBDAYS: --- (DB disabled)
[2020-07-23 12:11:44.079 1203M]    RESOLVE_IPV6: Resolve IPv6 addresses
[2020-07-23 12:11:44.079 1203M]    RESOLVE_IPV4: Resolve IPv4 addresses
[2020-07-23 12:11:44.079 1203M]    DBINTERVAL: saving to DB file every minute
[2020-07-23 12:11:44.080 1203M]    DBFILE: Using /etc/pihole/pihole-FTL.db
[2020-07-23 12:11:44.080 1203M]    MAXLOGAGE: Importing up to 24.0 hours of log data
[2020-07-23 12:11:44.080 1203M]    PRIVACYLEVEL: Set to 0
[2020-07-23 12:11:44.080 1203M]    IGNORE_LOCALHOST: Show queries from localhost
[2020-07-23 12:11:44.080 1203M]    BLOCKINGMODE: Null IPs for blocked domains
[2020-07-23 12:11:44.080 1203M]    ANALYZE_ONLY_A_AND_AAAA: Disabled. Analyzing all queries
[2020-07-23 12:11:44.080 1203M]    DBIMPORT: Importing history from database
[2020-07-23 12:11:44.080 1203M]    PIDFILE: Using /run/pihole-FTL.pid
[2020-07-23 12:11:44.080 1203M]    SOCKETFILE: Using /run/pihole/FTL.sock
[2020-07-23 12:11:44.080 1203M]    SETUPVARSFILE: Using /etc/pihole/setupVars.conf
[2020-07-23 12:11:44.080 1203M]    MACVENDORDB: Using /etc/pihole/macvendor.db
[2020-07-23 12:11:44.080 1203M]    GRAVITYDB: Using /etc/pihole/gravity.db
[2020-07-23 12:11:44.080 1203M]    PARSE_ARP_CACHE: Active
[2020-07-23 12:11:44.080 1203M]    CNAME_DEEP_INSPECT: Active
[2020-07-23 12:11:44.080 1203M]    DELAY_STARTUP: No delay requested.
[2020-07-23 12:11:44.080 1203M]    NICE: Set process niceness to -10 (default)
[2020-07-23 12:11:44.080 1203M]    BLOCK_ESNI: Enabled, blocking _esni.{blocked domain}
[2020-07-23 12:11:44.080 1203M]    NAMES_FROM_NETDB: Enabled, trying to get names from network database
[2020-07-23 12:11:44.080 1203M] Finished config file parsing
[2020-07-23 12:11:45.220 1203M] Database version is 6
[2020-07-23 12:11:45.220 1203M] Not using the long-term database for storing queries
[2020-07-23 12:11:45.220 1203M]  -> Total DNS queries: 0
[2020-07-23 12:11:45.220 1203M]  -> Cached DNS queries: 0
[2020-07-23 12:11:45.220 1203M]  -> Forwarded DNS queries: 0
[2020-07-23 12:11:45.220 1203M]  -> Blocked DNS queries: 0
[2020-07-23 12:11:45.220 1203M]  -> Unknown DNS queries: 0
[2020-07-23 12:11:45.220 1203M]  -> Unique domains: 0
[2020-07-23 12:11:45.220 1203M]  -> Unique clients: 0
[2020-07-23 12:11:45.220 1203M]  -> Known forward destinations: 0
[2020-07-23 12:11:45.220 1203M] Successfully accessed setupVars.conf
[2020-07-23 12:12:29.290 1713M] Using log file /var/log/pihole-FTL.log
* [2020-07-23 12:12:29.290 1713M] ########## FTL started! ##########
[2020-07-23 12:12:29.290 1713M] FTL branch: master
[2020-07-23 12:12:29.290 1713M] FTL version: v5.1
[2020-07-23 12:12:29.290 1713M] FTL commit: b522646
[2020-07-23 12:12:29.290 1713M] FTL date: 2020-07-15 22:26:33 +0100
[2020-07-23 12:12:29.290 1713M] FTL user: pihole
[2020-07-23 12:12:29.290 1713M] Compiled for x86_64 (compiled on CI) using gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
[2020-07-23 12:12:29.290 1713M] Starting config file parsing (/etc/pihole/pihole-FTL.conf)
[2020-07-23 12:12:29.290 1713M]    SOCKET_LISTENING: only local
[2020-07-23 12:12:29.290 1713M]    AAAA_QUERY_ANALYSIS: Show AAAA queries
[2020-07-23 12:12:29.290 1713M]    MAXDBDAYS: --- (DB disabled)
[2020-07-23 12:12:29.290 1713M]    RESOLVE_IPV6: Resolve IPv6 addresses
[2020-07-23 12:12:29.290 1713M]    RESOLVE_IPV4: Resolve IPv4 addresses
[2020-07-23 12:12:29.290 1713M]    DBINTERVAL: saving to DB file every minute
[2020-07-23 12:12:29.290 1713M]    DBFILE: Using /etc/pihole/pihole-FTL.db
[2020-07-23 12:12:29.290 1713M]    MAXLOGAGE: Importing up to 24.0 hours of log data
[2020-07-23 12:12:29.290 1713M]    PRIVACYLEVEL: Set to 0
[2020-07-23 12:12:29.290 1713M]    IGNORE_LOCALHOST: Show queries from localhost
[2020-07-23 12:12:29.290 1713M]    BLOCKINGMODE: Null IPs for blocked domains
[2020-07-23 12:12:29.290 1713M]    ANALYZE_ONLY_A_AND_AAAA: Disabled. Analyzing all queries
[2020-07-23 12:12:29.290 1713M]    DBIMPORT: Importing history from database
[2020-07-23 12:12:29.290 1713M]    PIDFILE: Using /run/pihole-FTL.pid
[2020-07-23 12:12:29.290 1713M]    SOCKETFILE: Using /run/pihole/FTL.sock
[2020-07-23 12:12:29.290 1713M]    SETUPVARSFILE: Using /etc/pihole/setupVars.conf
[2020-07-23 12:12:29.290 1713M]    MACVENDORDB: Using /etc/pihole/macvendor.db
[2020-07-23 12:12:29.290 1713M]    GRAVITYDB: Using /etc/pihole/gravity.db
[2020-07-23 12:12:29.291 1713M]    PARSE_ARP_CACHE: Active
[2020-07-23 12:12:29.291 1713M]    CNAME_DEEP_INSPECT: Active
[2020-07-23 12:12:29.291 1713M]    DELAY_STARTUP: No delay requested.
[2020-07-23 12:12:29.291 1713M]    NICE: Set process niceness to -10 (default)
[2020-07-23 12:12:29.291 1713M]    BLOCK_ESNI: Enabled, blocking _esni.{blocked domain}
[2020-07-23 12:12:29.291 1713M]    NAMES_FROM_NETDB: Enabled, trying to get names from network database
[2020-07-23 12:12:29.291 1713M] Finished config file parsing
[2020-07-23 12:12:29.292 1713M] Database version is 6
[2020-07-23 12:12:29.292 1713M] Not using the long-term database for storing queries
[2020-07-23 12:12:29.292 1713M]  -> Total DNS queries: 0
[2020-07-23 12:12:29.292 1713M]  -> Cached DNS queries: 0
[2020-07-23 12:12:29.292 1713M]  -> Forwarded DNS queries: 0
[2020-07-23 12:12:29.292 1713M]  -> Blocked DNS queries: 0
[2020-07-23 12:12:29.292 1713M]  -> Unknown DNS queries: 0
[2020-07-23 12:12:29.292 1713M]  -> Unique domains: 0
[2020-07-23 12:12:29.292 1713M]  -> Unique clients: 0
[2020-07-23 12:12:29.292 1713M]  -> Known forward destinations: 0
[2020-07-23 12:12:29.292 1713M] Successfully accessed setupVars.conf
[2020-07-23 12:12:29.304 1715M] PID of FTL process: 1715
[2020-07-23 12:12:29.304 1715/T1718] Listening on Unix socket
[2020-07-23 12:12:29.304 1715/T1717] Listening on port 4711 for incoming IPv6 telnet connections
[2020-07-23 12:12:29.305 1715M] Reloading DNS cache
[2020-07-23 12:12:29.305 1715M] Blocking status is enabled
[2020-07-23 12:12:29.308 1715/T1716] Listening on port 4711 for incoming IPv4 telnet connections
[2020-07-23 12:12:29.358 1715M] Compiled 2 whitelist and 25 blacklist regex filters for 0 clients in 20.3 msec
[2020-07-23 12:17:45.279 1715M] New upstream server: 127.0.0.1 (0/128)
[2020-07-23 12:31:01.370 1715M] WARN: Regex blacklist evaluation took 15.288 msec
[2020-07-23 12:41:19.238 1715M] WARN: Regex blacklist evaluation took 12.390 msec
[2020-07-23 12:41:40.792 1715M] WARN: Regex blacklist evaluation took 34.755 msec
[2020-07-23 12:48:30.035 1715M] WARN: Regex blacklist evaluation took 16.015 msec
[2020-07-23 12:54:29.751 1715M] WARN: Regex blacklist evaluation took 10.927 msec
[2020-07-23 13:04:35.447 1715M] WARN: Regex blacklist evaluation took 24.164 msec
[2020-07-23 13:10:59.428 1715M] WARN: Regex blacklist evaluation took 17.653 msec
[2020-07-23 13:11:49.527 1715M] WARN: Regex blacklist evaluation took 44.974 msec
[2020-07-23 13:21:46.992 1715M] WARN: Regex whitelist evaluation took 14.166 msec
[2020-07-23 13:21:47.184 1715M] WARN: Regex blacklist evaluation took 20.453 msec
[2020-07-23 13:21:47.802 1715M] Resizing "/FTL-strings" from 4096 to 8192
[2020-07-23 13:22:17.576 1715M] Resizing "/FTL-dns-cache" from 4096 to 8192
[2020-07-23 13:31:50.420 1715M] WARN: Regex blacklist evaluation took 45.766 msec
[2020-07-23 13:32:45.297 1715M] WARN: Regex blacklist evaluation took 18.476 msec
[2020-07-23 13:45:01.122 1715M] WARN: Regex blacklist evaluation took 14.453 msec
[2020-07-23 13:45:01.335 1715M] WARN: Regex blacklist evaluation took 17.594 msec
[2020-07-23 13:45:08.465 1715M] WARN: Regex blacklist evaluation took 22.189 msec
[2020-07-23 13:49:11.772 1715M] Resizing "/FTL-strings" from 8192 to 12288

yubiuser · July 23, 2020, 5:39pm

That indicates a network issue rather than a direct pihole issue. Has anything changed in the last week regarding your setup?

Thanks for the log - I can't spot any obvious issues. To deactivate the long-term database was on purpose, I guess?

josuah · July 23, 2020, 5:51pm

Nothing has changed, I had not touched my setup for months, excepted updating Ubuntu packages every now and then.

In fact I first lost Internet Friday evening while I was reading about Cloudflare issues. At first I thought it was related, then I remembered that I was running only Unbound. I rebooted and relaunched pihole-FTL, then it happened Saturday morning and haven't stopped since.

And I suspected the network issue, too. But how come that it needs (1) a reboot and (2) a pihole-FTL restart to get back on its feet?

Last, the long term DB is disabled on purpose. It was a recommendation from someone from the team (either here or on Reddit, can't recall) as the size of the database was growing out of control (I have at least 15GB free on the server and couldn't understand how it could be the issue but well, it solved my issue at that time, so...).

yubiuser · July 23, 2020, 7:04pm

That's indeed strange. A reboot should be sufficient under most circumstances. Does the IP of your VPS change during the failure/reboot?

josuah · July 23, 2020, 9:23pm

No, it has been the same since I got the server.

Random thoughts:

Setting MAXDBDAYS to 0 was due to an issue almost exactly like this one, so DB size is probably out of the way.
In the log above, I think that the failure occurs on the line 'Reloading DNS cache'. Can it be a hint to something?

josuah · July 23, 2020, 11:06pm

Useless info: I've just done my evening reboot, after my evening DNS failure :((

Bucking_Horn · July 24, 2020, 12:17pm

Just for completeness:
Your debug log shows you are also running unbound and containerd, and your logs also confirm that you are using unbound as Pi-hole's upstream. I hope you intend to use these and just forgot to mention them.

You've also heavily altered Pi-hole's web server - neither dashboard nor block page show Pi-hole's default headers, and it would seem you did so in order to make it accessible via HTTPS only.

I guess this is due to you running Pi-hole in the cloud on a Virtual Private Server (as you mentioned).
I thus would assume that access to your Pi-hole is always through a Wireguard VPN tunnel (in which case you wouldn't need HTTPS on top of that, as long as your VPS blocks access to port 80 for the public). Since you cannot ssh into your VPS at all when an outage occurs for you, have you considered whether Wireguard could be failing?

I noticed that there are some SSL certificate errors in lighttpd's error log:

*** [ DIAGNOSING ]: contents of /var/log/lighttpd

-rw-r--r-- 1 www-data www-data 236488 Jul 23 12:24    2020-07-19 09:40:05: (network.c.145) SSL: no certificate/private key for TLS server name <IP redacted>  
   2020-07-19 09:40:05: (connections-glue.c.200) SSL: 1 error:1422E0EA:SSL routines:final_server_name:callback failed

I am not sure whether this is signifcant for your situation, as it occurs a bit before your 10:30 window, and the debug log shows only a few lines of that log anyway.

But as you do observe outages at a certain time repeatedly, you should try to correlate the logs of all of your applications for that time.
It may also be worth to consult resources for your additonal software for support.

A thought: Does your VPS run with a static public IP, or do you access it by public hostname that may resolve to different somewhat arbitrary IPs over time? Why and when would those IPs normally change?

josuah · July 24, 2020, 12:50pm

Hi and thank you for chiming in.

Your debug log shows you are also running unbound and containerd, and your logs also confirm that you are using unbound as Pi-hole's upstream. I hope you intend to use these and just forgot to mention them.

But yes, I'm intentionally using unbound as Pi-hole's upstream. But... containerd ? What is that ???

You've also heavily altered Pi-hole's web server - neither dashboard nor block page show Pi-hole's default headers, and it would seem you did so in order to make it accessible via HTTPS only.

I guess this is due to you running Pi-hole in the cloud on a Virtual Private Server (as you mentioned).
I thus would assume that access to your Pi-hole is always through a Wireguard VPN tunnel (in which case you wouldn't need HTTPS on top of that, as long as your VPS blocks access to port 80 for the public). Since you cannot ssh into your VPS at all when an outage occurs for you, have you considered whether Wireguard could be failing?

Yes, I wanted to access the dashboard through a subdomain of mine, and make it the most secure possible. And yes, it was through SSL and requires HTTP authentication. Maybe overkill, but I'm a self-taught guy and discovering all these things with doing, trials and errors. So I might not be exactly aware of when there's an overlap in terms of security or so.

Nice idea, I'll try to find where Wireguard logs are on the system and look into it.

Lastly, my VPS has a static IP, it has never changed and will probably not (it's part of the contract). And I access it through that IP, not a hostname.

Side note: The outage has occurred again this morning (around 45mn later than usual), and since doing my usual procedure to get it back, I've been going through /var/log/syslog to see if I can get a hint there.

yubiuser · July 24, 2020, 3:37pm

I thought you might have triggered the reload manually.... This could be a hint. There are many debug flags you could enable, but as we don't know what to look for, enabling all debug output might be overkill and fill up your logs extremely fast...

josuah · July 26, 2020, 11:59pm

Hi,

I've spent my Sunday tracking the issue, going from Pihole to Wireguard and back: the system journal shows that dnsmasq fails to start on boot. IIRC the equivalent is pihole-FTL, that's probably why almost everything runs fine again once it restarts.

So now I have to figure out:

why dnsmasq suddenly doesn't start on boot anymore, after running fine for months
why dnsmasq fails every 12 hours.

Any idea is still welcome. Thanks in advance.

jfb · July 27, 2020, 12:00am

There should be no dnsmasq process running on the OS; only pihole-FTL should run on port 53.

josuah · July 27, 2020, 12:25am

Yes, I'm aware of that. My VPS run only Ubuntu 18.04 + Pihole + Wireguard + Syncthing. Nothing else has been installed on / modified in the system, I'm don't have enough knowledge for that.

But here's an extract of the logs.

yubiuser · July 27, 2020, 7:26am

I think the conflict between dnsmasq and pihole could be the reason here. Try uninstalling dnsmasq and see if it helps.

josuah · July 27, 2020, 9:40am

The thing is, I had this issue on a previous attempt to install Pihole + Wireguard, so when reinstalling the whole VPS, I made sure I didn't install dnsmasq. After your reply and to my surprise, apt list shows me

dnsmasq/bionic 2.79-1 all
dnsmasq-base/bionic,now 2.79-1 amd64 [residual-config]
dnsmasq-base-lua/bionic 2.79-1 amd64
dnsmasq-utils/bionic 2.79-1 amd64

How did all this appear? Could it be installed as a dependency of something else during a routine update upgrade?

Anyway, should an apt remove dnsmasq be enough? Or should I add autoremove and/or purge?

Side note: I'm still bothered by how it was working and suddenly doesn't work anymore, out of the blue.

Thank you for your assistance.

Bucking_Horn · July 27, 2020, 9:56am

Maybe related to some sort of automated OS updates for your VPS server?
Try to find out about your VPS provider's update policy for your OS, and how that would relate to dnsmasq explicitly.

josuah · July 27, 2020, 10:02am

Yeah I thought of it, but this is not the case. OS updates are up to the user, and the next one (20.04) will probably be up in a few weeks, when 20.04.1 is out.