Thank you, pi-hole team, to allow me to write my final findings regarding this.
TLDR; read, unbound users only, bottom of the post, Why is this something to look into? Users hardly notice the restarts...
I'm happy to report that I haven't had a single 'unbound stopped' message anymore, since I commented out
#unbound_conf=/var/cache/unbound/resolvconf_resolvers.conf
in /etc/resolvconf.conf
, thus solving the problem.
The following are observations made on a raspberry pi 3B, Raspberry Pi OS Lite, Release date: May 7th 2021, running unbound 1.13.1 (compiled version). jmccamb however, uses unbound version 1.9.0, see here, and has the same problem. You may need to verify / change paths to get the information you require, this to verify if you are affected.
You need to ignore the time/date, mentioned in the example messages I'm adding, I'm reusing screenshots / messages from previous posts in this topic. All the events happen within 1 or 2 consecutive seconds.
The problem was first noticed when I started checking my unbound logs. I noticed the following message, several times a day:
May 14 06:15:26 unbound[790:0] info: service stopped (unbound 1.13.1).
This was unexpected, so I started checking other logs and found a matching (time) message in syslog:
06:15:26 raspberrypi dhcpcd[562]: eth0: part of Router Advertisement expired
Queries, regarding this on Netgate (I have a pfsense router/firewall) and the unbound user discussion list, didn't return a solution. Luckily, jmccamb asked the question here, a lot of feedback from various users eventually solved the problem.
Are you, an unbound user, affected?
The remainder of this topic contains instructions for users that are reasonably familiar with linux. Don't do this, if you're not!
There are two methods you can use to check if you are affected, and need to implement one of the suggested solutions.
Method 1 (recommended):
if you have unbound logging enabled (not he default in the guide), you need to up the verbosity to 1 (which means level 1 gives operational information.
).
Once unbound has been running (you need to restart unbound to activate the setting) for a day or so, run the command (notice / modify the path):
cat /var/log/unbound/unbound.log |grep stopped
remember, if you intentionally restarted unbound, this message will occur at least once, you need to look for stop messages, other than user initiated restarts. If you find unexplainable (not user initiated) stop messages, you are affected.
If you're going to keep these unbound log settings, I strongly recommend to look into / setting up logrotation for this file! If you don't want logging to be permanently enabled, or don't know how to setup logrotation, remove the log settings from you're unbound.conf (and restart), as soon as you've determined you're (not) affected.
Method 2 (for inexperienced users, I strongly recommend to use method 1 - the unbound log):
Bucking_Horn suggested a method to detect signals, sent to unbound:
- install screen (sudo apt-get install screen)
- open a new screen (screen)
- start the trace:
sudo strace -tt -ffo /tmp/trace-unbound -e trace=%signal -p "`pidof unbound`"
- detach from the screen (press CTRL-AD)
The results are logged in /tmp. You'll find a number of files 'trace-unbound', you need to check the file that matches the unbound pid.
- run
pidof unbound
, look at /tmp/trace-unbound.xxx, where xxx matches the result of pidof unbound.
- check for SIGHUP, if there is a match, you probably are affected.
To return to the detached screen, run screen -r
, to stop the trace, press CTRL-C
, to exit (close the screen permanently, just type exit
(in the active screen)
You may or may not want to verify our conclusion (the culprit is /usr/sbin/resolvconf
)
jmccamb found:
141-type=PROCTITLE msg=audit(1622304458.990:134): proctitle=2F62696E2F7368002F7573722F7362696E2F7265736F6C76636F6E66002D6100657468302E7261
142:type=SYSCALL msg=audit(1622304459.310:135): arch=40000028 syscall=37 per=800000 success=yes exit=0 a0=67eb a1=1 a2=7fffffff a3=0 items=0 ppid=12586 pid=12626 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=4294967295 comm="resolvconf" exe="/usr/bin/dash" key=(null)
I found:
Sun 30 May 18:48:15 CEST 2021,32610,/bin/sh /usr/sbin/resolvconf -a eth0.ra
Again, there are two methods to get this information:
Method 1:
Bucking_Horn suggested using audit, explained here
Method 2:
DL6ER suggested using a script, see here
Both methods return a result, identifying /usr/sbin/resolvconf
(which resolvconf) as the culprit. You'll have to write the script yourself, we're not allowed to publish scripts here (tip: endless loop logging the unique results of ps aux
to a file)
Further analysis, a big thanks to MichaIng for his insight and effort, learned that, not only the raspbian version of resolvconf sends a SIGHUP to unbound, but also modifies a file, /var/cache/unbound/resolvconf_resolvers.conf
.
The entries in that file are unbound forward-zone
configuration entries, but the file itself is completely useless, read here, edit5 - zero effect, because unbound doesn't read configuration files from this location. Unbound uses configuration files, stored in /etc/unbound/
and /etc/unbound/unbound.conf.d/
In conclusion:
MichaIng suggested 3 possible solutions to eliminate the undesired unbound restarts, read here I used the least intrusive solution (comment out the unbound line in /etc/resolvconf.conf
, however jmccamb opted to use method 3, read here. Both methods appear to be successful.
Why is this something to look into? Users hardly notice the restarts...
- I noticed a lot of unbound errors in the unbound logs, immediately after the restart. After applying the fix (comment out the unbound entry in
/etc/resolvconf.conf
), these are all gone.
- In this topic, frequent query retries are discussed, the dnsmasq fix was applied to pihole-FTL, but the retry count remained high (for me). After applying the fix, the retry count has decreased (not zero, but significantly lower than before).
May I suggest this information (the fix) is added to the unbound user guide, this to prevent future users from running into this problem...