Unbound frequent restarts

I ran:

ps aux | grep 6620
pi       14927  0.0  0.0   7348   528 pts/0    S+   08:30   0:00 grep --color=auto 6620
pi       15325  0.0  0.0   7348   540 pts/0    S+   08:34   0:00 grep --color=auto 6671
pi       15442  0.0  0.0   7348   576 pts/0    S+   08:35   0:00 grep --color=auto 13881
pi       15257  0.0  0.0   7348   556 pts/0    S+   08:33   0:00 grep --color=auto 13932

It doesn't seem to be grabbing the correct column though, right? Seems like the number after pi would be the pid

When running top I can't see it there either

edit: I think this is referencing the command that I just ran ^

There's more than one SIGHUP in your log.
I've edited my previous post once I noticed that.
(You may have to hit refresh in your browser to see those recent changes.)

Yes, use ps -p <pid> instead.

1 Like

It looks like the process stopped:

ps -p 13932
  PID TTY          TIME CMD

I have another pihole instance on my network. Should I also run strace on that?

Bugger.
Is that true for all four of them?

You could try to monitor top output, but you'd have to be lucky to spot that process without prior knowledge of the pid.

1 Like

Yeah it is true for all of them. I am trying to figure out a way to batch process names into a text file. I will post more info if I am able to capture what process this is

Maybe the journals still have that PID logged somewhere:

journalctl | less -p 13932

1 Like

Pattern not found

Interestingly enough. Haven't had any issues since reporting last. (knock on wood).

I am sure it will happen. Just not sure when.

If it does I will try this command with the relevant PID

You may find some information in the system log files /var/log/messages , /var/log/syslog . Some processes print messages to the system log when it starts.

1 Like

Awesome, I will check that too. Thank you!

Did you try to increase the Unbound verbosity, to get some more info upon unexpected service restart?

sed -i 's/verbosity: 0/verbosity: 2/' /etc/unbound/unbound.conf.d/pi-hole.conf
systemctl restart unbound

And about dhcpcd IPv6 expiry: You said you tried to set a static IPv6 via pfsense, but stat means you reserved an IP, right? In this case, a router advertisement (DHCP in case of IPv4) is still required. Did you think about setting a real static IP via /etc/dhcpcd.conf, like below?

interface eth0
    static ipaddress=192.168.0.2
    static ip6address=xxxx:yyyy:etc...
    static routers=192.168.0.1
    static domain_name_servers=1.1.1.1

EDIT: All logs from /var/log/messages and /var/log/syslog are btw coming from journalctl, i.e. systemd-journald sends them to rsyslog, which filters, sorts and writes them to the known /var/log/ files. So likely you won't find anything new there. Only after reboot or crash, the journal will be empty, as it is stored in a tmpfs (/run/log/journal/) by default, while the /var/log directory is usually preserved on disk.

Increased the logging verbosity as per your advice.

In terms of the static IP. I set both a static IPV4 and IPV6 address on pfsense, but I didn't change it on the actual device. The IPV4 address is correct in /etc/dhcpcd.conf, but should I also hard code the static IPV6 address as well?

If anything (whether it's a LAN client or a service on the Pi-hole system itself) relies on that IP and the router for whatever reason sends RAs in longer intervals than its TTL, then yes. I guess it doesn't matter and you could also disable this ULA address advertising in the router completely, so that LAN hostnames are assured to resolve to IPv4 addresses. If IPv6 connections are really wanted within the LAN, then the SLAAC addresses can be used, which are anyway always static and predictable (derived from only the MAC of related adapter).

Sorry, I am not very familiar with this kind of stuff. A few clarifying questions:

  1. If I add the IPV6 address in dhcpcd.conf, which one would I add? There are a few listed when running ifconfig. Would I just copy the static IP that I set for the device in pfsense and paste it in dhcpcd.conf?

  2. In terms of disabling ULA address advertising on the router. Is that akin to changing these settings in the RA section of my router? Or do you mean disabling ipv6 entirely

Screen Shot 2021-05-28 at 5.01.30 PM

Sorry for all the questions. When it comes to IPV6 I really am lost.

As I have been checking / resolving problems I noticed in my logs, I also noticed the unexpected unbound stop messages in my logs, and have been looking into this.

Found and solved one of the unexpected stops, see here, however there are more..

After looking at several logs, I managed to match the remaining unbound stop times to a syslog entry:

May 14 06:15:26 raspberrypi dhcpcd[562]: eth0: part of Router Advertisement expired

which matches the stop entry in the unbound log

May 14 06:15:26 unbound[790:0] info: service stopped (unbound 1.13.1).

I opened a issue here on the netgate forum, but so far, nothing useful has come out of this.

details:

  • pfsense hands out IPv4 DHCP addresses
  • pihole (raspberry pi) doesn't use a static IPv4 address, but gets a DHCP assigned address that never changes (MAC based DHCP address assignment). This makes it easier for me to swap SD cards between my production and test pi's.
  • the IPv4 DHCP renewals (pfsense DHCP log) don't match the unbound stop (unbound log on the raspberry pi), so it's not the IPv4 DHCP server that causes this.
  • IPv6 addresses GUA are distributed in the network, using pfsense 'track interface'.

on the WAN interface:

on the LAN interface

  • The IPv6 addresses on the devices never change, unless the ISP kicks the modem, in which case the WAN IPv6 changes, thus all tracked addresses (if the prefix changes).
  • The IPv6 addresses on the devices also never change, unless I move the SD card to another raspberry pi (OR the WAN address changes - new prefix)
  • On the raspberry pi, I've configured temporary IPv6 addresses, changes every 2 hours, the address changes don't match the unbound stops.
  • There are no noticeable related events on the pfsense box @ the time unbound stops.

I'm currently assuming this is 'normal', nothing you can do about it, I must admit I haven't gone trough the strace process yet, this to try to identify the process that sends the SIGHUP.

  • running unbound with verbosity 1 here all the time, no additional information found yet.
	# level 0 means no verbosity, only errors.
	# level 1 gives operational information.
	# level 2 gives detailed operational  information.
	# level 3 gives query level information, output per query.
	# level 4 gives algorithm  level  information.
	# level 5 logs client identification for cache misses.
1 Like

The one set in pfsense for that device was my idea. But to be true, I'm not sure if it will then stop respecting RAs or simply have one IPv6 address more :thinking:. I'm also not 100% familiar with IPv6.

Basically there are min tree types of addresses assigned: the MAC-based autoconfigured SLAAC addresses, the ULAs (unique local addresses), advertised by the router and the GUAs (global unicast addresses), also advertised by the router, required if you want to access a device within LAN from www without NAT. And in theory there is DHCPv6, which is rarely used. For a usual home network, IPv4 is totally sufficient and an IPv6 address on each client only required for them to connect to remote IPv6 hosts. For that SLAAC is sufficient and can be used as well to connect via IPv6 to LAN clients. SLAAC is always used immediately by every network adapter/interface where IPv6 is enabled, no router or anything else needs to trigger that. So in your case, choosing "Disabled" should be fine and reduce a bit overhead and a bit more possible confusion in your home network :slight_smile:.


EDIT: What I forgot about is the IPv6 default route/gateway that relies on the RA as well, when not statically set. Not sure if "Disabled" means that the router won't send any RAs or only that is won't send ULAs and GUAs :thinking:. When curl -6 google.com suddenly fails with "No route to host", we'll know.

1 Like

If I disabled the pi from accepting ipv6 RA with net.ipv6.conf.all.disable_ipv6 = 1 in /etc/sysctl.conf will this prevent it from restarting unbound?

I quickly checked the dhcpcd source code. dhcpcd simply reloads its configuration on receipt of SIGHUP:

It does not send any signals to any non-dhcpcd processes so it is a "victim" of the same underlying issue.

Summary:
The cause seems to be a network device update (address change). The process causing it does not exist when manually calling ps so it is either short-lived or exited already. It may be (a) a terminating network process that fires this signal or (b) an intermediate process only existing for a few moments or (c) it is the new network process but the SIGHUP is fired before the process changed its PID by forking, etc.

To address

  • (a): Try setting up a small job that stores the content of ps every few minutes. This will be your memory for later checking if this PID every existed (for long enough time).
  • (b): I don't think this can be addressed easily.
  • (c): Try to find a process whose PID is the PID you are looking for +1 and +2 (look for 124 and 125 when the PID of the SIGHUP-sender is 123). This process may be unrelated but if it is always the same one, we should take a closer look here.
1 Like

quick question

My unbound always stops twice in a very short time frame, same in your environment?

cat /etc/unbound/log/unbound.log |grep stopped
May 29 02:17:56 unbound[14980:0] info: service stopped (unbound 1.13.1).
May 29 02:18:02 unbound[14980:0] info: service stopped (unbound 1.13.1).
May 29 04:22:57 unbound[14980:0] info: service stopped (unbound 1.13.1).
May 29 04:23:03 unbound[14980:0] info: service stopped (unbound 1.13.1).

@jmccamb

Just a side note on IPv6 addresses (click for details).

Not quite. :wink:

An RFC 4941 IPV6 Privacy extension address can also be considered a special application of SLAAC, and EUI64 addresses (containing a MAC) may be used just as well to provide the interface portion of an IPv6 ULA or GUA.

There are three types or ranges of IPv6 addresses (edit: well, truth is there are more, but these are the most common and also most relevant ones):
link-local (fe80::/10), ULA (fd00::/8) and GUA (2000::/3).
There are also three methods how a client can acquire those:
SLAAC as well as Stateless and Stateful DHCPv6.

A client may use SLAAC to construct IPv6 addresses for all three ranges, or it may choose any combination of those methods to acquire any set of addresses, depending on a client's configuration and availability of router support for those methods.

Link-local addresses are the only ones that can be acquired by SLAAC exclusively, but they cannot be relied upon to be predictable: A client may be configured to forego EUI48/64 addresses in favour of an RFC 7217 address. In Raspberry Pi OS, this is the default (configured by slaac private in dhcpcd.conf).

2 Likes

Maybe.
Note that some networking tools would ignore those settings, relying on their own configuration options instead - I believe NetworkManager may fall into that category.


If we cannot find this on the receiving end, let's try and capture it at the sender, using auditd.

EDIT: I've successfully tested below steps on Raspbian Buster, but couldn't get them to work on Armbian (the audit log never populated).
It may only be worth the hassle if you would run Raspberry Pi OS as well.

Since we don't know the sender, we have to capture traffic of all processes, so this may incur a performance hit for all of your system, and the respective log file may grow large quickly. And we still need strace to find the pid.

Don't forget to disable it once finished.

We start by installing auditd:

sudo apt install auditd audispd-plugins

Enable auditing:

sudo auditd -s enable

Find your system CPU architecture, e.g.:

~ $ uname -m
armv7l

Use that arch when creating a rule to capture system calls to kill -HUP:

sudo auditctl -a exit,always -F arch=armv7l -S kill -F a1=1

EDIT: In an attempt to limit audit log growth, I tried to be as restrictive as possible here, by assuming that unbound is indeed HUPed by an unknown process. I think that's a safe assumption, as the observed behaviour is documented for HUP reception in unbound. Still, if it proves too restrictive and you can't find a match, you may drop the trailing -F a1=1.

Verify that rule got created, e.g.:

~ $ sudo auditctl -l
-a always,exit -F arch=b32 -S kill -F a1=0x1

You should see an audit log at /var/log/audit/audit.log
Check that audit logging is working:

sudo auditctl -m 'verify its working'

Verify that statement made it into your audit log:

sudo grep -n 'verify' /var/log/audit/audit.log

If everything works, have your strace running also and wait for unbound to restart.

Once that happens, extract the pid of the process that HUPed unbound from your strace log as before, and use that to determine the offending process:

sudo grep -n -B1 '=29812 ' /var/log/audit/audit.log

In the above, replace 29812 with the pid from your strace (keep the leading = and trailing space).

That should give you a few lines like:

1265-type=USER_START msg=audit(1622277865.428:1119): pid=29811 uid=0 auid=1000 ses=1522 msg='op=PAM:session_open grantors=pam_permit,pam_unix acct="root" exe="/usr/bin/sudo" hostname=? addr=? terminal=/dev/pts/1 res=success'
1266:type=SYSCALL msg=audit(1622277865.448:1120): arch=40000028 syscall=37 per=800000 success=yes exit=0 a0=1f6 a1=1 a2=42adc00 a3=42adc00 items=0 ppid=29811 pid=29812 auid=1000 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=pts1 ses=1522 comm="kill" exe="/bin/kill" key=(null)

where exe= should point to the binary (at the end of line 1266 in the example above).
Let's hope that isn't just a shell /bin/kill spawned by yet another unknown process (in the example I used, it is of course such a command that I used to trigger those log lines).

Once done, don't forget to disable auditing:

sudo auditd -s disable
sudo service auditd stop
sudo systemctl disable auditd
4 Likes