Upgraded two Piholes today. Both have failed twice since

Background:

Two Pi3b's upgraded today. The second was upgraded a few hours after the first to ensure good working order. Performed an update/upgrade/-up/reboot on each.
Prior uptimes were 132 & 155 days with numerous pihole -up in between with no problems.

Expected Behaviour:

Pihole should not quit unexpectedly.

Actual Behaviour:

Two separate piholes have both failed near simultaneously on two separate occurrences today; two hours apart.

Debug Token:

https://tricorder.pi-hole.net/ThDs5VJi/
https://tricorder.pi-hole.net/5qfVXAog/

Added info

[edit] - after each failure, a pihole restartdns was performed successfully.
Monitoring - https://imgur.com/a/r6Egxk0 https://imgur.com/a/IKTASH4

What is the exact mode of failure? Pi-hole actually stopped (i.e. pihole-FTL process was not running?), or Pi-hole stopped responding to DNS queries.

Are there any errors shown in /var/log/pihole-FTL.log or /var/log/syslog? Only the head and tail of the first log are included in the debug logs.

The only abnormality is in the debug log for IP 99. SERVFAIL is not a desired reply.

*** [ DIAGNOSING ]: Pi-hole log
-rw-r--r-- 1 pihole pihole 2410061 Sep 17 16:42 /var/log/pihole.log
   -----head of pihole.log------
   Sep 17 00:00:03 dnsmasq[17531]: forwarded e.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.e.f.0.0.0.2.6.2.ip6.arpa to 9.9.9.9
   Sep 17 00:00:03 dnsmasq[17531]: forwarded e.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.e.f.0.0.0.2.6.2.ip6.arpa to 2620:fe::9
   Sep 17 00:00:03 dnsmasq[17531]: forwarded e.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.e.f.0.0.0.2.6.2.ip6.arpa to 2620:fe::fe
   Sep 17 00:00:03 dnsmasq[17531]: forwarded e.f.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0.e.f.0.0.0.2.6.2.ip6.arpa to 149.112.112.112
   Sep 17 00:00:03 dnsmasq[17531]: reply error is SERVFAIL

You also have a lot of interfaces and active DHCP servers, on what appear to be different VLANs. Has any of that changed?

*** [ DIAGNOSING ]: Discovering active DHCP servers (takes 10 seconds)
   Scanning all your interfaces for DHCP servers
   Timeout: 10 seconds
   
   WARN: Could not sendto() in send_dhcp_discover() (/root/project/src/dhcp-discover.c:228): Network is unreachable
   * Received 312 bytes from eth0:192.168.0.65
     Offered IP address: 192.168.0.98
     Server IP address: N/A
     Relay-agent IP address: N/A
     BOOTP server: (empty)
     BOOTP file: (empty)
     DHCP options:
      Message type: DHCPOFFER (2)
      server-identifier: 192.168.0.65
      lease-time: 86400 ( 1d )
      netmask: 255.255.255.192
      router: 192.168.0.65
      dns-server: 192.168.0.99
      dns-server: 192.168.0.98
         renewal-time: 43200 ( 12h )
      rebinding-time: 75600 ( 21h )
      domain-name: "lab.home"
      --- end of options ---
    
   * Received 312 bytes from eth0.6:192.168.6.1
     Offered IP address: 192.168.6.148
     Server IP address: N/A
     Relay-agent IP address: N/A
     BOOTP server: (empty)
     BOOTP file: (empty)
     DHCP options:
      Message type: DHCPOFFER (2)
      server-identifier: 192.168.6.1
      lease-time: 86400 ( 1d )
      netmask: 255.255.255.0
      router: 192.168.6.1
      dns-server: 192.168.6.98
      dns-server: 192.168.6.99
         renewal-time: 43200 ( 12h )
      rebinding-time: 75600 ( 21h )
      domain-name: "lab.home"
      --- end of options ---
    
   * Received 300 bytes from eth0.8:192.168.4.1
     Offered IP address: 192.168.4.17
     Server IP address: N/A
     Relay-agent IP address: N/A
     BOOTP server: (empty)
     BOOTP file: (empty)
     DHCP options:
      Message type: DHCPOFFER (2)
      server-identifier: 192.168.4.1
      lease-time: 86400 ( 1d )
      netmask: 255.255.255.0
      router: 192.168.4.1
      dns-server: 192.168.4.98
      dns-server: 192.168.4.99
      renewal-time: 43200 ( 12h )
      rebinding-time: 75600 ( 21h )
      --- end of options ---
    
   DHCP packets received on interface wlan0: 0
   DHCP packets received on interface lo: 0
   DHCP packets received on interface eth0.5: 0
   DHCP packets received on interface eth0: 1
   DHCP packets received on interface eth0.6: 1
   DHCP packets received on interface eth0.8: 1

What is the exact mode of failure?

FTL failed on one pihole per the service monitoring. The other did not appear to indicate FTL being down. Neither responded to DNS queries which lead to more 'panic than diagnosing further' (since kids and rarity of event leading me to be unprepared for a dual failure).

I performed a restartdns on both piholes. I also rebooted a single pihole, not both, at that time.

So far, both continue to operate as expected.

Are there any errors shown in /var/log/pihole-FTL.log or /var/log/syslog ?

I will look through this and report back anything significant.

You also have a lot of interfaces and active DHCP servers, on what appear to be different VLANs. Has any of that changed?

No VLAN/DHCPServer changes in well over a year.

[2021-09-17 14:11:10.233 787M] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2021-09-17 14:11:10.233 787M] ---------------------------->  FTL crashed!  <----------------------------
[2021-09-17 14:11:10.233 787M] !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
[2021-09-17 14:11:10.233 787M] Please report a bug at https://github.com/pi-hole/FTL/issues
[2021-09-17 14:11:10.233 787M] and include in your report already the following details:
[2021-09-17 14:11:10.233 787M] FTL has been running for 2789 seconds
[2021-09-17 14:11:10.233 787M] FTL branch: master
[2021-09-17 14:11:10.233 787M] FTL version: v5.9
[2021-09-17 14:11:10.234 787M] FTL commit: 198e7c61
[2021-09-17 14:11:10.234 787M] FTL date: 2021-09-11 22:02:45 +0100
[2021-09-17 14:11:10.234 787M] FTL user: started as pihole, ended as pihole
[2021-09-17 14:11:10.234 787M] Compiled for armv7hf (compiled on CI) using arm-linux-gnueabihf-gcc (Debian 6.3.0-18) 6.3.0 20170516
[2021-09-17 14:11:10.234 787M] Process details: MID: 787
[2021-09-17 14:11:10.234 787M]                  PID: 787
[2021-09-17 14:11:10.234 787M]                  TID: 787
[2021-09-17 14:11:10.234 787M]                  Name: pihole-FTL
[2021-09-17 14:11:10.235 787M] Received signal: Segmentation fault
[2021-09-17 14:11:10.235 787M]      at address: 0x1002e
[2021-09-17 14:11:10.235 787M]      with code:  SEGV_MAPERR (Address not mapped to object)
[2021-09-17 14:11:10.235 787M] Backtrace:
[2021-09-17 14:11:10.236 787M] B[0000]: /usr/bin/pihole-FTL(generate_backtrace+0x2d) [0x4c20c2]
[2021-09-17 14:11:10.613 787M] L[0000]: /root/project/src/signals.c:99
[2021-09-17 14:11:10.618 787M] B[0001]: /usr/bin/pihole-FTL(+0x424aa) [0x4c24aa]
[2021-09-17 14:11:10.668 787M] L[0001]: /root/project/src/signals.c:240
[2021-09-17 14:11:10.673 787M] B[0002]: /lib/arm-linux-gnueabihf/libc.so.6(__default_rt_sa_restorer+0) [0x76cd3120]
[2021-09-17 14:11:10.673 787M] B[0003]: /usr/bin/pihole-FTL(+0x637ec) [0x4e37ec]
[2021-09-17 14:11:10.729 787M] L[0003]: /root/project/src/dnsmasq/forward.c:2346
[2021-09-17 14:11:10.735 787M] B[0004]: /usr/bin/pihole-FTL(receive_query+0x6c1) [0x4e4a8e]
[2021-09-17 14:11:10.774 787M] L[0004]: /root/project/src/dnsmasq/forward.c:1708
[2021-09-17 14:11:10.778 787M] B[0005]: /usr/bin/pihole-FTL(+0x5c1d4) [0x4dc1d4]
[2021-09-17 14:11:10.812 787M] L[0005]: /root/project/src/dnsmasq/dnsmasq.c:1846
[2021-09-17 14:11:10.816 787M] B[0006]: /usr/bin/pihole-FTL(main_dnsmasq+0xda7) [0x4dd884]
[2021-09-17 14:11:10.849 787M] L[0006]: /root/project/src/dnsmasq/dnsmasq.c:1250
[2021-09-17 14:11:10.853 787M] B[0007]: /usr/bin/pihole-FTL(main+0xad) [0x4ae63e]
[2021-09-17 14:11:10.880 787M] L[0007]: /root/project/src/main.c:108
[2021-09-17 14:11:10.883 787M] B[0008]: /lib/arm-linux-gnueabihf/libc.so.6(__libc_start_main+0x10c) [0x76cbd718]

pi FTL log.txt (2.9 KB)
pi2 FTL log.txt (2.9 KB)

Two more logs attached to avoid spamming. Interestingly enough, the FTL log on pi stops shortly before 5PM. It appears to still be running/responding.

If possible, how does one revert to the last release?

I could roll one Pi back and leave the other for testing.

They both crashed again at the same time. Down to the second

Didn't quite make it six hours.

pis lasted six hours.txt (9.8 KB)

[edit] - I have reverted FTL to v5.8.1 on a single Pi3b. Hopefully this lasts overnight and the mixed pi-hole/adminlte/ftl does not break anything.

That they crash the very same second is suspicious. Probably the time of a specific query or a loop maxxed out or some network broadcast like RA. Do the query logs (and also FTL logs) show anything suspicions (or a large amount of queries) before the crash?

FTL itself crashes with a segmentation fault. To track down the backtrace I think we need @DL6ER.

This is a known bug we inherited from dnsmasq. It is currently under active investigation by the maintainer Simon Kelley. We track this here:

https://github.com/pi-hole/FTL/issues/1166#issuecomment-920968209

where also some technical background is given. When you also check out my mail to the mailing list (link included over there), you'll see that this is very complex issue deep inside dnsmasq.

For now, you'd need to remove any custom IPv6 configuration like

server=/netflix.com/#
address=/netflix.com/::

etc. to avoid the crash from happening.

As soon as we have a bug fix available, I'll immediately set up a custom testing branch for FTL.


@Andy_Grant Just out of pure curiosity: Why do you want to block IPv6 traffic to Netflix? Is there a specific issue you are solving with this? I've seen such config lines multiple times by now (also to Spotify, etc.) from various users but I've never asked before.

1 Like

@MichaIng - I agree, quite suspicious. Nothing unusual to note in FTL logs, that's the fun part.

[edit] - after reading @DL6ER 's post, I wonder if I can open Neftlix and make the crash occur on demand? Will test and report back.

Oof, removing that is unlikely to be an option. My ISP does not provide IPv6 connectivity so I have an HE tunnel. This was setup in 2018 or 2019. Netflix is highly allergic to VPN's and they apparently prefer IPv6 and think that I am using a VPN; blocking streams. By blocking IPv6 resolution, Netflix falls back to IPv4 and works.

Since rolling back a since pi to FTL v5.8.1, both have been running for 10 hours successfully. Will I want to roll back the other pi? Obviously usage was low overnight and Netflix was off.

Interesting... I wonder if the other users have the same issue.

Another option could be blocking them via regex instead. You should be able to specify something like

netflix;querytype=AAAA;reply=nodata

Maybe play around a bit with this. More details here:

Interesting option. This was deployed so long ago I don't think regex was an option back then. Or it was so new that deployment was low. Have many regex now, will give it a go after some further testing tonight when the Netflix crowd is awake.

Appreciate the support and suggestions.

The bug in dnsmasq is difficult to solve, hence, we decided to not suggest and submit a patch on our own as this might have been a few hours of work and, if the maintainers want to have it done differently in the end, it'd be wasted time. We're all awaiting a fix eagerly and I'll make a fixed FTL branch available as soon as possible.

i use it to block ipv6 traffic too, to avoid false detection of VPN with He.net tunnel.

The regex didn't exist when i created the custom config, it was few years ago.

(cross-posting from the Github issue)

Hey all,

Simon pushed a fix. He decided not to do the massive refurbishment but, instead, addressed the issue at hand with good faith that this was the only bug.

I pushed the patch to the branch mentioned above (FTL Crash since the last version · Issue #1166 · pi-hole/FTL · GitHub) and removed my segfault bug fix.
Please update your Pi-holes and try again with your full config.

Please run

pihole checkout ftl fix/server_may_not_be_server

and report back if the crashes are resolved with your full config.

Roger that. Branch has been changed, will monitor and report. It crashed again last night so it won't take more than a day to know. Two days with certainty.

First pi ran for 24 hours on fixed branched ok. Have now applied to 2nd pi as well.

Can report all is well on all pi's using the server_may_not_be_server branch.

@DL6ER - thank you very much.

2 Likes

Just remember to switch back to master using

pihole checkout master

when we released the next version of Pi-hole that includes this fix or you'll stay forever on this version.
I'll try to remember coming back to this thread once we released the fix, however, you should better not bet on my memory...