Pi-hole Intermittently Crashing/Non-responsive

TheMuffinMan · December 11, 2024, 4:03pm

Expected Behaviour:

TP-link Deco X55 Mesh System is running as modem + DHCP server with a single Pi-hole for its DNS server. This configuration is functioning as expected 95% of the time. The Deco system has a scheduled reboot daily at 3AM.

Raspberry Pi 5/8GB

Pi-hole [v5.18.3]
FTL [v5.25.2]
Web Interface [v5.21]

The Pi is also running Unbound.

Actual Behavior:

I have had several instances (one this morning) where the Pi is intermittently unresponsive (web portal down, no DNS, etc) and haven't figured out a time/schedule/pattern, where it will take several reboots to recover. As part of that recovery I typically have to remove it as the DNS server on the DHCP scope - I suspect it's unable to recover when getting hammered with requests. The issue this morning appears to have occurred when the 3AM reboot happened and I see the number of DNS requests shoot through the roof. The Pi also failed last night around 8PM.

Debug Token:

https://tricorder.pi-hole.net/XUpFdP9l/

TheMuffinMan · December 12, 2024, 3:07pm

As part of troubleshooting I've disabled the use of Unbound and also noted that I could be hitting the rate limit for a 60s period during morning reboots for whatever reason.

I increased the rate limit from 1000 to 2000 as well.

(We had a crash this morning too)

deHakkelaar · December 12, 2024, 5:09pm

Do you have Pi-hole "Conditional Forwarding" configured pointed at your router IP?
And do you also have the Pi-hole IP configured in the router WAN DNS settings?
If so, this closes a partial DNS loop that can trigger rate-limiting.

TheMuffinMan · December 12, 2024, 5:27pm

No Conditional Forwarding configured, I suppose I could add that in?

Sadly the TP-Link unit does not allow you to configure the WAN DNS - it can only be configured on the DHCP scope it's providing. There also is not a method to disable DHCP services on Deco

deHakkelaar · December 12, 2024, 5:59pm

Yes.
Just as long as you dont close a DNS loop by configuring the Pi-hole IP in the WAN/Internet DNS settings on the router.

Dont need to (see also above).
Entering the Pi-hole IP in the LAN DHCP server DNS settings is sufficient.
Usually no other router settings needs changing from factory defaults.

Could you post output for below four?
Might want to redact some bits!

nc localhost 4711 <<< '>stats >quit'

nc localhost 4711 <<< '>top-domains >quit'

nc localhost 4711 <<< '>top-clients >quit'

sudo pihole-FTL dhcp-discover

Did you check the logs in below folder around the time of the crashes?

/var/log/pihole/

The .gz archived ones can be browsed with the zless command.

deHakkelaar · December 12, 2024, 6:05pm

Oh and with Raspi's, always check for under voltage/brownouts:

EDIT: Oh2 before I forget

TheMuffinMan · December 12, 2024, 6:20pm

Yup! I'm familiar with that guide and this is how it is setup. Appreciate the link to it.

Will work on grabbing output from above.

Regarding undervoltage/brownouts, this was something that crossed my mind. While I am using an official RPi USB-C brick I do not have it plugged into a battery backup - just a surge protector. So very potentially could be an issue here.

deHakkelaar · December 12, 2024, 6:24pm

If the whole thing reboots, below lists all boots as an indication:

journalctl --list-boots

Could also check the systemd journal for the current --boot:

sudo journalctl --no-hostname --full --boot

Or two boots back (see --list-boots):

sudo journalctl --no-hostname --full --boot -2

TheMuffinMan · December 12, 2024, 6:37pm

Output from these commands - nothing fishy from what I can see.

I should also note the RPi has a static DHCP Lease in the Deco unit.

pi@raspberrypi:~ $ nc localhost 4711 <<< '>stats >quit'
domains_being_blocked 400175
dns_queries_today 216224
ads_blocked_today 13246
ads_percentage_today 6.126054
unique_domains 2800
queries_forwarded 82654
queries_cached 119449
clients_ever_seen 21
unique_clients 21
dns_queries_all_types 216224
reply_UNKNOWN 2244
reply_NODATA 30077
reply_NXDOMAIN 6988
reply_CNAME 116680
reply_IP 58774
reply_DOMAIN 12
reply_RRNAME 0
reply_SERVFAIL 12
reply_REFUSED 0
reply_NOTIMP 0
reply_OTHER 0
reply_DNSSEC 0
reply_NONE 0
reply_BLOB 1437
dns_queries_all_replies 216224
privacy_level 0
status enabled

pi@raspberrypi:~ $ nc localhost 4711 <<< '>top-domains >quit'
0 20024 ssl.gstatic.com
1 19232 a.slack-edge.com
2 19026 b.slack-edge.com
3 14217 ctest.cdn.nintendo.net
4 5526 teams.microsoft.com
5 5022 calendar.google.com
6 4886 statics.teams.cdn.office.net
7 4766 prodcsgcorp-my.sharepoint.com
8 4357 chat.google.com
9 4118 www.apple.com

nc localhost 4711 <<< '>top-clients >quit'
0 116237 192.168.0.50
1 19224 192.168.3.248
2 14725 192.168.0.58
3 14264 192.168.0.60
4 13856 192.168.0.51
5 13561 192.168.0.53
6 8190 192.168.0.61
7 5468 192.168.0.52
8 4200 192.168.0.62
9 2412 192.168.0.55

pi@raspberrypi:~ $ sudo pihole-FTL dhcp-discover
Scanning all your interfaces for DHCP servers
Timeout: 10 seconds

Received 300 bytes from eth0:192.168.1.1
Offered IP address: 192.168.1.2
Server IP address: N/A
Relay-agent IP address: N/A
BOOTP server: (empty)
BOOTP file: (empty)
DHCP options:
Message type: DHCPOFFER (2)
server-identifier: 192.168.1.1
lease-time: 7200 ( 2h )
netmask: 255.255.252.0
router: 192.168.1.1
dns-server: 192.168.1.2
--- end of options ---

DHCP packets received on interface eth0: 1

TheMuffinMan · December 12, 2024, 10:09pm

Just experienced another crash now.

RPi responding to ping however I'm unable to SSH into the device, sadly I don't have a monitor hooked up to it (going to set that up tonight).

Didn't experience any noticeable power outage or similar but will check logs here shortly.

TheMuffinMan · December 12, 2024, 10:43pm

Debug: https://tricorder.pi-hole.net/OhWxuzLK/

pihole.log quite literally stops recording until it reboots (crash noted at 17:10) -

Dec 12 17:08:06 dnsmasq[4334]: forwarded dns.msftncsi.com to 1.0.0.1
Dec 12 17:08:06 dnsmasq[4334]: reply dns.msftncsi.com is 131.107.255.255
Dec 12 17:11:44 dnsmasq[1408]: started, version pi-hole-v2.90+1 cachesize 10000
Dec 12 17:11:44 dnsmasq[1408]: compile time options: IPv6 GNU-getopt no-DBus no-UBus no-i18n IDN DHCP DHCPv6 Lua TFTP no-conntrack ipset no-nftset auth cryptohash DNSSEC loop-detect inotify dumpfile
Dec 12 17:11:44 dnsmasq[1408]: using nameserver 1.1.1.1#53

I also see this from running

journalctl -b -1

Dec 12 16:39:01 raspberrypi systemd[1]: Finished phpsessionclean.service - Clean php session files.
Dec 12 17:07:02 raspberrypi sshd[10011]: Connection reset by 192.168.0.58 port 56829 [preauth]
Dec 12 17:07:47 raspberrypi kernel: macb 1f00100000.ethernet eth0: Link is Down
Dec 12 17:07:50 raspberrypi NetworkManager[810]: [1734041270.6632] device (eth0): carrier: link connected
Dec 12 17:07:50 raspberrypi kernel: macb 1f00100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off
Dec 12 17:07:51 raspberrypi kernel: macb 1f00100000.ethernet eth0: Link is Down
Dec 12 17:07:54 raspberrypi NetworkManager[810]: [1734041274.7592] device (eth0): carrier: link connected
Dec 12 17:07:54 raspberrypi kernel: macb 1f00100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off
Dec 12 17:07:57 raspberrypi kernel: macb 1f00100000.ethernet eth0: Link is Down
Dec 12 17:08:00 raspberrypi NetworkManager[810]: [1734041280.9032] device (eth0): carrier: link connected
Dec 12 17:08:00 raspberrypi kernel: macb 1f00100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control off
Dec 12 17:09:01 raspberrypi CRON[10216]: pam_unix(cron:session): session opened for user root(uid=0) by (uid=0)
Dec 12 17:09:01 raspberrypi CRON[10217]: (root) CMD ( [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Dec 12 17:09:01 raspberrypi CRON[10216]: pam_unix(cron:session): session closed for user root
Dec 12 17:09:01 raspberrypi systemd[1]: Starting phpsessionclean.service - Clean php session files...
Dec 12 17:09:01 raspberrypi systemd[1]: phpsessionclean.service: Deactivated successfully.
Dec 12 17:09:01 raspberrypi systemd[1]: Finished phpsessionclean.service - Clean php session files.
lines 1252-1336/1336 (END)

Looking at the most recent (no -1) it simply starts at the next reboot/power on at 17:11

CallMeCurious · December 12, 2024, 11:00pm

Some thoughts:

The fact that you can ping it but not ssh indicates you don't have a pihole issue as these have nothing to with dns. A poor or underated power source could cause this. For the PI5 are you using the the official power brick for something else?

What are you using for a boot device? SD Card, USB, NVME? An SD Card going bad could be a cause or, if your using a NVME Hat on the PI5 it might not play nice with PCIE 3.0 etc.

It might be good to know what OS / version your running on the pi as well. Whats the output of uname -a and cat /etc/os-release

deHakkelaar · December 12, 2024, 11:40pm

Yeah I also suspect some HW failure like an inadequate power supply.
Or poor USB cable for power.
Or maybe a faulty ethernet cable/switch port.
The eth0 link shouldnt go down three times over a period of 10 seconds.
Or were you fiddling with the connection at that time?

Did below return any when searching the whole journal?

TheMuffinMan · December 13, 2024, 3:16am

Yep, it's looking more like a hardware/power issue.

It's the official power brick but as noted earlier I do not have it on a UPS or anything, I've ordered one, to be delivered tomorrow, that I'm going to see if it improves.

It's an SD Card - I think SanDisk Extreme C10, U3, V30 etc. Need to validate

edit Ethernet is a Monoprice Cat6A with no physical damage but you never know...

The Deco unit has 3 ethernet ports on it that show as fine in settings but also potentially an issue - maybe power is fluctuating on both of the units (they're on the same surge strip)

TheMuffinMan · December 13, 2024, 2:30pm

Surprisingly no entries here.

Also interesting is that I set up a cheap monitor on the RPi, I'm able to sign into it locally, open Firefox, log in to Pi-Hole, etc.

For some reason I cannot remotely access the RPi (PiHole interface, SSH, etc) from my Desktop1 and Laptop2. I am successfully able to access it from Laptop1.

TheMuffinMan · December 13, 2024, 9:08pm

Doing an update -

I've completely reimaged the SD Card (Samsung Pro btw) and done a clean install of PiHole + Unbound and super verified all configurations without any question.

I've also plugged both the router and RPi into a battery UPS to rule out any wonky power fluctuations on the system. Just configured the Deco to use the PiHole a minute ago so we will see how everything behaves now.

In the event I still have issues the ethernet cable is going to be next on the list for swapping out.

Appreciate all the help guys.

CallMeCurious · December 13, 2024, 10:46pm

Have you every been able to connect to the pi from those machines? What error messages are you getting when you try to connect? This could be a problem with keys or even an ssh client not being installed on the machine etc. Additional info would be helpful.

TheMuffinMan · December 15, 2024, 4:26pm

So specifically no error at all - just refused to connect.

As of this morning there have been no crashes as seen before - I'm going to chock it up to probably a power issue that putting the UPS in place has solved.

I also improved the configuration with the Conditional Forwarding that @deHakkelaar suggested

deHakkelaar · December 15, 2024, 4:51pm

I would keep an eye on if the link goes down more frequently.
Below one grep's for it and shows results --since yesterday:

sudo journalctl --no-hostname --full --pager-end --since yesterday --grep 'Link is Down'

EDIT: Below one shows if a link is currently down.
The NO-CARRIER flag for eth0 which is not physically connected in below example:

$ ip -br l
lo               UNKNOWN        00:00:00:00:00:00 <LOOPBACK,UP,LOWER_UP>
eth0             DOWN           b8:27:eb:xx:xx:xx <NO-CARRIER,BROADCAST,MULTICAST,UP>
wlan0            UP             e8:94:f6:xx:xx:xx <BROADCAST,MULTICAST,UP,LOWER_UP>

deHakkelaar · December 15, 2024, 5:09pm

Does below one resolve to a name if replace <CLIENT_IP> with an actual DHCP client IP thats currently connected?

dig +short @192.168.1.1 -x <CLIENT_IP>

Eg:

$ dig +short @10.0.0.2 -x 10.0.0.11
hakpc.home.dehakkelaar.nl.

If so, you'll profit from that CF setting bc names instead of IP's will be shown on the webGUI.
Also you can resolve by name now instead of connecting via IP.
Eg on a client of mine:

$ dig hakpc
[..]
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39282
[..]
;; ANSWER SECTION:
hakpc.                 2       IN      A       10.0.0.11

;; Query time: 3 msec
;; SERVER: 10.0.0.2#53(10.0.0.2) (UDP)
;; WHEN: Sun Dec 15 18:04:50 CET 2024
;; MSG SIZE  rcvd: 51

EDIT: Windows client:

C:\>nslookup hakpc
Server:  pi.hole
Address:  10.0.0.2

Name:    hakpc.home.dehakkelaar.nl
Address:  10.0.0.11