Pihole intermittently stops resolving until hardware reboot

Pseudokim · August 14, 2020, 9:05pm

Expected Behaviour:

PiHole should consistently return DNS queries to whole network.

O/S:

Raspbian GNU/Linux 10 (buster)

Hardware:

RasPi 3B+, ethernet connection.
Hard wired into gateway TP-Link AC1200 / VR400.

Conditions:

PiHole running as sole network-wide DHCP server and DNS sinkhole.
Router and other networking equipment set up with DHCP disabled.
Regardless of DHCP being disabled, PiHole's MAC and IP are bound on all network infrastructure. (did this as an attempt to fix issue).
Upstream DNS is Google.

Actual Behaviour:

Everything works perfectly for varying amounts of time, from between 30 minutes since last reboot as far as 24+hours since last reboot. At seemingly random intervals, DNS resolution for the whole network starts failing.

While in this failed state, web interface is sluggish in response, but despite devices on the network not receiving DNS results, the query log implies they're being received and served normally.

This issue can be immediately resolved by either SSH'ing in to the Pi and rebooting it, or following the steps to reboot through the PiHole's web interface (by navigating directly to its IP - pi.hole doesn't resolve in this situation).

I ran a diagnostic log while in this failed state. The log failed to upload to tricorder.pi-hole.net.

To reiterate, once in this state the only 'fix' is to reboot the hardware, at which point after loading, everything network-wide immediately works fine again and receives DNS resolution - until at some unspecified time later, it stops working. Repeat ad infinitum. I cannot identify any pattern to it entering this failed state.

Steps taken:

In an effort to resolve the issue I have so far:

Verified with date that the date and time are correct. With that in mind and having read a similar thread, I do also have an RTC on its way in the post, lest that turn out to be the root cause.
Tried flashing a known-good (working) disk image of Raspbian 10 with PiHole installed, set up for my network configuration.
Tried the above on an entirely new and different RasPi 3B+.
Running pihole -r and selecting repair
Running pihole -r and selecting reconfigure
Been through all configuration options on networking equipment with a fine-tooth comb to make sure nothing (obvious to me, at least) is configured wrongly.
Tried physically moving the RasPi to a different location and connecting it to a second router acting as a switch - again all ethernet connected back to the gateway. Same story, works fine until it stops.
LOTS of Googling and trying different things that also didn't work.

None of these steps have resolved my issue.

Theories:

Something in my Rasbian configuration is wrong.
Something in my pihole configuration is wrong.
Something in my router/network configuration is wrong.
A particular client device within my network is causing problems.

For all my googling and reading of this and other forums, I've not been able to narrow down my issue beyond these delightfully broad theories. The intermittency of the issue occurring is what's baffling me. It can be fine for an hour or fine for 12 hours or fine for a whole day but it still ends up in this failed state with seemingly no explanation.

Debug Token:

As mentioned earlier, PiHole was unable to upload the debug log to Tricorder. Please find the pasted debug log here.

Any help would be enormously appreciated!

DanSchaper · August 14, 2020, 9:09pm

How have you set the static IP address for the Pi-hole?

How is your Pi-hole install reaching the internet? You have 8.8.8.8 as an upstream server but connecting to 8.8.8.8 fails. This sounds like a routing or a firewall issue if the static IP is correct for the Pi-hole.

*** [ DIAGNOSING ]: Name resolution (IPv4) using a random blocked domain and a known ad-serving domain
[✓] stage.kochava.com is 0.0.0.0 via localhost (127.0.0.1)
[✓] stage.kochava.com is 0.0.0.0 via Pi-hole (192.168.0.2)
[✗] Failed to resolve doubleclick.com via a remote, public DNS server (8.8.8.8)

The OS seems like it may have issues as well. What does /etc/os-release have as it's contents?

*** [ DIAGNOSING ]: Operating system
[✗] Distro:  Raspbian
[✗] Error: Raspbian is not a supported distro (https://docs.pi-hole.net/main/prerequisites/)

Pseudokim · August 14, 2020, 9:27pm

Static IP was set for the PiHole during the PiHole installation process:

wget -O basic-install.sh https://install.pi-hole.net
sudo bash basic-install.sh

As mentioned, I've also bound the PiHole's MAC in the gateway and other networking equipment to its static IP.

Returns:

PRETTY_NAME="Raspbian GNU/Linux 10 (buster)"
NAME="Raspbian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=raspbian
ID_LIKE=debian
HOME_URL="http://www.raspbian.org/"
SUPPORT_URL="http://www.raspbian.org/RaspbianForums"
BUG_REPORT_URL="http://www.raspbian.org/RaspbianBugs"

DanSchaper · August 14, 2020, 9:29pm

Check /etc/dhcpcd.conf and make sure it's set as you expect.

But the bigger question is, why does a dig against 8.8.8.8 fail? That's a networking level issue. You can see from the same debug output that Pi-hole is operational and responding correctly.

Pseudokim · August 14, 2020, 9:34pm

# A sample configuration for dhcpcd.
# See dhcpcd.conf(5) for details.

# Allow users of this group to interact with dhcpcd via the control socket.
#controlgroup wheel

# Inform the DHCP server of our hostname for DDNS.
hostname

# Use the hardware address of the interface for the Client ID.
clientid
# or
# Use the same DUID + IAID as set in DHCPv6 for DHCPv4 ClientID as per RFC4361.
# Some non-RFC compliant DHCP servers do not reply with this set.
# In this case, comment out duid and enable clientid above.
#duid

# Persist interface configuration when dhcpcd exits.
persistent

# Rapid commit support.
# Safe to enable by default because it requires the equivalent option set
# on the server to actually work.
option rapid_commit

# A list of options to request from the DHCP server.
option domain_name_servers, domain_name, domain_search, host_name
option classless_static_routes
# Respect the network MTU. This is applied to DHCP routes.
option interface_mtu

# Most distributions have NTP support.
#option ntp_servers

# A ServerID is required by RFC2131.
require dhcp_server_identifier

# Generate SLAAC address using the Hardware Address of the interface
#slaac hwaddr
# OR generate Stable Private IPv6 Addresses based from the DUID
slaac private

# Example static IP configuration:
#interface eth0
#static ip_address=192.168.0.10/24
#static ip6_address=fd51:42f8:caae:d92e::ff/64
#static routers=192.168.0.1
#static domain_name_servers=192.168.0.1 8.8.8.8 fd51:42f8:caae:d92e::1

# It is possible to fall back to a static IP if DHCP fails:
# define static profile
#profile static_eth0
#static ip_address=192.168.1.23/24
#static routers=192.168.1.1
#static domain_name_servers=192.168.1.1

# fallback to static profile on eth0
#interface eth0
#fallback static_eth0
interface eth0
        static ip_address=192.168.0.2
        static routers=192.168.0.1
        static domain_name_servers=8.8.8.8 8.8.4.4

Looks fine to me?

That's what I'm struggling with understanding. There's absolutely no reason I can identify that it would fail. I have tried using different upstream DNS (cloudflare) but the issue persists with that changed too.

At least you've helped me narrow it down to probably not being the pihole at fault. Weird that it resolves if I reboot it though.

DanSchaper · August 14, 2020, 9:45pm

I would check things like ping 8.8.8.8 and ip route get 8.8.8.8 to make sure you even have basic connectivity to that IP address first, if that works then I'd look for things like DNS rebind protection or any logs on the TP-Link routern. Is it a stock firmware on the router?

Pseudokim · August 14, 2020, 9:48pm

I'll run them now and keep the results in a text file, and run them again next time it fails and update this thread with both results.

I'll look into that now.

EDIT: Nothing out of place in the logs. The router does have DNS rebind protection but from what I've read online, disabling its DHCP functionality and letting PiHole serve DHCP and DNS circumnavigates it.

If rebind protection was in effect, I'd not get any DNS resolution through the PiHole, ever. If it was randomly kicking in on the router, the router's VDSL connection also has 8.8.8.8/4.4 specified (because for some reason if you put a local subnet IP in that box the router stops responding entirely, until you factory reset, but I digress).

Point being if rebind protection was kicking in and causing the PiHole to stop getting through to clients, clients would be served unfiltered DNS from 8.8.8.8/4.4 as a fallback. I've tried running ipconfig /flushdns and a variety of netsh winsock reset commands on a Windows client on the network when this PiHole DNS error kicks in, and that does nothing. After running those commands and reconnecting, ipconfig /all still returns the PiHole as DHCP and DNS. So I don't think that's got anything to do with it. Even when in this state, DHCP is still working on the PiHole, evidenced by being assigned correct information after ipconfig /release /renew.

Stock, latest version, manually updated.

Thank you very much for your help so far!

DanSchaper · August 14, 2020, 10:50pm

This won't solve the problem with external DNS upstreams failing but you could do an unbound install as the Pi-hole upstream.

https://docs.pi-hole.net/guides/unbound/

Pseudokim · August 15, 2020, 12:43am

Didn't realise unbound was quite so simple to set up. Done. We shall see if the bug returns now unbound it set up. All working fine on unbound as of right now.

DanSchaper · August 15, 2020, 12:45am

If nothing else it should give you a lot more for log debugging options. The default configuration we suggest doesn't have logging enabled but you can set unbound up to be very verbose with logs to pick out all kinds of transit errors.

Pseudokim · August 15, 2020, 12:49am

Great, thanks! I'll leave it be for now but if I get further issues in the next few weeks (ambitious?) then I'll look into verbose logging and post what I find here.

Annoyingly with intermittent issues like this, they never show up when you're waiting for them... Murphy's law and all that.

deHakkelaar · August 15, 2020, 1:02am

On a Raspi, having intermittent issues, always suspect power issues first:

Pseudokim · August 15, 2020, 1:26am

Zero returned for that, I'll try it again if/when it plays up again.

I should note, I did swap the power supply with another of the same type when I switched the RasPi running PiHole in an effort to resolve the bug. Same problem, different hardware, different power adaptors, different physical locations (and therefore electrical rings). I doubt this is the issue but I'll check nontheless if the problem persists. Both adaptors are 5V3A.

deHakkelaar · August 15, 2020, 1:35am

Another is a corrupted filesystem.
But for that, you'd have to remove the SD card and insert into another Linux client that has the EXT4 filesystem stack available.
You would do a:

dehakkelaar@laptop:~$ lsblk
NAME                         MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda                            8:0    0 111.8G  0 disk
├─sda1                         8:1    0   190M  0 part /boot
├─sda2                         8:2    0  37.3G  0 part /
├─sda3                         8:3    0  69.9G  0 part
│ ├─vg0-lamp.dehakkelaar.nl--swap 254:0    0   1.5G  0 lvm  [SWAP]
│ └─vg0-lamp.dehakkelaar.nl--disk 254:1    0    10G  0 lvm
└─sda4                         8:4    0   4.5G  0 part [SWAP]
sdb                            8:16   1   7.2G  0 disk
├─sdb1                         8:17   1   256M  0 part
└─sdb2                         8:18   1     7G  0 part
sr0                           11:0    1  1024M  0 rom

And run filesystem-checker on the SD card partitions:

sudo fsck /dev/sdb1

sudo fsck /dev/sdb2

Pseudokim · August 15, 2020, 1:39am

Linux-wise I've only got another RasPi available right now - is this something I can check with that (utilising a USB card reader)?

deHakkelaar · August 15, 2020, 1:40am

Yes.
I have a couple of them.

Pseudokim · August 15, 2020, 1:42am

Excellent! I'll check it tomorrow after work and post an update here. Thank you!

deHakkelaar · August 15, 2020, 1:43am

You can add the -y argument to fsck to answer yes on all questions asked to fix.
But first try run without (can breakout with CTRL-C).

EDIT: ow and make sure the SD card partitions from lsblk aren't mounted when running fsck !
After inserting, check mounts with:

mount

Unmount with:

sudo umount /PATH/TO/MOUNT

Pseudokim · August 15, 2020, 2:58am

Guys it's just occurred to me what's going on. Feel free to laugh at me.

I had OpenVPN installed on the same RasPi as PiHole, from when I first set it up a good while back. My logic then was to encrypt all DNS traffic to prevent my ISP from reading/intercepting it (See 'snooper's charter' in the UK - creepy stuff).

Anyway I'd had it configured to use PIA's Swiss server. Then I'd totally forgotten it was installed and running on the RasPi that the PiHole is on, 'cause it'd been problem free for so long.

Long story short I've been disconnected a few times from PIA's Switzerland server on other devices recently. I'm willing to bet my bottom dollar that if the end of the PiHole's VPN tunnel is going down, even briefly, that's enough of a spanner-in-the-works to be what's causing my problems, explaining why PiHole hasn't been able to find 8.8.8.8 at times. One would have hoped I'd connected the dots before wasting your time but here we are.

Rather than faff about with trying to get OpenVPN to reconnect or hop to a different server etc I've just uninstalled it for now. I'll continue to monitor the status, but the issues with DNS resolution on this network and the VPN disconnections I've experienced elsewhere recently are in more-or-less the exact same time frame.

I'll look into other options for encrypting/obfuscating outbound DNS queries, but I'll leave this post 'unsolved' for now. If, in a few days time I've had no outages on the network now that OVPN is gone, I'll come back and mark this post as a solution, forever marking my own stupidity.

Thank you both for taking the time to help. Much appreciated. I'll be donating again to PiHole as a thanks/apology for wasting your time on what is likely entirely my own error.

tl;dr Don't use a VPN on your PiHole, especially if it's unreliable.

EDIT: 36+ hours later and no issues. Looks like it was indeed the VPN dropping out that was causing my issue. May this thread/post remain as a monument to my own shortsightedness Hopefully this thread can help someone else one day!

deHakkelaar · August 15, 2020, 3:09am

Look into if your router is able to dial into the VPN.
Maybe its able to deal better with temporally loss of connection.
Most current firmwares can dial into VPN... and if not, there are alternative firmwares available for popular router models that can dial in to most popular VPN providers.
One such example:

https://dd-wrt.com/