Pi-hole stops resolving

MDSharma · September 4, 2021, 10:20am

Please follow the below template, it will help us to help you!

If you are Experiencing issues with a Pi-hole install that has non-standard elements (e.g you are using `nginx` instead of `lighttpd`, or there is some other aspect of your install that is customised) - please use the Community Help category.

Expected Behaviour:

Once configured correctly, Pi-hole should carry on resolving without any issues.

Actual Behaviour:

Pi-hole stops resolving after some time and only a reboot of the system / manual pihole -g brings it back online. During this aberrant behaviour, the FTL service looks fully operational and does not generate any errors.. just the resolution keeps failing. This is very similar to Pihole intermittently stops resolving until hardware reboot.

A look at the FTL logs, before and just after the time when queries stop working does not show any errors. Similarly, the dnsmasq log (pihole.log) shows that DNS query forwarding and resolution as working at this stage (all the way till 08:53 ish when no further log updates are seen in the FTL log or the pihole log.)

Steps taken:

In an effort to resolve the issue I have so far:

Verified with date that the date and time are correct and checked that ntp sync is working.
Tried a fresh re-install with some FTL.conf adjustments due to the high query volume in the network (expected and not an anomaly).
Running pihole -r and selecting repair
Running pihole -r and selecting reconfigure
Although IP on these boxes is received over DHCP, the allocation has been reserved. I have not yet tried to configure it as a fixed IP address (with a static config).
checked /etc/dhcpcd.conf
checked network reachability (ping dns.google, dig @dns.google, ip route get 8.8.8.8etc. and all that works as expected.
checked that there are no filesystem corruption issues (both VMs were behaving similarly, but at different times)
kept checking stats with echo ">stats >quit" | nc localhost 4711 in case there was a delay with the GUI / web interface updates.
LOTS of Googling and trying different things that also didn't work.

None of those steps helped resolved my issue. Please could you share some thoughts on what may be causing this strange behaviour and what I could try to deliver a permanent fix?

PS: I really like Pi-hole and appreciate all the personal effort your team puts into this.

Guess work around potential resolution

A reboot fixes the issue..
running a manual pihole -g also seems to resolve the issue (and so does a full reboot)

Pi-hole and OS info

Pi-hole version is v5.3.1 (Latest: v5.3.1)
AdminLTE version is v5.5.1 (Latest: v5.5.1)
FTL version is v5.8.1 (Latest: v5.8.1)

Ubuntu 20.04.3 LTS (Focal Fossa)

Two debug tokens, from two installs are included below:

Debug Token:

[✓] Your debug token is: https://tricorder.pi-hole.net/lIC1q1FF/
[✓] Your debug token is: https://tricorder.pi-hole.net/o206bxUO/

MDSharma · September 13, 2021, 8:23am

Hi Team,

Updated to the latest stable release and currently using these custom settings (as query numbers are huge):

-rw-rw-r-- 1 pihole root 127 Sep 13 13:10 /etc/pihole/pihole-FTL.conf
BLOCKINGMODE=IP-NODATA-AAAA
EDNS0_ECS=true
RATE_LIMIT=0/0
DBIMPORT=no
DBINTERVAL=1.0
MAXDBDAYS=7
MAXLOGAGE=24.0
PRIVACYLEVEL=0

DNSMASQ customisation:

-rw-r--r-- 1 root root 22 Sep 6 22:37 /etc/dnsmasq.d/99halo-pihole.conf
dns-forward-max=10000

Looks like there is something making a lot of connections to the API and then getting denied:

Client denied (at max capacity of 255): 340
IPv4 telnet error: Success (0)

image760×35 28.7 KB

A quick look at ss -t -a -p confirms that all the connections to port 4711 are genuine / local and not external processes or any other scripts:

Just wondering if the developers would recommend that I compile FTL from source after editing FTL/src/FTL.h at 198e7c61362e07b22baa8bdb2bb57dd1c53be0fc · pi-hole/FTL · GitHub MAXCONNS value or is that not recommended?

[✓] Your debug token is: https://tricorder.pi-hole.net/v6VaVXXU/
[i] Logs are deleted 48 hours after upload.

Bucking_Horn · September 13, 2021, 10:09am

Your query count looks way too high for just 5 clients.
Did you check for DNS loops?

The chart from your screenshot would suggest that Pi-hole stopped receiving any queries after a certain time. At that time, your clients just may have started using another DNS server.
It seems your router/DHCP server is not distributing your Pi-hole as DNS, allowing DHCP clients to by-pass Pi-hole. (In addition, it is providing an awful lot of time servers?)

Also, you are running Pi-hole on a public IP.

Any steps you take to allow indiscriminate access to Pi-hole from public networks will turn your Pi-hole into an open resolver, which poses a potential threat for all Internet users, e.g. by serving as a multiplier in a DNS Amplification attack.

The Pi-hole team strongly discourages Pi-hole’s usage as an open resolver , and we won't provide support in that case.

That said, it seems your public IP is currently not allowing access to port 53.

You cannot know for the TIME-WAITs. That looks like some process connected to the API at some time in the past, but never closed the connection.

Are you perhaps using any third-party scripts that would issue Telnet API calls?

MDSharma · September 13, 2021, 12:42pm

Hi,

Thank you, yes, I did check for DNS loops. No issues on that front - the query count is correct for the network and as expected.

Pi-hole is currently one of the "resolvers" my primary and secondary DNS forwarders are pointing towards. That is why the client number is low, but number of queries are high. Typically, I can see 20M+ queries in a day on those forwarders.

.. and another beautiful graph here:

When I have previously tested pi-hole (all clients addressing it directly), pi-hole failure would cause interruption in the network so for the moment, p-hole is one of the servers to allow some test capability without disrupting things too much.

Spot-on and totally possible, however, I have spent a few hours waiting for that gap to appear and then ran a rate-limited dig (50 parallel queries at a time, shuffled subset of 200-1000 within the TOP1M domains) and basically, all the server returns at that stage is SERVFAIL or no response. An occasional pihole -g makes it happy again..

Originally, I was getting dnsmasq limit errors (expected based on the pre-set thresholds) - which were addressed by a custom dnsmasq sub config (the community thread is quite useful).

dns-forward-max=10000

I subsequently hit the FTL port 4711 telnet limit of port 4711. We don't have anything on these vanilla installs that would be querying or trying to connect to the API (other than the pi-hole internals). So, I have now tried to edit the FTL.h file within the source:

vi FTL/src/FTL.h
// How many client connection do we accept at once?
#define MAXCONNS 510

Changed MAXCONNS parameter to > 255 and then compiled a fresh build (hash differs from the pre-compiled binary and the system detects it as a dirty build).

I am hoping to replace those old forwarders with beautiful Pi-hole in the long run - and to support the community with donations for patronage.

Thank you for the open resolver reminder. I am happy to confirm that it is not running as an open resolver. Yes, there is a public IP on it, however, it is firewalled up to deny all queries, except those from specific IPs within the network.
Absolutely no third party scripts whatsoever. I am assuming that the web GUI updates (under a huge volume of query) are causing the FTL telnet warning.. maybe reducing how often the GUI picks up / triggers the update may help?

Your thoughts to help debug / address this will be very welcome.

system · October 4, 2021, 6:45pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.