DNS resolution failed after update to v4.1

When you ran pihole -g, did you wait for it to retry? Do that, and check if it agrees that the command runs in less than one second.

Doing that. Results I share next.

Meanwhile, both (Pihole):

timeout 1 dig -p 53 @127.0.0.1 raw.githubusercontent.com

and Unbound:

timeout 1 dig -p 5353 @127.0.0.1 raw.githubusercontent.com

were successful within a second.. from the server CLI

Results from letting pihole -g retrying:

Capture

FYI:

time getent ahostsv4 raw.githubusercontent.com

..takes 5.009s instead of 10.030s. But still rather long.

and..

perl -MSocket -le 'print inet_ntoa inet_aton shift' raw.githubusercontent.com

5.032s

and...

time dig -p 53 @8.8.8.8 raw.githubusercontent.com

1.046s

So in conclusion; when the server is using the gateway DNS, its slow. The gateway uses an external DNS party not provided by the ISP.

Using local pihole or unbound, or google (8.8.8.8) its fast.

The whole issue seems to be that Pi-hole has the bad luck of using the system config, and therefor taking the slow route. Perhaps the script should set a specific DNS server when timing, like 8.8.8.8 or whatever

The problem we run in to with hard-coding in an upstream is that it creates the view that we implicitly endorse a certain provider. If we include Google then a large portion of our user base will either want some other company or will just outright tie us to being in association with them.

Unfortunately that may end up with some configurations that are less than optimal but I'm not sure of a way to implement while still being highly privacy focused.

I can see that.

So in essence, Pi-hole determines if it can resolve, not only by resolving, but also by setting a time restriction. Perhaps you could use a higher time-out and do something like:

<1s: all good
<10s: we get a reply, but your DNS is slow as a snail man
30s+: dns is terribly slow, or did not reply at all

A second option would be to use 3 dns providers, and use averages to determine if its alright. but in case people do not want to contact any of those 3 at all out of preference, that might not be best. Even though using 3 shows no bias from pihole.

That's a fair idea and something to look at. The timeout on the getent calls is just to keep things from taking a long time and users thinking that the process had hung. I believe at one point it was set to 5 seconds but with the addition of a countdown timer we needed to make sure each call completed in the 1 second between timer ticks. We could increase the timer granularity to be more than 1 second and then increase the timeout, or look at removing the countdown timer completely as being too limiting.

Having it increase the time-out is a really fancy solution. But it takes a bit of scripting.

The easiest solution is simply informing: "We are now testing your DNS. It can be fast, but in the worst case it could take a maximum of 30 seconds. Sit tight."

I'm trying to think of how to handle the retry case. It's not just a single call that happens, if the call fails then we try again multiple times, up to 120 attempts. If we were to have 4 or 5 retries of a 30 second timeout then the process could end up taking a few minutes for completion. Definitely open to options for the function though, there are a few other cases in the current configuration that causes a false failure notice but we've been trying to mitigate that.

So lets break it down into the basics:

  1. We need to know there is a connection
  2. We need to know that a domain name can be resolved
  3. We need to know that resolving is not endless

1 is easy to test.
2 could be done by resolving using the 'standard' configuration
3 would be 2 with a 30 second time-out

If it takes more than 30 seconds to resolve, something is seriously wrong. no need for more attempts.
If it works well, you get a fast reply and proceed.
If it takes anywhere between 2-30s you can report: working, but slow

If it takes over 30 seconds than you can ask them if they want to (change anything and then) retry

No need for an increase in the timeout if the max is set to 30.
Worst case scenario, it takes 30 seconds and they just read it could be like that.

And if you want to provide less tech-savvy users with a build-in diagnostics tool to further narrow the scope on a problem of having it running for over 30 seconds, you could ask them:

"Since DNS seems slow, or not working at all, would you like us to run a few tests and help you sort it out? This means we will contact a Google, Cloudflare and DNS root server, and ask them for domain pi-hole.net. Is that okay?"

And then let a script try all sorts of resolving, to see where the problem lies.

We need to know if FTLDNS is back up from a restart. That's the main use for this. If we call getent at time 0:01 and FTLDNS comes back online at 0:05 then we'll never see the daemon come back online and we'll falsely assume that FTLDNS is dead and can not operate. That's the whole design of that function, keep calling getent until FTLDNS answers and is ready for use or we hit the countdown end and can say with a high degree of certainty that it's not going to come back online.

Right. So this error I started the topic with, is not like "DNS is down. Cannot resolve". But rather "FTL does not answer quickly, or at all". I would rephrase it a bit then, not to confuse people.

But apart from that, you are saying: you keep retrying resolving until you can contact FTL. I would split up the two things, so its clear whether its a FTL daemon problem, or a (slow) resolving problem.

Tip: you can check daemons (states) from script.

2 posts were split to a new topic: FTL v4.1.1 Crash

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.