DNS resolution failed after update to v4.1

Amazing work again!

I'm so happy with this update. And just before the holidays too, perfect time to do some heavy adminning among which this beauty.

Just a heads up; when updating from 4.0 to 4.1 in our test environment, since we are that eager to play with it, we ran into this problem:
Capture

Info to go with that: we run pihole on a dedicated machine. pihole forwards queries to 127.0.0.1#5353 where unbound runs which contacts root servers. Despite the error, everything is up to date and running. We haven't rebooted yet though :wink:

Check the FTL log (/var/log/pihole-FTL.log), dnsmasq log (/var/log/pihole.log), and journalctl -u pihole-FTL for any errors.

No error in either of the three logs. Just the usual log messages, nothing more than notices and information really. Oh, and some "insecure" messages (DNSSEC), but never a problem. So no evidence as of why. It even showed knowing about 127.0.0.1#5353 being the nameserver/forward server.

Run pihole -d for a debug token.

Here you go: iiboyb09gf

FTL may have just taken too long to start up (you have a lot of block lists), since you don't have any errors in your debug log. Try pihole -g now. If it works, try pihole -up again.

pihole -g 

resulted in:

Capture

Does this command complete successfully? dig raw.githubusercontent.com

Yes, that resolves perfectly (NOERROR)

Does this command run successfully (this is what the gravity script runs)?

timeout 1 getent hosts raw.githubusercontent.com

I suppose. No output, just returns me to the command line 1 second later.

Just to be clear; everything is running normally. this above is going on, but meanwhile we can browse and stuff gets blocked as usual. no actual resolving issues, apart from Pi-hole saying so.

Try the same command, but change the timeout 1 part to time.

Capture

Did the dig command also take that long? What server(s) are in /etc/resolv.conf?

No the dig command takes 1 second (reports 1 msec at the bottom)

resolv.conf has 2 servers in this order:

  1. Gateway (router)
  2. Local host

Capture

To elaborate on that:

This server runs in a test environment, which is also used enough to be a situation comparable to a real life one. It has a gateway (router) which forwards DNS externally. It also has a server, on which Pi-hole and other services run.

The server itself uses the gateway to resolve DNS. The devices in the test-network use the server to resolve DNS. So the devices are protected by Pi-hole. But optionally can get direct access by setting the DNS server to the gateway.

The server with Pi-hole, as said, runs Unbound. This is used to query root servers from the server. Pi-hole is asked to forward to this local Unbound to resolve.

Edit: TLDR; commands were ran from the server CLI and bypass the Pi-hole/Unbound setup..

From a device in the network (Windows), which goes through Pi-hole -> Unbound -> Root servers:
Capture

So the issue is that the getent hosts raw.githubusercontent.com command takes too long, where it should be taking about the same time as the dig command.

If the getent command fails, after the retry time it will make another check using:

timeout 1 dig +short raw.githubusercontent.com

Does this dig command run in less than one second? (it should, since the other dig one did, but just checking).

timeout 1 dig +short raw.githubusercontent.com

takes about a second yes