DNS resolution fails intermittently with "config error is REFUSED (EDE: network error)"

furinkan · January 12, 2022, 7:25am

Symptoms

I'm experiencing unusual behaviour on a new installation of Pihole. Chome on my Windows workstation will occassionally stop working due to DNS failures. Failures are instant, and using nslookup I can confirm that Pihole is returning a "Refused" response.

Using dig on the pihole I get the same results (querying its own dnsmasq instance). However, using dig to query pihole's upstream servers directly (Cloudflare 1.1.1.1) works just fine, so this isn't an internet connection dropout.

I've tried capturing debug info but wasn't able to catch it in the middle of the latest failure. When it happens, DNS resolution is down for maybe 2-5min, then comes back again for no apparent reason.

Debug Token:

https://tricorder.pi-hole.net/w66RXrSS/

Config details

I built this pihole box intending it to replace an existing proof-of-concept install on my network. The config should be quite similar, and I copied the config from the old one using Teleporter, checking all the checkboxes during import.

It's somewhat messy, but my network has a few DHCP servers running on the same segment with non-overlapping IP pools: the Unifi USG3 gateway, the old pihole box, and the new pihole box. This setup has allowed me to test enabling pihole filtering for devices on the network without making it a single point of failure.

I would eventually like to make this new pihole the sole DHCP server and DNS resolver on the network, and then later add a second similarly-configured pihole for high availability.

The old pihole install is an RPi 4 running Raspbian, the new one is a Radxa RockPi S running Ubuntu 20.04 with Radxa's custom 4.4 kernel for the hardware. Both are attached by onboard ethernet and I have no reason to doubt their reliability.

My internet connection has native v4 and v6 support. My ISP delegates a prefix to the router and clients autoconfigure themselves. As such, I've enabled v4 and v6 upstream resolvers in Pihole.

Evidence from logs

Setting DEBUG_QUERIES=true I can see the following in pihole-FTL.log:

[2022-01-12 17:21:40.241 111183M] **** new UDP IPv4 query[A] query "mail.google.com" from eth0/192.168.1.70#37148 (ID 6726, FTL 29316, src/dnsmasq/forward.c:1601)
[2022-01-12 17:21:40.241 111183M] mail.google.com is known as not to be blocked
[2022-01-12 17:21:40.242 111183M] **** got cache reply: error is REFUSED (nowhere to forward to) (ID 6726, src/dnsmasq/rfc1035.c:1110)
[2022-01-12 17:21:40.243 111183M]      EDE: network error (23)
[2022-01-12 17:21:40.243 111183M] Set reply to REFUSED (8) in src/dnsmasq_interface.c:2071

All the failures follow this same format. The corresponding log entry is pihole.log is

Jan 12 17:21:40 dnsmasq[111183]: query[A] mail.google.com from 192.168.1.70
Jan 12 17:21:40 dnsmasq[111183]: config error is REFUSED (EDE: network error)

The old pihole installation has never exhibited this behaviour. At first I thought it was a genuine upstream problem, because the old install uses OpenDNS and I'm trying Cloudflare on the new one. Changing the new pihole to use OpenDNS instead has not fixed the problem.

Hypotheses

I can't think of any reason why this would happen, particularly only on this pihole installation. The internet connection is quite stable, and this pihole hasn't been rebooted or anything recently.

The only scenario I can come up with is something like:

A very brief transient network failure occurs
A client requests DNS resolution during this loss of connectivity
Pihole/dnsmasq queries all its upstream, finds it can't reach any of them, and returns REFUSED to the client
Pihole/dnsmasq caches this failure, and continues to return REFUSED for a period of time even once the upstream network issue has cleared
Eventually the negative-cache times out and behaviour returns to normal

However I don't think dnsmasq caches connection errors, and it wouldn't explain why the resolution failure sometimes lasts 1-2min and sometimes lasts for 5min. Also, apps with existing connections keep working fine so I'm quite convinced it's not a connectivity issue.

Coro · January 12, 2022, 10:58am

It doesn't.

Just to be on the safe side: Does this also happen during the time of Pi-hole serving REFUSED to the clients?
Your log from FTL above suggests it is a UDP query that is failing. Does the same happen when you run TCP queries (dig +tcp ...) during this time?
Could your also add DEBUG_FLAGS=true and post again what you posted above from FTL's log when it happens again? It was sometimes helpful for me in the past to understand things.
I had this same issue a looong time ago when I was still using an on-demand connection and the connection was not currently established. However, you already ruled out that it can be a connection issue.

@DL6ER Can you suggest some further debug information?

Bucking_Horn · January 12, 2022, 11:52am

Going strictly by that message alone, your Pi-hole wasn't aware of any upstream DNS server to forward a query to at that time.

Did you perhaps test or switch to different upstream DNS servers for your Pi-hole at that time?

If that's not the case:

That observation - along with REFUSED log entries - would suggest that a client has exceeded Pi-hole's rate limit.
If a client would be expected to exceed the default 1,000 queries per minute, you may adjust that rate limit via pihole-FTL.conf.
This would often be the case if a router had been configured to use Pi-hole as its upstream DNS server (as opposed to distributing it as local DNS server via DHCP). A debug log shows only a fraction of Pi-hole's log, but almost all contained queries originate from a .70 address.
Would that be a router?

However, you should then also see a corresponding RATE_LIMIT message in Pi-hole's diagnosis view - and your debug log doesn't include such a message.

Yet your debug log shows some database related errors about missing tables in Pi-hole's gravity database.
This may be a bit of a far call (as the error messages involved do not match), but since you are dealing with a fresh install, you may unintentionally have corrupted your database by running pihole -g -r - see Error: no such table: main.gravity - Pi-hole v. 5.6 - #2 by DL6ER.

And likely unrelated to your problem, note that your other DHCP server distributes another .14 IP address besides your Pi-hole's .26.

furinkan · January 12, 2022, 12:58pm

That's correct, manual queries against upstream do work while pihole is serving REFUSED to clients.
I'll need to test that manually next time it happens, I'll try TCP queries from the pihole against itself, and against its upstreams.

I've set DEBUG_FLAGS=true and confirmed they appear in pihole-FTL.log, so I'll be ready when it happens next time.

No changes at that time, I'd been using Cloudflare upstreams since the initial install.

.70 is one of my workstations, I'm using is as a guinea pig for my testing. I did read that ratelimiting could be a potential problem, but it's not a new feature at this point, and as you say there's nothing in the logs about it either.

I'll read up on the possible DB corruption issues. I don't believe I've ever run pihole -g -r but it's worth a look.

The other DHCP server indeed distributes .14 as a DNS server, that's the old pihole install that doesn't have this issue.

DL6ER · January 12, 2022, 8:04pm

I prepared a special version of FTL you can try with

pihole checkout ftl tweak/debug_for_refused

Hint 1: If you are in docker, you need to use the nighly container to get access to the checkout feature
Hint 2: It will take a few minutes until the binaries are compiled and ready for checkout.

It should log a reason for the failure into /var/log/pihole.log when this happens, this is my first draft, we'll see if this turns out to be helpful.

furinkan · January 13, 2022, 11:47am

I've had a read and I'm confident that that's not the problem. It's a new pihole install but it's definitely been running properly otherwise.

TCP seems to work

I managed to catch a single bad query earlier today that did get a positive result when retried with TCP. I had it scripted up like so, the problem had gone when I tried to run it a second time immediately after the failure.

# On the pihole host
echo "dig - UDP - against pihole"
dig @127.0.0.1 google.com
echo "dig - TCP - against pihole"
dig +tcp @127.0.0.1 google.com

And the output (trimmed):

# UDP
; <<>> DiG 9.16.1-Ubuntu <<>> @127.0.0.1 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: REFUSED, id: 62843                     <---------
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; OPT=15: 00 17 ("..")

;; Query time: 3 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Jan 13 18:52:20 AEDT 2022

#TCP
; <<>> DiG 9.16.1-Ubuntu <<>> +tcp @127.0.0.1 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 16022
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096

;; Query time: 119 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Thu Jan 13 18:52:21 AEDT 2022

And the corresponding FTL logs:

[2022-01-13 18:52:20.794 120256M] Processing FTL hook from src/dnsmasq/forward.c:1599...
[2022-01-13 18:52:20.794 120256M]      Flags: F_FORWARD F_IPV4 F_QUERY
[2022-01-13 18:52:20.794 120256M] **** new UDP IPv4 query[A] query "google.com" from lo/127.0.0.1#50340 (ID 18337, FTL 20209, src/dnsmasq/forward.c:1601)
[2022-01-13 18:52:20.795 120256M] google.com is not known
[2022-01-13 18:52:20.795 120256M] DNS cache: 127.0.0.1/google.com is not blocked
[2022-01-13 18:52:20.796 120256M] Processing FTL hook from src/dnsmasq/rfc1035.c:1110...
[2022-01-13 18:52:20.796 120256M]      Flags: F_CONFIG F_RCODE
[2022-01-13 18:52:20.796 120256M] ***** Unknown cache query
[2022-01-13 18:52:20.796 120256M] **** got cache reply: error is REFUSED (nowhere to forward to) (ID 18337, src/dnsmasq/rfc1035.c:1110)
[2022-01-13 18:52:20.796 120256M]      EDE: network error (23)
[2022-01-13 18:52:20.796 120256M] Set reply to REFUSED (8) in src/dnsmasq_interface.c:2071


[2022-01-13 18:52:20.905 129730/F120256] TCP worker forked for client 127.0.0.1 on interface lo with IP 127.0.0.1
[2022-01-13 18:52:20.905 129730/F120256] Reopening Gravity database for this fork
[2022-01-13 18:52:20.912 129730/F120256] Closing Telnet socket for this fork
[2022-01-13 18:52:20.912 129730/F120256] Closing Unix socket for this fork
[2022-01-13 18:52:20.913 129730/F120256] Processing FTL hook from src/dnsmasq/forward.c:2077...
[2022-01-13 18:52:20.913 129730/F120256]      Flags: F_FORWARD F_IPV4 F_QUERY
[2022-01-13 18:52:20.913 129730/F120256] **** new TCP IPv4 query[A] query "google.com" from lo/127.0.0.1#43519 (ID 18338, FTL 20210, src/dnsmasq/forward.c:2080)
[2022-01-13 18:52:20.913 129730/F120256] google.com is known as not to be blocked
[2022-01-13 18:52:21.017 129730/F120256] Processing FTL hook from src/dnsmasq/forward.c:2245...
[2022-01-13 18:52:21.018 129730/F120256]      Flags: F_FORWARD F_IPV4 F_SERVER
[2022-01-13 18:52:21.018 129730/F120256] **** forwarded google.com to 208.67.222.222#53 (ID 18338, src/dnsmasq/forward.c:2245)
[2022-01-13 18:52:21.018 129730/F120256] FTL_CNAME called with: src = (null), dst = google.com, id = 18338
[2022-01-13 18:52:21.019 129730/F120256] google.com is known as not to be blocked
[2022-01-13 18:52:21.019 129730/F120256] Query 18338: CNAME google.com
[2022-01-13 18:52:21.020 129730/F120256] Processing FTL hook from src/dnsmasq/rfc1035.c:893...
[2022-01-13 18:52:21.020 129730/F120256]      Flags: F_FORWARD F_IPV4 F_UPSTREAM
[2022-01-13 18:52:21.020 129730/F120256] **** got upstream reply from 208.67.222.222#53: google.com is 172.217.24.46 (ID 18338, src/dnsmasq/rfc1035.c:893)
[2022-01-13 18:52:21.020 129730/F120256] Set reply to IP (4) in src/dnsmasq_interface.c:2144
[2022-01-13 18:52:21.025 129730/F120256] TCP worker terminating (client disconnected)

I did notice that query volumes go up when the error occurs, but no signs of rate-limiting, so I suspect it's just Chrome retrying aggressively. I didn't analyse the logs properly but a lot of the REFUSED results are for Google domains.

Special version

Thanks, I've just run that now. Looks like it worked fine and FTL is running so I'll wait for the next time it fails.

yubiuser · January 13, 2022, 1:10pm

As you said you have a fairly complex network setup. Do you have any DNS server in-between your clients and Pi-hole that would add EDNS(0) data? We have seen that some upstream DNS server reject DNS queries with EDNS(0) data.

Bucking_Horn · January 13, 2022, 9:47pm

No, the original problem mentioned certainly isn't.

As I said, it's a bit of a far call - the involvement I supected would be that your database would somehow have failed to store the RATE_LIMIT message, as your debug log shows your database to lack a few tables:

*** [ DIAGNOSING ]: contents of /var/log/lighttpd

-rw-r--r-- 1 www-data www-data 2.5K Jan  9 03:31 /var/log/lighttpd/error.log
   -----tail of error.log------
   2022-01-09 03:31:44: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  SQLite3::exec(): no such table: blacklist in /var/www/html/admin/scripts/pi-hole/php/teleporter.php on line 90
   2022-01-09 03:31:44: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  SQLite3::exec(): no such table: regex_blacklist in /var/www/html/admin/scripts/pi-hole/php/teleporter.php on line 90
   2022-01-09 03:31:44: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  SQLite3::exec(): no such table: whitelist in /var/www/html/admin/scripts/pi-hole/php/teleporter.php on line 90
   2022-01-09 03:31:44: (mod_fastcgi.c.421) FastCGI-stderr: PHP Warning:  SQLite3::exec(): no such table: regex_whitelist in /var/www/html/admin/scripts/pi-hole/php/teleporter.php on line 90

Still, if indeed those tables would have been missing, you could have considered trying to fix your database.

But then, I notice just now that those missing *lists tables are indeed views that carry a prefixed name in the database:

CREATE VIEW vw_whitelist AS SELECT domain, domainlist.id AS id, domainlist_by_group.group_id AS group_id
CREATE VIEW vw_blacklist AS SELECT domain, domainlist.id AS id, domainlist_by_group.group_id AS group_id
CREATE VIEW vw_regex_whitelist AS SELECT domain, domainlist.id AS id, domainlist_by_group.group_id AS group_id
CREATE VIEW vw_regex_blacklist AS SELECT domain, domainlist.id AS id, domainlist_by_group.group_id AS group_id

But rather than contributing to your issue, this may hint at teleporter.php incorrectly omitting the prefix when trying to clear db contents.
Could you have a look at this, @DL6ER?
If confirmed, I''ll split that into a new topic.

Bucking_Horn · January 13, 2022, 9:54pm

This is interesting, as an EDE code of 17 would hint at the request being filtered.
Unless pihole-FTL/dnsmasq is customarily adding that code for any REFUSED request, that may suggest that the upstream has filtered that specific request.

furinkan · January 13, 2022, 10:56pm

Not that I can think of. The client I usually notice the problem on (192.168.1.71) is an ordinary Win10 machine so there's not much to configure there.

The linux workstation I use the most (192.168.1.70) runs Ubuntu and that does use a funky local resolver via systemd. Now that you mention it, would this /etc/resolv.conf possibly trigger behaviour like that?

nameserver 127.0.0.53
options edns0 trust-ad
search thighhighs.top <and 14 work domains for the VPN>

furinkan · January 13, 2022, 11:10pm

Bucking_Horn:

As I said, it's a bit of a far call - the involvement I supected would be that your database would somehow have failed to store the RATE_LIMIT message, as your debug log shows your database to lack a few tables:
*** [ DIAGNOSING ]: contents of /var/log/lighttpd
<<trimmed out for reply>>
Still, if indeed those tables would have been missing, you could have considered trying to fix your database.

But then, I notice just now that those missing *lists tables are indeed views that carry a prefixed name in the database:

Given that this is a pretty fresh install, but you're seeing some weird behaviour relating to the database, what about this for an idea? My build process for the pihole was:

Install the OS and update all packages
Configure the network just how I need it
Do a vanilla Pi-hole install and configure it with desired upstreams and eth0 listening settings
Import a Teleporter backup from the old pihole host, because I'd like to keep historical data, and also keep all the Local DNS records without having to re-enter them

I can't 100% guarantee that both piholes were the same version, but I'm reasonably confident I ran pihole -up on both prior to doing the work because that seems like a wise thing to do (and I imagine that the importer would do a sanity-check on the versions too).

This is the first time I've ever used the Teleporter feature. Is there a chance that could've cause database corruption? There wouldn't have been any time to observe possible failures between steps 3 and 4 above.

Bucking_Horn · January 13, 2022, 11:13pm

I currently consider that more likely to be an issue with teleporter.php (and thus unrelated to your current topic), but let's wait for feedback from development.

furinkan · January 14, 2022, 5:47am

I haven't managed to catch it in the act yet, but there was definitely a failure a couple of hours ago during lunchtime, just for a minute or so. The tweaked branch is definitely doing something; the logs are slightly more verbose, and the diagnosis message marker is going nuts with over 400 warnings.

Two failed queries in pihole.log

Jan 14 12:54:48 dnsmasq[136452]: query[A] content-autofill.googleapis.com from 192.168.1.181
Jan 14 12:54:48 dnsmasq[136452]: Sending packet for content-autofill.googleapis.com upstream failed: Network is unreachable
Jan 14 12:54:48 dnsmasq[136452]: Tried all available servers over UDP, none worked, returning REFUSED
Jan 14 12:54:48 dnsmasq[136452]: config error is REFUSED (EDE: network error)

Jan 14 12:54:49 dnsmasq[136452]: query[A] adservice.google.com from 192.168.1.181
Jan 14 12:54:49 dnsmasq[136452]: Sending packet for adservice.google.com upstream failed: Network is unreachable
Jan 14 12:54:49 dnsmasq[136452]: Tried all available servers over UDP, none worked, returning REFUSED
Jan 14 12:54:49 dnsmasq[136452]: config error is REFUSED (EDE: network error)

Corresponding entries in pihole-FTL.log:

[2022-01-14 12:54:48.454 136452M] Processing FTL hook from src/dnsmasq/forward.c:1609...
[2022-01-14 12:54:48.455 136452M]      Flags: F_FORWARD F_IPV4 F_QUERY
[2022-01-14 12:54:48.455 136452M] **** new UDP IPv4 query[A] query "content-autofill.googleapis.com" from eth0/192.168.1.181#61450 (ID 30314, FTL 51823, src/dnsmasq/forward.c:1611)
[2022-01-14 12:54:48.456 136452M] content-autofill.googleapis.com is known as not to be blocked
[2022-01-14 12:54:48.456 136452M] WARNING in dnsmasq core: Sending packet for content-autofill.googleapis.com upstream failed: Network is unreachable
[2022-01-14 12:54:48.489 136452M] WARNING in dnsmasq core: Tried all available servers over UDP, none worked, returning REFUSED
[2022-01-14 12:54:48.519 136452M] Processing FTL hook from src/dnsmasq/rfc1035.c:1110...
[2022-01-14 12:54:48.520 136452M]      Flags: F_CONFIG F_RCODE
[2022-01-14 12:54:48.520 136452M] ***** Unknown cache query
[2022-01-14 12:54:48.520 136452M] **** got cache reply: error is REFUSED (nowhere to forward to) (ID 30314, src/dnsmasq/rfc1035.c:1110)
[2022-01-14 12:54:48.520 136452M]      EDE: network error (23)
[2022-01-14 12:54:48.520 136452M] Set reply to REFUSED (8) in src/dnsmasq_interface.c:2071

[2022-01-14 12:54:49.163 136452M] Processing FTL hook from src/dnsmasq/forward.c:1609...
[2022-01-14 12:54:49.164 136452M]      Flags: F_FORWARD F_IPV4 F_QUERY
[2022-01-14 12:54:49.164 136452M] **** new UDP IPv4 query[A] query "adservice.google.com" from eth0/192.168.1.181#56816 (ID 30317, FTL 51826, src/dnsmasq/forward.c:1611)
[2022-01-14 12:54:49.165 136452M] adservice.google.com is known as not to be blocked
[2022-01-14 12:54:49.165 136452M] WARNING in dnsmasq core: Sending packet for adservice.google.com upstream failed: Network is unreachable
[2022-01-14 12:54:49.198 136452M] WARNING in dnsmasq core: Tried all available servers over UDP, none worked, returning REFUSED
[2022-01-14 12:54:49.229 136452M] Processing FTL hook from src/dnsmasq/rfc1035.c:1110...
[2022-01-14 12:54:49.229 136452M]      Flags: F_CONFIG F_RCODE
[2022-01-14 12:54:49.229 136452M] ***** Unknown cache query
[2022-01-14 12:54:49.230 136452M] **** got cache reply: error is REFUSED (nowhere to forward to) (ID 30317, src/dnsmasq/rfc1035.c:1110)
[2022-01-14 12:54:49.230 136452M]      EDE: network error (23)
[2022-01-14 12:54:49.230 136452M] Set reply to REFUSED (8) in src/dnsmasq_interface.c:2071

I'm starting to wonder if I need to take packet captures via a mirroring port on the switch to see why it thinks the upstream nameservers are unreachable. Nothing in dmesg or journalctl output to suggest the link is flapping or anything like that.

Coro · January 14, 2022, 10:47am

seems to be pretty clear and suggests already the connect is failing so it is the operating system telling this your Pi-hole. However, this also explains perfectly well why you see only REFUSED during this time, any connect fails. What is puzzling is that TCP seems to work at the same time as you said.
Can it be a firewall/antivirus/etc. software issues that is somehow temporarily blocking UDP? I'm reading this thread with interest but I'm heavily scratching my head how this network unavailablity can happen for UDP alone - and then resolve itself after a (short) moment.

For development (@DL6ER):

This is why FTL says its a cache reply when it actually was a local error, we need to add the case CONFIG in addition to UPSTREAM here:

github.com

pi-hole/FTL/blob/7387e81981760c835148b7f5f61302b91ee83e7d/src/dnsmasq_interface.c#L124


      
          
          	// Note: The order matters here!
          	if((flags & F_QUERY) && (flags & F_FORWARD))
          		; // New query, handled by FTL_new_query via separate call
          	else if(flags & F_FORWARD && flags & F_SERVER)
          		// forwarded upstream
          		FTL_forwarded(flags, name, addr, id, path, line);
          	else if(flags == F_SECSTAT)
          		// DNSSEC validation result
          		FTL_dnssec(arg, addr, id, path, line);
          	else if(flags == (F_UPSTREAM | F_RCODE) && name && strcasecmp(name, "error") == 0)
          		// upstream sent something different than NOERROR or NXDOMAIN
          		FTL_upstream_error(addr, id, path, line);
          	else if(flags & F_NOEXTRA && flags & F_DNSSEC)
          	{
          		// This is a new DNSSEC query (dnssec-query[DS])
          		if(!config.show_dnssec)
          			return;
          
          		const ednsData edns = { 0 };
          		union mysockaddr saddr = {{ 0 }};

DL6ER · January 14, 2022, 4:14pm

Pi-hole has its own embedded package dumping. It can be enabled by adding the following to a file like /etc/dnsmasq.d/99-record.conf:

dumpfile=/etc/pihole/dump.pcap

(or any other location you prefer), in addition to

dumpmask=<mask>

where mask specifies which types of packets should be added to the dumpfile. defined above The argument should be the OR of the bitmasks for each type of packet to be dumped: it can be specified in hex by preceding the number with 0x in the normal way.
Each time a packet is written to the dumpfile, we log the packet sequence and the mask representing its type. The current types are:

0x0001 - DNS queries from clients
0x0002 - DNS replies to clients
0x0004 - DNS queries to upstream
0x0008 - DNS replies from upstream
0x0010 - queries send upstream for DNSSEC validation
0x0020 - replies to queries for DNSSEC validation
0x0040 - replies to client queries which fail DNSSEC validation
0x0080 - replies to queries for DNSSEC validation which fail validation.

If you just want to record everything and later filter this in Wireshark (I typically recommend this) you can just add the two lines

dumpfile=/etc/pihole/dump.pcap
dumpmask=0x00ff

This may be helpful to compare with any other recordings you may be doing. Maybe you can also try a UDP-based ping to your gateway/router and write that into a log file you can later use to check what the result was during the downtime, something like sudo hping3 --udp 192.168.2.1. This will send UDP packets which may be replied to with ICMP::DestinationUnreachable but this should still be enough to check if the connection is still alive.

@Coro Yeah, thanks for the analysis, it is spot-on. This change becomes necessary due to a recent change in the dnsmasq source code. Branch fix/local_upstream_errors should fix this. I also merged it into tweak/debug_for_refused so @furinkan can test it in the affected environment, too (please update FTL, e.g., by running the checkout command again).

furinkan · January 15, 2022, 1:19am

You and me both. The build process for this machine is pretty straightforward and there's nothing exotic going on that I can think of.

I wondered if the local ufw-managed firewall might be doing something, but iptables shows none of the DROP or REJECT rules are being hit, counters all at zero. The only rate-limiting is for logging of blocked packets, and it's on a chain that's never even referred to. And of course, the old pihole machine was setup in a similar manner with ufw and it doesn't see the issue.

Oh that is excellent, thank you, I'll use that for now and just dump everything to a pcap. I've added the hping command to my collection script, so if/when I notice the problem again I can jump onto the pihole host and try to grab some observations.

I've also just run pihole checkout ftl tweak/debug_for_refused to grab the updated tweak branch.

jfb · January 15, 2022, 3:27pm

A teleporter backup does not include historical data. It contains settings only.

The historical data is in the query database at /etc/pihole/pihole-FTL.db.

furinkan · January 16, 2022, 3:23pm

Thanks for the tip. I've done some reading and am pleasantly surprised that I can apparently just copy it from on old pihole to a new one. I expected I'd need to sift the data out from some other config, so this is very nice.

The builtin pcap has been somewhat useful, I spotted an in-progress failure and managed to catch just one failed query against the pihole before it fixed itself. Pinging the local gateway with
hping3 --udp --count 4 192.168.1.1 returns "ICMP Port Unreachable", which I believe is expected, so connectivity to the upstreams should be fine.

I do notice that the packet capture doesn't include any TCP packets, so I don't have any visibility on a query that succeeded right after the failure.

2022-01-17_calico_observed_failure.pcap (75.4 KB)

According to the pihole logs there was a failure on 2022-01-17 between 00:27:16 and 00:28:34 localtime. There's nothing terribly surprising in the pcap itself, the failed queries are simply DNS requests without a corresponding reply.

If you do inspect the pcap, I've trimmed it down to a window around the failure and the final packet should be stamped 2022-01-17 00:29:03.362182 localtime.

192.168.1.1 - usg (internet router)
192.168.1.26 - calico (new pihole with problem)
192.168.1.12 - illustrious (linux box running smokeping in a docker container)
- 2404:e80:42e3:0:d0c:242:ac11:2 - IPv6 source for smokeping probes
192.168.1.70 - suomi (linux workstation)
192.168.1.71 - wa-chan (windows workstation)

I'm just not sure what to make of it now. If nothing else, it shows a big hole where it didn't even try to contact any upstream OpenDNS servers between 00:26:48 and 00:28:41, using display filter
(ipv6.addr > 2620:119::) || (ip.addr > 208.67.0.0)

Does that make any sense? There are sometimes queries to OpenDNS that never get a reply,
&& (dns.flags.response == 0) && ! dns.response_in
but they're at 00:24:08, a few minutes before the observed failure. So during the window of interest, every upstream query gets a reply, and then pihole just... stops forwarding queries at some point because it thinks the upstreams are unreachable.

I'm starting to think I need that pcap from the network switch as well.

DL6ER · January 16, 2022, 4:18pm

I was afraid this might be the case. Unfortunately, we cannot really do much about it because our analysis above has already shown that it is connect() that comes back with

so the operating system's kernel prevents us from connection to the specified address. The error code returned by the kernel has to be ENETUNREACH to trigger this error message. So I checked the corresponding Linux kernel source and found that ENETUNREACH means that (a) your system is aware of the route it has to take to get to the requested destination (otherwise, EHOSTUNREACH is returned) but (b) no interface was found that is connected to this route.
This is really puzzling and I don't know how to exactly reproduce this. When I pull the Ethernet cable from the box, I do indeed get "No route to host" (and not "Network is unreachable") as error message myself. This makes me very confident that it is really an operating system issue and not a Pi-hole. We still want to get this solved, of course. (I'm not going to say "not my responsibility")

What is really really weird is that you say TCP works during the outage. Because this seems entirely impossible from the kernel source code. It explicitly returns here that there is no way to go through the network to the destination. I cannot imagine how this can be protocol-dependent.

Sorry that this does not contain anything helpful, but, nevertheless, I figured it's worth typing my current thoughts and maybe this already sparks another idea. I will keep thinking about it.

DL6ER · January 16, 2022, 4:21pm

This seems to confirm my thoughts above: