Pi-Hole does not answer DNS queries on TCP before it has answered on UDP

Hello,

I have a pi-hole deployed on my NAS using Docker and a macvlan network (so it has its own IP address and does not interfere with the host's DNS resolver) and I have been trying to get it to answer to DNS queries using TCP (the goal is to setup DNS over TLS using a reverse-proxy).

The weird thing is, if I make the same DNS query with UDP then with TCP, it works:

axel@axel-Inspiron-7370:~$ kdig @192.168.0.153 axeleroy.com +tcp
;; WARNING: response timeout for 192.168.0.153@53(TCP)
;; ERROR: failed to query server 192.168.0.153@53(TCP)
axel@axel-Inspiron-7370:~$ kdig @192.168.0.153 axeleroy.com
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 65363
;; Flags: qr rd ra; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 0
 
;; QUESTION SECTION:
;; axeleroy.com.       		IN	A
 
;; ANSWER SECTION:
axeleroy.com.       	3600	IN	A	213.186.33.5
 
;; Received 46 B
;; Time 2022-07-24 18:37:58 CEST
;; From 192.168.0.153@53(UDP) in 49.5 ms
axel@axel-Inspiron-7370:~$ kdig @192.168.0.153 axeleroy.com +tcp
;; ->>HEADER<<- opcode: QUERY; status: NOERROR; id: 38467
;; Flags: qr rd ra; QUERY: 1; ANSWER: 1; AUTHORITY: 0; ADDITIONAL: 0

;; QUESTION SECTION:
;; axeleroy.com.       		IN	A
 
;; ANSWER SECTION:
axeleroy.com.       	3598	IN	A	213.186.33.5
 
;; Received 46 B
;; Time 2022-07-24 18:38:00 CEST
;; From 192.168.0.153@53(TCP) in 7.1 ms

To note, I had to add the DNSMASQ_LISTENING=all environment variable to Docker in order to get TCP to work at all. I also have the following capabilities enabled: CAP_NET_BIND_SERVICE, CAP_NET_RAW, CAP_NET_ADMIN and CAP_SYS_NICE.

Has anyone any idea of what I might have missed?

Why do you want to force communication via TCP?

By default, DNS would always make use of UDP, unless there is too much data to fit into a UDP packet, or an upstream wold indicate that it would not process UDP packets larger than a certain threshold.
Only if those conditions are met, a DNS client would switch to more spacious, but also more costly TCP communication.

As I alluded to, I'm trying to implement DNS over TLS in order to use Android's Private DNS feature. Because DNS over TLS is TCP, I then forward the TCP call to my PiHole.

Apparently people have been able to make it work with a reverse-proxy, but for some reason I encounter this bug.

The requirement to use TCP exclusively would only apply to the DoT connection between a client and your reverse proxy.

The subsequent forward from your proxy to Pi-hole should follow the plain DNS protocol.
If your proxy would be set up accordingly, I'd expect it to be unlikely that you would observe that issue.

That said, while I am unable to reproduce your observation with well known domains like google.com... (click for dig result):
$  dig @192.168.127.128 google.com +tcp

; <<>> DiG 9.11.5-P4-5.1+deb10u7-Debian <<>> @192.168.127.128 google.com +tcp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 34852
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;google.com.                    IN      A

;; ANSWER SECTION:
google.com.             300     IN      A       172.217.19.78

;; Query time: 40 msec
;; SERVER: 192.168.127.128#53(192.168.127.128)
;; WHEN: Mon Jul 25 10:24:46 CEST 2022
;; MSG SIZE  rcvd: 55

... I can recreate your observation when forcing resolution of your sample domain via TCP:

~$ dig @192.168.1.28 axeleroy.com +tcp

; <<>> DiG 9.11.5-P4-5.1+deb10u7-Debian <<>> @192.168.1.28 axeleroy.com +tcp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 14307
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; OPT=15: 00 0c ("..")
;; QUESTION SECTION:
;axeleroy.com.                  IN      A

;; Query time: 187 msec
;; SERVER: 192.168.1.28#53(192.168.1.28)
;; WHEN: Mon Jul 25 09:49:26 CEST 2022
;; MSG SIZE  rcvd: 47

Note the EDE error code 00 0c in that answer.
That is also reflected and made human-readable in Pi-hole's log:

Jul 25 09:49:26 dnsmasq[25638]: 2338 192.168.1.28/43803 query[A] axeleroy.com from 192.168.1.28
Jul 25 09:49:26 dnsmasq[25638]: 2338 192.168.1.28/43803 forwarded axeleroy.com to 127.0.1.1#5335
Jul 25 09:49:26 dnsmasq[25638]: 2339 dnssec-query[DS] axeleroy.com to 127.0.1.1#5335
Jul 25 09:49:26 dnsmasq[25638]: 2338 192.168.1.28/43803 reply axeleroy.com is 213.186.33.5
Jul 25 09:49:26 dnsmasq[25638]: 2338 192.168.1.28/43803 validation axeleroy.com is BOGUS (EDE: NSEC(3) missing)

Note that the failure of that lookup is related to DNSSEC.
If you own or administrate axeleroy.com, you may want to verify its DNSSEC configuration.

However, I also tried a forced TCP lookup for sigok.verteiltesysteme.net, a well known DNSSEC signed domain that's also known to behave correctly.
And indeed, that again triggers the same EDE error code:

~$ dig @192.168.1.28 sigok.verteiltesysteme.net +tcp 

; <<>> DiG 9.11.5-P4-5.1+deb10u7-Debian <<>> @192.168.1.28 sigok.verteiltesysteme.net +tcp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 58895
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
; OPT=15: 00 0c ("..")
;; QUESTION SECTION:
;sigok.verteiltesysteme.net.    IN      A

;; Query time: 107 msec
;; SERVER: 192.168.1.28#53(192.168.1.28)
;; WHEN: Mon Jul 25 10:09:10 CEST 2022
;; MSG SIZE  rcvd: 61
Jul 25 10:09:10 dnsmasq[25834]: 2662 192.168.1.28/49609 query[A] sigok.verteiltesysteme.net from 192.168.1.28
Jul 25 10:09:10 dnsmasq[25834]: 2662 192.168.1.28/49609 forwarded sigok.verteiltesysteme.net to 127.0.1.1#5335
Jul 25 10:09:10 dnsmasq[25834]: 2663 dnssec-query[DS] verteiltesysteme.net to 127.0.1.1#5335
Jul 25 10:09:10 dnsmasq[25834]: 2662 192.168.1.28/49609 validation sigok.verteiltesysteme.net is BOGUS (EDE: NSEC(3) missing)
Jul 25 10:09:10 dnsmasq[25834]: 2662 192.168.1.28/49609 reply sigok.verteiltesysteme.net is 134.91.78.139

I can also confirm that a preceding successful UDP lookup will allow a subsequent forced TCP lookup to succeed as well, probably because the answer would then be cached.

As this is tied to DNSSEC, this seems to be a bug in upstream dnsmasq rather than pihole-FTL, but let's see if @DL6ER would be able to confirm that suspicion of mine.

I'm unable to reproduce your observation:

dig @192.168.2.10 sigok.verteiltesysteme.net +tcp
; <> DiG 9.18.1-1ubuntu1.1-Ubuntu <> @192.168.2.10 sigok.verteiltesysteme.net +tcp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 24758
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;sigok.verteiltesysteme.net. IN A

;; ANSWER SECTION:
sigok.verteiltesysteme.net. 60 IN A 134.91.78.139

;; Query time: 172 msec
;; SERVER: 192.168.2.10#53(192.168.2.10) (TCP)
;; WHEN: Mon Jul 25 10:59:29 CEST 2022
;; MSG SIZE rcvd: 71

/var/log/pihole/pihole.log
Jul 25 10:59:29 dnsmasq[1907067]: 236316 192.168.2.23/36035 query[A] sigok.verteiltesysteme.net from 192.168.2.23
Jul 25 10:59:29 dnsmasq[1907067]: 236316 192.168.2.23/36035 forwarded sigok.verteiltesysteme.net to 127.0.0.1#5353
Jul 25 10:59:29 dnsmasq[1907067]: 236317 dnssec-query[DS] verteiltesysteme.net to 127.0.0.1#5353
Jul 25 10:59:29 dnsmasq[1907067]: 236317 reply verteiltesysteme.net is DS keytag 61908, algo 5, digest 1
Jul 25 10:59:29 dnsmasq[1907067]: 236317 reply verteiltesysteme.net is DS keytag 61908, algo 5, digest 2
Jul 25 10:59:29 dnsmasq[1907067]: 236318 dnssec-query[DS] verteiltesysteme.net to 127.0.0.1#5353
Jul 25 10:59:29 dnsmasq[1907067]: 236318 reply verteiltesysteme.net is DNSKEY keytag 30665, algo 5
Jul 25 10:59:29 dnsmasq[1907067]: 236318 reply verteiltesysteme.net is DNSKEY keytag 61908, algo 5
Jul 25 10:59:29 dnsmasq[1907067]: 236316 192.168.2.23/36035 validation result is SECURE
Jul 25 10:59:29 dnsmasq[1907067]: 236316 192.168.2.23/36035 reply sigok.verteiltesysteme.net is 134.91.78.139 (DNSSEC signed)

I do neither observe an issue with the other domain:

dig @192.168.2.10 axeleroy.com +tcp
; <> DiG 9.16.1-Ubuntu <> @192.168.2.10 axeleroy.com +tcp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 61922
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;axeleroy.com. IN A

;; ANSWER SECTION:
axeleroy.com. 3600 IN A 213.186.33.5

;; Query time: 167 msec
;; SERVER: 192.168.2.10#53(192.168.2.10)
;; WHEN: Mo Jul 25 11:01:15 CEST 2022
;; MSG SIZE rcvd: 57

/var/log/pihole/pihole.log
Jul 25 11:01:15 dnsmasq[1907640]: 236444 192.168.2.10/37803 query[A] axeleroy.com from 192.168.2.10
Jul 25 11:01:15 dnsmasq[1907640]: 236444 192.168.2.10/37803 forwarded axeleroy.com to 127.0.0.1#5353
Jul 25 11:01:15 dnsmasq[1907640]: 236445 dnssec-query[DS] axeleroy.com to 127.0.0.1#5353
Jul 25 11:01:15 dnsmasq[1907640]: 236445 reply axeleroy.com is DS keytag 53287, algo 8, digest 2
Jul 25 11:01:15 dnsmasq[1907640]: 236446 dnssec-query[DS] axeleroy.com to 127.0.0.1#5353
Jul 25 11:01:15 dnsmasq[1907640]: 236446 reply axeleroy.com is DNSKEY keytag 53287, algo 8
Jul 25 11:01:15 dnsmasq[1907640]: 236446 reply axeleroy.com is DNSKEY keytag 50535, algo 8
Jul 25 11:01:15 dnsmasq[1907640]: 236444 192.168.2.10/37803 validation result is SECURE
Jul 25 11:01:15 dnsmasq[1907640]: 236444 192.168.2.10/37803 reply axeleroy.com is 213.186.33.5 (DNSSEC signed)

As you can see from the logs (and also the response times), nothing was served from cache here.

Question: Which upstream server are you using? I'm using a local unbound instance acting as full recursive resolver.

Same here, upstream unbound at 127.0.1.1#5335.

Version 1.9.0
linked libs: libevent 2.1.8-stable (it uses epoll), OpenSSL 1.1.1n  15 Mar 2022
linked modules: dns64 python subnetcache respip validator iterator

I can trigger a BOGUS reply on Armbian Buster as well as Raspberry Pi OS Buster, both up to date and running the same unbound version.

EDIT:
If I switch my upstream to DNS.WATCH, TCP lookups will succeed:

`dig @192.168.127.26 axeleroy.com +tcp`
; <<>> DiG 9.11.5-P4-5.1+deb10u7-Raspbian <<>> @192.168.127.26 axeleroy.com +tcp
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 19340
;; flags: qr rd ra ad; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;axeleroy.com.                  IN      A

;; ANSWER SECTION:
axeleroy.com.           3596    IN      A       213.186.33.5

;; Query time: 9428 msec
;; SERVER: 192.168.127.26#53(192.168.127.26)
;; WHEN: Mon Jul 25 11:57:50 CEST 2022
;; MSG SIZE  rcvd: 57
`pihole.log`
Jul 25 11:57:31 dnsmasq[1245]: query[A] axeleroy.com from 192.168.1.26
Jul 25 11:57:41 dnsmasq[1247]: query[A] axeleroy.com from 192.168.1.26
Jul 25 11:57:46 dnsmasq[1245]: forwarded axeleroy.com to 84.200.69.80
Jul 25 11:57:46 dnsmasq[1245]: dnssec-query[DS] axeleroy.com to 84.200.69.80
Jul 25 11:57:46 dnsmasq[1245]: reply axeleroy.com is DS keytag 53287, algo 8, digest 2
Jul 25 11:57:46 dnsmasq[1245]: dnssec-query[DNSKEY] axeleroy.com to 84.200.69.80
Jul 25 11:57:46 dnsmasq[1245]: reply axeleroy.com is DNSKEY keytag 53287, algo 8
Jul 25 11:57:46 dnsmasq[1245]: reply axeleroy.com is DNSKEY keytag 50535, algo 8
Jul 25 11:57:46 dnsmasq[1245]: reply axeleroy.com is 213.186.33.5
Jul 25 11:57:50 dnsmasq[1247]: forwarded axeleroy.com to 84.200.69.80
Jul 25 11:57:50 dnsmasq[1247]: dnssec-query[DS] axeleroy.com to 84.200.69.80
Jul 25 11:57:50 dnsmasq[1247]: reply axeleroy.com is DS keytag 53287, algo 8, digest 2
Jul 25 11:57:50 dnsmasq[1247]: dnssec-query[DNSKEY] axeleroy.com to 84.200.69.80
Jul 25 11:57:50 dnsmasq[1247]: reply axeleroy.com is DNSKEY keytag 53287, algo 8
Jul 25 11:57:50 dnsmasq[1247]: reply axeleroy.com is DNSKEY keytag 50535, algo 8
Jul 25 11:57:50 dnsmasq[1247]: reply axeleroy.com is 213.186.33.5

I'm running my Pi-hole on x86_64 hardware with Ubuntu 20.04.4 LTS as operating system.

Version 1.9.6

Configure line: 
Linked libs: mini-event internal (it uses select), OpenSSL 1.1.1f  31 Mar 2020
Linked modules: dns64 respip validator iterator

BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs@nlnetlabs.nl or https://github.com/NLnetLabs/unbound/issues

Let's wait for @axeleroy to see what he uses as upsteram. If this is really the difference, we need to turn on packet dumping for further inspection of the difference. Hopefully, we do not need to go as far as recording all port 53 activity to inspect what unbound is doing itself...

I'm not using my own recursive DNS resolver but instead I'm using 80.67.169.12 and 80.67.169.40 from the non-profit ISP FND.

In the meantime I'll try to see how I can setup nginx streams to forward TCP requests to UDP.

Why use theirs and not your own? With your own, you have complete control.

I know, I should setup unbound, but that's not the issue right now.

Thanks for sharing the IP addresses. The first one is either not working or not reachable from my location but the second one works and shows a SERVFAIL for me, too:

Jul 25 18:27:16 dnsmasq[389902]: 101 127.0.0.1/49947 query[A] axeleroy.com from 127.0.0.1
Jul 25 18:27:16 dnsmasq[389902]: 101 127.0.0.1/49947 forwarded axeleroy.com to 80.67.169.40
Jul 25 18:27:17 dnsmasq[389902]: 102 dnssec-query[DS] com to 80.67.169.40
Jul 25 18:27:17 dnsmasq[389902]: 103 dnssec-query[DS] . to 80.67.169.40
Jul 25 18:27:17 dnsmasq[389902]: 103 reply . is DNSKEY keytag 20826, algo 8
Jul 25 18:27:17 dnsmasq[389902]: 103 reply . is DNSKEY keytag 20326, algo 8
Jul 25 18:27:17 dnsmasq[389902]: 102 reply com is DS keytag 30909, algo 8, digest 2
Jul 25 18:27:17 dnsmasq[389902]: 104 dnssec-query[DS] axeleroy.com to 80.67.169.40
Jul 25 18:27:17 dnsmasq[389902]: 105 dnssec-query[DS] com to 80.67.169.40
Jul 25 18:27:17 dnsmasq[389902]: 105 reply com is DNSKEY keytag 30909, algo 8
Jul 25 18:27:17 dnsmasq[389902]: 105 reply com is DNSKEY keytag 32298, algo 8
Jul 25 18:27:17 dnsmasq[389902]: 104 reply axeleroy.com is DS keytag 53287, algo 8, digest 2
Jul 25 18:27:17 dnsmasq[389902]: 101 127.0.0.1/49947 validation axeleroy.com is ABANDONED
Jul 25 18:27:17 dnsmasq[389902]: 101 127.0.0.1/49947 reply axeleroy.com is 213.186.33.5

when simply switching to my local unbound, everything starts working as it should.
Everything also works fine when switching Google's DNS server 8.8.8.8 (not meant as a recommendation):

Jul 25 18:28:40 dnsmasq[390106]: 1 127.0.0.1/38755 query[A] axeleroy.com from 127.0.0.1
Jul 25 18:28:40 dnsmasq[390106]: 1 127.0.0.1/38755 forwarded axeleroy.com to 8.8.8.8
Jul 25 18:28:40 dnsmasq[390106]: 2 dnssec-query[DS] com to 8.8.8.8
Jul 25 18:28:40 dnsmasq[390106]: 3 dnssec-query[DS] . to 8.8.8.8
Jul 25 18:28:40 dnsmasq[390106]: 3 reply . is DNSKEY keytag 20826, algo 8
Jul 25 18:28:40 dnsmasq[390106]: 3 reply . is DNSKEY keytag 20326, algo 8
Jul 25 18:28:40 dnsmasq[390106]: 2 reply com is DS keytag 30909, algo 8, digest 2
Jul 25 18:28:40 dnsmasq[390106]: 4 dnssec-query[DS] axeleroy.com to 8.8.8.8
Jul 25 18:28:40 dnsmasq[390106]: 5 dnssec-query[DS] com to 8.8.8.8
Jul 25 18:28:40 dnsmasq[390106]: 5 reply com is DNSKEY keytag 32298, algo 8
Jul 25 18:28:40 dnsmasq[390106]: 5 reply com is DNSKEY keytag 30909, algo 8
Jul 25 18:28:40 dnsmasq[390106]: 4 reply axeleroy.com is DS keytag 53287, algo 8, digest 2
Jul 25 18:28:40 dnsmasq[390106]: 6 dnssec-query[DS] axeleroy.com to 8.8.8.8
Jul 25 18:28:40 dnsmasq[390106]: 6 reply axeleroy.com is DNSKEY keytag 53287, algo 8
Jul 25 18:28:40 dnsmasq[390106]: 6 reply axeleroy.com is DNSKEY keytag 50535, algo 8
Jul 25 18:28:40 dnsmasq[390106]: 1 127.0.0.1/38755 validation result is SECURE
Jul 25 18:28:40 dnsmasq[390106]: 1 127.0.0.1/38755 reply axeleroy.com is 213.186.33.5 (DNSSEC signed)

Hence, this very much looks like either a misconfigured or otherwise broken upstream DNS server answering differently to TCP than UDP queries.
The next step would be to do deep package inspection of the DNS traffic but this is a quite laborious task and I don't see much of a chance that this would still turn out to be a Pi-hole bug.

Once validation worked over UDP, the result is cached and subsequent TCP queries serve the already validated answer from cache (independent of the query protocol). No further lookups are done within the TTL.

I was unsuccessful when trying to verfify which DNS server is residing at 80.67.169.40 by running

dig +short @80.67.169.40 chaos txt version.bind

80.67.169.40 refused that request.

But according to DNS servers in France, ns1.fdn.fr is running the same unbound version as me:
ns1.fdn.fr

This would suggest that the observed unwanted behaviour may lie with a specific version of unbound rather than Pi-hole.

1 Like

I looked through the unbound Changelog but didn't find anything interesting when looking for "TCP" between versions 1.9.0 (05 February 2019) to 1.9.6 (12 December 2019).

There was one commit:

but if this applies here is questionable. No result at all for "DNSSEC" is the corresponding time window in the Changelog.

Well, I no longer reproduce the issue when using my usual upstream DNS servers, maybe there was a configuration issue on their end that they resolved?

Anyway, that unpleasant experience convinced me that I should deploy my own recursive DNS server.

Edit: They updated their DNS server like 12 hours ago, so it no longer times out when requested over TCP. So the issue was definitely on their end.