Two Pi-hole instances, One with "failed to send..."

Hi,
I have again (shared previously) started to see lots of "failed to send TCP(read_write) packet (Connection prematurely closed by remote server)".
I am having two instances of Pi-hole on my LAN, on for my wired LAN and one for my wireless LAN, each on different Port and IP address.
Only the Pi-hole that serves the wired LAN shows the message. The wired LAN also got the less clients and activities.
Unbound is running on both Pi-hole instances.
On each I use "Bind only to interface".
FTL log from today morning.
seh2000 FTL-log 05Jan2024.txt (28.8 KB)
Steen

I would suggest running a debug and posting the token. Your log shows the connection errors when using 127.0.0.1:5335 which indicates your using unbound. Have you looked at your unbound config or tried to test unbound as detailed in the guide provide here?

Do you have the same issue if you configure a different upstream resolver instead of unbound?

Thanks.
Both Pi-hole instances are using Unbound.
Both instances of Pi-hole runs on same Raspberry Pi, same Pi-hole versions.
Unbound was installed and setup as per unbound - Pi-hole documentation.

My curiosity is really why one shows the "failed to send TCP(read_write) packet (Connection prematurely closed by remote server)" while the other one don't.
The one that shows got the less activities and clients.
Both Pi-holes runs on own Raspberry Pi 4B+.
Pi-hole token for Pi-hole A which shows the "issue" https://tricorder.pi-hole.net/f1kJG85o/
Pi-hole token for Pi-hole B which don't show the "issue" https://tricorder.pi-hole.net/WKp1LSgj/

Raspberry Pi A runs an instance of the speedtest-tracker app (for years)
Raspberry Pi B runs an instance of the UniFi Network manager (for years)

Pi-hole A

Pi-hole B

A few question that come to mind (without and particular order):

  • Are both Pis of the same type (or is one maybe a Pi 1B+ and the other a 4)?
  • Are both Unbound servers of the same version?
  • Is there anything special about them that tells them apart?
  • Have you tried swapping them to check if this is a software- or hardware-caused issue? Should be as simple as swapping the SD cards.

Hi DL6ER,
Thanks for coming back to me :slight_smile:
First I'll answer your questions.
Q: Are both Pis of the same type (or is one maybe a Pi 1B+ and the other a 4)?
A: Yes, bit are Pi 4B+ (same HW same OS)
Q: Are both Unbound servers of the same version?
A: Yes, checked twice and verified
Q: Is there anything special about them that tells them apart?
A: Yes:
Raspberry Pi A (the one that shows the issue) runs an instance in docker of the speedtest-tracker app (for years)
Raspberry Pi B runs an instance of the UniFi Network manager (for years)
Q: Have you tried swapping them to check if this is a software- or hardware-caused issue? Should be as simple as swapping the SD cards.
A: Yes, swapped and swapped cards. Still same that is the PI that sits on the LAN

Now, I sort of started from scratch :wink: so:

  • I moved the speedtest-tracker to a 3rd Pi using for flight tracking
  • Re-installed OS + Pi-hole on both
  • Left UniFi Network Manger on the Pi (B) that sits on the WiFi
  • Changed from Unbound on both to Cloudflare

No issue seen for the past 6 days.
Will later today change back to use Unbound on both to see if issue pops-up.

Hi DL6ER,
Enabled unbound yesterday on both PI instances, after a few hours the "issue" started again on Pi A - LAN, while Pi B - WiFi don't show the "issue".
not sure which logs to search or debugging to enable to see if I can catch the reason.
Any suggestions?

It is probably difficult to debug this. Maybe we could run a tcpdump capturing all the (TCP-only !) DNS traffic and then later use the log to determine which query was showing issues, trying to find it in the recording using wireshark.

But we should first start with you trying

sudo pihole checkout ftl tweak/conn_errors

which has a bit of tweaking to not log this when a simple retry to the same server immediately succeeds. Seems a wise idea to try this first. What we are looking for here is whether the errors are gone on this custom branch.

Hi DL6ER,
Agree, will start with the simpler way first.

ftl tweak/conn_errors

Enabled.

Updated 21/01
The "issue" still pops up, see attached file (copied out from the FTL.log)
Pi4-1 21-01-2025.txt (6.8 KB)

Hi DL6ER,
I have kept closer look/monitoring on both PI instances after I enabled Unbound on both and enabled the the FTL tweak/conn_errors and to my surprise I see the "issue" on both PI instances.
Not very often on the 2nd Pi.
Interesting is the "issue" appears to happen around a NTP check, this could for sure be a coincidence, but wanted to mention.

Pi A
2025-01-21 21:19:03.824 INFO Time offset: 1.394719e+00 ms (excluded 0 outliers)
2025-01-21 21:19:03.824 INFO Round-trip delay: 5.113423e+00 ms (excluded 0 outliers)
2025-01-21 21:55:39.067 WARNING Connection error (127.0.0.1#5335): failed to send TCP(read_write) packet (Connection prematurely closed by remote server)
2025-01-21 22:12:09.129 WARNING Connection error (127.0.0.1#5335): failed to send TCP(read_write) packet (Connection prematurely closed by remote server)
2025-01-21 23:36:04.536 INFO Received 8/8 valid NTP replies from pool.ntp.org

Pi B
2025-01-22 05:37:23.126 INFO Round-trip delay: 9.762764e+00 ms (excluded 1 outliers)
2025-01-22 06:28:07.174 WARNING Connection error (127.0.0.1#5335): failed to send TCP(read_write) packet (Connection prematurely closed by remote server)
2025-01-22 06:37:24.189 INFO Received 8/8 valid NTP replies from pool.ntp.org

I don't really see any relation to the NTP syncs here, they at least ten minutes away in the snippets you posted.

Agree , just wanted to mention...
I have - again - disabled Unbound and using one of the predefined upstream servers (Cloudflare) and the issue is then gone. I still use the ftl tweak/conn_errors version.
As soon as I enable Unbound the issue pops back up, not using Unbound the issue stops.
Suggestion for further troubleshooting?

Yes, the next step would probably mean setting

debug.queries = true

and using tcpdump to record all the traffic between your Pi-hole and your unbound.

Then, when it happened again, the debug output in /var/log/pihole/FTL.log will allow you to exactly identify which query caused the issue and the pcap file should reveal what is going on. dnsmasq has an internal pcap recording feature (in FTL, it is the files.pcap config option) but it does not record TCP traffic - because it is not supposed to do this - I am not exactly sure why TBH.

I'll try to get this set up to capture.
Will be a first :slight_smile:
Any further inputs on how to do will be very appreciated.

Hi DL6ER,
To use tcpdump for capture would the command (just without the -c option) be the correct command to use or?

sudo tcpdump -i any -c10 -nn -w pihole.pcap src 127.0.0.1 and port 5335

Then to read:

tcpdump -nn -r pihole.pcap

Thanks - Steen

Exampel

admin@Pi4-1:~ $ sudo tcpdump -i any -c10 -nn -v -w pihole.pcap src 127.0.0.1 and port 5335
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
10 packets captured
44 packets received by filter
0 packets dropped by kernel

admin@Pi4-1:~ $ tcpdump -nn -r pihole.pcap
reading from file pihole.pcap, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144
Warning: interface names might be incorrect
11:25:50.043603 lo    In  IP 127.0.0.1.35162 > 127.0.0.1.5335: Flags [S], seq 2641623240, win 65495, options [mss 65495,sackOK,TS val 1096521723 ecr 0,nop,wscale 7,tfo  cookiereq,nop,nop], length 0
11:25:50.043683 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35162: Flags [S.], seq 536522800, ack 2641623241, win 65483, options [mss 65495,sackOK,TS val 1096521723 ecr 1096521723,nop,wscale 7], length 0
11:25:50.043782 lo    In  IP 127.0.0.1.35162 > 127.0.0.1.5335: Flags [P.], seq 1:46, ack 1, win 512, options [nop,nop,TS val 1096521723 ecr 1096521723], length 45
11:25:50.043877 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35162: Flags [.], ack 46, win 512, options [nop,nop,TS val 1096521723 ecr 1096521723], length 0
11:25:50.045500 lo    In  IP 127.0.0.1.35174 > 127.0.0.1.5335: Flags [S], seq 2737083423, win 65495, options [mss 65495,sackOK,TS val 1096521725 ecr 0,nop,wscale 7,tfo  cookiereq,nop,nop], length 0
11:25:50.045574 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35174: Flags [S.], seq 606931830, ack 2737083424, win 65483, options [mss 65495,sackOK,TS val 1096521725 ecr 1096521725,nop,wscale 7], length 0
11:25:50.045661 lo    In  IP 127.0.0.1.35174 > 127.0.0.1.5335: Flags [P.], seq 1:46, ack 1, win 512, options [nop,nop,TS val 1096521725 ecr 1096521725], length 45
11:25:50.045748 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35174: Flags [.], ack 46, win 512, options [nop,nop,TS val 1096521725 ecr 1096521725], length 0
11:25:50.124486 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35162: Flags [P.], seq 1:138, ack 46, win 512, options [nop,nop,TS val 1096521804 ecr 1096521723], length 137
11:25:50.131025 lo    In  IP 127.0.0.1.35162 > 127.0.0.1.5335: Flags [P.], seq 46:91, ack 138, win 512, options [nop,nop,TS val 1096521810 ecr 1096521804], length 45

Sorry for the delay, you can use what you proposed, I'd probably have used

sudo tcpdump -i lo -w /tmp/dns.pcap -n -p tcp port 5335

to capture TCP traffic to/from port 5335 on interface lo and save it into /tmp/dns.pcap but the difference is small. You can thereafter use WireShark to open this PCAP for easy inspection.

Hint

In case Wireshark shows only raw TCP traffic (as in: it does not automatically decode as DNS), you would see only TCP as protocol but never DNS, something like:


This can happen because port 5335 is not associated to DNS by default (e.g., 5353 would be). You can achieve this by navigating through the menu Analyse -> Decode As... and assigning DNS to TCP port 5335 manually like

Thereafter, Wireshark can correctly decode your packet recordings:

The remaining TCP stuff is protocol-inherent overhead and nothing to worry about. It may still be interesting to see around the location where you see the warning as a prematurely closed connection can also be because the ACKs did not arrive or other errors indicating the upstream closed the socket.