Two Pi-hole instances, One with "failed to send..."

seh2000 · January 9, 2025, 10:49am

Hi,
I have again (shared previously) started to see lots of "failed to send TCP(read_write) packet (Connection prematurely closed by remote server)".
I am having two instances of Pi-hole on my LAN, on for my wired LAN and one for my wireless LAN, each on different Port and IP address.
Only the Pi-hole that serves the wired LAN shows the message. The wired LAN also got the less clients and activities.
Unbound is running on both Pi-hole instances.
On each I use "Bind only to interface".
FTL log from today morning.
seh2000 FTL-log 05Jan2024.txt (28.8 KB)
Steen

CallMeCurious · January 10, 2025, 12:36am

I would suggest running a debug and posting the token. Your log shows the connection errors when using 127.0.0.1:5335 which indicates your using unbound. Have you looked at your unbound config or tried to test unbound as detailed in the guide provide here?

Do you have the same issue if you configure a different upstream resolver instead of unbound?

seh2000 · January 10, 2025, 8:39am

Thanks.
Both Pi-hole instances are using Unbound.
Both instances of Pi-hole runs on same Raspberry Pi, same Pi-hole versions.
Unbound was installed and setup as per unbound - Pi-hole documentation.

My curiosity is really why one shows the "failed to send TCP(read_write) packet (Connection prematurely closed by remote server)" while the other one don't.
The one that shows got the less activities and clients.
Both Pi-holes runs on own Raspberry Pi 4B+.
Pi-hole token for Pi-hole A which shows the "issue" https://tricorder.pi-hole.net/f1kJG85o/
Pi-hole token for Pi-hole B which don't show the "issue" https://tricorder.pi-hole.net/WKp1LSgj/

Raspberry Pi A runs an instance of the speedtest-tracker app (for years)
Raspberry Pi B runs an instance of the UniFi Network manager (for years)

Pi-hole A

Pi-hole B

DL6ER · January 18, 2025, 10:35am

A few question that come to mind (without and particular order):

Are both Pis of the same type (or is one maybe a Pi 1B+ and the other a 4)?
Are both Unbound servers of the same version?
Is there anything special about them that tells them apart?
Have you tried swapping them to check if this is a software- or hardware-caused issue? Should be as simple as swapping the SD cards.

seh2000 · January 18, 2025, 12:42pm

Hi DL6ER,
Thanks for coming back to me
First I'll answer your questions.
Q: Are both Pis of the same type (or is one maybe a Pi 1B+ and the other a 4)?
A: Yes, bit are Pi 4B+ (same HW same OS)
Q: Are both Unbound servers of the same version?
A: Yes, checked twice and verified
Q: Is there anything special about them that tells them apart?
A: Yes:
Raspberry Pi A (the one that shows the issue) runs an instance in docker of the speedtest-tracker app (for years)
Raspberry Pi B runs an instance of the UniFi Network manager (for years)
Q: Have you tried swapping them to check if this is a software- or hardware-caused issue? Should be as simple as swapping the SD cards.
A: Yes, swapped and swapped cards. Still same that is the PI that sits on the LAN

Now, I sort of started from scratch so:

I moved the speedtest-tracker to a 3rd Pi using for flight tracking
Re-installed OS + Pi-hole on both
Left UniFi Network Manger on the Pi (B) that sits on the WiFi
Changed from Unbound on both to Cloudflare

No issue seen for the past 6 days.
Will later today change back to use Unbound on both to see if issue pops-up.

seh2000 · January 20, 2025, 3:57pm

Hi DL6ER,
Enabled unbound yesterday on both PI instances, after a few hours the "issue" started again on Pi A - LAN, while Pi B - WiFi don't show the "issue".
not sure which logs to search or debugging to enable to see if I can catch the reason.
Any suggestions?

DL6ER · January 20, 2025, 5:13pm

It is probably difficult to debug this. Maybe we could run a tcpdump capturing all the (TCP-only !) DNS traffic and then later use the log to determine which query was showing issues, trying to find it in the recording using wireshark.

But we should first start with you trying

sudo pihole checkout ftl tweak/conn_errors

which has a bit of tweaking to not log this when a simple retry to the same server immediately succeeds. Seems a wise idea to try this first. What we are looking for here is whether the errors are gone on this custom branch.

seh2000 · January 20, 2025, 5:26pm

Hi DL6ER,
Agree, will start with the simpler way first.

ftl tweak/conn_errors

Enabled.

Updated 21/01
The "issue" still pops up, see attached file (copied out from the FTL.log)
Pi4-1 21-01-2025.txt (6.8 KB)

seh2000 · January 22, 2025, 11:04am

Hi DL6ER,
I have kept closer look/monitoring on both PI instances after I enabled Unbound on both and enabled the the FTL tweak/conn_errors and to my surprise I see the "issue" on both PI instances.
Not very often on the 2nd Pi.
Interesting is the "issue" appears to happen around a NTP check, this could for sure be a coincidence, but wanted to mention.

Pi A
2025-01-21 21:19:03.824 INFO Time offset: 1.394719e+00 ms (excluded 0 outliers)
2025-01-21 21:19:03.824 INFO Round-trip delay: 5.113423e+00 ms (excluded 0 outliers)
2025-01-21 21:55:39.067 WARNING Connection error (127.0.0.1#5335): failed to send TCP(read_write) packet (Connection prematurely closed by remote server)
2025-01-21 22:12:09.129 WARNING Connection error (127.0.0.1#5335): failed to send TCP(read_write) packet (Connection prematurely closed by remote server)
2025-01-21 23:36:04.536 INFO Received 8/8 valid NTP replies from pool.ntp.org

Pi B
2025-01-22 05:37:23.126 INFO Round-trip delay: 9.762764e+00 ms (excluded 1 outliers)
2025-01-22 06:28:07.174 WARNING Connection error (127.0.0.1#5335): failed to send TCP(read_write) packet (Connection prematurely closed by remote server)
2025-01-22 06:37:24.189 INFO Received 8/8 valid NTP replies from pool.ntp.org

DL6ER · January 29, 2025, 4:43pm

I don't really see any relation to the NTP syncs here, they at least ten minutes away in the snippets you posted.

seh2000 · January 29, 2025, 4:59pm

Agree , just wanted to mention...
I have - again - disabled Unbound and using one of the predefined upstream servers (Cloudflare) and the issue is then gone. I still use the ftl tweak/conn_errors version.
As soon as I enable Unbound the issue pops back up, not using Unbound the issue stops.
Suggestion for further troubleshooting?

DL6ER · January 29, 2025, 6:33pm

Yes, the next step would probably mean setting

debug.queries = true

and using tcpdump to record all the traffic between your Pi-hole and your unbound.

Then, when it happened again, the debug output in /var/log/pihole/FTL.log will allow you to exactly identify which query caused the issue and the pcap file should reveal what is going on. dnsmasq has an internal pcap recording feature (in FTL, it is the files.pcap config option) but it does not record TCP traffic - because it is not supposed to do this - I am not exactly sure why TBH.

seh2000 · January 29, 2025, 8:30pm

I'll try to get this set up to capture.
Will be a first
Any further inputs on how to do will be very appreciated.

seh2000 · January 30, 2025, 11:29am

Hi DL6ER,
To use tcpdump for capture would the command (just without the -c option) be the correct command to use or?

sudo tcpdump -i any -c10 -nn -w pihole.pcap src 127.0.0.1 and port 5335

Then to read:

tcpdump -nn -r pihole.pcap

Thanks - Steen

Exampel

admin@Pi4-1:~ $ sudo tcpdump -i any -c10 -nn -v -w pihole.pcap src 127.0.0.1 and port 5335
tcpdump: data link type LINUX_SLL2
tcpdump: listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
10 packets captured
44 packets received by filter
0 packets dropped by kernel

admin@Pi4-1:~ $ tcpdump -nn -r pihole.pcap
reading from file pihole.pcap, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144
Warning: interface names might be incorrect
11:25:50.043603 lo    In  IP 127.0.0.1.35162 > 127.0.0.1.5335: Flags [S], seq 2641623240, win 65495, options [mss 65495,sackOK,TS val 1096521723 ecr 0,nop,wscale 7,tfo  cookiereq,nop,nop], length 0
11:25:50.043683 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35162: Flags [S.], seq 536522800, ack 2641623241, win 65483, options [mss 65495,sackOK,TS val 1096521723 ecr 1096521723,nop,wscale 7], length 0
11:25:50.043782 lo    In  IP 127.0.0.1.35162 > 127.0.0.1.5335: Flags [P.], seq 1:46, ack 1, win 512, options [nop,nop,TS val 1096521723 ecr 1096521723], length 45
11:25:50.043877 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35162: Flags [.], ack 46, win 512, options [nop,nop,TS val 1096521723 ecr 1096521723], length 0
11:25:50.045500 lo    In  IP 127.0.0.1.35174 > 127.0.0.1.5335: Flags [S], seq 2737083423, win 65495, options [mss 65495,sackOK,TS val 1096521725 ecr 0,nop,wscale 7,tfo  cookiereq,nop,nop], length 0
11:25:50.045574 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35174: Flags [S.], seq 606931830, ack 2737083424, win 65483, options [mss 65495,sackOK,TS val 1096521725 ecr 1096521725,nop,wscale 7], length 0
11:25:50.045661 lo    In  IP 127.0.0.1.35174 > 127.0.0.1.5335: Flags [P.], seq 1:46, ack 1, win 512, options [nop,nop,TS val 1096521725 ecr 1096521725], length 45
11:25:50.045748 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35174: Flags [.], ack 46, win 512, options [nop,nop,TS val 1096521725 ecr 1096521725], length 0
11:25:50.124486 lo    In  IP 127.0.0.1.5335 > 127.0.0.1.35162: Flags [P.], seq 1:138, ack 46, win 512, options [nop,nop,TS val 1096521804 ecr 1096521723], length 137
11:25:50.131025 lo    In  IP 127.0.0.1.35162 > 127.0.0.1.5335: Flags [P.], seq 46:91, ack 138, win 512, options [nop,nop,TS val 1096521810 ecr 1096521804], length 45

DL6ER · January 30, 2025, 4:36pm

Sorry for the delay, you can use what you proposed, I'd probably have used

sudo tcpdump -i lo -w /tmp/dns.pcap -n -p tcp port 5335

to capture TCP traffic to/from port 5335 on interface lo and save it into /tmp/dns.pcap but the difference is small. You can thereafter use WireShark to open this PCAP for easy inspection.

Hint

In case Wireshark shows only raw TCP traffic (as in: it does not automatically decode as DNS), you would see only TCP as protocol but never DNS, something like:

This can happen because port 5335 is not associated to DNS by default (e.g., 5353 would be). You can achieve this by navigating through the menu Analyse -> Decode As... and assigning DNS to TCP port 5335 manually like

Thereafter, Wireshark can correctly decode your packet recordings:

The remaining TCP stuff is protocol-inherent overhead and nothing to worry about. It may still be interesting to see around the location where you see the warning as a prematurely closed connection can also be because the ACKs did not arrive or other errors indicating the upstream closed the socket.

seh2000 · January 30, 2025, 6:21pm

Thanks so much for sharing and information.
I'll get started this evening.

seh2000 · January 30, 2025, 9:07pm

Hi again.
I got a couple of incidents.
Unfortunately I am not sure if I captured them correctly.
Attached the FTL.log and the dns.pcap files.
Incidents are at in the shFTL.log:
19:00:34.889
19:09:02.394
19:37:32.562
19:42:41.765
19:42:41.864

I searched the pcap file for something around the above times., but guess of ,my lack of knowledge I failed to find anything.

Steen

DL6ER · January 31, 2025, 5:11pm

With the FTL log alone (I did not even need the PCAP file in the end), I was able to reproduce what is going on: In all cases in your log, the flow is the same:

A client requests a domain via TCP, Pi-hole forks a dedicated worker to serve this request
Pi-hole immediately replies with the "cache optimizer" to reduce latency. However, in all cases the queries where actually outdated so Pi-hole generated internal queries to "refresh" and update the possibly outdated cache records.
At the same time, the clients closed their TCP connection because they are done
The TCP worker is terminating while it is requesting the refreshment. This causes the warning.

Thank your much for your continued assistance! We will now discuss with the dnsmasq maintainers how to resolve this in the best way.

seh2000 · January 31, 2025, 5:38pm

Hi DL6ER,

Thanks for the feedback!
I am happy to help out as much as I can with my little SW knowledge.
I'll leave the two on the branch ftl tweak/conn_errors for now - or?
Have a great weekend - Steen

DL6ER · January 31, 2025, 6:09pm

I opened a PR to get this branch merged into development and you can change over there once it is merged (I'll try to remember pinging you). Furthermore, I'll come back to you, too, one we have something new for testing.

Have a great weekend, too! I hope the kids will allow me to sleep longer than 4:30am this weekend

edit

https://github.com/pi-hole/FTL/pull/2170