Periodic CONNECTION_ERRORs in Diagnostics - Failed to send TCP (read_write) to 127.0.0.1#5335

I've been running Beta 6 on a Pi4 with the latest Raspbian release that's up to date. I'm using Unbound and everything has been pretty smooth.

However, in the past week or so, I'll find "CONNECTION_ERROR"'s listed in Pi-Hole Diagnostics with "Connection error (127.0.0.1#5335): failed to send TCP(read_write) packet (No error information)". These are listed in FTL.log as well but there isn't much else around it.

In pihole.log, here's what I see about the time of the latest occurrence:

May 2 16:58:42 dnsmasq[26615]: reply waws-prod-blu-229-e5b5.eastus.cloudapp.azure.com is 20.49.104.22
May 2 16:58:42 dnsmasq[26614]: failed to send TCP(read_write) packet: No error information
May 2 16:58:42 dnsmasq[26616]: failed to send TCP(read_write) packet: No error information
May 2 16:58:42 dnsmasq[12647]: query[A] teams.events.data.microsoft.com from 172.16.3.107

DNS resolution before and after the log entry appear functional making it harder to find a root cause.

Here's a tricorder link: https://tricorder.pi-hole.net/jhum8cxg/

Any thoughts?
Thanks!

Guessing this is more unbound and dnssec related and not Pihole specifically. I get the same type of thing running "dig fail01.dnssec.works @127.0.0.1 -p 5335". More often than not, I'll get timeouts and eventually the expected response.

I'll just keep plugging along with v6 and monitor unless there's something specific to look at.

This is indeed interesting. The hint "No error information" indicates that ... well ... there is no error here. Looking into the code, there is an edge-case where the upstream DNS server initially accepts the connection but then closes it without sending a reply (in the case there is technically no error on our side).

Are you using a local unbound ?

If so, it may be interesting to enable logging over there and see if the query can be identified.

One thing concerning your log excerpt:

Notice the PID in square brackets. You can use this to scroll up a few lines in your log and identify the queries triggering the error. Look out for the PIDs 26614 and 26616

May 2 16:xx:xx dnsmasq[26614]: query[A] ....... from ......

and

May 2 16:xx:xx dnsmasq[26616]: query[A] ....... from ......

This is possible as every single TCP query spawns a dedicated TCP worker for them, each one always getting their own unique PID.

Unbound is local. Same Pi4 as V6.

I'm stumped. I did some correlation to the unbound log data from recent errors but nothing in the unbound log info indicated any issues. At least any I can tell.

I may shift all the clients on the network back to V5 Pi4s w/ unbound for the weekend and see what I get. Maybe rebuild this V6 Pi4.

Thanks for the info!

I don't think you have to do anything if the logs do not show anything unusual. This may be completely normal behavior in your particular case. These connection warning have only just recently been added and things have likely always been like that, you just didn't know.

I will probably silence this new warning in case there is no actual error to report. Thanks for bringing this up!

1 Like

Thanks for looking into this and for the update. And for all the time and effort on pi-hole one killer solution. V6 has been solid so far.

Update applied and I'll continue to have fun testing and observing.

The patch has been merged into development-v6

Just a question, will this be applied to the ftl fix/dhcp_names? Understand different issues, but to keep improvements/fixes at same level in order not to fall too far behind when on the ftl fix/dhcp_names version.
Thanks for all the hard work to you and the other developers to get Pi-hole to become an even greater solution/application with Beta 6!

This patch was really only cosmetic, we hope that fix/dhcp_names shows something in the (very!) near future so we can fix the underlying issue and merge everything together into development-v6 so anyone can benefit.

1 Like

Thank again. I had been running with "tweak/tcp_conn_err" the past 24 hours without any issues or reports of the "error" in diagnostics as soon as it was in place. I've returned to development-v6 and will continue to test away.

Now to figure out what might have caused "Connection prematurely closed by remote server". Love these things happen when I've tuned down Unbound logging. :slight_smile:

A quick look had 2 entries after AAAA requests to "dc.services.visualstudio.com".

Tried going to "dc.services.visualstudio.com" which is probably not web facing and triggered a 3rd error and all three have similar output. (and a 4th after I turned off IPS/IDS just to see if it was interfering).

May 6 20:55:47 dnsmasq[26976]: query[AAAA] dc.services.visualstudio.com from 172.16.3.107
May 6 20:55:47 dnsmasq[26976]: cached-stale dc.services.visualstudio.com is
May 6 20:55:47 dnsmasq[26976]: cached-stale dc.applicationinsights.microsoft.com is
May 6 20:55:47 dnsmasq[26976]: cached-stale dc.applicationinsights.azure.com is
May 6 20:55:47 dnsmasq[26976]: cached-stale global.in.ai.monitor.azure.com is
May 6 20:55:47 dnsmasq[26976]: cached-stale global.in.ai.privatelink.monitor.azure.com is
May 6 20:55:47 dnsmasq[26976]: cached-stale dc.trafficmanager.net is
May 6 20:55:47 dnsmasq[26976]: cached-stale westus2-global.in.applicationinsights.azure.com is
May 6 20:55:47 dnsmasq[26976]: cached-stale gig-ai-prod-westus2-global.trafficmanager.net is
May 6 20:55:47 dnsmasq[26976]: cached-stale gig-ai-g-prod-westus2-2-app-v4-tag.westus2.cloudapp.azure.com is NODATA-IPv6
May 6 20:55:47 dnsmasq[26976]: failed to send TCP(read_write) packet: Connection prematurely closed by remote server

As said above, this may be perfectly normal (as in: it has always happened, not as in: it is expected). Before this rather recent change, we simple have not logged such events. Interestingly, I do too have a local unbound running receiving both UDP and TCP traffic and haven't had a single occurrence of this warning in almost three weeks since this was added.

No worries. I'll let it all keep going. Thanks!