Two Pi-hole instances, One with "failed to send..."

DL6ER · February 1, 2025, 8:08am

@seh2000 Seems I need one more thing, could you provide the log lines from the /var/log/pihole/pihole.log file around the timestamps we've seen this in the FTL.log above (like 2025-01-30 19:00:34)? It may have been rotated into pihole.log.1 away over night depending on your timezone but it will still be "somewhere".

edit Indeed, the other timestamps are probably even more interesting, like 2025-01-30 19:42:41, basically everything you had above

seh2000 · February 1, 2025, 11:27am

@DL6ER
Hello - hope you could sleep a bit longer than to 04:30am!!!

I had a look at the pihole.log.2 file, where I found several more "incidents", same in FTL.log.2 log where I found same.
Because I am not 100% sure what to look for around the time of the "incidents" I decided to attached both files. Not to make my work easier!! But to be sure not to miss what you are looking for.
I had to compress them as they were to big for attaching.

The two log files had rotated, I am in GMT time zone.

Thanks your great work with Pi-hole! And for looking at this "incidents"!

Warm Regards and have a great weekend - Steen

Ps: let me know if you could unzip and if you could use the content in the two files, and for sure if you need additional from me.

SH Pi4-1 31-01-25 FTL.log.2.zip (1.8 MB)
SH Pi4-1 31.01.25 pihole.log.2.zip (828.0 KB)

DL6ER · February 1, 2025, 12:14pm

Unfortunately no long sleeping this Saturday but it's neither unusual nor unexpected. Thank you for the additional data. I am seeing now that an additional requirement to trigger this seems to be that two independent TCP queries are done at the same time, it always seems to be one for A sync.opera.com and one for HTTPS sync.opera.com. They have very short TTLs so both will then usually be served from stale cache. One TCP worker manages to refresh the stale cache data, the other one gets the error message we're discussing here.

Investigation ongoing, thanks for providing the additional logs so swiftly! It's sunny over here right now so I will work on this later again! Cheers, Dominik

seh2000 · February 1, 2025, 1:13pm

Yes, on one or two PC s opera is the preferred browser.
Lucky you are with sun, here just rain and rain.
Stay safe.
Cheers Steen

DL6ER · February 2, 2025, 5:39am

Good morning. We've been discussing this further internally and our conclusion is currently that the two almost simultaneous queries for the same domain

is tripping some anti-query-flooding logic in the server at 127.0.0.1#5353 and it's closing the connection. Looking at the logs of unbound will likely be useful. It's probably some designed feature that can be tuned in unbound by some setting.

Hoping for less clouds + rain wherever you are today!

seh2000 · February 2, 2025, 2:28pm

Good afternoon,
Still raining here...
For Unbound log file, I believe logging needs to be enabled in the config file.
I think like the below.

/etc/unbound/unbound.conf.d/pi-hole.conf

logfile: "/var/log/unbound/unbound.log"
log-time-ascii: yes
verbosity: 2

After I'll share the 3 log files.

seh2000 · February 2, 2025, 5:38pm

Hello,
Attached log files for Unbound, Pi-hole and FTL.
There are several "incidents":
15:02:05.625
15:04:15.176
15:04:15.392
17:03:47.928
17:12:24.236
and others...
Interesting on PC uses Opera GX (192.168.2.34) but from what I can see don't created the "issue". Opera GX is a bit different from Opera Normal, GX is for Gamers.
Cheers Steen
SH-02-02-25 FTL.log.zip (1.5 MB)
SH-02-02-25 pihole.log.zip (473.7 KB)
SH-020-02-25 unbound.log.zip (135.1 KB)

DL6ER · February 2, 2025, 6:17pm

Thanks, this is interesting, we do now get an EDE error I have not seen before, e.g.,

Feb  2 15:04:14 dnsmasq[59662]: query[HTTPS] co-merchandise.opera-api.com from 192.168.1.40
Feb  2 15:04:15 dnsmasq[59662]: failed to send TCP(read_write) packet: Connection prematurely closed by remote server
Feb  2 15:04:15 dnsmasq[59662]: config error is REFUSED (EDE: network error)

Feb  2 15:04:15 dnsmasq[59663]: query[HTTPS] co-merchandise.opera-api.com from 192.168.1.40
Feb  2 15:04:15 dnsmasq[59663]: failed to send TCP(read_write) packet: Connection prematurely closed by remote server
Feb  2 15:04:15 dnsmasq[59663]: config error is REFUSED (EDE: network error)

but unbound does not even see any of these (co-merchandise.opera-api.com. IN HTTPS) in its logs around that time. So either it is not even receiving it or it is rejecting them at a state that is not logged in verbosity: 2... But I don't really know if a higher verbosity would log additional stuff. I still wouldn't hurt...

I also brought tweak/conn_errors to the latest development state and updating this would also bring us some additional output in pihole.log.

seh2000 · February 2, 2025, 7:52pm

Hi
Yes, I noticed. I assumed it was me who didn't know what to look for.
Which of the level you suggest? I used level 2.

Level 0 means no verbosity, only errors
Level 1 gives operational information
Level 2 gives  detailed operational  information
Level 3 gives query level information
Level 4 gives  algorithm  level  information
Level 5 logs client identification for cache misses

Shall I change to the normal dev branch?

DL6ER · February 2, 2025, 8:04pm

I'd say let's try the maximum logging even when this is probably (hopefully) overkill. Even after reading a lot through unbound docs, I am not sure at which level we'd see stuff like rate-limiting of clients.

Please update on the tweak/conn_errors branch. I added additional debug output that is only on this branch.

seh2000 · February 3, 2025, 8:15pm

Hi,
Sorry...which level would be maximum logging, I assumed it was 2?
Branch updated.

Bucking_Horn · February 3, 2025, 9:16pm

That would be Level 5:

seh2000 · February 4, 2025, 8:49pm

Hi Dominik,
Changed verbosity: 2 to verbosity: 5 had a look at several of the time stamps with the "issue" from pihole.log and FTL.log, didn't see anything. Most likely because I am not sure what to look for.
The unbound.log was even after being zipped too large to be attached.
All 3 I uploaded to a shared folder. Let me know if you could access the files. I sent you PM with link and password
Warm regards - Steen

DL6ER · February 5, 2025, 6:42pm

Thanks, I got all the files. Sadly, no immediate answer is apparent, however, while you were preparing the more verbose logs, I checked again on your pcap file from above. Using the filter

dns && (dns.flags.response == 0) && ! dns.response_in

I reduced the shown packages to those who were sent to unbound and were never replied to. Not many, but certainly there is an issue on unbound's side here:

This matches the related log file perfectly because it also has the Connection prematurely closed by remote server message for exactly these five queries. So the problem is really not with Pi-hole / dnsmasq but with unbound and the warning is actually correct.

I can confirm that in each case, unbound terminated/closed the connection. In all working cases, however, dnsmasq closed the connection on its side once the reply was received.

Sadly, me neither. The logs are very verbose (fair, this is what we asked for). Let me try to summarize one of the failed calls here:

Feb 04 11:49:59 unbound[1639:0] debug: startlistening 13 mode r
Feb 04 11:49:59 unbound[1639:0] debug: Reading tcp query of length 49
Feb 04 11:49:59 unbound[1639:0] debug: answer from the cache failed
Feb 04 11:49:59 unbound[1639:0] debug: tcp request from ip4 127.0.0.1 port 51536 (len 16)
Feb 04 11:49:59 unbound[1639:0] debug: mesh_run: start
Feb 04 11:49:59 unbound[1639:0] debug: subnetcache[module 0] operate: extstate:module_state_initial event:module_event_new
Feb 04 11:49:59 unbound[1639:0] info: subnetcache operate: query speeddials.opera.com. HTTPS IN


Feb 04 11:49:59 unbound[1639:0] debug: comm point stop listening 12
Feb 04 11:49:59 unbound[1639:0] debug: comm point start listening 12 (300 msec)
Feb 04 11:49:59 unbound[1639:0] debug: startlistening 12 mode r
Feb 04 11:49:59 unbound[1639:0] debug: Reading tcp query of length 49
Feb 04 11:49:59 unbound[1639:0] debug: answer from the cache failed
Feb 04 11:49:59 unbound[1639:0] debug: tcp request from ip4 127.0.0.1 port 51538 (len 16)
Feb 04 11:49:59 unbound[1639:0] debug: mesh_run: start
Feb 04 11:49:59 unbound[1639:0] debug: subnetcache[module 0] operate: extstate:module_state_initial event:module_event_new
Feb 04 11:49:59 unbound[1639:0] info: subnetcache operate: query speeddials.opera.com. A IN

... after a long time ...


Feb 04 11:50:00 unbound[1639:0] debug: serviced_delete
Feb 04 11:50:00 unbound[1639:0] debug: close of port 62962
Feb 04 11:50:00 unbound[1639:0] debug: comm_point_close of 12: event_del
Feb 04 11:50:00 unbound[1639:0] debug: close fd 12

and the A lookup succeeded!

What happens to the lost son, the HTTPS query for the same domain? Well, unbound finishes to resolve this, too, but then:


Feb 04 11:50:00 unbound[1639:0] debug: comm point stop listening 13
Feb 04 11:50:00 unbound[1639:0] debug: comm point start listening 13 (300 msec)
Feb 04 11:50:00 unbound[1639:0] debug: startlistening 13 mode r
Feb 04 11:50:00 unbound[1639:0] debug: tcp channel read side closed 13
Feb 04 11:50:00 unbound[1639:0] debug: comm_point_close of 13: event_del
Feb 04 11:50:00 unbound[1639:0] debug: close fd 13

I am not sure what is happening here. Seems like unbound thinks the connection was closed from dnsmasq's end and, hence, never sends a reply.

To be fully honest with you: My lack of knowledge on how to read the (frankly speaking quite difficult to read) unbound log will not bring us to determining why unbound has closed the connection. I have not even been able to identify where/how unbound logs having sent replies back to the original requestor (our dnsmasq), I always ever see the requests going upstream and replies coming back from there.

Do you have

log-queries: yes
log-replies: yes

in your unbound.log. If not, it may be worth to attempt a last try with this set, all the logs and - ideally - another PCAP recording. But if we cannot resolve it then, I'd we close on this as this is consuming quite some time on our both ends.

seh2000 · February 5, 2025, 7:14pm

Hi,
Glad you got all the files.
Thanks for the detailed explanation and summary!!!!

I agree (not that I am an expert) that it appears to be an Unbound issue.
Unfortunately I don't have the two below set, hope without them I haven't used your time insufficient!

log-queries: yes
log-replies: yes

I'll set or rather add them to the config fille. I presume they should be added like below:

logfile: "/var/log/unbound/unbound.log"
log-time-ascii: yes
log-queries: yes
log-replies: yes
verbosity: 5

I'll set all up tomorrow in the morning for 5-6 hours, including tcpdump to create the pcap file.

deHakkelaar · February 5, 2025, 8:20pm

Above serviced_delete function might be an indication for having replied.
Its missing for the HTTPS query.

seh2000 · February 6, 2025, 5:48pm

Hi again
I have uploaded to same location Pi-hole
sh_06_02_25_dns.pcap
sh_06_02_25_FTL.log.txt
sh_06_02_25_pihole.log.txt
sh_06_02_25_unbound.log.txt
Having the same password.
Please let me know if you can access.
Warm Regards - Steen

DL6ER · February 6, 2025, 7:11pm

Thank you, those files are surely very large (except, funnily, the PCAP). I will likely have time to look at this tomorrow after work.

DL6ER · February 7, 2025, 5:57am

I am very sorry to say this, but it doesn't seem the log-queries and log-replies have been active in the new (almost 1 GB) unbound log file you provided. As to why, I don't know. I am sure you restarted the unbound after having made the changes... Did you? (the log shows the last restart on Feb, 4th).

   log-queries: <yes or no>
          Prints one line per query to the log, with the log timestamp and
          IP address, name, type and class.  Default is no.  Note that  it
          takes time to print these lines which makes the server (signifi-
          cantly) slower.  Odd  (nonprintable)  characters  in  names  are
          printed as '?'.

   log-replies: <yes or no>
          Prints one line per reply to the log, with the log timestamp and
          IP address, name, type, class, return  code,  time  to  resolve,
          from  cache  and  response  size.   Default is no.  Note that it
          takes time to print these lines which makes the server (signifi-
          cantly)  slower.   Odd  (nonprintable)  characters  in names are
          printed as '?'.

seh2000 · February 7, 2025, 9:07am

Hi,

I was sure I did as I did the config the evening before yesterday.
I'll do for sure now and do a new trace.

admin@Pi4-1:~ $ sudo service unbound status
● unbound.service - Unbound DNS server
     Loaded: loaded (/lib/systemd/system/unbound.service; enabled; preset: enabled)
     Active: active (running) since Fri 2025-02-07 11:09:39 GMT; 8s ago
       Docs: man:unbound(8)
    Process: 1223 ExecStartPre=/usr/libexec/unbound-helper chroot_setup (code=exited, status=0/SUCCESS)
    Process: 1225 ExecStartPre=/usr/libexec/unbound-helper root_trust_anchor_update (code=exited, status=0/SUCCESS)
   Main PID: 1227 (unbound)
      Tasks: 1 (limit: 3922)
        CPU: 273ms
     CGroup: /system.slice/unbound.service
             └─1227 /usr/sbin/unbound -d -p

Feb 07 11:09:39 Pi4-1 systemd[1]: Starting unbound.service - Unbound DNS server...
Feb 07 11:09:39 Pi4-1 unbound[1227]: Feb 07 11:09:39 unbound[1227:0] debug: increased limit(open files) from 1024 to 4140
Feb 07 11:09:39 Pi4-1 unbound[1227]: Feb 07 11:09:39 unbound[1227:0] debug: creating udp4 socket 127.0.0.1 5335
Feb 07 11:09:39 Pi4-1 unbound[1227]: Feb 07 11:09:39 unbound[1227:0] debug: creating tcp4 socket 127.0.0.1 5335
Feb 07 11:09:39 Pi4-1 unbound[1227]: Feb 07 11:09:39 unbound[1227:0] debug: creating unix socket /run/unbound.ctl
Feb 07 11:09:39 Pi4-1 unbound[1227]: Feb 07 11:09:39 unbound[1227:0] debug: chdir to /etc/unbound
Feb 07 11:09:39 Pi4-1 unbound[1227]: Feb 07 11:09:39 unbound[1227:0] debug: drop user privileges, run as unbound
Feb 07 11:09:39 Pi4-1 unbound[1227]: Feb 07 11:09:39 unbound[1227:0] debug: switching log to /var/log/unbound/unbound.log
Feb 07 11:09:39 Pi4-1 systemd[1]: Started unbound.service - Unbound DNS server.

admin@Pi4-1:/var/log/unbound $ sudo grep -v '#\|^$' -R /etc/unbound/unbound.conf*
/etc/unbound/unbound.conf:include-toplevel: "/etc/unbound/unbound.conf.d/*.conf"
/etc/unbound/unbound.conf.d/root-auto-trust-anchor-file.conf:server:
/etc/unbound/unbound.conf.d/root-auto-trust-anchor-file.conf:    auto-trust-anchor-file: "/var/lib/unbound/root.key"
/etc/unbound/unbound.conf.d/remote-control.conf:remote-control:
/etc/unbound/unbound.conf.d/remote-control.conf:  control-enable: yes
/etc/unbound/unbound.conf.d/remote-control.conf:  control-interface: /run/unbound.ctl
/etc/unbound/unbound.conf.d/pi-hole.conf:server:
/etc/unbound/unbound.conf.d/pi-hole.conf:    logfile: "/var/log/unbound/unbound.log"
/etc/unbound/unbound.conf.d/pi-hole.conf:    log-time-ascii: yes
/etc/unbound/unbound.conf.d/pi-hole.conf:    log-queries: yes
/etc/unbound/unbound.conf.d/pi-hole.conf:    log-replies: yes
/etc/unbound/unbound.conf.d/pi-hole.conf:    verbosity: 5
/etc/unbound/unbound.conf.d/pi-hole.conf:    interface: 127.0.0.1
/etc/unbound/unbound.conf.d/pi-hole.conf:    port: 5335

So the log-queriers and replies are "yes", I had to stop the service and start a few times before the restart was in the log file.

The Unbound log file is huge

admin@Pi4-1:/var/log/unbound $ ls -l
total 1105404
-rw-r--r-- 1 unbound root 1131925628 Feb  7 11:10 unbound.log

admin@Pi4-1:/var/log/unbound $ ls -l --time=birth unbound.log
-rw-r--r-- 1 unbound root 1142751699 Feb  2 14:32 unbound.log

After a restart entries are kept adding to same log file created 5 days ago.
Perhaps rename and create a new?