I'm having the same issue. Was having a couple of issues after the upgrade to 5.0, so I wiped the system and did a fresh install. It worked fine for a couple of days, but now I get the slow web interface and "DNS service not running" status again. Have you found a resolution?
One thing that stood out to me is all of the pihole-FTL instances (84 of them). Is this normal?
[80] is in use by lighttpd
[80] is in use by lighttpd
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[53] is in use by pihole-FTL
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL
[4711] is in use by pihole-FTL
I use dhcp-script which launches additional processes. Use pstree to check if they all belong to the same mother process or if they are really different instances of the same.
Then I tried to restartdns, and it failed. It recommended running a couple of commands. I ran those, along with pidof and pstree again. All of those results are here: https://tricorder.pi-hole.net/3iolwgcbrg
Okay, so this seems fine. It means that there are 7 TCP workers forked from FTL and 6 threads are surrounding the main process of FTL.
The forked TCP workers need to be listening on port 53 as this is where the communication with the clients are taking place (they do not listen to new connection but, in fact, handle one one).
The issue is now that all the forks feel also responsible for replying to incoming API requests, causing the web interface to become slow. Crashes are unlikely to happen as FTL implements a proper locking architecture to prevent race events.
I'll work on a fix for this, it seems pretty simple. The second issue (restartdns not working properly) is simple to be fixed as well, however, I have to do this at home where my testing Pi-hole is (Monday).
and check if the API still works and if there are still multiple bindings to 4711. Unfortunately, I'm completely unable to test this myself right now. You can simply go back to something working with
Ok, web interface was saying "DNS service not running" again this morning. No FTL crash, but here's yesterday's log:
[2020-06-26 02:32:44.750 23766] SQLite3 message: API call with invalid database connection pointer (21)
[2020-06-26 02:32:44.750 23766] SQLite3 message: misuse at line 157333 of [18db032d05] (21)
[2020-06-26 02:35:33.886 23798] SQLite3 message: API call with invalid database connection pointer (21)
[2020-06-26 02:35:33.886 23798] SQLite3 message: misuse at line 157333 of [18db032d05] (21)
[2020-06-26 02:44:42.223 698] Resizing "/FTL-dns-cache" from 57344 to 61440
[2020-06-26 05:24:14.752 698] Resizing "/FTL-strings" from 81920 to 86016
[2020-06-26 07:00:36.391 29633] SQLite3 message: API called with finalized prepared statement (21)
[2020-06-26 07:00:36.392 29633] SQLite3 message: misuse at line 81711 of [18db032d05] (21)
[2020-06-26 08:18:11.014 698] Resizing "/FTL-dns-cache" from 61440 to 65536
[2020-06-26 09:20:05.712 1521] SQLite3 message: API call with invalid database connection pointer (21)
[2020-06-26 09:20:05.712 1521] SQLite3 message: misuse at line 157333 of [18db032d05] (21)
[2020-06-26 09:51:21.760 698] Resizing "/FTL-strings" from 86016 to 90112
[2020-06-26 09:51:32.088 4289] Remapping "/FTL-strings" from 86016 to 90112
[2020-06-26 09:53:44.358 698] Resizing "/FTL-dns-cache" from 65536 to 69632
[2020-06-26 11:22:31.529 11687] SQLite3 message: API called with finalized prepared statement (21)
[2020-06-26 11:22:31.529 11687] SQLite3 message: misuse at line 81711 of [18db032d05] (21)
[2020-06-26 12:20:11.823 698] Resizing "/FTL-strings" from 90112 to 94208
[2020-06-26 12:28:44.873 17161] SQLite3 message: API call with invalid database connection pointer (21)
[2020-06-26 12:28:44.873 17161] SQLite3 message: misuse at line 157333 of [18db032d05] (21)
[2020-06-26 12:31:40.059 17404] SQLite3 message: API call with invalid database connection pointer (21)
[2020-06-26 12:31:40.059 17404] SQLite3 message: misuse at line 157333 of [18db032d05] (21)
[2020-06-26 12:35:53.860 698] Resizing "/FTL-dns-cache" from 69632 to 73728
[2020-06-26 12:38:49.529 18007] SQLite3 message: API called with finalized prepared statement (21)
[2020-06-26 12:38:49.529 18007] SQLite3 message: misuse at line 81711 of [18db032d05] (21)
[2020-06-26 14:18:11.416 24628] SQLite3 message: API called with finalized prepared statement (21)
[2020-06-26 14:18:11.416 24628] SQLite3 message: misuse at line 81711 of [18db032d05] (21)
[2020-06-26 14:31:57.811 698] Resizing "/FTL-dns-cache" from 73728 to 77824
[2020-06-26 15:33:52.032 698] Resizing "/FTL-strings" from 94208 to 98304
[2020-06-26 15:57:33.516 30101] SQLite3 message: API call with invalid database connection pointer (21)
[2020-06-26 15:57:33.516 30101] SQLite3 message: misuse at line 157333 of [18db032d05] (21)
[2020-06-26 16:36:46.448 698] Resizing "/FTL-dns-cache" from 77824 to 81920
[2020-06-26 16:53:24.270 698] Resizing "/FTL-queries" from 3670016 to 3899392
[2020-06-26 17:37:08.286 698] Resizing "/FTL-dns-cache" from 81920 to 86016
[2020-06-26 18:13:00.373 698] Resizing "/FTL-strings" from 98304 to 102400
[2020-06-26 18:35:58.960 698] Resizing "/FTL-dns-cache" from 86016 to 90112
[2020-06-26 19:11:03.579 698] Resizing "/FTL-dns-cache" from 90112 to 94208
[2020-06-26 19:16:17.417 8685] SQLite3 message: API called with finalized prepared statement (21)
[2020-06-26 19:16:17.417 8685] SQLite3 message: misuse at line 81711 of [18db032d05] (21)
[2020-06-26 19:51:00.228 698] Resizing "/FTL-strings" from 102400 to 106496
[2020-06-26 20:59:42.022 698] Resizing "/FTL-dns-cache" from 94208 to 98304
[2020-06-26 21:28:53.359 698] Resizing "/FTL-dns-cache" from 98304 to 102400
[2020-06-26 21:33:13.502 698] Resizing "/FTL-strings" from 106496 to 110592
[2020-06-26 21:50:53.978 698] Resizing "/FTL-dns-cache" from 102400 to 106496
[2020-06-26 23:41:29.774 698] Resizing "/FTL-strings" from 110592 to 114688
And today's:
[2020-06-27 00:14:48.282 698] Resizing "/FTL-dns-cache" from 106496 to 110592
[2020-06-27 03:32:50.819 698] Resizing "/FTL-dns-cache" from 110592 to 114688
[2020-06-27 06:53:57.947 698] Resizing "/FTL-domains" from 65536 to 131072
[2020-06-27 06:58:04.391 698] Resizing "/FTL-queries" from 3899392 to 4128768
[2020-06-27 07:08:57.120 698] Resizing "/FTL-dns-cache" from 114688 to 118784
[2020-06-27 07:09:40.031 698] Resizing "/FTL-strings" from 114688 to 118784
[2020-06-27 07:49:26.048 698] Resizing "/FTL-dns-cache" from 118784 to 122880
[2020-06-27 08:11:13.363 698] Resizing "/FTL-strings" from 118784 to 122880
Do you still see multiple instances listening on per 4711 ?
If so, could you please provide the pihole-FTL related lines from
pstree -pt
and the full output of
sudo lsof -iTCP -sTCP:LISTEN -n +c 10
?
You do not need to worry about this, this line is caused because you're running a branch based on the development version. This will be fixed when v5.1 is released.
Thanks for the quick reply! Yes, still many instances listening on port 4711 per the pihole debug log. I rebooted this morning, so there aren't as many now as there were earlier, but here's what I've got currently.
Ah yes, I forgot about the fact that child processes inherit sockets from their parents, this is fixed in the change I just pushed. However, this also highlights that this was actually not the problem as only processes that also listen (which the forks don't do) are handling actual content here.
With the latest version, only the main process should be listening on port 4711.
Thanks for all the help so far. Any other suggestions on troubleshooting the slow web interface and "DNS service not running" status? If I leave it running without reboot, eventually I start seeing actual DNS issues. I had to reboot a couple nights ago because devices on my network were having trouble with DNS queries. I'm currently at ~45 hours of uptime and the pihole still seems to be working OK for the network, but pihole status is now also returning [✗] DNS service is NOT running. pidof pihole-FTL is returning 21 PIDs.
Anything else in my pstree output above look like it would conflict with pihole?
Does this now work? It certainly works for me in a very quick test:
Concerning the latter, that might be a false-positive, they opened a PR to fix this yesterday:
Ah, this reminds me of a discussion I've seen on Github yesterday, let's go into this more:
Yes, this:
This very much sounds like this (link to a specific comment in there, also from yesterday):
TL;DR: Pi-hole imposes an upper limit of 20 concurrently running TCP workers at the same time as measure against a resource-exhaustion attack (after all, many Pi-hole are reachable from the public Internet, even when this is really not a good idea).
I checked git blame and this limit has always (at least since 2004) been there and is unchanged. Given the substantial improvement in memory and computing power since 2004, rising this limit is likely necessary!
@webdevelopers Maybe you want to increase the limit of concurrently allowed TCP connections from 20 to something substantially larger?
When more and more (I can only assume IoT) devices use "steady" TCP connections for their DNS, this seems a necessity. dnsmasq (and hence Pi-hole) rejects new TCP connections when the maximum number is reached. This doesn't seem to be the proper way of things (even when I understand why this is technically).
We can do this. With the massive improvements on memory consumption we've put into v5.0 we can now afford doing this.
Technical details
The TCP DNS RFC requests the ability for dedicated steady DNS connections for specific clients. TCP workers are forks of the main FTL process. The Linux kernel does implement fork() using mmap on a Copy-on-Write basis. In other words: When a TCP worker is created, the memory of the parent and child stays identical. However, it marked as read-only. Hence, an arbitrary number of forks can be created with NO extra memory requirements (besides a smallish bookkeeping overhead inside the kernel).
If now one of the processes (may be the parent or a children) changes a variable, this memory section ("page") is recognized as read-only. The writing process gets a dedicated copy it can privately use for itself. As a result, the memory is no longer identical between the two processes. However, if only(/mostly) read-operations are being performed, no(/little) memory will be copied at all.
As soon as all forks are terminated or have their own copies, the read-only attribute is removed and the main process can reuse the memory without the need for a copy. By this, no memory is wasted.
I'd propose an upper limit of at most 60 workers for the beginning. This should still fit on a Raspberry Pi Zero. We can even make this a configurable setting, however, as this is only a MAX value, it shouldn't matter as the vast majority of users will not reach this limit even remotely.
I wasn't sure how to test it, but yesterday I checked out fix/status_checking. Here's output of pihole version:
Pi-hole version is fix/status_checking v5.0-498-g94cd7f5 (Latest: v5.0)
AdminLTE version is v5.0 (Latest: v5.0)
FTL version is v5.0 (Latest: v5.0)
Was there a better test the change?
With pihole running a little over 24 hours since the upgrade, everything is still working fine so far. Web interface status reports active, and it's responsive. pihole status returns:
[✓] DNS service is listening
[✓] UDP (IPv4)
[✓] TCP (IPv4)
[✓] UDP (IPv6)
[✓] TCP (IPv6)
[✓] Pi-hole blocking is enabled
pidof pihole-FTL is returning 21 PIDs again today. My pihole is serving my home network with ~30 devices. It's had 76k queries in the last 24 hours. Does it make sense that I'm hitting the TCP connections limit? Prior to v5.0, I'm not sure how many TCP workers I typically had, but I wasn't getting the "DNS service not running" status. Did anything change in v5.0 that could cause this?
Not sure how significant this is (given pihole is working fine right now), but it appears I still have a number of FTL processes listening on 4711.
It depends on the devices. Even when I have a smaller number of devices, I do not have a single TCP worker without me forcing some to exist as none of the devices in my household (Linux machines, Android phones, one web radio) does TCP queries.
The limit has been there since "the beginning of time". The test also seems to be the same since at least version v4.0, so I have no real answer to this, sorry.
According to
you're not running the special branch which is avoiding this so this is expected.