FTL hangs when reaching out-of-memory

Two network devices died within a few days. This sounds like overvoltage issues or whatever. It definitely does not sound normal. Maybe you Pi-hole was damaged as well. If the configuration of the new equipment is similar to the old, it should work.

"Thousands" "all at the same time" may just be overwhelming your Pi-hole and it dies with out-of-memory after having hundreds of millions of queries. Is there anything interesting in /var/log/pihole-FTL.log ? Try disabling the query database, maybe your hard disk is too slow and a certain backlog arises which may be causing issues here.

In the end, it comes down to two possibilities:

  1. Your Pi-hole was damaged by the same event which made the other equipment die over a short time, or
  2. Your new router is configured incorrectly (maybe in just a minor detail) and is causing a massive memory issue on the Pi-hole (this seems more likely)

Regarding the two devices dying, I average replacing a modem every three years as that appear to be the life cycle with them and Comcast. The router was nine years old and had 78,000+ hours on it. Technically it still works except the wireless portion.

The PiHole actually is a virtual machine on my VMWare host. That host uses a Samsung M3 V-NAND SSD so speed shouldn't be an issue.

I am going to up the amount of memory allocated from 256MB to 512MB to that PiHole VM to see if it is maxing out the available memory. These all appear to be reverse DNS lookups from the router. I was not running an internal DNS server (other than PiHole) with the previous router or this router.

256 MB is a bit sparse.

Here is the some of the info from the /var/log/pihole-FTL.log.

[2020-09-24 15:31:52.304 584M] Resizing "/FTL-queries" from 256638976 to 256901120
[2020-09-24 15:31:52.305 584M] Resizing "/FTL-queries" from 256901120 to 257163264
[2020-09-24 15:31:52.306 584M] Resizing "/FTL-queries" from 257163264 to 257425408
[2020-09-24 15:31:52.308 584M] Resizing "/FTL-queries" from 257425408 to 257687552
[2020-09-24 15:31:52.309 584M] Resizing "/FTL-strings" from 24576 to 28672
[2020-09-24 15:31:52.309 584M] Imported 4023702 queries from the long-term database
[2020-09-24 15:31:52.309 584M]  -> Total DNS queries: 4023702
[2020-09-24 15:31:52.309 584M]  -> Cached DNS queries: 648
[2020-09-24 15:31:52.309 584M]  -> Forwarded DNS queries: 4020246
[2020-09-24 15:31:52.309 584M]  -> Blocked DNS queries: 2808
[2020-09-24 15:31:52.309 584M]  -> Unknown DNS queries: 0
[2020-09-24 15:31:52.309 584M]  -> Unique domains: 1011
[2020-09-24 15:31:52.309 584M]  -> Unique clients: 4
[2020-09-24 15:31:52.309 584M]  -> Known forward destinations: 3
[2020-09-24 15:31:52.309 584M] Successfully accessed setupVars.conf
[2020-09-24 15:31:52.322 602M] PID of FTL process: 602
[2020-09-24 15:31:52.322 602M] Reloading DNS cache
[2020-09-24 15:31:52.322 602M] Blocking status is enabled
[2020-09-24 15:31:52.323 602/T607] Listening on Unix socket
[2020-09-24 15:31:52.323 602/T606] Listening on port 4711 for incoming IPv6 telnet connections
[2020-09-24 15:31:52.323 602/T605] Listening on port 4711 for incoming IPv4 telnet connections
[2020-09-24 15:31:52.349 602M] INFO: No regex whitelist entries found
[2020-09-24 15:31:52.350 602M] Compiled 0 whitelist and 2 blacklist regex filters for 4 clients in 1.3 msec
[2020-09-24 15:32:03.736 602M] Resizing "/FTL-queries" from 257687552 to 257949696
[2020-09-24 15:32:07.172 602M] Resizing "/FTL-queries" from 257949696 to 258211840

There were 600+ lines of "Resizing /FTL-queries" at the begining.

Here is the information from /var/log/pihole.log:

Sep 24 15:31:52 dnsmasq[602]: started, version pi-hole-2.81 cachesize 10000
Sep 24 15:31:52 dnsmasq[602]: compile time options: IPv6 GNU-getopt no-DBus no-UBus no-i18n IDN DHCP DHCPv6 no-Lua TFTP no-conntrack ipset auth DNSSEC loop-detect inotify dumpfile
Sep 24 15:31:52 dnsmasq[602]: using only locally-known addresses for domain use-application-dns.net
Sep 24 15:31:52 dnsmasq[602]: using nameserver 192.168.254.1#53 for domain 254.168.192.in-addr.arpa 
Sep 24 15:31:52 dnsmasq[602]: using nameserver 8.8.4.4#53
Sep 24 15:31:52 dnsmasq[602]: using nameserver 8.8.8.8#53
Sep 24 15:31:52 dnsmasq[602]: read /etc/hosts - 5 addresses
Sep 24 15:31:52 dnsmasq[602]: failed to load names from /etc/pihole/custom.list: No such file or directory
Sep 24 15:31:52 dnsmasq[602]: read /etc/pihole/local.list - 2 addresses
Sep 24 15:31:53 dnsmasq[602]: query[A] i.ytimg.com from 192.168.254.1
Sep 24 15:31:53 dnsmasq[602]: forwarded i.ytimg.com to 8.8.4.4
Sep 24 15:31:53 dnsmasq[602]: forwarded i.ytimg.com to 8.8.8.8
Sep 24 15:31:53 dnsmasq[602]: reply i.ytimg.com is 172.217.8.182
Sep 24 15:32:00 dnsmasq[602]: query[PTR] 228.254.168.192.in-addr.arpa from 127.0.0.1
Sep 24 15:32:00 dnsmasq[602]: forwarded 228.254.168.192.in-addr.arpa to 192.168.254.1
Sep 24 15:32:00 dnsmasq[602]: query[PTR] 228.254.168.192.in-addr.arpa from 192.168.254.1
Sep 24 15:32:00 dnsmasq[602]: forwarded 228.254.168.192.in-addr.arpa to 192.168.254.1

There were 18600+ entries from what appear to be reverse lookups into the PiHole and back to the router. It appears that does this for any of the local addresses as the query starts like "xxx.254.168.192.in-addr.arpa".

As for settings on the Nighthawk R8000P router, other than setting the DNS server IPs, I only added DHCP reservations along with configuring the WiFi portion. Nothing else was changed.

This is the memory being used for that purpose, in bytes. That 258 million bytes, or about all the allocated memory. That's a lot of DNS queries (more than 4 million in 24 hours). This is the root of your problem.

This looks like a DNS loop. A client is asking "what is the name of the client located at 192.168.254.xxx", and it doesn't appear they are getting an answer.

If this is a client on your network, you can map that IP to a name in either /etc/hosts on the Pi or using Local DNS records. Then Pi-hole (via the Pi) can answer the request and this circular traffic should stop.

What are the outputs of the following - you don't need to redact any private IP ranges:

echo ">top-clients >quit" | nc localhost 4711

echo ">top-domains >quit" | nc localhost 4711

echo ">forward-dest >quit" | nc localhost 4711

That is what I was thinking since they were constant. It appears that the router it self is sending the queries to the PiHole to get that information for each client on the network. This is even occurring for each network device (APs) which do not have names. Shouldn't this occur within DNS instead of manually adding each device?

Not following you here. If Pi-hole is the DHCP server, it will know the names of the network clients. If not, and the router is the DHCP server, then Pi-hole may not even see the individual IPs of the clients (requests may all appear to be coming from the router).

You have conditional forwarding enabled, and this is likely causing the loop. What led you to use conditional forwarding?

All result with the same response. Sudo or not.
localhost [127.0.0.1] 4711 (?) : Connection refused

So DHCP is controlled by the router and DNS settings are pointing to the PiHole w/ secondary to 8.8.8.8. I have been running that same config for years since PiHole was initially released. As for conditional forwarding, I am unaware of enabling that, nor do I know where that is enabled.

*edit
Found the conditional forwarding setting. Disabled it.

I do not recall setting up Conditional Forwarding in the past. Apparently when I previously used the Apple AirPort Extreme, it didn't care about looping back and terminated the loop. Seems that the Nighthawk R800P likes to continuously keep that loop going.

I have restarted the PiHole VM and watched the /var/log/pihole.log file and that reverse lookup loop has not appeared in the last five minutes. The /var/log/pihole-FTL.log file has not spit out anything since that restart and disabling of the Conditional Fowarding.

I will continue to monitor throughout the night and provide feedback. I doubt that anything is going to happen as previously it occured within about a minute of a restart.

Well that was short lived. Seems that the DNS server stopped yet again. Here is the pihole-FTL.log after the restart from earlier.

[2020-09-24 16:14:00.544 586M] Resizing "/FTL-queries" from 257425408 to 257687552
[2020-09-24 16:14:00.544 586M] Resizing "/FTL-strings" from 24576 to 28672
[2020-09-24 16:14:00.544 586M] Imported 4023726 queries from the long-term database
[2020-09-24 16:14:00.545 586M]  -> Total DNS queries: 4023726
[2020-09-24 16:14:00.545 586M]  -> Cached DNS queries: 658
[2020-09-24 16:14:00.545 586M]  -> Forwarded DNS queries: 4020258
[2020-09-24 16:14:00.545 586M]  -> Blocked DNS queries: 2810
[2020-09-24 16:14:00.545 586M]  -> Unknown DNS queries: 0
[2020-09-24 16:14:00.545 586M]  -> Unique domains: 1011
[2020-09-24 16:14:00.545 586M]  -> Unique clients: 4
[2020-09-24 16:14:00.545 586M]  -> Known forward destinations: 3
[2020-09-24 16:14:00.545 586M] Successfully accessed setupVars.conf
[2020-09-24 16:14:00.557 596M] PID of FTL process: 596
[2020-09-24 16:14:00.558 596M] Reloading DNS cache
[2020-09-24 16:14:00.558 596M] Blocking status is enabled
[2020-09-24 16:14:00.558 596/T601] Listening on Unix socket
[2020-09-24 16:14:00.558 596/T600] Listening on port 4711 for incoming IPv6 telnet connections
[2020-09-24 16:14:00.558 596/T599] Listening on port 4711 for incoming IPv4 telnet connections
[2020-09-24 16:14:00.593 596M] INFO: No regex whitelist entries found
[2020-09-24 16:14:00.594 596M] Compiled 0 whitelist and 2 blacklist regex filters for 4 clients in 1.1 msec
[2020-09-24 16:27:03.861 596M] Resizing "/FTL-dns-cache" from 4096 to 8192
[2020-09-24 16:28:31.634 596M] Resizing "/FTL-strings" from 28672 to 32768
[2020-09-24 17:00:00.045 596/T602] Notice: Database size is 2076.03 MB, deleted 2193 rows
[2020-09-24 17:03:13.315 596M] Resizing "/FTL-dns-cache" from 8192 to 12288
[2020-09-24 17:03:38.711 596M] Reloading DNS cache
[2020-09-24 17:03:38.711 596M] Blocking status is enabled
[2020-09-24 17:03:38.717 596M] SQLite3 message: file unlinked while open: /etc/pihole/gravity.db (28)
[2020-09-24 17:03:38.718 596M] INFO: No regex whitelist entries found
[2020-09-24 17:03:38.719 596M] Compiled 0 whitelist and 2 blacklist regex filters for 4 clients in 0.7 msec
[2020-09-24 17:05:59.387 596M] Received: Real-time signal 0 (34 -> 0)
[2020-09-24 17:05:59.391 596M] INFO: No regex whitelist entries found
[2020-09-24 17:05:59.391 596M] Compiled 0 whitelist and 2 blacklist regex filters for 4 clients in 0.7 msec
[2020-09-24 17:06:03.905 596M] Received: Real-time signal 0 (34 -> 0)
[2020-09-24 17:06:03.909 596M] INFO: No regex whitelist entries found
[2020-09-24 17:06:03.910 596M] Compiled 0 whitelist and 2 blacklist regex filters for 4 clients in 0.8 msec
[2020-09-24 17:06:07.251 596M] Received: Real-time signal 0 (34 -> 0)
[2020-09-24 17:06:07.255 596M] INFO: No regex whitelist entries found
[2020-09-24 17:06:07.255 596M] Compiled 0 whitelist and 2 blacklist regex filters for 4 clients in 0.7 msec
[2020-09-24 17:06:10.322 596M] Received: Real-time signal 0 (34 -> 0)
[2020-09-24 17:06:10.326 596M] INFO: No regex whitelist entries found
[2020-09-24 17:06:10.326 596M] Compiled 0 whitelist and 2 blacklist regex filters for 4 clients in 0.7 msec
[2020-09-24 17:07:17.388 596M] Received: Real-time signal 0 (34 -> 0)
[2020-09-24 17:07:17.392 596M] INFO: No regex whitelist entries found
[2020-09-24 17:07:17.393 596M] Compiled 0 whitelist and 2 blacklist regex filters for 4 clients in 0.7 msec
[2020-09-24 17:07:28.072 596M] Received: Real-time signal 0 (34 -> 0)
[2020-09-24 17:07:28.076 596M] INFO: No regex whitelist entries found
[2020-09-24 17:07:28.077 596M] Compiled 0 whitelist and 2 blacklist regex filters for 4 clients in 0.7 msec
[2020-09-24 17:19:57.909 596M] Resizing "/FTL-strings" from 32768 to 36864
[2020-09-24 17:31:16.863 596M] Resizing "/FTL-queries" from 257687552 to 257949696
[2020-09-24 17:59:00.154 596/T602] Notice: Database size is 2076.03 MB, deleted 500 rows
[2020-09-24 18:27:12.605 596M] Resizing "/FTL-strings" from 36864 to 40960
[2020-09-24 18:27:41.978 596M] Resizing "/FTL-dns-cache" from 12288 to 16384
[2020-09-24 19:00:00.046 596/T602] Notice: Database size is 2076.05 MB, deleted 536 rows
[2020-09-24 19:45:12.476 596M] Resizing "/FTL-strings" from 40960 to 45056
[2020-09-24 19:59:56.451 596M] Resizing "/FTL-queries" from 257949696 to 258211840
[2020-09-24 20:00:00.062 596/T602] Notice: Database size is 2076.17 MB, deleted 923 rows
[2020-09-24 20:07:27.850 596M] Resizing "/FTL-dns-cache" from 16384 to 20480
[2020-09-24 20:21:57.497 596M] Resizing "/FTL-strings" from 45056 to 49152

The pihole.log file does contain any of the loops like it was reporting prior. All "normal" data and the last lines of the file are the following:

Sep 24 20:37:18 dnsmasq[596]: reply gateway.fe.apple-dns.net is 17.248.132.134
Sep 24 20:37:18 dnsmasq[596]: query[AAAA] adashbc.ut.taobao.com from 192.168.254.1
Sep 24 20:37:18 dnsmasq[596]: exactly blacklisted adashbc.ut.taobao.com is ::
Sep 24 20:37:18 dnsmasq[596]: query[A] adashbc.ut.taobao.com from 192.168.254.1
Sep 24 20:37:18 dnsmasq[596]: exactly blacklisted adashbc.ut.taobao.com is 0.0.0.0
Sep 24 20:37:22 dnsmasq[596]: query[AAAA] adashbc.ut.taobao.com from 192.168.254.1
Sep 24 20:37:22 dnsmasq[596]: exactly blacklisted adashbc.ut.taobao.com is ::
Sep 24 20:37:22 dnsmasq[596]: query[A] adashbc.ut.taobao.com from 192.168.254.1
Sep 24 20:37:22 dnsmasq[596]: exactly blacklisted adashbc.ut.taobao.com is 0.0.0.0
Sep 24 20:37:22 dnsmasq[596]: query[type=65] configuration.apple.com from 192.168.254.1
Sep 24 20:37:22 dnsmasq[596]: cached configuration.apple.com is <CNAME>
Sep 24 20:37:22 dnsmasq[596]: forwarded configuration.apple.com to 8.8.8.8
Sep 24 20:37:22 dnsmasq[596]: query[A] configuration.apple.com from 192.168.254.1
Sep 24 20:37:22 dnsmasq[596]: forwarded configuration.apple.com to 8.8.8.8
Sep 24 20:37:22 dnsmasq[596]: query[type=65] e673.dsce9.akamaiedge.net from 192.168.254.1
Sep 24 20:37:22 dnsmasq[596]: forwarded e673.dsce9.akamaiedge.net to 8.8.8.8
Sep 24 20:37:22 dnsmasq[596]: query[A] e673.dsce9.akamaiedge.net from 192.168.254.1
Sep 24 20:37:22 dnsmasq[596]: forwarded e673.dsce9.akamaiedge.net to 8.8.8.8
Sep 24 20:37:22 dnsmasq[596]: reply configuration.apple.com is <CNAME>
Sep 24 20:37:22 dnsmasq[596]: reply configuration.apple.com.akadns.net is <CNAME>
Sep 24 20:37:22 dnsmasq[596]: reply configuration.apple.com.edgekey.net is <CNAME>
Sep 24 20:37:22 dnsmasq[596]: reply e673.dsce9.akamaiedge.net is 23.35.209.97
Sep 24 20:37:22 dnsmasq[596]: reply e673.dsce9.akamaiedge.net is 23.35.209.97
Sep 24 20:37:26 dnsmasq[596]: query[type=65] gsp-ssl.ls.apple.com from 192.168.254.1
Sep 24 20:37:26 dnsmasq[596]: forwarded gsp-ssl.ls.apple.com to 8.8.8.8
Sep 24 20:37:26 dnsmasq[596]: query[A] gsp-ssl.ls.apple.com from 192.168.254.1
Sep 24 20:37:26 dnsmasq[596]: forwarded gsp-ssl.ls.apple.com to 8.8.8.8
Sep 24 20:37:26 dnsmasq[596]: query[type=65] get-bx.g.aaplimg.com from 192.168.254.1
Sep 24 20:37:26 dnsmasq[596]: forwarded get-bx.g.aaplimg.com to 8.8.8.8
Sep 24 20:37:26 dnsmasq[596]: query[A] get-bx.g.aaplimg.com from 192.168.254.1

I am going to rename both log files and restart to start over from scratch.

The issue is when Pi-hole reads the previous 24 hours of data from the long term database. Do the following to fix this:

sudo service pihole-FTL stop

sudo mv /etc/pihole/pihole-FTL.db /etc/pihole/pihole-FTL-old.db

sudo service pihole-FTL start

Check with df -h /dev/shm if there are any limitations on /dev/shm. IIRC the default size is half of the memory you allocated to the machine. Hence, you may still be hitting a memory limit here.

The solution suggested by @jfb will work because it prevents Pi-hole from re-importing the massive amount of looping queries from the database on restart.

Otherwise, to change the configuration for /dev/shm, add one line to /etc/fstab as follows.

tmpfs /dev/shm tmpfs defaults,size=8g 0 0

Here, the /dev/shm size is configured to be 8GB, you may want something on the order of 400MB.

If you would like to make it take effect immediately, run

sudo mount -o remount /dev/shm

Otherwise, the change will only become effective on reboot (I still recommend to reboot to avoid loosing existing content of the shm region).

I have renamed the DB file and restarted the FTL service.

The results from df -h /dev/shm are the following and it does not appear to have any limitations that I can see.

:~$ df -h /dev/shm
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           247M   56K  247M   1% /dev/shm

I have increased the amount of space for /dev/smh to match the amount of memory allocated to the VM running PiHole. Here are the results of that change.

Filesystem      Size  Used Avail Use% Mounted on
tmpfs           512M   68K  512M   1% /dev/shm

For now I will have to wait for traffic to start flowing across the network.

The limitation is/was:

This was hit with

because 258211840/1024/1024 = 246MB. Given that there are other objects as well in the shared memory, it was just filled up completely. I will see if we can catch this situation and log a clearer warning here.


I prepared a change for FTL to also log the usage statistics of said /dev/shm to improve visibility of the issue in the future.

4 Likes

I have not had any issues since I turned off Conditional Forwarding and increased the size of /dev/shm by double. I appreciate all the help with this issue.

1 Like

Rest assured we're still improving on this in the background. For instance, we've chosen the path of allowing "memory holes". This means we reserve memory from the system, however, we do not pretend on unused blocks being exclusive for us when there is still room. This simplifies the memory handling for the kernel, allowing other processes to "borrow" some surplus memory from Pi-hole when they need it only for a short time.
Obviously, this borrowing of memory from other processes can lead to errors when other processes take Pi-hole memory and don't give it back (because the memory is full, there is no alternative for them and they are not coded as friendly to others as we did).

The solution will be to exclusively allocate memory in small heaps to avoid such holes. Read the man pages of ftruncate (with holes) and fallocate (exclusively) in case you want to see the technical details. This will also allow us to identify out-of-memory situations easier and notify the user earlier.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.