FTL crashing / out of memory

Potato · May 11, 2021, 11:44am

I have two piholes, both on Pi-Zero Ws, that have FTL crashing or being killed due to out of memory.

The below is from kern.log:

May 11 07:24:34 hera kernel: [1035697.857225] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),global_oom,task_memcg=/,task=pihole-FTL,pid=17924,uid=999
May 11 07:24:34 hera kernel: [1035697.857317] Out of memory: Killed process 17924 (pihole-FTL) total-vm:485988kB, anon-rss:349412kB, file-rss:0kB, shmem-rss:924kB, UID:999 pgtables:452kB oom_score_adj:0
May 11 07:24:34 hera kernel: [1035698.595688] oom_reaper: reaped process 17924 (pihole-FTL), now anon-rss:0kB, file-rss:0kB, shmem-rss:924kB

Restarting DNS temporarily resolves the issue. I restarted DNS immediately before generating the debug log.

Let me know if there's anything else you need to help troubleshoot. Thanks.

Expected Behaviour:

Expecting FTL to continue running.

Actual Behaviour:

FTL is crashing.

Debug Token:

https://tricorder.pi-hole.net/7ryaztjofj

Tntdruid · May 11, 2021, 2:04pm

Are you running a lot of lists?

841k domains here and use Used memory: 31.30 MB on my raspberry pi zero w

Potato · May 11, 2021, 2:25pm

I only have ~60MB used under normal operation. 85,353 domains according to pihole -g. This happens roughly every week or two.

DanSchaper · May 11, 2021, 8:28pm

What else are you running on that server? pihole-FTL runs at a nice level that makes it open to being reaped when the kernel runs out of memory.

DanSchaper · May 11, 2021, 8:31pm

Looking at the debug log:

pihole-FTL isn't using much memory at all, 1.7MB out of 225.5MB available in the Shared Memory location.

You do have some errors that happen when there is a problem with the system time:

-rw-r--r-- 1 pihole pihole 9100 May 11 07:35 /var/log/pihole-FTL.log
   -----head of pihole-FTL.log------
   [2021-05-11 00:03:43.618 17924M] Resizing "FTL-dns-cache" from 57344 to (3840 * 16) == 61440 (/dev/shm: 1.7MB used, 225.5MB total, FTL uses 1.7MB)
   [2021-05-11 00:12:34.826 17924M] FATAL: Trying to access query ID -1, but maximum is 12288
   [2021-05-11 00:12:34.835 17924M]        found in FTL_query_in_progress() (/root/project/src/dnsmasq_interface.c:2181)
   [2021-05-11 00:12:34.836 17924M] Failed to unlock SHM lock: Operation not permitted
   [2021-05-11 00:49:29.472 17924M] FATAL: Trying to access query ID -1, but maximum is 12288
   [2021-05-11 00:49:29.480 17924M]        found in FTL_query_in_progress() (/root/project/src/dnsmasq_interface.c:2181)
   [2021-05-11 00:49:29.487 17924M] Failed to unlock SHM lock: Operation not permitted
   [2021-05-11 00:49:29.515 17924M] FATAL: Trying to access query ID -1, but maximum is 12288
   [2021-05-11 00:49:29.516 17924M]        found in FTL_query_in_progress() (/root/project/src/dnsmasq_interface.c:2181)
   [2021-05-11 00:49:29.516 17924M] Failed to unlock SHM lock: Operation not permitted
   [2021-05-11 00:59:01.157 17924/T17928] Notice: Database size is 784.02 MB, deleted 768 rows

Potato · May 11, 2021, 8:47pm

The Pi Zero Ws aren't doing anything else. They're both dedicated to running pihole only. They have unattended-upgrades tweaked a bit, but aren't a huge deviation from a standard install.

That's interesting, I looked right past those timestamps. This is around the time unattended-upgrades was running. Any chance it was updating and killed pihole-FTL for more memory?

Bucking_Horn · May 12, 2021, 6:46am

We generally advise against unattended upgrades of Pi-hole.

Could you share the config items you use to achieve that?

In the meantime, it should be easy enough to disable that and monitor your system whether that would fix your issue.

There should be some lines above those log messages that show the memory usage at the time of OOM-killer invocation, preceeded by an indication which process invoked it when running short on memory (something like <process> invoked OOM-killer).

Your debug log shows at least cloudflared to be running as well:

*** [ DIAGNOSING ]: Ports in use
127.0.0.1:33419 cloudflare (IPv4)
127.0.0.1:5053 cloudflare (IPv4)

It doesn't seem that you are using this as Pi-hole's upstream?

*** [ DIAGNOSING ]: Setup variables
    PIHOLE_DNS_1=1.1.1.1
    PIHOLE_DNS_2=1.0.0.1

In addition (and probably unrelated), you are using .local as your local domain name:

*** [ DIAGNOSING ]: Discovering active DHCP servers (takes 10 seconds)
   Scanning all your interfaces for DHCP servers
   
   * Received 320 bytes from wlan0:192.168.0.1
     Offered IP address: 192.168.0.4
     Server IP address: 192.168.0.1
     DHCP options:
      Message type: DHCPOFFER (2)
      server-identifier: 192.168.0.1
      domain-name: "local"
         dns-server: 192.168.0.3
      dns-server: 192.168.0.4
      router: 192.168.0.1

Note that .local is the default domain for the mDNS protocol - it should not be used with DNS.
Consider configuring your router for a different local domain name.

Potato · May 12, 2021, 11:25am

"unattended-upgrades" is the standard debian/ubuntu/raspbian package that handles automatic security updates -- nothing to do with pihole. I can try disabling it for the next week though.

I had issues with cloudflared DOH and stopped using it, but forgot to remove it. Good eye. I nuked it from both, thanks.

I found the OOM-killer section you're referencing and attached that portion of the log.

oomkiller.txt (27.4 KB)

DL6ER · May 12, 2021, 12:06pm

Your log suggests cloudflared used even more memory than pihole-FTL:

May 11 07:24:34 hera kernel: [1035697.855962] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
May 11 07:24:34 hera kernel: [1035697.856217] [    491]   998   491   200849      439    28672      280             0 cloudflared
May 11 07:24:34 hera kernel: [1035697.856420] [  17924]   999 17924   121497    87584   462848    19529             0 pihole-FTL

apt-get was consuming some notable amount of memory, too:

May 11 07:24:34 hera kernel: [1035697.856470] [   2232]     0  2232    12986     2025    59392        0             0 apt-get

Memory amount is given in page size (4K per page), so the logs above translate to

cloudflared = 784.6 MB (virtual memory), 1,7 MB (RAM memory)
pihole-FTL = 474.6 MB (virtual memory), 342.125 MB (RAM memory)
apt-get = 50.7 MB (virtual memory), 7.9 MB (RAM memory)

So the next question is: Why is your FTL process that huge? Mine is running at around 16 MB and, hence, similar to what others see.

How many queries does your system see, roughly, within 24 hours?

You could try experimenting with reducing CACHESIZE in your /etc/pihole/setupVars.conf and running pihole -r (Repair).

Potato · May 12, 2021, 12:18pm

Thanks for the analysis, apt-get was my guess since unattended-upgrades was running around this time.

This is the secondary pihole and total queries on dashboard shows 7,678. Primary shows 26,062.

Currently top shows:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND           
  755 pihole    10 -10   90236  20016   4524 S   0.0   4.5  10:44.46 pihole-FTL

Considering the virtual memory you noted used by cloudflared, I'm curious if that was the issue all along. I've uninstalled that from both piholes, and disabled unattended-upgrades on one.

DL6ER · May 12, 2021, 12:30pm

I don't really think so. Did I got you right that the memory need increases slowly until you hit the very point where it is just used up? In this case, we'll have to run some memory-leak checks on your system.

I wrote some documentation about it that should make it not especially easy, but doable:

https://deploy-preview-485--pihole-docs.netlify.app/ftldns/valgrind/

Potato · May 12, 2021, 1:30pm

When attempting to run what's in your link, it tells me bad option -s. This option doesn't seem to exist in the man page, so I excluded it.

FTL isn't really functional when being ran with valgrind due to this line correct?

sudo setcap -r /usr/bin/pihole-FTL

Is this a viable test if I'm not actively putting a load on FTL, or am I doing something wrong?

DL6ER · May 12, 2021, 3:12pm

It is because you are just stripping privileges. However, when starting FTL as root (sudo), all privileges will always be there.

Yes. The web interface may show an incorrect up/down status, but besides this, everything should work as usual.

Interesting ... from my man-page (valgrind 3.15.0):

      -s
          Specifying -s is equivalent to --show-error-list=yes.

Maybe you have more luck with the non-abbreviated command.

Potato · May 12, 2021, 3:30pm

I'm running RPI Buster and the latest I get from apt install valgrind is 3.7.0+6. It's safe to say Buster's repo is behind.

--show-error-list=yes is also a bad option.

When running with the command:

sudo rm /dev/shm/FTL-*; sudo valgrind --trace-children=yes --leak-check=full --track-origins=yes /usr/bin/pihole-FTL &> valgrind.log

DNS service is not functioning, so I am doing something wrong. Should the standard pihole-FTL log still be getting populated? It's not.

While this may be useful to see if there's an actual memory leak in FTL, I'm starting to think just reloading the pi and copying my handful of config changes (static DNS, additional lists) may be more productive.

DL6ER · May 12, 2021, 5:59pm

Yes, it should be.

What happens if you leave off the final output redirection, so

sudo rm /dev/shm/FTL-*; sudo valgrind --trace-children=yes --leak-check=full --track-origins=yes /usr/bin/pihole-FTL

?

Helping to debug is completely voluntary, of course. If it is too troublesome for you, you don't have to do it. Sorry for the inconvenience caused by this either way!

Potato · May 12, 2021, 11:44pm

I gave this a try with the same result. Valgrind seems to just hang. I would expect the program to keep running and for it to print more debugging information. I attached a log of what it does.

valgrind.txt (22.6 KB)

I found this Stackoverflow link that describes how a newer version of Valgrind resolves the issue, but confirms that Raspbian Buster is not adequately maintaining Valgrind.

Valgrind is a useful C/C++ Development tool and probably deserves to be maintained. This is making me wish I tried Ubuntu 20.04 32-bit ARM when it came out.

DL6ER · May 13, 2021, 7:39am

Okay, too bad that it doesn't work as expected. Maybe it is time to try Ubuntu out now? I do have a testing system with Ubuntu 64bit (aarch64) on a Raspberry Pi 3 running here and it seems to be doing the job quite fine.

Potato · May 15, 2021, 2:13pm

I figured it out after reloading the pis.

Roughly a month ago I reloaded my Macbook and I had previously disabled bonjour. The piholes were set to only display A and AAAA requests, but the Macbook with bonjour enabled was drowning them in reverse DNS lookups (PTR requests). The new piholes with fresh configurations shot up to over 150k each in a day. IMO it seems reasonable that >100k requests a day could be a bit much for a Zero.

Upon disabling bonjour on the Macbook, the number of queries per hour returned to a normal count.

I apologize for the wasted time, and I'm certain I am not the first to encounter this. I will not be filtering PTR requests again.

DL6ER · May 15, 2021, 6:59pm

No need to apologize for anything. But yeah, that's not the first time something like this happens and why we are going to deprecate this option in the near future.

system · June 5, 2021, 7:00pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.