Pi-Hole Crashing every 12-ish hours

Good afternoon. I was hoping that my crashing were unique and one-off, but over the last few days, my DNS and FTL has been going down about every 12 hours. It started about a week ago on version 4.x - I since upgraded to 5 to see if that would solve my problems - but alas, still happening. A restart of the DNS Resolver service brings it back online. No need to restart the server, etc.

My token is https://tricorder.pi-hole.net/gtgk6yfai0

Running on VM instance of CentOS 7 - straight install package. Nothing else is running on the instance. 4GB memory, 2CPU. My monitoring shows that I'm not running out of memory, not running out of CPU, or disk space.

When the service stops - everything on my network collapses obviously. It does take a fair amount of queries - about 36 million in a 24 hour period - but it's been working fine for the 6 months or so that I've been using it.

Not seeing anything specific in the logs but will continue to investigate. Any help would be appreciated.

Thanks.

_dave

In what environment are you running this with this query and client count?

   [2020-05-15 12:20:58.787 10961] Imported 5720245 queries from the long-term database
   [2020-05-15 12:20:58.788 10961]  -> Total DNS queries: 5720245
   [2020-05-15 12:20:58.788 10961]  -> Cached DNS queries: 5692539
   [2020-05-15 12:20:58.788 10961]  -> Forwarded DNS queries: 22415
   [2020-05-15 12:20:58.788 10961]  -> Blocked DNS queries: 5291
   [2020-05-15 12:20:58.788 10961]  -> Unknown DNS queries: 0
   [2020-05-15 12:20:58.788 10961]  -> Unique domains: 1481
   [2020-05-15 12:20:58.788 10961]  -> Unique clients: 107
   [2020-05-15 12:20:58.788 10961]  -> Known forward destinations: 4

Are there any errors shown in these logs?

/var/log/pihole.log

/var/log/pihole-FTL.log

/var/log/syslog

Appreciate the quick message back. The only thing those shows errors are in the /var/log/pihole.log:

May 15 12:22:00 dnsmasq[11720]: reply **error** is not implemented
May 15 12:22:00 dnsmasq[11720]: reply **error** is not implemented
May 15 12:22:01 dnsmasq[11720]: reply **error** is not implemented

But those were in there both before and after the crash. I don't see anything at the time of the failures. All I do see is no more entries of this kind of detail:

[2020-05-15 12:09:40.402 10425] Resizing "/FTL-queries" from 1985478656 to 1985740800
[2020-05-15 12:09:50.420 10425] Resizing "/FTL-queries" from 1985740800 to 1986002944
[2020-05-15 12:10:03.753 10425] Resizing "/FTL-queries" from 1986002944 to 1986265088

That gets logged every 10 seconds or so... And then it stops. I restart the service and I see the normal startup details:

[2020-05-15 12:20:38.506 10961] Using log file /var/log/pihole-FTL.log
[2020-05-15 12:20:38.506 10961] ########## FTL started! ##########
[2020-05-15 12:20:38.506 10961] FTL branch: master
[2020-05-15 12:20:38.506 10961] FTL version: v5.0
[2020-05-15 12:20:38.506 10961] FTL commit: 3d7c095
[2020-05-15 12:20:38.506 10961] FTL date: 2020-05-10 18:58:38 +0100
[2020-05-15 12:20:38.506 10961] FTL user: pihole
[2020-05-15 12:20:38.506 10961] Compiled for x86_64 (compiled on CI) using gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
[2020-05-15 12:20:38.506 10961] Starting config file parsing (/etc/pihole/pihole-FTL.conf)
[2020-05-15 12:20:38.506 10961]    SOCKET_LISTENING: only local
[2020-05-15 12:20:38.506 10961]    AAAA_QUERY_ANALYSIS: Show AAAA queries
[2020-05-15 12:20:38.506 10961]    MAXDBDAYS: max age for stored queries is 1 days
[2020-05-15 12:20:38.506 10961]    RESOLVE_IPV6: Resolve IPv6 addresses
[2020-05-15 12:20:38.506 10961]    RESOLVE_IPV4: Resolve IPv4 addresses
[2020-05-15 12:20:38.506 10961]    DBINTERVAL: saving to DB file every minute
[2020-05-15 12:20:38.506 10961]    DBFILE: Using /etc/pihole/pihole-FTL.db
[2020-05-15 12:20:38.506 10961]    MAXLOGAGE: Importing up to 4.0 hours of log data
[2020-05-15 12:20:38.506 10961]    PRIVACYLEVEL: Set to 0
[2020-05-15 12:20:38.506 10961]    IGNORE_LOCALHOST: Show queries from localhost
[2020-05-15 12:20:38.506 10961]    BLOCKINGMODE: Null IPs for blocked domains
[2020-05-15 12:20:38.506 10961]    ANALYZE_ONLY_A_AND_AAAA: Enabled. Analyzing only A and AAAA queries
[2020-05-15 12:20:38.506 10961]    DBIMPORT: Importing history from database
[2020-05-15 12:20:38.506 10961]    PIDFILE: Using /var/run/pihole-FTL.pid
[2020-05-15 12:20:38.506 10961]    PORTFILE: Using /var/run/pihole-FTL.port
[2020-05-15 12:20:38.506 10961]    SOCKETFILE: Using /var/run/pihole/FTL.sock
[2020-05-15 12:20:38.507 10961]    SETUPVARSFILE: Using /etc/pihole/setupVars.conf
[2020-05-15 12:20:38.507 10961]    MACVENDORDB: Using /etc/pihole/macvendor.db
[2020-05-15 12:20:38.507 10961]    GRAVITYDB: Using /etc/pihole/gravity.db
[2020-05-15 12:20:38.507 10961]    PARSE_ARP_CACHE: Active
[2020-05-15 12:20:38.507 10961]    CNAME_DEEP_INSPECT: Active
[2020-05-15 12:20:38.507 10961]    DELAY_STARTUP: No delay requested.
[2020-05-15 12:20:38.507 10961]    BLOCK_ESNI: Enabled, blocking _esni.{blocked domain}
[2020-05-15 12:20:38.507 10961] Finished config file parsing
[2020-05-15 12:20:38.508 10961] Database version is 5
[2020-05-15 12:20:38.508 10961] Database successfully initialized

That 10 minutes of time between timestamps is the outage window before I was able to hit the admin page and restart the service.

The logs to get pretty large - is there anything else I could grep through for?

Thanks!

_DS

It may be worth running pihole-FTL in debug mode and waiting for it to crash to see if we can glean any more interesting information:

https://docs.pi-hole.net/ftldns/debugging/

1 Like

This is a home lab actually - but I have about 100 some odd devices with significant amount of monitoring going on between them all. A few of my servers poll a lot of my devices every 10-30 seconds so the DNS lookup is pretty extensive. I'm glad you asked - only because when I see other Pi-hole queries questions, I also wonder if I'm at the extreme side of things or if it's just regular usage.

Thanks!

I will certainly try that out. I deployed a brand new VM with a Teleporter import just to remove any sort of corrupt server/OS issue not knowing how easy this was going to be to solve. I'll keep an eye on things and get a debug running when it happens again. I still have the old system running as well.

Thanks!

_DS

12 hours could be the DHCP lease time period.
Who is doing DHCP and whats lease time ?
Do you have other network managers running ?

sudo service dhcpcd status

sudo service network-manager status

sudo service connman status

Thanks @deHakkelaar - in my case - I'm using Pi-hole to do my DHCP as well. So everything gets impacted. I have it set for 24 hours.

The more I read other issues, it seems to be similar to prior issues (that seemed to have been resolved) with a fix under:

pihole checkout ftl tweak/overhaul_overTime

branch and fix.

I think it's really just a lot entries in FTL. Even if I do a restart, FTL loads 24H of log entries and is right back to the same issue in a few minutes.

I've changed the FTL configurations to include:

MAXDBDAYS=1
MAXLOGAGE=4.0

I'm now back online. But I expect this will keep happening. I'll try the debugger this time. The logs don't show anything that says "crashed"

Thanks,

_DS

Do not run pihole checkout unless a developer asks you. That branch is based on an old version of FTL and you will cause a whole lot of headache to yourself if you do that.

Oh yeah - not doing that! But thank you for reading and the reminder. Similar symptoms for sure. I actually installed a brand new instance of Pi-hole on a new VM - same scenario. Will keep looking for answers. Thanks!

Debug spit out something new just now:


Program received signal SIGBUS, Bus error.
0x000056122bb56c3a in _getQuery (queryID=queryID@entry=31031552, checkMagic=checkMagic@entry=false,
line=line@entry=538, function=function@entry=0x56122bca6d48 <__FUNCTION__.12093> "_FTL_new_query",
file=file@entry=0x56122bca6274 "src/dnsmasq_interface.c") at src/shmem.c:766
766src/shmem.c: No such file or directory.
(gdb) 

Please send us the token for this new debug log.

And the list of files in /dev/shm when this happens.

Will do - I was remote when it happened. When it happens again - I'll provide a new debug log and list of files. Thanks!

Not sure if it's related... But each time it's crashed - it's around the same size in this resizing FTL-queries process:

(2 billion??)

[2020-05-14 12:34:50.046 30738] Resizing "/FTL-queries" from 1986002944 to 1986265088
[2020-05-14 12:35:00.552 30738] Resizing "/FTL-queries" from 1986265088 to 1986527232

[2020-05-15 12:09:50.420 10425] Resizing "/FTL-queries" from 1985740800 to 1986002944
[2020-05-15 12:10:03.753 10425] Resizing "/FTL-queries" from 1986002944 to 1986265088

[2020-05-16 12:06:32.981 24511] Resizing "/FTL-queries" from 1985216512 to 1985478656
[2020-05-16 12:06:40.601 24511] Resizing "/FTL-queries" from 1985478656 to 1985740800

[2020-05-17 10:12:30.642 8769] Resizing "/FTL-queries" from 1986002944 to 1986265088

Also - just putting in a snapshot of logs until this happens again:

https://tricorder.pi-hole.net/w0nu58i19h

# ls -l /dev/shm
total 808560
-rw------- 1 pihole pihole     81920 May 17 15:55 FTL-clients
-rw------- 1 pihole pihole       124 May 17 10:44 FTL-counters
-rw------- 1 pihole pihole    122880 May 17 18:45 FTL-dns-cache
-rw------- 1 pihole pihole    196608 May 17 18:30 FTL-domains
-rw------- 1 pihole pihole        48 May 17 10:44 FTL-lock
-rw------- 1 pihole pihole     12288 May 17 10:44 FTL-overTime
-rw------- 1 pihole pihole      4096 May 17 10:44 FTL-per-client-regex
-rw------- 1 pihole pihole 827588608 May 17 18:46 FTL-queries
-rw------- 1 pihole pihole        12 May 17 10:44 FTL-settings
-rw------- 1 pihole pihole    139264 May 17 18:40 FTL-strings
-rw------- 1 pihole pihole      4096 May 17 10:44 FTL-upstreams

How big is your shared memory allocation?

Not sure of the exact command to find that detail - but googling something similar provided me this:

# free -h
              total        used        free      shared  buff/cache   available
Mem:           3.7G        498M        108M        803M        3.1G        2.2G
Swap:          3.9G         43M        3.8G


# cat /proc/sys/kernel/shmmax
18446744073692774399

# ipcs -lm

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1

# df /dev/shm
Filesystem     1K-blocks   Used Available Use% Mounted on
tmpfs            1940108 824396   1115712  43% /dev/shm

Thanks, the df is perfect.