Good afternoon. I was hoping that my crashing were unique and one-off, but over the last few days, my DNS and FTL has been going down about every 12 hours. It started about a week ago on version 4.x - I since upgraded to 5 to see if that would solve my problems - but alas, still happening. A restart of the DNS Resolver service brings it back online. No need to restart the server, etc.
Running on VM instance of CentOS 7 - straight install package. Nothing else is running on the instance. 4GB memory, 2CPU. My monitoring shows that I'm not running out of memory, not running out of CPU, or disk space.
When the service stops - everything on my network collapses obviously. It does take a fair amount of queries - about 36 million in a 24 hour period - but it's been working fine for the 6 months or so that I've been using it.
Not seeing anything specific in the logs but will continue to investigate. Any help would be appreciated.
Appreciate the quick message back. The only thing those shows errors are in the /var/log/pihole.log:
May 15 12:22:00 dnsmasq[11720]: reply **error** is not implemented
May 15 12:22:00 dnsmasq[11720]: reply **error** is not implemented
May 15 12:22:01 dnsmasq[11720]: reply **error** is not implemented
But those were in there both before and after the crash. I don't see anything at the time of the failures. All I do see is no more entries of this kind of detail:
[2020-05-15 12:09:40.402 10425] Resizing "/FTL-queries" from 1985478656 to 1985740800
[2020-05-15 12:09:50.420 10425] Resizing "/FTL-queries" from 1985740800 to 1986002944
[2020-05-15 12:10:03.753 10425] Resizing "/FTL-queries" from 1986002944 to 1986265088
That gets logged every 10 seconds or so... And then it stops. I restart the service and I see the normal startup details:
[2020-05-15 12:20:38.506 10961] Using log file /var/log/pihole-FTL.log
[2020-05-15 12:20:38.506 10961] ########## FTL started! ##########
[2020-05-15 12:20:38.506 10961] FTL branch: master
[2020-05-15 12:20:38.506 10961] FTL version: v5.0
[2020-05-15 12:20:38.506 10961] FTL commit: 3d7c095
[2020-05-15 12:20:38.506 10961] FTL date: 2020-05-10 18:58:38 +0100
[2020-05-15 12:20:38.506 10961] FTL user: pihole
[2020-05-15 12:20:38.506 10961] Compiled for x86_64 (compiled on CI) using gcc (Debian 6.3.0-18+deb9u1) 6.3.0 20170516
[2020-05-15 12:20:38.506 10961] Starting config file parsing (/etc/pihole/pihole-FTL.conf)
[2020-05-15 12:20:38.506 10961] SOCKET_LISTENING: only local
[2020-05-15 12:20:38.506 10961] AAAA_QUERY_ANALYSIS: Show AAAA queries
[2020-05-15 12:20:38.506 10961] MAXDBDAYS: max age for stored queries is 1 days
[2020-05-15 12:20:38.506 10961] RESOLVE_IPV6: Resolve IPv6 addresses
[2020-05-15 12:20:38.506 10961] RESOLVE_IPV4: Resolve IPv4 addresses
[2020-05-15 12:20:38.506 10961] DBINTERVAL: saving to DB file every minute
[2020-05-15 12:20:38.506 10961] DBFILE: Using /etc/pihole/pihole-FTL.db
[2020-05-15 12:20:38.506 10961] MAXLOGAGE: Importing up to 4.0 hours of log data
[2020-05-15 12:20:38.506 10961] PRIVACYLEVEL: Set to 0
[2020-05-15 12:20:38.506 10961] IGNORE_LOCALHOST: Show queries from localhost
[2020-05-15 12:20:38.506 10961] BLOCKINGMODE: Null IPs for blocked domains
[2020-05-15 12:20:38.506 10961] ANALYZE_ONLY_A_AND_AAAA: Enabled. Analyzing only A and AAAA queries
[2020-05-15 12:20:38.506 10961] DBIMPORT: Importing history from database
[2020-05-15 12:20:38.506 10961] PIDFILE: Using /var/run/pihole-FTL.pid
[2020-05-15 12:20:38.506 10961] PORTFILE: Using /var/run/pihole-FTL.port
[2020-05-15 12:20:38.506 10961] SOCKETFILE: Using /var/run/pihole/FTL.sock
[2020-05-15 12:20:38.507 10961] SETUPVARSFILE: Using /etc/pihole/setupVars.conf
[2020-05-15 12:20:38.507 10961] MACVENDORDB: Using /etc/pihole/macvendor.db
[2020-05-15 12:20:38.507 10961] GRAVITYDB: Using /etc/pihole/gravity.db
[2020-05-15 12:20:38.507 10961] PARSE_ARP_CACHE: Active
[2020-05-15 12:20:38.507 10961] CNAME_DEEP_INSPECT: Active
[2020-05-15 12:20:38.507 10961] DELAY_STARTUP: No delay requested.
[2020-05-15 12:20:38.507 10961] BLOCK_ESNI: Enabled, blocking _esni.{blocked domain}
[2020-05-15 12:20:38.507 10961] Finished config file parsing
[2020-05-15 12:20:38.508 10961] Database version is 5
[2020-05-15 12:20:38.508 10961] Database successfully initialized
That 10 minutes of time between timestamps is the outage window before I was able to hit the admin page and restart the service.
The logs to get pretty large - is there anything else I could grep through for?
This is a home lab actually - but I have about 100 some odd devices with significant amount of monitoring going on between them all. A few of my servers poll a lot of my devices every 10-30 seconds so the DNS lookup is pretty extensive. I'm glad you asked - only because when I see other Pi-hole queries questions, I also wonder if I'm at the extreme side of things or if it's just regular usage.
I will certainly try that out. I deployed a brand new VM with a Teleporter import just to remove any sort of corrupt server/OS issue not knowing how easy this was going to be to solve. I'll keep an eye on things and get a debug running when it happens again. I still have the old system running as well.
Thanks @deHakkelaar - in my case - I'm using Pi-hole to do my DHCP as well. So everything gets impacted. I have it set for 24 hours.
The more I read other issues, it seems to be similar to prior issues (that seemed to have been resolved) with a fix under:
pihole checkout ftl tweak/overhaul_overTime
branch and fix.
I think it's really just a lot entries in FTL. Even if I do a restart, FTL loads 24H of log entries and is right back to the same issue in a few minutes.
I've changed the FTL configurations to include:
MAXDBDAYS=1
MAXLOGAGE=4.0
I'm now back online. But I expect this will keep happening. I'll try the debugger this time. The logs don't show anything that says "crashed"
Do not run pihole checkout unless a developer asks you. That branch is based on an old version of FTL and you will cause a whole lot of headache to yourself if you do that.
Oh yeah - not doing that! But thank you for reading and the reminder. Similar symptoms for sure. I actually installed a brand new instance of Pi-hole on a new VM - same scenario. Will keep looking for answers. Thanks!
Program received signal SIGBUS, Bus error.
0x000056122bb56c3a in _getQuery (queryID=queryID@entry=31031552, checkMagic=checkMagic@entry=false,
line=line@entry=538, function=function@entry=0x56122bca6d48 <__FUNCTION__.12093> "_FTL_new_query",
file=file@entry=0x56122bca6274 "src/dnsmasq_interface.c") at src/shmem.c:766
766src/shmem.c: No such file or directory.
(gdb)
Not sure if it's related... But each time it's crashed - it's around the same size in this resizing FTL-queries process:
(2 billion??)
[2020-05-14 12:34:50.046 30738] Resizing "/FTL-queries" from 1986002944 to 1986265088
[2020-05-14 12:35:00.552 30738] Resizing "/FTL-queries" from 1986265088 to 1986527232
[2020-05-15 12:09:50.420 10425] Resizing "/FTL-queries" from 1985740800 to 1986002944
[2020-05-15 12:10:03.753 10425] Resizing "/FTL-queries" from 1986002944 to 1986265088
[2020-05-16 12:06:32.981 24511] Resizing "/FTL-queries" from 1985216512 to 1985478656
[2020-05-16 12:06:40.601 24511] Resizing "/FTL-queries" from 1985478656 to 1985740800
[2020-05-17 10:12:30.642 8769] Resizing "/FTL-queries" from 1986002944 to 1986265088
Not sure of the exact command to find that detail - but googling something similar provided me this:
# free -h
total used free shared buff/cache available
Mem: 3.7G 498M 108M 803M 3.1G 2.2G
Swap: 3.9G 43M 3.8G
# cat /proc/sys/kernel/shmmax
18446744073692774399
# ipcs -lm
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1
# df /dev/shm
Filesystem 1K-blocks Used Available Use% Mounted on
tmpfs 1940108 824396 1115712 43% /dev/shm