DNS Server not working consistently after upgrade

Please follow the below template, it will help us to help you!

If you are Experiencing issues with a Pi-hole install that has non-standard elements (e.g you are using nginx instead of lighttpd, or there is some other aspect of your install that is customised) - please use the Community Help category.

Expected Behaviour:

After upgrade to v6, expected pi-hole to resolve DNS consistently

Actual Behaviour:

The pi-hole stops resolving addresses. After 5-10 minutes of not resolving, the pi-hole begins resolving queries again. When it's not resolving traffic, network captures show no attempts to contact upstream DNS servers to resolve addresses. New queries continually come in to it, but no response or attempt to resolve the queries is seen.

High CPU is not seen like others have reported. Remains steady about 10% whether working correctly or not.

Both a working and nonworking debug token are below. The same process is in both. The nonworking debug token was generated about 8 minutes after the working token.

Thanks in advance for any assistance.

Debug Token:

This debug session is from when the system was broken: https://tricorder.pi-hole.net/zr8Mp1xv/

This debug session is from when the system worked correctly: https://tricorder.pi-hole.net/nqBdq1Tn/

Two observations from your debug log may contribute to your observation:

*** [ DIAGNOSING ]: contents of /var/log/pihole

-rw-r----- 1 pihole pihole 890 Mar  8 00:11 /var/log/pihole/FTL.log
   -----head of FTL.log------
   2025-03-08 00:03:33.086 EST [574/T1313] ERROR: Cannot receive UDP DNS reply: Timeout - no response from upstream DNS server
   2025-03-08 00:03:33.087 EST [574/T1313] INFO: Tried to resolve PTR "118.0.168.192.in-addr.arpa" on 127.0.0.1#53 (UDP)
   2025-03-08 00:10:07.726 EST [574/T1313] ERROR: Cannot receive UDP DNS reply: Timeout - no response from upstream DNS server
   2025-03-08 00:10:07.728 EST [574/T1313] INFO: Tried to resolve PTR "183.0.168.192.in-addr.arpa" on 127.0.0.1#53 (UDP)
   2025-03-08 00:11:03.191 EST [574M] INFO: Rate-limiting 192.168.0.118 for at least 2 seconds

Upstream DNS requests time out, and at least one of your clients has been rate-limited.

What kind of device is 192.168.0.118?

Okay, going from here, let me first reverse-engineer when your system was broken:

Active: active (running) since Fri 2025-03-07 23:27:59 EST; 1h 13min ago

=> Fri 2025-03-08 00:41:00 EST

Looking at the tail of pihole.log snippet, no queries arrived at your Pi-hole after Mar 8 00:39:08 so this matches: no queries arrived = no queries replied to

So now the task is finding out why no queries seem to be arriving. As also the internally generated queries during the debug log generation are not being processed here, an external network issues can be ruled out, I'd say.

Please first try setting debug.all = true using, e.g.,

sudo pihole-FTL --config debug.all true

and then quote the last couple of lines from /var/log/pihole/FTL.log when it fails again. Alternatively, the command

tail -n 500 /var/log/pihole/FTL.log | pihole tricorder

would do the same for you, uploading it to our server and giving you another token you'd need to share with us.

Upload successful, your token is: https://tricorder.pi-hole.net/PySvYkd9/

Looks like the process is stopping activity when it hits disk.query_storage

root@raspberrypi:/home/pi# grep -F -C2 disk.query_storage /var/log/pihole/FTL.log
2025-03-08 08:42:16.066 EST [574/T1311] DEBUG_DATABASE: dbquery: "UPDATE disk.counters SET value = value + 31 WHERE id = 1;"
2025-03-08 08:42:16.067 EST [574/T1311] DEBUG_DATABASE:          ---> OK
2025-03-08 08:47:15.767 EST [574/T1311] DEBUG_DATABASE: Exported 235 rows for disk.query_storage (took 300647.9 ms, last SQLite ID 469361600)
2025-03-08 08:47:15.812 EST [574/T1312] DEBUG_LOCKS: Obtained SHM lock for replace_config() (/app/src/config/config.c:1901)
2025-03-08 08:47:15.813 EST [574/T1311] DEBUG_LOCKS: Removed SHM lock in DB_thread() (/app/src/database/database-thread.c:144)

2025-03-08 08:53:27.026 EST [574M] DEBUG_LOCKS: Waiting for SHM lock in FTL_dnsmasq_log() (/app/src/dnsmasq_interface.c:3776)
2025-03-08 08:53:27.027 EST [574M] DEBUG_LOCKS: SHM lock: 0x76f6f000
2025-03-08 08:58:24.218 EST [574/T1311] DEBUG_DATABASE: Exported 0 rows for disk.query_storage (took 298780.4 ms, last SQLite ID 469361365)
2025-03-08 08:58:24.265 EST [574/T1312] DEBUG_LOCKS: Obtained SHM lock for GC_thread() (/app/src/gc.c:685)
2025-03-08 08:58:24.265 EST [574/T1311] DEBUG_LOCKS: Removed SHM lock in DB_thread() (/app/src/database/database-thread.c:144)

2025-03-08 09:00:00.897 EST [574/T1313] DEBUG_LOCKS: Waiting for SHM lock in resolveClients() (/app/src/resolve.c:873)
2025-03-08 09:00:00.897 EST [574/T1313] DEBUG_LOCKS: SHM lock: 0x76f6f000
2025-03-08 09:03:22.892 EST [574/T1311] DEBUG_DATABASE: Exported 694 rows for disk.query_storage (took 297789.8 ms, last SQLite ID 469362753)
2025-03-08 09:03:22.942 EST [574/T1311] DEBUG_LOCKS: Removed SHM lock in DB_thread() (/app/src/database/database-thread.c:144)
2025-03-08 09:03:22.942 EST [574M] DEBUG_LOCKS: Obtained SHM lock for FTL_dnsmasq_log() (/app/src/dnsmasq_interface.c:3776)

2025-03-08 09:03:31.487 EST [574/T1313] DEBUG_LOCKS: Waiting for SHM lock in resolveClients() (/app/src/resolve.c:1000)
2025-03-08 09:03:31.487 EST [574/T1313] DEBUG_LOCKS: SHM lock: 0x76f6f000
2025-03-08 09:08:22.860 EST [574/T1311] DEBUG_DATABASE: Exported 0 rows for disk.query_storage (took 292554.0 ms, last SQLite ID 469362059)
2025-03-08 09:08:22.908 EST [574M] DEBUG_LOCKS: Obtained SHM lock for FTL_dnsmasq_log() (/app/src/dnsmasq_interface.c:3776)
2025-03-08 09:08:22.908 EST [574M] DEBUG_LOCKS: Removed SHM lock in FTL_dnsmasq_log() (/app/src/dnsmasq_interface.c:3782)

It's a Palo Alto firewall. It's not inline with any traffic; it's just looking up hosts for its policies.

I checked /etc/pihole/pihole-FTL.db and found it had grown to 13GB. Not sure what the pihole setting database.maxDBdays means in relation to this database, but I found entries in the query_storage table that were over 100 days old despite database.maxDBdays being set to 91.

I deleted entries from query_storage older than 30 days and got the database down to about 1.3 GB. The "DEBUG_DATABASE: Exported" statements are now down to about 1 second to complete. I still see the pihole not responding during this time, but with it being about a second, it's not so noticeable in normal activity.

Let's see what domains it was requesting when rate limited, and how many DNS requests Pi-hole had to handle at that time.
What's the output of:

sudo pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT count(*) FROM queries \
WHERE timestamp > strftime('%s','2025-03-08 00:11:03.191 EST', '-60 seconds', 'utc') \
AND timestamp <= strftime('%s','2025-03-08 00:11:03.191 EST', 'utc');"
sudo pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT domain, count(*) FROM queries \
WHERE client = '192.168.0.118' \
AND timestamp > strftime('%s','2025-03-08 00:11:03.191 EST', '-60 seconds', 'utc') \
AND timestamp <= strftime('%s','2025-03-08 00:11:03.191 EST', 'utc') \
GROUP BY domain ORDER BY 2 DESC LIMIT 20;"

(I had to modify the time in your commands to get a return. The 'EST' seems to have broken the strftime function. Please let me know if there's something else I should have used:

root@raspberrypi:/home/pi# pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "select strftime('%s','2025-03-08 00:11:03.191 EST', '-60 seconds', 'utc');"

root@raspberrypi:/home/pi# pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "select strftime('%s','2025-03-08 00:11:03.191', '-60 seconds', 'utc');"
1741410603

)

No results in the requested time range:

pi@raspberrypi:~ $ sudo pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT count(*) FROM queries \
WHERE timestamp > strftime('%s','2025-03-08 00:11:03.191', '-60 seconds', 'utc') \
AND timestamp <= strftime('%s','2025-03-08 00:11:03.191', 'utc');"
0
pi@raspberrypi:~ $ sudo pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT domain, count(*) FROM queries \
WHERE client = '192.168.0.118' \
AND timestamp > strftime('%s','2025-03-08 00:11:03.191', '-60 seconds', 'utc') \
AND timestamp <= strftime('%s','2025-03-08 00:11:03.191', 'utc') \
GROUP BY domain ORDER BY 2 DESC LIMIT 20;"

I had to expand the range to 500 seconds to get any results:

pi@raspberrypi:~ $ sudo pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT count(*) FROM queries \
WHERE timestamp > strftime('%s','2025-03-08 00:11:03.191', '-400 seconds', 'utc') \
AND timestamp <= strftime('%s','2025-03-08 00:11:03.191', 'utc');"
0
pi@raspberrypi:~ $ sudo pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT count(*) FROM queries \
WHERE timestamp > strftime('%s','2025-03-08 00:11:03.191', '-500 seconds', 'utc') \
AND timestamp <= strftime('%s','2025-03-08 00:11:03.191', 'utc');"
150
pi@raspberrypi:~ $ sudo pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT domain, count(*) FROM queries \
WHERE client = '192.168.0.118' \
AND timestamp > strftime('%s','2025-03-08 00:11:03.191', '-500 seconds', 'utc') \
AND timestamp <= strftime('%s','2025-03-08 00:11:03.191', 'utc') \
GROUP BY domain ORDER BY 2 DESC LIMIT 20;"
workshop.cit.gmu.edu|2
wmra.gmu.edu|2
vxn.datawire.net|2
urecregister.gmu.edu|2
updates.paloaltonetworks.com|2
rm7.cit.gmu.edu|2
it-secawarep.gmu.edu|2
it-ntp1.gmu.edu|2
it-edwdb1.gmu.edu|2
it-cppmt.gmu.edu|2
it-ccp1.gmu.edu|2
europa.cisat.gmu.edu|2
dlve-slp.cisat.gmu.edu|2
csmbio.csm.gmu.edu|2
csm2.csm.gmu.edu|2
clearpass.gmu.edu|2
cars1.gmu.edu|2
www.gmu.edu|1
wiki.cise.gmu.edu|1

In general, I know the two domains that device queries for the most are the 2 domains I added to dns.hosts in the configuration.

Apologies, me copying from your debug log line was too eager to include EST.

That looks correct:

$ TZ=":US/Eastern" date -d @1741410603
Sa 8. Mär 00:10:03 EST 2025

However, the SQL results are unexpected:
Pi-hole's rate limit default is 1,000 requests per minute per client.
For the rate limit to trigger, your 192.168.0.118 should have exceeded 1,000 requests in the 60 seconds period before the message was logged.
Also, assuming that it wasn't the only active client during that period, the overall count(*) from the first SQL query should have been well over 1,000, where your result is a meager 150 - not enough to trigger the rate limit.

Just to be sure:
Your time zone on your Pi-hole machine is indeed US/Eastern?

Yes, it's US/Eastern

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.