Database locks up every 2-3 days, recovers by itself in about 2 hours

Hi,

I have a fairly standard pihole setup : DNS server on IPV4 using cloudflare as Public DNS. DHCP as well offering IPV4 leases.
Hardware is 4 GB Pi4b, 64 GB MicroSD card.
PiHole config has been imported from another pihole instance (no longer functional now).

Right from the intial days this pihole install went active, it's been experiencing issues wherein:
Database locks up every 2-3 days, recovers by itself in about 2 hours by a restart of the FTL process.
The lockup time window is manifested as a gap in Dashboard - Queries.
Within nslookup, the response to any domain being queried is "Query Refused"

I cleared the network table which seemed to help for a few days, and then it came back again .

I have held off doing the repair (pihole - r) or setting it up from scratch.
Please advise. Thanks in advance.

Debug Token:

https://tricorder.pi-hole.net/sSPBpCw8/

Here's a snippet of the dashboard, showing the gap when the database locked up

What makes you suspect that the database is locked?
Do you encounter any errors during those times that would hint at a database issue?

Commonly, a gap in the dashboard would indicate that Pi-hole has not received any queries during that time, or that Pi-hole would have been terminated, e.g. by a crash. But if Pi-hole would have crashed, it would not recover by itself.

Not sure if that's related, but your debug log shows a constant stream of warning messages:

-rw-r--r-- 1 pihole pihole 440K Jul 11 01:55 /var/log/pihole/FTL.log
   -----head of FTL.log------
   [2023-07-11 00:00:06.364 586M] WARNING in dnsmasq core: Ignoring domain alsco.com.au for DHCP host name PYM-A3V85WV3

There seems to be one of those messages every five or six seconds.

Some Intel device at 192.168.1.181 seems to be claiming that domain name during DHCP lease negotiation.
You should try to identify that device and see if you could address why it would claim that domain.

Hi,
Thanks for taking a look at the log.

The SQLite database locked messages are from the /var/log/pihole-FTL.log, including the events indicating that

  • Database is locked, so queries/transactions cannot be flushed
  • And hence will be kept in memory
  • These repeat themselves for the next 1 hr 50 mins, after which FTL terminates , and spawns a new instance of the process.

The Time window of this lockup more or less tallies with the gap in dashboard.
How I noticed there's an issue at the time:

  • My phone was unable to checking emails or chat messages.
  • Then on the computer, tried resolving 2 domains (yahoo.com/google.com) through nslookup, both failed

Below are some snips from the FTL log.

As for the that device with hostname PYM-A3vxxxx (192.168.1.181), it actually is an office asset, and the specified domain alsco.xxxxx is the DNS suffx of the corporate domain (part of the machine FQDN)
And this machine has been on the network working well with previous pihole instance for a few months, but this issue didn't occur, before this instance took over.

Can you elaborate on this imported config? Was it just a teleporter backup or were other files brought over too? How was the import done?

Database lockups are normal, but they usually don't last that long.

sqlite3 locks the database when it is writing data.

Usually, when the file is locked, the data is kept in memory and later the data is finally saved to the disk.

This is how a normal log entry looks like when the data is finally stored:

 Notice: Queries stored in long-term database: 277 (took 120.9 ms, last SQLite ID 81964907)
         Queries from earlier attempt(s) stored successfully

Your "in-memory" data should have been stored into the database when the previous operations were finished, but in your case you had an unusually long database lock period.

Queries were never stored in the database, something else (too many open files error) crashed FTL and in memory data was lost, causing the gap.


Is there another process (script, cron job, third party app) using the Pi-hole database?

The debug log only contains a few lines from the log files.

Could you share the logs by uploading them to our servers and sharing the token here?

sudo cat /var/log/pihole/pihole.log | pihole tricorder

You'd probably have to check whether pihole.log still contains the period, or if that has been rotated to pihole.log.1 already.

From the excerpt you show, the respective sequence from the logs looks just like a controlled restart, e.g. as initiated via Pi-hole's UI.

Your debug log suggests that some UI interaction occured at 01:50:08, just five seconds before the shutdown message at 01:50:13 appears in your log, and only a few minutes before the debug log was created at around 01:54.

Did you perhaps restart Pi-hole's DNS resolver as a measure to fix your issue before creating a debug log?

Does that happen at always the same time?

Also, your log shows that Pi-hole has been running for some 23+ hours before it was shutdown, which would have been earlier than the 2 to 3 days that you've observed.

And to get an idea how (continously) often that Ignoring domain warning occurs, what's the output of:

sudo zgrep -c "Ignoring domain alsco" /var/log/pihole/pihole.log.*