DNS resolution unexpectedly stops after upgrade to 5.0

Each topic branch covers a specific feature or section of code. What you see in one branch most often won't be in another branch. tweak/gravity_counting_logging has stuff that new/tre-regex doesn't and vice versa.

Really you shouldn't be jumping from topic branch to topic branch. Database changes can break your install. Dom does a great job making sure things are as backwards and cross compatible as possible but don't expect this as the default behavior.

I'm fully aware of the risks associated with frequent branch switches. I'll take the risk that things may break and the database might have to be deleted. I do have a very simple setup which can be restored within minutes and usually don't need the long-term database.

I'll usually switch branches to help investigating new features or fixes. I was on the new/tre-regex branch and only switch to tweak/gravity_counting_logging because I encountered the same issue (or symptom) as reported here and wanted to help narrowing it down by increasing FTL's output.

Great, thanks for testing, but branches are not additive. Changing to tweak won't add that code to new/.

Edit: If we need you to change branches to get more information, we'll ask you to change branches. It's very helpful that you're testing but it causes problems and goose chases when you do things without fully understanding the consequences.

@DanSchaper is there anything I need to be doing, as I'm the one that originally reported this? The last thing I did was to run pihole checkout ftl tweak/gravity_counting_logging at the request of DL6ER. I haven't seen any issues since then, but I'm not sure if that's just coincidence as it wasn't 100% reproducible in the first place.

Sorry for the thread derail, let's get @DL6ER to take a look when he can, he's a bit busy as of late.

The chances are harmless and, at worst, they increase readability. I agree that they should still be included for Pi-hole v5.1

1 Like

It looks like this happened again just now. Debug token is https://tricorder.pi-hole.net/gl3aeqycnk. Last 5000 lines of pihole-FTL.log is at https://pastebin.com/wx5QdWrw. (If you need more I'll figure out another way to get it to you.

It looks like the issue started at 22:01 and I see this in the logs at that time:

[2020-07-10 22:01:15.735 931M] Querying count of distinct domains in gravity database table vw_gravity: SELECT value FROM info WHERE property = 'gravity_count';
[2020-07-10 22:02:00.079 931/T935] dbquery: "BEGIN TRANSACTION IMMEDIATE"
[2020-07-10 22:02:00.079 931/T935]          ---> OK
[2020-07-10 22:02:00.080 931/T935] dbquery: "SELECT MAX(ID) FROM queries"
[2020-07-10 22:02:00.081 931/T935]          ---> Result 1936649 (long long int)
[2020-07-10 22:02:00.082 931/T935] dbquery: "END TRANSACTION"

It looks like there's a 45 second gap there. I'll leave the rest to the experts to figure out.

These two lines are from independent subprocesses (931M is the main process, 931/T is a thread of the main processes) so the delay between these two lines does not mean anything here.

The next log line of the main process is:

[2020-07-10 22:02:44.420 931M] Querying count of distinct domains in gravity database table vw_regex_blacklist: SELECT COUNT(DISTINCT domain) FROM vw_regex_blacklist

so the gap is more 89 seconds in reality.

I do not seem able to figure out how

SELECT value FROM info WHERE property = 'gravity_count';

could take 1.5 minutes, especially because the next actions (partially involving real counting) return immediately:

[2020-07-10 22:02:44.420 931M] Querying count of distinct domains in gravity database table vw_regex_blacklist: SELECT COUNT(DISTINCT domain) FROM vw_regex_blacklist
[2020-07-10 22:02:44.421 931M] INFO: No regex blacklist entries found
[2020-07-10 22:02:44.421 931M] Querying count of distinct domains in gravity database table vw_regex_whitelist: SELECT COUNT(DISTINCT domain) FROM vw_regex_whitelist
[2020-07-10 22:02:44.421 931M] INFO: No regex whitelist entries found
[2020-07-10 22:02:44.422 931M] Querying gravity database for client 192.168.0.42 (getting best match)
[2020-07-10 22:02:44.422 931M] Querying gravity database for client 192.168.0.1 (getting best match)
[2020-07-10 22:02:44.423 931M] Querying gravity database for client 192.168.0.46 (getting best match)
[2020-07-10 22:02:44.424 931M] Querying gravity database for client 192.168.0.3 (getting best match)
[2020-07-10 22:02:44.424 931M] Querying gravity database for client 127.0.0.1 (getting best match)
[2020-07-10 22:02:44.425 931M] Compiled 0 whitelist and 0 blacklist regex filters for 5 clients in 4.7 msec

checking back with our database master @PromoFaux if he has an idea what we could look at in addition (there are no triggers on the info table or whatsoever).

@DL6ER

As far as I went through the logs, we have never seen an log outputs of the changes you made in
fix/dkozinn-debugging. Maybe you could merge this with tweak/gravity_counting_logging and @dkozinn could try with this combined?

It happened again overnight. Debug token is https://tricorder.pi-hole.net/dop07i5cjo, I've extracted a section of the log around when it happened. But I've noticed something interesting: Based on collectd metrics, CPU & loadavg climbed sharply at about 04:16, at which point it looks like something starts to run, though it appears to finish in < 1 second. I don't know if that's just a coincidence but based on the collectd data it looks like it took until around 4:23 to clear up.

I do want to mention that the Pi is also running Pi-aware (aircraft tracking) but I've checked those logs as well as syslog and there's nothing the corresponds to those times other than things that run continuously and don't cause issues. (And as you'll recall this problem only started with 5.0; the version of Pi-aware hasn't changed).

Also, do you still want to see the logs? I'm happy to keep sending, just want to make sure they are useful.

It may have been the automated, once-a-week gravity update, check your cron configuration if the time matches:

grep Gravity /etc/cron.d/pihole

If the first two numbers are 16 4 * * 7 (or close), then this means gravity starts at 04:16 AM on the 7th day of the week (=Sunday).

Only the devs can see them. Which branch of FTL are you running currently?

I guess this one? Maybe they can add more debugging details in case something interesting didn't show up so far. @DL6ER (edit I see this has already been requested by @yubiuser).

@Coro: Gravity update happens at 04:18 on Sunday (18 4 * * 7 ), which matches what's seen in the logs.

And as for version, you're correct that I'm running that special version.

@dkozinn Sorry for the delay here. I just merged the special branch fix/dkozinn-debugging into tweak/gravity_counting_logging. It will add more logging of each individual step of the loading from the database process so we can (hopefully) watch what is really the cause for this delay.

Please update your Pi-hole.

I've done the update and will keep an eye on it.

1 Like

Ok, the issue recurred today. Debug token is https://tricorder.pi-hole.net/6ime59r933. Last 5000 lines of pihole-FTL.log are at Ubuntu Pastebin. I believe the issue started sometime after 17:21 (maybe 17:23?). I had some unrelated issues at the time and didn't get the normal historical CPU metrics but I could see that pihole-FTL was running at 100% CPU as the problem was occurring.

[2020-07-17 17:23:44.892 28893M] Querying count of distinct domains in gravity database table vw_gravity: SELECT value FROM info WHERE property = 'gravity_count';
[2020-07-17 17:23:44.892 28893M] Stepping...
[2020-07-17 17:23:44.892 28893M] Getting result...
[2020-07-17 17:23:44.892 28893M] Got result: 94793
[2020-07-17 17:23:44.892 28893M] Finalized
[2020-07-17 17:23:44.892 28893M] Reading regex from database...
[...]
[2020-07-17 17:24:37.655 28893M] Querying count of distinct domains in gravity database table vw_regex_blacklist: SELECT COUNT(DISTINCT domain) FROM vw_regex_blacklist
[2020-07-17 17:24:37.655 28893M] Stepping...
[2020-07-17 17:24:37.656 28893M] Getting result...
[2020-07-17 17:24:37.656 28893M] Got result: 0
[2020-07-17 17:24:37.656 28893M] Finalized
[2020-07-17 17:24:37.656 28893M] INFO: No regex blacklist entries found

This might be coincidence, but it looks like this happened right at the time Cloudflare had their outage, and I do use their 1.1.1.1 as one of my servers.

Had another one just happen. debug token is https://tricorder.pi-hole.net/zv0xnvu7fy, log extract of pihole-FTL.log is at https://pastebin.com/iFUCczef. I think the issue starts at around 11:14:02 in the log:

[2020-07-19 11:14:02.456 28893M] Querying count of distinct domains in gravity database table vw_gravity: SELECT 
value FROM info WHERE property = 'gravity_count';
[2020-07-19 11:14:02.456 28893M] Stepping...
[2020-07-19 11:14:02.456 28893M] Getting result...
[2020-07-19 11:14:02.456 28893M] Got result: 94989
[2020-07-19 11:14:02.457 28893M] Finalized
[2020-07-19 11:14:02.457 28893M] Reading regex from database...
...
[2020-07-19 11:15:52.432 28893M] Querying count of distinct domains in gravity database table vw_regex_blacklist: SELECT COUNT(DISTINCT domain) FROM vw_regex_blacklist
[2020-07-19 11:15:52.433 28893M] Stepping...
[2020-07-19 11:15:52.433 28893M] Getting result...
[2020-07-19 11:15:52.433 28893M] Got result: 0
[2020-07-19 11:15:52.433 28893M] Finalized
[2020-07-19 11:15:52.434 28893M] INFO: No regex blacklist entries found
[2020-07-19 11:15:52.434 28893M] Querying count of distinct domains in gravity database table vw_regex_whitelist: SELECT COUNT(DISTINCT domain) FROM vw_regex_whitelist

So counting the gravity domains is a red herring. We see now that this did not even take one millisecond:

So the delay is really between

which is exactly the same as

What is interesting is that there is always a thread hopping in in between these two lines (abbreviated by ... resp. [...] above), but this shouldn't matter right now.

I pushed another change. This contains (a) even more debugging output and (b) an optimizations for the cache cleaning procedure.

I guess (b) may even be a sufficient improvement to resolve this, we'll see...

@dkozinn Please update your FTL. It may take a few minutes until the binaries are generated on the building system so please try again if you are too fast :wink:

I think we're on a good way getting there. Best 73s!