Behavior when database is busy

True, but should this warrant an exception from rate-limiting? Rather not.

you're suggesting to

which means there will be no load on any configured resolver (server=). I wouldn't treat this as a rate limit violation, but the decission is yours.
I've disabled rate limiting on my system (RATE_LIMIT=0/0), so I'll find out soon enough if refusing the queries will cause problems.

I'm not yet sure what would be the best solution. If clients do not get a REFUSED immediately, they should also not retry immediately - but only after some timeout. So it might be a better solution to silently drop queries during the busy database issue. Typical retry delays should be on the order of seconds until the first retry is triggered.

Drop the query but keep logging it, as before, so we can see it is happening.

If you're not sure, isn't it possible to code

  • assume allowed (current situation)
  • reply refused
  • drop query
    so we have a choice, and can test both options.

It is surely possible, just notably more work so it takes longer until feature realization.

Please try

pihole checkout ftl tweak/blocking_while_busy
pihole checkout web tweak/blocking_while_busy

These branches are based on the latest beta so anyone testing this should ensure you are checking out also the other the beta components before switching to this FTL variant.

The new config option is called

REPLY_WHEN_BUSY

with the accepted options

  • DROP
  • REFUSE
  • BLOCK
  • ALLOW (the default)

You will not see anything special about ALLOW (as before). However, you will see details in the Query Log for the other three options. They are shown in orange (in contrast to the others in red/green/gray) to highlight that something out of the ordinary has happened here.

ALLOW

BLOCK

REFUSE

DROP


The screenshot above reveals a retry timeout of five seconds for the DROP method which is why it is still my favorite here.

I'll test this immediatlly, one quick question:

The database is busy problem only manifests during pihole -g and pihole -up. Is there a way to simulate a database busy condition, this to avoid running pihole -g all the time?

Thanks

Yes, run

BEGIN EXCLUSIVE;

on the /etc/pihole/gravity.db database to start the busy simulation and

END;

to end it.

using drop...

something appears to be wrong, the web interface shows

the database shows:

the log also shows this, multiple identical queries at exactly the same time.

This very much looks like it hasn't worked. FTL should have logged

REPLY_WHEN_BUSY: Drop queries when the database is busy

during startup if it has read it correctly.

from the log:

[2021-08-18 11:57:25.030 26281M]    ADDR2LINE: Enabled
[2021-08-18 11:57:25.031 26281M]    REPLY_WHEN_BUSY: Drop queries when the database is busy
[2021-08-18 11:57:25.031 26281M] Finished config file parsing
[2021-08-18 11:57:25.033 26281M] Database version is 9
[2021-08-18 11:57:25.033 26281M] Resizing "FTL-strings" from 40960 to (81920 * 1) == 81920 (/dev/shm: 1.6MB used, 484.0MB total, FTL uses 1.6MB)

edit
lots of duplicate queries for various domains from various clients.
/edit

edit2
I assume the new status type 15 means dropped.
I switched back to dev, than to the new branches (tweak/blocking_while_busy)

With dev, I see:

[2021-08-18 12:55:17.212 11202M] domain_in_list("incoming.telemetry.mozilla.org", 0x73710670, whitelist): Database is busy, assuming domain is NOT on list
[2021-08-18 12:55:19.107 11202M] domain_in_list("inbox.google.com", 0x73713eb0, whitelist): Database is busy, assuming domain is NOT on list
[2021-08-18 12:56:21.068 11202M] domain_in_list("collector-hpn.ghostery.net", 0x1a198b8, blacklist): Database is busy, assuming domain is NOT on list
[2021-08-18 12:56:21.073 11202M] domain_in_list("collector-hpn.privacy.ghostery.net", 0x73713550, whitelist): Database is busy, assuming domain is NOT on list

With tweak/blocking_while_busy, I see (29 times the 3 messages)

[2021-08-18 13:04:36.644 14886M] Gravity database is busy (whitelist)
[2021-08-18 13:04:36.645 14886M] Gravity database is busy (blacklist)
[2021-08-18 13:04:34.628 14886M] Gravity database is busy (gravity)

and I find 29 status 15 entries:


the web interface shows:

image

so at least, drop appears to work.

I do however not understand where the duplicates suddenly come from, such as:

identical queries, identical time, both forwarded to unbound

  1. Please add DEBUG_QUERIES and DEBUG_FLAGS to pihole-FTL.conf to get more information in pihole.FTL.log. We're looking for lines like

    Blocking XXX as YYY is to be ZZZ (gravity database is not available)
    

    where ZZZ should be one of

    • refused
    • dropped
    • blocked

    (there will be no message for the allow case, queries are just permitted in this case)

  2. How do the corresponding lines in /var/log/pihole.log look like? This will help us identify if it was really the client which send two at the same time.

  3. Does this (duplicated queries) also happen for REFUSE BLOCK and ALLOW ? Does it only happen when the database is busy?

  4. Status 15 is (as you see on the web interface screenshot): "Database is busy". It does not imply that the query was dropped, just that it was handled specially (according to the user config). See my screenshots above - you should see how they were replied to in the Reply column. Drop = NONE (because there was no reply), REFUSED = (well) REFUSED and BLOCK = [depending on your blocking mode] (e.g., IP for the default NULL blocking replying with 0.0.0.0).

Reduced the number of blocklists to allow the pihole -g time to be about 1 minute.
DEBUG_QUERIES and DEBUG_FLAGS added
created little script, ran script 4 times (DROP, REFUSE, BLOCK ALLOW):

#!/bin/bash
sudo service pihole-FTL stop
pihole -f
pihole -g
cat /var/log/pihole/pihole-FTL.log | grep 'is gravity blocked'

created archive with both logs for each:
drop.zip (14.0 KB)
refuse.zip (11.0 KB)
block.zip (10.6 KB)
allow.zip (13.3 KB)

Yes, I made sure there are duplicates in every zip/log

No, The duplicate requests are at the end of the log, after pihole -g has finished.

correct, but since you preffer drop, that's what I will be using, so I matched the grafana text accordingly.

We are not looking for "is gravity blocked" but rather "gravity database is not available". I haven't found this in any of your files.

I haven't found duplicates in drop.zip. After gravity ran, the following domains are queried:

[2021-08-18 15:16:59.141 18321M] **** new UDP IPv4 query[A] query "ctldl.windowsupdate.com" from -:<client1>
[2021-08-18 15:16:59.150 18321M] **** new UDP IPv4 query[AAAA] query "ctldl.windowsupdate.com" from -:1<client1>
[2021-08-18 15:16:59.210 18321M] **** new UDP IPv4 query[A] query "crl.identrust.com" from -:<client1>
[2021-08-18 15:16:59.216 18321M] **** new UDP IPv4 query[AAAA] query "crl.identrust.com" from -:<client1>
[2021-08-18 15:17:10.769 18321M] **** new UDP IPv4 query[A] query "ssl.google-analytics.com" from -:<client2>

Neither does refuse.zip contain any duplicates.

Next I checked allow.zip and there we see some duplicates after the gravity run (the last three):

[2021-08-18 15:25:11.197 20626M] **** new UDP IPv4 query[A] query "ssl.google-analytics.com" from -:<client 2>#53530
[2021-08-18 15:25:36.200 20626M] **** new UDP IPv4 query[A] query "livetileedge.dsx.mp.microsoft.com" from -:<client 1>#57229
[2021-08-18 15:25:46.078 20626M] **** new UDP IPv4 query[A] query "userlocation.googleapis.com" from -:<client 2>#59625
[2021-08-18 15:25:51.898 20626M] **** new UDP IPv4 query[A] query "firefox.settings.services.mozilla.com" from -:<client 1>#57790
[2021-08-18 15:25:51.909 20626M] **** new UDP IPv4 query[A] query "firefox.settings.services.mozilla.com" from -:<client 1>#65422
[2021-08-18 15:25:51.918 20626M] **** new UDP IPv4 query[A] query "firefox.settings.services.mozilla.com" from -:<client 1>#65158

However, they are all correct. I left the port from which they sent the query at the end of the copied line - you can see that they are all different, i.e., those are independent queries from the clients themselves. As to why they are doing this ... I don't know. They arrived with about 10 milliseconds distance between them and, at first, I was confused why they aren't cached but you may have disabled caching and then everything behaves just as expected.

Similarly for block.zip where the query to getgreenshot.org was duplicated, however, these also were doubled queries from the clients themselves so nothing looks like a Pi-hole induced issue.


Your tests did not reveal anything unusual, however, it also doesn't seem you experienced the case we want to study here (query whilst database was busy).

changed that...

looks like the gravity run needs to be long enough to trigger this event, I ran all the tests again with all lists enabled.

I can't trigger this using BLOCK (didn't try ALLOW, you said there will be no messages) I even tried throwing a random list in DPT and run that during pihole -g, still no messages...

from DROP:

[2021-08-19 02:51:55.762 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.764 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.766 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.769 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.771 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.773 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.776 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.779 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)

duplicates example in the DROP test: search for armmf

[2021-08-19 02:54:58.701 23632M] **** new UDP IPv4 query[A] query "armmf.adobe.com" from -:192.168.2.227#56094 (ID 194, FTL 193, src/dnsmasq/forward.c:1561)
[2021-08-19 02:55:03.008 23632M] **** new UDP IPv4 query[A] query "armmf.adobe.com" from -:192.168.2.227#56094 (ID 195, FTL 194, src/dnsmasq/forward.c:1561)
[2021-08-19 02:55:03.011 23632M] **** new UDP IPv4 query[A] query "armmf.adobe.com" from -:192.168.2.227#56094 (ID 196, FTL 195, src/dnsmasq/forward.c:1561)
[2021-08-19 02:55:03.016 23632M] **** new UDP IPv4 query[A] query "armmf.adobe.com" from -:192.168.2.227#56094 (ID 197, FTL 196, src/dnsmasq/forward.c:1561)
  • armmf duplicate in REFUSE (search for 49157)
  • ablink duplicate in BLOCK (search for 54907)
  • jsdelivr duplicate in BLOCK (search for 59822)
  • digicert duplicate in BLOCK (search for 60033)

BLOCK.zip (139.2 KB)
DROP.zip (27.7 KB)
REFUSE.zip (151.9 KB)

Just hoping I'm doing the correct tests and analysis...

Everything still looks fine - except the very fast retries when we want to drop. The duplicate queries are here client-side also.

I pushed another update (FTL version should be vDev-2edd512) that should prevent the almost immediate replies (few ms delay) you've seen in DROP mode for ssl.google-analytics.com above.

edit 6c6188f1 would be okay as well. I fixed a small glitch with TCP queries not always being dropped.

installed

Current Pi-hole version is development v5.3.1-100-g98867d8d
Current AdminLTE version is tweak/blocking_while_busy v5.5.1-58-gb4a26d3
Current FTL version is tweak/blocking_while_busy vDev-2edd512

testing may take a while...

I've run pihole -g like twenty times now, the text 'gravity database is not available' doesn't appear in the log file anymore...

DROP.zip (40.1 KB)

to ensure something needed to be dropped, I ran

#!/bin/bash

domains=("userlocation.googleapis.com." "ssl.google-analytics.com." "incoming.telemetry.mozilla.org." "settings-win.data.microsoft.com.")

# initialize RANDOM only once, to avoid using the same entry evry time
RANDOM=$$$(date +%s)

while :; do
	selecteddomain=${domains[ $RANDOM % ${#domains[@]} ]}
	dig @192.168.2.57 A ${selecteddomain}
	sleep 5
done

on another pi

Although I didn't get a notification from this topic, I ran pihole -up this morning and got an update.

Current Pi-hole version is development v5.3.1-100-g98867d8d
Current AdminLTE version is tweak/blocking_while_busy v5.5.1-58-gb4a26d3
Current FTL version is tweak/blocking_while_busy vDev-6c6188f

after running the update, I checked for status type 15, and yes, their back 69 entries after the new FTL became active, 15 unique domains

DEBUG_QUERIES and DEBUG_FLAGS aren't active anymore (space on TMPFS is limited)

In the pihole-FTL log, I found (41 times):

[2021-08-20 09:09:57.387 9764M] Gravity database is busy (whitelist)
[2021-08-20 09:09:57.388 9764M] Gravity database is busy (blacklist)

and (90 matches)

[2021-08-20 09:11:40.007 9764M] ERROR: Gravity database not available
[2021-08-20 09:11:40.007 9764M] get_client_groupids("192.168.2.228") - SQL error step: database is locked

I assume this is normal, some FTL thread is still trying to acces the database.

I'm NOT sure how you determine what you consider to be a client initiated resubmitted query, and thus a normal situation. Could you please explain what to look for, to identify duplicates (that shouldn't happen) and resubmitted (that is normal)?

Will you be inviting additional testers, possibly make this part of dev?

Thanks, again, for your time and effort.