Behavior when database is busy

during the end stage of the gravity run, following message(s) may appear:

[2021-07-10 09:13:00.567 7474M] domain_in_list("userlocation.googleapis.com", 0x73715f80, gravity): Database is busy, assuming domain is NOT on list

request: option to allow (current default) or block the requests (preferred), this to, for example, prevent communicating the device location to google.

This is tricky because it needs some code to be rewritten. Assuming something is not blocked is easy. However, if we (temporarily) assume it is blocked, then this knowledge will be added to the DNS cache. We need to build a method to allow blocking without entering the DNS cache for this to work as expected.

I wonder if we should do something altogether different in this case: Refusing the query altogether. This is neither a positive nor a negative answer and should retrigger a lookup from the client (hopefully with some added delay, but this is something we cannot control).

What do you think?

Fine by me, aslong as the client isn't able to provide the information (example userlocation) to google (again, example).

However, since you have no idea how many retries are going to be issued by the client, you should watch out this doesn't trigger rate limiting...

edit
checked (maxdbdays=8)

since I'm still using and updating the dnsmasq2.86 branche, thus updating frequently, I assumme the allowed entries have been registered during the database busy period. The entry userlocation.googleapis.com is on a static local list, I maintain myself, so there is no chanche it bypassed gravity in any other way.
/edit

True, but should this warrant an exception from rate-limiting? Rather not.

you're suggesting to

which means there will be no load on any configured resolver (server=). I wouldn't treat this as a rate limit violation, but the decission is yours.
I've disabled rate limiting on my system (RATE_LIMIT=0/0), so I'll find out soon enough if refusing the queries will cause problems.

I'm not yet sure what would be the best solution. If clients do not get a REFUSED immediately, they should also not retry immediately - but only after some timeout. So it might be a better solution to silently drop queries during the busy database issue. Typical retry delays should be on the order of seconds until the first retry is triggered.

Drop the query but keep logging it, as before, so we can see it is happening.

If you're not sure, isn't it possible to code

  • assume allowed (current situation)
  • reply refused
  • drop query
    so we have a choice, and can test both options.

It is surely possible, just notably more work so it takes longer until feature realization.

Please try

pihole checkout ftl tweak/blocking_while_busy
pihole checkout web tweak/blocking_while_busy

These branches are based on the latest beta so anyone testing this should ensure you are checking out also the other the beta components before switching to this FTL variant.

The new config option is called

REPLY_WHEN_BUSY

with the accepted options

  • DROP
  • REFUSE
  • BLOCK
  • ALLOW (the default)

You will not see anything special about ALLOW (as before). However, you will see details in the Query Log for the other three options. They are shown in orange (in contrast to the others in red/green/gray) to highlight that something out of the ordinary has happened here.

ALLOW

BLOCK

REFUSE

DROP


The screenshot above reveals a retry timeout of five seconds for the DROP method which is why it is still my favorite here.

I'll test this immediatlly, one quick question:

The database is busy problem only manifests during pihole -g and pihole -up. Is there a way to simulate a database busy condition, this to avoid running pihole -g all the time?

Thanks

Yes, run

BEGIN EXCLUSIVE;

on the /etc/pihole/gravity.db database to start the busy simulation and

END;

to end it.

using drop...

something appears to be wrong, the web interface shows

the database shows:

the log also shows this, multiple identical queries at exactly the same time.

This very much looks like it hasn't worked. FTL should have logged

REPLY_WHEN_BUSY: Drop queries when the database is busy

during startup if it has read it correctly.

from the log:

[2021-08-18 11:57:25.030 26281M]    ADDR2LINE: Enabled
[2021-08-18 11:57:25.031 26281M]    REPLY_WHEN_BUSY: Drop queries when the database is busy
[2021-08-18 11:57:25.031 26281M] Finished config file parsing
[2021-08-18 11:57:25.033 26281M] Database version is 9
[2021-08-18 11:57:25.033 26281M] Resizing "FTL-strings" from 40960 to (81920 * 1) == 81920 (/dev/shm: 1.6MB used, 484.0MB total, FTL uses 1.6MB)

edit
lots of duplicate queries for various domains from various clients.
/edit

edit2
I assume the new status type 15 means dropped.
I switched back to dev, than to the new branches (tweak/blocking_while_busy)

With dev, I see:

[2021-08-18 12:55:17.212 11202M] domain_in_list("incoming.telemetry.mozilla.org", 0x73710670, whitelist): Database is busy, assuming domain is NOT on list
[2021-08-18 12:55:19.107 11202M] domain_in_list("inbox.google.com", 0x73713eb0, whitelist): Database is busy, assuming domain is NOT on list
[2021-08-18 12:56:21.068 11202M] domain_in_list("collector-hpn.ghostery.net", 0x1a198b8, blacklist): Database is busy, assuming domain is NOT on list
[2021-08-18 12:56:21.073 11202M] domain_in_list("collector-hpn.privacy.ghostery.net", 0x73713550, whitelist): Database is busy, assuming domain is NOT on list

With tweak/blocking_while_busy, I see (29 times the 3 messages)

[2021-08-18 13:04:36.644 14886M] Gravity database is busy (whitelist)
[2021-08-18 13:04:36.645 14886M] Gravity database is busy (blacklist)
[2021-08-18 13:04:34.628 14886M] Gravity database is busy (gravity)

and I find 29 status 15 entries:


the web interface shows:

image

so at least, drop appears to work.

I do however not understand where the duplicates suddenly come from, such as:

identical queries, identical time, both forwarded to unbound

  1. Please add DEBUG_QUERIES and DEBUG_FLAGS to pihole-FTL.conf to get more information in pihole.FTL.log. We're looking for lines like

    Blocking XXX as YYY is to be ZZZ (gravity database is not available)
    

    where ZZZ should be one of

    • refused
    • dropped
    • blocked

    (there will be no message for the allow case, queries are just permitted in this case)

  2. How do the corresponding lines in /var/log/pihole.log look like? This will help us identify if it was really the client which send two at the same time.

  3. Does this (duplicated queries) also happen for REFUSE BLOCK and ALLOW ? Does it only happen when the database is busy?

  4. Status 15 is (as you see on the web interface screenshot): "Database is busy". It does not imply that the query was dropped, just that it was handled specially (according to the user config). See my screenshots above - you should see how they were replied to in the Reply column. Drop = NONE (because there was no reply), REFUSED = (well) REFUSED and BLOCK = [depending on your blocking mode] (e.g., IP for the default NULL blocking replying with 0.0.0.0).

Reduced the number of blocklists to allow the pihole -g time to be about 1 minute.
DEBUG_QUERIES and DEBUG_FLAGS added
created little script, ran script 4 times (DROP, REFUSE, BLOCK ALLOW):

#!/bin/bash
sudo service pihole-FTL stop
pihole -f
pihole -g
cat /var/log/pihole/pihole-FTL.log | grep 'is gravity blocked'

created archive with both logs for each:
drop.zip (14.0 KB)
refuse.zip (11.0 KB)
block.zip (10.6 KB)
allow.zip (13.3 KB)

Yes, I made sure there are duplicates in every zip/log

No, The duplicate requests are at the end of the log, after pihole -g has finished.

correct, but since you preffer drop, that's what I will be using, so I matched the grafana text accordingly.

We are not looking for "is gravity blocked" but rather "gravity database is not available". I haven't found this in any of your files.

I haven't found duplicates in drop.zip. After gravity ran, the following domains are queried:

[2021-08-18 15:16:59.141 18321M] **** new UDP IPv4 query[A] query "ctldl.windowsupdate.com" from -:<client1>
[2021-08-18 15:16:59.150 18321M] **** new UDP IPv4 query[AAAA] query "ctldl.windowsupdate.com" from -:1<client1>
[2021-08-18 15:16:59.210 18321M] **** new UDP IPv4 query[A] query "crl.identrust.com" from -:<client1>
[2021-08-18 15:16:59.216 18321M] **** new UDP IPv4 query[AAAA] query "crl.identrust.com" from -:<client1>
[2021-08-18 15:17:10.769 18321M] **** new UDP IPv4 query[A] query "ssl.google-analytics.com" from -:<client2>

Neither does refuse.zip contain any duplicates.

Next I checked allow.zip and there we see some duplicates after the gravity run (the last three):

[2021-08-18 15:25:11.197 20626M] **** new UDP IPv4 query[A] query "ssl.google-analytics.com" from -:<client 2>#53530
[2021-08-18 15:25:36.200 20626M] **** new UDP IPv4 query[A] query "livetileedge.dsx.mp.microsoft.com" from -:<client 1>#57229
[2021-08-18 15:25:46.078 20626M] **** new UDP IPv4 query[A] query "userlocation.googleapis.com" from -:<client 2>#59625
[2021-08-18 15:25:51.898 20626M] **** new UDP IPv4 query[A] query "firefox.settings.services.mozilla.com" from -:<client 1>#57790
[2021-08-18 15:25:51.909 20626M] **** new UDP IPv4 query[A] query "firefox.settings.services.mozilla.com" from -:<client 1>#65422
[2021-08-18 15:25:51.918 20626M] **** new UDP IPv4 query[A] query "firefox.settings.services.mozilla.com" from -:<client 1>#65158

However, they are all correct. I left the port from which they sent the query at the end of the copied line - you can see that they are all different, i.e., those are independent queries from the clients themselves. As to why they are doing this ... I don't know. They arrived with about 10 milliseconds distance between them and, at first, I was confused why they aren't cached but you may have disabled caching and then everything behaves just as expected.

Similarly for block.zip where the query to getgreenshot.org was duplicated, however, these also were doubled queries from the clients themselves so nothing looks like a Pi-hole induced issue.


Your tests did not reveal anything unusual, however, it also doesn't seem you experienced the case we want to study here (query whilst database was busy).

changed that...

looks like the gravity run needs to be long enough to trigger this event, I ran all the tests again with all lists enabled.

I can't trigger this using BLOCK (didn't try ALLOW, you said there will be no messages) I even tried throwing a random list in DPT and run that during pihole -g, still no messages...

from DROP:

[2021-08-19 02:51:55.762 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.764 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.766 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.769 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.771 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.773 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.776 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)
[2021-08-19 02:51:55.779 23632M] Blocking ssl.google-analytics.com as ssl.google-analytics.com is to be dropped (gravity database is not available)

duplicates example in the DROP test: search for armmf

[2021-08-19 02:54:58.701 23632M] **** new UDP IPv4 query[A] query "armmf.adobe.com" from -:192.168.2.227#56094 (ID 194, FTL 193, src/dnsmasq/forward.c:1561)
[2021-08-19 02:55:03.008 23632M] **** new UDP IPv4 query[A] query "armmf.adobe.com" from -:192.168.2.227#56094 (ID 195, FTL 194, src/dnsmasq/forward.c:1561)
[2021-08-19 02:55:03.011 23632M] **** new UDP IPv4 query[A] query "armmf.adobe.com" from -:192.168.2.227#56094 (ID 196, FTL 195, src/dnsmasq/forward.c:1561)
[2021-08-19 02:55:03.016 23632M] **** new UDP IPv4 query[A] query "armmf.adobe.com" from -:192.168.2.227#56094 (ID 197, FTL 196, src/dnsmasq/forward.c:1561)
  • armmf duplicate in REFUSE (search for 49157)
  • ablink duplicate in BLOCK (search for 54907)
  • jsdelivr duplicate in BLOCK (search for 59822)
  • digicert duplicate in BLOCK (search for 60033)

BLOCK.zip (139.2 KB)
DROP.zip (27.7 KB)
REFUSE.zip (151.9 KB)

Just hoping I'm doing the correct tests and analysis...

Everything still looks fine - except the very fast retries when we want to drop. The duplicate queries are here client-side also.

I pushed another update (FTL version should be vDev-2edd512) that should prevent the almost immediate replies (few ms delay) you've seen in DROP mode for ssl.google-analytics.com above.

edit 6c6188f1 would be okay as well. I fixed a small glitch with TCP queries not always being dropped.