Gravity database

Whenever gravity is run (manual, pihole -g OR automatic, sunday run) The new gravity database is build from the blocklists and, eventually swapped to become the active one. This method was found to be the most efficient one during the v5.0 beta tests.

My request: Don't simply erase the old database, keep it on disk, this to allow users to determine which domains have been added / removed since the last run. It often happens something suddenly doesn't work anymore on a sunday (due to added domains from a blocklist), it is possible to find the cause, using the query log, but it's impossible to determine if it's a new domain (recently added) or it already existed (last week), unless the user starts checking all the (some compressed) pihole logs.
By having the ability to determine the blocked domain is recently (last gravity run) added, the user can get insurance / gain confidence he's doing the right thing, whitelisting the domain.

There shouldn't be a disk space problem, the old (active) and new (to be activated) databases do exist on the disk until 'swapping databases' is executed. As opposed to deleting the old one, simply rename it, and erase it at the start of a new gravity run.

This hardly change anything to the inner working of pihole, only the moment the old database is removed changes.

Eventually, totally optional, an additional web interface page could be added, to indicate the added / removed domains, compared to the last gravity run.

I can see some value in keeping the old database for a diff around. There would still be a memory impact as we'd need three databases in parallel (in the worst case). Assume a gravity run fails for whatever reason. The current database is kept in this case. However, with your proposal we'd already have deleted the gravity_old.db file so the diff function would be broken until the next pihole -g run. My gravity.db is 4.5 MB, so keeping the database three times is okay for most users. Those with humongous lists should just ensure they have large enough disks. Doesn't sound like an issue to me.

How large is your gravity database?

How log does computing a diff take?" I'm aware that this is a very subjective question. Assume your pihole -g run will not change anything at all. Comparing the two indices will be fairly quick, how long does it run?

  1. disable a notable amount of lists,
  2. run pihole -g,
  3. copy the database sudo cp /etc/pihole/gravity.db /etc/pihole/gravity_old.db,
  4. re-enable all lists you use,
  5. run pihole -g,
  6. run the command
     time sqlite3 /etc/pihole/gravity.db "ATTACH \"/etc/pihole/gravity_old.db\" AS olddb; SELECT DISTINCT domain FROM gravity WHERE domain NOT IN (SELECT domain FROM olddb.gravity);" > delta
    

my database is currently (before disabling any lists) 258.788Kb (252.7Mb). Large, I know, the advantage is I can actually produce realistic time measurements, this to verify everything remains acceptable, even with a lot of lists.

executing the script I communicated to you, with time.
first test (new entries, but no new blocks):

real    0m14.776s
user    0m13.819s
sys     0m0.685s

second test (new entries, forced a new block entry, using dig):

real    0m14.434s
user    0m13.949s
sys     0m0.477s

third test (your scenario, my script, disabled 33% of the lists)
new gravity (everything enabled): 236.216Kb
old gravity (33% disabled): 89.632Kb
12 new blocks found by the script...

real    0m36.610s
user    0m29.579s
sys     0m0.818s

fourth test (your scenario, your sql statement, disabled 33% of the lists)
delta contains 1.543.114 entries, size 32.664Kb

real    0m22.354s
user    0m20.417s
sys     0m0.530s

all of the test performed on raspberry pi 3B, uname -a: Linux raspberrypi 5.10.17-v7+ #1403 SMP Mon Feb 22 11:29:51 GMT 2021 armv7l GNU/Linux

edit
for reference:

 time pihole -q doubleclick.net
  [i] Over 100 results found for doubleclick.net
        This can be overridden using the -all option

real    0m7.940s
user    0m7.307s
sys     0m0.816s

/edit

Continuing the discussion from here about adding a new database field:

Original idea by @jpgpi250:

idea (NOT part of this PR)

The results of comparing the new and old gravity database appear to be correct.

I was testing the following:

add a field 'new_entry' to the gravity table of gravity.db, default 0
ALTER TABLE gravity ADD COLUMN new_entry INTEGER DEFAULT 0;

by running the following update on the database, while both old (temp) and new still exist, this field would be updated:
( sudo sqlite3 "${gravityDBfile}" << EOSQL ATTACH '${gravityOLDfile}' AS old; UPDATE gravity SET new_entry = '1' WHERE domain IN (SELECT domain FROM gravity WHERE domain NOT IN (SELECT domain FROM old.gravity)); EOSQL )
On a database with 4.087.232 entries and 4.223 new entries this takes only 35 seconds (raspberry pi 3B).
real 0m34.983s user 0m14.937s sys 0m0.767s

This database field could than be used to inform the user in the web interface, a new gravity entry caused a blocked query, for example add ' (NEW') behind the domain name.
This could be the user friendly way to inform the user blocking behavior has changed, due to a recent gravity update.

Unfortunately, I'm not smart / confident enough to update all the required components of pi-hole (API, web), thus asking for help.

Reply by @DL6ER:

We should discuss the addition of a new database column somewhere else. Adding it to the gravity table does have some impact because it is the largest (or second largest behind queries ) database table users will have. 35 seconds for 4 mio queries doesn't sound that dramatic. TBH, I'd have expected worse.

Points for the discussion somewhere else (likely Discourse):

  • How long does gravity run overall with such a high number of domains? I'm wondering if we're talking about an increase of 1% or 10% or 50% in total runtime.
  • How large is the same database once with and once without the extra column? It'll be surely some megabytes of difference.

This would also need a change in FTL. Currently, we're are just using the tree index of gravity when checking if a domain is blocked. This index is loaded into memory and super-fast even when there are dozens of million of domains in gravity. However, when this change is made, FTL needs to actually look up the database record to get the value of new . Surely doable but it'd remove one major speed benefit of the database. I'm neither convinced that it worth it nor do I have any actual numbers for comparison at hand.

Reply by @jpgpi:

I can already answer some of your question regarding this, I've been experimenting already with this.

remember, my gravity database has 4.087.232 entries, there were 4.223 unique new entries, compared to the previous gravity database (I keep an old version of the gravity database, this to increase the diff between the databases every time I run 'pihole -g' (manual copy required)

image

gravity_new.db has the additional field

How long does gravity run? As I see it, this wouldn't affect the gravity run at all (it would require new entries to be added with 3 fields, currently 2, but the new field would always be '0').
The gravity build would't change at all. The idea is to modify the 'new_entry' column (run the update query) just before the gravity.sh instruction 'mv "${gravityTEMPfile}" "${gravityDBfile}"' (apply the change on the 'TEMPfile'). This would than add a minute, give or take to the gravity run, but since that happens at night (sunday morning), you would only notice this if you forced a run, using 'pihole -g'


  1. Concerning file size: Something isn't right here with the file size comparison, even when only one byte would be used in addition for 4 mio. entries, this would still be a size difference of at least 4 MB. Could you run a VACUUM on both databases to ensure there are no dead pages in them so we can really compare them?

  2. Concerning run time: But how long does a manual pihole -g run take with your 4 mio. entries on your testing Raspberry Pi 3?

Database structure:

  • stopped pi-hole-FTL, than
pi@raspberrypi:~ $ sudo sqlite3 "/etc/pihole/gravity.db"
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> VACUUM;
sqlite> .quit
pi@raspberrypi:~ $ sudo sqlite3 "/etc/pihole/gravity_new.db"
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> VACUUM;
sqlite> .quit

result: only the new database has become smaller (look at the file dates in the screenshots, really not trying to cheat here...).
image

already executing this with the new gravity.sh from the PR, which really doesn't do anything more than keeping the old database after some simple checks (diskspace, setupVars variable) this is without the additional field!

I haven't modified the code yet to add the additional field, all changes would be pointless, if adding the field doesn't get approved...

real    7m13.816s
user    4m17.191s
sys     0m28.699s

I use a large set of blocklists, this to keep the 'time' checks real (a lot of users simply use a large amount of blocklists). I can, if required, push this even further by simply copying an internet list to a local file and add both the 'https://' and the 'file:///' versions of the list...

edit
since you instructed me to run 'pihole -g', here are some new results, including actions:

the database now contains 4.082.670 entries (3.110.900 unique entries - dashboard report)

sudo cp gravity_very_old.db gravity_old.db
pihole -q -countnew
  [i] 20680 new domain(s) found.

sudo cp gravity.db gravity_new.db
sudo sqlite3 "/etc/pihole/gravity_new.db"
SQLite version 3.27.2 2019-02-25 16:06:06
Enter ".help" for usage hints.
sqlite> alter table gravity ADD COLUMN new_entry INTEGER DEFAULT 0;
sqlite> .quit

time sudo /home/pi/test3.sh

real    0m55.343s
user    0m15.784s
sys     0m1.387s

test3.sh tags the new entries in the database
/edit

edit2
verified results in new database, this matches the count in the new (PR) query.sh script (pihole -q -countnew)

Also, the duplicate entries are all tagged


/edit2

Thanks for your tests. So creating and filling the new_entry took

when the overall gravity run time was

The increase is roughly 15% in time. Not too bad. For comparison, it takes certainly less than 30 seconds with stock lists.

I checked how SQLite3 stores the new column and see that they are appended to the B-tree table leaves. Okay, while this is somewhat slower during runtime, it is obviously a lot faster when running the ALTER TABLE ... ADD COLUMN ... because nothing more needs to be done but rewriting the table definition itself.


Meanwhile, I did some more experiments with loading the new property from the database after checking the tree for the existence of this domain in gravity. Cfm my post above for why I did that:

The result is that the lookup from the database resulted in an IO-wait surprisingly often.* Overall, the time spent with looking up if the domain in contained in gravity and, if so, if it is new takes almost exactly four times as long as just checking if it should be blocked (which is what we're doing right now). I don't think this is a price worth paying for it.

So my vote is clearly: Don't integrate this into the gravity table.

It is a query fast enough to run when users really want to investigate specific stuff. For what they should use pihole -q. We will add a better web interface tool for doing this without having to go to the CLI, maybe even clickable from the Query Log.


  • ) This specific type of database access is random access compared to sequential memory access with the majority of things Pi-hole does (log writing, query dumping to the database, etc.). Even when this shouldn't be an issue on solid state memory (like SD cards).

Ok, so no gravity table change.

Isn't this what my current PR is all about (the ability to list blocks, due to new gravity entries)?

Does this imply my PR (does only modify query.sh and gravity.sh, but doesn't change the database) is now of the table, or will you still be reviewing it?

I will review it when I have the time to do the proper testing. Currently, I'm just looking at easy fixable things and replying to users from my laptop. Time will come for a proper review.