Pihole -g / List download / disappointing performance

Yeah, unfortunately, he did so already. It doesn't change anything and I will shortly explain why.

Firstly, there is no way to completely get rid of an index, it is the defining element of the row in SQLite3: Rows can have variable lengths to not waste memory: say the comment of a domain is null for one domain but 2 KB for another one. You would not want the database engine to reserve "enough" space for very long comments for each domain.

Secondly, we flush the database table before we fill it (in one go). The index is empty and can be built up from scratch. This already dramatically reduces the need for reorganization in the tree,

Thirdly, we mass-insert everything in one go. All the INSERTs are done in a transaction. More specifically, the sqlite3 CLI tool uses SAVEPOINT and RELEASE which basically comes down to the same as BEGIN and COMMIT. All INSERTions are collected in a transaction and processed at once when the savepoint is released. This happens at the end of the mass-insertions. Only then, the database is written to and the tree is built up (as the index is populated). I see no way how this could be implemented any better.

Even with building the index in a separate step (which is not possible, but just assuming it would be), then we'd need to walk the entire database table again when creating the index. We would not gain anything, it's rather likely that we are slowed down by the heavy I/O that would be involved.

TL;DR Building the database tree is really cannot be accelerated much further: We start from an empty tree (empty table) and accumulate all added domains in a transaction that is written to disk at once. I see no other way how this could be realized faster.


I had some more ideas: There is some potential in just leaving the incoming domains unsorted. As contradictory as it may sound. unsorted domains may even speed up the tree buildup as random access is faster than always-access-the-last-leaf in B-trees.

1 Like

The last one for today:

root@munichpi:/etc/.pihole# time pihole -g
  [i] Neutrino emissions detected...
  [βœ“] Pulling blocklist source list into range

  [βœ“] Preparing new gravity table
  [i] Target: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
  [βœ“] Status: Retrieval successful

  [i] Target: https://mirror1.malwaredomains.com/files/justdomains
  [βœ“] Status: No changes detected

  [i] Target: http://sysctl.org/cameleon/hosts
  [βœ“] Status: No changes detected

  [i] Target: https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist
  [βœ“] Status: Retrieval successful

  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_tracking.txt
  [βœ“] Status: No changes detected

  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt
  [βœ“] Status: No changes detected

  [i] Target: https://hosts-file.net/ad_servers.txt
  [βœ“] Status: No changes detected

  [βœ“] Storing downloaded domains in new gravity table
  [βœ“] Activating new gravity table
  [i] Number of gravity domains: 146591 (124042 unique domains)
  [i] Number of exact blacklisted domains: 0
  [i] Number of regex blacklist filters: 2
  [i] Number of exact whitelisted domains: 0
  [i] Number of regex whitelist filters: 0
  [βœ“] Cleaning up stray matter

  [βœ“] Flushing DNS cache
  [βœ“] DNS service is running
  [βœ“] Pi-hole blocking is Enabled

real    0m11,276s
user    0m3,959s
sys     0m0,705s

Looks like roughly a factor 4x in speed-up compared to earlier today.

Testing with large blocking lists (of lower quality) will show if we can leave of the call to uniq (which consumes about 50% of the time in the gravity script! - about equally much as the entire database processing).

There is something unusual in your setup. I took your blocklists (minus the first two which are local to your network, minus three where the full URL was truncated in your debug log, and minus https://wally3k.github.io (which is not a blocklist but is a collection of blocklists, and cannot be parsed by Pi-Hole) and ran them through a Pi Zero W running latest dev branch (should be same as 5.0 branch at this point). Note that a Zero has a 1 Ghz single core CPU and 512 MB of RAM, and a 3B has a quad core 1.2 GHz CPU and 1 GB of RAM.

Results on the Zero W:

  [i] Number of gravity domains: 3100840 (2137802 unique domains)

real	25m56.066s

Edit - I added the wally3k URL, no other changes, and the number of domains changed just a bit (one of the blocklists was likely updated in the interval), and the time went up only by 5 minutes.

 [i] Number of gravity domains: 3101039 (2137993 unique domains)

real	29m46.952s

So, your lists on a device that has less capability than your Pi will rebuild gravity in 1/3 the time you are seeing. Still not rocket-shop speedy, but your time is a definite outlier.

the beta5 announcement says:

echo "release/v5.0" | sudo tee /etc/pihole/ftlbranch
pihole checkout core release/v5.0
pihole checkout web release/v5.0

do I enter this to go to the new branch on a pihole v4.3.2 system?

echo "release/v5.0" | sudo tee /etc/pihole/ftlbranch
pihole checkout core tweak/gravity_performance
pihole checkout web release/v5.0

don't want to falsify the test, or destroy another system, this is a lot of work...

Yeah, that should have much the same affect as checking out that branch from a 5.0 system. If you want to be safe, go from 4.3.2 to 5.0, and then check out the gravity performance branch of core afterward. But it shouldn't make much difference!

You can always test a workflow on something like a digital ocean droplet / disposable VM instance of some kind before you do it for real :slight_smile:

Can you please just cache every list so if the download in the future fails it can be re-used? I believe you didn't think about server failures when removing this feature.

Also /var/cache is a thing btw, there is no need to keep it in /etc.

Asked and answered earlier:

1 Like

Thank your for another enlightening explanation, @DL6ER :wink:

I took a glance at the DDL from Templates/gravity.db.sql.
My understanding of your design is most certainly only superficial, but I share my thoughts anyhow.

I saw that you are not using indexes at all, other than those enforced automatically by primary key constraints (this renders my earlier comments quite useless indeed).

The gravity table stands out as it declares only one text field, which is also its primary key.
I suspect an artifical integer id to be lighter on index creation with regards to both memory consumption and comparison speed for insertions.
Would there be a chance to introduce such an artificial id for the gravity table? If needed, creating a second index on the text field only after bulk loading?
Of course, if the latter is needed, it might easily overcompensate any performance gain achieveable by using an artificial integer.

I also noticed some trigger definitions.
I suspect the performance penalty to be substantially less than those of an index, but still:
Any chances you could get away with disabling triggers during bulk insert?

Removed it from my lists for all tests.

I've typed this during the pihole -g wait periods, so forgive me, if some comments are no longer necessary or valid after this AMAZING result.

  • new systems specs: raspbian Release date: 2019-09-26 with sudo apt-get update && sudo apt-get -y upgrade, unbound 1.9.6 + redis, pihole 4.2.3
  • running all test on the production pi, so no different results, due to different network location.

pihole 4.3.2 (pihole -g) approximately 2.5 minutes
start: Thu 23 Jan 09:40:40 CET 2020
finished: Thu 23 Jan 09:42:03 CET 2020
Number of unique domains trapped in the Event Horizon: 2.150.519

new SD card (etcher), pihole with v4.3.2 (fully configured with existing whitelist, regex and adlists.list - 75 entries)
echo "release/v5.0" | sudo tee /etc/pihole/ftlbranch
pihole checkout core release/v5.0
pihole checkout web release/v5.0

pihole beta5 (pihole -g / database v9) approximately 1 hour 46 minutes
start: Thu 23 Jan 11:10:41 CET 2020
time before the first list starts processing (Flushing gravity table): Thu 23 Jan 11:19:44 CET 2020 (9 minutes)
finished: Thu 23 Jan 12:56:16 CET 2020
Number of gravity domains: 3303054 (2.154.469 unique domains)
database size (gravity.db): 214.760KB

applied "Add gravity database 9->10 update script #3089"

pihole beta5 (pihole -g / database v10) approximately 1 hour 50 minutes
start: Thu 23 Jan 13:00:37 CET 2020
time before the first list starts processing (Flushing gravity table): Thu 23 Jan 13:10:56 CET 2020 (10 minutes)
finished: Thu 23 Jan 14:50:59 CET 2020
Number of gravity domains: 3303342 (2.154.728 unique domains)
database size (gravity.db): 214.868KB

new SD card (etcher), pihole with v4.3.2 (fully configured with existing whitelist, regex and adlists.list - 75 entries)
echo "release/v5.0" | sudo tee /etc/pihole/ftlbranch
pihole checkout core tweak/gravity_performance
pihole checkout web release/v5.0

pihole tweak/gravity_performance (pihole -g / database v9) approximately 2 minutes
start: Thu 23 Jan 15:04:22 CET 2020
finished: Thu 23 Jan 15:06:20 CET 2020
Number of gravity domains: 530379 (530.379 unique domains)
database size (gravity.db): 72.176KB

applied "Add gravity database 9->10 update script #3089"

pihole tweak/gravity_performance (pihole -g / database v10) approximately 2.1 minutes
start: Thu 23 Jan 15:09:02 CET 2020
finished: Thu 23 Jan 15:11:12 CET 2020
Number of gravity domains: 530379 (530.379 unique domains)
database size (gravity.db): 72.176KB

@DL6ER: THIS IS AMAZING...
however, there appears to be something wrong, look at the number of unique domains (difference between beta5 and gravity_performance - a small difference can be explained by changing lists, gravity_performance produces only 25% of the beta5 result).
never the less, GREAT IMPROVEMENT, find the last bug, and we are one step closer to an ideal world.

I checked the firewall logs, no abnormal messages (blocks or drops) from pfsense firewall or SURICATA.
Also checked the pihole log, no abnormal entries found.
Checked the system logs, no abnormal warnings or errors found.

I noticed during list processing, a file /etc/pihole/gravity.db-journal is created, size changes during processing. This file can be quite large, example 88.885KB
Flushing the database in beta5 takes a very long time, my earlier suggestion to build gravity in a temp table eliminates the need to flush it...

again a lot to ask but anyway, while your at it...
could you change:

  [i] Target: https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt
  [βœ“] Status: Retrieval successful
  [βœ“] Adding adlist with ID 11 to database table

into

  [i] time - Target: https://hostfiles.frogeye.fr/multiparty-trackers-hosts.txt
  [βœ“] time - Status: Retrieval successful
  [βœ“] time - Adding adlist with ID 11 to database table
  [βœ“] time - Done adding adlist with ID 11 to database table

possible other suggestion, to increase performance:

  • download all the lists, prior to processing
  • sort the downloads by file size, smallest to biggest and store the result in a TEMP database
  • process the lists from the TEMP database, this will ensure the database contains lesser entries during the processing of the first lists (the end result will be the same), as opposed to accidently starting with the biggest list -> impact on duplicate check
  • processing done -> remove the TEMP database

5 posts were split to a new topic: Debug upload port may be undesirable for some

Be warned, this file is not the sole source of truth. You'll also have to check the gravity migration upgrade scripts... This is something that is not very nice, however, it ensures that everyone at any level of the gravity database version can be migrated to the most recent status. I will update this template to version 10 (most recent).

This is not the most recent state. The gravity table currently has a 2D index: [domain, adlist_id].

The index on the domain field is ultimately needed (actually the 2D index as above is needed). This is the tree that is scanned by FTL to check whether a domain is in gravity or not. Without this index, it'd have to search the entire database table which is a big No-No.

There should be no triggers on the gravity table, did I miss something?

No, I suppose it's rather my understanding of your DB design that is lacking :wink:
I only saw a bunch of triggers on several of the list tables, none for gravity itself.

This is already implemented. You should already see

instead of

Also already done. There is only a rather lightweight sed call running after each download. I don't think one very large sed is any notable faster than a few smaller calls. It's not there there are thousands of calls to sed, the overhead should not really be visible.

With TEMP you mean a separate one (we already do this in the gravity_new table)? Or do you mean an in-memory one? The latter may become problematic once it grows up to 200MB or so. It doesn't seem to be a good idea.

This impact is actually smaller than you'd maybe expect as it roughly scales with log_2(n).

Instead, on the new branch, we collect everything in one big file and inject this at once instead of step by step. You may have noticed the absence of

in my new performance branch.

Measuring the time is quite a lot of work in bash. There are many places where it'd need to be done. I'm not going to do this right now, we can, however, add specific time measurements in the most critical parts. But not as a permanent solution.

I found a typo because of which only the last downloaded list was actually used. Fix pushed.

running pihole -up

[βœ“] Installing latest logrotate script
[i] Backing up /etc/dnsmasq.conf to /etc/dnsmasq.conf.old
[βœ“] man pages installed and database updated
[i] Testing if systemd-resolved is enabled
[i] Systemd-resolved is not enabled
[βœ“] Restarting lighttpd service...
[βœ“] Enabling lighttpd service to start on reboot...
[i] Restarting services...
[βœ“] Enabling pihole-FTL service to start on reboot...
[i] Restarting pihole-FTL service...
Unable to complete update, please contact Pi-hole Support

Have you checked sudo service pihole-FTL status ?
I haven't changed anything outside gravity.sh which is not active here.

pihole-FTL was NOT running (after restart by pihole -up)
sudo service pihole-FTL start -> OK

pihole version
Pi-hole version is v4.3.2-390-gc5713b6 (Latest: v4.3.2)
AdminLTE version is v4.3.2 (Latest: v4.3.2)
FTL version is vDev-021cd83 (Latest: v4.3.1)

correct?

after pihole -up and start pihole-FTL:
pihole tweak/gravity_performance (pihole -g / database v10) approximately 14 minutes
start: Thu 23 Jan 19:03:35 CET 2020
finished: Thu 23 Jan 19:17:18 CET 2020
Number of gravity domains: 3303704 (2.155.018 unique domains)
database size (gravity.db): 250.816KB

Apparently, you fixed it, pihole -g execution time is now acceptable, given the large number of lists and unique domains. the process (step) that is consuming the most time is Activating new gravity table

The cached list also have reappeared in /etc/pihole

When will this be available in beta5?

This is not the very latest revision (it does not include the timers), but there has been no functional change (= also no extra speed-up) since then. Go ahead and try with the timers included. It may shed some light on what is the slowest part for you.

I'm still wondering what @jfb measures, I think he wanted to set up a similar setup with similar lists than you. Just to rule out you have some strange factor like a slow SD card or whatsoever.

When we think we sqeezed the last drop out of it. And when we removed the extra time commands. There is not much going on besides this performance branch as we're speaking. The beta seems overall to really work amazingly well (even if not amazingly fast, but we're working on it :wink: ).

pihole -up
[i] Checking for updates...
[i] Pi-hole Core: update available
[i] Web Interface: up to date
[i] FTL: update available

after pihole -up:
pihole tweak/gravity_performance (pihole -g / database v10) approximately 20 minutes
start: Thu 23 Jan 19:32:58 CET 2020
finished: Thu 23 Jan 19:52:30 CET 2020
Number of gravity domains: 3303813 (2.155.118 unique domains)
database size (gravity.db): 429.556KB

  [i] Storing downloaded domains in new gravity table...
real    7m11.525s
user    1m9.096s
sys     0m12.848s
  [βœ“] Storing downloaded domains in new gravity table
  [i] Activating new gravity table...
real    6m59.576s
user    0m1.414s
sys     0m6.253s
  [βœ“] Activating new gravity table
  [i] Number of gravity domains: 3303813 (2155118 unique domains)
  [i] Number of exact blacklisted domains: 0
  [i] Number of regex blacklist filters: 20
  [i] Number of exact whitelisted domains: 32
  [i] Number of regex whitelist filters: 0
  [βœ“] Cleaning up stray matter

  [βœ“] Flushing DNS cache
  [βœ“] DNS service is running
  [βœ“] Pi-hole blocking is Enabled

  [i] The install log is located at: /etc/pihole/install.log
Update Complete!

  Current Pi-hole version is v4.3.2-392-g0939c35
  Current AdminLTE version is v4.3.2
  Current FTL version is vDev-9889501

The database has almost doubled in size!

microSDHC card 32 GB Transcend 32GB CL10 MICRO SDHC CARD Class 10 (item no.: 416521)

Yes, this is expected and an unavoidable side effect by what you called the "TEMP" table. I'll shortly explain why:

  1. Initially, gravity is already there (assume the database is 200MB in size)
  2. Now, we build gravity_temp leaving gravity intact (so FTL can still access it), this adds yet another 200 MB --> 400 MB in total
  3. We rename gravity to gravity_old (no change)
  4. We rename gravity_temp to gravity (no change)
  5. We remove gravity_old (no change)

Steps 3 & 4 are needed how they are so we do not have to drop and recreate the view.
One might expect that step 5 releases 200 MB, however, the developers of sqlite3 decided to have the database only ever grow to avoid fragmentation. It marks the extra space as unused. It can be used by later gravity runs.

You could shrink the database size when running VACUUM on the database (we have
pihole -g -o for this, -o for --optimize), however, be warned that this may take some VERY long extra time as the database itself may be restructured in the process. I'd say it's not worth it if you have the extra space. It is the unavoidable price to pay for having the gravity table effectively two times at some point (when they are swapped).

This is a bit concerning. What happens at this point is 3, 4 and 5 from above. I can confirm that this also takes a few seconds on my Raspberry Pi even though I'd expect this to be an instantaneous action. I confirmed that the renaming is really only a matter of milliseconds, however DROP TABLE gravity_old; took 2.5 seconds for me (which is the entire time measured for Activating new gravity table in my case. I'll keep looking.