Pihole -g / List download / disappointing performance

I did run the same speed test on the newest build of the tweak gravity branch, same blocklists, same Zero W with a 32 GB Class 10 SanDisk Ultra uSD card.

  [i] Number of gravity domains: 3106983 (2137836 unique domains)
  [i] Number of exact blacklisted domains: 10
  [i] Number of regex blacklist filters: 29
  [i] Number of exact whitelisted domains: 138
  [i] Number of regex whitelist filters: 0
 
real	22m2.758s
user	9m2.923s
sys	2m12.282s

For comparison, the same run yesterday on the dev branch (without the gravity tweaks), took about 4 minutes longer.

  [i] Number of gravity domains: 3100840 (2137802 unique domains)

real	25m56.066s

I implemented this. It saves me another 2 seconds as the swapping action is instantaneous now. There is zero downtime for FTL's gravity blocking now.

Since I felt that a 2D index would make postponing of index creation more favourable, I decided to showcase what could be gained theoretically if you switched from an automatic primary key index to a manual composite index.

Edit: Note that this is a synthetic, isolated showcase, not a full blown performance evaluation.

That latter would take more careful testcase rigging, aligning them better with your target design (I deliberately eliminated primary keys, totally ignoring whether that would meet your needs), run on a range of more relevant hardware, take into consideration more side effects and inter-table issues (which I have eliminated completely), account better for the execution enironment (i.e. especially of runtime behaviour of SQLite3 over time), provide deeper coverage of its additional performance-relevant paramaters (which -frankly- I am totally oblivious of for SQLite) and above all, an excessive number of iterations to raise the lot size to minimal statistical significance or even a desired level of confidence.


As such, take the resulting numbers as an indication only, not as hard evidence.

I first helped myself to a Beta 5.0 Pi-hole on a separate machine.
I also added a few blocklist to get a decent size, using Wally3K's non-crossed lists, currently 62 entries strong, resulting in just under 1.7 millions blocked domains.

I then extracted the DDL for the gravity table from Pi-hole's database:
sqlite3 ".schema gravity"
CREATE TABLE gravity
(
        domain TEXT NOT NULL,
        adlist_id INTEGER NOT NULL REFERENCES adlist (id),
        PRIMARY KEY(domain, adlist_id)
);

For my isolated showcase, I decided to create two stripped down versions of the table:
  1. compgravity, defining a composite primary key, mimicing the actual table
CREATE TABLE compgrav
(
        domain TEXT NOT NULL,
        adlist_id NOT NULL,
        PRIMARY KEY(domain, adlist_id)
);
  1. indxgravity - omitting a primary key in favour of manual index creation
CREATE TABLE indxgravity
(
        domain TEXT NOT NULL,
        adlist_id NOT NULL,
);
CREATE UNIQUE INDEX idx_indxgrav_domain_adlist ON indxgravity (domain, adlist_id);

Next, I generated a `gravimport` flat file for bulk loading by a full export of Pi-hole's gravity table:
.mode csv gravity
.once /etc/pihole/gravimport
select * from gravity;

Finally, I prepared a bulk load `.import` for each table with the following snippets
  1. comp-import for compgravity
DELETE FROM compgravity;
.mode csv
.import gravimport compgravity
SELECT count(*) from compgravity;
  1. indx-import for indxgravity
DROP TABLE indxgravity;
CREATE TABLE indxgravity( domain TEXT NOT NULL, adlist_id NOT NULL );
.mode csv
.import gravimport indxgravity
CREATE UNIQUE INDEX idx_indxgrav_domain_adlist ON indxgravity (domain, adlist_id);
SELECT count(*) from indxgravity;

I then measured the time it took to bulk load my 1.7m blocklists into each table respectively:
  1. into compgravity
time ( sudo -u pihole sqlite3 /etc/pihole/gravity.db < comp-import )
1.692.085

real    3m43.597s
user    2m42.503s
sys     0m15.124s
  1. into indxgravity
time ( sudo -u pihole sqlite3 /etc/pihole/gravity.db < indx-import )
1.692.085

real    1m48.947s
user    1m37.152s
sys     0m5.072s

As both tables were empty on that initial run, I repeated the same commands for a second time.
I also ran a VACUUM before each run.

The resulting times are as follows:

For those just looking at the table without reading:
Times are purely for bulkloading gravity - they dont account for a complete blocklist update


initial comp initial indx decrease repeated comp repeated indx decrease
real 03:43,597 01:48,947 -51,28% 04:16,282 02:17,995 -46,16%
user 02:42,503 01:37,152 -40,22% 02:42,670 01:38,488 -39,46%
sys 00:15,124 00:05,072 -66,46% 00:21,126 00:10,318 -51,16%

This shows that manually creating a index has some potential to decrease bulk loading times.

I ran my tests on RPi Zero W. This is a single core CPU - achievable gain might turn out to be smaller with multi-cores.

As populating the gravity table is only part of the blacklist import process, you have to decide whether the potential decrease in time offsets the work needed to go with a manual composite index instead of a composite primary key.

Also, you are the only ones to know if and how the rest of your DB design would cope with this shift.

That said, I wouldn't expect a manual index to be significantly slower than the automatic index enforced by a composite primary key.

If needed, you could expand the table by an additional integer id column as artificial primary key that would also auto-alias with the rowid shadow column (which wouldn't be possible for the composite key).


Note that I would expect the performance gain to grow disproportionate to the number of blacklist entries, as insertion cost would rise with each database record.

In other words, I would expect larger manual indexed bulkloads to be faster relatively when compared to composite primary key bulkloads, e.g. 55% for 2m blacklist entries as opposed to 46% for the showcased 1.7m, or a mere 25% for just 150k.

I haven't got the statistics to prove this, as I am shying away from the effort to prepare another scenario (has taken a few hours so far), and I forgot to take exact numbers on my first test runs with only 130k blocked domains (recall 12 vs 15 secs though).


I hope this helps somehow :wink:

1 Like

Thanks, it definitely looks interesting. We're moving to only insert to a fresh database now as it has a number of benefits.

I wonder if not omitting the foreign key (REFERNCES ...) causes any difference. It is not something I'd like to give up as it is some security measure (you cannot delete adlist columns as long as there are gravity domains still pointing at them as source).

I did that just for the purpose of showcasing, as I didn't want to recreate all of your tables - keep those refs in production mode.
The only thing I've been suggesting: Replace composite primary key constraint by manual composite index on the exact same fields, and maybe throw in an additional artificial id key if needed :wink:

P.S.: I am surprised how well SQLites .import statement performs on its own. As I said, I recall an order of magnitude in performance gains when we did something similar on an Oracle DB - but that was more than a decade ago :wink:

I tried your suggestion and hit an issue with that: Poor quality adlists with multiple entries for the same domain so I had to relax the UNIQUE constraint (it doesn't matter all that much TBH) as SQLite was not able to check this on-the-fly now.

I can see an improvement from (no statistics on those numbers)

  • 4 seconds for an existing 2D index realized though PRIMARY KEY
    to
  • 2 + 1 = 3 seconds for Importing + index creation in two steps

Commit pushed,

currently, https://dbl.oisd.nl/ is unavailable, so the gravity count has severally decreased, comparing the results would NOT be real. I've sent a PM to @sjhgvr (the owner of the blocklist?) to look into that.

I can confirm a second database is created (gravity_temp.db), showing something like this, but since not all lists are available, the count is off, times should be ignored. I'll rerun the test as soon as the list is available again.

  [i] Storing downloaded domains in new gravity database...
real    0m31.184s
user    0m21.222s
sys     0m0.410s
  [✓] Storing downloaded domains in new gravity database
  [i] Building tree...
real    0m29.755s
user    0m11.662s
sys     0m1.114s
  [✓] Building tree
  [i] Swapping databases...
real    0m3.465s
user    0m0.023s
sys     0m0.020s
  [✓] Swapping databases
  [✓] Flushing DNS cache
  [i] Number of gravity domains: 2350500 (1588393 unique domains)
  [i] Number of exact blacklisted domains: 0
  [i] Number of regex blacklist filters: 20
  [i] Number of exact whitelisted domains: 32
  [i] Number of regex whitelist filters: 0
  [✓] Cleaning up stray matter

  [✓] DNS service is running
  [✓] Pi-hole blocking is Enabled

I checked the database content, all the data appears to be copied over to the new database (I don't use all the tables yet...)

edit
NOT sure, reporting anyway.
Yesterday, all lists downloaded perfectly, so a cached version of the list was available.
I was running pihole-up this morning and noticed the message:

  [i] Target: https://dbl.oisd.nl
  [✗] Status: Forbidden
  [✗] List download failed: no cached list available

Is it possible there still is a problem, using cached lists, if available?
/edit

due to the fact that https://dbl.oisd.nl/ was unavailable, the gravity count was way lower than before. I was confronted with the counter again, this time, pihole -g continued before the counter was expired.
This would indicate (if I'm interpreting this correct) that, with a large gravity count (see earlier results), pihole-FTL needs even more than 120 seconds to be able to produce a result for getent hosts pi.hole.
Maybe you need to increase (double the value - 240) to satisfy systems with a very high gravity count...
Just a suggestion, no offence intended...

Whole system did a boo-boo.
Rebooted, and is up again.
Thanks for reporting :wink:

back in business (thanks @sjhgvr)

pihole tweak/gravity_performance (pihole -g / database v10) approximately 2.5 minutes
start: Fri 24 Jan 10:01:02 CET 2020
finished: Fri 24 Jan 10:04:38 CET 2020
Number of gravity domains: 3308749 (2.156.167 unique domains)
database size (gravity.db): 204.440KB

YES. AMAZING RESULT!!!

Now integrate these changes into beta5, so everyone can enjoy them.

Thank you, all participants, for your time and effort.

NOT sure what is happening.
Using phpliteadmin, I can see all the data in the database, I entered, nothing missing.
In the web interface however, I only have 6 adlists (should be 75), no regex_blacklist entries (should be 20), no whitelist entries (should be 32)

restarted lighttpd, same result.

edit
Something else, needing verification.
During all previous test, I never had the same Number of gravity domains: 3308749 (2156167 unique domains) count. I've run pihole -g several times now, the count doesn't change any more (2.156.167). This could be of course the result of no changes in the lists (maybe to soon to tell), but it needs looking into. I will follow this up and report back...

IGNORE THIS edit , count has changed: Number of gravity domains: 3308784 (2156200 unique domains)
/edit

So v5.0 is now even faster than v4.3.x ? This is good news.

You asked to get it merged and then said something about wrong numbers on the dashboard although everything looks fine in the database. What is the current status on this, you went a bit back and forth and I want to have everything bug-free before opening a PR.

6 posts were split to a new topic: No group management page

Does this mean, after somebody approved this, my beta5 system (other SD card) will get an update OR is this reserved for the final release?

Yes. The branch tweak/gravity_performance will be abandoned after the merge. You should either checkout release/v5.0 in your test system or disable it if you don't need it any longer.

No offense taken, and if there are independently verifiable reports of this being an issue then we will take a look.

This improvement has been merged into the v5.0 beta branch.

3 Likes

Rather than a mass insert/update, maybe there could be a step utilizing say a cuckoo filter so that only net new entries are added.

Also, maybe the solution for systems that have sufficient memory is to use redis as a backing store instead.

Just a one-liner? :wink:

As I understand it, a Cuckoo filter (as a special variant of cuckoo hash tables) performs admirably when determining that a given item is *not* a member of a static or at least a predictable size set of items.

As such, it would require the filter to be stored (along with the original set items), fully populated by calculating and inserting the respective hash values for each and any single one entry in the set to be tested against (i.e. the database of blacklisted hostnames).
Fortunately, storage requirements are another discipline where Cuckoos excel (filters as well as hashes) - but it would still grow the database (Note to self: Another extended example for trading off memory consumption vs. execution speed).

Due to their fast lookup and low-space properties, Cuckoo filters are ideally applicable to problems where full-on caching is too memory-consuming, and/or the cost of establishing non-membership is excruciatingly time-sensitive or largely outweighs the cost of an action that would be sensible only for set members otherwise, making it a promising candidate for e.g. network routing problems.

As for the update problem at hand here:
While we cannot predict the number of blacklisted hostnames of any given Pi-hole installation, it probably would be safe to assume a certain maximum number of hosts, say 5 millions or so, in order to keep the hash filter stable and prevent it from overflowing, which would make additional insertions impossible (at least, I am not aware of any procedure allowing to extend a Cuckoo filter upon reaching its capacity limit).

Each hostname from a blacklist file would have to be checked against the existing database set by calculating its hash and apply that hash against the filter.

The established set of non-members would then have to be inserted into the database, which would still accrue the associated time cost.

On insertion of each entry, the Cuckoo filter would also have to be updated.
I take it that this is a discipline where Cuckoos fall short: While still comparably fast in the beginning, Cuckoo filters will exhibit degrading insertion performance with growing load factors. i.e. inserting item (n +1) will always be more expensive than inserting item (n). This is caused by an ever higher probabilty for a new item to displace an older item from its calculated hash location, which might in turn force yet another item's displacement, and so forth.
There are hashing functions that exhibit a constant cost for insertion (if higher initally) that would have to be scrutinised for comparison.


It remains to be demonstrated that the combined cost for establishing non-membership for every item and inserting a non-member item into the database as well as the Cuckoo filter would outmatch the current approach for any assumed update percentage, or to determine the update percentage up to which Cuckoo filtering would be favourable.

Furthermore, determining non-membership would just assist with one side of the update gravity operation, namely adding hostnames not present in the current set.

However, it does not support any help in the reciprocal decision, namely which entries to *remove* from the database that are not present in the blacklist flat files anymore.

Applying Cuckoo filters to this problem as well would involve creating an additional Cuckoo filter from all entries in the blacklist flat file that each database entry would have to be checked against.

So that would basically mean that we'd accrue the cost of creating a full index for all entries, just as in the current solution, and also the cost of checking each entry in either set against the Cuckoo filter of the respective other set, plus a cost for updating the database as well as the databases Cuckoo filter for a random percentage of entries.

The current solution relies (in parts) on inserting all entries into the database straight and completely building the index after loading completes, avoiding index insertion cast altogether.


Yours is an interesting proposal, but would you care to elaborate how you would conceive Cuckoo filters to be superior to the current approach?

Also, as you seem knowledgeable in Cuckoo filters having proposed this, previous experiences as well as a demonstration making use of some real life blacklist data would certainly be helpful as well. :slight_smile: