Pihole -g / List download / disappointing performance

jpgpi250 · January 24, 2020, 8:35am

due to the fact that https://dbl.oisd.nl/ was unavailable, the gravity count was way lower than before. I was confronted with the counter again, this time, pihole -g continued before the counter was expired.
This would indicate (if I'm interpreting this correct) that, with a large gravity count (see earlier results), pihole-FTL needs even more than 120 seconds to be able to produce a result for getent hosts pi.hole.
Maybe you need to increase (double the value - 240) to satisfy systems with a very high gravity count...
Just a suggestion, no offence intended...

sjhgvr · January 24, 2020, 9:01am

Whole system did a boo-boo.
Rebooted, and is up again.
Thanks for reporting

jpgpi250 · January 24, 2020, 9:11am

back in business (thanks @sjhgvr)

pihole tweak/gravity_performance (pihole -g / database v10) approximately 2.5 minutes
start: Fri 24 Jan 10:01:02 CET 2020
finished: Fri 24 Jan 10:04:38 CET 2020
Number of gravity domains: 3308749 (2.156.167 unique domains)
database size (gravity.db): 204.440KB

YES. AMAZING RESULT!!!

Now integrate these changes into beta5, so everyone can enjoy them.

Thank you, all participants, for your time and effort.

jpgpi250 · January 24, 2020, 10:00am

NOT sure what is happening.
Using phpliteadmin, I can see all the data in the database, I entered, nothing missing.
In the web interface however, I only have 6 adlists (should be 75), no regex_blacklist entries (should be 20), no whitelist entries (should be 32)

restarted lighttpd, same result.

edit
Something else, needing verification.
During all previous test, I never had the same Number of gravity domains: 3308749 (2156167 unique domains) count. I've run pihole -g several times now, the count doesn't change any more (2.156.167). This could be of course the result of no changes in the lists (maybe to soon to tell), but it needs looking into. I will follow this up and report back...

IGNORE THIS edit , count has changed: Number of gravity domains: 3308784 (2156200 unique domains)
/edit

DL6ER · January 24, 2020, 1:32pm

So v5.0 is now even faster than v4.3.x ? This is good news.

You asked to get it merged and then said something about wrong numbers on the dashboard although everything looks fine in the database. What is the current status on this, you went a bit back and forth and I want to have everything bug-free before opening a PR.

DL6ER · January 24, 2020, 3:44pm

6 posts were split to a new topic: No group management page

DL6ER · January 24, 2020, 3:51pm

https://github.com/pi-hole/pi-hole/pull/3100

jpgpi250 · January 24, 2020, 4:08pm

Does this mean, after somebody approved this, my beta5 system (other SD card) will get an update OR is this reserved for the final release?

DL6ER · January 24, 2020, 4:15pm

Yes. The branch tweak/gravity_performance will be abandoned after the merge. You should either checkout release/v5.0 in your test system or disable it if you don't need it any longer.

DanSchaper · January 24, 2020, 5:07pm

No offense taken, and if there are independently verifiable reports of this being an issue then we will take a look.

DL6ER · January 24, 2020, 6:30pm

This improvement has been merged into the v5.0 beta branch.

gomsucjo · January 26, 2020, 6:43am

Rather than a mass insert/update, maybe there could be a step utilizing say a cuckoo filter so that only net new entries are added.

Also, maybe the solution for systems that have sufficient memory is to use redis as a backing store instead.

Bucking_Horn · January 26, 2020, 9:32pm

Just a one-liner?

As I understand it, a Cuckoo filter (as a special variant of cuckoo hash tables) performs admirably when determining that a given item is *not* a member of a static or at least a predictable size set of items.

As such, it would require the filter to be stored (along with the original set items), fully populated by calculating and inserting the respective hash values for each and any single one entry in the set to be tested against (i.e. the database of blacklisted hostnames).
Fortunately, storage requirements are another discipline where Cuckoos excel (filters as well as hashes) - but it would still grow the database (Note to self: Another extended example for trading off memory consumption vs. execution speed).

Due to their fast lookup and low-space properties, Cuckoo filters are ideally applicable to problems where full-on caching is too memory-consuming, and/or the cost of establishing non-membership is excruciatingly time-sensitive or largely outweighs the cost of an action that would be sensible only for set members otherwise, making it a promising candidate for e.g. network routing problems.

As for the update problem at hand here:
While we cannot predict the number of blacklisted hostnames of any given Pi-hole installation, it probably would be safe to assume a certain maximum number of hosts, say 5 millions or so, in order to keep the hash filter stable and prevent it from overflowing, which would make additional insertions impossible (at least, I am not aware of any procedure allowing to extend a Cuckoo filter upon reaching its capacity limit).

Each hostname from a blacklist file would have to be checked against the existing database set by calculating its hash and apply that hash against the filter.

The established set of non-members would then have to be inserted into the database, which would still accrue the associated time cost.

On insertion of each entry, the Cuckoo filter would also have to be updated.
I take it that this is a discipline where Cuckoos fall short: While still comparably fast in the beginning, Cuckoo filters will exhibit degrading insertion performance with growing load factors. i.e. inserting item (n +1) will always be more expensive than inserting item (n). This is caused by an ever higher probabilty for a new item to displace an older item from its calculated hash location, which might in turn force yet another item's displacement, and so forth.
There are hashing functions that exhibit a constant cost for insertion (if higher initally) that would have to be scrutinised for comparison.

It remains to be demonstrated that the combined cost for establishing non-membership for every item and inserting a non-member item into the database as well as the Cuckoo filter would outmatch the current approach for any assumed update percentage, or to determine the update percentage up to which Cuckoo filtering would be favourable.

Furthermore, determining non-membership would just assist with one side of the update gravity operation, namely adding hostnames not present in the current set.

However, it does not support any help in the reciprocal decision, namely which entries to *remove* from the database that are not present in the blacklist flat files anymore.

Applying Cuckoo filters to this problem as well would involve creating an additional Cuckoo filter from all entries in the blacklist flat file that each database entry would have to be checked against.

So that would basically mean that we'd accrue the cost of creating a full index for all entries, just as in the current solution, and also the cost of checking each entry in either set against the Cuckoo filter of the respective other set, plus a cost for updating the database as well as the databases Cuckoo filter for a random percentage of entries.

The current solution relies (in parts) on inserting all entries into the database straight and completely building the index after loading completes, avoiding index insertion cast altogether.

Yours is an interesting proposal, but would you care to elaborate how you would conceive Cuckoo filters to be superior to the current approach?

Also, as you seem knowledgeable in Cuckoo filters having proposed this, previous experiences as well as a demonstration making use of some real life blacklist data would certainly be helpful as well.