Filter out wildcarded entries when consolidating lists

ZHN · July 30, 2018, 5:19pm

Please consider adding an option to run the lists against the blacklist wildcards and remove matching entries during consolidation. Users on slower hardware could use many more public lists if, during consolidation, sites that use random subdomains and appear many thousands of times in the lists, like 2o7, openx, or 302br got filtered out when wildcarded.

I found that there is a performance balance point where too many entries slow things down. Eliminating thousands of them in favor of a wildcard can decrease load and increase overall responsiveness, especially when using the UI. I'm using a Pi Zero W, which may have something to do with it, and I actually haven't noticed any slowdowns even with fairly extensive blacklist wildcarding.

To clarify what I'm asking, here's an example. It's probably inefficient and maybe even wrong because I don't know the format of the intermediate processing list. Compare group \2 from the results of expression (?<=address\=/)([^/]+) on the dnsmasq list, to group \1 in \s(\S+)\s on the consolidated list, and drop matching lines.

ZHN · July 31, 2018, 2:47am

Did you get a noticable improvement from that?

Mine didn't get too bad until I got up to around 1.2 million domain entries, and it was nearly unusable when I hit 2.5 million domain entries.

I just switched over to the regex FTL and my regex lines are so broad I'm not even sure I need more than 1000 proper domain entries. I have lines like

metrics\.
(^|\.)metrics
telemetry
(^|\.)ads?\d*\.

which cover probably 30% of all entries in common domain lists, if not more.

technicalpyro · August 2, 2018, 9:13pm

12 posts were split to a new topic: Regex examples

technicalpyro · August 1, 2018, 5:04pm

To compare each line in gravity.list to the list crerated at regex.list for even a moderately sized blocklist ~100K and just a single regex filter let alone hundreds of them would take what is currently an efficient and resource smart process of running Gravity to what could be hours to run gravity once.

Based on that trade off and the fact gravity is run once a week at the very least not sure how this would be implemented on base level hardware like a Pi.

ZHN · August 1, 2018, 7:46pm

This is why I suggested it as an option rather than an always-on. For most people's uses, it's probably fine if it runs at 2-3am.

The way I would implement it to minimize interruptions, since the goal is "less total data needs to be scanned for each request" or "less data held in memory" (not sure which it actually is that slows things so much) rather than "use less storage space overall," is to use a second file in append mode, which I'll call collisions.list for the example:

Generate gravity.list as usual
IF the option to filter is selected, AND collisions.list exists, replace gravity.list with a version that contains only lines unique between collisions.list and gravity.list (in the worst possible scenario, it takes as long as generating gravity.list did)
IF option to filter is selected, begin the process of adding lines to collisions.list from the new gravity.list with low processor priority.
3a. IF we did step 3, perform step 2 again.

This way, so long as the regex list does remove a significant number of entries, we would run the regex expressions against the smallest possible data set.

technicalpyro · August 2, 2018, 9:14pm

I have split the majority of posts in this thread to a new topic in General.

Please keep discussions in this thread directly related to the feature request

yubiuser · June 13, 2020, 11:01am

I'm closing this FR and release votes in favor of

https://discourse.pi-hole.net/t/reduce-size-gravity-list-with-active-wildcard-and-regex-entries/10698

as it takes into account wildcards and regex. If you still support the request please vote over there.