Gravity.list and regex

Informational only...
I was interested in the effect of using regular expressions.
For this test, I reconfigured pihole to using the default lists only:

StevenBlack  https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
MalwareDom   https://mirror1.malwaredomains.com/files/justdomains
Cameleon     http://sysctl.org/cameleon/hosts
ZeusTracker  https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist
DisconTrack  https://s3.amazonaws.com/lists.disconnect.me/simple_tracking.txt
DisconAd     https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt
HostsFile    https://hosts-file.net/ad_servers.txt

I'm using the regex list from @mmotti

Running this script (does take a while - there probably is a better way to achieve this - I said before, NOT a Linux expert, nor a regex expert…):

#!/bin/bash

regexfile=/etc/pihole/regex.list
gravityfile=/etc/pihole/gravity.list

list=0
regexcount=0
result=0

for domain in `cat $gravityfile`
do
   list=$((list+1))
   valid=true
   for regex in  `cat $regexfile`
      do
         if [[ $domain =~ $regex ]]; then
            regexcount=$((regexcount+1))
            valid=false
            break
         fi
      done
      if $valid; then
         result=$((result+1))
      fi
done
echo -e "listsize=$list - regex matches=$regexcount - unique domains=$result"

The result:
number of entries in the list: 136702
number of regex matches: 25342 (18.5%)
number of domains, not covered by regex: 111360 (81,5%)

There are several topics on improving the gravity.list, based on the content of regex.list (here, here and here for example...). These are ingenious, but complex procedures, witch take a lot of processing time and make things complex.
It should however, given an improved and much faster version of the script above, be possible to integrate this in gravity.sh (= pihole -g). Is it worth the effort?

discuss...

It is more efficient to block domains using the normal method instead of regex. The domains are kept in a structure which allows constant time lookup (very fast). When using regex, the domains have to be checked (in the worst case) against every regex you have loaded, and only after all of that does it know if it is blocked or not (this result is kept until the regex is modified).

1 Like

As with unbound, there will be a delay the first time the domain is queried, after this, the answer is readily available.
I understand there is a pro and con side to this approach, but, for example, I currently have 13949 (no typo it's really 13949, as counted by notepad++ / find / count) matches for doubleclick.net in my gravity list. I'm not sure witch method will be the fastest to block the domain (wildcard, regex or gravity).
Other examples:
centade.com: 8397
clickbank.net: 631

I have, default output from pihole -g, no tweaks, 1280858 gravity entries, 15 regex expressions and 106 wildcard (will be 53 - IPv4 and IPv6, as soon as dnsmasq2.80 is integrated in pihole-FTL) entries.

But many of these are things like

www.ad.doubleclick.net.76530.9544.302br.net

which rarely exist (admittedly, this is only a feeling).

The fastest will be to have everything in gravity as the dnsmasq cache is optimize for fast lookups of domain in cache buckets. This is much faster than wildcard or regex. Wildcards are looped over and it is checked for each requested domain if this domain is part of the wildcard. With regex it is much more dramatic as every regex evaluation is a complex process - with all the possible rules obviously much slower than a simple sub-string comparison. The high regex performance that is seen in FTL is only possible as we pre-compile regex and hence make them as fast as possible in their execution.

TL;DR: gravity.list can as grow as it can fit into memory. The only slowdown you'll observe is the initial loading. Both, wildcards and regex filters do not have any initial overhead, however, their evaluation is notable or even significantly slower than a direct domain cache hit from domains imported from gravity.list.

I still don't understand, given the above statement, why you converted the wildcards into regex. I'm still using wildcard, just changed the filename to avoid regexconverter="/opt/pihole/wildcard_regex_converter.sh"

That's why I feel cleaning up the gravity list would be a benefit. Doesn't pihole-FTL needs to load the list, every time you give it a kick (SIGHUP)? wouldn't a cleaned gravity list benefit the load time.

Not so sure about that. from what I understand from @DL6ERs explanation, if your regex list is longer, it will take more time to do the evaluation. The same logic would apply to wildcards, the more you have, the longer it takes to process them.

I assume wildcards are read from file, by pihole-FTL, once, and kept in memory, since the dnsmasq file clearly states you need to stop/start dnsmasq for changes to apply.
I assume the same goes for regex, @DL6ER: could you please confirm/deny this?

Yes.

Yes. Wildcards need restarting. Regex can also be updated on-the-fly by sending the signal SIGHUP to pihole-FTL. The difference is that dnsmasq is constructed such that configuration can only be read at startup while pihole-FTL can re-read its config whenever needed and regex is an FTL implementation, there is no regex code at all in the contained dnsmasq.