Questions on RegEx efficiency

Knuschberkeks · November 25, 2021, 6:36pm

Hi,
I've been learning to build RegExs the last few days. Some things have multiple ways to achive and I wanted to know which of them is the most efficient/quick to resolve on pi-hole.

Is \d more efficient than [0-9] or [:digit:]?
what about [a-z] and [:lower:]
Is there a performance difference between www and w{3}?
If I want to match "mms", "mmg-fna", and "media" is there a more efficient way than (m(m(s|g-fna)|edia)
I read multiple times that Regex is less efficient than just adding every domain, but for example if I want to whitelist "www.domain.net" and "www.domain.com" isn't ^www\.domain\.(net|com)$ faster because pi hole only has to check for "www.domain." once instead of twice?

DL6ER · November 25, 2021, 8:25pm

The answer for your first three questions is: No. The reason is that regex are compiled to efficient byte-code before being used. The result of the compilation is used when performing the actual matching.

is the shortest way, however, m(ms|mg-fna|edia) may be faster as it is simpler.

Checking one simple regex is more work than checking against millions of exact domains.

Giving exact numbers is neither straightforward nor easy and will heavily depend on the particular regex we are talking about. The reason for this extreme efficiency difference is that all the exact domains are stored in a B-tree where lookups are performed in logarithmic time:

An O (log n ) algorithm is considered highly efficient, as the ratio of the number of operations to the size of the input decreases and tends to zero when n increases. (Wikipedia, link above)

A match can really be found within only a few CPU cycles time.

For regex, however, the compiled regex has to be matched against the string. We have to try all possible combinations even there are less when anchoring is used, like in your example.

An algorithm that must access all elements of its input cannot take logarithmic time, as the time taken for reading an input of size n is of the order of n. (Wikipedia, same link above)

and regular expressions are actually worse than this as they are designed to find any possible match - instead of only a simple.

Knuschberkeks · November 25, 2021, 9:06pm

Thanks, that helped. I think I understood the basics of why Regexes are slower.
Followup question: Would it be feasible for pi hole to check if a regex has a finite number of matches (or a number of matches under a predefined threshold) and if so just quietly adding them all as exact domains "behind the scenes"? In a current example, I whitelisted ^((pps|www|((dyn|w[0-9])\.)?web|m(m(s|g-fna)|edia(-[a-z]{3}|\.[a-z]{4})[0-9]{1,2}-[0-9][.-](cdn|fna)))\.)?whatsapp\.(com|net)$, and I would like to keep it, because whatsapp changes it's domains every now and then but they always follow the same pattern. If my calculations are correct the regex has 2.088.028.816 possible matches. I know I could just get a script to print them all out and add them manually, but then what if I want to disable them or add them to another group in the future?
(Also yes, thanks to your help I know I can make the Regex simpler by adding the www, pps, dyn etc as exact domains, I will do that tomorrow.)

jfb · November 26, 2021, 12:59am

Since you are whitelisting them all, why not just wildcard whitelist the domain whatsapp.com and then all subdomains will also be whitelisted.

DL6ER · November 26, 2021, 8:29am

This is already happening. When a client first requests a domain, all the checks (blacklist, whitelist, regex, etc.) are done (but only until a match is found, the rest is skipped in this case).

When we find a match, we store the domain and the result (e.g., blacklist regex matched + the ID of the matching regex) and can provide the answer instantaneously on any subsequent query of the same domain. This knowledge is stored in the cache and does not survive restarts of the system, however, it does survive TTL expiration of domains. (Furthermore, this cache works per-client to ensure group policies can still work as a domain may not be blocked/allowed for all devices.)

You could assign them to a dedicated group and simply disable the group. This would affect all contained domains at once. Adding them to another group doesn't make much sense, instead, add the group containing them to the devices you want this service unblocked.

Or, likely even better: Don't use lists where services you want to work are expected to get blocked.

I concur. The most efficient way to do this would be a simply-anchored regex like

\.whatsapp\.(com|net)$

which will match all subdomains of said domains.

To complete this (you also want ^whatsapp\.(com|net)$, add the following two exact domains:

whatsapp.com
whatsapp.net

I think this would give the best performance in your case.

Knuschberkeks · November 26, 2021, 10:35am

allright thanks. I did that. The reason I do this is I want to block everything else from facebook, but need to whitelist whatsapp for some clients. I couldn't find a good list that only blocks certain fb services so I just block all of it and whitelist whatsapp. Actually just adding a whatsapp as a wildcard didn't even occur to me, I just was so fascinated by the possibiltys of Regex