Discussion about optimal wildcard syntax

jpgpi250 · February 20, 2020, 7:21pm

you replied:

This afternoon, we had a discussion about a facebook regex

The following came up:

and

do we use (scripted entry)

(.|^)wizaly\.com$

regex101.com says match -> 18 steps
OR

^(.+\.)??wizaly\.com$

regex101.com says match -> 14 steps

your opinion and comment would be highly appreciated.

edit typos

DL6ER · February 20, 2020, 8:05pm

The first one you mention is slightly different than the one we're using (\.|^)
Please repeat your benchmark with the correct syntax as a case-sensitive literal match \. is a few steps faster than an any match. Also, you only specified the regex. When you specify the number of steps it takes, you should also provide the test vector you ran your test against.

I can only assume you tested against wizaly.com for which I see 16 steps for our and 14 steps for the proposed other wildcard regex. However, this is not a definite answer as you also need to consider subdomain matches. Given abc.wizaly.com, our regex needs 26 steps whereas the other one needs 31 steps.

You see, regex magic is not necessarily as simple as it may seem sometimes and we've already invested a lot of effort also into the more subtle things of Pi-hole. Not everything is documented, however, that doesn't mean the things around FTL are not heavily optimized anyways.

jpgpi250 · February 20, 2020, 8:13pm

that's not my fault, that's discourse, cutting chars when block quoting.
another attempt to list the regex

(\.|^)wizaly\.com$

versus

^(.+\.)??wizaly\.com$

and a screenshot of what I type

DL6ER · February 20, 2020, 8:14pm

Seems okay.

jpgpi250 · February 20, 2020, 8:20pm

repeating, you split while I was replying

so which on is the better one, the first one is the one you use, the second one appears to require less steps?

DL6ER · February 20, 2020, 8:21pm

How so?

jpgpi250 · February 20, 2020, 8:30pm

end of discussion.
when entering the domain wizaly.com in regex101.com, ^(.+\.)??wizaly\.com$ wins
when entering the domain abc.wizaly.com in regex101.com, (\.|^)wizaly\.com$wins

(hoping discourse doesn't change the code again...)
so your regex, wins, if a subdomain is used, which is what will happen in real life.

Sorry for my mistake.

edit
@Bucking_Horn, you showed an interest in this, here is the answer, provided by the smarter (than us) developer.
/edit

Bucking_Horn · February 20, 2020, 11:02pm

Thanks, I'd been alerted as soon as you quoted me in your opening post

But while we're at it, I add my two cents

In the very post you quoted me from, I was also doubting whether number of steps as calculated by regex101 would actually suffice as assessment criterion, as Pi-hole would likely use a different variant (ERE vs.PCRE) and most certainly a different implementation as that website.

Anyhow, as @DL6ER's musings here also take those number of steps into consideration, it seems that with increasing subdomain length, the preference for a method would break even on length=4.
|domain|(^|\.)|^(.+\.)?|
|---|---|---|
|1.domain.com|20|29|
|12.domain.com|23|29|
|123.domain.com|26|29|
|1234.domain.com|29|29|
|12345.domain.com|32|29|
|123456.domain.com|35|29|

Matching additional subdomain levels would tip the balance even earlier:
|domain|(^|\.)|^(.+\.)?|
|---|---|---|
|1.1.domain.com|28|29|
|1.12.domain.com|31|29|
|1.123.domain.com|34|29|
|www.tripod.lycos.com|48|27|

(I've included the last entry (lycos) as an example of a real world domain.)

So in searching an answer for my question, we have to move away from simple hard facts into heuristics, where the best solution for @jpgpi250 might not be equally beneficial for me.

However, I think it is safe to assume that the vast majority of those matches would be executed against www., which is clearly approving the solution that Pi-hole has chosen - gotta love them guys