Discussion about optimal wildcard syntax

you replied:

This afternoon, we had a discussion about a facebook regex

The following came up:


do we use (scripted entry)


regex101.com says match -> 18 steps


regex101.com says match -> 14 steps

your opinion and comment would be highly appreciated.

edit typos

The first one you mention is slightly different than the one we’re using (\.|^)
Please repeat your benchmark with the correct syntax as a case-sensitive literal match \. is a few steps faster than an any match. Also, you only specified the regex. When you specify the number of steps it takes, you should also provide the test vector you ran your test against.

I can only assume you tested against wizaly.com for which I see 16 steps for our and 14 steps for the proposed other wildcard regex. However, this is not a definite answer as you also need to consider subdomain matches. Given abc.wizaly.com, our regex needs 26 steps whereas the other one needs 31 steps.

You see, regex magic is not necessarily as simple as it may seem sometimes and we’ve already invested a lot of effort also into the more subtle things of Pi-hole. Not everything is documented, however, that doesn’t mean the things around FTL are not heavily optimized anyways.

that’s not my fault, that’s discourse, cutting chars when block quoting.
another attempt to list the regex




and a screenshot of what I type

Seems okay.

repeating, you split while I was replying

so which on is the better one, the first one is the one you use, the second one appears to require less steps?

How so?

end of discussion.
when entering the domain wizaly.com in regex101.com, ^(.+\.)??wizaly\.com$ wins
when entering the domain abc.wizaly.com in regex101.com, (\.|^)wizaly\.com$wins

(hoping discourse doesn’t change the code again…)
so your regex, wins, if a subdomain is used, which is what will happen in real life.

Sorry for my mistake.

@Bucking_Horn, you showed an interest in this, here is the answer, provided by the smarter (than us) developer.

Looking at it I think that I see why it’s faster using (\.|^)

When comparing the chance of matching a subdomain is higher and so the the second part is not matched.

Then after compiling how does it looks. I think all permutations are written out to single lines.

becomes two lines

Am I correct in that?

Then is the compiled list sorted on length and kind (start and end anchor) of the line so that even faster a match can be archieved

I am a big fan of single anchoring and handling domains makes the end anchoring ($) favourable.

Thanks, I’d been alerted as soon as you quoted me in your opening post :wink:

But while we’re at it, I add my two cents :wink:

In the very post you quoted me from, I was also doubting whether number of steps as calculated by regex101 would actually suffice as assessment criterion, as Pi-hole would likely use a different variant (ERE vs.PCRE) and most certainly a different implementation as that website.

Anyhow, as @DL6ER’s musings here also take those number of steps into consideration, it seems that with increasing subdomain length, the preference for a method would break even on length=4.

domain (^|\.) ^(.+\.)?
1.domain.com 20 29
12.domain.com 23 29
123.domain.com 26 29
1234.domain.com 29 29
12345.domain.com 32 29
123456.domain.com 35 29

Matching additional subdomain levels would tip the balance even earlier:

domain (^|\.) ^(.+\.)?
1.1.domain.com 28 29
1.12.domain.com 31 29
1.123.domain.com 34 29
www.tripod.lycos.com 48 27

(I’ve included the last entry (lycos) as an example of a real world domain.)

So in searching an answer for my question, we have to move away from simple hard facts into heuristics, where the best solution for @jpgpi250 might not be equally beneficial for me.

However, I think it is safe to assume that the vast majority of those matches would be executed against www., which is clearly approving the solution that Pi-hole has chosen - gotta love them guys :wink: :+1: :smiling_face_with_three_hearts: