Regex capture / non-capture groups best practice

chrislph · February 18, 2023, 6:25pm

I've been playing with regex, reading forums and trying out various code on regex101.

In Pi-hole I currently have the following blacklist entry (just one example of many)

^(meetings|hangouts|suggestqueries).*google(apis)?\.com$

I hadn't appreciated until now that this creates two capture groups. Since I don't need these I can modify it to use non-capture groups with the ?: modifier inside the brackets

^(?:meetings|hangouts|suggestqueries).*google(?:apis)?\.com$

I've been testing how Pi-hole handles such groups and modifiers and it appears to respect them all perfectly. If I have two capture groups I can reference them with \1 and \2. If I tweak the first group to be non-capturing with ?: Pi-hole correctly now sees just one group (what was the second one) and references it now as \1. I've been testing with

pihole-FTL regex-test testdomainhere

The bracket syntax in the example above is quite common in here for matching variations of a domain and back-references aren't needed in these cases. So I am inclined to use non-capturing groups, and I've read a few articles where this seems to be considered best practice (non-storing so speeds things up, etc).

However I cannot find any info on Pi-hole's treatment of these or use of the ?: syntax. I assume it handles it correctly because testing seems to show that. The docs suggest that the FTL engine may not quite match what I would see on regex101.

Just wanted to get some feedback on the regex engine's behaviour and any gotchas or anything else worth knowing, before I go ahead and redo a whole raft of expressions on a couple of Pi-holes, adding in the ?: syntax. What about going forward for new expressions? Still worth adding up front?

I'm on latest FTL v5.21. I have read the posts in the regex engine thread but posted a new topic rather than disturb a lot of people by resurrecting the 2-year old thread.

DL6ER · February 18, 2023, 9:15pm

Reading what you said here about the non-capturing groups, I feel reminded about that I wanted to extend our regex documentation.

As you've read in the other (two years old) thread, the Regex engine inside FTL is a POSIX-compliant ERE-like (extended regular expressions) regex engine. We started at some point with BRE (basic regular expressions) but switched to ERE-like later as BRE is now considered obsolete.

Why do I say "ERE-like"? Well, because ERE does not specify some useful things (such as said back-references \1 - \9) that were actually specified in BRE so it makes sense to slightly deviate from the standard and accept them (this isn't a breaking change).

Let me try to summarize below what the regex engine of FTL can do what is (currently) missing in the cheat sheet in the documentation (some parts are mentioned elsewhere on our regex pages). You'll see that it's not much that is missing.

Back references

\[1-9]

A back reference is a backslash followed by a single non-zero decimal digit n. It matches the same sequence of characters matched by the n-th parenthesized subexpression.

Hexadecimal literals

A literal can either be a ordinary character/digit, an escape sequence (\n, \t, ...) or a hex-encoded string starting in \x such as \x1B. However, I do not see much/any application in the context of domains. Even not when considering international domains with special characters as they should be using Punycode-encoding.

Assertion characters

Besides the well known "anchors" ^ and $ matching the start and end of the input, respectively, we also support the following assertion characters:

\< – Beginning of word
\> – End of word
\b – Word boundary
\B – Non-word boundary
\d – Digit character (equivalent to [[:digit:]])
\D – Non-digit character (equivalent to [^[:digit:]])
\s – Space character (equivalent to [[:space:]])
\S – Non-space character (equivalent to [^[:space:]])
\w – Word character (equivalent to [[:alnum:]_])
\W – Non-word character (equivalent to [^[:alnum:]_])

Non-greedy matching

Normally a repeated expression is greedy, that is, it matches as many characters as possible:

(matching 0 or more)*
(matching 0 or 1)?
(matching 1 or more)+
(matching n times){n}
(matching n or more times){n,}
(matching n to m (inclusive) times){n,m}

Simple example for greedy matching ("eating as much as possible"):

regex: (.*)ab
input: bcdabdcbabcd
       ^^^^^^^^ab
match: bcdabdcb

Adding a ? to a repeat operator (such as ? or *) makes the subexpression minimal, or non-greedy. A non-greedy subexpression matches as few characters as possible:

(matching 0 or more)*?
(matching 0 or 1)??
(matching 1 or more)+?
(matching n times){n}?
(matching n or more times){n,}?
(matching n to m (inclusive) times){n,m}?

Simple example for non-greedy matching ("eating as little as possible"):

regex: (.*?)ab
input: bcdabdcbabcd
       ^^^ab
match: bcd

Note that this does not (always) mean the same thing as matching as many or few repetitions as possible. Also note that minimal repetitions are not supported for approximate matching due to performance reasons.

Non-capture groups

Non-capturing groups give some higher performance as we do not have to create a matching record for them. However, be aware of possible caveats, e.g. "normal" groups within non-capturing groups still capture. Example: (?:([A-Za-z]+):) will match, for instance, "ftp:", however, even when the entire string is matched by a non-capturing group, the inner group will still return the string "ftp" for "\1".

Options

Finally, you can specify some options for the regex such as disabling the standard case-insensitive matching. This shouldn't be done, typically, for domains, however, if there are reasons you really really want to do this, you can by prepending your regex with (?-i) as in

(?-i)intENtional CASe-senSItive matCHING

Other supported options are (?U) forcing the repetition operators in your regex to be non-greedy unless a ? is appended, and (?r) causing the regex to be matched right associative rather than the default left associative manner.

Mind that these options have a higher chance to hurt than to actually help you (unless you know exactly what you are doing with them).

chrislph · February 20, 2023, 4:01pm

That's a very useful updated regex reference, thankyou for writing that up. I've spent some time trying them out. Very solid and useful to have this capability in Pi-hole.

Isn't matching case-sensitive as standard (hence the need, for example, for [A-Za-z]+ to catch variants of "ftp" in your example)? Where can the standard case-insensitive matching be found?

From what you've written plus the info online I'm seeing that best practice, including here in Pi-hole, is to use capturing groups where needed and to otherwise keep them non-capturing. I guess the performance hit is minimal though. Do you concur? I will redo my regexs with brackets as non-capturing variants.

In Pi-hole if I add a wildcard domain it creates the following regex:

(\.|^)example\.com$

Does this mean it should ideally be

(?:\.|^)example\.com$

DL6ER · February 20, 2023, 5:32pm

No, insensitive is the default. This is a compile-time setting and needs to be overwritten if one really wants this. [a-z] will be enough, otherwise, rules such as

would not make much sense, either.

Yes.

Ideally, yes. However, we do not do this by default as it is more complex to explain for new users and typically doesn't make enough of a difference to be worth the extra confusion the ?: might be causing.

chrislph · February 20, 2023, 6:35pm

Do you mean insensitive is the default for Pi-hole as this makes sense from a domain perspective? If so, that makes sense (I first thought you were saying insensitive was the default for regex engine standard behaviour).

Thanks for clarifying the groups overhead. It's almost like the syntax is the wrong way around – would be better to have (xxx) for standard grouping and (?:xxx) to modify for capture use. This article is I think a good summary. On balance I'll leave them be for the clearer syntax.

jpgpi250 · February 20, 2023, 10:12pm

please explain (not a regex expert)

regexper.com says:

what is the difference?
what should be used when adding the regexes for this list (automated - scripted), for example for the entry {"company_name": "A8.net", "domains": ["a8.net"]},

chrislph · February 21, 2023, 6:12am

Using parentheses does two things. It groups items for matching. It also captures the results of those matches so they can be referenced again in the same expression using \1 for the first capture, \2 for the second and so on.

The expression star(trek|wars) will match both startrek and starwars. It also creates a capture group 1 which can be referenced with \1. This will reference the actual string trek or wars, depending on what was actually matched.

This might be used in this longer expression:

star(trek|wars) is great and star\1 is my favourite

This will match

startrek is great and startrek is my favourite
starwars is great and starwars is my favourite

It won't match

startrek is great and starwars is my favourite
starwars is great and startrek is my favourite

There's a slight overhead in storing and processing capture groups. If the parentheses are only needed for grouping, the group can be made non-capturing by adding ?: like star(?:trek|wars)

That's why the example (\.|^)example\.com$ creates a capture group, where \1 will match either . (for a subdomain) or be null (no subdomain). This is what Pi-hole is doing if you tell it to create a wildcard. Since the \1 is not needed here – the brackets are just to select bewteen a subdomain or the start of a domain – then there is no value in creating a capture group. So technically the expression can be ammended to be non-capturing by using (?:\.|^)example\.com$

You can see in your screenshot the way group 1 is created or not created depending on which variant is used.

In practice the overhead of storing the group during evaluation is tiny enough that it's probably better to go for the cleaner syntax of just using the parentheses as they are, even though this technically creates a capture group that is never used during that expression's evaluation.

Bottom line – there's no need to change any existing regexs from ( ) groups to (?: ) groups. They are fine as they are. If you do want to use a capturing group to match a domain it's great to know that Pi-hole supports it. Kudos to the devs for the work.

If you have a script that is creating bracketed selections and it's working I would say leave it as it is now.

jpgpi250 · February 21, 2023, 8:07am

Thanks for the detailed explanation.

Both NextDNS and AdguardTeam have lists for known cname entries, GitHub links in the scripts.
If you want to look at / use the scripts, you can find them here:

I run the scripts weekly, this to add possible new entries. Obsolete regexes aren't removed! To remove the entries, use sqlite3 to select / delete entries with the fixed comment, entered by the scripts.

system · March 1, 2023, 3:16am

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.