Hey all,
last weekend I came around to finish what I started working on a rather long time ago: Extending the RegEx engine. For compatibility reasons (and also for the sake of convenience), we stick with the extended Regular expressions (ERE) as we currently have. All your existing regular expressions will still work. There is also some new stuff:
Regex Test mode
In order to ease regex development, we added a regex test mode to pihole-FTL
which can be invoked like
pihole-FTL regex-test doubleclick.net
(test doubleclick.net
against all regexs in the gravity database), or
pihole-FTL regex-test doubleclick.net "(^|\.)double"
(test doubleclick.net
against the CLI-provided regex (^|\.)double
.
You do NOT need to be sudo
for this, any arbitrary user should be able to run this command. The test returns 0
on match and 1
on no match and errors, hence, it may be used for scripting.
Comments
You can specify comments withing your regex using the syntax
(?#some comment here)
The comment can contain any characters except for a closing parenthesis )
(for the sole reason being the terminating element). The text in the comment is completely ignored by the regex parser and it used solely for readability purposes.
Try it yourself!
$ pihole-FTL regex-test "doubleclick.net" "(^|\.)doubleclick\.(?#TODO: We need to maybe support more than just .net here)net$"
FTL Regex test:
Domain: "doubleclick.net"
Regex: "(^|\.)doubleclick\.(?#TODO: We need to maybe support more than just .net here)net$"
Step 1: Compiling regex filter...
Compiled regex filter in 0.167 msec
Step 2: Checking domain...
Done in 0.032 msec
MATCH
Back-references
A back reference is a backslash followed by a single non-zero decimal digit d
. It matches the same sequence of characters matched by the d
th parenthesized subexpression.
Example:
"cat.foo.dog---cat%dog!foo" is matched by "(cat)\.(foo)\.(dog)---\1%\3!\2"
Another (more complex example is):
(1234|4321)\.(foo)\.(dog)--\1
MATCH: 1234.foo.dog--1234
MATCH: 4321.foo.dog--4321
NO MATCH: 1234.foo.dog--4321
Mind that the last line gives no match as \1
matches exactly the same sequence the first character group matched. And 4321
is not the same as 1234
even when both are valid replies for (1234|4321)
Back references are not defined for POSIX EREs (for BREs they are, surprisingly enough). We add them to ERE in the BRE style.
Try it yourself!
$ pihole-FTL regex-test "someverylongandmaybecomplexthing.foo.dog--someverylongandmaybecomplexthing" "(someverylongandmaybecomplexthing|somelesscomplexitem)\.(foo)\.(dog)--\1"
FTL Regex test:
Domain: "someverylongandmaybecomplexthing.foo.dog--someverylongandmaybecomplexthing"
Regex: "(someverylongandmaybecomplexthing|somelesscomplexitem)\.(foo)\.(dog)--\1"
Step 1: Compiling regex filter...
Compiled regex filter in 0.563 msec
Step 2: Checking domain...
Done in 0.031 msec
MATCH
More character classes for bracket expressions
A bracket expression specifies a set of characters by enclosing a nonempty list of items in brackets. Normally anything matching any item in the list is matched. If the list begins with ^
the meaning is negated; any character matching no item in the list is matched.
- Multiple characters:
[abc]
matchesa
,b
, andc
. - Character ranges:
[0-9]
matches any decimal digit. - Character classes:
-[:alnum:]
alphanumeric characters
-[:alpha:]
alphabetic characters
-[:blank:]
blank characters
-[:cntrl:]
control characters
-[:digit:]
decimal digits (0 - 9)
-[:graph:]
all printable characters except space
-[:lower:]
lower-case letters (FTL matches case-insensitive by default)
-[:print:]
printable characters including space
-[:punct:]
printable characters not space or alphanumeric
-[:space:]
white-space characters
-[:upper:]
upper case letters (FTL matches case-insensitive by default)
-[:xdigit:]
hexadecimal digits
Furthermore, there are two shortcurts for some character classes:
-
\d
- Digit character (equivalent to[[:digit:]]
) -
\D
- Non-digit character (equivalent to[^[:digit:]]
)
Approximative matching
I don't know if you know agrep
. It is basically a "forgiving" grep
. I use it a lot when searching through my (offline!) dictionaries. It is tolerant against errors (up to degree you specify). It may be beneficial is you want to match against domains where you don't really know the pattern. It is just an idea, we will have to see if it is actually useful.
This is a somewhat complicated topic, we'll approach it by examples as it is very complicated to get the head around it by just listening to the specifications.
The approximate matching settings for a subpattern can be changed by appending approx-settings to the subpattern. Limits for the number of errors can be set and an expression for specifying and limiting the costs can be given:
- Number of acceptable insertions (
+
)
Use(something){+x}
to specify that the regex should still be matching whenx
characters would need it be inserted into the sub-expressionsomething
:
Example:
The missing"doubleclick.net" is matched by "^doubleclick\.(nt){+1}$"
e
innt
is inserted.
Similarly:
The missing characters in the domain are substituted. The maximum number of insertions spans the entire domain as is wrapped in the sub-expression"doubleclick.net" is matched by "^(doubleclk\.nt){+3}$"
(...)
.. - Number of acceptable deletions (
-
)
Use(something){-x}
to specify that the regex should still be matching whenx
characters would need it be deleted from the sub-expressionsomething
:
Example:
The surplus"doubleclick.net" is matched by "^doubleclick\.(neet){-1}$"
e
inneet
is deleted.
Similarly:"doubleclick.net" is matched by "^(doubleclicky\.netty){-3}$"
"doubleclick.net" is NOT matched by "^(doubleclicky\.nettfy){-3}$"
- Number of acceptable substitutions (
#
)
Use(something){#x}
to specify that the regex should still be matching whenx
characters would need to be substituted from the sub-expressionsomething
:
Example 1:
Example 2:"oobargoobaploowap" is matched by "(foobar){#2~2}" Hint: "goobap" is "foobar" with two substitutions "f->g" and "r->p"
The incorrect"doubleclick.net" is matched by "^doubleclick\.n(tt){#1}$"
t
inntt
is substituted. Note that substitutions are necessary when a character needs to be replaced as the corresponding realization with one insertion and one deletion is not identical:
("doubleclick.net" is matched by "^doubleclick\.n(tt){+1-1}$"
t
is removed,e
is added), however
(the"doubleclick.nt" is ALSO matched by "^doubleclick\.n(tt){+1-1}$"
t
is just removed, nothing had to be added) but
doesn't match as substitutions always require characters to be swapped by others."doubleclick.nt" is NOT matched by "^doubleclick\.n(tt){#1}$"
- Combinations and total error limit (
~
)
All rules from above can be combined like as{+2-5#6}
allowing (up to!) two insertions, five deletions, and six substitutions. You can enforce an upper limit on the number of tried realizations using the tilde. Even when{+2-5#6}
can lead to up to 13 operations being tried, this can be limited to (at most) seven tries using{+2-5#6~7}
.
Example:
Specifying"oobargoobploowap" is matched by "(foobar){+2#2~3}" Hint: "goobaap" is "foobar" with - two substitutions "f->g" and "r->p", and - one addition "a" between "bar" (to have "baap")
~2
instead of~3
will lead to no match as (at least) three errors need to be corrected in total for a match. - Cost-equation: For experts (or crazy users) only!
You can even weight the "costs" of insertions, deletions or substitutions. This is really an advanced topic and should only be touched when really needed.Cost-equation details (you have been warned!)
A cost-equation can be thought of as a mathematical equation, where
i
,d
, ands
stand for the number of insertions, deletions, and substitutions, respectively. The equation can have a multiplier for each ofi
,d
, ands
.
The multiplier is the cost of the error, and the number after<
is the maximum allowed total cost of a match. Spaces and pluses can be inserted to make the equation more readable. When specifying only a cost equation, adding a space after the opening{
is required .
Example 1:{ 2i + 1d + 2s < 5 }
This sets the cost of an insertion to two, a deletion to one, a substitution to two, and the maximum cost to five.
Example 2:{+2-5#6, 2i + 1d + 2s < 5 }
This sets the cost of an insertion to two, a deletion to one, a substitution to two, and the maximum cost to five. Furthermore, it allows only up to 2 insertions (coming at a total cost of 4), five deletions and up to 6 substitutions. As six substitutions would come at a cost of6*2 = 12
, exeeding the total allowed costs of 5, they cannot all be realized.
Get it yourself for testing
pihole checkout ftl new/tre-regex
Switching to this version is safe (as in reversible without data-loss) from Pi-hole v5.0.