RegEx engine improvements

Hey all,

last weekend I came around to finish what I started working on a rather long time ago: Extending the RegEx engine. For compatibility reasons (and also for the sake of convenience), we stick with the extended Regular expressions (ERE) as we currently have. All your existing regular expressions will still work. There is also some new stuff:

Regex Test mode

In order to ease regex development, we added a regex test mode to pihole-FTL which can be invoked like

pihole-FTL regex-test doubleclick.net

(test doubleclick.net against all regexs in the gravity database), or

pihole-FTL regex-test doubleclick.net "(^|\.)double"

(test doubleclick.net against the CLI-provided regex (^|\.)double.

You do NOT need to be sudo for this, any arbitrary user should be able to run this command. The test returns 0 on match and 1 on no match and errors, hence, it may be used for scripting.

Comments

You can specify comments withing your regex using the syntax

(?#some comment here)

The comment can contain any characters except for a closing parenthesis ) (for the sole reason being the terminating element). The text in the comment is completely ignored by the regex parser and it used solely for readability purposes.

Try it yourself!
$ pihole-FTL regex-test "doubleclick.net" "(^|\.)doubleclick\.(?#TODO: We need to maybe support more than just .net here)net$"
FTL Regex test:
  Domain: "doubleclick.net"
   Regex: "(^|\.)doubleclick\.(?#TODO: We need to maybe support more than just .net here)net$"
Step 1: Compiling regex filter...
        Compiled regex filter in 0.167 msec
Step 2: Checking domain...
        Done in 0.032 msec

MATCH

Back-references

A back reference is a backslash followed by a single non-zero decimal digit d . It matches the same sequence of characters matched by the dth parenthesized subexpression.

Example:

"cat.foo.dog---cat%dog!foo" is matched by "(cat)\.(foo)\.(dog)---\1%\3!\2"

Another (more complex example is):

(1234|4321)\.(foo)\.(dog)--\1
   MATCH: 1234.foo.dog--1234
   MATCH: 4321.foo.dog--4321
NO MATCH: 1234.foo.dog--4321

Mind that the last line gives no match as \1 matches exactly the same sequence the first character group matched. And 4321 is not the same as 1234 even when both are valid replies for (1234|4321) Back references are not defined for POSIX EREs (for BREs they are, surprisingly enough). We add them to ERE in the BRE style.

Try it yourself!
$ pihole-FTL regex-test "someverylongandmaybecomplexthing.foo.dog--someverylongandmaybecomplexthing" "(someverylongandmaybecomplexthing|somelesscomplexitem)\.(foo)\.(dog)--\1"
FTL Regex test:
  Domain: "someverylongandmaybecomplexthing.foo.dog--someverylongandmaybecomplexthing"
   Regex: "(someverylongandmaybecomplexthing|somelesscomplexitem)\.(foo)\.(dog)--\1"
Step 1: Compiling regex filter...
        Compiled regex filter in 0.563 msec
Step 2: Checking domain...
        Done in 0.031 msec

MATCH

More character classes for bracket expressions

A bracket expression specifies a set of characters by enclosing a nonempty list of items in brackets. Normally anything matching any item in the list is matched. If the list begins with ^ the meaning is negated; any character matching no item in the list is matched.

  1. Multiple characters: [abc] matches a, b, and c.
  2. Character ranges: [0-9] matches any decimal digit.
  3. Character classes:
    - [:alnum:] alphanumeric characters
    - [:alpha:] alphabetic characters
    - [:blank:] blank characters
    - [:cntrl:] control characters
    - [:digit:] decimal digits (0 - 9)
    - [:graph:] all printable characters except space
    - [:lower:] lower-case letters (FTL matches case-insensitive by default)
    - [:print:] printable characters including space
    - [:punct:] printable characters not space or alphanumeric
    - [:space:] white-space characters
    - [:upper:] upper case letters (FTL matches case-insensitive by default)
    - [:xdigit:] hexadecimal digits

Furthermore, there are two shortcurts for some character classes:

  • \d - Digit character (equivalent to [[:digit:]])
  • \D - Non-digit character (equivalent to [^[:digit:]])

Approximative matching

I don't know if you know agrep. It is basically a "forgiving" grep. I use it a lot when searching through my (offline!) dictionaries. It is tolerant against errors (up to degree you specify). It may be beneficial is you want to match against domains where you don't really know the pattern. It is just an idea, we will have to see if it is actually useful.

This is a somewhat complicated topic, we'll approach it by examples as it is very complicated to get the head around it by just listening to the specifications.

The approximate matching settings for a subpattern can be changed by appending approx-settings to the subpattern. Limits for the number of errors can be set and an expression for specifying and limiting the costs can be given:

  1. Number of acceptable insertions (+)
    Use (something){+x} to specify that the regex should still be matching when x characters would need it be inserted into the sub-expression something:
    Example:
    "doubleclick.net" is matched by "^doubleclick\.(nt){+1}$"
    
    The missing e in nt is inserted.
    Similarly:
    "doubleclick.net" is matched by "^(doubleclk\.nt){+3}$"
    
    The missing characters in the domain are substituted. The maximum number of insertions spans the entire domain as is wrapped in the sub-expression (...)..
  2. Number of acceptable deletions (-)
    Use (something){-x} to specify that the regex should still be matching when x characters would need it be deleted from the sub-expression something:
    Example:
    "doubleclick.net" is matched by "^doubleclick\.(neet){-1}$"
    
    The surplus e in neet is deleted.
    Similarly:
    "doubleclick.net" is matched by "^(doubleclicky\.netty){-3}$"
    
    "doubleclick.net" is NOT matched by "^(doubleclicky\.nettfy){-3}$"
    
  3. Number of acceptable substitutions (#)
    Use (something){#x} to specify that the regex should still be matching when x characters would need to be substituted from the sub-expression something:
    Example 1:
    "oobargoobaploowap" is matched by "(foobar){#2~2}"
    Hint: "goobap" is "foobar" with two substitutions "f->g" and "r->p"
    
    Example 2:
    "doubleclick.net" is matched by "^doubleclick\.n(tt){#1}$"
    
    The incorrect t in ntt is substituted. Note that substitutions are necessary when a character needs to be replaced as the corresponding realization with one insertion and one deletion is not identical:
    "doubleclick.net" is matched by "^doubleclick\.n(tt){+1-1}$"
    
    (t is removed, e is added), however
    "doubleclick.nt" is ALSO matched by "^doubleclick\.n(tt){+1-1}$"
    
    (the t is just removed, nothing had to be added) but
    "doubleclick.nt" is NOT matched by "^doubleclick\.n(tt){#1}$"
    
    doesn't match as substitutions always require characters to be swapped by others.
  4. Combinations and total error limit (~)
    All rules from above can be combined like as {+2-5#6} allowing (up to!) two insertions, five deletions, and six substitutions. You can enforce an upper limit on the number of tried realizations using the tilde. Even when {+2-5#6} can lead to up to 13 operations being tried, this can be limited to (at most) seven tries using {+2-5#6~7}.
    Example:
    "oobargoobploowap" is matched by "(foobar){+2#2~3}"
    Hint: "goobaap" is "foobar" with
               - two substitutions "f->g" and "r->p", and
               - one addition "a" between "bar" (to have "baap")
    
    Specifying ~2 instead of ~3 will lead to no match as (at least) three errors need to be corrected in total for a match.
  5. Cost-equation: For experts (or crazy users) only!
    You can even weight the "costs" of insertions, deletions or substitutions. This is really an advanced topic and should only be touched when really needed.
    Cost-equation details (you have been warned!)

    A cost-equation can be thought of as a mathematical equation, where i, d, and s stand for the number of insertions, deletions, and substitutions, respectively. The equation can have a multiplier for each of i, d, and s.
    The multiplier is the cost of the error, and the number after < is the maximum allowed total cost of a match. Spaces and pluses can be inserted to make the equation more readable. When specifying only a cost equation, adding a space after the opening { is required .
    Example 1: { 2i + 1d + 2s < 5 }
    This sets the cost of an insertion to two, a deletion to one, a substitution to two, and the maximum cost to five.
    Example 2: {+2-5#6, 2i + 1d + 2s < 5 }
    This sets the cost of an insertion to two, a deletion to one, a substitution to two, and the maximum cost to five. Furthermore, it allows only up to 2 insertions (coming at a total cost of 4), five deletions and up to 6 substitutions. As six substitutions would come at a cost of 6*2 = 12, exeeding the total allowed costs of 5, they cannot all be realized.

Get it yourself for testing

pihole checkout ftl new/tre-regex

Switching to this version is safe (as in reversible without data-loss) from Pi-hole v5.0.

5 Likes

Some more examples combining several elements from above for your enjoyment:

  1. Complex character sets
    "__abc#LMN012$x%yz789*" is matched by "[[:digit:]a-z#$%]+"
    "abc#lmn012%yz789--@*,abc" is matched by "[^[:digit:]#$%[:xdigit:]]+"
    
  2. Character classes
    "-0123456789ABCDEFabcdef" is matched by "[[:xdigit:]]+"
    "a~!@#$%^&*()_+=-`[]{};':\"|\\,./?>< " is matched by "[[:punct:]]+"
    
  3. Range expressions
    "!ABC-./XYZ~" is matched by "[--Z]+"
    
  4. Back referencing test
    "abc.abc" is matched by "([a-z]*)\.\1"
    "aabc" is matched by "(a)\1{1,2}"
    
    "foo" is matched by "(.)\1$"
    "foox" is NOT matched by "(.)\1$" (does not end with 2x the same character)
    
    "1234512345" is matched by "([0-9]{5})\1"
    "12345" is NOT matched by "([0-9]{5})\1" (no repetition for the back-reference)
    
  5. Approximative matching
    "foobarzap" is matched by "foo(bar){~1}zap"
    "foobrzap" is matched by "foo(bar){~1}zap"
    "foxbarzap" is NOT matched by "foo(bar){~1}zap" (error outside of the fault-tolerant regime `(bar)`)
    
    "foobar" is matched by "^(foobar){~1}$" (there is no error here)
    "cfoobar" is matched by "^(foobar){~1}$" (there is one error to be fixed here)
    "ccfoobar" is NOT matched by "^(foobar){~1}$" (at most one error is allowed)
    
    "xirefoabrzlfd" is matched by "(foobar){~2}" ("foobar" can be made "foabr" by substituting "o->a" and deleting "a")
    "xirefoabzlfd" is NOT matched by "(foobar){~2}" (at most two errors allowed)
    
    "oobargoobaploowap" is matched by "(foobar){+2#2~2}" (at most two inserts or substitutions and max. two errors in total)
    
    "molasses anaconda foo bar baz smith anderson" is matched by "(znacnda){ #1 ~3 1i + 1d < 2 }"
    "molasses anaconda foo bar baz smith anderson" is NOT matched by "(znacnda){ #1 ~3 1i + 1d < 1 }" (fixing is more costly than 1)
    
    "3oifaowefbaoraofuiebofasebfaobfaorfeoaro" is matched by "(foobar){+1 -2 #3, 2d + 1s < 4}" (At most one insert, two deletes, and three substitutions.
                                                                                                Additionally, deletes cost two and substitutes one, and
                                                                                                total cost must be less than 4.)
    
    

Yes, it is still there with double-\ as I copied it from bash. Fixed above.

(first lines of your initial post) should be

pihole-FTL regex-test doubleclick.net

see pihole-FTL --help...

The output is now MATCH or NO MATCH

Would it be possible to change the output, to match the output of pihole -q -exact, e.g.

pihole -q -exact facebook.com
 Exact match found in regex whitelist
   (\.|^)facebook\.com$
   ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.[^.]+$
 Exact match found in regex blacklist
   (\.|^)facebook\.com$
   ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.[^.]+$

edit
already using the database schema, that allows duplicate whitelist / blacklist entries, hence, the duplicate entries in the above output of pihole -q
/edit

I tried it and it works. Approximative matching is an amazing addition!

Image your kids (or you yourself) mistype google.com as gogle.com or similar - you will still see the search page. With (google){~2} I am now finally able to catch this. It is just one (maybe less practical) example, there are many sites more doing things like this. I'm sure I will use it when it is stable released.

Thank you very much!

I'm wondering about the remaining 0.5%. What would you need in addition (maybe I could need this too)?

Yes and No. The regex matching mechanism in FTL is coded differently: As soon as one regex is matching, the comparison is immediately stopped (whitelist is checked first). We don't need to know if multiple regex matched because the first match already tells us what to do. Changing this will not be easy to do. We can easily print which regex caused the current match but we cannot easily list all matches.

I'll look into this.

pihole-FTL regex-test "fbcdn.net"

Screenshot at 2020-06-23 19-48-23

edit I also improved the scripting behavior by defining three possible return types (without looking at the output at all):

  • 0 = match
  • 1 = error (cannot read gravity database, specified regex contains error, etc.)
  • 2 = no match
1 Like

great improvement first match for whitelist and blacklist

pihole-FTL regex-test facebook.com

  • ignores disabled entries
  • wildcard entries first, example: (.|^)facebook.com$
  • more complex entries last, example ^(.+.)?(facebook|fb(cdn|sbx)?|tfbnw).[^.]+$

even if the database id of the complex regex < wildcard id
correct?

edit
additional question:
Is there a website where you can test and/or visualize a regex with approximative matching (searched, but didn't find...)? neither regex101, nor regexper appear to understand these regexes (tested with "^doubleclick\.(nt){+1}$" (from the above explanation).
/edit

I don't think so. I guess this is why they added a regex tester for this.

Yes

No. We do not put any filtering/sorting on the regex query. They are read in the order they are stored in the database (this does not necessarily correlate with the ID).

pihole -q doesn't find any matches for typo's (Approximative matching)

pi@raspberrypi:~ $ pihole-FTL regex-test doubleclk.nt
[i] Loading regex filters from database...
      Compiled 36 black- and 2 whitelist regex filters in 14.998 msec

[i] Checking domain against blacklist...
      Time: 0.271 msec
[✓] "doubleclk.nt" matched by "^(doubleclk\.nt){+3}$"

[i] Checking domain against whitelist...
      Time: 0.096 msec
[✗] No result for "doubleclk.nt"

pi@raspberrypi:~ $ pihole -q  doubleclk.nt
  [i] No results found for doubleclk.nt within the block lists
pi@raspberrypi:~ $ pihole -q  -all doubleclk.nt
  [i] No results found for doubleclk.nt within the block lists
pi@raspberrypi:~ $ pihole -q -all -exact doubleclk.nt
  [i] No exact results found for doubleclk.nt within the block lists

Will this be changed (correct output for pihole -q)?

edit
possible solution, change /opt/pihole/query.sh
original (line 41):

"regex" ) 
    for list in ${lists}; do
        if [[ "${domain}" =~ ${list} ]]; then
            printf "%b\n" "${list}";
        fi
    done;;

modified:

"regex" ) 
    for list in ${lists}; do
        if [[ "${domain}" =~ ${list} ]]; then
            printf "%b\n" "${list}";
        else
            result=$(echo "${domain}" | agrep "${list}")
            if [[ ! -z ${result} ]]; then
                printf "%b\n" "${list}";
            fi
        fi
    done;;

result (example - test environment - using database schema that allows duplicate domainlist entries):

pihole -q -all -exact facebook.com
 Exact match found in regex whitelist
   (\.|^)facebook\.com$
   ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.[^.]+$
 Exact match found in regex blacklist
   ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.[^.]+$
   ^(facbok\.cm){+3}$
 Exact match for facebook.com found in:
  - file:///home/pi/duckduckgo/categorized_trackers

The problem might be agrep, which is not installed by default and doesn't even exist on Raspberry Pi OS (previously called Raspbian).

What I did to overcome this problem on Raspberry Pi OS, to be able to use agrep:

sudo apt-get install tre-agrep
sudo ln -s /usr/bin/tre-agrep /bin/agrep

the installer needs to be changed for this, finding a suitable agrep for all supported operating systems, this is way above my skillset.
/edit

Yes, thanks for the suggestion of installing agrep, however, the Pi-hole internal regex is not exactly identical with tre-agrep. For now, this difference does only manifest in a ignoring case by default, but we may implement Pi-hole specific regex extensions at any point. It's probably better to simplify the FTL output and directly use the regex-test for this as this guarantees that the regex engine actually being used is asked here.

from tre-agrep --help (ignore case requires -i)

Regexp selection and interpretation:
  -e, --regexp=PATTERN      use PATTERN as a regular expression
  -i, --ignore-case         ignore case distinctions
  -k, --literal             PATTERN is a literal string
  -w, --word-regexp         force PATTERN to match only whole words

or just use the exit codes (see your post above), something like:

"regex" ) 
    for list in ${lists}; do
        if [[ "${domain}" =~ ${list} ]]; then
            printf "%b\n" "${list}";
        else
            if /usr/bin/pihole-FTL regex-test "${domain}" "${list}" >/dev/null 2>&1; then
                printf "%b\n" "${list}";
            fi
        fi
    done;;

I just added a "quiet" regex-test mode which can directly be used within pihole -q

pihole-FTL -q regex-test "fbcdn.net"

results in

    (\.|^)fbcdn\.net$ matches (regex blacklist, DB ID 83)

(the matching regex is in bold when this is a terminal)

Does this solve the problem for query.sh?

pihole-FTL -q regex-test facebook.com
    ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.[^.]+$ matches (regex blacklist, DB ID 59)
    ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.[^.]+$ matches (regex whitelist, DB ID 93)

only shows the first match

 pihole-FTL -q regex-test facebook.com "^(facbok\.cm){+3}$"
    ^(facbok\.cm){+3}$ matches

validates the domain name versus a given regex, the output still doesn't blend in the output of pihole -q (extra word matches)

In order to use pihole-FTL in pihole -q, I think it's best to use the exit codes (see example in earlier post) and let the pihole-FTL -q option simply suppress all output, which would eliminate the need for redirection (>/dev/null 2>&1). This would be the identical to for example grep -q ( -q, --quiet, --silent suppress all normal output).

Doing it like this still pushes the regex through bash where it may be subject to exploiting things. I'd prefer a solution where FTL directly reads from the database to avoid this.

Blend in as comes as soon as available? Is this really a needed feature?

I agree on this.

No, it doesn't.

Screenshot at 2020-06-29 21-28-52

pihole-FTL -q regex-test "aaabbbccc"
    aaa matches (regex blacklist, DB ID 90)
    bbb matches (regex blacklist, DB ID 91)
    ccc matches (regex blacklist, DB ID 92)

Validating something against a single user-provided regex was never possible in pihole -q. So I don't really see what you want to say here.

I assume pihole -q in version 5.2 will also list regexes with approximative matching. For that to happen, /opt/pihole/query.sh needs to be modified, the pihole v5.0 version doesn't display these. All I try to achieve is, to ensure the output of pihole -q is properly (visually) formatted (eye candy).

when using this (the pihole-FTL -q solution) in /opt/pihole/query.sh

"regex" ) 
    for list in ${lists}; do
        if [[ "${domain}" =~ ${list} ]]; then
            printf "%b\n" "${list}";
        else
            /usr/bin/pihole-FTL -q regex-test "${domain}" "${list}"
        fi
    done;;

the output looks like this (extra indentation before the regex with Approximative matching):

image

When using this (using the pihole-FTL exit code) in /opt/pihole/query.sh

"regex" ) 
    for list in ${lists}; do
        if [[ "${domain}" =~ ${list} ]]; then
            printf "%b\n" "${list}";
        else
            if /usr/bin/pihole-FTL regex-test "${domain}" "${list}" >/dev/null 2>&1; then
                printf "%b\n" "${list}";
            fi
        fi
    done;;

the output is properly aligned:

image

if the -q option from pihole-FTL would behave as the -q option from grep, the second example could be one without >/dev/null 2>&1

enough said on the subject, do as you please, I don't want to waist anymore of your time.

Okay, I'm convinced. Please by forgiving with me when it appears I'm not picking up an idea. It doesn't always mean I don't like it. There is just too much stuff going on at the same time...

I pushed a change that should make the -q really quiet. I intentionally left one exception in there, that is regex errors where the error message is still logged on the terminal.

You may even want to leave the bash ~= completely out of this:

"regex" ) 
    for list in ${lists}; do
        if /usr/bin/pihole-FTL -q regex-test "${domain}" "${list}"; then
            printf "%b\n" "${list}";
        fi
    done;;

to ensure we don't get false-negatives. The regex-test should be fairly fast.

Thanks for the suggestion, I should of seen this myself, always good to have an expert opinion...
Your latest solution works like a charm, resulting output is perfect,..

Thanks again for your time and effort