Is there a way to "weight" a blacklist item over a regex/wildcard whitelist item?

cecoates · January 21, 2020, 7:23pm

Please follow the below template, it will help us to help you!

Please ensure that you are running the latest version of the beta code.
Run pihole -up to update to the latest, then verify that the problem still exists before reporting it.

Problem with Beta 5.0:
Being able to whitelist with wildcards is excellent. But I'm having trouble plotting out how to re-work my blacklist regexes.

Examples:

I have a few blacklist regex for things like blocking subdomains with "logs" in them. Entries such as:

^logs?\..*\..*$

However, in the whitelist I've now added a wildcard for "roku.com".

(\.|^)roku\.com$

So that means domains like:

austin.logs.roku.com

Now get through.

So if I have a blacklist regex for:

^.*metric.*\..*\..*$

BUT, I've whitelisted plex.tv as a wildcard:

(\.|^)plex\.tv$

How would I still block a domain like metrics.plex.tv?

Intuitively I was thinking the most "specific" item would win. So if I have a whitelist wildcard, but then I specifically blacklist metrics.plex.tv, it would still get blocked.

Or, if I have a regex set to blacklist "logs" subdomains, I could then specifically whitelist logs.roku.com if I wanted to (not that I would). Granted, THAT works because of the logic laid out here:

Exact Whitelist

Regex Whitelist

Exact Blacklist

Blocklist domains (AKA gravity )

Regex Blacklist

But now I'm not sure what to do in certain situations like the above.

So tl;dr in my mind the hierarchy would be:

Exact Whitelist
Exact Blacklist
Regex Whitelist
Blocklist domains (AKA gravity )
Regex Blacklist

I get why y'all would choose differently, but I'm curious if that means I should just limit my use of regex/wildcard whitelisting, even if that's more labor intensive.

Example: adding only specific whitelisted plex.tv domains, so that I can still block metrics.plex.tv, although that isn't as convenient.

Debug Token:
https://tricorder.pi-hole.net/fz73yizi76

DL6ER · January 21, 2020, 7:36pm

We went back and forth on this and discussed with several users and came to the conclusion that the simplest and most obvious implementation would be: The whitelist always wins.

This is not all that intuitive when you think about it again. Especially the intermix of the black- and white components in your proposed hierarchy seems to make things rather complicated to conceptualize. The simple rule "the whitelist always wins" is always understandable.

You'll have to specify less generic whitelist regex filters. I know that this will make the somewhat longer, but, in the end, I think we benefit more from a simpler implementation.

cecoates:

I have a few blacklist regex for things like blocking subdomains with “logs” in them. Entries such as:
^logs?\..*\..*$
However, in the whitelist I’ve now added a wildcard for “roku.com”.
(\.|^)roku\.com$
So that means domains like:
austin.logs.roku.com
Now get through.

It's rather difficult to interpret for Pi-hole what you want: You do not want logs..., however, you obviously also want everything from roku.com because you explicitly added it to your whitelist...

cecoates · January 21, 2020, 11:26pm

Fair enough! I'll remember to be less liberal with wildcard/regex whitelisting.

That was a bad example, I just did that as a test to see who would "win".

A more practical example is Plex. Plex has so many silly subdomains and domain extensions being able to whitelist it with a regex would be nice. But! Then if I still want to block metrics.plex.tv, the whitelist entry would override it.

But your answer was very clear, so thanks! Still like the wildcarding overall.

cecoates · January 21, 2020, 11:44pm

Since y'all were so kind as to implement regex, what if I tried a "negative lookahead" on the whitelist. Like:

^(?!.*metric).*plex\..*$

It only just occurred to me as I was looking over the regex/wildcards.

If I understand:

Then that says "match everything with plex before . a subdomain EXCEPT if it contains the word metric".

Or, stated more clearly than I can explain via regex101.com:

^ asserts position at start of a line
Negative Lookahead (?!.*metric)
Assert that the Regex below does not match
.*
matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
metric matches the characters metric literally (case sensitive)
.*
matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
plex matches the characters plex literally (case sensitive)
\. matches the character . literally (case sensitive)
.*
matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line

I'm going to give that a try, just for kicks. Regex really is amazing although I wish it didn't make my head hurt so bad.

jfb · January 22, 2020, 12:20am

I don't believe the flavor of regex used on Pi-Hole (POSIX ERE) supports negative look aheads.

cecoates · January 22, 2020, 12:25am

Ah, I see! Thanks for clarifying.

DL6ER · January 22, 2020, 6:06am

I was just about to say what @jfb already said.

I can understand that, you need to wrap your head around it. And the fact that there are different dialects out there does not improve the situation. We decided to implement ERE (Extended Regular Expressions) as they are very powerful whilst very efficient*. Furthermore, because this is the regex also used by some well-known applications such as grep.

I compiled an overview what is possible and how to use it here: Redirecting...

There's a cheatsheet at the bottom of this page which lists everything the ERE dialect can do.

*) Implementations allowing special groups like jump operations, negative/positive lockaheads and lookbehinds, conditionals, etc. are typically much bulkier and much slower. For Pi-hole we chose an implementation which is both powerful (even if it cannot do anything) and fast. Given that some users use up to hundreds of regex filters and the issue that all of them need to be evaluated when an unknown domain is queried ... you get the idea

cecoates · January 26, 2020, 5:18pm

Aha, I get it. Thanks for the explanation and link.