Wildcard and regex support for whitelisting

Rick_V · November 19, 2018, 10:33pm

hello developers!

I'd like the ability to use wildcard (and regex) on the whitelist.

The reason being, our office is pretty much "all in" on Microsoft 365, Sharepoint, etc. Basically, "all microsoft, all the time". Yet, several of the blacklists I subscribe to block various microsoft servers, causing random issues, like logging in, etc. While I understand this may not be everyone's cup of tea, it is what it is for us and it would be great if we could wildcard-whitelist .microsoft.com once and for all.

Conversely, we're mainly Mac users at home. And I see on some blocklists things like ocsp.apple.com listed, as well as other servers. I have no clue why anyone would think it would be a good idea to block a ocsp server! But same thing, at home I'd like to whitelist everything .apple.com.

-Rick

jfb · November 19, 2018, 11:01pm

In the short term, you could reduce your false positives with fewer block lists. One of the problems with public block lists is that you have no control over the content, and what the list maintainer wants to block may not be what you want to block (as in the case of ocsp.apple.com, which I have also had to whitelist).

You might try just a few lists (or no lists) and set up some regex filters that you control. Here are some examples that will knock down a lot of the adware, metrics, etc.

^(.+[-_.])??ad[sxv]?[0-9]*[-_.]
^adim(age|g)s?[0-9]*[-_.]
^adse?rv(e(rs?)?|ices?)?[0-9]*[-.]
^adtrack(er|ing)?[0-9]*[-.]
^advert(s|is(ing|ements?))?[0-9]*[-_.]
^aff(iliat(es?|ion))?[-.]
^analytics?[-.]
^banners?[-.]
^beacons?[0-9]*[-.]
^clicks?[-.]
^count(ers?)?[0-9]*[-.]
^pixels?[-.]
^stat(s|istics)?[0-9]*[-.]
^telemetry[-.]
^track(ers?|ing)?[0-9]*[-.]
^traff(ic)?[-.]

Rick_V · December 6, 2018, 3:45am

Hey jfb,

Just a quick shout-out thanks for this list of regex filters! I've never really been able to wrap my head around regex, so this list is great! Thanks much!

Someone should consider adding this list as "examples" on the regex section of the documentation!

-Rick

jfb · December 6, 2018, 4:02am

I won't take credit for these. They are on github, but for the life of me I can't recall where. When I find the link, I'll post it.

Edit - thanks @anon55913113, that was the link.

OscarN · February 16, 2019, 2:38pm

regexp whitelistling would be a really nice feature because now it is possible to set .* as blacklist regexp and deny everything but it is really cumbersome to add all subdomains that are to be allowed instead of just adding *.microsoft.com for an example.

OscarN · February 17, 2019, 8:18pm

I'm not sure i'm following what your saying but the webgui has 'exact', 'wildcard'(which also is regexp), 'regexp' for blacklisting. But only 'exact' for whitelisting. What I would like and I'm guessing this feature request is, would be to have the regexp for whitelisting as well.

Thanks for a fantastic software for any developer who's reading.

OscarN · February 17, 2019, 8:39pm

ah ok, that's nice. Care to share? Or even better submit to the project?

Rick_V · February 18, 2019, 1:26am

I would (still) welcome wildcard whitelisting incorporated into pi-hole.

We are a Microsoft 365 Business shop (for better or worse), and literally just last week I had to whitelist yet-another microsoft sub-domain after my users couldn't login into Skype for Business because some blocklist decided it needed blocking. Grrrr....

smoser · March 21, 2019, 1:26am

@Rick_v,

For what its worth, Microsoft publishes domain and ip addresses at https://docs.microsoft.com/en-us/office365/enterprise/urls-and-ip-address-ranges including machine readable json linked from there.

Rick_V · March 24, 2019, 2:41pm

@smoser,

Wow, this is tremendously helpful! Thank you so much!

jult · May 27, 2019, 9:35am

Me too. It very much surprised me it wasn't there, while pihole allowed me to put
*.somedomain.sometld
in whitelist.txt, it didn't error out or anything.

jult · May 27, 2019, 9:39am

Is there an easy way to reload dnsmasq from this script from outside of a docker-container? I mean, I would have to reload the list manually everytime the list updates, that's not going to work.
Also, it would be nice if pi-hole had an 'execute after' option for such scripts, like you write, a hook during import or something post all updates, just before reload.

jfb · June 2, 2019, 4:45am

2 posts were split to a new topic: Entering multiple regex at one time

DL6ER · July 7, 2019, 10:19am

Revisiting the original request: Whitelist regex support.

It is technically possible, however, I will tell you why I don't think it is a good idea:

Regex filter evaluation is - always - a sequential (and hence slow) task. You have to try all of them until you know that none of them matched. This is the exact reason for why we split the blacklist into an "exact" and a "regex" component. The "exact" component is loaded into cache and can be replied to with close to no delay at all. Walking the chain of regex filters is, however, much slower.

The implementation could be made in two ways:

Only use regex-based whitelist - very bad performance if you have many whitelisted domains
This is to be avoided as Pi-hole v5.0 will just introduce support for massive whitelists, using an implementation strategy that will still give the result of a query with a typical delay of < 4 msec even if your have millions of domains on the whitelist.
Add a regex-based whitelist next to the already existing whitelist - increase in complexity for the users.
This is to be avoided as well as it would introduce a severe slowdown of the blocked domains preparation (AKA "gravity"). Instead of only excluding the whitelisted domains (which is very efficient), we'd need to evaluate all whitelist regex filters against any of the (possibly up to millions of) domains on the blocking lists. This would result in a catastrophic slowdown, maybe causing gravity to take hours instead of tens of seconds on Raspberry Pi devices. This is unacceptable.

DL6ER · July 7, 2019, 10:48am

As you know there is almost nothing I will not discuss about. If I can be convinced from the contrary, I have not problems in accepting that I have maybe been wrong.

Can you give some more details about this so we can understand the performance impact?

What kind of device are you running Pi-hole on?
How many blocked domains so you source without whitelist regex filtering?
How many whitelist regex entries do you use?
How complex are they (this is a asking for subjective feeling, .* is not very complex, however, if you use () or | rather often, then the regex is much more complex when compiled to byte code).
How many blocked domains are left after your whitelist procedure is done with them?
How long does the script take only for the whitelist processing?

DL6ER · July 7, 2019, 8:05pm

Hmm, okay I will continue to think about this, however, we made recent optimizations for v5.0 which make implementing it in the way you did (a bit) harder.

As far as I see, your implementation also only cleans the gravity list, right? If a user (intentionally or not) blocked something in addition on the blacklist, you don't delete the line for him, right? If so, this would be rather unexpected behavior I'd say.

Instead of one big gravity run, that needs to be repeated for each whitelist modification, we instead use table views now. By this it is sufficient to send SIGHUP to pihole-FTL for the modified whitelist to become active. There is no call to pihole -g any more. However, this also means that edits to such a regex-based whitelist would again require us to run gravity each time as live filtering when loading the lists may be too slow.

Can you send me the regex filters you're using so I can have a realistic set of filters for performance measurements when I come around doing a testing implementation?
Please do not see this sentence as a guarantee that I will do it anytime soon nor that this will become part of Pi-hole if it turns out to be either too slow or to complex.

DL6ER · July 7, 2019, 10:08pm

Quicker than expected, we now already have a development draft that is handling regex whitelists consistently and completely inside of FTL and are currently working out the performance impact.

I've seen the blacklist regex domains on @mmotti's GitHub project but I'm not sure I have seen the mentioned whitelist regex before.
Note that this feature request is about whitelisting, not trimming down the number if gravity domains because they are already partially covered by regex filters.

jfb · July 25, 2019, 12:05pm

A post was split to a new topic: DNS configuration for wildcard whitelisting

DL6ER · September 4, 2019, 7:22pm

How about the Pi-hole Teleporter feature? Note that it is not only available on the dashboard but also through the CLI ( pihole -a -t creates a pi-hole-teleporter file in the current working directory ).

DL6ER · September 5, 2019, 5:11pm

I agree.

https://github.com/pi-hole/AdminLTE/pull/1001