Regex: Block all google.tld sites except google.com

Hello everyone, not only is this my first post but I am also a first time Raspberry Pi 5 owner as of 12/27/23. Thank you for your consideration in advance.

I wrote the following regex that seems to work in notepad++ but doesn't give me the desired results in Pi-hole Domain Management.

^(?:[a-zA-Z0-9-]+\.)*\bgoogle\.(?!com\b)\w+

The type is set to "Regex blacklist" gave it the "comment": google.com Permit / Deny All Others

I want to block everything in this list except for the first item "google.com" and "www.google.com"

google.com
www.google.com
google.net
google.ca
sub1.google.ca
sub1.sub2.google.ca
sub1.sub2.sub3.google.ca

I hope someone can point out where I went wrong with this regex.

Here is our regex tutorial: Tutorial - Pi-hole documentation

We also have a built-in regex tester: Testing - Pi-hole documentation

You can also find a semi-graphical regex checker at https://regex101.com

The easiest approach is to regex block the google.tld sites with no exceptions.

Then, explicity whitelist google.com and www.google.com.

Whitelisting overrides blacklisting as follows:

The priority is:

  1. Exact Whitelist
  2. Regex Whitelist
  3. Exact Blacklist
  4. Blocklist domains (AKA gravity)
  5. Regex Blacklist

If a domain is found anywhere from top to bottom, FTL skips the rest of the tests.

Ok, good to know that order of operations there. Thanks for this tip. It is a pretty cool regex I wrote. :wink:

For those that may end up here in the future here's what I did that does work:

Blacklist:
Google - Deny All Sites

^(?:[a-zA-Z0-9-]+\.)*google\...*$

Whitelist:
google.com - Allow *.google.com

^(?:[a-zA-Z0-9-]+\.)*google\.com

I know this is not common, but this regex will block domains containing "google" in the middle.

For example, a domain containing a "google" subdomain, like this example: google.whatever.tld

Note also that Pi-hole can do subdomain wildcards for you.

That example would get turned into

(\.|^)google\.com$

Ok, now you've done it... a challenge! lol - Thanks for pointing out the potential issue with the regex. Here's the new and improved version. Hammer away at it, I think it's rock solid now...

Google - Deny All Sites
^(?:[a-zA-Z0-9-]+\.)?([a-zA-Z0-9-]+\.)?google\.[a-zA-Z]{2,}$

Chris, thanks for pointing out that very cool use of wildcard matching a domain.

I'm loving pi-hole more and more.

As chrislph suggested, you can simplify this regex to (\.|^)google\.[a-zA-Z]{2,}$, but this won't match "All Google Sites"...

I don't want to discourage you, but this is harder than it looks (that's why there are many people creating lists).

Now your regex won't match google.com.br and similar domains.

Well, rats. I was obviously trying to future proof it, so I wouldn't have to manually maintain a list. I was unable to find a list of search engines (for google or anyone else) and their country alternative sites. Fortunately, google has a list of all their sites and I can add them to my own private list, which is what I originally started and then realized, hey maybe Regex would be a good way to go.

Thanks for the insights and it's been a great learning experience.

The way I'd handle those would be a wildcard domain blacklist entry for google.ca to cover the bottom four, and an exact domain blacklist entry for google.net. That leaves the top two working by virtue of not being blocked, or explicitly add a wildcard domain whitelist entry for google.com.

Chris,

If I understand your suggestion correctly, it wouldn't handle the entire list of google supported domains: https://www.google.com/supported_domains

I guess I could write a script that would slurp that URL and monitor for any changes and update my gravity list accordingly.

Personally, I can live with the rare case of google.whatever.tld not coming through, so I went back to this:

^(?:[a-zA-Z0-9-]+\.)*google\...*$

Thank you again Chris and everyone for your thoughts.

If you want to block every item currently on this list, you can add 3 or 4 regex to cover the entire list without the need to whitelist google.com and its subdomains (and google.whatever.tld will not be blocked).

I created an example using 4 regex entries:

(^|\.)google\...$
(^|\.)google\...\...$
(^|\.)google\.com\...$
(^|\.)google\.cat

or

(^|\.)google(\...){1,2}$
(^|\.)google\.com\...$
(^|\.)google\.cat

The online example is concatenating the 4 entries, but you should add them separately.

RD,

Nice solution and it looks future proofed pretty well without hindering sites like google.whatever.com. I'm going to implement the 3 rule set now. Thank you.

Taking the advice found here, this is the solution I'm running, which allows *.google.com but blocks dns.google.com along with other variants of google search engines found with alternate TLDs.

dns.google.com is what provides the ability to do DNS over HTTPS, which gets around port 53 blocking.

The online example is concatenating the 5 entries below demonstrating the matching of the current google domains list (list date 1/4/2024).

Deny *.google.xx and *.google.xx.xx
(^|\.)google(\...){1,2}$

Deny *.google.com.xx
(^|\.)google\.com\...$

Deny *.google.cat
(^|\.)google\.cat

Deny *.dns.google.com
(^|\.)dns.google\.com

Deny *.dns.google
(.|^)dns.google$

This is my final version until one of these braniacs comes up with yet another way of doing things better. Hope this run down of what I'm doing helps someone else who wants to keep the kiddos away from adult thumbnails in google and other search engines. Side note: I enforcing Google Safe Search via the cname method. This solution has really locked things down.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.