Regex help / facebook

Facebook appears to be at it again, ref this reddit article
I’ve been using the following regex (from the reddit Regex Megathread):

^(.+\.)??(facebook|.*fb.+)\.(com|net)$

but this doesn’t block the domain graphs.fb.me, also mentioned in the reddit article.
so I modified the regex to

^(.+\.)??(facebook|.*fb.+)\.(com|net|me)$

still no success.

After playing with the online regex tester and regexper, I came up with the following:

^(.+\.)??(facebook|.*fb.*)\.(com|net|org|me)$

this covers all the domains(online regex tester), covered in the reddit article, and the article it refers to, examples:


and

Regexper shows the following:

Is this a good regex, or does it go to far (covers domains you don’t want to blacklist)? All comments are welcome…
Using a regex here as opposed to blocklist (better solution), because it’s impossible to keep up with new facebook domains.

edit
just realized I could remove the wildcard for ‘fb’, thus using:

^(.+\.)??(facebook|fb)\.(com|net|org|me)$

still would like to know which one you recommend…
/edit

A dump of my Facebook hatred:

### Facebook block ############################################################
\.fb$
(fb|fbcdn|fbsbx|tfbnw|facebook|freebasics|internet|messenger)\.[a-z.]{2,7}$
facebook\.com\.edge(key|suite)\.net$
fborigin\.[a-z.]{2,7}$
fbsx\.com.online-metrix\.[a-z.]{2,7}$
facebook-web-clients.appspot\.[a-z.]{2,7}$
fbcdn-profile-[a-z].akamaihihd\.[a-z.]{2,7}$
instagram\.[a-z.]{2,7}$
cdninstagram\.[a-z.]{2,7}$
instagram(static-)?([a-z]|a\.akamaihd)\.facebook\.[a-z.]{2,7}$
instagram # sliding match
whatsapp\.[a-z.]{2,7}$
m\.me$

For example there is also a facebook.nl which TLD is covered by \.[a-z.]{2,7}$

Mine looks pretty similar:
^.*\.?(facebook|.*fb.*?)\..+$

Your opening match clause ^(.+\.)?? seems to be more efficient than both my approach ^.*\.? and the one Pi-hole inserts for wildcard entries (^|\.) - more efficient in the sense that regex101.com shows some 8 or 38 steps less needed to evaluate yours (using other-sub.graph.fb.com as test string).
Of course, that’s no hard assessment criterion, as Pi-hole’s runtime behaviour might differ, depending on the actually regex implementation used.

Midways, mine would still match fbcdn or tfbnw parts, as I apply the leading and trailing wildcard matches in the domain part that you cut away from yours.

Towards EOL, mine would also catch country specific TLDs like .co.uk or .nl, but may overblock, e.g. by also matching facebook.someblacklistingdomain.com.

I am going to adopt your opening match to my regex. :wink:

And just out of curiosity, maybe a developer could comment whether it actually would be beneficial to replace (^|\.) by ^(.+\.)?? in simple wildcard matching - absolutely no priority though :wink:

The current wild is is most efficient one and I can’t follow you two in using a less specific way.

@jpgpi250 this was mine some time ago: https://www.reddit.com/r/pihole/comments/az98bk/this_is_so_satisfying_for_some_reason/ei6hb14?utm_medium=android_app&utm_source=share

^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.(com|net)$

@Bucking_Horn I believe my wildcard syntax to be more efficient but I believe the devs are using the current wildcard regexp as visually it is easier to understand.

compiled from your suggestions, how about this:

^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)\..+$


and

and

regexper:

I would personally keep the tfbnw separate as it will likely never appear as the full tfbnw and it could be introducing extra steps checking for those optional characters every time

You can leave as wildcard at the end if you like but bear in mind this opens up to matching stuff like fbcdn.test.com etc. Maybe not a huge issue but sometimes have to be careful. If the list of tlds is small I would explicitly state them in an or statement at the end :slight_smile:

Or look at whether extended regexps would support something like \.[^.]+$ (dot, not dot, to end of string)

and your proposed regex would than be?

I will do some experiments when I can get to my laptop :slight_smile:

^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$

www.facebook.co.uk -> match, 63 steps
graphs.fb.me -> match, 41 steps
b-api.facebook.com -> match, 45 steps
www.fbnw.com -> match, 46 steps
www.tfbnw.com -> match, 47 steps (only one step more for tfbnw)
fbcdn.test.com -> NO match, 64 steps

regexper:

pihole-FTL log:

[2020-02-20 15:01:49.737 2720] Regex blacklist (DB ID 54) >> MATCH: "www.tfbnw.com" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"
[2020-02-20 15:02:59.087 2720] Regex blacklist (DB ID 54) >> MATCH: "www.facebook.co.uk" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"
[2020-02-20 15:03:16.154 2720] Regex blacklist (DB ID 54) >> MATCH: "graphs.fb.me" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"
[2020-02-20 15:04:13.793 2720] Regex blacklist (DB ID 54) >> MATCH: "www.fbnw.com" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"

Wouldn’t ^(.+\.)?? be actually more specific than (^|\.) ?

That would miss out on .co.uk (too many dots :wink: ).

But combining your more specific mid match with my ending doesn’t look too bad:
(@jpgpi250, note you can add mutiple lines to regex101)

^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\..+$

(click for test strings)
www.facebook.co.uk
an.facebook.nl
b-api.facebook.com
graph.facebook.com
graphs.fb.me
other-sub.graph.fb.com
www.fbcdn.com
www.tfbnw.com
www.fbsbx.com
www.fbnw.com
fbcdn.test.com

Because the RegEx is anchored to the right by a $:

\.|(facebook|fb|tfb)(cdn|sbx|nw|)\.[a-z.]{2,7}$

Facereg

If you also want to block facebook.com the replace \. by (^|\.)

www.facebook.co.uk -> match, 37 steps (my regex 63 steps)
graphs.fb.me -> match, 48 steps (my regex 41 steps)
b-api.facebook.com -> match 27 steps (my regex match, 45 steps)
www.fbnw.com ->match, 34 steps (my regex 46 steps)
www.tfbnw.com -> match, 35 steps (my regex 47 steps)
fbcdn.test.com -> NO match, 71 steps (my regex 64 steps)

Although the matches almost (except 1) require less steps, a NO match requires more steps. Since every domain, NOT in gravity, is always evaluated by regex, I’m NOT so sure this is a better solution than mine, You always need to look for the number of steps, required in case of NO match. This may result in a less efficient looking regex, but speed is all that counts.

You do absolutely want to know the number of steps, required for every individual match, to come to the best (fastest) solution, thus, looking at them one at the time.

Yes, well, it may well be more specific but ultimately they achieve the same goal. Sadly the truth is the more efficient version looks more ugly to most people and may confuse people just starting out with regex.

Ah, yes. I see @jpgpi250 caught this above.

So, if I were to personally use regexps to block Facebook, my preferences would be as follows:

  1. If I wanted to be very specific:
    ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.(com|me|net)$
  2. If I didn’t care for issues that may come up under other subdomains (e.g. facebook.test.com)
    ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.
  3. If I wanted to accommodate for known and possibly unknown future tlds:
    ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.([^.]+|co\.uk)$

Regarding #2 - You do not need to include a .+$ at the end - You only need a partial match. The only reason you would need to specify the full pattern is if there may be some exceptions you would like to make.

www.facebook.co.uk -> match, 55 steps (msatter 37 steps) (my regex 63 steps)
graphs.fb.me -> match, 31 steps (msatter 48 steps) (my regex 41 steps)
b-api.facebook.com -> 39 match, steps (msatter 27 steps) (my regex match, 45 steps)
www.fbnw.com ->NO match, 36 steps (match msatter 34 steps) (match my regex 46 steps)
www.tfbnw.com -> match, 35 steps (msatter 35 steps) (my regex 47 steps)
fbcdn.test.com -> NO match, 55 steps ( msatter 71 steps) (my regex 64 steps)

This regex has the least number of steps for a NO match, but doesn’t cover www.fbnw.com (does it even exist it does exist). Looks like almost a winner…

I believe I originally referenced here although this list may have been updated since I looked at it to form the main regexp last year.

Bear in mind that this is not strictly a fair test for the steps of no match, as it would have partially matched the string before reaching the subdomain and determining it didn’t fit the criteria. If you were to use test.com or match.test.com, it would likely have less steps.

by simply adding nw to the expression, the no match count only goes up by one.

^(.+\.)?(facebook|fb(cdn|sbx|nw)?|tfbnw)\.([^.]+|co\.uk)$

www.fbnw.com -> match, 37 steps (msatter 34 steps) (my regex 46 steps)
fbcdn.test.com -> NO match, 56 steps ( msatter 71 steps) (my regex 64 steps)

The final winner, or do we keep on going…

Where have you got this domain from? It isn’t one that currently looks to be owned by Facebook. It redirects to a parked page saying that the domain is for sale.

I would suggest that, for now, the following is enough :slight_smile:

^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.([^.]+|co\.uk)$

dig, see below, but this says ‘for sale’…

pi@raspberrypi:~ $ dig @127.10.10.2 -p 5552 www.fbnw.com

; <<>> DiG 9.11.5-P4-5.1-Raspbian <<>> @127.10.10.2 -p 5552 www.fbnw.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55098
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1472
;; QUESTION SECTION:
;www.fbnw.com.                  IN      A

;; ANSWER SECTION:
www.fbnw.com.           3600    IN      A       69.172.201.153

;; Query time: 293 msec
;; SERVER: 127.10.10.2#5552(127.10.10.2)
;; WHEN: Thu Feb 20 16:45:42 CET 2020
;; MSG SIZE  rcvd: 57