Regex help / facebook

jpgpi250 · February 20, 2020, 10:17am

Facebook appears to be at it again, ref this reddit article
I've been using the following regex (from the reddit Regex Megathread):

^(.+\.)??(facebook|.*fb.+)\.(com|net)$

but this doesn't block the domain graphs.fb.me, also mentioned in the reddit article.
so I modified the regex to

^(.+\.)??(facebook|.*fb.+)\.(com|net|me)$

still no success.

After playing with the online regex tester and regexper, I came up with the following:

^(.+\.)??(facebook|.*fb.*)\.(com|net|org|me)$

this covers all the domains(online regex tester), covered in the reddit article, and the article it refers to, examples:

and

Regexper shows the following:

Is this a good regex, or does it go to far (covers domains you don't want to blacklist)? All comments are welcome...
Using a regex here as opposed to blocklist (better solution), because it's impossible to keep up with new facebook domains.

edit
just realized I could remove the wildcard for 'fb', thus using:

^(.+\.)??(facebook|fb)\.(com|net|org|me)$

still would like to know which one you recommend...
/edit

Bucking_Horn · February 20, 2020, 12:35pm

Mine looks pretty similar:
^.*\.?(facebook|.*fb.*?)\..+$

Your opening match clause ^(.+\.)?? seems to be more efficient than both my approach ^.*\.? and the one Pi-hole inserts for wildcard entries (^|\.) - more efficient in the sense that regex101.com shows some 8 or 38 steps less needed to evaluate yours (using other-sub.graph.fb.com as test string).
Of course, that's no hard assessment criterion, as Pi-hole's runtime behaviour might differ, depending on the actually regex implementation used.

Midways, mine would still match fbcdn or tfbnw parts, as I apply the leading and trailing wildcard matches in the domain part that you cut away from yours.

Towards EOL, mine would also catch country specific TLDs like .co.uk or .nl, but may overblock, e.g. by also matching facebook.someblacklistingdomain.com.

I am going to adopt your opening match to my regex.

And just out of curiosity, maybe a developer could comment whether it actually would be beneficial to replace (^|\.) by ^(.+\.)?? in simple wildcard matching - absolutely no priority though

mmotti · February 20, 2020, 1:12pm

@jpgpi250 this was mine some time ago: https://www.reddit.com/r/pihole/comments/az98bk/this_is_so_satisfying_for_some_reason/ei6hb14?utm_medium=android_app&utm_source=share

^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.(com|net)$

@Bucking_Horn I believe my wildcard syntax to be more efficient but I believe the devs are using the current wildcard regexp as visually it is easier to understand.

jpgpi250 · February 20, 2020, 1:24pm

compiled from your suggestions, how about this:

^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)\..+$

and

and

regexper:

mmotti · February 20, 2020, 1:34pm

I would personally keep the tfbnw separate as it will likely never appear as the full tfbnw and it could be introducing extra steps checking for those optional characters every time

You can leave as wildcard at the end if you like but bear in mind this opens up to matching stuff like fbcdn.test.com etc. Maybe not a huge issue but sometimes have to be careful. If the list of tlds is small I would explicitly state them in an or statement at the end

Or look at whether extended regexps would support something like \.[^.]+$ (dot, not dot, to end of string)

jpgpi250 · February 20, 2020, 1:35pm

and your proposed regex would than be?

mmotti · February 20, 2020, 1:36pm

I will do some experiments when I can get to my laptop

jpgpi250 · February 20, 2020, 1:51pm

^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$

www.facebook.co.uk -> match, 63 steps
graphs.fb.me -> match, 41 steps
b-api.facebook.com -> match, 45 steps
www.fbnw.com -> match, 46 steps
www.tfbnw.com -> match, 47 steps (only one step more for tfbnw)
fbcdn.test.com -> NO match, 64 steps

regexper:

pihole-FTL log:

[2020-02-20 15:01:49.737 2720] Regex blacklist (DB ID 54) >> MATCH: "www.tfbnw.com" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"
[2020-02-20 15:02:59.087 2720] Regex blacklist (DB ID 54) >> MATCH: "www.facebook.co.uk" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"
[2020-02-20 15:03:16.154 2720] Regex blacklist (DB ID 54) >> MATCH: "graphs.fb.me" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"
[2020-02-20 15:04:13.793 2720] Regex blacklist (DB ID 54) >> MATCH: "www.fbnw.com" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"

Bucking_Horn · February 20, 2020, 2:23pm

Wouldn't ^(.+\.)?? be actually more specific than (^|\.) ?

That would miss out on .co.uk (too many dots ).

But combining your more specific mid match with my ending doesn't look too bad:
(@jpgpi250, note you can add mutiple lines to regex101)

^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\..+$

(click for test strings)

www.facebook.co.uk
an.facebook.nl
b-api.facebook.com
graph.facebook.com
graphs.fb.me
other-sub.graph.fb.com
www.fbcdn.com
www.tfbnw.com
www.fbsbx.com
www.fbnw.com
fbcdn.test.com

jpgpi250 · February 20, 2020, 3:21pm

www.facebook.co.uk -> match, 37 steps (my regex 63 steps)
graphs.fb.me -> match, 48 steps (my regex 41 steps)
b-api.facebook.com -> match 27 steps (my regex match, 45 steps)
www.fbnw.com ->match, 34 steps (my regex 46 steps)
www.tfbnw.com -> match, 35 steps (my regex 47 steps)
fbcdn.test.com -> NO match, 71 steps (my regex 64 steps)

Although the matches almost (except 1) require less steps, a NO match requires more steps. Since every domain, NOT in gravity, is always evaluated by regex, I'm NOT so sure this is a better solution than mine, You always need to look for the number of steps, required in case of NO match. This may result in a less efficient looking regex, but speed is all that counts.

You do absolutely want to know the number of steps, required for every individual match, to come to the best (fastest) solution, thus, looking at them one at the time.

mmotti · February 20, 2020, 3:23pm

Yes, well, it may well be more specific but ultimately they achieve the same goal. Sadly the truth is the more efficient version looks more ugly to most people and may confuse people just starting out with regex.

Ah, yes. I see @jpgpi250 caught this above.

mmotti · February 20, 2020, 3:30pm

So, if I were to personally use regexps to block Facebook, my preferences would be as follows:

If I wanted to be very specific:
^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.(com|me|net)$
If I didn't care for issues that may come up under other subdomains (e.g. facebook.test.com)
^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.
If I wanted to accommodate for known and possibly unknown future tlds:
^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.([^.]+|co\.uk)$

Regarding #2 - You do not need to include a .+$ at the end - You only need a partial match. The only reason you would need to specify the full pattern is if there may be some exceptions you would like to make.

jpgpi250 · February 20, 2020, 3:43pm

www.facebook.co.uk -> match, 55 steps (msatter 37 steps) (my regex 63 steps)
graphs.fb.me -> match, 31 steps (msatter 48 steps) (my regex 41 steps)
b-api.facebook.com -> 39 match, steps (msatter 27 steps) (my regex match, 45 steps)
www.fbnw.com ->NO match, 36 steps (match msatter 34 steps) (match my regex 46 steps)
www.tfbnw.com -> match, 35 steps (msatter 35 steps) (my regex 47 steps)
fbcdn.test.com -> NO match, 55 steps ( msatter 71 steps) (my regex 64 steps)

This regex has the least number of steps for a NO match, but doesn't cover www.fbnw.com (~~does it even exist~~ it does exist). Looks like almost a winner...

mmotti · February 20, 2020, 3:49pm

I believe I originally referenced here although this list may have been updated since I looked at it to form the main regexp last year.

Bear in mind that this is not strictly a fair test for the steps of no match, as it would have partially matched the string before reaching the subdomain and determining it didn't fit the criteria. If you were to use test.com or match.test.com, it would likely have less steps.

jpgpi250 · February 20, 2020, 3:56pm

by simply adding nw to the expression, the no match count only goes up by one.

^(.+\.)?(facebook|fb(cdn|sbx|nw)?|tfbnw)\.([^.]+|co\.uk)$

www.fbnw.com -> match, 37 steps (msatter 34 steps) (my regex 46 steps)
fbcdn.test.com -> NO match, 56 steps ( msatter 71 steps) (my regex 64 steps)

The final winner, or do we keep on going...

mmotti · February 20, 2020, 3:56pm

Where have you got this domain from? It isn't one that currently looks to be owned by Facebook. It redirects to a parked page saying that the domain is for sale.

I would suggest that, for now, the following is enough

^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.([^.]+|co\.uk)$

jpgpi250 · February 20, 2020, 3:59pm

dig, see below, but this says 'for sale'...

pi@raspberrypi:~ $ dig @127.10.10.2 -p 5552 www.fbnw.com

; <<>> DiG 9.11.5-P4-5.1-Raspbian <<>> @127.10.10.2 -p 5552 www.fbnw.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55098
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1472
;; QUESTION SECTION:
;www.fbnw.com.                  IN      A

;; ANSWER SECTION:
www.fbnw.com.           3600    IN      A       69.172.201.153

;; Query time: 293 msec
;; SERVER: 127.10.10.2#5552(127.10.10.2)
;; WHEN: Thu Feb 20 16:45:42 CET 2020
;; MSG SIZE  rcvd: 57

jpgpi250 · February 20, 2020, 4:01pm

Agree...

mmotti · February 20, 2020, 4:01pm

Yeah so I would suggest that this is not a Facebook domain, so not one that the regexp needs to accommodate for at this time

Emoji spam

Edit: Yay. Looks like a resolution

Bucking_Horn · February 20, 2020, 4:06pm

What about .nl or .de?
I think it's safe to assume facebook has registered its name with a TLD of any country they are operating or planning to operate in, so I'd stick to my more general end match

^(.+\.)?(facebook|fb(cdn|sbx|nw)?|tfbnw)\..+$

domain	match	steps
`www.facebook.co.uk`	YES	44
`graphs.fb.me`	YES	30
`b-api.facebook.com`	YES	37
`www.fbnw.com`	YES	35
`www.tfbnw.com`	YES	33
`fbcdn.test.com`	YES	43
`www.example.com`	NO	32