Regex help / facebook

and your proposed regex would than be?

I will do some experiments when I can get to my laptop :slight_smile:

^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$

www.facebook.co.uk -> match, 63 steps
graphs.fb.me -> match, 41 steps
b-api.facebook.com -> match, 45 steps
www.fbnw.com -> match, 46 steps
www.tfbnw.com -> match, 47 steps (only one step more for tfbnw)
fbcdn.test.com -> NO match, 64 steps

regexper:

pihole-FTL log:

[2020-02-20 15:01:49.737 2720] Regex blacklist (DB ID 54) >> MATCH: "www.tfbnw.com" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"
[2020-02-20 15:02:59.087 2720] Regex blacklist (DB ID 54) >> MATCH: "www.facebook.co.uk" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"
[2020-02-20 15:03:16.154 2720] Regex blacklist (DB ID 54) >> MATCH: "graphs.fb.me" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"
[2020-02-20 15:04:13.793 2720] Regex blacklist (DB ID 54) >> MATCH: "www.fbnw.com" vs. "^(.+\.)??(facebook|(t)?fb(nw)?(cdn|sbx)?)(\.[^\.]+|\.co\.uk)$"

Wouldn't ^(.+\.)?? be actually more specific than (^|\.) ?

That would miss out on .co.uk (too many dots :wink: ).

But combining your more specific mid match with my ending doesn't look too bad:
(@jpgpi250, note you can add mutiple lines to regex101)

^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\..+$

(click for test strings)
www.facebook.co.uk
an.facebook.nl
b-api.facebook.com
graph.facebook.com
graphs.fb.me
other-sub.graph.fb.com
www.fbcdn.com
www.tfbnw.com
www.fbsbx.com
www.fbnw.com
fbcdn.test.com

www.facebook.co.uk -> match, 37 steps (my regex 63 steps)
graphs.fb.me -> match, 48 steps (my regex 41 steps)
b-api.facebook.com -> match 27 steps (my regex match, 45 steps)
www.fbnw.com ->match, 34 steps (my regex 46 steps)
www.tfbnw.com -> match, 35 steps (my regex 47 steps)
fbcdn.test.com -> NO match, 71 steps (my regex 64 steps)

Although the matches almost (except 1) require less steps, a NO match requires more steps. Since every domain, NOT in gravity, is always evaluated by regex, I'm NOT so sure this is a better solution than mine, You always need to look for the number of steps, required in case of NO match. This may result in a less efficient looking regex, but speed is all that counts.

You do absolutely want to know the number of steps, required for every individual match, to come to the best (fastest) solution, thus, looking at them one at the time.

Yes, well, it may well be more specific but ultimately they achieve the same goal. Sadly the truth is the more efficient version looks more ugly to most people and may confuse people just starting out with regex.

Ah, yes. I see @jpgpi250 caught this above.

So, if I were to personally use regexps to block Facebook, my preferences would be as follows:

  1. If I wanted to be very specific:
    ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.(com|me|net)$
  2. If I didn't care for issues that may come up under other subdomains (e.g. facebook.test.com)
    ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.
  3. If I wanted to accommodate for known and possibly unknown future tlds:
    ^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.([^.]+|co\.uk)$

Regarding #2 - You do not need to include a .+$ at the end - You only need a partial match. The only reason you would need to specify the full pattern is if there may be some exceptions you would like to make.

www.facebook.co.uk -> match, 55 steps (msatter 37 steps) (my regex 63 steps)
graphs.fb.me -> match, 31 steps (msatter 48 steps) (my regex 41 steps)
b-api.facebook.com -> 39 match, steps (msatter 27 steps) (my regex match, 45 steps)
www.fbnw.com ->NO match, 36 steps (match msatter 34 steps) (match my regex 46 steps)
www.tfbnw.com -> match, 35 steps (msatter 35 steps) (my regex 47 steps)
fbcdn.test.com -> NO match, 55 steps ( msatter 71 steps) (my regex 64 steps)

This regex has the least number of steps for a NO match, but doesn't cover www.fbnw.com (does it even exist it does exist). Looks like almost a winner...

I believe I originally referenced here although this list may have been updated since I looked at it to form the main regexp last year.

Bear in mind that this is not strictly a fair test for the steps of no match, as it would have partially matched the string before reaching the subdomain and determining it didn't fit the criteria. If you were to use test.com or match.test.com, it would likely have less steps.

by simply adding nw to the expression, the no match count only goes up by one.

^(.+\.)?(facebook|fb(cdn|sbx|nw)?|tfbnw)\.([^.]+|co\.uk)$

www.fbnw.com -> match, 37 steps (msatter 34 steps) (my regex 46 steps)
fbcdn.test.com -> NO match, 56 steps ( msatter 71 steps) (my regex 64 steps)

The final winner, or do we keep on going...

Where have you got this domain from? It isn't one that currently looks to be owned by Facebook. It redirects to a parked page saying that the domain is for sale.

I would suggest that, for now, the following is enough :slight_smile:

^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.([^.]+|co\.uk)$

dig, see below, but this says 'for sale'...

pi@raspberrypi:~ $ dig @127.10.10.2 -p 5552 www.fbnw.com

; <<>> DiG 9.11.5-P4-5.1-Raspbian <<>> @127.10.10.2 -p 5552 www.fbnw.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55098
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1472
;; QUESTION SECTION:
;www.fbnw.com.                  IN      A

;; ANSWER SECTION:
www.fbnw.com.           3600    IN      A       69.172.201.153

;; Query time: 293 msec
;; SERVER: 127.10.10.2#5552(127.10.10.2)
;; WHEN: Thu Feb 20 16:45:42 CET 2020
;; MSG SIZE  rcvd: 57

Agree...

Yeah :+1: so I would suggest that this is not a Facebook domain, so not one that the regexp needs to accommodate for at this time :grin:

Emoji spam

Edit: Yay. Looks like a resolution :partying_face: :partying_face: :partying_face:

What about .nl or .de?
I think it's safe to assume facebook has registered its name with a TLD of any country they are operating or planning to operate in, so I'd stick to my more general end match :wink:

^(.+\.)?(facebook|fb(cdn|sbx|nw)?|tfbnw)\..+$

domain match steps
www.facebook.co.uk YES 44
graphs.fb.me YES 30
b-api.facebook.com YES 37
www.fbnw.com YES 35
www.tfbnw.com YES 33
fbcdn.test.com YES 43
www.example.com NO 32

don't see the problem

These are covered.

\.([^.]+|co\.uk)$ essentially means any single tld (e.g. .com, .net, .nl, .de, .org) or the one known tld to be used with a subdomain (.co.uk), are covered.

^(.+\.)?(facebook|fb(cdn|sbx|nw)?|tfbnw)\..+$ will match things like fb.fbi.gov
or fbsbx.entirelyunrelatedwebsite.com

Any TLD would e.g. include .co.au and .co.nz as well :wink:

This is true, however going by the current facebook "blacklists", the necessary tlds are covered.

However, if you really wanted to be sure that all of your basis are covered above with .co.something:
^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.(co\.)?[^.]+$

Edit:
However, I would still say that:
^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.([^.]+|co\.uk)$
is sufficient for all use cases at the moment. I don't even think the .co.uk part is entirely necessary.

Final Edit:

My recommendation for this regexp going by the blacklists available to use at this time would be:

^(.+\.)?(facebook|fb(cdn|sbx)?|tfbnw)\.[^.]+$ :exploding_head:

I plead guilty - I admitted that much in my first post :wink: