RegexGen (Algorithm help)

Greetings!

So I am in dire need of someone with some expert programming knowledge to lead me into the right direction for a problem, and yes, this is related to Pi-hole. So here’s the problem:

Given a list of domains, such as in the case of:

I want groups of domains to be clustered to the highest level possible (closest to subdomain/subsubdomain/domain), as to regex them eventually. So the expected output of this program would be:
Group 1: google.com
Group 2: github.io
Group 3: sub.domain.com
Group 4: sub1.sub.domain.com
Group 5: domain.com
ETC…

This was in no particular order, but you get the concept. So we could input a blacklist, and think hey, how can we shorten this with regex, instead of having huge blacklists, when in reality, some entire domains would want to be blacklisted. (or whitelisted)

A preferred language to start with would be Python, since I prefer that for working with data. Any solutions/input?

I do not want to code for every domain length possible by splitting in between the “.”, because that would be a lot of conditionals. Rather, I prefer code to use minimal resources in doing so…

It is rather related to blacklists.
Pi-hole developers do maintain neither blacklists nor blacklisting regexs. They just provide the means of applying blacklists or regexs to their filtering engine.

Blacklist maintainers may prove more apt in providing support for your specific problem, as they may choose to put similar pattern matching challenges before them (or not).

Besides, it seems this topic is an extension of your previous HELP: Python Program to Generate/Suggest Wildcards from Blocklist)

I don’t want to sound discouraging - but :wink:

Considering both topics, your design idea seems flawed to me.

First, your approach seems to have primarily wildcard blocking in mind. This would leverage just a fraction of what can be achieved by applying regular expressions.

Second, wildcarding a top level hierarchy entry within hierarchical groups as you want to assemble them would also block lower levels altogether. making the subdomain grouping somehow obsolete.

Third, applying unspecific wildcarding -even if used on your subdomains- will quite likely result in overblocking of resources.

Fourth, concentrating on the domain name hierarchy neglects other cross domain and domain substring naming patterns that prove way more significant for matching ad-serving domains.


In short: Wildcarding by itself is too unspecific to come up with a reliable, short list of regex patterns.

If you pursuit this to hone your programming skills just for fun, don’t be bothered by my unqualified ramblings :-1: and go ahead with it :slight_smile:

Otherwise, you may want to shift your focus a bit away from the domain name hierarchy to cross domain or substring keyword pattern matching (e.g. as demonstrated by Regex examples)