Ok, so here's the scenario:
- I have a blocklist with multiple subdomains, sub-subdomains, etc. You get the idea. If there is more than one under the sub-sub level or the subdomain level, or the domain level, the program should find it, and recommend it to be wildcarded. For example, using this list, it should suggest the following: *.microsoft.com, *.windowsupdate.com. Now for a more complex list, you could have for say sub1.subdomain.domain.com, and sub1.subdomain.domain.com. Subdomain.domain.com would be recommended.
windowsupdate.microsoft.com
update.microsoft.com
windowsupdate.com
download.windowsupdate.com
download.microsoft.com
test.stats.update.microsoft.com
ntservicepack.microsoft.com
au.windowsupdate.com
tlu.dl.delivery.mp.microsoft.com
I already kind of have a start to this, using python:
import re
domains = [
"windowsupdate.microsoft.com",
"update.microsoft.com",
"windowsupdate.com",
"download.windowsupdate.com",
"download.microsoft.com",
"test.stats.update.microsoft.com",
"ntservicepack.microsoft.com",
"au.windowsupdate.com",
"tlu.dl.delivery.mp.microsoft.com",
"tlu.dl.delivery.mp.microsoft.com"
]
domains = [line.rstrip('\n') for line in open("/private/tmp/blocklists/Social/twitter.txt")]
rootdomains = []
subdomains = []
subsubdomains = []
# https://go.jayke.net/YtB
domainregex = r"\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b"
# Ensure that we are working with valid domains
for domain in domains:
if re.match(domainregex, domain):
print("Domain Vaild: {}".format(domain))
else:
domains.remove(domain)
print("Domain Invalid: {}".format(domain))
# Experimental stuff
for domain in domains:
parts = domain.split(".")
rootdomain = parts[(len(parts) - 2):]
subdomain = parts[(len(parts) - 3):]
subsubdomain = parts[(len(parts) - 4):]
if rootdomain not in rootdomains:
rootdomains.append(rootdomain)
if subdomain not in subdomains:
subdomains.append(subdomain)
#print(rootdomains)
#print(subdomains)
for rootdomain in rootdomains:
rdomain = ".".join(rootdomain)
for subdomain in subdomains:
sdomain = ".".join(subdomain)
if sdomain in rootdomain:
print("You might want to wildcard: {}".format(rdomain))
for subsubdomain in subsubdomains:
ssdomain = ".".join(subsubdomain)
if ssdomain in subdomain:
print("You might want to wildcard: {}".format(ssdomain))
It's not perfect, but it is more of a proof-of-concept type of thing, but I could only get it to work at the root-domain-level.
Any POSITIVE suggestions?
Thanks,
Jayke