HELP: Python Program to Generate/Suggest Wildcards from Blocklist

Ok, so here’s the scenario:

  • I have a blocklist with multiple subdomains, sub-subdomains, etc. You get the idea. If there is more than one under the sub-sub level or the subdomain level, or the domain level, the program should find it, and recommend it to be wildcarded. For example, using this list, it should suggest the following: *.microsoft.com, *.windowsupdate.com. Now for a more complex list, you could have for say sub1.subdomain.domain.com, and sub1.subdomain.domain.com. Subdomain.domain.com would be recommended.
windowsupdate.microsoft.com
update.microsoft.com
windowsupdate.com
download.windowsupdate.com
download.microsoft.com
test.stats.update.microsoft.com
ntservicepack.microsoft.com
au.windowsupdate.com
tlu.dl.delivery.mp.microsoft.com

I already kind of have a start to this, using python:

import re
domains = [
    "windowsupdate.microsoft.com",
    "update.microsoft.com",
    "windowsupdate.com",
    "download.windowsupdate.com",
    "download.microsoft.com",
    "test.stats.update.microsoft.com",
    "ntservicepack.microsoft.com",
    "au.windowsupdate.com",
    "tlu.dl.delivery.mp.microsoft.com",
    "tlu.dl.delivery.mp.microsoft.com"
]
domains = [line.rstrip('\n') for line in open("/private/tmp/blocklists/Social/twitter.txt")]
rootdomains = []
subdomains = []
subsubdomains = []

# https://go.jayke.net/YtB
domainregex = r"\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b"

# Ensure that we are working with valid domains
for domain in domains:
    if re.match(domainregex, domain):
        print("Domain Vaild: {}".format(domain))
    else:
        domains.remove(domain)
        print("Domain Invalid: {}".format(domain))

# Experimental stuff
for domain in domains:
    parts = domain.split(".")
    rootdomain = parts[(len(parts) - 2):]
    subdomain = parts[(len(parts) - 3):]
    subsubdomain = parts[(len(parts) - 4):]
    if rootdomain not in rootdomains:
        rootdomains.append(rootdomain)
    if subdomain not in subdomains:
        subdomains.append(subdomain)

#print(rootdomains)
#print(subdomains)

for rootdomain in rootdomains:
    rdomain = ".".join(rootdomain)
    for subdomain in subdomains:
        sdomain = ".".join(subdomain)
        if sdomain in rootdomain:
            print("You might want to wildcard: {}".format(rdomain))
        for subsubdomain in subsubdomains:
            ssdomain = ".".join(subsubdomain)
            if ssdomain in subdomain:
                print("You might want to wildcard: {}".format(ssdomain))

It’s not perfect, but it is more of a proof-of-concept type of thing, but I could only get it to work at the root-domain-level.

Any POSITIVE suggestions?

Thanks,

Jayke

To generate a standard Pi-hole wildcard regex:

def genWildcardRegex(self, wildcard):
        base = "(^|\.){}$"
        parts = wildcard.strip("*.").split(".")
        if len(parts[0]) > 1:
            regex = base.format("\.".join(parts))
            return regex
        else:
            print("ERROR\t Malformed Wildcard Domain:: " + wildcard)
            return

Taken from here

Here’s what I do

def remove_subdomains(hosts):

    # Conditionally exit if hosts not provided
    if not hosts:
        return

    # Create set to store wildcards
    cleaned_hosts = set()
    # Set prev tracker to None
    prev = None
    # Reverse each host
    rev_hosts = [host[::-1] for host in hosts]
    # Sort reversed hosts
    rev_hosts.sort()

    # For each host
    for host in rev_hosts:
        # If the domain is not a subdomain of the previous
        # iteration
        if not host.startswith(f'{prev}.'):
            # Conditionally set rev_host depending on prev
            rev_host = prev[::-1] if prev else host[::-1]
            # Add to host set
            cleaned_hosts.add(rev_host)
            # Set previous domain to the current iteration
            prev = host

    return cleaned_hosts

remove_subdomains(your_set_here)

Doesn’t require regex so it’s very quick.

It reverses and sorts the domains:

moc.elpmaxe
moc.elpmaxe.tset

So you loop through the reversed list and say does the current iteration start with the previous domain + . (is it a sub domain), if it is, don’t do anything. When you hit one that doesn’t match the current comparison criteria, it’s a new domain so add it to the list, set it as the comparison criteria and keep looping.

1 Like

This one will identify domains with subdomains over a certain number, and record how many there actually are in a dictionary e.g. {test.com: 55}

def identify_wildcards(hosts, limit=50):

    # Conditionally exit if hosts not provided
    if not hosts:
        return

    # Create set to store wildcards
    wildcards = {}
    # Set prev tracker to None
    prev = None
    # Set iterator to 0
    i = 0
    # Reverse each host
    rev_hosts = [host[::-1] for host in hosts]
    # Sort reversed hosts
    rev_hosts.sort()

    # For each host
    for host in rev_hosts:
        # If the domain is not a subdomain of the previous
        # iteration
        if not host.startswith(f'{prev}.'):
            # If our previous host had more subdomains
            # than the limit
            if i >= limit:
                # Add to wildcards set
                wildcards[prev[::-1]] = i
            # Set previous domain to the current iteration
            prev = host
            # Reset the iterator
            i = 0
        else:
            # Current iteration is a subdomain of the last
            # so increment the counter
            i += 1

    # Sort dict on sub-domain count (desc)
    wildcards = {k: v for k, v in sorted(wildcards.items(), key=lambda x: x[1], reverse=True)}

    return wildcards

1 Like

Thanks for the help!