HELP: Python Program to Generate/Suggest Wildcards from Blocklist

jaykepeters · July 17, 2019, 5:09am

Ok, so here's the scenario:

I have a blocklist with multiple subdomains, sub-subdomains, etc. You get the idea. If there is more than one under the sub-sub level or the subdomain level, or the domain level, the program should find it, and recommend it to be wildcarded. For example, using this list, it should suggest the following: *.microsoft.com, *.windowsupdate.com. Now for a more complex list, you could have for say sub1.subdomain.domain.com, and sub1.subdomain.domain.com. Subdomain.domain.com would be recommended.

windowsupdate.microsoft.com
update.microsoft.com
windowsupdate.com
download.windowsupdate.com
download.microsoft.com
test.stats.update.microsoft.com
ntservicepack.microsoft.com
au.windowsupdate.com
tlu.dl.delivery.mp.microsoft.com

I already kind of have a start to this, using python:

import re
domains = [
    "windowsupdate.microsoft.com",
    "update.microsoft.com",
    "windowsupdate.com",
    "download.windowsupdate.com",
    "download.microsoft.com",
    "test.stats.update.microsoft.com",
    "ntservicepack.microsoft.com",
    "au.windowsupdate.com",
    "tlu.dl.delivery.mp.microsoft.com",
    "tlu.dl.delivery.mp.microsoft.com"
]
domains = [line.rstrip('\n') for line in open("/private/tmp/blocklists/Social/twitter.txt")]
rootdomains = []
subdomains = []
subsubdomains = []

# https://go.jayke.net/YtB
domainregex = r"\b((?=[a-z0-9-]{1,63}\.)(xn--)?[a-z0-9]+(-[a-z0-9]+)*\.)+[a-z]{2,63}\b"

# Ensure that we are working with valid domains
for domain in domains:
    if re.match(domainregex, domain):
        print("Domain Vaild: {}".format(domain))
    else:
        domains.remove(domain)
        print("Domain Invalid: {}".format(domain))

# Experimental stuff
for domain in domains:
    parts = domain.split(".")
    rootdomain = parts[(len(parts) - 2):]
    subdomain = parts[(len(parts) - 3):]
    subsubdomain = parts[(len(parts) - 4):]
    if rootdomain not in rootdomains:
        rootdomains.append(rootdomain)
    if subdomain not in subdomains:
        subdomains.append(subdomain)

#print(rootdomains)
#print(subdomains)

for rootdomain in rootdomains:
    rdomain = ".".join(rootdomain)
    for subdomain in subdomains:
        sdomain = ".".join(subdomain)
        if sdomain in rootdomain:
            print("You might want to wildcard: {}".format(rdomain))
        for subsubdomain in subsubdomains:
            ssdomain = ".".join(subsubdomain)
            if ssdomain in subdomain:
                print("You might want to wildcard: {}".format(ssdomain))

It's not perfect, but it is more of a proof-of-concept type of thing, but I could only get it to work at the root-domain-level.

Any POSITIVE suggestions?

Thanks,

Jayke

jaykepeters · July 17, 2019, 5:13am

To generate a standard Pi-hole wildcard regex:

def genWildcardRegex(self, wildcard):
        base = "(^|\.){}$"
        parts = wildcard.strip("*.").split(".")
        if len(parts[0]) > 1:
            regex = base.format("\.".join(parts))
            return regex
        else:
            print("ERROR\t Malformed Wildcard Domain:: " + wildcard)
            return

Taken from here

mmotti · July 18, 2019, 9:25am

Here's what I do

def remove_subdomains(hosts):

    # Conditionally exit if hosts not provided
    if not hosts:
        return

    # Create set to store wildcards
    cleaned_hosts = set()
    # Set prev tracker to None
    prev = None
    # Reverse each host
    rev_hosts = [host[::-1] for host in hosts]
    # Sort reversed hosts
    rev_hosts.sort()

    # For each host
    for host in rev_hosts:
        # If the domain is not a subdomain of the previous
        # iteration
        if not host.startswith(f'{prev}.'):
            # Conditionally set rev_host depending on prev
            rev_host = prev[::-1] if prev else host[::-1]
            # Add to host set
            cleaned_hosts.add(rev_host)
            # Set previous domain to the current iteration
            prev = host

    return cleaned_hosts

remove_subdomains(your_set_here)

Doesn't require regex so it's very quick.

It reverses and sorts the domains:

moc.elpmaxe
moc.elpmaxe.tset

So you loop through the reversed list and say does the current iteration start with the previous domain + . (is it a sub domain), if it is, don't do anything. When you hit one that doesn't match the current comparison criteria, it's a new domain so add it to the list, set it as the comparison criteria and keep looping.

mmotti · July 18, 2019, 9:54am

This one will identify domains with subdomains over a certain number, and record how many there actually are in a dictionary e.g. {test.com: 55}

def identify_wildcards(hosts, limit=50):

    # Conditionally exit if hosts not provided
    if not hosts:
        return

    # Create set to store wildcards
    wildcards = {}
    # Set prev tracker to None
    prev = None
    # Set iterator to 0
    i = 0
    # Reverse each host
    rev_hosts = [host[::-1] for host in hosts]
    # Sort reversed hosts
    rev_hosts.sort()

    # For each host
    for host in rev_hosts:
        # If the domain is not a subdomain of the previous
        # iteration
        if not host.startswith(f'{prev}.'):
            # If our previous host had more subdomains
            # than the limit
            if i >= limit:
                # Add to wildcards set
                wildcards[prev[::-1]] = i
            # Set previous domain to the current iteration
            prev = host
            # Reset the iterator
            i = 0
        else:
            # Current iteration is a subdomain of the last
            # so increment the counter
            i += 1

    # Sort dict on sub-domain count (desc)
    wildcards = {k: v for k, v in sorted(wildcards.items(), key=lambda x: x[1], reverse=True)}

    return wildcards

jaykepeters · July 20, 2019, 7:48am

Thanks for the help!

jaykepeters · January 22, 2020, 4:10am

@mmotti So now if we had the following:

How to clump them as so:
domain.com = 4, 3, 2, 1
sub.domain.com = 3, 2, 1
sub1.sub.domain.com = 1, 2

So it's more about clustering them than getting the largest similarity between them. Essentially, the las part of a reversed domain would be dropped, since it would not match other domains of the same length... So if all our sites are subdomains with lengths of 3(split by '.'), and we have one with 4, we know that the last part of 4 is not a match, so therefore, it is not eligible for grouping... Do you see where I am coming from?

I want to create a way to get granular regexes, so users can go as far down the subdomain chain as they want, or as closest to the root TLD as possible...

So if I had this example from my code:
['com', 'google', 'support']
['com', 'google', 'drive']
['com', 'google', 'www']

I somehow would like to do this in parallel and say hey, we have three .coms, they match. Hey, we have three googles, they match. Hey, wait a minutes, support, drive, and www do not match and are all elements [2]. Therefore it is safe to say that there is no further grouping that can be done.

My Tinkering:

## RegeXgen
import json
domains = [
    "www.google.com",
    "drive.google.com",
    "support.google.com",
    "becker.k12.mn.us",
    "sartell.k12.mn.us",
    "sub.domain.com",
    "sub1.sub.domain.com",
    "sub02.sub.domain.com",
    "sub2.sub1.sub.domain.com"
]
dom_a_ins = []

## Let's separate each domain into its subparts (REVERSED)
for domain in domains:
    dom_a_ins.append(domain.split('.')[::-1])
    
## Now, let's sort this array of domain subparts from greatest to least (OTHER WAY AROUND????????) for sorting...
dom_a_ins = sorted(dom_a_ins, key=len)[::-1]