Limited parsing of AdBlock-style lists

Blockit · December 2, 2019, 10:12pm

Hi,
I'm not happy about the drop of support for AdBlock-style lists. However I can understand your reasons as it was an implementation that easily led to false positives.

Though I'm still for having such a (optional) feature as e.g. uBlock Origin's best practice for blocking domains is to block on the document level:

! Using request of type document will cause the whole site to be blocked through
! strict blocking, yet the site will render properly if a user still decide[s] to
! go ahead [(temporary whitelisting)].

Source: GitHub - uBlockOrigin/uAssets: Resources for uBlock Origin, uMatrix: static filter lists, ready-to-use rulesets, etc.

That means rules are formatted like this:

||adisalesde.com^$document
||adk-praxis.de^$document
||adlogistics.nl^$document

Source: https://raw.githubusercontent.com/stonecrusher/filterlists/master/watchlist-internet.txt

So "parsing" should be a just a RegEx job. Not trying to catch as many (sub)domains as possible, but eliminating everything unknown. That way it's much more likely to only grab the "block the whole domain" rules out of AdBlock style lists and discard all the more fine grained filters.

RegEx example trying to catch known domain-wide filter rules (wildcard support and adblock-specific flags optional):
regex101: build, test, and debug regex

What are your thoughts on this?
Oh and somehow I missed the discussion on this, so I'm not mad if you are replying with links and moneyquotes.

Thanks,
blockit

jfb · December 2, 2019, 10:48pm

Here is a related discusson on this topic.

jfb · December 2, 2019, 10:50pm

I believe that a number of public blocklists have already done this parsing and have HOSTS based blocklists derived from AdBlock formatted lists.

Blockit · December 3, 2019, 10:45pm

Here is a related discusson on this topic.

Ah ok, I thought there was more (seen that one but no satisfying answers).

I believe that a number of public blocklists have already done this parsing and have HOSTS based blocklists derived from AdBlock formatted lists

Forcing ~~everyone~~ every list maintainer to publish different versions is not a good solution.
E.g. the list mentioned above is not available in hosts format.

Found piholeparser but it seems pretty dead?!

I ran EasyList (imho worst possible choice) through my RegEx:

# -*- coding: utf-8 -*-
"""
Created on Tue Dec  3 22:47:57 2019
@description: Process a file through a RegExp and give output to another file
@author: blockit
"""
import re

outputFile = open("output.txt", "w")

# block $third-party, allow wildcard
regex = re.compile(r'^(?:https?:\/\/|\|\|)?([a-z\d*](?=.*\.)[a-z\d.*-]{3,252})(?:\/|\^|\^\$[\w,]*?(?:document|third-party|3p|all)[\a-z,=~-]*?)?$')

with open("input.txt", "r") as fileHandler:  
    # Read next line
    line = fileHandler.readline()
    # check line is not empty
    while line:
        m = regex.match(line)
        if m:
            outputFile.write(m.group(1) + "\n")
        line = fileHandler.readline()

outputFile.close()

Block $third-party, allow wildcard
72746 lines down to 23601 domains
Block $third-party, forbid wildcard
72746 lines down to 23583 domains
Do not block $third-party, forbid wildcard
72746 lines down to 1140 domains

Execution took < 0.13 seconds on my laptop.

I would make a PR but I don't have the knowhow.
Of course before running that RegEx over a list, there should be a detection if it is AdBlock style. Otherwise I'd have to extend the RegEx for also accepting hosts-style, which shouldn't be a problem.

Hail RegEx!

jfb · December 3, 2019, 11:02pm

Everyone does not have to do this. A single list maintainer can do this.

Blockit · December 4, 2019, 12:45am

I thought that was implied as not everyone is publishing anything at all.

Blockit · December 8, 2019, 4:29pm

So I was fiddling around a bit more and the regex is now capable of digesting nearly every list format:
regex101: build, test, and debug regex

Performance is good. It always depends on how strict you want to be. I'd rather choose to be too strict to prevent false positives.

Current RegEx

allows wildcards
discards "specific" filters (not domain-like)
blocks domains which were just blocked $third-party (checking the lists I found this to be reasonable - especially thinking about how "parsing" worked before).

Just have a look at my link above.

As devs seem to tend against that kind of list support, it wouldn't hurt making it optional, default off. But imho it's a feature too often asked for to just drop completely only because the latest implementation was faulty.