Limited parsing of AdBlock-style lists

Hi,
I’m not happy about the drop of support for AdBlock-style lists. However I can understand your reasons as it was an implementation that easily led to false positives.

Though I’m still for having such a (optional) feature as e.g. uBlock Origin’s best practice for blocking domains is to block on the document level:

! Using request of type document will cause the whole site to be blocked through
! strict blocking, yet the site will render properly if a user still decide[s] to
! go ahead [(temporary whitelisting)].

Source: https://github.com/uBlockOrigin/uAssets

That means rules are formatted like this:

||adisalesde.com^$document
||adk-praxis.de^$document
||adlogistics.nl^$document

Source: https://raw.githubusercontent.com/stonecrusher/filterlists/master/watchlist-internet.txt

So “parsing” should be a just a RegEx job. Not trying to catch as many (sub)domains as possible, but eliminating everything unknown. That way it’s much more likely to only grab the “block the whole domain” rules out of AdBlock style lists and discard all the more fine grained filters.

RegEx example trying to catch known domain-wide filter rules (wildcard support and adblock-specific flags optional):
https://regex101.com/r/ni7Uac/4

What are your thoughts on this?
Oh and somehow I missed the discussion on this, so I’m not mad if you are replying with links and moneyquotes.

Thanks,
blockit

Here is a related discusson on this topic.

I believe that a number of public blocklists have already done this parsing and have HOSTS based blocklists derived from AdBlock formatted lists.

1 Like

Here is a related discusson on this topic.

Ah ok, I thought there was more (seen that one but no satisfying answers).

I believe that a number of public blocklists have already done this parsing and have HOSTS based blocklists derived from AdBlock formatted lists

Forcing everyone every list maintainer to publish different versions is not a good solution.
E.g. the list mentioned above is not available in hosts format.

Found piholeparser but it seems pretty dead?!

I ran EasyList (imho worst possible choice) through my RegEx:
# -*- coding: utf-8 -*-
"""
Created on Tue Dec  3 22:47:57 2019
@description: Process a file through a RegExp and give output to another file
@author: blockit
"""
import re

outputFile = open("output.txt", "w")

# block $third-party, allow wildcard
regex = re.compile(r'^(?:https?:\/\/|\|\|)?([a-z\d*](?=.*\.)[a-z\d.*-]{3,252})(?:\/|\^|\^\$[\w,]*?(?:document|third-party|3p|all)[\a-z,=~-]*?)?$')

with open("input.txt", "r") as fileHandler:  
    # Read next line
    line = fileHandler.readline()
    # check line is not empty
    while line:
        m = regex.match(line)
        if m:
            outputFile.write(m.group(1) + "\n")
        line = fileHandler.readline()

outputFile.close()

  • Block $third-party, allow wildcard
    72746 lines down to 23601 domains
  • Block $third-party, forbid wildcard
    72746 lines down to 23583 domains
  • Do not block $third-party, forbid wildcard
    72746 lines down to 1140 domains

Execution took < 0.13 seconds on my laptop.

I would make a PR but I don’t have the knowhow.
Of course before running that RegEx over a list, there should be a detection if it is AdBlock style. Otherwise I’d have to extend the RegEx for also accepting hosts-style, which shouldn’t be a problem.

Hail RegEx!
:smiley:

Everyone does not have to do this. A single list maintainer can do this.

I thought that was implied as not everyone is publishing anything at all.

So I was fiddling around a bit more and the regex is now capable of digesting nearly every list format:
https://regex101.com/r/ni7Uac/4

Performance is good. It always depends on how strict you want to be. I’d rather choose to be too strict to prevent false positives.

Current RegEx

  • allows wildcards
  • discards “specific” filters (not domain-like)
  • blocks domains which were just blocked $third-party (checking the lists I found this to be reasonable - especially thinking about how “parsing” worked before).

Just have a look at my link above.

As devs seem to tend against that kind of list support, it wouldn’t hurt making it optional, default off. But imho it’s a feature too often asked for to just drop completely only because the latest implementation was faulty.