Hi,
I'm not happy about the drop of support for AdBlock-style lists. However I can understand your reasons as it was an implementation that easily led to false positives.
Though I'm still for having such a (optional) feature as e.g. uBlock Origin's best practice for blocking domains is to block on the document level:
! Using request of type document will cause the whole site to be blocked through
! strict blocking, yet the site will render properly if a user still decide[s] to
! go ahead [(temporary whitelisting)].
So "parsing" should be a just a RegEx job. Not trying to catch as many (sub)domains as possible, but eliminating everything unknown. That way it's much more likely to only grab the "block the whole domain" rules out of AdBlock style lists and discard all the more fine grained filters.
RegEx example trying to catch known domain-wide filter rules (wildcard support and adblock-specific flags optional): regex101: build, test, and debug regex
What are your thoughts on this?
Oh and somehow I missed the discussion on this, so I'm not mad if you are replying with links and moneyquotes.
Ah ok, I thought there was more (seen that one but no satisfying answers).
I believe that a number of public blocklists have already done this parsing and have HOSTS based blocklists derived from AdBlock formatted lists
Forcing everyone every list maintainer to publish different versions is not a good solution.
E.g. the list mentioned above is not available in hosts format.
I ran EasyList (imho worst possible choice) through my RegEx:
# -*- coding: utf-8 -*-
"""
Created on Tue Dec 3 22:47:57 2019
@description: Process a file through a RegExp and give output to another file
@author: blockit
"""
import re
outputFile = open("output.txt", "w")
# block $third-party, allow wildcard
regex = re.compile(r'^(?:https?:\/\/|\|\|)?([a-z\d*](?=.*\.)[a-z\d.*-]{3,252})(?:\/|\^|\^\$[\w,]*?(?:document|third-party|3p|all)[\a-z,=~-]*?)?$')
with open("input.txt", "r") as fileHandler:
# Read next line
line = fileHandler.readline()
# check line is not empty
while line:
m = regex.match(line)
if m:
outputFile.write(m.group(1) + "\n")
line = fileHandler.readline()
outputFile.close()
Block $third-party, allow wildcard
72746 lines down to 23601 domains
Block $third-party, forbid wildcard
72746 lines down to 23583 domains
Do not block $third-party, forbid wildcard
72746 lines down to 1140 domains
Execution took < 0.13 seconds on my laptop.
I would make a PR but I don't have the knowhow.
Of course before running that RegEx over a list, there should be a detection if it is AdBlock style. Otherwise I'd have to extend the RegEx for also accepting hosts-style, which shouldn't be a problem.
Performance is good. It always depends on how strict you want to be. I'd rather choose to be too strict to prevent false positives.
Current RegEx
allows wildcards
discards "specific" filters (not domain-like)
blocks domains which were just blocked $third-party (checking the lists I found this to be reasonable - especially thinking about how "parsing" worked before).
Just have a look at my link above.
As devs seem to tend against that kind of list support, it wouldn't hurt making it optional, default off. But imho it's a feature too often asked for to just drop completely only because the latest implementation was faulty.