Load regexps from "host file"

Hey Guys,

I haven't used Pi-Hole in a little while as I have been experimenting with AdGuard Home. I have had a feature request or two from people whom have wanted to add individual domains to my regexp list, but I have been trying to keep it as broad as possbile to avoid enforcing my personalised wildcard blocking onto everyone.

There has been some chatter around first party trackers and needing to wildcard block those, which I may like to create a separate list for, but I don't really want to have to look back into writing scripts to check existing regexps, resolve conflicts, import etc all over again.

I haven't looked into Pi-Hole's development for some time so I apologise in advance if this feature is already present, but if not, would it be possible to look into allowing regexps to be loaded in from gravity?

I understand that there may be some concerns over how to determine what is and what is not valid regexps, but it seems to work quite well with AdGuard and this responsibility of this should lay with the list maintainer.

AdGuard Home determines anything between / to be regexps, e.g.

/^(.+[-_.])??adse?rv(er?|ice)?s?[0-9]*[-.]/
/^(.+[-_.])??m?ad[sxv]?[0-9]*[-_.]/

Could something similar be implemented to Pi-Hole? I would imagine that gravity may pull some duplicate entries (for example, people already using items from my list) - So there would need to be some process of checking for existing manually inputted regexps and either removing those or excluding the duplicates pulled down from gravity

It could be really beneficial to people too as if I make updates, changes or fixes to resolve conflicts, their regexps could be automatically update. Currently, one would have to manually update on a regular basis (if the list were to be updated frequently).

I'm not clear on what you want this feature to do. Gravity is a list of single domains, one per line, in HOSTS format, that will be blocked. How do you want to incorporate regular expressions into this?

Granted, gravity is a single list of domains; I used the term more as a reflection of the all encompassing gravity.db table which holds the blacklist, whitelist, regexps etc.

I understand that,currently, only files in HOST file format are supported, but it would be nice to be able to expand this to include things such as regular expressions, whitelists etc.

I raised the question originally because in order to distrubute my regexp list to people, they have to either copy and paste them all manually or run a script to 'install' them.

It would be nice to be able to load these items from a URL in a similar manner to the host files.

We will investigate this.

We could amend the adlist table by a new column type (defaulting to exact list) to support various types of automatically downloaded filters. The gravity script could then handle the various types accordingly (for now that'd be exact and regex gravity domains).

The remaining issue is that they should probably (?) be kept separately from the normal regex filters users can specify as they should neither be editable nor survive a pihole -g run. This means we'd also need a new table gravity_regex as we cannot import them into our regular regex table for performance reasons (we do the lookup directly using the B-tree of the database, it would not contain the information about regex = true/false).

Thanks @DL6ER this would be most useful and I think definitely a step in the right direction to some modernisation.

Yes - I think you're right. They would need to be separate, however I think there also needs to be a way for users to "whitelist a regexp" for extreme circumstances where one or more regexps may cause them issues in their specific use case.

1 Like

Pi-hole's general rule is: The whitelist trumps.

Even if a domain is contained in all adlists, we have an exact blacklist and multiple regex hits for it - if either an exact or a regex whitelist entry matches as well - then the domain is not blocked. So this is already guaranteed.

1 Like

Ah, that's helpful! I wasn't aware that you'd implemented whitelist regex.

How do you intend to define the type of list? E.g blacklist, whitelist and regex list? Or are you simply going to have a separate page for a user to define each and therefore know the type from that?

Adguard home works quite nicely with reading all rules from a single file but I imagine that would involve a lot of modifications to the shell scripts and not necessarily worth the time at this point.

Do you mean internally (in the database) or externally (in the front end)?

To make things clear to the user, we retitled the three categories

  • blacklist
  • whitelist
  • regexlist

and added a fourth one:

  • exact blacklist
  • regex blacklist
  • exact whitelist
  • regex whitelist

The exact and regex versions of a list are shown together on the same (e.g. blacklist) page:

We're also currently investigating to switch this more traditional displaying to a common page showing all four domain types at once. This would have a number of benefits like allowing us to, e.g., allow user to easily change a white to a black regex or integrate per-client group management, etc. This is currently under review:

I meant externally :slight_smile:

Thanks for including all of the details! Definitely looks like things are going in a good direction. I was wondering where the visual represntation for the 'enabled' field in the DB was.

I have adapted the regexp install scripts for now to conform to the new DB structure, but would definitely be useful to have these loading from gravity.

Thanks for your consideration :slight_smile: