Reduce size gravity.list with active wildcard and regex entries

regex

#1

I am new to Pi-Hole but not to DNS filtering which I did years on my own.

Using wildcard filtering (address=/xyz.abc/[x.x.x.x]) in DNSmasq I was thinking how to reduce size/records in gravity.list. So I thought why leave the each record that is also covered by the wildcard records.
I reduced my domains number from 143K to 118K in this way. And if regex is also used it can be reduced even more.

If an wildcard/regex record is entered it not good to run the filter each time so an extra button to generate a filtered gravity.list on the blacklist page.

Removing a wildcard/regex means that all the external blocking-lists have to be read from scratch and then the new wildcard/regex are applied to new gravity.list

To make this all more manageable I could think of a page that shows all the records with the same top level domain grouped in pages. From this page it is easy to generate a new wildcard if the user wants to block the whole top level domain.

I had a quick-and-dirty script line that did the job for me:

Shortend version and this could also converted to a for-next that reads each wildcard\regex separate.

awk ‘!/302br.net|sandai.net|liveadvert.com|appier.net/’ gravity.list > gravity-1.list

I then duplicate gravity-1.list over gravity.list and then restart PiHole to have the updated gravity.list active.


#2

Why do you want to reduce the size of gravity.list? With FTLDNS the file is actually much smaller, because it only contains domains.

This feature would add a lot of complexity for little benefit.


#3

Thanks for your answer however I don’t follow you on the FTLDNS.

As far as I understood the generating of the gravity.list file is that all the adlist files are downloaded and put in one file (event horizon) and that file is de-doubled and sorted. The result is transported tot gravity.

Gravity consists of domians and subdomains. Then we have wildcard and those I can’t see back in a file and I assume that those are stored in a database. I can add my own wildcards in /etc/dnsmasq/xx-pihole-wildcard.conf and have it imported when FTLdns is (re)started.

My own wildcards are not shown in a query so that strengthens the idea that those are not stored internally.

You’re writing that only domains are stored then I assume that.

I am using an up-stream DNS server to resolve and have Unbound between it and FTLdns. DNSmasq is not that stable if it can’t reach the upstream server and DNSSEC was buggy and returned a lot of times, BOGUS. Unbound does a great job and is littered with features, that everyone wants to use.

Back to filtering out also domains as I proposed will lower the overall impact on the system even if wildcard is first in line when filtering. So what does FTLdns use as blocklist?

Extra:

When wanting to make manually a wildcard file it noticed that was almost impossible from the list and just to much information not grouped so I sorted it again more matching to the notation of domains and subdomains.

After a lot searching and trying and failing I was surprised how easy it was to get the sorting I wanted with: cat gravity.list | rev | sort | rev > gravity-3.list Linux keeps flabbergasting me again and again.
It inverts each separate line and puts the domain at the beginning and the subdomain at the end, sorts it and reverse the the lines again so it are correct (subdomain)domains again.

It was much easier now to find and I noticed even a way to automate it with a program. In the list underneath you see a domain with subdomains.domains in the next lines. If you go through the list you can look if under that domain there are a certain number of subdomains with the same domains. Then that domain is added to wildcard and will in the end filter out all matching domains and subdomains.domains in the gravity.list

I thought I had 143K entries as I stated and I used a wrong formatted adlist file and the correct one put the counter on 290K and after filtering the wildcards out it was reduced to 210K entries.

Example of sorted on domain

Summary

volumtrk.com
voluumtrk.com
boh00.voluumtrk.com
eju10.voluumtrk.com
cma60.voluumtrk.com
e4sa0.voluumtrk.com
xssa0.voluumtrk.com
rvcb0.voluumtrk.com
a3hb0.voluumtrk.com
rjnc0.voluumtrk.com
rwbd0.voluumtrk.com
okrg0.voluumtrk.com
qb2h0.voluumtrk.com
hbho0.voluumtrk.com
lbfp0.voluumtrk.com
aafq0.voluumtrk.com
qitr0.voluumtrk.com
lxjw0.voluumtrk.com
hoz01.voluumtrk.com
nam11.voluumtrk.com
mb871.voluumtrk.com
.
.


Filter out wildcarded entries when consolidating lists
#4

It sounds like you are greatly over complicating things. The adlists are parsed to get the domains they contain, and then the duplicates are removed and the result is put into gravity.list. That gravity file only contains domains, separated by newlines.

In dev, there are no more wildcards. They are replaced by regex filters, which are stored separately (/etc/pihole/regex.list).

When FTLDNS checks if a domains should be allowed, it only needs to check if the domain exists in the list, or if it is caught by a regex filter. The domain search is well optimized by Dnsmasq, and the regex result is even cached, so it is only a small one time cost.

You can have millions of domains blocked and not have any notable performance issues, so I do not see the value in implementing a complex algorithm to trim the number of domains by replacing them with complex regex filters.


#5

Gravity.list works great and Pi-Hole can select which kind of blocking method is used. I don’t think it would have to be very complex and the only drawback I see is that if a wildcard is manually deleted that the gravity.list is out of sync with the adlists that combined, filled the gravity.list

My current standing on this, is to filter out the wildcard/regex just after downloading and before normal processing of the adlist source files. In this way the the subdomains covered by the wildcard/regex will never enter the system and only the wildcard/regex will be known.

I am running since two days the DEV version (v3.3-312-gb087888-dirty) and it did not yet put the manually (webinterface) added wildcard in the regex but in 03-pihole-wildcard.conf in /etc/dnsmasq.d

server=/urbanairship.com/
server=/000webhostapp.com/
server=/smartadserver.com/

address=/planet.nl/192.168.0.1

This no problem on the moment however I look at the added 192.168.0.1 that confused me and I expected no IP number so that it would be a NXDOMAIN. I have set BLOCKINGMODE=NXDOMAIN in /etc/pihole/pihole-FTL.conf. When I resolv it comes back as 192.168.0.1.
This may be caused by I am using the 03-pihole-wildcard.conf

Mentioning NXDOMAIN, which also tackles the IPv6 need seperate list for IPv6 ;-), I noticed that some browser don’t really like NXDOMAIN for specific addresses. I then put an IP behind the domain and then the constant trying to resolve stops.
I manually add it to the wildcard direct in to the file as: address=/thumbnails2.opera.com/127.0.0.1 or use the wildcard entry field…see above.

I assume that in the future, NXDOMAIN is the used method in Pi-Hole and then it would be nice to have a new exact blacklist that is not resolving to NXDOMAIN but to a local IP address.
This would have to sit in front of Wildcard/regex or even better included in the Wildcard/regex (then allow subdomains in wildcard) and a specific hit will be prioritized over the wildcard line in DNSmasq.

Blacklist/Whitelist --> Wildcard/regex --> gravity

Then an other request that is present in /etc/dnsmasq.d/01-pi-hole.conf. My router caching DNS looks at the TTL that are sent with local requests. DNSmasq has an default of zero seconds (no caching) so my router sends a request every 10 seconds to Pi-Hole to resolve again. Now does Pi-Hole have an local-ttl=2 and it would be nice if that can also be changed/defaulted from the setupVars.conf. I use an value of 600 and so does not fill up the query log with that many log-lines by the router requesting.

I have been very busy with Pi-Hole in the last days and I like it very much. I do have a lot of questions and proposals on the moment and that will be become less. :slight_smile:


#6

The regex feature has yet to be merged into core and web, but is available in FTL’s dev branch.

See the various blocking modes you can use: https://docs.pi-hole.net/ftldns/blockingmode/

You still have not said why this is a valuable feature, beyond that it will lower the number of lines in gravity.list.


#7

Thanks and I have now switched from NXDOMAIN to NULL and that covers also IPv6. The wildcard file had to be edited to reflect also the NULL setting.

Using tools to query lists shows often more than 100 entries and it states then that have to use “-all”, that does not work because it is not a domain.
I like to repeat my suggestion to allow subdomains to be added to the wildcards and in the search options. And also, for only a TLD and I have now blocked the TLD “.goog”. The TLD is show in the wildcard table and working but I can’t search on it. Of course this all can stay in the wildcard file and in time be replaced by regex.
Update: I have now even more TLD blocked in regex

No I don’t have any better reason than only having a smaller gravity.list and so less impact from that list on the whole system.

I have been busy with regex and now my own regex lines filter about 50.000 lines from a 250.000 lines gravity.list and I have converted my wildcard file to regex. Not difficult but you have to pay attention.

Today I found regex files on this site: https://github.com/cbuijs/accomplist/tree/master/chris by C. Buijs and I have used that on my gravity.list and that removed 100.000 lines from a 250.000 lines list. That can be even more if I also put my regex also over it. He is has also made a extension to Unbound to filter so that my even give the possibility to completely replace DNSmasq by Unbound.

Then having lots of unique URL in the regex, is that not more load than using the DNSmasq way of working? If so, then it would it not be better to keep, al the not covered URL’s by regex, in DNSmasq?

Update: a lot of hours later I have completed the two lines to implement regex filtering on the adlist files. It is very rought and the paths I had put straight in the code.

The extra lines to implement the adlist-regex filter are:

cat /etc/pihole/regex.list | awk NF | awk '!/^#++'/ | sed "s/.*/awk \'\!\/&\/\' \|/" | sed '$ s/.$//'> /etc/pihole/regex.clean

source /etc/pihole/regex.clean >  ${destination}

I have not put surrounding code in so that it is not that easy to screw up a working system to other users and it should be go through the proper channels if it is accepted.

The speed when using >pihole -g is good and I don’t have to wait very long before it completes.

Now I don’t have any wildcard file anymore because all is in regex I can’t query the list if there is a match. I am sure that this will be available in the future.

ps. when you comment out lines in regex then they are still being processed. To find errors you have to remove one or more lines and then save the regex.list file till you find the conflicting line.


#8

I have my doubts about having the about putting the 1-on-1 the wildcard file in the regex file. I noticed some developments in the beta and did some searching.

To optimize a bit you could change from one rule to matching rules on one line:

use Regexp::Optimizer;
my $o  = Regexp::Optimizer->new->optimize(qr/foobar|fooxar|foozap/);
# $re is now qr/foo(?:[bx]ar|zap)/

list:

sub.aaa.bbb
sub2.aaa.bbb
sub3.ccc.ddd
sub4.ccc.ddd

So first you group all the unique domains and (different) subdomains to one line:

^sub.aaa.bbb|sub2.aaa.bbb$
^sub3.ccc.ddd|sub4.ccc.ddd$

Then optimize:

^(sub|sub2)aaa\.bbb\.$
^(sub3|sub4)ccc\.ddd\.$

To archive grouping by domain you could use the "rev | sort " command
Then walk through the list and group same domains to one line with their subdomains and it is reverse:

$bbb.\aaa.\)bus|2bus(^
$ddd.\ccc.\)3bus|4bus(^

and when ready “rev” again.

This is a light optimize and not as advanced as Regexp::Optimizer but it will reduce load.

Maybe regex is already internally optimizing returning matching strings as soon they are put on one line.
I just began two weeks ago with regex to generate blacklists for my firewall so still lots to learn for me.

Update: this could also be used on gravity.lists and made corrections and made tables.


#9

In a gravity list done by hand:

Source gravity.list:

.
.
robinwoodcomics.org
thong-pics.org
painolympics.org
www.painolympics.org
maturewomenpics.org
freepornpics.org
tattoo-pics.org
zoo-pics.org
netpics.org
holidaypics.org
mmetrics.org
.
.

result regex:

^robinwoodcomics.org$
^(holiday|net|zoo-|tattoo-|freeporn|maturewomen|www.painolym|painolym|thong-)pics\.org$
^mmetrics.org$

Lets take a smaller step and use the advantage of some domains giving away, that they are a good base for a regex line. Here you see “moc.21” (12.com) as base.

So lets take “moc.12” as base look at the following lines if they start with “moc.12” and transfer and remove the base and and append “|”

$moc.21)002di-troppus-enilno|01klc|02asdr|02erawypsitna|…|wrt|wrt.www|xobswen(^

moc.21
moc.21002di-troppus-enilno
moc.2101klc
moc.2102asdr
moc.2102erawypsitna
moc.2102hrffa
moc.2102mti
moc.2102qqqeweqew
moc.2102tropmiolop
moc.210anihc
moc.2111gnikcart
moc.21-1eciton
moc.211ikcart
moc.211ikcart.www
moc.211kcart
moc.211kcart.www
moc.211oahin
moc.211tahc
moc.213214lpyapsemagcipelpyap-ppa
moc.215tner
moc.216.1t
moc.21682919dootiowdotegn-ppa.snoitettaduarf-redrolecnac
moc.2188329501-pleh
moc.219tkm.ztnetnoc
moc.21dnesetaerc
moc.21ecinnaitsirhc
moc.21egakcap.d61we
moc.21ffa
moc.21gnikniltcennoc
moc.21gnikniltcennoc.www
moc.21grebdlogdivad
moc.21knltxen
moc.21liamc
moc.21lssup
moc.21namrof
moc.21narbacmasdan
moc.21orpriapercp
moc.21orpriapercp.www
moc.21rekamssim
moc.21-retrats
moc.21rrtup
moc.21sbale
moc.21segassem
moc.21sliamiffe
moc.21smtenihasm
moc.21sunisiurc
moc.21tceriderorez
moc.21tceriderorez.1az
moc.21tceriderorez.1cz
moc.21tceriderorez.1rz
moc.21tceriderorez.1tz
moc.21tceriderorez.1wz
moc.21tebrocs.setailiffa
moc.21wrt
moc.21wrt.www
moc.21xobswen

#10

I have been using regex now for a short while and it is really powerful and can be used like a scalpel to be very precise.

I started use a good list but noticed that basic domains like Google, Github etc. where blocked. I can manually remove those domains each time or whitelist them.

To just be sure that those domains are never blocked by any list I generated a whitelist filter with the name regex.white

In my personal configuration regex.list + regex.white filters the adlist block files on import by means a temporary file regex.clean which contains the AWK commands to filter already covered by regex.list list entries. Now also the domain names that are wished are never blocked by the regex.white:

# Whitelist only removes domains from the adlist files
# Don't remove the dummy.not.remove entry
dummy\.not\.remove$ 
# put underneath your whitelist domains
(^|^www\.)youtu\.be$
(^|^www\.)?twitter\.com$
(^|^www\.)?githubusercontent\.com$
(^|^www\.)github\.com$
(^|^www\.)myip\.ms$
(^|^www\.)grc\.com$
lh[0-9]\.ggpht\.com$
(^|^www\.)google.(com|nl)$  

I do use (^|^www.) to be sure that only twitter.com and www.twitter.com is not imported from a adlist, ever.

The advantage is that the regex.white is only used when importing the adlist and the disadvantage is that is not visible in Pi-hole…but then also not, the wanted domains. :wink:

Looking forward to be able to use regex also in the Pi-hole whitelist page so that this work around is not needed and only need one line instead two or more lines.


#11

Hey msatter - Do you mind sharing exactly how you implement the regex whitelist? Thanks in advance!


#12

Thanks for your interst and the whitelist only removes entries that would end up in gravity and I want to allow them.
The entries are filtered out as it is done with the regex.list entries. The trick is that the whitelist is only used during processing of the adlists while the regex.list is always used used.

I have publised the scripts and I have to check if those are up-to-date.
The scripts are stable now for some time and I use them daily.


#13

So for the life of me I have no idea where you published them at! :wink:


#14

Me either. I am looking also in this forum and I made that many posts in that period.

found the thread and posting:

https://discourse.pi-hole.net/t/cleaning-gravity-list-based-on-what-is-in-regex-list/12009/78?u=msatter