Hi, I'm the guy in Reddit who helped with the awk script.
Regarding the feature request, I'd suggest instead of running the script every week for every possible domain in the gravity.list, to keep a history of all the alive and dead domains from the previous runs, so you don't have to query the same millions of domains over and over. This way you'll only have to query for hundreds (the new ones) instead of millions of domain names each week.
It won't be hard to incorporate this functionality in the current script, I can help.
Hi, I'm the guy in Reddit who helped with the awk script.
I assume you have tested this. What were the results? Given that Pi-Hole maintains gravity sorted in memory and the lookup algorithm is quite fast, I wouldn't imagine you would save more than a few msecs.
Yes, it was not a drastic speed boost, but a speed boost nonethenless.
I tested it with reading from EMMC flash memory on my Tinkerboard S, rather than the RAM module, but my understanding is they are both similar high speed memory.
I don't recall the exact timings
I ran a test. Identical Pi Zero W, both connected via wireless. Same power supplies, same SD card size and type, same Pi-Hole version and settings.
Both got the same command from their own terminal:
time dig flurry.com
Number 1 with default blocklists - 134,515 domains on blocklist
;; Query time: 1 msec real 0m0.443s user 0m0.147s sys 0m0.085s
Number 2 with all WaLLy3K blocklists - 1,308,625 domains on blocklist
;; Query time: 1 msec real 0m0.203s user 0m0.136s sys 0m0.031s
Repeated runs of this query (after TTL expiration) show data that overlaps and there is no statistically meaningful separation in the times between the two Pi-Holes.
Regarding whether the feature request is useful or not, I would suggest that it is useful for 2 reasons:
- it's very slow to restart Pi-hole in a PI Zero if you have millions of blacklisted domains in gravity.list (it takes 20 seconds or more on my PI Zero)
- it takes much RAM, since Pi-hole keeps the gravity.list in memory (about 100-150 MB of RAM for each million of blacklisted domains according to my tests). Therefore, devices with 512MB Ram or less (like my Pi Zero) are pushed to the limit.
Perhaps a better solution is to get your blocklist maintainers to remove NXDOMAIN entries from their blocklists before they publish them. This would have even more benefits:
Less bandwidth and server load when thousands (or perhaps tens of thousands) of users download their block lists.
Fewer demands on DNS servers - instead of every user testing all their domains, the block list maintainers do this once.
Faster for users to download, process, remove duplicates, etc while rebuilding gravity.
Simplified code in Pi-Hole.
This is why the cron script to rebuild gravity runs on Sunday am between 3 am and 5 am your local time.
Right now I am using very many lists, and many of the maintainers are not properly purging their lists
Pi-hole also restarts (or reloads its database I'm not sure) every time I whitelist or blacklist a new domain. I've checked that pihole-FTL process is at 100% utilization for 15-20 seconds on my Pi zero every time I perform these tasks and I'm assuming it rebuilds the hash table of the blacklisted domains in memory.
Should it be the responsilbity of the Pi-Hole developers and Pi-Hole code to clean up this mess?
The first time you ran this script on millions of domains, how many NXDOMAINS were removed?
No, for sure not, thats why its just a feature request We will try and work on a script independently
I went from 3,098,006 domains / 65 megs down to 2,009,122 domains / 42 megs
1,088,884 domains were removed
I agree. We'll work with @p1r473 on a new version of the script when version 5 is out, and we'll keep track of the dead domains (per list ideally) so we can notify the original maintainers to clean up their lists. We'll share the script with the community as well for anyone interested.
In addition to NXDOMAINs some returned other errors, such as "No answer"
*** Can't find 00038a.net: No answer
Feel free to close this feature request if you don't believe it enhances the product!
What happens when a domain is removed that becomes active? What about domains that are active and then removed after your script runs? There are a number of situations that render the approach of pruning lists to be rather futile.
FTL doesn't read / grep a text list every time a query happens, the lists are read in to memory and that is what is queried. RPi Zero with limited memory have a limit on how many domains can be used. The gain you get from pruning is converging quickly on being inconsequential. The better solution is to use better quality lists or use hardware that can handle the load.
I run the script weekly to solve that first question, but henf1b3rs suggestion will help: keeping track of dead and alive of previous weeks, and just rescanning the newly added stuff. a full rescan can be done less frequently
Looking up millions of domains weekly on the internet to save a msec or two here and there in DNS lookups does not seem to be a fruitful endeavor. As noted, there is a statistically insignificant difference in Pi-Hole search time.
What benefit is there in running blocklists totalling miillions of domains compiled by strangers? Wouldn't it be much simpler and put you in control if you ran some basic blocklists and then made local blacklist and regex entries to block things you don't want to load?
Before you added all the blocklists, were you seeing ads or collecting malware?
No, just chasing perfection and cleaniless. I guess for me its just cleaning up junk. Maybe trying to min/max my DNS lookup efficiency
Exactly, if you have the domain in the local blocklist/blacklist then
ftl will immediately return the NXDOMAIN response. If you remove the domain from the lists then a query for that domain will need to go to the upstream for a determination. If you have DNSSEC enabled that will be a slower process as the NXDOMAIN from upstream will require a chain of queries from the root on down to the authoritative.
I'm late to the party, however, I will also comment a bit on whether I find "pre-cleaning gravity" useful or not.
TL;DR: I don't.
Whether your experience on the web is better or worse when you go from 120,000 blocked domains to 3,000,000 (or even more) is subject to, well, everybody should know for themselves. I do not see an advantage here, but this is just one opinion out of many. We intentionally always keep the support for humongous list, even if that made the code more tricky in one or the other place.
With Pi-hole v5.0 on the horizon we're moving the gravity database into a SQL database. Your script could actually be greatly simplified as you could just run the one-liner
DELETE FROM "gravity" WHERE "domain" IN (SELECT "domain" FROM "known_not_existing");
with a prepared table of domains known to not exist.
However, coming back to a more general statement, I do have my doubts that there is any benefit in querying millions of domains once a week in order to reduce the gravity list size by 1/3.
Yes, this reduces gravity's size somewhat, however, this comes at the cost of doing bulk lookups once a week. They are so extreme that you query more domains each week than my Pi-hole at home queries in an entire year (just to give some numbers here).
How long does your script take to run to completion? And how many times could you restart
pihole-FTL in this time window?
You see: I am not in favor of doing this cleaning up of mess caused by some list maintainers (as this has been said somewhere in this topic). Not because it is not a good idea, no just because you try to obtain something arguably useful using the absolutely wrong tools. This should not be done by the end-user. Even though nobody usually seems this, internet traffic consumes and costs energy. This should not be taken lightly.
I should point out one more, only mildly related, thing here:
Behind the scenes, we're currently redesigning critical portions of our FTL daemon to remove some of the fundamental bottlenecks we always carried around. This includes using a B-tree (self-balancing binary tree) for domains.
Due to using this tree for searching, we do not need to load all the domains into memory and startup time of FTL is greatly reduced (in fact, there is virtually no difference between having 10,000 or 10,000,000 domains on the gravity list). As a side effect, one could say that we remove any limitation on the size of the gravity list.
Obviously, nothing comes at no costs, however, we shift the cost of building the tree to the (weekly) run of
pihole -g instead of to the start of
pihole-FTL to cover this.
This code is -- even though it already works -- still to be considered highly experimental at this time and only future performance measurements will show if we finally go for it. Whether it could make it into a v5.1, v5.2 or possibly a v6.0 is completely open at this point.