I'm late to the party, however, I will also comment a bit on whether I find "pre-cleaning gravity" useful or not.
TL;DR: I don't.
Whether your experience on the web is better or worse when you go from 120,000 blocked domains to 3,000,000 (or even more) is subject to, well, everybody should know for themselves. I do not see an advantage here, but this is just one opinion out of many. We intentionally always keep the support for humongous list, even if that made the code more tricky in one or the other place.
With Pi-hole v5.0 on the horizon we're moving the gravity database into a SQL database. Your script could actually be greatly simplified as you could just run the one-liner
DELETE FROM "gravity" WHERE "domain" IN (SELECT "domain" FROM "known_not_existing");
with a prepared table of domains known to not exist.
However, coming back to a more general statement, I do have my doubts that there is any benefit in querying millions of domains once a week in order to reduce the gravity list size by 1/3.
Yes, this reduces gravity's size somewhat, however, this comes at the cost of doing bulk lookups once a week. They are so extreme that you query more domains each week than my Pi-hole at home queries in an entire year (just to give some numbers here).
How long does your script take to run to completion? And how many times could you restart pihole-FTL
in this time window?
You see: I am not in favor of doing this cleaning up of mess caused by some list maintainers (as this has been said somewhere in this topic). Not because it is not a good idea, no just because you try to obtain something arguably useful using the absolutely wrong tools. This should not be done by the end-user. Even though nobody usually seems this, internet traffic consumes and costs energy. This should not be taken lightly.
I should point out one more, only mildly related, thing here:
Behind the scenes, we're currently redesigning critical portions of our FTL daemon to remove some of the fundamental bottlenecks we always carried around. This includes using a B-tree (self-balancing binary tree) for domains.
Due to using this tree for searching, we do not need to load all the domains into memory and startup time of FTL is greatly reduced (in fact, there is virtually no difference between having 10,000 or 10,000,000 domains on the gravity list). As a side effect, one could say that we remove any limitation on the size of the gravity list.
Obviously, nothing comes at no costs, however, we shift the cost of building the tree to the (weekly) run of pihole -g
instead of to the start of pihole-FTL
to cover this.
This code is -- even though it already works -- still to be considered highly experimental at this time and only future performance measurements will show if we finally go for it. Whether it could make it into a v5.1, v5.2 or possibly a v6.0 is completely open at this point.