Script to remove dead hosts

Hello,
I am using a script that nslookups all hosts in gravity.list and removes the dead ones.
Now that Pihole 5.0 is moving to databases, my script will break.
Have you guys considered baking this functionality in?

Or, can someone who is good at databases help me adapt my script?

Just out of curiosity, what is the benefit of this script?

The vast majority of the domains in your gravity list will never be queried by your network clients. Domains that would result in NXDOMAIN if they were not blocked (the domains you are removing) would rarely be queried, and when they are, since they are in the gravity list the very fast reply would be 0.0.0.0. This is a single lookup.

With your script, each time it runs it queries every domain in gravity. You have exchanged an occasional 0.0.0.0 reply for a domain that does not exist for at minimum hundreds of thousands of queries to external DNS servers.

1 Like

If I have a huge gravity file, it takes a longer time for DNS lookups as it has to grep or read a larger file, slowing down the search for a domain in gravity

You can test this by timing DNS queries on a 3 million line gravity file vs. a very tiny 20 line gravity file.

I run the script once a week and it takes all the dead domains out, slimming gravity, and thus slimming down the search for "is this domain blocked by gravity, if not, look it up"

Hi, I'm the guy in Reddit who helped with the awk script.
Regarding the feature request, I'd suggest instead of running the script every week for every possible domain in the gravity.list, to keep a history of all the alive and dead domains from the previous runs, so you don't have to query the same millions of domains over and over. This way you'll only have to query for hundreds (the new ones) instead of millions of domain names each week.
It won't be hard to incorporate this functionality in the current script, I can help.

1 Like

I assume you have tested this. What were the results? Given that Pi-Hole maintains gravity sorted in memory and the lookup algorithm is quite fast, I wouldn't imagine you would save more than a few msecs.

Yes, it was not a drastic speed boost, but a speed boost nonethenless.
I tested it with reading from EMMC flash memory on my Tinkerboard S, rather than the RAM module, but my understanding is they are both similar high speed memory.
I don't recall the exact timings

I ran a test. Identical Pi Zero W, both connected via wireless. Same power supplies, same SD card size and type, same Pi-Hole version and settings.

Both got the same command from their own terminal: time dig flurry.com

Number 1 with default blocklists - 134,515 domains on blocklist

;; Query time: 1 msec
real 0m0.443s
user 0m0.147s
sys 0m0.085s

Number 2 with all WaLLy3K blocklists - 1,308,625 domains on blocklist

;; Query time: 1 msec

real 0m0.203s
user 0m0.136s
sys 0m0.031s

Repeated runs of this query (after TTL expiration) show data that overlaps and there is no statistically meaningful separation in the times between the two Pi-Holes.

Regarding whether the feature request is useful or not, I would suggest that it is useful for 2 reasons:

  1. it's very slow to restart Pi-hole in a PI Zero if you have millions of blacklisted domains in gravity.list (it takes 20 seconds or more on my PI Zero)
  2. it takes much RAM, since Pi-hole keeps the gravity.list in memory (about 100-150 MB of RAM for each million of blacklisted domains according to my tests). Therefore, devices with 512MB Ram or less (like my Pi Zero) are pushed to the limit.

Perhaps a better solution is to get your blocklist maintainers to remove NXDOMAIN entries from their blocklists before they publish them. This would have even more benefits:

  1. Less bandwidth and server load when thousands (or perhaps tens of thousands) of users download their block lists.

  2. Fewer demands on DNS servers - instead of every user testing all their domains, the block list maintainers do this once.

  3. Faster for users to download, process, remove duplicates, etc while rebuilding gravity.

  4. Simplified code in Pi-Hole.

1 Like

This is why the cron script to rebuild gravity runs on Sunday am between 3 am and 5 am your local time.

Right now I am using very many lists, and many of the maintainers are not properly purging their lists :frowning:

Pi-hole also restarts (or reloads its database I'm not sure) every time I whitelist or blacklist a new domain. I've checked that pihole-FTL process is at 100% utilization for 15-20 seconds on my Pi zero every time I perform these tasks and I'm assuming it rebuilds the hash table of the blacklisted domains in memory.

1 Like

Should it be the responsilbity of the Pi-Hole developers and Pi-Hole code to clean up this mess?

The first time you ran this script on millions of domains, how many NXDOMAINS were removed?

No, for sure not, thats why its just a feature request :slight_smile: We will try and work on a script independently

I went from 3,098,006 domains / 65 megs down to 2,009,122 domains / 42 megs

1,088,884 domains were removed

I agree. We'll work with @p1r473 on a new version of the script when version 5 is out, and we'll keep track of the dead domains (per list ideally) so we can notify the original maintainers to clean up their lists. We'll share the script with the community as well for anyone interested.

2 Likes

In addition to NXDOMAINs some returned other errors, such as "No answer"

nslookup 00038a.net

Non-authoritative answer:
*** Can't find 00038a.net: No answer

Feel free to close this feature request if you don't believe it enhances the product!

What happens when a domain is removed that becomes active? What about domains that are active and then removed after your script runs? There are a number of situations that render the approach of pruning lists to be rather futile.

FTL doesn't read / grep a text list every time a query happens, the lists are read in to memory and that is what is queried. RPi Zero with limited memory have a limit on how many domains can be used. The gain you get from pruning is converging quickly on being inconsequential. The better solution is to use better quality lists or use hardware that can handle the load.

1 Like

I run the script weekly to solve that first question, but henf1b3rs suggestion will help: keeping track of dead and alive of previous weeks, and just rescanning the newly added stuff. a full rescan can be done less frequently

Looking up millions of domains weekly on the internet to save a msec or two here and there in DNS lookups does not seem to be a fruitful endeavor. As noted, there is a statistically insignificant difference in Pi-Hole search time.

What benefit is there in running blocklists totalling miillions of domains compiled by strangers? Wouldn't it be much simpler and put you in control if you ran some basic blocklists and then made local blacklist and regex entries to block things you don't want to load?

Before you added all the blocklists, were you seeing ads or collecting malware?