Pihole -g / List download / disappointing performance

No. FTL v5.0 does not need any gravity runs for any list manipulations. We just send a suitable signal to pihole-FTL (automatically done for you) and FTL will found out for itself what changed.

pihole -g is really run only once weekly (unless invoked manually).


While we're at it, I should mention that FTL can be informed about list changes in three ways:

  • pihole restartdns: Full restart of pihole-FTL (slowest)
  • pihole restartdns reload: Looks for updates to the lists, flushes dnsmasq's DNS cache (faster than with v4.x)
  • pihole restartdns reload-lists: Looks for updates to the lists WITHOUT flushing dnsmasq's DNS cache (new, basically instant)

As I indicated in this topic to @DL6ER I've added the proposed database upgrade version 10 to my system. Somebody has to test them, I was expecting this would be appreciated...

Testing is appreciated, but for troubleshooting a beta 5.0 issue it would be best to be running the distributed software and eliminate a potential variable.

The list downloads work fine. The processing message appears almost immediately, further more, if it was a network problem, this would also be the case in pihole v4.

edit
as you can see in the debug log, I'm also using local lists (file:///home/pi/BL/tracker/domains). Processing is also slow for this list.
/edit

I know you guys keep repeating that 30 seconds is fine etc for the standard lists, but the reality is that a large percentage of users have millions of domains spread across many blocklists. This is of course a matter of preference; I prefer a smaller list personally. I am not criticising either as I know you can't please everybody.

The problem you're going to have is that people are going to be used to the original pihole -g speed before the added complexities of adlist ids etc. It will be seen as a step backwards, where in many ways we know it in actual fact greatly improves functionality.

Might it be an idea to focus more on 'sync', than truncating the table and then starting a fresh? This might also solve the issue if lists aren't accessible during the gravity run.

I mainly work in python now so I'm not too familiar with the limitations of bash, but surely, for each url provider, you could instead make comparisons and generate an array of additions / removals and then make only the necessary changes? Say inserting a couple thousand urls instead of millions every time gravity is run? The time you'd lose in the extra checks would be gained back easily in DB updates?

this is a fair assumption, ref this topic.

using the default block lists only feels like driving a sports car round the block in a 20Mph zone and convince yourself it's a powerfull car.

The default block lists do NOT take regional domains into account, one must add at least some regional lists.

edit
A long time ago, I wrote a topic about testing the value of new lists. I still use this on pihole v4, it works, the result is useful to the decision whether or not to add the new list.
So whenever I add a lists, I verify if is appropriate adding it (low new entries value -> don't add)...
/edit

No. Mass-injections are very fast. Lookups for each domain and then going into an if is very slow. This is a severe limitation of bash. It is similar for python. One of the first things you learn in python: Avoid long loops! Second thing you learn: Avoid long loops with many branches inside.

1 Like

I'm working really hard on this topic, I'm not taking anything lightly here. When I'm not responding quickly, then it's because I'm deeply inside the code, all mental abilities bound.

My first iteration of performance improvements is done, I will look at this again tomorrow. It's too late already over here in Europe.

See for yourself:

Switched to branch 'release/v5.0'

pi@munichpi:/etc/.pihole# time pihole -g
  [i] Neutrino emissions detected...
  [✓] Pulling blocklist source list into range

  [✓] Flushing gravity table
  [i] Target: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
  [✓] Status: Retrieval successful
  [✓] Adding adlist with ID 1 to database table

  [i] Target: https://mirror1.malwaredomains.com/files/justdomains
  [✓] Status: Retrieval successful
  [✓] Adding adlist with ID 2 to database table

  [i] Target: http://sysctl.org/cameleon/hosts
  [✓] Status: Retrieval successful
  [✓] Adding adlist with ID 3 to database table

  [i] Target: https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist
  [✓] Status: Retrieval successful
  [✓] Adding adlist with ID 4 to database table

  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_tracking.txt
  [✓] Status: Retrieval successful
  [✓] Adding adlist with ID 5 to database table

  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt
  [✓] Status: Retrieval successful
  [✓] Adding adlist with ID 6 to database table

  [i] Target: https://hosts-file.net/ad_servers.txt
  [✓] Status: Retrieval successful
  [✓] Adding adlist with ID 7 to database table

  [i] Number of gravity domains: 146591 (124042 unique domains)
  [i] Number of exact blacklisted domains: 0
  [i] Number of regex blacklist filters: 2
  [i] Number of exact whitelisted domains: 0
  [i] Number of regex whitelist filters: 0
  [✓] Cleaning up stray matter

  [✓] Flushing DNS cache
  [✓] DNS service is running
  [✓] Pi-hole blocking is Enabled

real    0m49,244s
user    0m10,144s
sys     0m2,527s

versus

Switched to branch 'tweak/gravity_performance'

pi@munichpi:/etc/.pihole# time pihole -g
  [i] Neutrino emissions detected...           
  [✓] Pulling blocklist source list into range                             
                                                                                 
  [✓] Preparing new gravity table                   
  [i] Target: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
  [✓] Status: Retrieval successful     
                                                                              
  [i] Target: https://mirror1.malwaredomains.com/files/justdomains  
  [✓] Status: Retrieval successful                                      
                                                                         
  [i] Target: http://sysctl.org/cameleon/hosts                          
  [✓] Status: Retrieval successful                                          
                                               
  [i] Target: https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist                                             
  [✓] Status: Retrieval successful                                                               
                                                                       
  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_tracking.txt
  [✓] Status: Retrieval successful                           
                                                                
  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt
  [✓] Status: Retrieval successful                                                                          
                                            
  [i] Target: https://hosts-file.net/ad_servers.txt
  [✓] Status: Retrieval successful    
                                                                
  [✓] Storing downloaded domains in new gravity table
  [✓] Activating new gravity table                                       
  [i] Number of gravity domains: 146591 (124042 unique domains)
  [i] Number of exact blacklisted domains: 0                                 
  [i] Number of regex blacklist filters: 2                                               
  [i] Number of exact whitelisted domains: 0            
  [i] Number of regex whitelist filters: 0          
  [✓] Cleaning up stray matter                                          

  [✓] Flushing DNS cache
  [✓] DNS service is running
  [✓] Pi-hole blocking is Enabled

real    0m33,544s
user    0m9,753s
sys     0m2,026s

This is on a RPi 3B with the stock lists. You can try it yourself with

pihole checkout core tweak/gravity_performance

However, note that there are no guarantees that everything works as expected. Although it should.
This also already features the idea of populating an alternative table called gravity_new and only swapping them in the end. I haven't had the time to check if this leaves all triggers and views intact, however, I hope so.

2 Likes

Appreciate the time you're putting in @DL6ER. Please don't take my comments negatively; I just think overall the update from v4 to v5 has been such an improvement that it would just be a shame if it were to stumble a little on gravity updates. It's only meant in a constructive manner.

If it were entirely run in the background it would be fine, but if it's after tweaks by the user in the UI etc it could put them off a bit if it looks like it's just hanging.

1 Like

No UI changes need gravity runs. See

@DL6ER your effort is much appreciated. I don't intend to offend any of the developers with my comments, just trying to prevent massive negative responses for this issue when the final v5 is released.

Thanks again to all the developers for your time and effort.

If you're going that way, download first, process later, could you please reinstate the code to use the cached versions, in case of download failure (part of the original question - first entry of the topic)

I will install 2 new systems tomorrow, one with beta5 + database v10 and one with gravity_performance, using my adlist, regex and whitelist entries. This will take a while, I'll report back the processing times.

Good night from Europe (Belgium).

1 Like

I don't feel offended at all. It will be interesting to see your comparison.

Yeah, sure, I'm currently working on the performance. The cached files are something I will look into as well at some point. But it is something different. The code to re-use the cached versions (if they are available) is still present. I suspect there is some misguided rm somewhere. Let's see tomorrow.

edit It turns out we keep them, but we moved them into the migration_backup directory. This is now fixed.

2 Likes

Yeah, unfortunately, he did so already. It doesn't change anything and I will shortly explain why.

Firstly, there is no way to completely get rid of an index, it is the defining element of the row in SQLite3: Rows can have variable lengths to not waste memory: say the comment of a domain is null for one domain but 2 KB for another one. You would not want the database engine to reserve "enough" space for very long comments for each domain.

Secondly, we flush the database table before we fill it (in one go). The index is empty and can be built up from scratch. This already dramatically reduces the need for reorganization in the tree,

Thirdly, we mass-insert everything in one go. All the INSERTs are done in a transaction. More specifically, the sqlite3 CLI tool uses SAVEPOINT and RELEASE which basically comes down to the same as BEGIN and COMMIT. All INSERTions are collected in a transaction and processed at once when the savepoint is released. This happens at the end of the mass-insertions. Only then, the database is written to and the tree is built up (as the index is populated). I see no way how this could be implemented any better.

Even with building the index in a separate step (which is not possible, but just assuming it would be), then we'd need to walk the entire database table again when creating the index. We would not gain anything, it's rather likely that we are slowed down by the heavy I/O that would be involved.

TL;DR Building the database tree is really cannot be accelerated much further: We start from an empty tree (empty table) and accumulate all added domains in a transaction that is written to disk at once. I see no other way how this could be realized faster.


I had some more ideas: There is some potential in just leaving the incoming domains unsorted. As contradictory as it may sound. unsorted domains may even speed up the tree buildup as random access is faster than always-access-the-last-leaf in B-trees.

1 Like

The last one for today:

root@munichpi:/etc/.pihole# time pihole -g
  [i] Neutrino emissions detected...
  [✓] Pulling blocklist source list into range

  [✓] Preparing new gravity table
  [i] Target: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
  [✓] Status: Retrieval successful

  [i] Target: https://mirror1.malwaredomains.com/files/justdomains
  [✓] Status: No changes detected

  [i] Target: http://sysctl.org/cameleon/hosts
  [✓] Status: No changes detected

  [i] Target: https://zeustracker.abuse.ch/blocklist.php?download=domainblocklist
  [✓] Status: Retrieval successful

  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_tracking.txt
  [✓] Status: No changes detected

  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt
  [✓] Status: No changes detected

  [i] Target: https://hosts-file.net/ad_servers.txt
  [✓] Status: No changes detected

  [✓] Storing downloaded domains in new gravity table
  [✓] Activating new gravity table
  [i] Number of gravity domains: 146591 (124042 unique domains)
  [i] Number of exact blacklisted domains: 0
  [i] Number of regex blacklist filters: 2
  [i] Number of exact whitelisted domains: 0
  [i] Number of regex whitelist filters: 0
  [✓] Cleaning up stray matter

  [✓] Flushing DNS cache
  [✓] DNS service is running
  [✓] Pi-hole blocking is Enabled

real    0m11,276s
user    0m3,959s
sys     0m0,705s

Looks like roughly a factor 4x in speed-up compared to earlier today.

Testing with large blocking lists (of lower quality) will show if we can leave of the call to uniq (which consumes about 50% of the time in the gravity script! - about equally much as the entire database processing).

There is something unusual in your setup. I took your blocklists (minus the first two which are local to your network, minus three where the full URL was truncated in your debug log, and minus https://wally3k.github.io (which is not a blocklist but is a collection of blocklists, and cannot be parsed by Pi-Hole) and ran them through a Pi Zero W running latest dev branch (should be same as 5.0 branch at this point). Note that a Zero has a 1 Ghz single core CPU and 512 MB of RAM, and a 3B has a quad core 1.2 GHz CPU and 1 GB of RAM.

Results on the Zero W:

  [i] Number of gravity domains: 3100840 (2137802 unique domains)

real	25m56.066s

Edit - I added the wally3k URL, no other changes, and the number of domains changed just a bit (one of the blocklists was likely updated in the interval), and the time went up only by 5 minutes.

 [i] Number of gravity domains: 3101039 (2137993 unique domains)

real	29m46.952s

So, your lists on a device that has less capability than your Pi will rebuild gravity in 1/3 the time you are seeing. Still not rocket-shop speedy, but your time is a definite outlier.

the beta5 announcement says:

echo "release/v5.0" | sudo tee /etc/pihole/ftlbranch
pihole checkout core release/v5.0
pihole checkout web release/v5.0

do I enter this to go to the new branch on a pihole v4.3.2 system?

echo "release/v5.0" | sudo tee /etc/pihole/ftlbranch
pihole checkout core tweak/gravity_performance
pihole checkout web release/v5.0

don't want to falsify the test, or destroy another system, this is a lot of work...

Yeah, that should have much the same affect as checking out that branch from a 5.0 system. If you want to be safe, go from 4.3.2 to 5.0, and then check out the gravity performance branch of core afterward. But it shouldn't make much difference!

You can always test a workflow on something like a digital ocean droplet / disposable VM instance of some kind before you do it for real :slight_smile:

Can you please just cache every list so if the download in the future fails it can be re-used? I believe you didn't think about server failures when removing this feature.

Also /var/cache is a thing btw, there is no need to keep it in /etc.

Asked and answered earlier:

1 Like