Pihole -g / List download / disappointing performance

jpgpi250 · January 22, 2020, 1:40pm

In previous pihole versions, when a list download failed, the previously cached version was used. These cached lists could be found in /etc/pihole, example: list.3.bitbucket.org.domains. In pihole 5, these lists only appear to exist while the list is processed, it is deleted as soon as processing is completed.

Is this intentional? downloads fail sometimes, using the cached list was a benefit, as opposed to no data from that list at all.

I have 76 lists to process (defaults + firebog nocross + a few more (less than 10), running pihole -g takes for ever.
on pihole v5beta: approximately 1 hour 28 minutes
start: Wed 22 Jan 16:53:54 CET 2020
finished: Wed 22 Jan 18:21:45 CET 2020
on pihole v4.3.2: approximately 5 minutes
start: Wed 22 Jan 17:10:31 CET 2020
finished: Wed 22 Jan 17:15:09 CET 2020
using the same lists, whitelist, regex on both versions.
pihole v5beta database (v10) size: 213.860Kb
Is this normal? (running on a pi 3B).

I would almost suggested (probably not appreciated by the developers) to revert to the v4.3.2 routines to create gravity.list and than import the result into the database...

during the pihole -g run...

 uptime
 16:50:12 up  5:19,  1 users,  load average: 4.71, 3.88, 2.94

would this be a reason to buy a pi 4 to run pihole?

Thanks for your time and effort.

Jeroentje · January 22, 2020, 6:09pm

I've noticed that the performance of my installation during the list download is severely downgraded as well. Whenever my installation is fetching the newest list and putting it in the database DNS resolution will be temporarily unavailable.

PromoFaux · January 22, 2020, 6:18pm

This would also take the same time. It's not the downloading of the lists that takes the time, it's inserting them into the database.

This is due to the indexing of each list, which is necessary so lookups are fast even for clients that use only some of the adlists, because with 5.0 adlists can be per-client, too.

At my last testing, standard lists on a Pi 3 took 23 seconds to complete. For me this is acceptable.

Besides which, gravity is supposed to be a background process (hence the cron job for it to run once a week in the small hours of Sunday morning) there shouldn't be a huge reason for people to be running pihole -g constantly.

DL6ER · January 22, 2020, 6:26pm

I'm currently investigating a possible speedup of the process. However, everything Adam said is 100% correct. I'm investigating if putting everything into a single file and then mass-inserting from here into the database is any faster than doing smaller insertions for each list.

I guess this is what @jpgpi250 meant by

So far, the answer seems to be: It doesn't matter if we do it one way or the other. The difference is less than about 2% and can very well mainly be influenced by the general load on the system during the tests.

However. I still have a few ideas so further work is going into this.

jfb · January 22, 2020, 6:53pm

Is this repeatable or was this a one-off result? On a Pi 3B+ with 41 lists, it takes mine less than 10 mins at most (I have not timed it precisely).

Bucking_Horn · January 22, 2020, 7:01pm

Rather than having the DB insert the data row by row, I trust you have considered bulk loading the lists into SQLite (using non-SQL .mode and .import statements, if I recall correctly), haven't you?

Not sure whether how this would impact indexing the freshly loaded data, though.

jpgpi250 · January 22, 2020, 7:02pm

some ideas...

The old (v4) gravity list is already duplicate free. This would eliminate the duplicate protection whilst inserting the preprocessed gravity.list into the database, which is probably consuming most of the tome (cpu).
As far as I can tell, the pihole -g process starts with flushing the database (I noticed a message that would indicate this - correct me if I'm wrong. Could it not be an option to NOT flush the database, but add an additional field that indicates the number of times 'pihole -g' has run.
example:
test.com is in gravity. latest pihole -g run is set to 54
pihole -g starts, the counter is increased to 55
as soon as test.com is found (ready to be inserted), it is either inserted (if new) or updated (if exists), e.g. pihole-g run counter is set to 55.
as soon as all lists are processed, run a database job that deletes entries, still having a pihole-g count of 54
This may even take longer, but has the advantage the database is never empty, gravity remains in use, even during pihole -g.
You might consider modifying the pihole command to run the -g in a screen (sudo apt-get install screen.

just my two cents...

jpgpi250 · January 22, 2020, 7:07pm

This is definitely repeatable. I've been testing beta5 since the announcement, have setup several new (fresh Raspbian + +unbound + pihole v4 + beta 5), this to ensure, I always have the latest, greatest and unpolluted version of beta5. The duration is consistent, it always take this long, give or take a few (less than 5) minutes difference.

jfb · January 22, 2020, 7:11pm

For comparison only, here is what I see on two current up-to-date installs of 5.0 beta:

Pi-3B+ ethernet

41 blocklists, time 3 minutes, 39 seconds, 778,609 domains on blocklist

Pi Zero W wireless

stock 6 blocklists, time 58 seconds, 124,042 domains on blocklist

DL6ER · January 22, 2020, 7:12pm

Yes, sure.

github.com/pi-hole/pi-hole

gravity.sh

29f06a444


      
            done
          fi
          inputfile="${tmpFile}"
          
          # Remove possible duplicates found in lower-quality adlists
          sort -u -o "${inputfile}" "${inputfile}"
          
          # Store domains in database table specified by ${table}
          # Use printf as .mode and .import need to be on separate lines
          # see https://unix.stackexchange.com/a/445615/83260
          output=$( { printf ".timeout 30000\\n.mode csv\\n.import \"%s\" %s\\n" "${inputfile}" "${table}" | sqlite3 "${gravityDBfile}"; } 2>&1 )
          status="$?"
          
          if [[ "${status}" -ne 0 ]]; then
            echo -e "\\n  ${CROSS} Unable to fill table ${table} in database ${gravityDBfile}\\n  ${output}"
            gravity_Cleanup "error"
          fi
          
          # Delete tmpfile
          rm "${tmpFile}" > /dev/null 2>&1 || \
              echo -e "  ${CROSS} Unable to remove ${tmpFile}"

No, this is very efficient. The uniqueness is achieved through the tree (the index). As the domain location in the database would anyway needs to get looked up, SQLite immediately sees if it is there or not at no extra costs.

Being duplicate free is not what we want. This would severely limit the group management capabilities. We've discussed this elsewhere already, so I will ot go into details here.

What you propose is similar to our first proof-of-concept implementation. Doing this in a loop for all the domains was orders of magnitude slower compared to the mass-inserts though .import. Even for the standard lists, it took almost half an hour on a RPi 3B. There is really no alternative than to mass-insert. I'm currently experimenting with alternative paths in the SQLite3 byte engine for the B-tree and will report when I find something that really makes a difference.

jfb · January 22, 2020, 7:14pm

How many domains end up in blocklist with this collection of lists.

Edit - generate a debug log, upload it and post the token. I'll test your list collection on my Pi and see if we have similar time results.

jpgpi250 · January 22, 2020, 7:16pm

Could you not consider to build gravity in a temporary table, leaving the active gravity list and views operational?

Just trying to help here, you'll here this complaint (performance) from multiple users, as soon as they realize this. NO offence intended what so ever.

jfb · January 22, 2020, 7:17pm

Which database is this? That can't be the gravity.db.

DanSchaper · January 22, 2020, 7:18pm

How often is the update being run? Are we trying to optimize out minutes in a process that runs once a week, maybe once a day?

DanSchaper · January 22, 2020, 7:20pm

Interesting, but would we still have the same issue when trying to merge the two? Making the assumption that the active table may change while the temporary table is being built.

DL6ER · January 22, 2020, 7:24pm

I'm not questioning this here, you cannot know what we did during the (quite extended) investigations that did for this massive change. That's why I'm just summarizing this. New ideas are never bad, however, if they have already been tried, we can skip them.

Yes and No, an in-memory database won't work. However, a duplicated on-disk might work.

jpgpi250 · January 22, 2020, 7:29pm

Bucking_Horn · January 22, 2020, 7:35pm

Do recall experiencing good results with this approach (different DB (Oracle), however).

Alternating tables (or whole schemas, if need be) will also allow for validation of data before activating it, while leaving the system fully operational (well, depending on cpu) for the time of both import and validation.

Would .importing into an existing table vs. .importing in a newly created one have any performance impact?

EDIT:
I am not sure if any of the following would apply to SQLite or Pi-hole, or if @DL6ER already employed equivalent tactics - but for it's worth, go and take what is useful from my crumbled memories and just discard the rest:

I struggle remembering how exactly we tackled a similar problem over a decade ago (importing several 100k of structured records from a bunch of flat files into several database tables on a daily schedule).

I think we switched off indexes when merging imports into existing tables, reenabling them only after load completion, and postponed index creation until after bulk loading into newly created tables. This resulted in the most dramatic gain by almost an order of magnitude, even exceeding on tables with several indexes. Also, switching all indexes to upper case exclusively had some impact on the larger merge tables, if minor.

As for the tables we chose to merge, diff'ing a file against its direct predecessor and working on that diff result proved to be faster than any in-process evaluation we tried.

jpgpi250 · January 22, 2020, 7:39pm

Doesn't pihole -g run every time you add a blocklist and click save and update? There already was a topic about not being able to upload multiple blocklists, using the web interface. Hit the 'save and update' button (by accident) instead of the 'save' button and your day is over...

DanSchaper · January 22, 2020, 7:42pm

How often do you change things? Seems like a one time cost for setting things up. Amortize the pain.

I hear the issue, I'm just trying to see what the overall impact is.

And if that takes out your entire day, well...