FTL - Further Normalize Tables

Thanks for the update the code. I was lazy and didn't format or do anything as I was rushing through. Also, no doubt with a normalized schema, some updates to queries could be made over time. I just went with the simplest method.

I purely used the view/trigger for a direct swap, but the better method is obviously doing direct inserts removing the view and trigger. I purely used them for a quick example to provide a quick exchange, but updates to the queries would be even better.

While it doesn't provide a ton of timing around the migration process it could take place in the background blindly to the user using the following method.

  1. Update takes place, pi hole continues to use current DB and handle everything normally.
  2. Migration to new db takes place using something similar to what we used above. This is done blindly in the background blindly to the user with old DB still in place.
  3. Once scripts complete, new DB is validated.
  4. If validated, system replaces old DB /w new DB, this should take very little time as it would just be a move.

What if users interrupt this by rebooting, etc.? We'd have to detect this and start afresh on the next round. Given severely limited memory on Raspberry Pi systems (less than 300 MB available for use, maybe even less on other SBCs), will wrapping it into a single transaction even work? Is SQLite3 intelligent enough to see the nearing end of available memory on the system? I'm afraid they are not.
Same with disk space we may be running out of while doing the conversion as every information will be duplicated at some point.

You see my point of view. I'm talking about practical implementation aspects more than theoretical ideas. I do see a potential benefit here but only when we can make the transition rock-solid, it will be advantageous for our users. Otherwise, if nobody really notices it but it breaks somehow for even only a small number of people, the caused pain is much larger than the theoretical win.

@DL6ER , that could easily be addressed in C id imagine. I watched in top -o %MEM and memory usage was under 10MB, AKA negligible. If its done in C im sure it would be higher, but since there is no commit etc.. I think it cuts down the memory a lot.

The idea I would think would work would be pretty simple. Some kind of flag or check could be done, could be a simple file existing to make it persistent. Basic check would be IF the process has not been completed, check if the new DB name exists, if so, delete and restart.

Actually a better method would be to add a version history table. Use that to keep track. Thats how I have done it in other applications. Just keep the version # + timestamp for completion. If the new version or the table does not exist, start it over. Again, just need to keep in mind to check of the temporary new db exists and if so,delete it otherwise it will error out.

As for storage space, that is a good point and my only thought is to say check if they have at least 50% of the current db in space available. If not give them a notice that they cannot upgrade with their current storage space? As stated before, storage is cheap, I really can't see this being a big issue, but a check would of course make sense.

Do you have a consolidated list of queries used anywhere or is it throughout the code? I'd be happy to look through them all to see what changes would be beneficial.

I think it might be easier to run a

than solving the migration problem.

Once users upgrade, everything will be written to a new database which is normalized. The only place the old database comes into play is when querying the long-term data. There it gets tricky as the data has to be concatenated.

We're currently exploring using an in-memory SQLite3 database instead of storing the data bare-metal in compact structs. This will be somewhat slower and will use somewhat more memory but it will allow for a lot of simplifications in other places of the code as we can filter much easier when requesting data from history. Even when the current code is quite optimized (and battle tested), using SQL allows for much more.

Yes. Sort of, at least.

The best idea I have right now is starting over in a new database. Then import the history data in batches. Say 1.000 queries at once. Say every 10 seconds.

Under these conditions, converting a database with 15 mio. queries (41.000 per day) will take roughly one day.

We go from the newest down to the oldest query when importing and don't do any special complex handling otherwise. We would mention in the change log that long-term data may be incomplete for some time during a background migration task but this should run to completion sooner than later.

This seems to be the biggest show stopper left at this moment (else from we'd need to write and test a lot of code). I'm not really confident right now about this.

Another idea: How many users do really use the long-term features? Maybe it's better to start with a fresh database and just insert the most recent 24 hours. We leave the old database around but do not import anything automatically. We, nevertheless, offer a script they can run which does the job for them. I don't see any problems with doing it in SQL alone:

  1. Full database scan: Add all distinct clients to a table
  2. Full database scan: Add all distinct domains to a table
  3. Migration is steps of 1,000 queries: INSERT INTO ... SELECT ... magic applies

By doing it in smaller steps we can give users a process indicator instead of just having a script that runs silently for one day. Also, it'd be interruption safe (if everything is wrapped into a transaction).

We'd have to keep that dual database then for as long as MAXDBDAYS (with a one year default).

During that time, it would increase disk storage requirements by roughly 50%, rather than delivering on the one relevant advantage that normalisation has, i.e. lowering storage.

Normalisation tackles storage by avoiding redundancy, as well as data integrity (by avoiding having to update same data in multiple places). Integrity isn't that important for Pi-hole, as we never update a query once its written, and following jfb's argument here, storage is not an issue.

The increased speed for some specific queries comes from indexed or rather keyed access to a smaller set of data in normalised tables (autoindexed by id).
Queries that benefit from that access may be faster, but those that don't and still require all columns of a record may be negatively impacted.
Assuming that we always need the complete record, this would likely be true for almost all queries that you can issue through Pi-hole's UI, as they almost always require a timeframe. It would depend on the actual options chosen, i.e. the specific query, whether performance would be better or worse.

Speed benefits could also be achieved by defining an additional index on a respective field, without incurring the cost of traversing relations (but increasing storage requirements slightly and still making insertions more expensive).

Consider this statement:

SELECT count(*) FROM queries \
WHERE(timestamp BETWEEN strftime('%s','2020-10-02') AND strftime('%s','2021-01-01')
AND client = '192.168.127.26');")
pihole-FTL.db normalised.db pihole-FTL.db with client index
real x7,8 0m0.204s x7,3 0m0.189s x1 0m0.026s
user x11,5 0m0.161s x11,4 0m0.160s x1 0m0.014s
sys x3,6 0m0.043s x2,4 0m0.029s x1 0m0.012s

Note that creating a client index on the normalised data won't provide the same performance boost, as the data set would be smaller already.
With only a few hundred clients, I'd expect very marginal gains by indexing client.


What are the goals we would try to achieve by normalising?
It seems to me if we dismiss storage reduction and don't need integrity improvements, speed could be tackled more effectively by other means, and with less migration concerns.

Creating an index on client/domain now is not really feasible as this includes a full database scan. And since the client is duplicated very very often (as are domains, but less) the tree has to be re-structured all the time.

Experiment on a Raspberry Pi 3 running Ubuntu (aarch64) - expect things to be slower on RPi Zero:

$ ls -lh pihole-FTL.db
-rw-r--r-- 1 ubuntu ubuntu 274M Mar 29 09:24 pihole-FTL.db

$ time sqlite3 pihole-FTL.db "SELECT COUNT(*) FROM queries;"
4'143'870

real    0m5.406s
user    0m0.063s
sys     0m0.542s

$ time sqlite3 pihole-FTL.db "CREATE INDEX IF NOT EXISTS query_clients ON queries (client);"

real    1m59.446s
user    0m30.638s
sys     0m7.983s

$ ls -lh pihole-FTL.db
-rw-r--r-- 1 ubuntu ubuntu 358M Mar 29 09:52 pihole-FTL.db

$ time sqlite3 pihole-FTL.db "CREATE INDEX IF NOT EXISTS query_domains ON queries (domain);"

real    1m42.239s
user    0m34.677s
sys     0m4.513s

$ ls -lh pihole-FTL.db
-rw-r--r-- 1 ubuntu ubuntu 466M Mar 29 09:55 pihole-FTL.db

$ time sqlite3 pihole-FTL.db "CREATE INDEX IF NOT EXISTS query_forwards ON queries (forward);"

real    1m23.249s
user    0m46.498s
sys     0m2.740s

$ ls -lh pihole-FTL.db
-rw-r--r-- 1 ubuntu ubuntu 524M Mar 29 09:57 pihole-FTL.db

We doubled the required disk space here !!


Splitting things into three (maybe four if forward gets its own one as well) tables makes things more complicated also in the process of storing things from the v6.0 in-memory database to the disk database gets more complicated (we need some additional REPLACE ... statements on the entire domain, client, and forward in-memory tables).

My index example above is not meant as a call to make use of it.

It is meant as an example to highlight two aspects in discussing normalisation:

  • if we were seeking to improve search speed, there are other options available that may even offer better performance.
  • applying normalisation may preclude us from some of those options

in order to introduce the ensuing question:

Reduction of storage requirements and improved data integrity are the goals that normalisation is guarantueed to contribute to.
Search speed may be affected positively as well as negatively, while insertion speed will be impacted somewhat negatively.

I did not make any assumptions. I just looked at the table definitions.

There are mainly two approaches to define data structures:

  1. the process oriented approach
  2. the data oriented approach

Since E.F.Codd (relational model) and P.Chen (entity relationship approach) there have been long and intensive discussions over several decades, what approach is the better one.

State of the art now is: the data oriented approach is superior as opposed to the process oriented approach.

You can believe it or not, just do some googling. Those, who do not learn from history, are damned to repeat the errors of the past.

Why is the data oriented approach better than the process oriented approach?

The process oriented approach focusses mainly on technical aspects (eg. performance, storage needs). This leads inevitably to data redundancy, and data redundancy is poison for data quality. There is no advantage, when a fast algorithm processes wrong and inconsistent data.

The data oriented approach focusses on data quality and data integrity. That means: it improves the value of the later data processing.

Remember N.Wirth's famous book Algorithms and Data Structures. He elaborates, that appropriate data structures lead to more simple program structures. Therefore, when you begin a software project, start with defining the data structure and then at the second stage derive the program structure. If you do it the opposite way, you'll run in a lot of troubles.

When I look at the postings of the last days, I can't see anybody, who thinks in terms of data quality. Nearly everyone concentrates on technical aspects. There seem to be a lot of hackers around.

Rather than making further assumptions, and in order to make the jump between text book and practical application, could you give some concrete examples how you fear data integrity in queries may be corrupted with the solution at hand, and how you would apply normalisation to help to avoid that?

I fear, you do not know the meaning of data integrity. Data integrity can only be compromised by manipulating operations. Because queries do not manipulate data, they cannot compromise data integrity by definition.

Professional data engineers start with an entity relationship model of the user data. If you represent such a model as (mathematical) relations, then this set of relations are already in 3NF. So there's no need for any normalization steps.

He was referring to the specific table called queries, not the operation to manipulate them.

I know data integrity well enough, and I am well aware it is of little concern in this specific case.

I specifically asked because you fielded data integrity to support your criticism:

So I wanted to know why you came to think that data integrity would be at hazard with the current layout or how you think it could benefit from it otherwise?

It is not a good idea, to pick just one table for examination. That means to aim at the symptoms of the problem and not at its roots.

Make a proper top down design and you will see, that you won't get a table like your current table 'query'.

The queries table has been the prime subject of this discussion.

It may help to understand it is a complete representation of a DNS query log, used for reporting and analysis purposes only, and it's only ever written to once per record.

So, how do you think that table would benefit from data integrity improvements offered by normalisation?

As I already said: It is not a good idea, to pick just one table for examination. That means to aim at the symptoms of the problem and not at its roots.

If you do not change and broaden your point of view, then you will not see the general problem with your set of data.

Well, of course not.

I see where that misunderstanding is coming from:
You seem intent to turn this into a general controversy about normalisation.

It isn't. We agree it is generally beneficial.

But this whole discussion is about a normalisation proposal for a specific table, and how that would affect Pi-hole's associated UI driven database interaction overall (and how secondary concerns like migration could best be handled if it were adopted).

As you've already acknowledged, the data integrity argument from normalisation doesn't apply here. Any step taken towards further normalisation has to consider potential negative effects and weigh those against an obvious decrease in storage requirements.

Reasserting that database normalisation is commonly a good thing is obviously correct, but does not add anything to this specific issue.

I fear, you haven't understood, what I've written in my previous postings. At no time I ever pleaded for normalization.

Normalization is just an algorithm to transform a set of relations from one state to another. It has nothing to do with data design.

I used the OP's hint for normalization as an opportunity to plead for a data oriented view in opposition to the processing oriented view.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.