FTL - Further Normalize Tables

So this isn't so much a feature as a enhancement to speed up FTL. Having looked into the databases and how the data is stored, I am questioning why it isn't being normalized fully to save space which increases efficiency/performance. It seems like it was partially normalized /w status_id and client_typ_id, but wouldn't domain and client help also? By doing this with just a few weeks of data, it cut my storage space by > 50%.

Current design only meets UNF level instead of 1NF due to the duplication of client/domains.

EX:

That's discussing an implementation detail, not a feature that adds some new functionality.

I'm confident the developers carefully weighed their options.

Normalisation may save some space, but it almost always negatively impacts performance. Additional joins mean additional I/O, thus slower and/or more expensive results.
It would come down to how that cost would impact Pi-hole, and whether it could take, mitigate or maybe avoid that hit.

To elaborate this a bit (click for details):

Reading and assembling a complete logical record from multiple tables is more expensive than doing so from a single one, and the same is true for decomposing and writing it to multiple tables as well.
Overall cost for those associated operations can be expected to increase, with the exception of column-based/6NF db designs, where this would be true for any column anyway (not accounting for db engine optimisatons, of course).

How a given SQL statement's execution speed is affected is another thing.
It would depend on individual query structure and the actual data, i.e. whether it can take advantage of a smaller size of the normalised data set and/or index to match records, if and what other columns are relevant for matching records, and whether the full record or only a more favourable subset is to be retrieved. For example, queries won't benefit if the number of entries in a dependent table is only slightly smaller than that in the main one. Worst case, a normalised design may force a query to perform multiple full table scans on the main as well as the dependent tables where a single one would have sufficed before.

Whether an effect is measurable at all would also depend on other things like the db engine taking advantage of multi-core CPUs, whether it can be executed in-memory etc.

(edit)
Normalisation can be expected to improve data integrity and redundancy.
Typically, data integrity is of little concern for logging facilities, as a record is only ever written once. Addressing redundancy may well decrease a database's footprint.
On the other hand, normalisation does not make any guarantees for speed - speed may be affected either way.
(end-edit)

Trying to assess how an application's speed or responsiveness is affected from a normalised design would amount to revisit all queries involved, making this a tricky task for applications that support dynamic construction of db queries.
Pi-hole's long-term database UI may be considered as an example for dynamic construction.

1 Like

Basically everything @Bucking_Horn already said.

The current implementation was chosen to be particularly simple: A simple SELECT * FROM queries gives you everything you want. Granted, the same can be achieved through views, however, this is not true for INSERTions.

Currently, FTL can add a very simple row of

BEGIN TRANSACTION
INSERT INTO queries VALUES (NULL,?,?,?,?,?,?,?)
INSERT INTO queries VALUES (NULL,?,?,?,?,?,?,?)
...
INSERT INTO queries VALUES (NULL,?,?,?,?,?,?,?)
INSERT INTO queries VALUES (NULL,?,?,?,?,?,?,?)
END TRANSACTION

with the level of optimization you mention, the rather simple query

INSERT INTO queries VALUES (NULL,?,?,?,?,?,?,?)

would become something like

INSERT INTO queries VALUES (NULL,?,?,(SELECT id FROM domains WHERE domain = ?),(SELECT id FROM clients WHERE client = ?),(SELECT id FROM forwards WHERE forward = ?),?,?)

It'd be a breaking change. And actually one that we cannot afford in my opinion. I say this because we'd have to convert the existing databases of users to the new storing system. This will take forever (thinking in terms of a few hours) on Raspberry Pi Zero devices which are still used abundantly. Starting with the new schema without some auto-conversion will loose history. This is bad as well.
And any intermediate mixing of the old table queries and a new queries2 where both data are merged in a view is likely to cause a lot of trouble as well.

For a reduction by two, I'd say this is not worth it. On modern computers a factor of two is not really worth anything. We can surely discuss this if the saving is on the order of > 90% (which is likely the case). However, we'd first need to find a solution for a working transition (see above).

1 Like

Yeah, I didn't see any area to discuss implementation here so thats why I picked feature request. I understand your point completely, but I can tell you, in this case, it didn't negatively impact performance from my testing, it was about the same /w only a month of data /w limited clients.

From my DB experience, which is my daytime job, I am sure it will improve performance with higher quantity of data. The reason being unique indexing can be added to the client and domain tables as they are much smaller in size than it would be doing a full table scan of text fields. Nothing is slower in databases than scanning text columns /w duplicated data.

I don't agree /w the additional I/O from the additional joins part as the scans for the clients is much smaller and makes for faster queries. However, with all that said, I can agree with it being a breaking change at this point for existing users. My limited data set is 50-60% reduction, excluding forwards. So I can't judge the impact on larger data sets. My only way to do that is to use my pi.hole for a longer time :slight_smile: which will happen.

I will say SQLite supports upserts now which makes it a bit more simplified now which is helpful. SQLite Syntax: upsert-clause

If you were looking just at queries, I believe that's quite likely to be true in the case you outlined (though overall performance would depend on actual queries and their frequency).

But you'd buy faster lookups with substantially more expensive inserts. And those inserts would happen during name resolution, which could be considered critical for Pi-hole's performance (though in a normal home usage scenario, that impact probably would be negligible - at least I'd guess as much :wink: ).

It could even be argued that that database is more often subject to being written to than to being read from.

1 Like

I guess you wanted to link here instead. How would it be helpful in this context?

The one big showstopper here is migration on low-end devices. I have no idea how to cross this boundary, and ideas that do not involve either loosing history up to the migration point or a possibly complex hybrid approach that will likely break in many unforeseen places?

Normalizing the database might improve performance for lookups but I doubt the impact will affect many users. The last 24h are read from memory already, so the optimization will be relevant only by the long-term-data feature of Pi-hole.

1 Like

The existing design is a typical design made by software engineers: it is process oriented. For a database engineer it is an example for the quite opposite of a good database design.

That's an old fairy-tale and wrong.

Design and implementation are two different stages of the whole process. Software engineers tend to start with implementation and call the result the design.

Pull requests are welcomed on our Github site.

Watch the tone.

You make quite a few assumptions there and frankly you're quite insulting with those comments.

Do you know the backgrounds of the developers? Do you know who wrote the code or came up with the design?

If you're going to start with making comments that insinuate that you are the expert then show your cards and tell us why we should believe you're the expert?

That may be true. However, there was no database engineer available when we made that decision. The schema was floating around for over half a year before it was fixed like this. I am not too experienced when it comes to database design (actually this was my first contact with SQL at all) so I left this to others. We just took what was agreed upon in the end. I'm not saying this was optimal.

There may have been other reasons for that particular design and I tried to express what I'm thinking might be the reasoning. In other places we do have linked tables connected through foreign key constraints so this was not unknown by the developers at the time of the decision.


I feel like I'm asking this for the third time, but maybe I wrote too much text so this submersed: How would you as a database professional address the situation on low-performance hardware without loosing history for those users? If you can propose a chain of SQL actions, we can take an example database and measure how this behaves on a few different hardware choices typically used with Pi-hole. I'm just afraid there is no possible way at all to do this because rewriting/-structuring a database in the low Gigabyte range is just too much stuff to be fast enough.

I am not convinced it would be overall beneficiary for Pi-hole.

On the pros, Pi-hole would use less disk storage, and separating out certain sets of columns into indexed tables would help avoid table scans in certain search scenarios and speed up respective lookups.

On the cons, inserts could be quite more expensive, especially when indexing. And as database engines tend to pull index hash tables into memory, we would increase memory consumption, which could be detrimental on systems with limited memory - and Pi-hole currently runs exceptionally fine on that kind of systems.

On the other hand, the fact that Pi-hole already stages its database writes would certainly help in mitigating increased insertion costs.

Still, I feel that we shoud scrutinise the normalisation proposal and try to back it up with some testing before considering it just on the promise of reduced disk storage
This should give us a better idea what to expect from that change - certainly better than me contemplating potential gains and losses. :wink:

I also agree with yubiuser that speeding up long term searches may not be relevant for the majority of users. I myself am more comfortable with grepping the logs for immediate issues and rarely revert to database queries.
But that just implies that reflecting our probably vast, certainly diverse and for the most part unknown user base in that test would be another challenge.

And yet another one would be how to interpret results:
Is 200MB less disk space, 10MB more RAM, 5ms slower resolution times and a quarter second faster domain searches better or worse?

Is this a huge concern, given the size and cost of currently shipping SD cards? As a single example, a 180 day database on my Pi-3B that serves about 16K queries a day is 185 MB. Even 10 times that is less than 2 GB.

Raspbian Buster, unbound and some other small packages running on this device:

df -ah
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/mmcblk0p1   30G  1.6G   28G   6% /
...

You can get a pair of 32 GB Sandisk SD cards for $14 US (delivered).

Input/output performance should be the driver, particularly on the least capable SBCs. Saving space should be secondary. If we can do both, that's a win.

I so opened a can of worms! :slight_smile: I am sorry LOL. So to address a few comments here....And I do think this would do both as jfb said to make it a win.

#1 DBA's know better than software engineers. I am not touching this with a 10ft pole! :slight_smile: There are bad both!
#2 Inserts will be more expensive. I don't think there would be a net impact either way. You are avoiding having to insert the client, domain etc... each time so the writes are cut down. But it does have to be queried, though it would be indexed so it would mostly be offset.
#3 Queries will be faster - I can verify this so far /w my 1 month that it was slightly faster. With larger datasets, it will progressively be more impactful.

So here would be the benefits IMO.

  • A 50-70% space reduction. Likely more with large datasets.
  • Reduced writes to the SD card, which may actually improve insert speeds, but that would need to be tested.
  • FTL becomes even more FTL? :slight_smile:

The funny part is, the whole reason I brought this up was because I setup a python script to copy my FTL data to postgresql so I can add graphing to my grafana dashboards I have for other things.

As for how we address the history, I think it's easy enough to do, possibly as part of a "upgrade" version. It would just be a few queries... but as you said it may take time, but if its part of a version upgrade, that wouldn't be a huge issue would it? Basic idea would be... in bad code because im not on a main PC /w access to the DB.

  1. Alter queries add domain_id (int), client_id (int), forward_id (int).
  2. Create Tables domain (id int pk, domain text)
  3. Create client (id int pk, client text)
  4. Create forward (id int pk, forward text)
  5. Create unique index on all domain, client, forward for the text columns (reduce upsert cost).
  6. Insert into new tabes (domain,client, foward) using insert into table (text field name) select distinct text field name from table.
  7. Alter queries, drop old text fields.
  8. Create FK from these tables referencing queries

I can script a lot of this when I have time and check improvements... but again I have a limited dataset.

Can you provide something that would show this? Some timing that shows the improvement and what the benefit would be to offset the cost of the changes?

And this is the problematic part. I did some preliminary testing on this because we wanted to do something some time ago. Another story, but it may help to show what the issue is:

FTL started with mapping queries to its own enum. That's how things started, we cannot undo this. Anyway, type is something like

Type ID Resource Record as text
1 A
2 AAAA
3 ANY
4 SRV
5 SOA
6 PTR
7 TXT
8 NAPTR
9 MX
10 DS
11 RRSIG
12 DNSKEY
13 NS
14 OTHER (any query type not covered elsewhere)
15 SVCB
16 HTTPS

This became somewhat limiting when we wanted to store the actual query types. So we looked at how we can realize it

It turned out to be a major issue. We tried a few things. like

ALTER TABLE queries RENAME TO queries_old;

CREATE TABLE queries ( id INTEGER PRIMARY KEY AUTOINCREMENT, timestamp INTEGER NOT NULL, type INTEGER NOT NULL, status INTEGER NOT NULL, domain TEXT NOT NULL, client TEXT NOT NULL, forward TEXT , additional_info TEXT);

INSERT INTO queries (timestamp,type,status,domain,client,forward,additional_info)
SELECT timestamp
       ,CASE type WHEN 1 THEN 1
                 WHEN 2 THEN 28
                 WHEN 3 THEN 255
                 WHEN 4 THEN 33
                 WHEN 5 THEN 6
                 WHEN 6 THEN 12
                 WHEN 7 THEN 16 
                 WHEN 8 THEN 35
                 WHEN 9 THEN 15
                 WHEN 10 THEN 43
                 WHEN 11 THEN 46
                 WHEN 12 THEN 48
                 WHEN 13 THEN 2
                 WHEN 14 THEN 0
        END as type
        ,status
        ,domain
        ,client
        ,forward
        ,additional_info
FROM queries_old;

DROP TABLE queries_old;

or

UPDATE queries set type = CASE type WHEN 1 THEN 1 WHEN 2 THEN 28 WHEN 3 THEN 255 WHEN 4 THEN 33 WHEN 5 THEN 6 WHEN 6 THEN 12 WHEN 7 THEN 16  WHEN 8 THEN 35 WHEN 9 THEN 15 WHEN 10 THEN 43 WHEN 11 THEN 46 WHEN 12 THEN 48 WHEN 13 THEN 2 WHEN 14 THEN 0 END;

Both of them (the second was was actually faster) took several minutes. And my database is even rather moderate with 500 MB for one year full of data. There are databases out there with easily 10x (and more) the number of queries I have.

In the end, we found a solution that does not need any rewriting at all by offsetting OTHER type queries.


Back to the problem at hand (which is rather similar). The question is: What happens if a user interrupts this long-running process?

The database will either roll back on the next action (if we wrap everything in a transaction) or things will break dramatically. I'm just not convinced it's worth the risk. I hope you see my point. If it'd take 5 seconds, okay fine, but we're really dealing with rewriting a database table in the Gigabyte range on single-core ARM devices with possibly slow SD cards. This is a whole lot of limitation.--

Oh I totally understand, but when I say load tables, it's loading JUST distinct clients or JUST distinct domains, so it's a much smaller dataset.

Out of curiosity sake, i have a small db... but this looks to run fast, and I'm curious how fast it would run with a bigger db? Also, gives you an idea of what I meant. If it's too slow, There is a workaround where we could link databases and have it work in the background until the job was fully completed. If it failed, the old db would still be intact and it could reattempt on 2nd run.

This is an example of what I meant, no need to reload the full queries table this way. But if it was linked DB's , even this much wouldn't need to be done and lock the db. Basically, you could create the new db, and migrate the data over.

CREATE TABLE client
(
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    client text,
    CONSTRAINT uq_client UNIQUE (client)
);
INSERT INTO client (client)
SELECT distinct client
	FROM queries;
 ALTER TABLE queries add client_id int;
update queries set client_id = (select id from client where client = queries.client);

So my limited testing with a smaller dataset, it loosk to only improve things with normalization. Here is my results. Those against history are normalized (its a view) and those against queries are the old version. At the bottom is the SQL I used. This way wouldn'tlock the DB, so it could be run in the background and then verify the new DB before replacing it. I think that addresses any of your concerns? Would just need to run a pargma integrity to verify it before making the final replacement. Anyone want to test on a larger data set? Just make sure to run PRAGMA foreign_keys = ON; to enable the FKs when you open your connection.

This was a simple count(*) but more specific queries will improve even more for sure. The inserts seem more consistent speed wise /w the upsert method. However, unfortunately sqlite doesnt support loops or I'd test with more. Maybe ill write a quick python script to loop it for more detailed testing, but not today :slight_smile:

sudo sqlite3 Test.db ".read test.sql"


sqlite> select count(*) from history where client = '192.168.2.155';
71509
Run Time: real 1.039 user 1.007407 sys 0.030865
sqlite> select count(*) from history where client = '192.168.2.155';
71509
Run Time: real 1.038 user 0.948123 sys 0.089802
sqlite> select count(*) from history where client = '192.168.2.155';
71509
Run Time: real 1.031 user 0.989952 sys 0.040031
sqlite> select count(*) from history where client = '192.168.2.155';
71509
Run Time: real 1.036 user 0.965032 sys 0.070296
sqlite> select count(*) from history where client = '192.168.2.155';
71509
Run Time: real 1.035 user 1.004284 sys 0.029671
sqlite> select count(*) from history where client = '192.168.2.155';
71509
Run Time: real 1.035 user 0.994221 sys 0.039705
sqlite> 

sqlite> insert into history (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from history limit 10000;
Run Time: real 1.063 user 0.682191 sys 0.030671
sqlite> insert into history (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from history limit 10000;
Run Time: real 1.244 user 0.719878 sys 0.009267
sqlite> insert into history (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from history limit 10000;
Run Time: real 1.571 user 0.739281 sys 0.000249
sqlite> insert into history (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from history limit 10000;
Run Time: real 1.795 user 0.730746 sys 0.010223
sqlite> insert into history (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from history limit 10000;
Run Time: real 4.090 user 0.728491 sys 0.029757
sqlite> insert into history (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from history limit 10000;
Run Time: real 3.619 user 0.734049 sys 0.020111
sqlite> insert into history (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from history limit 10000;
Run Time: real 3.239 user 0.726018 sys 0.020167
sqlite> insert into history (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from history limit 10000;
Run Time: real 1.946 user 0.738246 sys 0.019949


sqlite> select count(*) from queries where client = '192.168.2.155';
71509
Run Time: real 1.268 user 1.106239 sys 0.160775
sqlite> select count(*) from queries where client = '192.168.2.155';
71509
Run Time: real 1.281 user 1.199737 sys 0.079670
sqlite> select count(*) from queries where client = '192.168.2.155';
71509
Run Time: real 1.259 user 1.118651 sys 0.139912
sqlite> select count(*) from queries where client = '192.168.2.155';
71509
Run Time: real 1.286 user 1.125245 sys 0.159506
sqlite> select count(*) from queries where client = '192.168.2.155';
71509
Run Time: real 1.286 user 1.064541 sys 0.220703
sqlite> 


sqlite> ^Cinsert into queries (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_in from queries limit 10000;
Run Time: real 6.939 user 0.186649 sys 0.054963
sqlite> insert into queries (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from queries limit 10000;
Run Time: real 3.648 user 0.173676 sys 0.081017
sqlite> insert into queries (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from queries limit 10000;
Run Time: real 2.111 user 0.194485 sys 0.060911
sqlite> insert into queries (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from queries limit 10000;
Run Time: real 7.550 user 0.158069 sys 0.090163
sqlite> insert into queries (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from queries limit 10000;
Run Time: real 2.391 user 0.195867 sys 0.054545
sqlite> insert into queries (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from queries limit 10000;
Run Time: real 2.515 user 0.224625 sys 0.044268
sqlite> insert into queries (timestamp, type, status, domain, client, forward, additional_info) select timestamp, type, status, domain, client, forward, additional_info from queries limit 10000;
Run Time: real 4.307 user 0.159630 sys 0.109274
ATTACH DATABASE 'pihole-FTL.db' AS og;
CREATE TABLE client (id INTEGER PRIMARY KEY AUTOINCREMENT, client text, CONSTRAINT uq_client UNIQUE (client));
CREATE TABLE domain (id INTEGER PRIMARY KEY AUTOINCREMENT, domain text, CONSTRAINT uq_domain UNIQUE (domain));
CREATE TABLE forward (id INTEGER PRIMARY KEY AUTOINCREMENT, forward text not null default "", CONSTRAINT uq_forward UNIQUE (forward));
CREATE TABLE queries ( id INTEGER PRIMARY KEY AUTOINCREMENT, timestamp INTEGER NOT NULL, type INTEGER NOT NULL, status INTEGER NOT NULL, domain_id INTEGER , client_id INTEGER , forward_id INTEGER  , additional_info TEXT, FOREIGN KEY(client_id) REFERENCES client(id)
, FOREIGN KEY(domain_id) REFERENCES domain(id)
, FOREIGN KEY(forward_id) REFERENCES forward(id));
CREATE INDEX idx_queries_timestamps ON queries (TIMESTAMP);
CREATE TABLE aliasclient (id INTEGER PRIMARY KEY NOT NULL, name TEXT NOT NULL, comment TEXT);
CREATE TABLE counters ( id INTEGER PRIMARY KEY NOT NULL, value INTEGER NOT NULL );
CREATE TABLE ftl ( id INTEGER PRIMARY KEY NOT NULL, value BLOB NOT NULL );
CREATE TABLE message ( id INTEGER PRIMARY KEY AUTOINCREMENT, timestamp INTEGER NOT NULL, type TEXT NOT NULL, message TEXT NOT NULL, blob1 BLOB, blob2 BLOB, blob3 BLOB, blob4 BLOB, blob5 BLOB );
CREATE TABLE network_addresses ( network_id INTEGER NOT NULL, ip TEXT UNIQUE NOT NULL, lastSeen INTEGER NOT NULL DEFAULT (cast(strftime('%s', 'now') as int)), name TEXT, nameUpdated INTEGER, FOREIGN KEY(network_id) REFERENCES network(id));
INSERT INTO client (client) select distinct client from og.queries;
INSERT INTO domain (domain) select distinct domain from og.queries;
INSERT INTO forward (forward) select distinct ifnull(forward,"") from og.queries;
INSERT INTO queries (id, timestamp, type, status, domain_id, client_id, forward_id, additional_info)
select id, timestamp, type, status, (select id from domain where domain = queries.domain), (select id from client where client = queries.client)
,(select id from forward where ifnull(forward,"") = ifnull(queries.forward,"")), additional_info
from og.queries;

create view history (id, timestamp, type, status, domain, client, forward, additional_info)
as 
select 
q.id, q.timestamp, q.type, q.status, d.domain, c.client, f.forward, q.additional_info
from queries q  join domain d on q.domain_id = d.id  join client c on q.client_id = c.id  join forward f on q.forward_id = f.id ;

CREATE TRIGGER trig1 
INSTEAD OF INSERT ON history 
BEGIN

        INSERT INTO client (client)
        VALUES (NEW.client) ON CONFLICT (client) DO NOTHING;

        INSERT INTO domain (domain)
        VALUES (NEW.domain) ON CONFLICT (domain) DO NOTHING;

        INSERT INTO forward (forward)
        VALUES (NEW.forward) ON CONFLICT (forward) DO NOTHING;

        insert into queries (timestamp, type, status, domain_id, client_id, forward_id, additional_info)
        select NEW.timestamp, NEW.type, NEW.status, (SELECT id from domain where domain = NEW.domain), (SELECT id from client where client = NEW.client), (SELECT id from forward where forward = NEW.forward), NEW.additional_info
;

END;

Thank your for those tests.

I've slightly altered your proposed schema and did a bit of testing myself, yet my results don't allow declaring an obvious winner.

I adopted your schema to account for a missing (not directly affected) table (network) as well as for the imports for all tables, to better prepare for timing preliminary tests on migration duration for huge databases (which I haven't done).

I also swapped queries and history, in order to avoid breaking existing query contracts by the proposed solution.

My resulting script for creating `normalised.db` looks like this (click for details)
CREATE TABLE sqlite_sequence(name,seq);

CREATE TABLE history ( id INTEGER PRIMARY KEY AUTOINCREMENT, timestamp INTEGER NOT NULL,
    type INTEGER NOT NULL,
    status INTEGER NOT NULL,
    domain_id INTEGER,
    client_id INTEGER,
    forward_id INTEGER,
    additional_info TEXT,
    FOREIGN KEY(client_id) REFERENCES client(id),
    FOREIGN KEY(domain_id) REFERENCES domain(id),
    FOREIGN KEY(forward_id) REFERENCES forward(id));
CREATE INDEX idx_queries_timestamps ON history (TIMESTAMP);

CREATE TABLE client (id INTEGER PRIMARY KEY AUTOINCREMENT, client text, CONSTRAINT uq_client UNIQUE (client));
CREATE TABLE domain (id INTEGER PRIMARY KEY AUTOINCREMENT, domain text, CONSTRAINT uq_domain UNIQUE (domain));
CREATE TABLE forward (id INTEGER PRIMARY KEY AUTOINCREMENT, forward text not null default "", CONSTRAINT uq_forward UNIQUE (forward));

CREATE TABLE counters ( id INTEGER PRIMARY KEY NOT NULL, value INTEGER NOT NULL );
CREATE TABLE ftl ( id INTEGER PRIMARY KEY NOT NULL, value BLOB NOT NULL );
CREATE TABLE message ( id INTEGER PRIMARY KEY AUTOINCREMENT, timestamp INTEGER NOT NULL, type TEXT NOT NULL, message TEXT NOT NULL, blob1 BLOB, blob2 BLOB, blob3 BLOB, blob4 BLOB, blob5 BLOB );

CREATE TABLE IF NOT EXISTS "network_addresses" ( network_id INTEGER NOT NULL, ip TEXT UNIQUE NOT NULL, lastSeen INTEGER NOT NULL DEFAULT (cast(strftime('%s', 'now') as int)), name TEXT, nameUpdated INTEGER, FOREIGN KEY(network_id) REFERENCES network(id));
CREATE TABLE IF NOT EXISTS "network" ( id INTEGER PRIMARY KEY NOT NULL, hwaddr TEXT UNIQUE NOT NULL, interface TEXT NOT NULL, firstSeen INTEGER NOT NULL, lastQuery INTEGER NOT NULL, numQueries INTEGER NOT NULL, macVendor TEXT, aliasclient_id INTEGER);
CREATE TABLE aliasclient (id INTEGER PRIMARY KEY NOT NULL, name TEXT NOT NULL, comment TEXT);

CREATE VIEW queries (id, timestamp, type, status, domain, client, forward, additional_info)
AS
SELECT
h.id, h.timestamp, h.type, h.status, h.domain, h.client, h.forward, h.additional_info
FROM history h
JOIN domain d on h.domain_id = d.id
JOIN client c ON h.client_id = c.id
JOIN forward f ON h.forward_id = f.id
/* queries(id,timestamp,type,status,domain,client,forward,additional_info) */;
CREATE TRIGGER trig1
INSTEAD OF INSERT ON queries
BEGIN
    INSERT INTO client (client)
    VALUES (NEW.client) ON CONFLICT (client) DO NOTHING;

    INSERT INTO domain (domain)
    VALUES (NEW.domain) ON CONFLICT (domain) DO NOTHING;

    INSERT INTO forward (forward)
    VALUES (NEW.forward) ON CONFLICT (forward) DO NOTHING;

    INSERT INTO history (timestamp, type, status, domain_id, client_id, forward_id, additional_info)
      SELECT NEW.timestamp, NEW.type, NEW.status, 
             (SELECT id FROM domain WHERE domain = NEW.domain),
             (SELECT id FROM client WHERE client = NEW.client),
             (SELECT id FROM forward WHERE forward = NEW.forward),
             NEW.additional_info;
END;

ATTACH DATABASE 'pihole-FTL.db' AS og;

INSERT INTO ftl SELECT * FROM og.ftl;
INSERT INTO counters SELECT * FROM og.counters;
INSERT INTO message SELECT * FROM og.message;
INSERT INTO network_addresses SELECT * FROM og.network_addresses;
INSERT INTO network SELECT * FROM og.network;
INSERT INTO aliasclient SELECT * FROM og.aliasclient;

INSERT INTO client (client) select distinct client from og.queries;
INSERT INTO domain (domain) select distinct domain from og.queries;
INSERT INTO forward (forward) select distinct ifnull(forward,"") from og.queries;

INSERT INTO history (id, timestamp, type, status, domain_id, client_id, forward_id, additional_info)
  SELECT id, timestamp, type, status,
         (SELECT id FROM domain WHERE domain = q.domain),
         (SELECT id FROM client WHERE client = q.client),
         (SELECT id FROM forward WHERE ifnull(forward,"") = ifnull(q.forward,"")),
         additional_info
  FROM og.queries q;

I then ran some lookup tests on a NanoPi NEO for a few randomly picked examples. (I haven't done any tests on insert performance yet .I may add them if I find the time).

I've tried picking those examples for situations that Pi-hole could be expected to be queried for via its UI, but I was also drawing on custom SQLs we sometimes provide when helping our users.
Of course, those examples are far from being a comprehensive set of common uses cases.

The following table shows the results for the examples I came up with

search for reduction to (-) increase to (+) pihole-FTL.db normalised.db SQL
no. of client IPs overall -96% real 0m0.479s 0m0.020s SELECT count(*) FROM client;
-99% user 0m0.427s 0m0.002s SELECT count(DISTINCT client) \ FROM queries;
-65% sys 0m0.052s 0m0.018s
no. of queries by a specific client -27% real 0m0.244s 0m0.178s SELECT count(*) FROM queries \ WHERE client = '192.168.0.26';
-14% user 0m0.183s 0m0.158s
-67% sys 0m0.060s 0m0.020s
no. of queries from a specific subnet 85% real 0m0.275s 0m0.509s SELECT count(*) FROM queries \ WHERE client LIKE '192.168.0.'
95% user 0m0.245s 0m0.477s
10% sys 0m0.030s 0m0.033s
no. of queries by a specific client with LIKE 78% real 0m0.292s 0m0.519s SELECT count(*) FROM queries \ WHERE client LIKE '192.168.0.26'
87% user 0m0.261s 0m0.489s
0% sys 0m0.030s 0m0.030s
no. of queries during a specific timeframe 175% real 0m0.106s 0m0.292s SELECT count(*) FROM queries \ WHERE(timestamp BETWEEN strftime('%s','2020-10-02') AND strftime('%s','2021-01-01'));
213% user 0m0.087s 0m0.272s
5% sys 0m0.019s 0m0.020s
list of top blocked domains 11% real 0m0.461s 0m0.512s SELECT domain,count(domain) \ FROM queries \ WHERE status NOT IN(0,2,3) \ GROUP BY domain ORDER BY count(domain) DESC LIMIT 100;
14% user 0m0.420s 0m0.480s
-20% sys 0m0.040s 0m0.032s

The second column shows by how much the proposed normalisation would reduce (-) or increase (+) lookups relative to the current schema.

While queries that clearly benefit from involving a new index or even focussing on separate table results are indeed faster as expected, other common queries are slower than before.

Such a small set of sample tests isn't sufficient to accept or reject the normalisation proposal, but it serves to demonstrate that we cannot simply assume overall improvement. There's a price to be paid for speeding up certain queries - we slow down others (and I haven't looked at insertion yet).

Adopting it would mean more than just creating the normalised tables. Queries may have to be tuned and potentially be revised to more individualised sets of queries specialising in certain tasks.

And of course, those test samples don't provide insights with regard to DL6ER's concern for migrating huge databases.