Suggestion: Optimize SQL query in gravity.sh

PiHoleParole · January 24, 2024, 3:32pm

I don't know how critical is the performance of the following SQL query, or how many times it is used, but seeing an opportunity to improve I'm jumping in.

In function gravity_Table_Count() in gravity.sh there's the query:

SELECT COUNT(*) FROM (SELECT DISTINCT domain FROM ${table});

SQL has the operator COUNT (DISTINCT column), therefore the above query can be rewritten as

SELECT COUNT(DISTINCT domain) FROM ${table};

It makes a difference for SQLite. If you compare the output of EXPLAIN (the first one below is optimized, the second one is original), you'll see the execution plan for the optimized version is shorter. For ${tables} with substantial number of rows it can make a real difference.

I hope it's useful and will make Pi-Hole a little faster for users

sqlite> create table t1(c1 int);

sqlite> explain select count(distinct c1) from t1;

addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     15    0                    0   Start at 15
1     Null           0     1     2                    0   r[1..2]=NULL
2     OpenEphemeral  1     0     0     k(1,B)         0   nColumn=0
3     OpenRead       0     2     0     1              0   root=2 iDb=0; t1
4     Rewind         0     11    0                    0
5       Column         0     0     3                    0   r[3]= cursor 0 column 0
6       Found          1     10    3     1              0   key=r[3]
7       MakeRecord     3     1     4                    0   r[4]=mkrec(r[3])
8       IdxInsert      1     4     3     1              16  key=r[4]
9       AggStep        0     3     2     count(1)       1   accum=r[2] step(r[3])
10    Next           0     5     0                    1
11    AggFinal       2     1     0     count(1)       0   accum=r[2] N=1
12    Copy           2     5     0                    0   r[5]=r[2]
13    ResultRow      5     1     0                    0   output=r[5]
14    Halt           0     0     0                    0
15    Transaction    0     0     1     0              1   usesStmtJournal=0
16    Goto           0     1     0                    0

sqlite> explain select count(*) from (select distinct(c1) from t1) x;

addr  opcode         p1    p2    p3    p4             p5  comment
----  -------------  ----  ----  ----  -------------  --  -------------
0     Init           0     21    0                    0   Start at 21
1     InitCoroutine  1     12    2                    0   x
2     OpenEphemeral  2     0     0     k(1,B)         8   nColumn=0
3     OpenRead       1     2     0     1              0   root=2 iDb=0; t1
4     Rewind         1     11    0                    0
5       Column         1     0     2                    0   r[2]= cursor 1 column 0
6       Found          2     10    2     1              0   key=r[2]
7       MakeRecord     2     1     3                    0   r[3]=mkrec(r[2])
8       IdxInsert      2     3     2     1              16  key=r[3]
9       Yield          1     0     0                    0
10    Next           1     5     0                    1
11    EndCoroutine   1     0     0                    0
12    Null           0     4     4                    0   r[4..4]=NULL
13    InitCoroutine  1     0     2                    0
14      Yield          1     17    0                    0   next row of x
15      AggStep        0     0     4     count(0)       0   accum=r[4] step(r[0])
16    Goto           0     14    0                    0
17    AggFinal       4     0     0     count(0)       0   accum=r[4] N=0
18    Copy           4     5     0                    0   r[5]=r[4]
19    ResultRow      5     1     0                    0   output=r[5]
20    Halt           0     0     0                    0
21    Transaction    0     0     1     0              1   usesStmtJournal=0
22    Goto           0     1     0                    0

Bucking_Horn · January 24, 2024, 6:11pm

There's always room for improvement - we welcome contributions, how small they may be,

When testing both statements against my gravity table, however, my results would favour the existing statement.

Proposed statement:

~$ time pihole-FTL sqlite3 /etc/pihole/gravity.db \
"SELECT COUNT(distinct domain) FROM gravity;"
695509

real	0m12.214s
user	0m10.620s
sys 	0m0.774s

Current statement:

~$ time pihole-FTL sqlite3 /etc/pihole/gravity.db \ 
"SELECT COUNT(*) FROM (SELECT DISTINCT domain FROM gravity);"
695509

real	0m1.529s
user	0m0.950s
sys 	0m0.281s

The current statement is substantially faster, at least with my gravity database on my machine.
I ran above a few times on an ARMv7, serving the db from an SD card, with repeatable results.

How large is your gravity table?
Do you see contrary results on your system?

PiHoleParole · January 24, 2024, 6:33pm

WOW that is totally unexpected. If this result is repeatable, that is running the query multiple times produce similar times, then that would be a good case to report to SQLite authors.

Please try running the proposed query 2-3 times in a row and also post a result of SELECT COUNT(*) FROM gravity; to get a rough idea of the ratio of distinct values to total number of values in the column, and the version of SQLite CLI used.

Thanks.

Bucking_Horn · January 24, 2024, 6:46pm

When asking SQLite3 to explain both queries, we see that the current statement is using the index when scanning gravity:

SELECT COUNT(*) FROM (SELECT DISTINCT domain FROM gravity);"
QUERY PLAN
|--CO-ROUTINE (subquery-1)
|  `--SCAN gravity USING COVERING INDEX idx_gravity

The proposed query constructs a temporary B-tree, but has to do a full table scan also, without the help of the index:

SELECT COUNT(distinct domain) FROM gravity;"
QUERY PLAN
|--USE TEMP B-TREE FOR count(DISTINCT)
`--SCAN gravity

Edit:

I was using Pi-hole's embedded engine (of course ):

~$ pihole-FTL sqlite3 --version
3.42.0 2023-05-16 12:36:15 831d0fb2836b71c9bc51067c49fee4b8f18047814f2ff22d817d25195cf350b0

And the total count of domains would be 724,301, so less than 5% more than the 695,509 distinct domains.

DL6ER · January 24, 2024, 6:48pm

Once a week, at night, running in an automated script - I'd say: not very critical.

It is shorter but shorter is not the same as "better" (at least not if we define "better" as "faster")

You have to interpret what EXPLAIN shows you, then it becomes clear. Without going over each line and explaining them what they do, the high-level difference is:

... [ I have not finished and remove my draft of this section because @Bucking_Horn replied meanwhile and has a nice description of what I was going to explain in greater lengths here ] ...

PiHoleParole · January 24, 2024, 6:55pm

This is what I suspected and I'd like to report that to SQLite authors (not only for the purpose of this query). I can confirm it exists in the newest SQLite 3.45.0, so I'll ping its authors about it. Thanks. You never know what you're going to find

DL6ER · January 24, 2024, 7:14pm

Just for the record - we are always latest version on Pi-hole's active development branches:

$ pihole-FTL sqlite3 --version
3.45.0 2024-01-15 17:01:13 1066602b2b1976fe58b5150777cced894af17c803e068f5918390d6915b46e1d (64-bit)

PiHoleParole · January 24, 2024, 9:18pm

Thank you. I have reported this peculiar behavior to SQLite developers.

https://sqlite.org/forum/forumpost/9a54836237

DL6ER · January 31, 2024, 7:18am

Even with the small change implemented in the current trunk of SQLite3, the situation will not become better than when manually forcing the index as has also been suggested:

SELECT COUNT(DISTINCT domain) FROM gravity INDEXED BY idx_gravity

And the performance is basically identical with what we already have. Admittedly, the query plan is simpler

EXPLAIN QUERY PLAN SELECT COUNT(*) FROM (SELECT DISTINCT domain FROM gravity)
QUERY PLAN
|--CO-ROUTINE (subquery-1)
|  `--SCAN gravity USING COVERING INDEX idx_gravity
`--SCAN (subquery-1)

EXPLAIN QUERY PLAN SELECT COUNT(DISTINCT domain) FROM gravity INDEXED BY idx_gravity
QUERY PLAN
`--SCAN gravity USING COVERING INDEX idx_gravity

but this does not lead to any measurable (I repeated each one about five times) difference on my machine - and with "not measurable" I mean: not even reliably 10 msec.

I'd conclude like this: We have not found a way to make it faster than the existing implementation. At this point saying "with the very latest version we can use the other SQL at the same costs", than this still sounds somewhat like a change for the sake of a change.

We are all showing some amazing dedication for such a small thing here and I'd absolutely love if we could give the many other SQL queries Pi-hole uses (especially inside FTL !) the same love - I am absolutely sure there is more room for improvement in other queries that are executed live during DNS serving rather than once per week in a process that typically runs in the background.

PiHoleParole · January 31, 2024, 7:59am

Thank you. I'm not requesting you to do anything, and I mean anything, on your part. You've involved in this voluntarily. I'm monitoring the SQLite forum thread because potentially a lot of applications can benefit from it.
As a database folk I'm glad it's been found and is being improved. For pi-hole there's no benefit, but it doesn't mean SQLite should not correct this. The thing is, it was pi-hole that allowed to find this issue.

I think this can be closed.