Discrepancy between Pi-hole API client query count and logs

bazmonkey · May 7, 2024, 5:47pm

I've been racking my brain here trying to figure this one out.

I'm using a ver. 6 beta of Pi-Hole, and also writing a prometheus exporter to use the new API to collect various counts of queries by type, status, client, etc. At the same time, I'm already ingesting my pihole logs into Loki, and deriving counts that way.

With both approaches (API, logs), I'm making per-minute count metrics. Most of the time the two counts match, but I've noticed that they don't always track. Sometimes by a just a couple queries, which I have been able to chase down to the API not including queries of pi.hole itself. But other times the difference is huge.

For example, between 20:08 and 20:09 tonight, my exporter reports 319 queries from a particular host on my network. Querying the API for these exact 60 seconds with this exact client IP gives me just that: 319. But /var/log/pihole.log shows a whopping 815 of them. I can't tell which nearly-500 of them are the odd ones out.

I checked I really am finding 815 lines of the correct form, e.g.: query[A] somedomain.com from 192.168.1.10
I double-checked I really have these in my logs... Loki is correct.
Both the 319 and 815 counts include blocked queries, cache-stale queries, etc. There aren't any pi.hole queries in this minute.
The web interface, which uses the API, of course shows 319 queries in that minute as well.
It doesn’t appear to be just a timing issue: these “missing” queries don’t show up in the minutes before/after this.
The pi-hole itself is working fine, blocking like it always has, etc. etc.

Does anyone know what could be causing the wild discrepancy? The ONLY possible clue I have is that 492 of the queries (hey... that's almost how many extra there are) came in one second, when I refreshed a particular site. It seems like maybe the API is missing some that arrive too fast. But that doesn't make sense to me because it's the process that fills out the log in the first place.

Thanks for any hints/leads/ideas.

DL6ER · May 7, 2024, 6:02pm

Thanks for bringing this up, bugs can of course always exist and we'll fix then once we identify them!

You said the API tells you there are 319 queries in this in interval. Please go to the Query Log page and request the same time interval: How many are reported here?

If the Query Log also contain be only 319, you should be able to find the difference and this may tell a story. We can also do this research for you if you prefer, we just need the data in any form we could process.

bazmonkey · May 7, 2024, 6:07pm

The query log agrees with the API: 319 queries.

I'll get some hard data here tonight.

bazmonkey · May 8, 2024, 2:28am

Ok. In this repo are three files:

The API /queries response for 192.168.1.116 for a minute 15 minutes ago. 335 records. I set the length huge so that's all of 'em. The website query log and my exporter both agree with this number.
The pihole.log for this IP for that minute. 505 lines query[type] domain from 192.168.1.116.
The full pihole.log for that minute.

I didn't include it here, but the log and API agree in terms of query count for several minutes before and after this.

Thanks for the assistance! I'd love to not only be able to arrive at that 335 number from the logs as well, but know the significance of the extra 170 lines here. I tried to scrutinize the whole log and not just the "query from" lines... but my grep-fu is not good enough to parse it out easily and look for what 170 or 335 of them have in common.

https://github.com/bazmonk/junk

DL6ER · May 8, 2024, 7:25am

Okay, so this is indeed very strange and we'll need additional debug output. Please run

sudo pihole-FTL --config debug.queries true

and post the relevant parts from /var/log/pihole/FTL.log when it happens again.

I have not been able to identify anything in common about the missing domains even looking at full_pihole.log.

FWIW this is my real quick comparison script and its result:

compare.py (1.4 KB)
missing_queries.json (5.4 KB)

Output is:

Queries in the log file:  505
Queries in the API call:  335
Missing queries:  170

bazmonkey · May 8, 2024, 4:30pm

Will do... unfortunately I'm back at work :-/

bazmonkey · May 9, 2024, 2:48am

Ok... there's a pihole.log, an FTL.log (debug true), and the API queries for that same IP (192.168.1.116).

There are 288 queries in the API, and I see 303 in the pihole and FTL logs (those agree) now for that IP.

https://github.com/bazmonk/junk

It's the three files with "0236" in it (the minute I chose).

I went to Test Ad Block - Toolz (an "ad-block test" page) to generate these queries (because it generates a lot and causes this log/API discrepancy pretty reliably).

DL6ER · May 9, 2024, 7:15pm

Something doesn't really line up with your new data:

API.0236 has 288 queries from 02:36:19.079 to 02:36:48.367 (little less than 30 seconds), and
pihole.0236.log has 404 queries from 02:36:02 to 02:36:57 (55 seconds), and
FTL.0236.log has 379 queries from 02:36:19.067 to 02:36:57.643 (almost 40 seconds)

Anyway, when manually tailoring the three files to the time frame of the API file (smallest one), we get:

$ grep -c "\"list_id\"" API.0236.json 
288
grep -c "query\[" pihole.0236_edit.log 
358  
$ grep -c " query" FTL.0236_edit.log 
360

Still enough to work on.

We can indeed find missing queries easily - however - none of this makes sense to me right now. Everything seems perfectly normal in the debug logs and I see no reason for things not to show up in the API response at this time. This all seems to require me investing some work to reproduce this locally where I can fire a ton more debug stuff at it when needed. I will come back to you but it will surely take a few days until I can allocate enough time to do this.

bazmonkey · May 9, 2024, 7:23pm

No problem, thanks for the help! Let me know if I can provide you something else.

On the FTL and pihole logs, I didn't filter those for just the IP I've been focusing on (192.168.1.116). When I grepped them I found 303 queries for that IP.

The API query is just for that IP. Sorry if that was confusing.

rewt@cairon:~/junk $ grep "query.*192.168.1.116" pihole.0236.log -c
303
rewt@cairon:~/junk $ grep "query.*192.168.1.116" FTL.0236.log -c
303

DL6ER · May 10, 2024, 8:10am

The pointer to logs are not filtered this time but the API is made the difference, I see 303 now as well. I checked some of the missing queries but still couldn't find a single difference compared to queries that have not been reported in the API.

This brings me to asking you

how does your particular API request look like?
what are your config values (/etc/pihole/pihole.toml) concerning:
- misc.privacylevel
- webserver.api.excludeDomains
- webserver.api.excludeClients

It may also be worthwhile to enable debug.api = true and check the log after you have requested the API response.

edit Also of interest is if these queries have made their way into the database or not. Please try both commands:

pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE domain = 'adtago.s3.amazonaws.com' AND client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;"
pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE domain = 'events3alt.adcolony.com' AND client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;"

bazmonkey · May 10, 2024, 3:41pm

The API request was:

curl -X GET "https://pi.hole:443/api/queries?from=1715222160&until=1715222220&length=100000&client_ip=192.168.1.116" \
 -H "accept: application/json"

privacylevel is 0, and I have no excluded domains or clients

I get no results for either database query.

rewt@cairon:~/junk $ sudo pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE domain = 'adtago.s3.amazonaws.com' AND client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;"
rewt@cairon:~/junk $ pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE domain = 'events3alt.adcolony.com' AND client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;"
rewt@cairon:~/junk $

I'm also noticing that now, running that same API query (using disk=true), I get only 209 results! I uploaded it to that repository.

I'm running these API calls about 25 seconds after the minute, querying for the last whole minute (e.g. at 02:37:23 the exporter queries 02:36 - 02:37). Is that... too soon to query the last minute? Do things resolve or change after that?

EDIT: the database also now only shows 209 results... but that's where the API query is getting it now that it's on disk, right?

rewt@cairon:~/junk $ pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE client = '192.168.1.116' AND timestamp BETWEEN 1715222160 AND 1715222220;" | grep 116 -c
209

DL6ER · May 11, 2024, 10:19am

This is strange, could you please also run

sudo pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT COUNT(*) FROM queries WHERE timestamp BETWEEN 1715222179 AND 1715222208;"
sudo pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT COUNT(*) FROM queries WHERE client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;

edit No, I just saw you already anticipated this question

No, this should be fine.

Another question: How does the situation look like when you don't filter on a specific client, i.e., all queries in that time interval?

bazmonkey · May 11, 2024, 10:42pm

I’ll turn on debug and get some data for all clients.

bazmonkey · May 12, 2024, 7:50pm

Well... I'm at a loss. I can post more data but I don't know how fruitful it will be.

As far as I can tell there's just plain more queries in the logs than the API reports. The more rapidly I query, the more "extra" queries show up.

Is there any chance that something in the API code implicitly assumes that there will be like, only one queried domain per millisecond or something, and is misplacing some? That's what it seems is happening.

This was a pretty fresh install (i.e. just installed the OS and slapped pi-hole on), but I'm tempted to re-install it all as a sanity check.

Has anyone happened to check their logs against the API queries reported like I am? It would be helpful to know if this was just a "me" problem or not.

DL6ER · May 12, 2024, 8:07pm

I'm planing to reproduce this next week but I wasn't able to invest much time into this, so far. However, I'm confident that the scripts I wrote to check your data is like 60% of what is needed so we should have some answers soon(ish). We will eventually get this fixed, no doubts!

DL6ER · May 14, 2024, 6:24pm

Thank you for you patience. I had some time yesterday evening and was able to reproduce this.

The issue is in an algorithm designed to reduce the number of database updates to a reasonable minimum. It was designed to be adaptive but it turns out it cannot cope very well with irregular burst happening only seldomly. I replaced this algorithm with instead updating all queries up to 30 seconds in the past. This will both always include all queries within bursts of any length/frequency as well as queries where the reply take up to 30 seconds. Queries taking even longer to resolve are skipped by this method but this seems fine as the requesting client will have dropped them anyway (timeout). It is overall an improvement even for systems without bursts as it reduces the workload involved with database interactions notably.

I had this running for almost a full day now and when requesting, say, the entire last five full hours, I do not see a single missed query with the new algorithm:

Client	`pihole.log`	API `/queries`
Total (sum)	7152	7152
::	315	315
127.0.0.1	3961	3961
192.168.1.1	15	15
192.168.1.2	5	5
192.168.1.3	4	4
192.168.1.4	457	457
192.168.1.26	1	1
192.168.1.33	55	55
192.168.1.39	151	151
192.168.1.57	341	341
192.168.1.109	7	7
192.168.1.120	5	5
192.168.1.127	5	5
192.168.1.147	56	56
192.168.1.236	1774	1774

The script compare_log_api.py (2.2 KB) will do exactly this analysis for you and create such a table, please confirm the fix with your own method independently.

To get the fix, simply run:

pihole checkout ftl fix/database_queries

and let the system alone for a few hours

DL6ER · May 14, 2024, 6:37pm

https://github.com/pi-hole/FTL/pull/1960

bazmonkey · May 14, 2024, 8:06pm

Wonderful! I will try this tonight! Thanks for the work reproducing it!

bazmonkey · May 15, 2024, 2:00am

I haven't left it alone for a few hours yet, but I can tell you that it definitely looks to have fixed it! It's been spot-on so far. I'll check back after an evening of queries.

Thanks again! Time to proceed with the exporter

bazmonkey · May 15, 2024, 2:26am

I assume I'll checkout the development-v6 branch again in pihole once this is upstream?