I've been racking my brain here trying to figure this one out.
I'm using a ver. 6 beta of Pi-Hole, and also writing a prometheus exporter to use the new API to collect various counts of queries by type, status, client, etc. At the same time, I'm already ingesting my pihole logs into Loki, and deriving counts that way.
With both approaches (API, logs), I'm making per-minute count metrics. Most of the time the two counts match, but I've noticed that they don't always track. Sometimes by a just a couple queries, which I have been able to chase down to the API not including queries of pi.hole itself. But other times the difference is huge.
For example, between 20:08 and 20:09 tonight, my exporter reports 319 queries from a particular host on my network. Querying the API for these exact 60 seconds with this exact client IP gives me just that: 319. But /var/log/pihole.log shows a whopping 815 of them. I can't tell which nearly-500 of them are the odd ones out.
I checked I really am finding 815 lines of the correct form, e.g.: query[A] somedomain.com from 192.168.1.10
I double-checked I really have these in my logs... Loki is correct.
Both the 319 and 815 counts include blocked queries, cache-stale queries, etc. There aren't any pi.hole queries in this minute.
The web interface, which uses the API, of course shows 319 queries in that minute as well.
It doesn’t appear to be just a timing issue: these “missing” queries don’t show up in the minutes before/after this.
The pi-hole itself is working fine, blocking like it always has, etc. etc.
Does anyone know what could be causing the wild discrepancy? The ONLY possible clue I have is that 492 of the queries (hey... that's almost how many extra there are) came in one second, when I refreshed a particular site. It seems like maybe the API is missing some that arrive too fast. But that doesn't make sense to me because it's the process that fills out the log in the first place.
Thanks for bringing this up, bugs can of course always exist and we'll fix then once we identify them!
You said the API tells you there are 319 queries in this in interval. Please go to the Query Log page and request the same time interval: How many are reported here?
If the Query Log also contain be only 319, you should be able to find the difference and this may tell a story. We can also do this research for you if you prefer, we just need the data in any form we could process.
The API /queries response for 192.168.1.116 for a minute 15 minutes ago. 335 records. I set the length huge so that's all of 'em. The website query log and my exporter both agree with this number.
The pihole.log for this IP for that minute. 505 lines query[type] domain from 192.168.1.116.
The full pihole.log for that minute.
I didn't include it here, but the log and API agree in terms of query count for several minutes before and after this.
Thanks for the assistance! I'd love to not only be able to arrive at that 335 number from the logs as well, but know the significance of the extra 170 lines here. I tried to scrutinize the whole log and not just the "query from" lines... but my grep-fu is not good enough to parse it out easily and look for what 170 or 335 of them have in common.
It's the three files with "0236" in it (the minute I chose).
I went to Test Ad Block - Toolz (an "ad-block test" page) to generate these queries (because it generates a lot and causes this log/API discrepancy pretty reliably).
We can indeed find missing queries easily - however - none of this makes sense to me right now. Everything seems perfectly normal in the debug logs and I see no reason for things not to show up in the API response at this time. This all seems to require me investing some work to reproduce this locally where I can fire a ton more debug stuff at it when needed. I will come back to you but it will surely take a few days until I can allocate enough time to do this.
No problem, thanks for the help! Let me know if I can provide you something else.
On the FTL and pihole logs, I didn't filter those for just the IP I've been focusing on (192.168.1.116). When I grepped them I found 303 queries for that IP.
The API query is just for that IP. Sorry if that was confusing.
The pointer to logs are not filtered this time but the API is made the difference, I see 303 now as well. I checked some of the missing queries but still couldn't find a single difference compared to queries that have not been reported in the API.
This brings me to asking you
how does your particular API request look like?
what are your config values (/etc/pihole/pihole.toml) concerning:
misc.privacylevel
webserver.api.excludeDomains
webserver.api.excludeClients
It may also be worthwhile to enable debug.api = true and check the log after you have requested the API response.
edit Also of interest is if these queries have made their way into the database or not. Please try both commands:
pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE domain = 'adtago.s3.amazonaws.com' AND client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;"
pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE domain = 'events3alt.adcolony.com' AND client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;"
curl -X GET "https://pi.hole:443/api/queries?from=1715222160&until=1715222220&length=100000&client_ip=192.168.1.116" \
-H "accept: application/json"
privacylevel is 0, and I have no excluded domains or clients
I get no results for either database query.
rewt@cairon:~/junk $ sudo pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE domain = 'adtago.s3.amazonaws.com' AND client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;"
rewt@cairon:~/junk $ pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE domain = 'events3alt.adcolony.com' AND client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;"
rewt@cairon:~/junk $
I'm also noticing that now, running that same API query (using disk=true), I get only 209 results! I uploaded it to that repository.
I'm running these API calls about 25 seconds after the minute, querying for the last whole minute (e.g. at 02:37:23 the exporter queries 02:36 - 02:37). Is that... too soon to query the last minute? Do things resolve or change after that?
EDIT: the database also now only shows 209 results... but that's where the API query is getting it now that it's on disk, right?
rewt@cairon:~/junk $ pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT * FROM queries WHERE client = '192.168.1.116' AND timestamp BETWEEN 1715222160 AND 1715222220;" | grep 116 -c
209
sudo pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT COUNT(*) FROM queries WHERE timestamp BETWEEN 1715222179 AND 1715222208;"
sudo pihole-FTL sqlite3 -h /etc/pihole/pihole-FTL.db "SELECT COUNT(*) FROM queries WHERE client = '192.168.1.116' AND timestamp BETWEEN 1715222179 AND 1715222208;
edit No, I just saw you already anticipated this question
No, this should be fine.
Another question: How does the situation look like when you don't filter on a specific client, i.e., all queries in that time interval?
Well... I'm at a loss. I can post more data but I don't know how fruitful it will be.
As far as I can tell there's just plain more queries in the logs than the API reports. The more rapidly I query, the more "extra" queries show up.
Is there any chance that something in the API code implicitly assumes that there will be like, only one queried domain per millisecond or something, and is misplacing some? That's what it seems is happening.
This was a pretty fresh install (i.e. just installed the OS and slapped pi-hole on), but I'm tempted to re-install it all as a sanity check.
Has anyone happened to check their logs against the API queries reported like I am? It would be helpful to know if this was just a "me" problem or not.
I'm planing to reproduce this next week but I wasn't able to invest much time into this, so far. However, I'm confident that the scripts I wrote to check your data is like 60% of what is needed so we should have some answers soon(ish). We will eventually get this fixed, no doubts!
Thank you for you patience. I had some time yesterday evening and was able to reproduce this.
The issue is in an algorithm designed to reduce the number of database updates to a reasonable minimum. It was designed to be adaptive but it turns out it cannot cope very well with irregular burst happening only seldomly. I replaced this algorithm with instead updating all queries up to 30 seconds in the past. This will both always include all queries within bursts of any length/frequency as well as queries where the reply take up to 30 seconds. Queries taking even longer to resolve are skipped by this method but this seems fine as the requesting client will have dropped them anyway (timeout). It is overall an improvement even for systems without bursts as it reduces the workload involved with database interactions notably.
I had this running for almost a full day now and when requesting, say, the entire last five full hours, I do not see a single missed query with the new algorithm:
Client
pihole.log
API /queries
Total (sum)
7152
7152
::
315
315
127.0.0.1
3961
3961
192.168.1.1
15
15
192.168.1.2
5
5
192.168.1.3
4
4
192.168.1.4
457
457
192.168.1.26
1
1
192.168.1.33
55
55
192.168.1.39
151
151
192.168.1.57
341
341
192.168.1.109
7
7
192.168.1.120
5
5
192.168.1.127
5
5
192.168.1.147
56
56
192.168.1.236
1774
1774
The script compare_log_api.py (2.2 KB) will do exactly this analysis for you and create such a table, please confirm the fix with your own method independently.
I haven't left it alone for a few hours yet, but I can tell you that it definitely looks to have fixed it! It's been spot-on so far. I'll check back after an evening of queries.