Pie chart Upstream servers does not add up to 100%

yubiuser · August 17, 2021, 7:55pm

Noticed today (very few overall queries).

* **Pi-hole** vDev (release/v5.4, v5.3.1-96-g58a21be)
* **Web Interface** vDev (release/v5.6, v5.5.1-57-geb4b88d)
* **FTL** vDev (release/v5.9, vDev-c1cbe48)

Daxtorim · August 18, 2021, 7:07pm

Did you really not have any forwarded queries?

I'm not on the bleeding edge here, but I'm seeing the same thing, although not quite as bad.

$ pihole -v
  Pi-hole version is release/v5.4 v5.3.1-94-ge163ed44 (Latest: v5.3.1)
  AdminLTE version is release/v5.6 v5.5.1-57-geb4b88d (Latest: v5.5.1)
  FTL version is release/v5.9 vDev-b90a993 (Latest: v5.8.1)

Pie chart after running Pi-hole for 3 days:

Total = 95,7%

Pie chart immediately after restarting the DNS resolver:

Total = 100%

Importing from the database seems to work correctly, but during prolonged uptime some queries seem to not be counted (or over-corrected).

Noteworthy here is that the values particularly for localhost#5335 (unbound) changed quite significantly. But also my target for reverse lookups, fritz.box#53, has increased by almost a third of the previous value.

There was a similar issue a while ago, but the total back then was too high. Now it is too low. My blind guess would have been that maybe the internal queries from pi.hole would not have been counted, but those make up for only 2% of all queries in my case, which is not enough to explain the 4.3% that were actually missing. Otherwise I don't really have any immediate ideas about what the issue could be.

yubiuser · August 19, 2021, 11:29am

I noticed this as well and think it is very strange. I guess the chart is wrong, as I'm seeing forwarded queries within the last 24h to #5335 (which is my local unbound instance).

Also note, the sum of top blocked queries is greater than the number of total queries.

I'm currently not at home so unfortunately I can't provide much more data :-/

DL6ER · August 19, 2021, 8:01pm

What changed recently is that we store all upstream servers we send queries to. Before, we stored only the first one a query was sent to. This does not make a difference when you have only one upstream, however, it makes a difference when you have more than one (as FTL periodically checks which is the fastest one). To account for multiple forward destinations being associated with a query, we broadened the garbage collection algorithm to clean up multiple servers.

[[ Small side-note: As we can store only one server per query in the long-term database, a certain difference in the percentages is expected when restarting FTL. Nevertheless, the values should always add up to 100%. ]]

Looking over the changes, I found one small bug which affects counting: when a query has already been forwarded and is then later blocked due to CNAME inspection. In this case, the upstream count is immediately reduced by one and 24 hours later again (which is wrong -> over-correction). This is so much edge-case that we should definitely continue to monitor if it still happen after you updated and leave it running for a few hours/days. The beta branch has been updated.

yubiuser · August 20, 2021, 6:45am

I do have only one upstream, which is my local unbound.

Will do when being back home.

Daxtorim · August 22, 2021, 12:45am

After two days of uptime since updating it is still borked.

Total = 94.8%

$ pihole -v
  Pi-hole version is release/v5.4 v5.3.1-96-g58a21bee (Latest: v5.3.1)
  AdminLTE version is release/v5.6 v5.5.1-63-ga2ff508 (Latest: v5.5.1)
  FTL version is release/v5.9 vDev-594ff02 (Latest: v5.8.1)

Same as @yubiuser, I also only have a local unbound instance as the sole upstream server. The fritz.box#53 server is just for arpa and fritz.box queries. If Pi-hole (or rather dnsmasq?) starts sending any queries to the DNS server defined in the server=/<domain>/<DNS server> directive (or rev-server), or the other way around, starts sending fritz.box queries to the other general upstream DNS servers, I would count that as a defect. But I don't think that is what is happening here.

I took a look at your changes, but I'm not entirely sure what I'm looking at. You are keeping a tally of overtime statistics and pushing related memory around on garbage collection, but those statistics are never exposed anywhere as far as I can tell. What is the advantage of having both a general upstream->count as well as a more detailed upstream->overTime[timeidx] if the latter isn't really used? Especially since they are always changed together.

`git grep` output

# In the pi-hole/FTL repo, release/v5.9 branch
$ git grep --line-number --show-function --heading --break 'upstream->overTime\|upstream->count'
src/api/api.c
495=void getUpstreamDestinations(const char *client_message, const int *sock)
578:                            percentage = 1e2f * upstream->count / totalqueries;

src/database/query-table.c
339=void DB_read_queries(void)
580:                                            upstream->overTime[timeidx]++;
581:                                            upstream->count++;

src/datastructure.c
74=int findUpstreamID(const char * upstreamString, const in_port_t port)
105:    upstream->count = 0;

src/dnsmasq_interface.c
1264=static void FTL_forwarded(const unsigned int flags, const char *name, const union all_addr *addr,
1347:                   upstream->overTime[timeidx]++;
1349:                   upstream->count++;
1846=static void query_blocked(queriesData* query, domainsData* domain, clientsData* client, const enum query_status new_status)
1862:                   upstream->count--;
1864:                   upstream->overTime[timeidx]--;

src/gc.c
39=void *GC_thread(void *val) {
126:              upstream->overTime[timeidx]--;
127:              upstream->count--;

src/overTime.c
27=static void initSlot(const unsigned int index, const time_t timestamp)
63:                     upstream->overTime[index] = 0;
152=void moveOverTimeMemory(const time_t mintime)
223:                    sum += upstream->overTime[idx];
225:            upstream->count -= sum;
228:                         sum, getstr(upstream->ippos), upstream->port, upstream->count);
231:            memmove(&(upstream->overTime[0]),
232:                    &(upstream->overTime[moveOverTime]),
233:                    remainingSlots*sizeof(*upstream->overTime));

The thing that sticks out to me is this conditional:

github.com/pi-hole/FTL

src/dnsmasq_interface.c

594ff02c0


      
          		// Only count upstream when there has been no reply so far
          		if(query->reply == REPLY_UNKNOWN)
          		{
          			// Update overTime
          			const int timeidx = getOverTimeID(query->timestamp);
          			upstream->overTime[timeidx]++;
          			// Update total count
          			upstream->count++;
          		}

What was the reason for gating the statistic increment now, only introducing it here recently? How likely is it that this condition is false and statistics are not incremented at this point? And why would you not want to increment statistics here if the query was forwarded?

DL6ER · August 22, 2021, 9:59am

It is used - in garbage collection. I'd agree with you that it doesn't make much sense when you have only one upstream server. However, as the queriesData can only hold one upstream server ID, it makes a difference when you have multiple.

Say you have three servers A, B and C. And you may even have set all-servers as config option. This means you always forward three times for each query. So far, we did not count B and C when A was queried first. This has been changed so all three servers get an increment. However, this caused an issue with garbage collection because we memorized only upstream server A's ID within the corresponding query data. Hence, we had to add another over-time memory structure that keeps track of all the upstream forwarding so garbage collection can correct the numbers later on.

I've been thinking about this conditional yesterday, too. However, I've then been distracted but let's look at it again now. I'd say the chances are relatively low for this case to be triggered. It means we forward another query after a reply was already received.

The only case where I can see this happening is a CNAME reply (with a long TTL) we still have cached but for which we need to get some A/AAAA RRs with a shorter TTL. I searched through my logs and indeed found one such case in four days worth of data:

Aug 21 11:12:00 dnsmasq[1253642]: query[A] clients1.google.com from 2a02:<snip>
Aug 21 11:12:00 dnsmasq[1253642]: cached clients1.google.com is <CNAME>
Aug 21 11:12:00 dnsmasq[1253642]: forwarded clients1.google.com to 127.0.0.1
Aug 21 11:12:00 dnsmasq[1253642]: validation result is INSECURE
Aug 21 11:12:00 dnsmasq[1253642]: reply clients1.google.com is <CNAME>
Aug 21 11:12:00 dnsmasq[1253642]: reply clients.l.google.com is 142.250.186.110

The condition isn't really necessary now (but it was before when we haven't had the upstream->overTime structure) so we can just remove it.

I'll do this once I have my development system back up (we're still affected by power issues since the flooding here in Ahrweiler one month ago).

edit I pushed the changes now + I also removed the (now redundant) upstream->count.

Daxtorim · August 22, 2021, 7:35pm

Ah. I didn't really know what to make of the code in src/overTime.c but I think I see now what it is trying to do. I actually had DEBUG_GC enabled, but with only one general upstream DNS server set it didn't really tell my anything. I guess that was to be expected.

I actually see a lot of these cases in my logs for all sorts of services, to the point where this happens once a minute for most of the day. A couple examples that I see many of:

Microsoft, basically every 5 minutes or so while my PC is on.

Aug 22 14:25:08 dnsmasq[16450]: query[A] v10.events.data.microsoft.com from 192.168.178.20
Aug 22 14:25:08 dnsmasq[16450]: cached v10.events.data.microsoft.com is <CNAME>
Aug 22 14:25:08 dnsmasq[16450]: cached global.asimov.events.data.trafficmanager.net is <CNAME>
Aug 22 14:25:08 dnsmasq[16450]: forwarded v10.events.data.microsoft.com to 127.0.0.1
Aug 22 14:25:08 dnsmasq[16450]: reply v10.events.data.microsoft.com is <CNAME>
Aug 22 14:25:08 dnsmasq[16450]: reply global.asimov.events.data.trafficmanager.net is <CNAME>
Aug 22 14:25:08 dnsmasq[16450]: reply onedscolprdcus00.centralus.cloudapp.azure.com is 13.89.178.26

v10.events.data.microsoft.com TTL 3600
global.asimov.events.data.trafficmanager.net TTL 60
onedscolprdcus00.centralus.cloudapp.azure.com TTL 10

Aug 22 14:30:24 dnsmasq[16450]: query[A] v.redd.it from 192.168.178.20
Aug 22 14:30:24 dnsmasq[16450]: cached v.redd.it is <CNAME>
Aug 22 14:30:24 dnsmasq[16450]: forwarded v.redd.it to 127.0.0.1
Aug 22 14:30:24 dnsmasq[16450]: reply v.redd.it is <CNAME>
Aug 22 14:30:24 dnsmasq[16450]: reply reddit.map.fastly.net is 151.101.13.140

v.redd.it TTL 900
reddit.map.fastly.net TTL 30

Spotify

Aug 22 14:33:23 dnsmasq[16450]: query[A] i.scdn.co from 192.168.178.20
Aug 22 14:33:23 dnsmasq[16450]: cached i.scdn.co is <CNAME>
Aug 22 14:33:23 dnsmasq[16450]: forwarded i.scdn.co to 127.0.0.1
Aug 22 14:33:23 dnsmasq[16450]: reply i.scdn.co is <CNAME>
Aug 22 14:33:23 dnsmasq[16450]: reply scdnco.spotify.map.fastly.net is 151.101.134.248

i.scdn.co TTL 300
scdnco.spotify.map.fastly.net TTL 30

Since updating the system it has been running for over 6 hours so far and the pie chart still adds up to 100 % exact. So far it is looking good. I'll let you know when it goes off again.

$ pihole -v
  Pi-hole version is release/v5.4 v5.3.1-96-g58a21bee (Latest: v5.3.1)
  AdminLTE version is release/v5.6 v5.5.1-63-ga2ff508 (Latest: v5.5.1)
  FTL version is release/v5.9 vDev-f73c9d2 (Latest: v5.8.1)

yubiuser · August 23, 2021, 6:48pm

Only gathered data for 2 hours now, but the latest updates fixed the issue for me. Upstream servers add up to 100% again and also my localhost:5335 showed up again. Thanks for the fix.

  Pi-hole version is release/v5.4 v5.3.1-96-g58a21be (Latest: v5.3.1)
  AdminLTE version is release/v5.6 v5.5.1-63-ga2ff508 (Latest: v5.5.1)
  FTL version is release/v5.9 vDev-f73c9d2 (Latest: v5.8.1)

DL6ER · August 25, 2021, 6:17pm

Do you have an update for me two days later?

yubiuser · August 25, 2021, 6:26pm

Still perfect 100%