Help us test FTL v5.8 / dnsmasq v2.85

cleanev · April 2, 2021, 7:44pm

Thanks a lot @yubiuser - i was using each line and copied 14 times with each status number increment - your script just is simpler and cleaner

jpgpi250 · April 2, 2021, 9:04pm

Taking this a step further, to verify it's not some specific TLD or domain that is causing the problem.

first column (count) = SELECT count(*)
second column (unique) = SELECT count(DISTINCT domain)

today: 04/02/2021 -> 1617314400
total # of queries today: 9612
status  count   unique  description
0       0       0       Unknown status
1       1343    51      Domain contained in gravity database
2       5918    358     Forwarded
3       104     16      Known, replied to from cache
4       18      4       Domain matched by a regex blacklist filter
5       0       0       Domain contained in exact blacklist
6       0       0       By upstream server (known blocking page IP address)
7       0       0       By upstream server (0.0.0.0 or ::)
8       0       0       By upstream server (NXDOMAIN with RA bit unset)
9       209     3       Domain contained in gravity database (CNAME)
10      0       0       Domain matched by a regex blacklist filter (CNAME)
11      0       0       Domain contained in exact blacklist (CNAME)
12      2020    95      Retried query
13      0       0       Retried but ignored query (DNSSEC)
14      0       0       Already forwarded, not forwarding again

Looking at the result of SELECT DISTINCT domain FROM "queries" WHERE timestamp > 1617314400 AND status= 12, it doesn't look like a specific TLD or domain is causing this, it's just my activity (browsing and viewing habits) that is causing the unique count to be much lower than the total count. Trying to find a cause, another dead end...

edit
I used the result from the above query (95 domains) and used these as source for DPT (DNS Performance Test) Checked the stats again after completion, No significant change in the count (status 12), thus confirmed, it's not a TLD or domain problem.
/edit

jpgpi250 · April 3, 2021, 7:44pm

Continuing to look for a cause...

Normally, I use unbound as a recursive resolver, thus pihole-FTL is reporting stats, based on the data, provided by unbound.
As I have a rather specific compiled unbound setup (tcp fast open, redis and unbound optimizations), I wanted to make sure this is not causing the problem, so I installed knot-resolver, config out of the box, and changed the pihole-FTL resolver settings to use knot-resolver, unbound thus out of the picture.
It didn't take that long to conclude my unbound config isn't the cause of the retries, knot-resolver generated similar stats (lots of retries) in a few hours.

While I was testing if knot-resolver gave me the same features (DNSSEC, IPv6 support, ...), I noticed, during one of the tests, the retry count increased significantly when a lot of messages like "dnsmasq[1135]: reply ipv6.test-ipv6.fratec.net is CNAME" appeared in the pihole-log. To achieve this, just browse to this site, you'll see a lot of them. No idea if this might point DL6ER into the right direction, hope it helps.

edit
just ran a test on the unique domains with status 12 from today. Out of 88 unique domains listed, 55 are actually cnames. Don't know if this means anything...
/edit

I also noticed the following.

Installed a fresh pihole on a new system.
as soon as pihole was installed, executed pihole checkout ftl update/dnsmasq-v2.85
tried to add entries to the aliasclient table of /etc/pihole/pihole-FTL.db, this failed because the table didn't exist. The aliasclient_id field in the network table was also missing.
switch back to master, aliasclient table was immediately created. ditto for the aliasclient_id field.
switched back to update/dnsmasq-v2.85, problem solved...

I hope the alias feature is here to stay (not removed in the next version of FTL), as it works perfectly...

DL6ER · April 4, 2021, 3:17pm

Thanks, this hint was very helpful (even if it doesn't have anything to do with CNAMEs). I can reproduce this with your suggested test.

edit Content largely replaced/updated

There seems to be an issue with the "fixed" retry algorithm of dnsmasq. I reported this upstream. The issue is that certain queries are refused by the upstream (SERVFAIL). This is by design of the IPv6 test. dnsmasq retries those SERVFAILs while it shouldn't. Further investigation is going on.

I pushed a small fix that should detect and handle this case. Please update on this branch and try again.

jpgpi250 · April 4, 2021, 6:21pm

Pi-hole version is v5.2.4 (Latest: v5.2.4)
AdminLTE version is v5.4 (Latest: v5.4)
FTL version is update/dnsmasq-v2.85 vDev-7127e2a (Latest: v5.7)

What are we looking for, I don't see any warnings or specific messages in the pihole log? Or should we run the tests again and watch the retry count?

Reminder, the original v2.85 test branch doesn't create the aliasclient table and the aliasclient_id field. Intentional?

DL6ER · April 4, 2021, 6:30pm

This. A warning like Ignoring self-retry would only be shown for DEBUG_QUERIES=true.

This special branch may not have been in sync with development, however, the aliasclient feature is there for longer by now. They are added when the database is update to version 9 (which is the most recent version). Does it work as expected now?

jpgpi250 · April 4, 2021, 7:02pm

Enabled DEBUG_QUERIES=true

looking in the pihole-FtL log, I don't see any Ignoring self-retry messages, however I do see:

[2021-04-04 20:57:11.820 15947M] **** query 14606 is duplicate of 14606

relevant? first and last number is identical...

DL6ER · April 5, 2021, 1:14am

Yes, this is another but closely related issue which I haven't seen during my tests. I'll push a fix for this ~~later today~~ (edit done) even when it is completely harmless (no-op for identical IDs). We should still suppress the debug message in this case.

It's strange that you don't see the self-retry message but this just means you're maybe not even affected by the issue I've fixed

Can you identify the pihole.log lines related to this?
This is simplified by setting log-queries=extra in /etc/dnsmasq.d/01-pihole.conf as this ensures the query IDs do also show up in pihole.log.
Did the retry count in your database change notably since the most recent update?
Just to ensure we run (about) to same test: What is your result? I'm getting a 10/10 in the end.
The self-retry messages appear close to the end of even shortly afterwards for me.

jpgpi250 · April 5, 2021, 8:00am

what I did:

pihole -up
Pi-hole version is v5.2.4 (Latest: v5.2.4)
AdminLTE version is v5.4 (Latest: v5.4)
FTL version is update/dnsmasq-v2.85 vDev-fe69f11 (Latest: v5.7)
stopped FTL and unbound
removed all todays queries from the database (all counter in my script -> 0)
delete from queries where timestamp > 1617573600
made requested configuration changes
DEBUG_QUERIES=true
log-queries=extra
erased all pihole logs (pihole and FTL)
started unbound and FTL
browsed a little bit, including http://test-ipv6.com (result = 10/10)
stopped FTL and copied the logs (have send them to you in a PM)

I don't see any duplicate or Ignoring self-retry messages
The retry counter still increases, but at a much lower rate than before (22 retries, 13 unique in one hour, 1282 queries total)

Small test (one hour) will keep you informed if retry count rate increases

see PM for logs and details.

DL6ER · April 5, 2021, 8:02am

This may be correct. There may be (expected) retries from your clients. You could grep for **** RETRIED query ... with DEBUG_QUERIES enabled.

jpgpi250 · April 5, 2021, 8:07am

found (example) [2021-04-05 09:38:38.474 1029M] **** RETRIED query 1049 as 1051 to fdaa:bbcc:ddee:2::5552 (ID 2)

the count matches the count in the database

jpgpi250 · April 5, 2021, 8:36am

Two new retries in a row, no reason for the retry logged at all (if I correctly read the data)

[2021-04-05 10:26:43.266 1029M] **** new UDP query[A] query "cdn.syndication.twimg.com" from eth0:192.168.2.228 (ID 1488, FTL 10685, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.267 1029M] cdn.syndication.twimg.com is not known
[2021-04-05 10:26:43.269 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1488, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.270 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1488, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.271 1029M] **** new UDP query[AAAA] query "cdn.syndication.twimg.com" from eth0:192.168.2.228 (ID 1489, FTL 10686, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.272 1029M] cdn.syndication.twimg.com is not known
[2021-04-05 10:26:43.273 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1489, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.275 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1489, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.282 1029M] **** new UDP query[A] query "www.googletagmanager.com" from eth0:192.168.2.228 (ID 1490, FTL 10687, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.283 1029M] www.googletagmanager.com is known as gravity blocked
[2021-04-05 10:26:43.284 1029M] **** new UDP query[AAAA] query "www.googletagmanager.com" from eth0:192.168.2.228 (ID 1491, FTL 10688, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.284 1029M] www.googletagmanager.com is known as gravity blocked
[2021-04-05 10:26:43.298 1029M] **** new UDP query[A] query "cdn.syndication.twimg.com" from eth0:192.168.2.228 (ID 1492, FTL 10689, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.299 1029M] cdn.syndication.twimg.com is known as not to be blocked
[2021-04-05 10:26:43.299 1029M] **** RETRIED query 1488 as 1492 to 127.10.10.2 (ID 3)
[2021-04-05 10:26:43.300 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1492, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.300 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1492, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.301 1029M] **** new UDP query[AAAA] query "cdn.syndication.twimg.com" from eth0:192.168.2.228 (ID 1493, FTL 10690, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.301 1029M] cdn.syndication.twimg.com is known as not to be blocked
[2021-04-05 10:26:43.302 1029M] **** RETRIED query 1489 as 1493 to 127.10.10.2 (ID 3)
[2021-04-05 10:26:43.302 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1493, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.303 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1493, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.394 1029M] cdn.syndication.twimg.com is known as not to be blocked
[2021-04-05 10:26:43.394 1029M] Query 1492: CNAME cdn.syndication.twimg.com
[2021-04-05 10:26:43.395 1029M] **** got reply cdn.syndication.twimg.com is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.395 1029M] cs196.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.396 1029M] Query 1492: CNAME cs196.wac.edgecastcdn.net
[2021-04-05 10:26:43.396 1029M] **** got reply cs196.wac.edgecastcdn.net is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.397 1029M] cs2-wac.apr-8315.edgecastdns.net is not known
[2021-04-05 10:26:43.397 1029M] Query 1492: CNAME cs2-wac.apr-8315.edgecastdns.net
[2021-04-05 10:26:43.398 1029M] **** got reply cs2-wac.apr-8315.edgecastdns.net is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.398 1029M] cs2-wac-eu.8315.ecdns.net is not known
[2021-04-05 10:26:43.399 1029M] Query 1492: CNAME cs2-wac-eu.8315.ecdns.net
[2021-04-05 10:26:43.399 1029M] **** got reply cs2-wac-eu.8315.ecdns.net is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.399 1029M] cs672.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.400 1029M] Query 1492: CNAME cs672.wac.edgecastcdn.net
[2021-04-05 10:26:43.400 1029M] **** got reply cs672.wac.edgecastcdn.net is 192.229.233.50 (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.407 1029M] cdn.syndication.twimg.com is known as not to be blocked
[2021-04-05 10:26:43.408 1029M] Query 1493: CNAME cdn.syndication.twimg.com
[2021-04-05 10:26:43.409 1029M] **** got reply cdn.syndication.twimg.com is (CNAME) (ID 1493, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.409 1029M] cs196.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.410 1029M] Query 1493: CNAME cs196.wac.edgecastcdn.net
[2021-04-05 10:26:43.411 1029M] **** got reply cs196.wac.edgecastcdn.net is (CNAME) (ID 1493, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.411 1029M] cs2-wac.apr-8315.edgecastdns.net is not known
[2021-04-05 10:26:43.412 1029M] Query 1493: CNAME cs2-wac.apr-8315.edgecastdns.net
[2021-04-05 10:26:43.413 1029M] **** got reply cs2-wac.apr-8315.edgecastdns.net is (CNAME) (ID 1493, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.414 1029M] cs2-wac-eu.8315.ecdns.net is not known
[2021-04-05 10:26:43.414 1029M] Query 1493: CNAME cs2-wac-eu.8315.ecdns.net
[2021-04-05 10:26:43.415 1029M] **** got reply cs2-wac-eu.8315.ecdns.net is (CNAME) (ID 1493, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.416 1029M] cs45.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.417 1029M] Query 1493: CNAME cs45.wac.edgecastcdn.net
[2021-04-05 10:26:43.417 1029M] **** got reply cs45.wac.edgecastcdn.net is 2606:2800:134:1a0d:1429:742:782:b6 (ID 1493, /root/project/src/dnsmasq/cache.c:494)

DL6ER · April 5, 2021, 9:39am

Yes, there is always only one reason: The client requested the same domain multiple times. It is dnsmasq politics to interpret this as: The client is impatient, maybe the upstream didn't reply. dnsmasq doesn't keep any records about time itself so it doesn't know if there were 10 msec or 5 minutes between the two queries. Whenever the second arrives, it marks the previous query as retried. And this is what FTL reports here.

Let's look at the first together in more detail:

was forwarded

jpgpi250:

[2021-04-05 10:26:43.269 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1488, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.270 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1488, /root/project/src/dnsmasq/forward.c:613)

Later on (well, only 32 ms later), the query was again received from the same client and dnsmasq marked the first one as having been retried:

reports this:

and forwards again (because it assumes the answer to the first will never arrive):

jpgpi250:

[2021-04-05 10:26:43.300 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1492, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.300 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1492, /root/project/src/dnsmasq/forward.c:613)

The reply from upstream arrives some 100 msec later and is sent two times to the client.

jpgpi250:

[2021-04-05 10:26:43.394 1029M] cdn.syndication.twimg.com is known as not to be blocked
[2021-04-05 10:26:43.394 1029M] Query 1492: CNAME cdn.syndication.twimg.com
[2021-04-05 10:26:43.395 1029M] **** got reply cdn.syndication.twimg.com is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.395 1029M] cs196.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.396 1029M] Query 1492: CNAME cs196.wac.edgecastcdn.net
[2021-04-05 10:26:43.396 1029M] **** got reply cs196.wac.edgecastcdn.net is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.397 1029M] cs2-wac.apr-8315.edgecastdns.net is not known
[2021-04-05 10:26:43.397 1029M] Query 1492: CNAME cs2-wac.apr-8315.edgecastdns.net
[2021-04-05 10:26:43.398 1029M] **** got reply cs2-wac.apr-8315.edgecastdns.net is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.398 1029M] cs2-wac-eu.8315.ecdns.net is not known
[2021-04-05 10:26:43.399 1029M] Query 1492: CNAME cs2-wac-eu.8315.ecdns.net
[2021-04-05 10:26:43.399 1029M] **** got reply cs2-wac-eu.8315.ecdns.net is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.399 1029M] cs672.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.400 1029M] Query 1492: CNAME cs672.wac.edgecastcdn.net
[2021-04-05 10:26:43.400 1029M] **** got reply cs672.wac.edgecastcdn.net is 192.229.233.50 (ID 1492, /root/project/src/dnsmasq/cache.c:494)

Before dnsmasq v2.85 the behavior was different: When a query arrived while we're still waiting for the forwarded to arrive back, the query was marked as IN PROGRESS (status 14). This status does not exist any longer as a second query for the same domain does now always trigger a re-submission. See,

jpgpi250 · April 5, 2021, 10:09am

The impatient client in this case would be a windows10 20H2, IPv6 enabled, with browser edge chromium Version 89.0.774.68 (Official build) (64-bit). The site I was visiting is DataNews.

Maybe the client isn't impatient, but he same domain needs to be resolved on that page twice. A lot of newspapers, and probably others have a (some) main articles, which they repeat in there sections (tech, sports, ...). This means the identical content is repeated...
A lot of the retried domains in the list (updated in the PM) have front pages, that use this technique (repeating articles) frequently.

doesn't dnsmasq know this, something like a queue of pending requests?

Why does it do that. Why doesn't it simply reply to both queries. The DNS service on the client will have these pending requests in it's queue, so it will be expecting replies to both queries.

I'm NOT an expert so my questions may be foolish...

DL6ER · April 5, 2021, 10:23am

Yes, but I seem to recall somebody told me that Windows does have a DNS cache itself. In this case (and if you didn't disable it intentionally), it shouldn't ask the same question twice in such a short interval.

Well, yes, but as it has no concept of after which kind of timeout a retry is a useful thing, it just always retries. Wether this is 5 msec or 5 sec. Your upstream wasn't fast enough in this case so dnsmasq thought: Okay, it did not reply. (which is correct at this point)

Resource-economy. It discards the old query and just remembers that it has to reply to the client. You are right, when the client sent two queries independently from each other (different query IDs), it'll never get a reply to one of them. I'll mention this as well to the mailing list when sending my patch (to retry only after a certain timeout to avoid spamming upstream servers) later today.

jpgpi250 · April 5, 2021, 10:33am

Here is how to view DNS cache on any Windows 10 computer. My windows DNS cache isn't empty!

There are already two (or multiple requests) forwarded to unbound, thus possibly (not in the unbound cache) to the NS servers. I assume the NS servers will always reply, regardless of the number of times you ask the same question (The question may come from different clients behind a NATted single IPv4). The only thing you would achieve by NOT forwarding the reply to the client (from dnsmasq) is some small packets on the local LAN. Doesn't look like a lot of profit here (my opinion).

Remember: NO OFFENCE intended, even if the replies may sound a little direct.
Very Happy you are taking the time to look into this.

DL6ER · April 5, 2021, 1:30pm

There is, however, still a query ID in the random request. With this the server is able to tell apart different queries from each other.

DL6ER · April 5, 2021, 3:49pm

I sent a patch upstream that will ensure:

When multiple queries arrive, only the first one will be sent upstream. Further queries are just added to a list and are otherwise ignored (they don't trigger a re-submission upstream).
Only if no reply comes back after the timeout (defaulting to 3 seconds, customizable), another query is sent upstream.
When the first valid reply arrives from upstream, the list is processed. All clients will receive a reply. If the same client asked the same question multiple times, it will get the answer multiple times (every question will be answered).

Now we wait what they say.

jpgpi250 · April 5, 2021, 5:01pm

There are multiple solutions to tackle this problem, yours looks like a pretty good one (as always). Especially the clients getting answers to all queries looks appealing, this may even fix problems with some applications that go crazy, when not receiving answers.

Will there be a log entry to show that this has happened? Wil there be database entries to show this (status duplicate, reply send anyway) Otherwise you don't even know this is happening...

Will this change (your patch) still be part of v2.85 (already in release candidate stadium) or will we have to wait for v2.86 (hopefully with your patch)

Thanks again for your time and effort, much appreciated.

DL6ER · April 5, 2021, 5:41pm

dnsmasq doesn't want to log such things. We don't mess with their log routines.

pihole-FTL.log will only contain it when in DEBUG_QUERIES mode but it will surely log something about it.

Yes, this will revive the IN_PROGRESS status that is currently deprecated by dnsmasq (as every query is currently forwarded). The individual queries will be summarized in a list by dnsmasq, however, FTL will report them as individual queries (because that's what they are). The first query will be FORWARDED, the others IN_PROGRESS (because we're waiting for the reply and not forwarding on our own).

This is something we cannot control. In my opinion, the current behavior is defect and a fix warrants a v2.85rc3 (we are currently at rc2).