Help us test FTL v5.8 / dnsmasq v2.85

what I did:

  • pihole -up
    Pi-hole version is v5.2.4 (Latest: v5.2.4)
    AdminLTE version is v5.4 (Latest: v5.4)
    FTL version is update/dnsmasq-v2.85 vDev-fe69f11 (Latest: v5.7)
  • stopped FTL and unbound
  • removed all todays queries from the database (all counter in my script -> 0)
    delete from queries where timestamp > 1617573600
  • made requested configuration changes
    DEBUG_QUERIES=true
    log-queries=extra
  • erased all pihole logs (pihole and FTL)
  • started unbound and FTL
  • browsed a little bit, including http://test-ipv6.com (result = 10/10)
  • stopped FTL and copied the logs (have send them to you in a PM)

I don't see any duplicate or Ignoring self-retry messages
The retry counter still increases, but at a much lower rate than before (22 retries, 13 unique in one hour, 1282 queries total)

Small test (one hour) will keep you informed if retry count rate increases

see PM for logs and details.

This may be correct. There may be (expected) retries from your clients. You could grep for **** RETRIED query ... with DEBUG_QUERIES enabled.

found (example) [2021-04-05 09:38:38.474 1029M] **** RETRIED query 1049 as 1051 to fdaa:bbcc:ddee:2::5552 (ID 2)

the count matches the count in the database

Two new retries in a row, no reason for the retry logged at all (if I correctly read the data)

[2021-04-05 10:26:43.266 1029M] **** new UDP query[A] query "cdn.syndication.twimg.com" from eth0:192.168.2.228 (ID 1488, FTL 10685, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.267 1029M] cdn.syndication.twimg.com is not known
[2021-04-05 10:26:43.269 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1488, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.270 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1488, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.271 1029M] **** new UDP query[AAAA] query "cdn.syndication.twimg.com" from eth0:192.168.2.228 (ID 1489, FTL 10686, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.272 1029M] cdn.syndication.twimg.com is not known
[2021-04-05 10:26:43.273 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1489, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.275 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1489, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.282 1029M] **** new UDP query[A] query "www.googletagmanager.com" from eth0:192.168.2.228 (ID 1490, FTL 10687, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.283 1029M] www.googletagmanager.com is known as gravity blocked
[2021-04-05 10:26:43.284 1029M] **** new UDP query[AAAA] query "www.googletagmanager.com" from eth0:192.168.2.228 (ID 1491, FTL 10688, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.284 1029M] www.googletagmanager.com is known as gravity blocked
[2021-04-05 10:26:43.298 1029M] **** new UDP query[A] query "cdn.syndication.twimg.com" from eth0:192.168.2.228 (ID 1492, FTL 10689, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.299 1029M] cdn.syndication.twimg.com is known as not to be blocked
[2021-04-05 10:26:43.299 1029M] **** RETRIED query 1488 as 1492 to 127.10.10.2 (ID 3)
[2021-04-05 10:26:43.300 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1492, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.300 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1492, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.301 1029M] **** new UDP query[AAAA] query "cdn.syndication.twimg.com" from eth0:192.168.2.228 (ID 1493, FTL 10690, /root/project/src/dnsmasq/forward.c:1568)
[2021-04-05 10:26:43.301 1029M] cdn.syndication.twimg.com is known as not to be blocked
[2021-04-05 10:26:43.302 1029M] **** RETRIED query 1489 as 1493 to 127.10.10.2 (ID 3)
[2021-04-05 10:26:43.302 1029M] **** forwarded cdn.syndication.twimg.com to fdaa:bbcc:ddee:2::5552#5552 (ID 1493, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.303 1029M] **** forwarded cdn.syndication.twimg.com to 127.10.10.2#5552 (ID 1493, /root/project/src/dnsmasq/forward.c:613)
[2021-04-05 10:26:43.394 1029M] cdn.syndication.twimg.com is known as not to be blocked
[2021-04-05 10:26:43.394 1029M] Query 1492: CNAME cdn.syndication.twimg.com
[2021-04-05 10:26:43.395 1029M] **** got reply cdn.syndication.twimg.com is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.395 1029M] cs196.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.396 1029M] Query 1492: CNAME cs196.wac.edgecastcdn.net
[2021-04-05 10:26:43.396 1029M] **** got reply cs196.wac.edgecastcdn.net is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.397 1029M] cs2-wac.apr-8315.edgecastdns.net is not known
[2021-04-05 10:26:43.397 1029M] Query 1492: CNAME cs2-wac.apr-8315.edgecastdns.net
[2021-04-05 10:26:43.398 1029M] **** got reply cs2-wac.apr-8315.edgecastdns.net is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.398 1029M] cs2-wac-eu.8315.ecdns.net is not known
[2021-04-05 10:26:43.399 1029M] Query 1492: CNAME cs2-wac-eu.8315.ecdns.net
[2021-04-05 10:26:43.399 1029M] **** got reply cs2-wac-eu.8315.ecdns.net is (CNAME) (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.399 1029M] cs672.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.400 1029M] Query 1492: CNAME cs672.wac.edgecastcdn.net
[2021-04-05 10:26:43.400 1029M] **** got reply cs672.wac.edgecastcdn.net is 192.229.233.50 (ID 1492, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.407 1029M] cdn.syndication.twimg.com is known as not to be blocked
[2021-04-05 10:26:43.408 1029M] Query 1493: CNAME cdn.syndication.twimg.com
[2021-04-05 10:26:43.409 1029M] **** got reply cdn.syndication.twimg.com is (CNAME) (ID 1493, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.409 1029M] cs196.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.410 1029M] Query 1493: CNAME cs196.wac.edgecastcdn.net
[2021-04-05 10:26:43.411 1029M] **** got reply cs196.wac.edgecastcdn.net is (CNAME) (ID 1493, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.411 1029M] cs2-wac.apr-8315.edgecastdns.net is not known
[2021-04-05 10:26:43.412 1029M] Query 1493: CNAME cs2-wac.apr-8315.edgecastdns.net
[2021-04-05 10:26:43.413 1029M] **** got reply cs2-wac.apr-8315.edgecastdns.net is (CNAME) (ID 1493, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.414 1029M] cs2-wac-eu.8315.ecdns.net is not known
[2021-04-05 10:26:43.414 1029M] Query 1493: CNAME cs2-wac-eu.8315.ecdns.net
[2021-04-05 10:26:43.415 1029M] **** got reply cs2-wac-eu.8315.ecdns.net is (CNAME) (ID 1493, /root/project/src/dnsmasq/cache.c:494)
[2021-04-05 10:26:43.416 1029M] cs45.wac.edgecastcdn.net is not known
[2021-04-05 10:26:43.417 1029M] Query 1493: CNAME cs45.wac.edgecastcdn.net
[2021-04-05 10:26:43.417 1029M] **** got reply cs45.wac.edgecastcdn.net is 2606:2800:134:1a0d:1429:742:782:b6 (ID 1493, /root/project/src/dnsmasq/cache.c:494)

Yes, there is always only one reason: The client requested the same domain multiple times. It is dnsmasq politics to interpret this as: The client is impatient, maybe the upstream didn't reply. dnsmasq doesn't keep any records about time itself so it doesn't know if there were 10 msec or 5 minutes between the two queries. Whenever the second arrives, it marks the previous query as retried. And this is what FTL reports here.

Let's look at the first together in more detail:

was forwarded

Later on (well, only 32 ms later), the query was again received from the same client and dnsmasq marked the first one as having been retried:

reports this:

and forwards again (because it assumes the answer to the first will never arrive):

The reply from upstream arrives some 100 msec later and is sent two times to the client.

Before dnsmasq v2.85 the behavior was different: When a query arrived while we're still waiting for the forwarded to arrive back, the query was marked as IN PROGRESS (status 14). This status does not exist any longer as a second query for the same domain does now always trigger a re-submission. See,

The impatient client in this case would be a windows10 20H2, IPv6 enabled, with browser edge chromium Version 89.0.774.68 (Official build) (64-bit). The site I was visiting is DataNews.

Maybe the client isn't impatient, but he same domain needs to be resolved on that page twice. A lot of newspapers, and probably others have a (some) main articles, which they repeat in there sections (tech, sports, ...). This means the identical content is repeated...
A lot of the retried domains in the list (updated in the PM) have front pages, that use this technique (repeating articles) frequently.

doesn't dnsmasq know this, something like a queue of pending requests?

Why does it do that. Why doesn't it simply reply to both queries. The DNS service on the client will have these pending requests in it's queue, so it will be expecting replies to both queries.

I'm NOT an expert so my questions may be foolish...

Yes, but I seem to recall somebody told me that Windows does have a DNS cache itself. In this case (and if you didn't disable it intentionally), it shouldn't ask the same question twice in such a short interval.

Well, yes, but as it has no concept of after which kind of timeout a retry is a useful thing, it just always retries. Wether this is 5 msec or 5 sec. Your upstream wasn't fast enough in this case so dnsmasq thought: Okay, it did not reply. (which is correct at this point)

Resource-economy. It discards the old query and just remembers that it has to reply to the client. You are right, when the client sent two queries independently from each other (different query IDs), it'll never get a reply to one of them. I'll mention this as well to the mailing list when sending my patch (to retry only after a certain timeout to avoid spamming upstream servers) later today.

Here is how to view DNS cache on any Windows 10 computer. My windows DNS cache isn't empty!

There are already two (or multiple requests) forwarded to unbound, thus possibly (not in the unbound cache) to the NS servers. I assume the NS servers will always reply, regardless of the number of times you ask the same question (The question may come from different clients behind a NATted single IPv4). The only thing you would achieve by NOT forwarding the reply to the client (from dnsmasq) is some small packets on the local LAN. Doesn't look like a lot of profit here (my opinion).

Remember: NO OFFENCE intended, even if the replies may sound a little direct.
Very Happy you are taking the time to look into this.

There is, however, still a query ID in the random request. With this the server is able to tell apart different queries from each other.

I sent a patch upstream that will ensure:

  1. When multiple queries arrive, only the first one will be sent upstream. Further queries are just added to a list and are otherwise ignored (they don't trigger a re-submission upstream).
    Only if no reply comes back after the timeout (defaulting to 3 seconds, customizable), another query is sent upstream.
  2. When the first valid reply arrives from upstream, the list is processed. All clients will receive a reply. If the same client asked the same question multiple times, it will get the answer multiple times (every question will be answered).

Now we wait what they say.

There are multiple solutions to tackle this problem, yours looks like a pretty good one (as always). Especially the clients getting answers to all queries looks appealing, this may even fix problems with some applications that go crazy, when not receiving answers.

Will there be a log entry to show that this has happened? Wil there be database entries to show this (status duplicate, reply send anyway) Otherwise you don't even know this is happening...

Will this change (your patch) still be part of v2.85 (already in release candidate stadium) or will we have to wait for v2.86 (hopefully with your patch)

Thanks again for your time and effort, much appreciated.

dnsmasq doesn't want to log such things. We don't mess with their log routines.

pihole-FTL.log will only contain it when in DEBUG_QUERIES mode but it will surely log something about it.

Yes, this will revive the IN_PROGRESS status that is currently deprecated by dnsmasq (as every query is currently forwarded). The individual queries will be summarized in a list by dnsmasq, however, FTL will report them as individual queries (because that's what they are). The first query will be FORWARDED, the others IN_PROGRESS (because we're waiting for the reply and not forwarding on our own).

This is something we cannot control. In my opinion, the current behavior is defect and a fix warrants a v2.85rc3 (we are currently at rc2).

1 Like

From what I'm reading in the dnsmasq discussion group, Simon doesn't agree with your solution very much. What I'm missing in your comments is the fact the client, that sends the two requests, only gets an answer for one of the requests, The second request remains unanswered, thus possibly causing problems on that client.

Maybe, if your solution doesn't make it in it's current form (he appears to be against the fixed 3 second wait), You should at least ask for an option that makes dnsmasq reply to both queries, not simply discarding one of them, thus eliminating the retry status.

You are right. His main point seems to be one of the kind "it was even worse pre-2.83 so what do we complain about here". Do you have any suggestion for an effective alternative to the proposed timeout?

The other issue (the everyone should receive a reply) has been fixed with his commit already. Whether intentional or not.

In the code (the bin file ) your uploaded to the discussion group yesterday, you proposed a default of of 3 seconds with the option to set it to zero to regain the prior behavior. He (Simon) said that users wouldn't understand this, get confused. Why not rewrite the code to have a default of zero, so that the average user gets the default behavior, with the option to change it? You could than add a setting to pihole-01.conf, setting it to the desired value for pihole users only, try to get a lot more testers than you have today, and see if the result satisfies our needs. Maybe Simon will be more open for this...

can we get an update of the current FTL (test v2.85 version) with the patch from Simon applied, so we can test again?

He says:

It does mean the clients which repeat queries from new source ports,
rather than retrying them from the same source port, will see
different behaviour, but it in fact restores the pre-2.83 behaviour,
so it's not expected to be a practical problem.

From the notes in the patch, it appears to be intentional...

I'm a bit shocked that the proposed patch is not applied. Giving servers a grace time of 3 seconds (maybe reduce this to 2 or even 1) seems meaningful. My unbound regularly needs longer for altogether new domains. If a client doesn't retry after a few seconds, the need for getting this query cannot be too high... Or the system is really defect. In which case you added the ability to disable this bandwidth saving behavior completely by setting retry-timeout=0 (if I understand it correctly, but I'm confident I do).

I'm not sure I understand Simon Kelley's comment correctly, maybe @jpgpi250 and @DL6ER can help me understand it:

  1. Retry means same query from same machine, but if the ID is the same is irrelevant?
  2. Repeat means same query but from a different machine, again, the ID is irrelevant?

If so this nomenclature seems very confusing to me, yet, I don't know how to interpret Kelley's comment otherwise.

Let's recap, the history of this topic, DL6ER will correct me if I get it wrong somewhere, apologies if I did.

We were asked to test the new dnsmasq v2.85 branch, which originally was based on dnsmasq v2.85rc2, the latest available as I'm writing this (may / will change in the future). One of the entries in the release notes, "Fix problem with DNS retries in 2.83/2.84" caught my attention, so I started writing some scripts (don't ask for the scripts, not allowed to publish scripts here, I was banned for doing so) to get the retry count (current day only), and found my retry count was very high (up to 15% of all queries). I tried to find a cause of these retries and wrote down my findings (they were wrong, NOT caused by CNAME), as I found a web site that always caused retries. Luckily, DL6ER performed some more extensive tests (wireshark) and was able to duplicate the problem that causes the "retries". I didn't fully agree with his findings the retries are caused by "impatient clients", but than again, I didn't look at a wireshark dump. I believe there are (a lot) of websites that simply require the same domain resolved twice (duplicate articles, which appear, for example, on several news sites (main article, repeated in a specific section).
There are thus two possible scenario's:

  • client submits the same query twice (duplicate content on the web page. Identical URL for the repeated article picture?)
  • client does a retry (impatient client)

In the original v2.85rc2 (both dnsmasq and pihole-FTL), both scenario's were treated in the same way. If it was determined the same request was made twice, only one was answered, the other one discarded (never answered to). My opinion is that this may possibly cause problems for the application (keeps waiting for an answer to the second request). For clarity, this is NOT the problem with the browsers I've tested (firefox and edge chromium), they don't seem to have a problem with the missing reply.

DL6ER than released a second version of the pihole-FTL v2.85 test version, don't really know what he has changed, apart from additional logging features, but my retry count decreased, however, I still noticed the retry count increased for duplicate (same query twice from a single web page), one unanswered.

DL6ER now uploaded a patch proposal (way to change the behavior of dnsmasq) so that it would not simply discard the second (not a client retry) but keep it in a queue and reply to both queries without submitting the query to the upstream twice. That's the 3 second wait. However, Simon isn't really convinced this is the way to go, for various reasons, such as 'to complex for the average user', what if the first upstream query is never answered, will the client eventually give up and display an error.

Simon has now come up with his own method to solve the problem, which he also submitted as a patch, his notes now say dnsmasq will reply to both queries (duplicate), so clients will receive an answer to all submitted queries.

I really want to test this patch, because something in his notes worry me a bit, "Retry -> all servers, Repeat -> next server". I wonder what the effect will be for the user experience, in the end, that's all that counts.

Repeat doesn't necessarily mean different machine, as I explained in the recap, I believe some webpages trigger an identical query twice. All my tests have been performed on a single client.

I didn't test this, but my understanding is a retry should have the same ID (and use the same source port?), a repeat should have a different ID ( and use a different source port? That is what Simon mentions in his latest patch notes). Will this work, Simon isn't convinced (there is no RFC that enforce ID and source port must me the same when retrying)

As far as I understand your and his discussion about Windows DNS caching, this shouldn't happen. The multiple requests should all end up at Windows's DNS cache which itself should ask only once upstream, correct?

I just had a look at the code. The difference is that FTL marked queries as RETRIED when they happen multiple times. Now, this does not happen any more if the same query is just send by the same client multiple times. They are all FORWARDED now, because this is what dnsmasq does internally. So the previous FTL behavior was incorrect. I guess a leftover detail from the recent changes in dnsmasq.

Do we agree that they should not? (because of local DNS chache on the client) I think this is an important detail to the entire discussion.

Yes, I agree on this, however, looking at the patch from Mr. Kelley, this is not what the code does. It only does

retry on existing query, from original source. Send to all available servers

There is no comparison of IDs anywhere. At least not in this patch. Hence my question because I don't seem to understand the patch message. Maybe it is wrong, maybe it is a language barrier.

There hasn't been a reply yet, even though the client asked it twice, so there is nothing in the windows cache. If the answer is in the windows cache, dnsmasq isn't involved, the answer would come from the clients cache, you wouldn't see any requests in the logs.

In order to test, you need to empty the windows cache frequently. If I'm correct, some browsers even have a DNS cache, you need to CTRL-F5 to resubmit.

again, nothing in the cache yet...

DL6ER will enlighten us, I'm sure...