Help us test FTL v5.8 / dnsmasq v2.85

Originally published at: https://pi-hole.net/2021/03/31/help-us-test-ftl-v5-8-dnsmasq-v2-85/

Pi-hole embeds the DNS server dnsmasq, which is currently in release-candidate state for version 2.85. Please join us in the final testing of this version of dnsmasq, to help us ensure there are no major bugs before the final release. You may be receiving a few updates on this branch.

To get the release candidate version, run
pihole checkout ftl update/dnsmasq-v2.85

You can go back at any time using
pihole checkout ftl master

Please also go back to master before updating Pi-hole after the next release. This can be done at any time, also after the update happened. Support and discussions are possible on the linked Discourse thread.

dnsmasq: CHANGELOG

  • Fix problem with DNS retries in 2.83/2.84.
    The new logic in 2.83/2.84 which merges distinct requests for the same domain causes problems with clients which do retries as distinct requests (differing IDs and/or source ports.) The retries just get piggy-backed on the first, failed, request.
    The logic is now changed so that distinct requests for repeated queries still get merged into a single ID/source port, but they now always trigger a re-try upstream.
  • Avoid treating a dhcp-host which has an IPv6 address as eligible for use with DHCPv4 on the grounds that it has no address, and vice-versa.
  • Add dynamic-host option
    A and AAAA records which take their network part from the network of a local interface. Useful for routers with dynamically prefixes.
  • Teach bogus-nxdomain and ignore-address to take an IPv4 subnet.
  • Use random source ports where possible if source addresses/interfaces in use. CVE-2021-3448 applies.
    It’s possible to specify the source address or interface to be used when contacting upstream name servers: server=8.8.8.8@1.2.3.4 or server=8.8.8.8@1.2.3.4#66 or server=8.8.8.8@eth0, and all of these have, until now, used a single socket, bound to a fixed port. This was originally done to allow an error (non-existent interface, or non-local address) to be detected at start-up. This means that any upstream servers specified in such a way don’t use random source ports, and are more susceptible to cache-poisoning attacks.
    We now use random ports where possible, even when the source is specified, so server=8.8.8.8@1.2.3.4 or server=8.8.8.8@eth0 will use random source ports. server=8.8.8.8@1.2.3.4#66 or any use of query-port will use the explicitly configured port, and should only be done with understanding of the security implications. Note that this change changes non-existing interface, or non-local source address errors from fatal to run-time. The error will be logged and communication with the server not possible.
  • Change the method of allocation of random source ports for DNS. Previously, without min-port or max-port configured, dnsmasq would default to the compiled in defaults for those, which are 1024 and 65535. Now, when neither are configured, it defaults instead to the kernel’s ephemeral port range, which is typically 32768 to 60999 on Linux systems. This change eliminates the possibility that dnsmasq may be using a registered port > 1024 when a long-running daemon starts up and wishes to claim it. This change does likely slightly reduce the number of random ports and therefore the protection from reply spoofing. The older behaviour can be restored using the min-port and max-port config switches should that be a concern.
  • Scale the size of the DNS random-port pool based on the value of the dns-forward-max configuration.
  • TFTP tweak: Check sender of all received packets, as specified in RFC 1350 para 4.
4 Likes

Pi-hole version is v5.2.4 (Latest: v5.2.4)
AdminLTE version is v5.4 (Latest: v5.4)
FTL version is update/dnsmasq-v2.85 vDev-3fa1640 (Latest: v5.7)

Your debug token is: https://tricorder.pi-hole.net/ux6j6s03kq

pihole-FTL.zip (1.8 KB)

What are we looking for:

  • that doesn't happen anymore?
  • something new that's happening?

Nothing too specific. Just if everything still works. This is going to recognize issues like the broken retry-mechanism in the previous / current master version early enough so they don't get released into the wild and annoy users that may not be willing to register here and provide feedback.

Just updated a couple Pi3's with this build. I'll keep my eyes open for issues and report back.

1 Like

Updated both of my pi-holes (Ubuntu 20.04.2 LTS server) yesterday (2021-03-31) morning and all has been well thus far.

1 Like

started new FTL (update/dnsmasq-v2.85) yesterday (2021-03-31 23:05:57 local time).
hopefully no logic errors...

Epoch timestamp : 1617228000
Date and time (Your time zone) : Thursday, April 1, 2021 12:00:00 AM GMT+02:00

SELECT count(*) FROM "queries" WHERE timestamp > "1617228000";

count: 7489

SELECT count(*) FROM "queries" WHERE timestamp > "1617228000" and status is "12";

count: 1289

There is no noticeable impact on the user experience.

edit
todays results @ 09:00 local time (CET)

today: 04/02/2021 -> 1617314400
total # of queries today: 1681
status  count   description
0       0       Unknown status
1       399     Domain contained in gravity database
2       1128    Forwarded
3       25      Known, replied to from cache
4       5       Domain matched by a regex blacklist filter
5       0       Domain contained in exact blacklist
6       0       By upstream server (known blocking page IP address)
7       0       By upstream server (0.0.0.0 or ::)
8       0       By upstream server (NXDOMAIN with RA bit unset)
9       44      Domain contained in gravity database (CNAME)
10      0       Domain matched by a regex blacklist filter (CNAME)
11      0       Domain contained in exact blacklist (CNAME)
12      80      Retried query
13      0       Retried but ignored query (DNSSEC)
14      0       Already forwarded, not forwarding again

/edit

1 Like

Running nicely here on Pi Zero W, with other Pi-Hole modules running dev branches.
(for completeness, using latest unbound as DNS resolver)

Ran the same check as @jpgpi250 , here are my results:

today: 04/02/2021 -> 1617314400
total # of queries today: 41707
status  count   description
0       0       Unknown status
1       21660   Domain contained in gravity database
2       19062   Forwarded
3       133     Known, replied to from cache
4       8       Domain matched by a regex blacklist filter
5       0       Domain contained in exact blacklist
6       0       By upstream server (known blocking page IP address)
7       0       By upstream server (0.0.0.0 or ::)
8       0       By upstream server (NXDOMAIN with RA bit unset)
9       437     Domain contained in gravity database (CNAME)
10      20      Domain matched by a regex blacklist filter (CNAME)
11      0       Domain contained in exact blacklist (CNAME)
12      387     Retried query
13      0       Retried but ignored query (DNSSEC)
14      0       Already forwarded, not forwarding again
1 Like

@jpgpi250
This may be a brain fart on my side, however i cannot seem to run either select commands on 3 different Pi - one built just to confirm if it is my earlier instance
SELECT count(*) FROM "queries" WHERE timestamp > "1617228000";
I keep getting below error:
-bash: syntax error near unexpected token `('

What am I doing wrong?

Edit:
Managed to get this going with below command
sqlite3 "/etc/pihole/pihole-FTL.db" "SELECT count(*) FROM "queries" WHERE timestamp > "1617228000" and status is "12";"
However the output is a single line with a number and not in tabular form as shown

The select statement you use only returns a single number, the total number of retried queries registered, since the given timestamp. The output is thus correct. I assume DL6ER is mostly interested in the result of the query you executed, that is one of the things dnsmasq v2.85 is suppose to fix (if I understand the description correctly...)

The output I added in my edit is the result of a script, getting the count for all possible status types from the database, unfortunately, we're not allowed to share scripts here. You can look at the documentation to learn more about the different status types

Thanks @jpgpi250.
I figured that this must be some type of script but didn’t extend my question that far. Appreciate you highlighting that the script is internal. I managed to run single liner using same documentation and will try to make something from all available options.

sqlite3 /etc/pihole/pihole-FTL.db --header --column "SELECT status, count(*) FROM 'queries' WHERE timestamp > strftime('%s','now','-24 hours') group by status order by status asc;"
1 Like

Thanks a lot @yubiuser - i was using each line and copied 14 times with each status number increment - your script just is simpler and cleaner

Taking this a step further, to verify it's not some specific TLD or domain that is causing the problem.

first column (count) = SELECT count(*)
second column (unique) = SELECT count(DISTINCT domain)

today: 04/02/2021 -> 1617314400
total # of queries today: 9612
status  count   unique  description
0       0       0       Unknown status
1       1343    51      Domain contained in gravity database
2       5918    358     Forwarded
3       104     16      Known, replied to from cache
4       18      4       Domain matched by a regex blacklist filter
5       0       0       Domain contained in exact blacklist
6       0       0       By upstream server (known blocking page IP address)
7       0       0       By upstream server (0.0.0.0 or ::)
8       0       0       By upstream server (NXDOMAIN with RA bit unset)
9       209     3       Domain contained in gravity database (CNAME)
10      0       0       Domain matched by a regex blacklist filter (CNAME)
11      0       0       Domain contained in exact blacklist (CNAME)
12      2020    95      Retried query
13      0       0       Retried but ignored query (DNSSEC)
14      0       0       Already forwarded, not forwarding again

Looking at the result of SELECT DISTINCT domain FROM "queries" WHERE timestamp > 1617314400 AND status= 12, it doesn't look like a specific TLD or domain is causing this, it's just my activity (browsing and viewing habits) that is causing the unique count to be much lower than the total count. Trying to find a cause, another dead end...

edit
I used the result from the above query (95 domains) and used these as source for DPT (DNS Performance Test) Checked the stats again after completion, No significant change in the count (status 12), thus confirmed, it's not a TLD or domain problem.
/edit

Continuing to look for a cause...

Normally, I use unbound as a recursive resolver, thus pihole-FTL is reporting stats, based on the data, provided by unbound.
As I have a rather specific compiled unbound setup (tcp fast open, redis and unbound optimizations), I wanted to make sure this is not causing the problem, so I installed knot-resolver, config out of the box, and changed the pihole-FTL resolver settings to use knot-resolver, unbound thus out of the picture.
It didn't take that long to conclude my unbound config isn't the cause of the retries, knot-resolver generated similar stats (lots of retries) in a few hours.

While I was testing if knot-resolver gave me the same features (DNSSEC, IPv6 support, ...), I noticed, during one of the tests, the retry count increased significantly when a lot of messages like "dnsmasq[1135]: reply ipv6.test-ipv6.fratec.net is CNAME" appeared in the pihole-log. To achieve this, just browse to this site, you'll see a lot of them. No idea if this might point DL6ER into the right direction, hope it helps.

edit
just ran a test on the unique domains with status 12 from today. Out of 88 unique domains listed, 55 are actually cnames. Don't know if this means anything...
/edit

I also noticed the following.

  • Installed a fresh pihole on a new system.
  • as soon as pihole was installed, executed pihole checkout ftl update/dnsmasq-v2.85
  • tried to add entries to the aliasclient table of /etc/pihole/pihole-FTL.db, this failed because the table didn't exist. The aliasclient_id field in the network table was also missing.
  • switch back to master, aliasclient table was immediately created. ditto for the aliasclient_id field.
  • switched back to update/dnsmasq-v2.85, problem solved...

I hope the alias feature is here to stay (not removed in the next version of FTL), as it works perfectly...

Thanks, this hint was very helpful (even if it doesn't have anything to do with CNAMEs). I can reproduce this with your suggested test.

edit Content largely replaced/updated

There seems to be an issue with the "fixed" retry algorithm of dnsmasq. I reported this upstream. The issue is that certain queries are refused by the upstream (SERVFAIL). This is by design of the IPv6 test. dnsmasq retries those SERVFAILs while it shouldn't. Further investigation is going on.

I pushed a small fix that should detect and handle this case. Please update on this branch and try again.

Pi-hole version is v5.2.4 (Latest: v5.2.4)
AdminLTE version is v5.4 (Latest: v5.4)
FTL version is update/dnsmasq-v2.85 vDev-7127e2a (Latest: v5.7)

What are we looking for, I don't see any warnings or specific messages in the pihole log? Or should we run the tests again and watch the retry count?

Reminder, the original v2.85 test branch doesn't create the aliasclient table and the aliasclient_id field. Intentional?

This. A warning like Ignoring self-retry would only be shown for DEBUG_QUERIES=true.

This special branch may not have been in sync with development, however, the aliasclient feature is there for longer by now. They are added when the database is update to version 9 (which is the most recent version). Does it work as expected now?

Enabled DEBUG_QUERIES=true

looking in the pihole-FtL log, I don't see any Ignoring self-retry messages, however I do see:

[2021-04-04 20:57:11.820 15947M] **** query 14606 is duplicate of 14606

relevant? first and last number is identical...

Yes, this is another but closely related issue which I haven't seen during my tests. I'll push a fix for this later today (edit done) even when it is completely harmless (no-op for identical IDs). We should still suppress the debug message in this case.

It's strange that you don't see the self-retry message but this just means you're maybe not even affected by the issue I've fixed

  1. Can you identify the pihole.log lines related to this?
    This is simplified by setting log-queries=extra in /etc/dnsmasq.d/01-pihole.conf as this ensures the query IDs do also show up in pihole.log.

  2. Did the retry count in your database change notably since the most recent update?

  3. Just to ensure we run (about) to same test: What is your result? I'm getting a 10/10 in the end.
    The self-retry messages appear close to the end of even shortly afterwards for me.