Failing lookups after power outage DNSSEC

Interested in opinions on this - have I thought this through enough!

I've got one particular site that has a pi-hole installed with dnsmasq running DNSSEC using a PI4 on a standard config - been working fine with some split horizon local DNS entries in pi-hole.

The site suffered a power outage beyond the capabilities of the local UPS - so obviously when the PI came back up the clock was out by a couple of hours (so yes before anyone says it - stick an RTC hat on - will do that..) - the problem was that with DNSSEC configured all external lookups were now being rejected because of the clock issue and TTL's had expired on cached copies - including lookups to the NTP servers - which mean that the pi was never getting the right time, seeing hundreds of NTP resolver enquiries and never satisfying them.

My immediate thought was to add a local NTP server (like a local Synology NAS which has a battery backed RTC) into the PI's ntpd configs (I happen to be running chrony) - easy to do and means at least one time server source will be available to the PI without an external lookup - but of course on a power up situation the PI and the NAS will both be coming up at the same time so it may take a little while for things to get back into sync - and there is of course the situation where the NAS may not come back up (for whatever reason) - either way worth doing - so I've done that.

My second thought was to use the very simple, non-DNSSEC resolver that is sitting in the local router (which is generally the case), so adding the following to the bottom of the local dnsmasq custom settings file:

server=/pool.ntp.org/_IP_ADDRESS_OF_ROUTER_

Now pi-hole behaves as normal, but will go out to the non-DNSSEC resolver on the router for NTP lookups (which I'm never going to block obviously).

Anyone see a situation where that won't solve the problem after a long power restore?

I have installed ntp (sudo apt-get -yq install ntp), configured it and created a cron job (@reboot) with the following content:

sudo /etc/init.d/ntp stop
sudo ntpd -gq
sudo /etc/init.d/ntp start

from the ntpd man:

     -g, --panicgate
             Allow the first adjustment to be Big.  This option may appear an
             unlimited number of times.

             Normally, ntpd exits with a message to the system log if the off‐
             set exceeds the panic threshold, which is 1000 s by default. This
             option allows the time to be set to any value without restric‐
             tion; however, this can happen only once. If the threshold is ex‐
             ceeded after that, ntpd will exit with a message to the system
             log. This option can be used with the -q and -x options.  See the
             tinker configuration file directive for other options.

What nameserver is the Pi using?

cat /etc/resolv.conf

Yup. The problem isn’t the quantum of the adjustment from the initial ntp sync - that can be stepped over using settings as you describe depending upon your weapon of choice for ntp lookups - its essentially a race condition from the clock being out of date and a dnssec resolver expiring a cached lookup of the ntp server and then being unable to resolve because the keys no longer validate with an incorrect clock.

If you tell the ntp server to get it’s time from ‘xxx.foo.com’ when the pi reboots it is unable to resolve ‘xxx.foo.com’ because the dnssec resolve fails on the key lookup - so you end up in a deadly embrace.

Only fixes (other than putting a battery rtc on the pi) I can see that have minimal dependency on external activity are (and are not mutually exclusive by any means):

  1. add a local ‘server’ rather than ‘pool’ entry to your ntp set up that points to a rtc system (like a Synology NAS) - that you know is in the PiHole local dns files and that you have nailed down on a static IP address - that also helps when the internet connection is down and you get a power cut btw

  2. hard code into your local PiHole dns files the IP address of some internet ntp servers and hope they don’t move (or script updating them at some point)

  3. point the dnssec resolver at an upstream resolver that does not implement dnssec for just the ntp server domain you care about (and of course that carries a small risk of hijacking - but I carry a bigger risk of a power cut!) - a pretty generic example of which will be your local router (if you are setting up a router with dnssec then you have the same problem there - if it comes out of the box setup that way then one would hope the vendor has applied a hard coded ip value they control - so you can ‘rely’ on it - or of course put a battery backed rtc inside - although always worth checking!!)

It might not be a bad idea (and I can’t see any harm in it) for PiHole (given dnsmasq doesn’t do it) to allow you to enter somewhere ‘domains that will always be cached locally’ and maintained in the PiHole local config. Can be used for ntp domains but I can think of several others that would always be useful to keep locally and permanently cached upon TTL expiry.

Or better solution and a lot less intrusive a ‘local and a remote ntp server’ option that PiHole calls on startup to set the date (or validate the date) if the dnssec option is enabled in the configuration given the sensitivity to date of dnssec. Can appear in the ui if the dnssec option is turned on?

It’s also worth noting the pi time restore mode on power up using the last valid hwtime file stuffed in /etc - it doesn’t help once the outage goes beyond the TTL of the ntp server records.

Have to say it was a pretty spectacular failure mode - clients picking up PiHole from DHCP on a number of vlans failing but infrastructure and administration systems all working fine as they were set to use a non-dnssec resolver to minimise dependency chains. My bad that I wasn’t checking date settings in Prometheus monitoring (now implemented with a suitable alert) - support teams saying ‘world is fine’ and random users saying ‘my email isn’t working’.

Open for discussion: Would disabling DNSSEC on a configurable domain be a solution? Maybe circumventing not in general but more specifically only on timeout errors?

I'm not convinced this would be a good solution but it is an idea and we can discuss it.

That’s another way of doing it - it relies on the admin understanding and joining up the config in PiHole and their ntp client (which is outside PiHole control).

Personally I think (and the more I mull it over) for the standard use case (vast majority) when enabling dnssec in the ui opening up two fields to enter a ‘local ntp server’ option which defaults to the local default route as a suggested value and a ‘remote ntp server’ option which defaults to a generic ntp server out in internet land would be the least intrusive option.

It’s easy to explain in a caption next to the field’s ‘dnssec requires accurate clocks - please provide local and/or remote clocks IP addresses’ - might be worth adding ‘that will be accurate even after a power cut’ but perhaps not needed to get the desired input.

The local server gets written into custom list file as ntp1.hole and remote as ntp2.hole (or whatever) to avoid any upstream interference and to ensure always resolved from cache.

On startup PiHole:

If (dnssec_mode and !rtc) {
   Check current pi date against local and remote
   If consensus break
   If (remote_responded) {
      Force Sync to remote
   } else {
      Force sync to local 
   }
   If (failed_force_sync) {
      Disable dnssec mode
   }
}

(Forgive format - typing pseudo code on an iPhone while on a tube not a great thing to do!)

It’s best effort but if it fails at least things work. If you put fqdn in the local and remote there are some other gymnastics to do obviously.

Could always wrap a minimal ntp client into PiHole I guess that is preset underneath to not use a dnssec lookup for whatever servers are entered - can fire up once on startup and I still think needs a local server to cope with ‘internet down’ situation as a last resort.

Or just refuse dnssec mode if no rtc or local rtc server entered - a bit draconian but would also solve the problem.

Interestingly there are a number of threads in the discussion forums that talk about failures on power up that sound remarkably like this situation but have been dismissed or resolved in other ways (or it just sorted itself out when I made this random change and rebooted :slight_smile: ) but actually are probably this underlying issue.

Depending on your choice of time-sync s/w, you may still have to adjust time manually if your s/w finds it off by too much. Other s/w may pick up the correct time, but still stretch out adoption over a considerably long period before clock is in-sync again (from what you wrote, I think you are already aware of this, but I decided to mention that anyway for the benefit of casual readers).

Often, a router can be configured to act as a local time server.
If that's the case for you, you could just add your router's IP to your NTP server list (edit: that's how I run my Pi-hole, which also has an RTC).

Another option would be to use one of the NIST time server's IP addresses, as they seem to have a fixed IP.

And as a not quite clean solution, you could also decide to create a Local DNS Record in Pi-hole for one of your time server names.

Yup all correct - and essentially what I’ve done. Hard cached both a local and remote ntp server, added a local ntp server into the pools on the PiHole server ntp config, and pointed the nist domain lookups at a known non-dnssec resolver (local internet router which happens to be a udmpro - one of the reasons I’m running PiHole as a split horizon locally - grr no option to run local lookups on those things unlike the edgemax… but that’s another story).

The question was other than those things anyone spot anything else I’m missing or a failure mode that won’t cope with?

As soon as I spotted the problem a simple manual stomp on the PiHole servers clock fixed it obviously - but it was non-obvious to spot in the heat of battle and the screams from the human elements :slight_smile: and the PiHole was getting quite a thrashing from clients trying to resolve - including of course the ntp clients.

I have to disclose as well this is not a trivial problem - I happen to be chair of Jersey Telecom (amongst other things) - this (scroll to the technical bit) worth a quick glance https://channeleye.media/jt-shares-the-cause-of-the-12-july-major-service-outage/

There an ntp clock rollover triggered a failure in the key exchanges at the bgp route exchange level in the core Cisco mesh (simplified as this isn’t a networking forum!) - but it’s a similar dependency. Key exchanges rely on accurate clocks - if you are going to enable key validation to get secure critical data exchanges - make sure the clock fabric is just as safe (!)

On a side note: Whether or not the forward target supports DNSSEC is not relevant - according to dnsmasq documentation, DNSSEC validation will be disabled anyway:

-S, --local, --server=[/[<domain>]/[domain/]][<ipaddr>[#<port>]][@<interface>][@<source-ip>[#<port>]]
(...) DNSSEC validation is turned off for such private nameservers, UNLESS a --trust-anchor is specified for the domain in question.

True - while you are using PiHole/dnssec - but that’s not the only deployment PiHole supports and other resolvers can be plugged into PiHole - and of course the problem is not limited to PiHole - this is a failure mode for any dnssec resolver platform.

That’s one of the reasons I was suggesting we add a mini ntp client into PiHole startup - it’s all within PiHole control at that point and so any dependencies are eliminated for the common case of ‘I just installed PiHole and it worked’

Anyone doing anything smarter will have to brew their own solution - but should be capable of doing it.

You're correct, but as Pi-hole is the prime subject here in this forum, let's keep the discussion focused on what can be done with Pi-hole. :wink:

If you deploy Pi-hole as part of a critical infrastructure, I'd always consider an RTC backup, which would have avoided the issue altogether (and yes: you already admitted that much :wink: ). It's inexpensive and well worth the few bucks (even for a home use scenario).

Personally, to me, expanding Pi-hole to include NTP functionality would seem out of scope. There's just too many dependencies and too many configuration details to consider (and code to maintain).
Time-Syncing is the domain of a whole different class of s/w and should be left to those.

Your other suggestion to add redirection of a specific NTP domain to a customisable forward target like a router seems like the solution that probably would work well within the scope of Pi-hole.

Great! It’s a good point that expanding scope is never a good thing and of course rtc is already on order :slight_smile:

Just thought worth running through all the detail and random ideas while fresh in my mind and for the purposes of documenting the debate for others to find!

1 Like

Okay, so conclusion on this topic is either / or any combination:

  • Install a RTC or use a board which has a builtin battery buffered clock
  • Add a custom server=/whatever.ntp/whatever.no-DNSSEC.server
  • Use the IP address of a local NTP server instead (may be your router) but ensure this server is not using your Pi-hole as DNS or gets the time from a reliable non-Internet source (DCF77, GNSS, etc.)

However, don't

  • Hard-code the IP address of your favorite NTP server as this can cause a lot of unrelated trouble if their IP changes

Did I forget something? If so, I can edit this into here.

1 Like

Have you ever considered this dnsmasq option: dnssec-no-timecheck

man dnsmasq:

--dnssec-no-timecheck
DNSSEC signatures are only valid for specified time windows, and should be rejected outside those windows. This generates an interesting chicken-and-egg problem for machines which don't have a hardware real time clock. For these machines to determine the correct time typically requires use of NTP and therefore DNS, but validating DNS requires that the correct time is already known. Setting this flag removes the time-window checks (but not other DNSSEC validation.) only until the dnsmasq process receives SIGINT. The intention is that dnsmasq should be started with this flag when the platform determines that reliable time is not currently available. As soon as reliable time is established, a SIGINT should be sent to dnsmasq, which enables time checking, and purges the cache of DNS records which have not been thoroughly checked.
Earlier versions of dnsmasq overloaded SIGHUP (which re-reads much configuration) to also enable time validation.

If dnsmasq is run in debug mode (--no-daemon flag) then SIGINT retains its usual meaning of terminating the dnsmasq process.

Last time I brought it up was 3 years ago, here, never received any comment on it, but it works...

Yes, but it is not wise to deploy this unconditionally. It'd allow you to serve valid but old responses, giving rise to a not-so-unlikely kind of man-in-the-middle attack.

Example of how to abuse this following below. Enough technical details are given to show it's possible but hopefully few enough to not encourage anyone actually trying to do this:

Fewer and fewer users/companies host servers themselves. Assume your domain was pointing to a cloud server instance. At some point, you decide that you need a different server with larger memory or whatever. Now assume that this server has a new IP address. No problem, you just change the DNS records of your domain.

Now, a new party may buy the server with your old IP. When they manage to cache-poison the old (valid but expired) response, they can effectively take over the domain. This could even be abused to get a valid certificate for the old domain from providers like Let's Encrypt so your man-in-the-middle is perfect. The only thing you need to do is knowing when a server IP changes (easy) and they try renting more and more cloud instances (usually on the order of few cents per instance if only rented for minutes) until you find the IP you are looking for. Now cache-poison with the expired (but otherwise valid) record to the old IP and voila - everything is fine.

This may sound unlikely, however, it is actually possible and I'd have a few ideas how to do this if I really wanted to (hint: I don't). If you don't disable time checking, then this security issue does not exist. If dnssec-no-timecheck would be narrowed down to a single domain and you know that said party is likely to own the IP addresses forever (!), this would be much less of an issue.

Agree. Thinking about it overnight generally hard coding ip addresses into things you don’t have control of is just wrong for all sorts of reasons.

I get the point about not overloading PiHole with another function ‘ntp’ or dependency - agree that is the job of other things - still think there is a point about if PiHole is the ‘controller’ of enabling dnssec then it should at least validate or warn that time is critical in some way (even if that’s just a note in the ui that says ‘hey pay attention here’) - and providing a mechanism in the ui that allows you to divert a specific ntp domain is by far and away the simplest fix.

Yes. There are a bunch of specific things I can do to avoid the situation - that feels like a bit of a sledgehammer to crack a nut though if applied genetically and (as previously noted) opens up all sorts of other issues - and if (thinking about PiHole general user base and purpose - to make this stuff available and functional and easy) if I’m able to apply that I’m probably able to add to my chosen resolver a break out for an ntp domain manually.

Given most people won’t have an rtc on their pi (very broad sweeping hand gestures at that point) at least alerting when dnssec is enabled or offering the ntp domain break out feels like the right config option that should be in PiHole’s ‘scope and remit’.

At the very least this thread has probably debated and documented the point thoroughly enough that ‘future me’ will get to this point and understand the issue and the possible solutions available to them - which in itself is a good thing.

rtc on its way from my chosen pi vendor arriving tomorrow :slight_smile:

Good discussion! DL6ER Bucking_Horn - respect! (I’d mention but ‘new users are not allowed to mention other users’ :slight_smile: )