DNS Resolver is getting sluggish/unstable after time

Bucking_Horn · March 2, 2023, 5:29pm

Those domain counts are too low - that output doesn't cover the timeframe when the 'maximum concurrent' warning was logged.

That's quite likely my fault, of course - I forgot to have you adjust the SQL statement for your local time zone.

Assuming your Pi-hole host machine is configured for a timezone matching UTC +01:00 (e.g. WAT or CET), please run the following statement:

pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db \
"SELECT client, count(domain) FROM queries \
WHERE timestamp BETWEEN strftime('%s','2023-03-02 03:14:40+01:00') AND strftime('%s','2023-03-02 03:14:43+01:00') \
GROUP by client ORDER BY count(domain) DESC LIMIT 10;"

If your timezone differs from UTC+01:00, please substitute both(!) of +01:00 literals trailing the two datetimes in the above statement.

Johnnii360 · March 2, 2023, 6:00pm

No problem. But I think it's useless now because I rebooted the Pi.

192.168.178.32|8
192.168.178.24|8
192.168.178.64|4
fe80::464e:6dff:fe7f:2341|3
192.168.178.1|3
192.168.178.21|1

Bucking_Horn · March 2, 2023, 6:08pm

Those are even fewer.

Your debug log shows the warning as follows:

*** [ DIAGNOSING ]: Pi-hole diagnosis messages
 count  last timestamp       type                  message
 -----  -------------------  --------------------  ------------------------------------------------------------
 1      2023-03-02 03:14:42  DNSMASQ_WARN          Maximum number of concurrent DNS queries reached (max: 150)

Are you sure you applied the correct time modifier?
You may check your timezone via timedatectl.

That shouldn't matter. The SQL statement is searching the long-time database.

DL6ER · March 2, 2023, 7:11pm

Well, this does not necessarily have to come from an excessive amount of queries in a short time but can also mean you are not receiving m/any replies. Guess you send two queries per second but never receive answers from upstream. This would make 150 concurrent (as in actively waiting for a reply at the same time) in litte more than one minute.

This somewhat matches the overall title of this topic. You seem to have either reliably picked now those servers that are unreliable or there are other things in your network that are causing (partial?) network failures. While normal Internet traffic (using the TCP protocol) can easily be re-requested when packets get lost in transit, the DNS protocol (utilizing UDP) cannot do this causing you to be affected by intermittent Internet access much stronger.

Johnnii360 · March 2, 2023, 7:17pm

Now it should be correct.

192.168.178.32|1429
192.168.178.24|1058
192.168.178.21|1045
192.168.178.64|982
192.168.178.1|586
fe80::464e:6dff:fe7f:2341|564
192.168.178.26|242
192.168.178.22|120
192.168.178.23|82
2a02:810d:b63f:fdc8:6531:3642:376d:73a9|70

Edit:
Interesting... 32 is my TP-Link C200 IP-Cam, 24 my TP-Link LB130 Smart Light Bulb and 21 the Vodafone GigaTV2 Box. 64 is my homeassistant.

Johnnii360 · March 2, 2023, 7:20pm

Hmm... It's a good question. I don't know if DNS.WATCH is reliable or not. Is Cloudflare maybe better?

DL6ER · March 2, 2023, 7:45pm

I don't know, it also strongly depends on your location. Google is known for its reliability but simply go ahead and try.

Bucking_Horn · March 2, 2023, 8:07pm

Ok, now let's see what those clients have requested.

Please run:

sqlite3 /etc/pihole/pihole-FTL.db \
"SELECT client, domain, count(domain), datetime(min(timestamp), 'unixepoch', 'localtime'), datetime(max(timestamp), 'unixepoch', 'localtime') FROM queries \
WHERE timestamp BETWEEN strftime('%s','2023-03-02 03:14:40+01:00') AND strftime('%s','2023-03-02 03:14:43+01:00') \
GROUP by client, domain ORDER BY count(domain) DESC LIMIT 20;"

Again, please adjust the +01:00 as required.

Johnnii360 · March 3, 2023, 6:54am

+01:00 is good but I adjusted the first time to 02:14:40+01:00.

Here you are:

192.168.178.32|rtsp-dcipc.tplinknbu.com|732|2023-03-02 02:59:31|2023-03-02 03:14:43
192.168.178.32|euw1-relay-i-04f41c7ec834938bb.dcipc.i.tplinknbu.com|513|2023-03-02 03:01:55|2023-03-02 03:14:43
192.168.178.21|connectivitycheck.gstatic.com|457|2023-03-02 02:56:33|2023-03-02 03:14:39
192.168.178.64|checkonline.home-assistant.io|457|2023-03-02 02:22:12|2023-03-02 03:14:39
192.168.178.24|pool.ntp.org|449|2023-03-02 02:35:14|2023-03-02 03:14:43
192.168.178.24|time.nist.gov|449|2023-03-02 02:35:14|2023-03-02 03:14:43
192.168.178.64|srz28.homematic.com|306|2023-03-02 02:56:40|2023-03-02 03:11:55
192.168.178.26|dispatch.mcs2.miele.com|240|2023-03-02 02:56:25|2023-03-02 03:14:38
192.168.178.21|beta-api.crunchyroll.com|172|2023-03-02 02:26:49|2023-03-02 03:14:42
192.168.178.64|o427061.ingest.sentry.io|148|2023-03-02 02:38:11|2023-03-02 03:14:42
192.168.178.24|n-devs.tplinkcloud.com|145|2023-03-02 02:56:30|2023-03-02 03:14:41
fe80::464e:6dff:fe7f:2341|outlook.office365.com|122|2023-03-02 02:29:22|2023-03-02 03:14:39
192.168.178.1|outlook.office365.com|119|2023-03-02 02:29:26|2023-03-02 03:14:39
192.168.178.21|connectivitycheck.gstatic.com.fritz.box|108|2023-03-02 02:57:15|2023-03-02 03:14:38
192.168.178.32|n-devs-dcipc.tplinkcloud.com|104|2023-03-02 02:58:22|2023-03-02 03:14:40
192.168.178.21|beta-api.crunchyroll.com.fritz.box|85|2023-03-02 02:57:07|2023-03-02 03:14:39
192.168.178.32|n-deventry-dcipc.tplinkcloud.com|72|2023-03-02 02:57:06|2023-03-02 03:14:19
192.168.178.25|n-devs.tplinkcloud.com|64|2023-03-02 02:58:06|2023-03-02 03:14:24
192.168.178.22|lookup.homematic-ip.com|60|2023-03-02 02:57:11|2023-03-02 03:14:25
192.168.178.22|lookup.homematic.com|60|2023-03-02 02:56:37|2023-03-02 03:12:46

But today it's all fine.

Edit:
Okay, not really. I was looking a bit YouTube shorts and than it suddenly slow down again. But I don't see any errors in the Pi-diagnosis section. Maybe I should test Cloudflare now.

192.168.178.39|223
fd05:51a5:9dc4:0:3a32:b1be:8560:bdd1|179
fd05:51a5:9dc4:0:b140:2d48:3b04:158f|96
fd05:51a5:9dc4:0:a6eb:d837:85b7:259a|64
fd05:51a5:9dc4:0:709b:496b:268:be90|27
192.168.178.42|26
fd05:51a5:9dc4:0:a5d5:ebe9:8c17:e8e6|10
192.168.178.28|8
192.168.178.1|6
192.168.178.64|5

First one is my PC second the Pi hole.
https://tricorder.pi-hole.net/lG3AqsVW/

Bucking_Horn · March 3, 2023, 9:49pm

That time frame is too large.

For the concurrency warning to be triggered, you need a certain amount of DNS requests in a short time frame to saturate Pi-holes upstream connection pool, either because of substantially large numbers or because of very slow or non-responding upstream resolvers, or perhaps both.

Only the requests arriving shortly before the warning is triggered may provide clues as to which clients, domains or upstreams would contribute to the warning.

Please rerun the command with the time frame as provided.

Johnnii360 · March 4, 2023, 6:00pm

As you wish. Here you are.

192.168.178.32|rtsp-dcipc.tplinknbu.com|4|2023-03-02 03:14:41|2023-03-02 03:14:43
192.168.178.64|o427061.ingest.sentry.io|4|2023-03-02 03:14:40|2023-03-02 03:14:42
192.168.178.24|pool.ntp.org|3|2023-03-02 03:14:40|2023-03-02 03:14:43
192.168.178.24|time.nist.gov|3|2023-03-02 03:14:40|2023-03-02 03:14:43
192.168.178.32|euw1-relay-i-04f41c7ec834938bb.dcipc.i.tplinknbu.com|3|2023-03-02 03:14:41|2023-03-02 03:14:43
192.168.178.1|17-courier.push.apple.com|2|2023-03-02 03:14:41|2023-03-02 03:14:41
192.168.178.24|n-devs.tplinkcloud.com|2|2023-03-02 03:14:40|2023-03-02 03:14:41
fe80::464e:6dff:fe7f:2341|17-courier.push.apple.com|2|2023-03-02 03:14:41|2023-03-02 03:14:41
192.168.178.1|gateway.fe.apple-dns.net|1|2023-03-02 03:14:43|2023-03-02 03:14:43
192.168.178.21|beta-api.crunchyroll.com|1|2023-03-02 03:14:42|2023-03-02 03:14:42
192.168.178.32|n-devs-dcipc.tplinkcloud.com|1|2023-03-02 03:14:40|2023-03-02 03:14:40
fe80::464e:6dff:fe7f:2341|gateway.fe.apple-dns.net|1|2023-03-02 03:14:43|2023-03-02 03:14:43

Bucking_Horn · March 4, 2023, 11:42pm

Thank you.
These are also low query counts, so it is likely that a slow upstream resolver may indeed have contributed to that warning.

Let's exend the time frame from a few seconds to a few minutes before the failure:

pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db \
"SELECT client, domain, count(domain), datetime(min(timestamp), 'unixepoch', 'localtime'), datetime(max(timestamp), 'unixepoch', 'localtime') FROM queries \
WHERE timestamp BETWEEN strftime('%s','2023-03-02 03:12:00+01:00') AND strftime('%s','2023-03-02 03:14:43+01:00') \
GROUP by client, domain ORDER BY count(domain) DESC LIMIT 10;"

And let's also have a look at the reply types, to get an idea about upstream behaviour:

pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db \
"SELECT substr('00'||reply_type,-2), datetime(min(timestamp), 'unixepoch', 'localtime'), datetime(max(timestamp), 'unixepoch', 'localtime'), count(reply_type) FROM queries \
WHERE timestamp BETWEEN strftime('%s','2023-03-02 03:12:00+01:00') AND strftime('%s','2023-03-02 03:14:43+01:00') \
GROUP by reply_type ORDER BY 1;"

Johnnii360 · March 5, 2023, 7:02am

Here are the two results:

192.168.178.64|o427061.ingest.sentry.io|144|2023-03-02 03:12:20|2023-03-02 03:14:42
192.168.178.24|pool.ntp.org|136|2023-03-02 03:12:01|2023-03-02 03:14:43
192.168.178.24|time.nist.gov|136|2023-03-02 03:12:01|2023-03-02 03:14:43
192.168.178.32|rtsp-dcipc.tplinknbu.com|133|2023-03-02 03:12:00|2023-03-02 03:14:43
192.168.178.32|euw1-relay-i-04f41c7ec834938bb.dcipc.i.tplinknbu.com|110|2023-03-02 03:12:00|2023-03-02 03:14:43
192.168.178.64|checkonline.home-assistant.io|94|2023-03-02 03:12:03|2023-03-02 03:14:39
192.168.178.21|connectivitycheck.gstatic.com|63|2023-03-02 03:12:00|2023-03-02 03:14:39
192.168.178.26|dispatch.mcs2.miele.com|40|2023-03-02 03:12:22|2023-03-02 03:14:38
192.168.178.1|init.itunes.apple.com|26|2023-03-02 03:12:03|2023-03-02 03:13:56
192.168.178.21|beta-api.crunchyroll.com|26|2023-03-02 03:12:02|2023-03-02 03:14:42

00|2023-03-02 03:12:00|2023-03-02 03:14:43|1253
02|2023-03-02 03:12:02|2023-03-02 03:14:39|42
08|2023-03-02 03:14:42|2023-03-02 03:14:42|1

Hmm... 00... unknown... could it be the DNS.WATCH DNS the culprit? Meanwhile I switched to Cloudflare and no issue till today.

Bucking_Horn · March 5, 2023, 8:33am

Yes, and this should be easy to confirm via Pi-hole's Query Log or Long Term Data Query Log as well.

While it is not uncommon for upstreams to occassionally take longer for a response (~1% of DNS requests in February for myself), if the majority of answers reads OK (already forwarded), then your upstream has an issue, as Pi-hole is still waiting for upstream answers.

So as DL6ER already suggested, your observation was quite correct:
You were using a DNS resolver that got sluggish at times, but it wasn't Pi-hole - it was your upstream.

It would also fit your later observation that you subjectively observed slow resolutions without a concurrency warning appearing: Pi-hole's upstream failed or slowed down, but recovered before Pi-hole ran out of concurrent threads.
EDIT: If you'd observe that with different upstreams, that could also suggest a somewhat unreliable upstream connection, either locally or your ISP.

Johnnii360 · March 5, 2023, 10:35am

Thank you for all of your help and tips! It's not obvious today to get such a good and extensive help on a internet community. I wish there were more such helpful communities on the internet like here!

system · March 26, 2023, 1:13pm

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.