Those domain counts are too low - that output doesn't cover the timeframe when the 'maximum concurrent' warning was logged.
That's quite likely my fault, of course - I forgot to have you adjust the SQL statement for your local time zone.
Assuming your Pi-hole host machine is configured for a timezone matching UTC +01:00 (e.g. WAT or CET), please run the following statement:
pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db \
"SELECT client, count(domain) FROM queries \
WHERE timestamp BETWEEN strftime('%s','2023-03-02 03:14:40+01:00') AND strftime('%s','2023-03-02 03:14:43+01:00') \
GROUP by client ORDER BY count(domain) DESC LIMIT 10;"
If your timezone differs from UTC+01:00, please substitute both(!) of +01:00 literals trailing the two datetimes in the above statement.
*** [ DIAGNOSING ]: Pi-hole diagnosis messages
count last timestamp type message
----- ------------------- -------------------- ------------------------------------------------------------
1 2023-03-02 03:14:42 DNSMASQ_WARN Maximum number of concurrent DNS queries reached (max: 150)
Are you sure you applied the correct time modifier?
You may check your timezone via timedatectl.
That shouldn't matter. The SQL statement is searching the long-time database.
Well, this does not necessarily have to come from an excessive amount of queries in a short time but can also mean you are not receiving m/any replies. Guess you send two queries per second but never receive answers from upstream. This would make 150 concurrent (as in actively waiting for a reply at the same time) in litte more than one minute.
This somewhat matches the overall title of this topic. You seem to have either reliably picked now those servers that are unreliable or there are other things in your network that are causing (partial?) network failures. While normal Internet traffic (using the TCP protocol) can easily be re-requested when packets get lost in transit, the DNS protocol (utilizing UDP) cannot do this causing you to be affected by intermittent Internet access much stronger.
Ok, now let's see what those clients have requested.
Please run:
sqlite3 /etc/pihole/pihole-FTL.db \
"SELECT client, domain, count(domain), datetime(min(timestamp), 'unixepoch', 'localtime'), datetime(max(timestamp), 'unixepoch', 'localtime') FROM queries \
WHERE timestamp BETWEEN strftime('%s','2023-03-02 03:14:40+01:00') AND strftime('%s','2023-03-02 03:14:43+01:00') \
GROUP by client, domain ORDER BY count(domain) DESC LIMIT 20;"
Edit:
Okay, not really. I was looking a bit YouTube shorts and than it suddenly slow down again. But I don't see any errors in the Pi-diagnosis section. Maybe I should test Cloudflare now.
For the concurrency warning to be triggered, you need a certain amount of DNS requests in a short time frame to saturate Pi-holes upstream connection pool, either because of substantially large numbers or because of very slow or non-responding upstream resolvers, or perhaps both.
Only the requests arriving shortly before the warning is triggered may provide clues as to which clients, domains or upstreams would contribute to the warning.
Please rerun the command with the time frame as provided.
Thank you.
These are also low query counts, so it is likely that a slow upstream resolver may indeed have contributed to that warning.
Let's exend the time frame from a few seconds to a few minutes before the failure:
pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db \
"SELECT client, domain, count(domain), datetime(min(timestamp), 'unixepoch', 'localtime'), datetime(max(timestamp), 'unixepoch', 'localtime') FROM queries \
WHERE timestamp BETWEEN strftime('%s','2023-03-02 03:12:00+01:00') AND strftime('%s','2023-03-02 03:14:43+01:00') \
GROUP by client, domain ORDER BY count(domain) DESC LIMIT 10;"
And let's also have a look at the reply types, to get an idea about upstream behaviour:
pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db \
"SELECT substr('00'||reply_type,-2), datetime(min(timestamp), 'unixepoch', 'localtime'), datetime(max(timestamp), 'unixepoch', 'localtime'), count(reply_type) FROM queries \
WHERE timestamp BETWEEN strftime('%s','2023-03-02 03:12:00+01:00') AND strftime('%s','2023-03-02 03:14:43+01:00') \
GROUP by reply_type ORDER BY 1;"
Yes, and this should be easy to confirm via Pi-hole's Query Log or Long Term Data Query Log as well.
While it is not uncommon for upstreams to occassionally take longer for a response (~1% of DNS requests in February for myself), if the majority of answers reads OK (already forwarded), then your upstream has an issue, as Pi-hole is still waiting for upstream answers.
So as DL6ER already suggested, your observation was quite correct:
You were using a DNS resolver that got sluggish at times, but it wasn't Pi-hole - it was your upstream.
It would also fit your later observation that you subjectively observed slow resolutions without a concurrency warning appearing: Pi-hole's upstream failed or slowed down, but recovered before Pi-hole ran out of concurrent threads.
EDIT: If you'd observe that with different upstreams, that could also suggest a somewhat unreliable upstream connection, either locally or your ISP.
Thank you for all of your help and tips! It's not obvious today to get such a good and extensive help on a internet community. I wish there were more such helpful communities on the internet like here!