Random DNS query failures with Pi-hole

The issue I am facing: Some DNS queries fail at random. Re-try of the same DNS query usually resolves the query correctly, even if the same DNS server is used.

Details about my system:
https://tricorder.pi-hole.net/alcgjlb9wr

My setup is as follows:
Two outbound connections from two different Operators .These are load balanced with failover with my router at 192.168.42.1, WAN1 being the "active", WAN2 as a secondary in case WAN1 is not responding.

The LAN has two DNS servers, both Raspberry Pis, ns1: Raspberry 4 with Debian GNU/Linux 10 (buster), ns2: Raspberry 3B+ with Raspbian GNU/Linux 9 (stretch), both running BIND at 192.168.42.38 (ns1) and 192.168.42.34 (ns2). The configuration for both servers is to use OpenDNS DNS servers for resolution. Neither use the other for any DNS queries. I have setup a local LAN which the BIND handles correctly. This is not an issue. BIND is configured to allow queries for local domain for ANY but only for 192.168.42.34/.38/.98/.99 as well as 127.0.0.1 for query, recursion and cache.

Both the DNS servers are also running Pi-hole. The Pi-hole is my main DNS for LAN. The first at 192.168.42.99 (ns1), second at 192.168.42.48 (ns2). Pihole is configured to use OpenDNS as resolver for all but the local domain, which is handled with both Pi-holes independently, that is, the ns1 uses only the BIND at ns1 and ns2 only the BIND at ns2. DHCP is handled by my router. DNS servers are configured for DHCP as 192.168.42.99 and 192.168.42.98. ns1 uses Apache, ns2 lighthttp (but, this makes no difference for DNS queries).

No IPv6 is used only IPv4 as the secondary operator does not support IPv6.

What I have changed since installing Pi-hole:
Well, a lot (as you would imagine)...
My configs for Pi-hole:
/etc/dnsmasq.d/01-pihole.conf

addn-hosts=/etc/pihole/local.list
addn-hosts=/etc/pihole/custom.list
localise-queries
no-resolv
cache-size=10000
log-queries
log-facility=/var/log/pihole.log
local-ttl=2
log-async
listen-address=192.168.42.98
server=208.67.220.220
server=208.67.222.222
domain-needed
expand-hosts
bogus-priv
bind-interfaces
rev-server=192.168.42.0/24,192.168.42.34
server=/heralan/192.168.42.34
server=/use-application-dns.net/

/etc/pihole/pihole-FTL.conf

PRIVACYLEVEL=0
#BLOCKINGMODE=NODATA
BLOCKINGMODE=IP-NODATA-AAAA
RATE_LIMIT=0/0

/etc/pihole/setupVars.conf

#PIHOLE_INTERFACE=
WEBPASSWORD=[ omitted ]
ADMIN_EMAIL=tommi@oire.fi
WEBUIBOXEDLAYOUT=boxed
WEBTHEME=default-dark
API_EXCLUDE_DOMAINS=
API_EXCLUDE_CLIENTS=
API_QUERY_LOG_SHOW=all
API_PRIVACY_MODE=false
IPV4_ADDRESS=192.168.42.98/24
IPV6_ADDRESS=::1
QUERY_LOGGING=true
INSTALL_WEB_SERVER=false
INSTALL_WEB_INTERFACE=true
LIGHTTPD_ENABLED=false
CACHE_SIZE=100000
BLOCKING_ENABLED=true
CNAME_DEEP_INSPECT=true
RESOLVE_IPV6=no
DNSMASQ_LISTENING=single
PIHOLE_DNS_1=208.67.220.220
PIHOLE_DNS_2=208.67.222.222
DNS_FQDN_REQUIRED=true
DNS_BOGUS_PRIV=true
DNSSEC=false
REV_SERVER=true
REV_SERVER_CIDR=192.168.42.0/24
REV_SERVER_TARGET=192.168.42.34
REV_SERVER_DOMAIN=heralan

/etc/resolv.conf

# Generated by resolvconf
search heralan
nameserver 192.168.42.98

I did get an error "Gateway did not respond." with pihole -d. The gateway, however, does respond to pings when trying manually:

PING 192.168.42.1 (192.168.42.1) 56(84) bytes of data.
64 bytes from 192.168.42.1: icmp_seq=1 ttl=64 time=0.580 ms
64 bytes from 192.168.42.1: icmp_seq=2 ttl=64 time=0.451 ms
64 bytes from 192.168.42.1: icmp_seq=3 ttl=64 time=0.509 ms
64 bytes from 192.168.42.1: icmp_seq=4 ttl=64 time=0.474 ms
64 bytes from 192.168.42.1: icmp_seq=5 ttl=64 time=0.413 ms
^C
--- 192.168.42.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4138ms
rtt min/avg/max/mdev = 0.413/0.485/0.580/0.060 ms

Also, DNS query succeeds (if no GW was reached, I would imagine the DNS query would fail).

From here on, I am using ns2 as an example while the issue is more or less the same with ns1. All operations, configurations, etc. are from ns2. ns2 does not use ns1 for anything but the BIND zone transfers from ns1 (again, not an issue).
I tried debugging the issue with dnsperf and came up with:


[Status] Command line: dnsperf -e -Q 100 -l 5 -d /root/dnsperf-test.txt -s 192.168.42.98
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Thu Jul 15 12:06:27 2021
[Status] Stopping after 5.000000 seconds
[Timeout] Query timed out: msg id 3
[Timeout] Query timed out: msg id 5
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         500
  Queries completed:    498 (99.60%)
  Queries lost:         2 (0.40%)

  Response codes:       NOERROR 498 (100.00%)
  Average packet size:  request 41, response 80
  Run time (s):         5.000187
  Queries per second:   99.596275

  Average Latency (s):  0.000839 (min 0.000164, max 0.049515)
  Latency StdDev (s):   0.002248
[Status] Command line: dnsperf -e -Q 100 -l 5 -d /root/dnsperf-test.txt -s 192.168.42.98
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Thu Jul 15 12:06:40 2021
[Status] Stopping after 5.000000 seconds
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         500
  Queries completed:    500 (100.00%)
  Queries lost:         0 (0.00%)

  Response codes:       NOERROR 500 (100.00%)
  Average packet size:  request 41, response 81
  Run time (s):         5.000193
  Queries per second:   99.996140

  Average Latency (s):  0.000729 (min 0.000168, max 0.004930)
  Latency StdDev (s):   0.000465

The /root/dnsperf-test.txt for dnsperf is:

www.google.com A
www.yle.fi A

As we can see I might loose a packet(/query) or two from time to time with this setup but this is acceptable, at least with the qps of 100. Usually the load is much less with my LAN clients.

Here's where it gets interesting and what my actual issue is:

Running with the exact same setup a different set of A queries i get (again, all from ns2):

[Status] Command line: dnsperf -Q 250 -l 5 -t 10 -d /root/dnsperf-simple.txt -s 192.168.42.98 -v
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Thu Jul 15 12:40:42 2021
[Status] Stopping after 5.000000 seconds

Statistics:

  Queries sent:         254
  Queries completed:    154 (60.63%)
  Queries lost:         100 (39.37%)

  Response codes:       NOERROR 154 (100.00%)
  Average packet size:  request 33, response 93
  Run time (s):         5.000149
  Queries per second:   30.799082

  Average Latency (s):  0.002605 (min 0.000216, max 0.082471)
  Latency StdDev (s):   0.010682

No matter how many times I repeat the test the results are not any better.
The /root/dnsperf-simple.txt for dnsperf is

0.pool.ntp.org A
1.pool.ntp.org A
api.dropbox.com A
calendar.google.com A
docs.google.com A
play.google.com A
ns1.heralan A
ipinfo.io A
netperf-eu.bufferbloat.net A
outlook.office365.com A
pihole0.heralan A
time.akamai.com A
www.amazon.com A
www.bing.com A
www.bloomberg.com A
www.dropbox.com A
www.ebay.com A
www.eff.org A
www.facebook.com A
www.google.com A
www.grafana.com A
www.helsinki.fi A
www.microsoft.com A
www.netflix.com A
www.opendns.com A
www.openwrt.org A
www.pushbullet.com A
www.raspberrypi.org A
www.reddit.com A
www.tp-link.com A
www.twitter.com A
www.wikipedia.org A
www.yahoo.com A
www.youtube.com A

Every entry resolves correctly if tried manually:

while read line; do dig $line +short; echo "**"; sleep 0.2; done < /root/dnsperf-simple.txt
193.182.111.13
176.119.210.243
122.117.253.246
89.221.214.130
**
95.216.175.117
95.216.154.135
95.216.24.230
162.159.200.123
**
api-env.dropbox-dns.com.
162.125.70.19
**
216.58.207.206
**
142.250.74.142
**
172.217.20.46
**
192.168.42.38
**
34.117.59.81
**
flent-eu.bufferbloat.net.
demo.tohojo.dk.
193.10.227.30
**
outlook.ha.office365.com.
40.101.12.82
40.101.80.2
40.101.80.18
52.97.200.130
52.97.200.146
52.97.250.210
52.97.250.226
40.101.12.18
**
192.168.42.99
**
time.akamai.com.edgekey.net.
e1534.dscb.akamaiedge.net.
2.20.2.248
**
tp.47cf2c8c9-frontier.amazon.com.
d3ag4hukkh62yn.cloudfront.net.
13.32.123.226
**
a-0001.a-afdentry.net.trafficmanager.net.
www-bing-com.dual-a-0001.a-msedge.net.
dual-a-0001.a-msedge.net.
13.107.21.200
204.79.197.200
**
www.bloomberg.com.shared.bloomberga.com.
bloomberg.map.fastly.net.
151.101.129.73
151.101.193.73
151.101.1.73
151.101.65.73
**
www-env.dropbox-dns.com.
162.125.70.18
**
slot9428.ebay.com.edgekey.net.
e9428.a.akamaiedge.net.
96.16.165.22
**
eff.map.fastly.net.
151.101.128.201
151.101.192.201
151.101.0.201
151.101.64.201
**
star-mini.c10r.facebook.com.
157.240.205.35
**
216.58.207.196
**
grafana.com.
34.120.177.193
**
adc-vip3.it.helsinki.fi.
128.214.189.90
**
www.microsoft.com-c-3.edgekey.net.
www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net.
e13678.dscb.akamaiedge.net.
2.22.42.141
**
www.dradis.netflix.com.
www.eu-west-1.internal.dradis.netflix.com.
dualstack.apiproxy-website-nlb-prod-3-ac110f6ae472b85a.elb.eu-west-1.amazonaws.com.
3.251.50.149
54.74.73.31
54.155.178.5
**
146.112.62.105
**
wiki-01.infra.openwrt.org.
139.59.209.225
**
104.26.5.125
104.26.4.125
172.67.68.238
**
104.22.1.43
104.22.0.43
172.67.36.98
**
reddit.map.fastly.net.
151.101.1.140
151.101.65.140
151.101.129.140
151.101.193.140
**
13.32.123.66
13.32.123.43
13.32.123.97
13.32.123.76
**
twitter.com.
104.244.42.193
104.244.42.1
**
dyna.wikimedia.org.
91.198.174.192
**
new-fp-shed.wg1.b.yahoo.com.
87.248.100.216
87.248.100.215
**
youtube-ui.l.google.com.
216.58.207.206
172.217.21.174
172.217.21.142
172.217.20.46
142.250.74.142
142.250.74.110
142.250.74.78
142.250.74.46
142.250.74.14
216.58.211.14
216.58.207.238

How do I get Pi-hole to respond to queries correctly and not "loosing" 1/3 - 1/2 of the queries? With every other DNS query lost, the LAN becomes unbearably slow to use...

edit(s): Tried to make the long text more readable. Also, if some details are missing, I'd be happy to add them...

Bulk load tests at a rate of -Q 250 requests per second would trigger Pi-hole's default rate limiting, but your debug log tells me you already configured Pi-hole to disable rate limiting:

*** [ DIAGNOSING ]: contents of /etc/pihole

-rw-rw-r-- 1 pihole root 104 Jul 15 11:37 /etc/pihole/pihole-FTL.conf
   PRIVACYLEVEL=0
   BLOCKINGMODE=IP-NODATA-AAAA
   RATE_LIMIT=0/0

While bulk results and overall stats are helpful in assessing, they won't be of much use trying to track down potential issues.

I see you've already been running dnsperf with -v.
Did the verbose output produce any timed out or interrupted queries? If so, for what domain?

Also, you should try to correlate any conspicuities from Pi-hole's query logs at /var/log/pihole.log* to the time frame of your bulk tests (e.g. Thu Jul 15 12:40:42 2021 plus 5 (or 10) seconds).

Your debug log does show some inconsistencies.

Your Pi-hole doesn't seem to be properly aware of its network configuration.

*** [ DIAGNOSING ]: Networking
[โœ—] No IPv4 address(es) found on the  interface.

[โœ—] No IPv6 address(es) found on the  interface.

[i] Default IPv4 gateway: 192.168.42.1
   * Pinging 192.168.42.1...
[โœ—] Gateway did not respond.

Nevertheless, your Pi-hole seems operational:

*** [ DIAGNOSING ]: Name resolution (IPv4) using a random blocked domain and a known ad-serving domain
[โœ“] www.mobicover.com.ua is 162.55.8.125 via localhost (127.0.0.1)
[โœ“] www.mobicover.com.ua is 162.55.8.125 via Pi-hole (192.168.42.98)
[โœ“] doubleclick.com is 142.250.74.110 via a remote, public DNS server (8.8.8.8)

BIND's named claims multiple ports, including port 53:

*** [ DIAGNOSING ]: Ports in use
[80] is in use by lighttpd
[80] is in use by lighttpd
192.168.42.34:8053 named (IPv4)
127.0.0.1:8053 named (IPv4)
[53] is in use by named (https://docs.pi-hole.net/main/prerequisites/#ports)
[53] is in use by named (https://docs.pi-hole.net/main/prerequisites/#ports)
127.0.0.1:953 named (IPv4)
[53] is in use by pihole-FTL
[4711] is in use by pihole-FTL

That may be intended and attributable to your multiple IP address configuration - I can't tell that from just that output. I do note, however, that you've tried to take precautions against possible port conflicts by applying the bind-interfaces option.
Still, you should consider verifying those three different ports that named is listening on won't conflict with Pi-hole.

Also note that you applied your custom dnsmasq options to Pi-hole's own 01-pihole.conf: That file will be overwritten on Pi-hole upgrades or certain UI interactions or Repair/Reconfigure runs of Pi-hole's CLI.
I'd recommed to put your custom options into a separate file instead - at least the one's that won't conflict with Pi-hole's configuration.
Your listen-address=192.168.42.98 will always conflict with Pi-hole's own interface option (which may well be the reason for the debug log showing connectivity issues, since that option is absent from your configuration).

And finally, your host machine seems to be low on disk space:

*** [ DIAGNOSING ]: contents of /var/log/lighttpd

-rwxrwx--- 1 www-data www-data 6362 Jul 13 12:38 /var/log/lighttpd/error.log
   -----head of error.log------
   2021-07-11 20:08:59: (mod_fastcgi.c.2543) FastCGI-stderr: 
           PHP Warning:  Unknown: write failed: No space left on device (28) in Unknown on line 0

Inability to write log files usually won't stop Pi-hole from continuing to provide DNS resolution, though.

So apart from potential DNS loops (as a possible side effect of (yet unconfirmed) conflicting port allocations), none of the above would provide any immediate lead as to why a large amount of DNS queries is reported as lost by dnsperf.
(What does "Queries lost" mean anyway, i.e. what exactly makes dnsperf report a query as lost? Does it distinguish between a query not reaching its target and a response never received?)

As suggested, Pi-hole's query log may provide additional insights about which DNS requests actually were received and how those queries were processed.

If that's the case, please share a sample set of such log entries.

Thanks for the input!

I disabled (stopped) BIND and recorded all my local domain names with Pi-hole's custom.list. Also, the Pi-hole configuration was changed to reflect BIND not running
/etc/dnsmasq.d/01-pihole.conf

interface=eth0
#bind-interfaces

to match the interface on the Raspberry

# ip addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether b8:27:eb:d0:c1:5e brd ff:ff:ff:ff:ff:ff
    inet 192.168.42.34/24 brd 192.168.42.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet 192.168.42.98/24 scope global secondary eth0
       valid_lft forever preferred_lft forever
3: wlan0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether b8:27:eb:85:94:0b brd ff:ff:ff:ff:ff:ff

After a restart the Pi-hole is listening:

# netstat -tulpan|grep ":53 "
tcp        0      0 0.0.0.0:53              0.0.0.0:*               LISTEN      24482/pihole-FTL
tcp6       0      0 :::53                   :::*                    LISTEN      24482/pihole-FTL
udp        0      0 0.0.0.0:53              0.0.0.0:*                           24482/pihole-FTL
udp6       0      0 :::53                   :::*                                24482/pihole-FTL

(Can I set the PTRs for my local domain names somehow? Missing the PTRs is not a deal breaker but would be nice to have...)

Running dnsperf (with -v) again with BIND now out of the scope:

[Status] Command line: dnsperf -Q 250 -l 5 -t 10 -n 1 -d /root/dnsperf-simple.txt -s 192.168.42.98 -v
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Fri Jul 16 01:12:40 2021
[Status] Stopping after 5.000000 seconds or 1 run through file
> NOERROR ns1.heralan A 0.001438
> NOERROR pihole0.heralan A 0.000502
> NOERROR 0.pool.ntp.org A 0.063362
> NOERROR 1.pool.ntp.org A 0.093172
> NOERROR api.dropbox.com A 0.092555
> NOERROR www.google.com A 0.039335
> NOERROR www.facebook.com A 0.081420
> T calendar.google.com A
> T docs.google.com A
> T play.google.com A
> T ipinfo.io A
> T netperf-eu.bufferbloat.net A
> T outlook.office365.com A
> T time.akamai.com A
> T www.amazon.com A
> T www.bing.com A
> T www.bloomberg.com A
> T www.dropbox.com A
> T www.ebay.com A
> T www.eff.org A
> T www.grafana.com A
> T www.helsinki.fi A
> T www.microsoft.com A
> T www.netflix.com A
> T www.opendns.com A
> T www.openwrt.org A
> T www.pushbullet.com A
> T www.raspberrypi.org A
> T www.reddit.com A
> T www.tp-link.com A
> T www.twitter.com A
> T www.wikipedia.org A
> T www.yahoo.com A
> T www.youtube.com A
[Status] Testing complete (end of file)

Statistics:

  Queries sent:         34
  Queries completed:    7 (20.59%)
  Queries lost:         27 (79.41%)

  Response codes:       NOERROR 7 (100.00%)
  Average packet size:  request 33, response 70
  Run time (s):         0.153532
  Queries per second:   45.593101

  Average Latency (s):  0.053112 (min 0.000502, max 0.093172)
  Latency StdDev (s):   0.040197

Yet a manual query (dig) for the missed names results in success, e.g.

; <<>> DiG 9.10.3-P4-Debian <<>> +search calendar.google.com A
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21563
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;calendar.google.com.           IN      A

;; ANSWER SECTION:
calendar.google.com.    100     IN      A       142.250.74.110

;; Query time: 32 msec
;; SERVER: 192.168.42.98#53(192.168.42.98)
;; WHEN: Fri Jul 16 01:14:09 EEST 2021
;; MSG SIZE  rcvd: 64

With the pihole.log these are seen:
dnsperf:
Jul 16 01:12:40 dnsmasq[24482]: query[A] calendar.google.com from 192.168.42.34
Jul 16 01:12:40 dnsmasq[24482]: forwarded calendar.google.com to 208.67.222.222
Jul 16 01:12:40 dnsmasq[24482]: forwarded calendar.google.com to 208.67.220.220

dig:
Jul 16 01:14:09 dnsmasq[24482]: query[A] calendar.google.com from 192.168.42.34
Jul 16 01:14:09 dnsmasq[24482]: forwarded calendar.google.com to 208.67.222.222
Jul 16 01:14:09 dnsmasq[24482]: forwarded calendar.google.com to 208.67.220.220
Jul 16 01:14:09 dnsmasq[24482]: reply calendar.google.com is 142.250.74.110

So, a reply is missing for the dnsperf. If the reason could be found it would probably solve my issue...

Just for fun, I ran the dnsperf later again couple of times and result were different from previous run:

[Status] Command line: dnsperf -Q 300 -l 5 -n 3 -d /root/dnsperf-simple.txt -s 192.168.42.98 -v
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Fri Jul 16 01:57:36 2021
[Status] Stopping after 5.000000 seconds or 3 runs through file
> NOERROR ns1.heralan A 0.000369
> NOERROR pihole0.heralan A 0.000270
> NOERROR ns1.heralan A 0.000336
> NOERROR pihole0.heralan A 0.000277
> NOERROR ns1.heralan A 0.000350
> NOERROR pihole0.heralan A 0.000284
> T 0.pool.ntp.org A
> T 1.pool.ntp.org A
> T api.dropbox.com A
> T calendar.google.com A
> T docs.google.com A
> T play.google.com A
> T ipinfo.io A
> T netperf-eu.bufferbloat.net A
> T outlook.office365.com A
> T time.akamai.com A
> T www.amazon.com A
> T www.bing.com A
> T www.bloomberg.com A
> T www.dropbox.com A
> T www.ebay.com A
> T www.eff.org A
> T www.facebook.com A
> T www.google.com A
> T www.grafana.com A
> T www.helsinki.fi A
> T www.microsoft.com A
> T www.netflix.com A
> T www.opendns.com A
> T www.openwrt.org A
> T www.pushbullet.com A
> T www.raspberrypi.org A
> T www.reddit.com A
> T www.tp-link.com A
> T www.twitter.com A
> T www.wikipedia.org A
> T www.yahoo.com A
> T www.youtube.com A
> T 0.pool.ntp.org A
> T 1.pool.ntp.org A
> T api.dropbox.com A
> T calendar.google.com A
> T docs.google.com A
> T play.google.com A
> T ipinfo.io A
> T netperf-eu.bufferbloat.net A
> T outlook.office365.com A
> T time.akamai.com A
> T www.amazon.com A
> T www.bing.com A
> T www.bloomberg.com A
> T www.dropbox.com A
> T www.ebay.com A
> T www.eff.org A
> T www.facebook.com A
> T www.google.com A
> T www.grafana.com A
> T www.helsinki.fi A
> T www.microsoft.com A
> T www.netflix.com A
> T www.opendns.com A
> T www.openwrt.org A
> T www.pushbullet.com A
> T www.raspberrypi.org A
> T www.reddit.com A
> T www.tp-link.com A
> T www.twitter.com A
> T www.wikipedia.org A
> T www.yahoo.com A
> T www.youtube.com A
> T 0.pool.ntp.org A
> T 1.pool.ntp.org A
> T api.dropbox.com A
> T calendar.google.com A
> T docs.google.com A
> T play.google.com A
> T ipinfo.io A
> T netperf-eu.bufferbloat.net A
> T outlook.office365.com A
> T time.akamai.com A
> T www.amazon.com A
> T www.bing.com A
> T www.bloomberg.com A
> T www.dropbox.com A
> T www.ebay.com A
> T www.eff.org A
> T www.facebook.com A
> T www.google.com A
> T www.grafana.com A
> T www.helsinki.fi A
> T www.microsoft.com A
> T www.netflix.com A
> T www.opendns.com A
> T www.openwrt.org A
> T www.pushbullet.com A
> T www.raspberrypi.org A
> T www.reddit.com A
> T www.tp-link.com A
> T www.twitter.com A
> T www.wikipedia.org A
> T www.yahoo.com A
> T www.youtube.com A
[Status] Testing complete (end of file)

Statistics:

  Queries sent:         102
  Queries completed:    6 (5.88%)
  Queries lost:         96 (94.12%)

  Response codes:       NOERROR 6 (100.00%)
  Average packet size:  request 33, response 47
  Run time (s):         0.340131
  Queries per second:   17.640262

  Average Latency (s):  0.000314 (min 0.000270, max 0.000369)
  Latency StdDev (s):   0.000042
[Status] Command line: dnsperf -Q 300 -l 5 -n 3 -d /root/dnsperf-simple.txt -s 192.168.42.98 -v
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Fri Jul 16 01:58:06 2021
[Status] Stopping after 5.000000 seconds or 3 runs through file
> NOERROR ns1.heralan A 0.000347
> NOERROR pihole0.heralan A 0.000255
> NOERROR 0.pool.ntp.org A 0.059319
> NOERROR 1.pool.ntp.org A 0.104837
> NOERROR 0.pool.ntp.org A 0.000518
> NOERROR 1.pool.ntp.org A 0.000428
> NOERROR ns1.heralan A 0.000337
> NOERROR pihole0.heralan A 0.000269
> NOERROR 0.pool.ntp.org A 0.000443
> NOERROR 1.pool.ntp.org A 0.000436
> NOERROR ns1.heralan A 0.000338
> NOERROR pihole0.heralan A 0.000270
> T api.dropbox.com A
> T calendar.google.com A
> T docs.google.com A
> T play.google.com A
> T ipinfo.io A
> T netperf-eu.bufferbloat.net A
> T outlook.office365.com A
> T time.akamai.com A
> T www.amazon.com A
> T www.bing.com A
> T www.bloomberg.com A
> T www.dropbox.com A
> T www.ebay.com A
> T www.eff.org A
> T www.facebook.com A
> T www.google.com A
> T www.grafana.com A
> T www.helsinki.fi A
> T www.microsoft.com A
> T www.netflix.com A
> T www.opendns.com A
> T www.openwrt.org A
> T www.pushbullet.com A
> T www.raspberrypi.org A
> T www.reddit.com A
> T www.tp-link.com A
> T www.twitter.com A
> T www.wikipedia.org A
> T www.yahoo.com A
> T www.youtube.com A
> T api.dropbox.com A
> T calendar.google.com A
> T docs.google.com A
> T play.google.com A
> T ipinfo.io A
> T netperf-eu.bufferbloat.net A
> T outlook.office365.com A
> T time.akamai.com A
> T www.amazon.com A
> T www.bing.com A
> T www.bloomberg.com A
> T www.dropbox.com A
> T www.ebay.com A
> T www.eff.org A
> T www.facebook.com A
> T www.google.com A
> T www.grafana.com A
> T www.helsinki.fi A
> T www.microsoft.com A
> T www.netflix.com A
> T www.opendns.com A
> T www.openwrt.org A
> T www.pushbullet.com A
> T www.raspberrypi.org A
> T www.reddit.com A
> T www.tp-link.com A
> T www.twitter.com A
> T www.wikipedia.org A
> T www.yahoo.com A
> T www.youtube.com A
> T api.dropbox.com A
> T calendar.google.com A
> T docs.google.com A
> T play.google.com A
> T ipinfo.io A
> T netperf-eu.bufferbloat.net A
> T outlook.office365.com A
> T time.akamai.com A
> T www.amazon.com A
> T www.bing.com A
> T www.bloomberg.com A
> T www.dropbox.com A
> T www.ebay.com A
> T www.eff.org A
> T www.facebook.com A
> T www.google.com A
> T www.grafana.com A
> T www.helsinki.fi A
> T www.microsoft.com A
> T www.netflix.com A
> T www.opendns.com A
> T www.openwrt.org A
> T www.pushbullet.com A
> T www.raspberrypi.org A
> T www.reddit.com A
> T www.tp-link.com A
> T www.twitter.com A
> T www.wikipedia.org A
> T www.yahoo.com A
> T www.youtube.com A
[Status] Testing complete (end of file)

Statistics:

  Queries sent:         102
  Queries completed:    12 (11.76%)
  Queries lost:         90 (88.24%)

  Response codes:       NOERROR 12 (100.00%)
  Average packet size:  request 33, response 71
  Run time (s):         0.340124
  Queries per second:   35.281250

  Average Latency (s):  0.013983 (min 0.000255, max 0.104837)
  Latency StdDev (s):   0.033255

Again, manually both these queries succeed:

; <<>> DiG 9.10.3-P4-Debian <<>> +search api.dropbox.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42776
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;api.dropbox.com.               IN      A

;; ANSWER SECTION:
api.dropbox.com.        29      IN      CNAME   api-env.dropbox-dns.com.
api-env.dropbox-dns.com. 60     IN      A       162.125.70.19

;; Query time: 105 msec
;; SERVER: 192.168.42.98#53(192.168.42.98)
;; WHEN: Fri Jul 16 02:01:02 EEST 2021
;; MSG SIZE  rcvd: 94
; <<>> DiG 9.10.3-P4-Debian <<>> +search 0.pool.ntp.org
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 48778
;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;0.pool.ntp.org.                        IN      A

;; ANSWER SECTION:
0.pool.ntp.org.         150     IN      A       37.228.129.2
0.pool.ntp.org.         150     IN      A       95.216.138.141
0.pool.ntp.org.         150     IN      A       95.216.142.52
0.pool.ntp.org.         150     IN      A       95.216.149.161

;; Query time: 120 msec
;; SERVER: 192.168.42.98#53(192.168.42.98)
;; WHEN: Fri Jul 16 02:00:43 EEST 2021
;; MSG SIZE  rcvd: 107

About the lighthttpd error on disk space: disk space on the Raspberry shows enough free and log files for lighthttpd are recorded, e.g entering the "settings" page:

1626388785|pihole1.heralan|GET /admin/settings.php HTTP/1.1|200|115344
1626388785|pihole1.heralan|GET /admin/api.php?summaryRaw&topItems HTTP/1.1|200|1102
1626388785|pihole1.heralan|GET /admin/style/vendor/SourceSansPro/SourceSansPro.css?v=1615852941 HTTP/1.1|304|0
1626388785|pihole1.heralan|GET /admin/style/vendor/bootstrap/css/bootstrap.min.css?v=1615852941 HTTP/1.1|304|0
1626388785|pihole1.heralan|GET /admin/style/vendor/datatables.min.css?v=1615852941 HTTP/1.1|304|0

[etc.]

# df -h
Filesystem              Size  Used Avail Use% Mounted on
/dev/root               6.8G  5.2G  1.3G  81% /
/dev/mmcblk0p6           66M   22M   45M  32% /boot

[etc.]

Anything more I could do for the Pi-hole/DNS? I might need to remove the router (load balancer) from the setup to verify it is not to blame...

edit: I ran the debug for pihole again. The networking section was unchanged by the configuration changes I made.

This indicates that Pi-hole has correctly forwarded the DNS request to its upstreams, but hasn't received an answer (yet).

Pi-hole can do nothing about an upstream server that doesn't reply or doesn't reply in time.
At 250 DNS requests per second, you may be observing the upstream's rate limit.

So you are suggesting the upstream is an issue with this? I tend to agree with this (without any evidence for it - yet). The query rate is [most likely] not an issue, I've run close to 1,500qps on the upstream in the past without issues.

I'm at a loss with the networking of the Pi-hole still...

Networking
[โœ—] No IPv4 address(es) found on the  interface.

[โœ—] No IPv6 address(es) found on the  interface.

What could be the reason for this? The ethernet (eth0) is configured via DHCP (from the router).

Once more, running dnsperf test and this time a more simple one, gives me yet more unpredictable results:
/root/dnsperf-test.txt

www.eff.org A
www.google.com A
google.com MX
www.yahoo.fr A
[Status] Command line: dnsperf -Q 2 -l 10 -d /root/dnsperf-test.txt -s 192.168.42.98 -v
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Fri Jul 16 10:22:33 2021
[Status] Stopping after 10.000000 seconds
> NOERROR www.google.com A 0.000357
> NOERROR www.yahoo.fr A 0.041440
> NOERROR www.google.com A 0.000656
> NOERROR www.yahoo.fr A 0.001565
> NOERROR www.google.com A 0.000506
> T www.eff.org A
> NOERROR www.yahoo.fr A 0.001506
> T google.com MX
> NOERROR www.google.com A 0.000558
> T www.eff.org A
> NOERROR www.yahoo.fr A 0.001490
> T google.com MX
> NOERROR www.google.com A 0.000559
> T www.eff.org A
> NOERROR www.yahoo.fr A 0.001465
> T google.com MX
> T www.eff.org A
> T google.com MX
> T www.eff.org A
> T google.com MX
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         20
  Queries completed:    10 (50.00%)
  Queries lost:         10 (50.00%)

  Response codes:       NOERROR 10 (100.00%)
  Average packet size:  request 29, response 77
  Run time (s):         10.000234
  Queries per second:   0.999977

  Average Latency (s):  0.005010 (min 0.000357, max 0.041440)
  Latency StdDev (s):   0.012810

The same test, some time later

[Status] Command line: dnsperf -Q 2 -l 10 -d /root/dnsperf-test.txt -s 192.168.42.98 -v
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Fri Jul 16 10:26:59 2021
[Status] Stopping after 10.000000 seconds
> NOERROR www.eff.org A 0.039919
> NOERROR www.google.com A 0.124464
> NOERROR google.com MX 0.138557
> NOERROR www.eff.org A 0.001381
> NOERROR www.google.com A 0.000608
> NOERROR google.com MX 0.033536
> NOERROR www.eff.org A 0.001750
> NOERROR www.google.com A 0.000587
> NOERROR google.com MX 0.138582
> NOERROR www.eff.org A 0.001735
> T www.yahoo.fr A
> NOERROR www.google.com A 0.000663
> NOERROR google.com MX 0.100718
> NOERROR www.eff.org A 0.001753
> T www.yahoo.fr A
> NOERROR www.google.com A 0.000720
> T www.yahoo.fr A
> T www.yahoo.fr A
> T google.com MX
> T www.yahoo.fr A
[Status] Testing complete (time limit)

Statistics:

  Queries sent:         20
  Queries completed:    14 (70.00%)
  Queries lost:         6 (30.00%)

  Response codes:       NOERROR 14 (100.00%)
  Average packet size:  request 29, response 100
  Run time (s):         10.000235
  Queries per second:   1.399967

  Average Latency (s):  0.041783 (min 0.000587, max 0.138582)
  Latency StdDev (s):   0.057061

So at least the ratio is not an issue with this test.

Yes, and that would be supported by Pi-hole's log entries indicating that an upstream's reply is still pending.

I can only speculate here as well.
From the information you've provided so far, I would conclude that Pi-hole is fully operational, and your observation would not constitute a failure.

For a start, bulk load tests are designed to find a system's breaking point under load, not to find functional errors. Since manual lookups for failing domains always work for you, I'm leaning towards suspecting a feature (or maybe a flaw) in the bulk test design to provoke a behaviour that you normally wouldn't encounter in the wild.

Looking at your observations from that angle, I notice that Pi-hole would probably multiply your aspired rate limit by the number of upstream servers its forwarding queries to on ocassions - by default, Pi-hole will do so from time to time to prefer the fastest responding upstream.
Combined with a somewhat too small set of domains, this may in turn trigger other security precautions of upstream DNS servers, e.g. filtering out excessive requests for the same domain from the same IP - especially when all upstreams would belong to the same DNS provider.

You could try to eliminate this from your tests by enlarging your test domain set (if necessary) and by configuring just one upstream for the purpose of bulk loading to see if my speculation is any good.

Must be my glasses, maybe, but I can't find your new debug log.
Would you share that again, please?

It's not you, it's me: I didn't upload any new log(s). I can do that later if the issue persists...

Again, thank you for your input: valid points, heed taken! I am in the process of re-designing my LAN anyway so it is good time to reflect my learnings on this subject as well. One thing I will definitely at least try is to rid the LAN of the current load balancer (which is not "industrial grade" and thus maybe underperforming). I already tried removing the secondary outbound connection from the LB and the results look like the LB is to blame here. Changing the configuration and/or replacing the router with a completely different one will provide more evidence but this is on my TODO list still...

I think I found the culprit... I had set the logs for most of my services to be written on external disk(s) to reduce wear and tear on the Raspberry Pi's SD card. The Pi-hole had it's logs linked to the external drive as well. Turns out, not only was my SD card in the Raspberry failing, the external drive had issues with data consistency as well (probably due to the SD card failing). Once fixed, all my LAN services seem happy enough with the DNS Pi-hole provides. I still get errors with the dnsperf tests but as you pointed out, this is another topic and [most likely] my Pi-hole will run smoothly even if stress testing the DNS does not give 100% record of success.

This indicates that Pi-hole has correctly forwarded the DNS request to its upstreams, but hasn't received an answer (yet).

I cannot stress enough the importance of understanding given advice: my upstream was indeed to blame. :joy: Once fixed (and the Pi-hole server repaired/fixed) every allowed query succeeds. My setup works now as expected. Delays are gone, LAN services resumed, etc.

The problem actually was with the LB and the way I had the FW rules for the LB set up. I can confirm there is absolutely nothing wrong with the Pi-hole (although as pointed out, my settings should be revised, still).

Thanks for all your help, it really pointed my in the right direction and ultimately fixed my [bad configuration] setup.

1 Like