The issue I am facing: Some DNS queries fail at random. Re-try of the same DNS query usually resolves the query correctly, even if the same DNS server is used.
Details about my system:
https://tricorder.pi-hole.net/alcgjlb9wr
My setup is as follows:
Two outbound connections from two different Operators .These are load balanced with failover with my router at 192.168.42.1, WAN1 being the "active", WAN2 as a secondary in case WAN1 is not responding.
The LAN has two DNS servers, both Raspberry Pis, ns1: Raspberry 4 with Debian GNU/Linux 10 (buster), ns2: Raspberry 3B+ with Raspbian GNU/Linux 9 (stretch), both running BIND at 192.168.42.38 (ns1) and 192.168.42.34 (ns2). The configuration for both servers is to use OpenDNS DNS servers for resolution. Neither use the other for any DNS queries. I have setup a local LAN which the BIND handles correctly. This is not an issue. BIND is configured to allow queries for local domain for ANY but only for 192.168.42.34/.38/.98/.99 as well as 127.0.0.1 for query, recursion and cache.
Both the DNS servers are also running Pi-hole. The Pi-hole is my main DNS for LAN. The first at 192.168.42.99 (ns1), second at 192.168.42.48 (ns2). Pihole is configured to use OpenDNS as resolver for all but the local domain, which is handled with both Pi-holes independently, that is, the ns1 uses only the BIND at ns1 and ns2 only the BIND at ns2. DHCP is handled by my router. DNS servers are configured for DHCP as 192.168.42.99 and 192.168.42.98. ns1 uses Apache, ns2 lighthttp (but, this makes no difference for DNS queries).
No IPv6 is used only IPv4 as the secondary operator does not support IPv6.
What I have changed since installing Pi-hole:
Well, a lot (as you would imagine)...
My configs for Pi-hole:
/etc/dnsmasq.d/01-pihole.conf
addn-hosts=/etc/pihole/local.list
addn-hosts=/etc/pihole/custom.list
localise-queries
no-resolv
cache-size=10000
log-queries
log-facility=/var/log/pihole.log
local-ttl=2
log-async
listen-address=192.168.42.98
server=208.67.220.220
server=208.67.222.222
domain-needed
expand-hosts
bogus-priv
bind-interfaces
rev-server=192.168.42.0/24,192.168.42.34
server=/heralan/192.168.42.34
server=/use-application-dns.net/
/etc/pihole/pihole-FTL.conf
PRIVACYLEVEL=0
#BLOCKINGMODE=NODATA
BLOCKINGMODE=IP-NODATA-AAAA
RATE_LIMIT=0/0
/etc/pihole/setupVars.conf
#PIHOLE_INTERFACE=
WEBPASSWORD=[ omitted ]
ADMIN_EMAIL=tommi@oire.fi
WEBUIBOXEDLAYOUT=boxed
WEBTHEME=default-dark
API_EXCLUDE_DOMAINS=
API_EXCLUDE_CLIENTS=
API_QUERY_LOG_SHOW=all
API_PRIVACY_MODE=false
IPV4_ADDRESS=192.168.42.98/24
IPV6_ADDRESS=::1
QUERY_LOGGING=true
INSTALL_WEB_SERVER=false
INSTALL_WEB_INTERFACE=true
LIGHTTPD_ENABLED=false
CACHE_SIZE=100000
BLOCKING_ENABLED=true
CNAME_DEEP_INSPECT=true
RESOLVE_IPV6=no
DNSMASQ_LISTENING=single
PIHOLE_DNS_1=208.67.220.220
PIHOLE_DNS_2=208.67.222.222
DNS_FQDN_REQUIRED=true
DNS_BOGUS_PRIV=true
DNSSEC=false
REV_SERVER=true
REV_SERVER_CIDR=192.168.42.0/24
REV_SERVER_TARGET=192.168.42.34
REV_SERVER_DOMAIN=heralan
/etc/resolv.conf
# Generated by resolvconf
search heralan
nameserver 192.168.42.98
I did get an error "Gateway did not respond." with pihole -d. The gateway, however, does respond to pings when trying manually:
PING 192.168.42.1 (192.168.42.1) 56(84) bytes of data.
64 bytes from 192.168.42.1: icmp_seq=1 ttl=64 time=0.580 ms
64 bytes from 192.168.42.1: icmp_seq=2 ttl=64 time=0.451 ms
64 bytes from 192.168.42.1: icmp_seq=3 ttl=64 time=0.509 ms
64 bytes from 192.168.42.1: icmp_seq=4 ttl=64 time=0.474 ms
64 bytes from 192.168.42.1: icmp_seq=5 ttl=64 time=0.413 ms
^C
--- 192.168.42.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4138ms
rtt min/avg/max/mdev = 0.413/0.485/0.580/0.060 ms
Also, DNS query succeeds (if no GW was reached, I would imagine the DNS query would fail).
From here on, I am using ns2 as an example while the issue is more or less the same with ns1. All operations, configurations, etc. are from ns2. ns2 does not use ns1 for anything but the BIND zone transfers from ns1 (again, not an issue).
I tried debugging the issue with dnsperf and came up with:
[Status] Command line: dnsperf -e -Q 100 -l 5 -d /root/dnsperf-test.txt -s 192.168.42.98
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Thu Jul 15 12:06:27 2021
[Status] Stopping after 5.000000 seconds
[Timeout] Query timed out: msg id 3
[Timeout] Query timed out: msg id 5
[Status] Testing complete (time limit)
Statistics:
Queries sent: 500
Queries completed: 498 (99.60%)
Queries lost: 2 (0.40%)
Response codes: NOERROR 498 (100.00%)
Average packet size: request 41, response 80
Run time (s): 5.000187
Queries per second: 99.596275
Average Latency (s): 0.000839 (min 0.000164, max 0.049515)
Latency StdDev (s): 0.002248
[Status] Command line: dnsperf -e -Q 100 -l 5 -d /root/dnsperf-test.txt -s 192.168.42.98
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Thu Jul 15 12:06:40 2021
[Status] Stopping after 5.000000 seconds
[Status] Testing complete (time limit)
Statistics:
Queries sent: 500
Queries completed: 500 (100.00%)
Queries lost: 0 (0.00%)
Response codes: NOERROR 500 (100.00%)
Average packet size: request 41, response 81
Run time (s): 5.000193
Queries per second: 99.996140
Average Latency (s): 0.000729 (min 0.000168, max 0.004930)
Latency StdDev (s): 0.000465
The /root/dnsperf-test.txt for dnsperf is:
www.google.com A
www.yle.fi A
As we can see I might loose a packet(/query) or two from time to time with this setup but this is acceptable, at least with the qps of 100. Usually the load is much less with my LAN clients.
Here's where it gets interesting and what my actual issue is:
Running with the exact same setup a different set of A queries i get (again, all from ns2):
[Status] Command line: dnsperf -Q 250 -l 5 -t 10 -d /root/dnsperf-simple.txt -s 192.168.42.98 -v
[Status] Sending queries (to 192.168.42.98)
[Status] Started at: Thu Jul 15 12:40:42 2021
[Status] Stopping after 5.000000 seconds
Statistics:
Queries sent: 254
Queries completed: 154 (60.63%)
Queries lost: 100 (39.37%)
Response codes: NOERROR 154 (100.00%)
Average packet size: request 33, response 93
Run time (s): 5.000149
Queries per second: 30.799082
Average Latency (s): 0.002605 (min 0.000216, max 0.082471)
Latency StdDev (s): 0.010682
No matter how many times I repeat the test the results are not any better.
The /root/dnsperf-simple.txt for dnsperf is
0.pool.ntp.org A
1.pool.ntp.org A
api.dropbox.com A
calendar.google.com A
docs.google.com A
play.google.com A
ns1.heralan A
ipinfo.io A
netperf-eu.bufferbloat.net A
outlook.office365.com A
pihole0.heralan A
time.akamai.com A
www.amazon.com A
www.bing.com A
www.bloomberg.com A
www.dropbox.com A
www.ebay.com A
www.eff.org A
www.facebook.com A
www.google.com A
www.grafana.com A
www.helsinki.fi A
www.microsoft.com A
www.netflix.com A
www.opendns.com A
www.openwrt.org A
www.pushbullet.com A
www.raspberrypi.org A
www.reddit.com A
www.tp-link.com A
www.twitter.com A
www.wikipedia.org A
www.yahoo.com A
www.youtube.com A
Every entry resolves correctly if tried manually:
while read line; do dig $line +short; echo "**"; sleep 0.2; done < /root/dnsperf-simple.txt
193.182.111.13
176.119.210.243
122.117.253.246
89.221.214.130
**
95.216.175.117
95.216.154.135
95.216.24.230
162.159.200.123
**
api-env.dropbox-dns.com.
162.125.70.19
**
216.58.207.206
**
142.250.74.142
**
172.217.20.46
**
192.168.42.38
**
34.117.59.81
**
flent-eu.bufferbloat.net.
demo.tohojo.dk.
193.10.227.30
**
outlook.ha.office365.com.
40.101.12.82
40.101.80.2
40.101.80.18
52.97.200.130
52.97.200.146
52.97.250.210
52.97.250.226
40.101.12.18
**
192.168.42.99
**
time.akamai.com.edgekey.net.
e1534.dscb.akamaiedge.net.
2.20.2.248
**
tp.47cf2c8c9-frontier.amazon.com.
d3ag4hukkh62yn.cloudfront.net.
13.32.123.226
**
a-0001.a-afdentry.net.trafficmanager.net.
www-bing-com.dual-a-0001.a-msedge.net.
dual-a-0001.a-msedge.net.
13.107.21.200
204.79.197.200
**
www.bloomberg.com.shared.bloomberga.com.
bloomberg.map.fastly.net.
151.101.129.73
151.101.193.73
151.101.1.73
151.101.65.73
**
www-env.dropbox-dns.com.
162.125.70.18
**
slot9428.ebay.com.edgekey.net.
e9428.a.akamaiedge.net.
96.16.165.22
**
eff.map.fastly.net.
151.101.128.201
151.101.192.201
151.101.0.201
151.101.64.201
**
star-mini.c10r.facebook.com.
157.240.205.35
**
216.58.207.196
**
grafana.com.
34.120.177.193
**
adc-vip3.it.helsinki.fi.
128.214.189.90
**
www.microsoft.com-c-3.edgekey.net.
www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net.
e13678.dscb.akamaiedge.net.
2.22.42.141
**
www.dradis.netflix.com.
www.eu-west-1.internal.dradis.netflix.com.
dualstack.apiproxy-website-nlb-prod-3-ac110f6ae472b85a.elb.eu-west-1.amazonaws.com.
3.251.50.149
54.74.73.31
54.155.178.5
**
146.112.62.105
**
wiki-01.infra.openwrt.org.
139.59.209.225
**
104.26.5.125
104.26.4.125
172.67.68.238
**
104.22.1.43
104.22.0.43
172.67.36.98
**
reddit.map.fastly.net.
151.101.1.140
151.101.65.140
151.101.129.140
151.101.193.140
**
13.32.123.66
13.32.123.43
13.32.123.97
13.32.123.76
**
twitter.com.
104.244.42.193
104.244.42.1
**
dyna.wikimedia.org.
91.198.174.192
**
new-fp-shed.wg1.b.yahoo.com.
87.248.100.216
87.248.100.215
**
youtube-ui.l.google.com.
216.58.207.206
172.217.21.174
172.217.21.142
172.217.20.46
142.250.74.142
142.250.74.110
142.250.74.78
142.250.74.46
142.250.74.14
216.58.211.14
216.58.207.238
How do I get Pi-hole to respond to queries correctly and not "loosing" 1/3 - 1/2 of the queries? With every other DNS query lost, the LAN becomes unbearably slow to use...
edit(s): Tried to make the long text more readable. Also, if some details are missing, I'd be happy to add them...