The issue I am facing: Some DNS queries fail at random. Re-try of the same DNS query usually resolves the query correctly, even if the same DNS server is used.
Details about my system:
My setup is as follows:
Two outbound connections from two different Operators .These are load balanced with failover with my router at 192.168.42.1, WAN1 being the "active", WAN2 as a secondary in case WAN1 is not responding.
The LAN has two DNS servers, both Raspberry Pis, ns1: Raspberry 4 with Debian GNU/Linux 10 (buster), ns2: Raspberry 3B+ with Raspbian GNU/Linux 9 (stretch), both running BIND at 192.168.42.38 (ns1) and 192.168.42.34 (ns2). The configuration for both servers is to use OpenDNS DNS servers for resolution. Neither use the other for any DNS queries. I have setup a local LAN which the BIND handles correctly. This is not an issue. BIND is configured to allow queries for local domain for ANY but only for 192.168.42.34/.38/.98/.99 as well as 127.0.0.1 for query, recursion and cache.
Both the DNS servers are also running Pi-hole. The Pi-hole is my main DNS for LAN. The first at 192.168.42.99 (ns1), second at 192.168.42.48 (ns2). Pihole is configured to use OpenDNS as resolver for all but the local domain, which is handled with both Pi-holes independently, that is, the ns1 uses only the BIND at ns1 and ns2 only the BIND at ns2. DHCP is handled by my router. DNS servers are configured for DHCP as 192.168.42.99 and 192.168.42.98. ns1 uses Apache, ns2 lighthttp (but, this makes no difference for DNS queries).
No IPv6 is used only IPv4 as the secondary operator does not support IPv6.
What I have changed since installing Pi-hole:
Well, a lot (as you would imagine)...
My configs for Pi-hole:
addn-hosts=/etc/pihole/local.list addn-hosts=/etc/pihole/custom.list localise-queries no-resolv cache-size=10000 log-queries log-facility=/var/log/pihole.log local-ttl=2 log-async listen-address=192.168.42.98 server=126.96.36.199 server=188.8.131.52 domain-needed expand-hosts bogus-priv bind-interfaces rev-server=192.168.42.0/24,192.168.42.34 server=/heralan/192.168.42.34 server=/use-application-dns.net/
PRIVACYLEVEL=0 #BLOCKINGMODE=NODATA BLOCKINGMODE=IP-NODATA-AAAA RATE_LIMIT=0/0
#PIHOLE_INTERFACE= WEBPASSWORD=[ omitted ] ADMIN_EMAILfirstname.lastname@example.org WEBUIBOXEDLAYOUT=boxed WEBTHEME=default-dark API_EXCLUDE_DOMAINS= API_EXCLUDE_CLIENTS= API_QUERY_LOG_SHOW=all API_PRIVACY_MODE=false IPV4_ADDRESS=192.168.42.98/24 IPV6_ADDRESS=::1 QUERY_LOGGING=true INSTALL_WEB_SERVER=false INSTALL_WEB_INTERFACE=true LIGHTTPD_ENABLED=false CACHE_SIZE=100000 BLOCKING_ENABLED=true CNAME_DEEP_INSPECT=true RESOLVE_IPV6=no DNSMASQ_LISTENING=single PIHOLE_DNS_1=184.108.40.206 PIHOLE_DNS_2=220.127.116.11 DNS_FQDN_REQUIRED=true DNS_BOGUS_PRIV=true DNSSEC=false REV_SERVER=true REV_SERVER_CIDR=192.168.42.0/24 REV_SERVER_TARGET=192.168.42.34 REV_SERVER_DOMAIN=heralan
# Generated by resolvconf search heralan nameserver 192.168.42.98
I did get an error "Gateway did not respond." with pihole -d. The gateway, however, does respond to pings when trying manually:
PING 192.168.42.1 (192.168.42.1) 56(84) bytes of data. 64 bytes from 192.168.42.1: icmp_seq=1 ttl=64 time=0.580 ms 64 bytes from 192.168.42.1: icmp_seq=2 ttl=64 time=0.451 ms 64 bytes from 192.168.42.1: icmp_seq=3 ttl=64 time=0.509 ms 64 bytes from 192.168.42.1: icmp_seq=4 ttl=64 time=0.474 ms 64 bytes from 192.168.42.1: icmp_seq=5 ttl=64 time=0.413 ms ^C --- 192.168.42.1 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4138ms rtt min/avg/max/mdev = 0.413/0.485/0.580/0.060 ms
Also, DNS query succeeds (if no GW was reached, I would imagine the DNS query would fail).
From here on, I am using ns2 as an example while the issue is more or less the same with ns1. All operations, configurations, etc. are from ns2. ns2 does not use ns1 for anything but the BIND zone transfers from ns1 (again, not an issue).
I tried debugging the issue with dnsperf and came up with:
[Status] Command line: dnsperf -e -Q 100 -l 5 -d /root/dnsperf-test.txt -s 192.168.42.98 [Status] Sending queries (to 192.168.42.98) [Status] Started at: Thu Jul 15 12:06:27 2021 [Status] Stopping after 5.000000 seconds [Timeout] Query timed out: msg id 3 [Timeout] Query timed out: msg id 5 [Status] Testing complete (time limit) Statistics: Queries sent: 500 Queries completed: 498 (99.60%) Queries lost: 2 (0.40%) Response codes: NOERROR 498 (100.00%) Average packet size: request 41, response 80 Run time (s): 5.000187 Queries per second: 99.596275 Average Latency (s): 0.000839 (min 0.000164, max 0.049515) Latency StdDev (s): 0.002248
[Status] Command line: dnsperf -e -Q 100 -l 5 -d /root/dnsperf-test.txt -s 192.168.42.98 [Status] Sending queries (to 192.168.42.98) [Status] Started at: Thu Jul 15 12:06:40 2021 [Status] Stopping after 5.000000 seconds [Status] Testing complete (time limit) Statistics: Queries sent: 500 Queries completed: 500 (100.00%) Queries lost: 0 (0.00%) Response codes: NOERROR 500 (100.00%) Average packet size: request 41, response 81 Run time (s): 5.000193 Queries per second: 99.996140 Average Latency (s): 0.000729 (min 0.000168, max 0.004930) Latency StdDev (s): 0.000465
The /root/dnsperf-test.txt for dnsperf is:
www.google.com A www.yle.fi A
As we can see I might loose a packet(/query) or two from time to time with this setup but this is acceptable, at least with the qps of 100. Usually the load is much less with my LAN clients.
Here's where it gets interesting and what my actual issue is:
Running with the exact same setup a different set of A queries i get (again, all from ns2):
[Status] Command line: dnsperf -Q 250 -l 5 -t 10 -d /root/dnsperf-simple.txt -s 192.168.42.98 -v [Status] Sending queries (to 192.168.42.98) [Status] Started at: Thu Jul 15 12:40:42 2021 [Status] Stopping after 5.000000 seconds Statistics: Queries sent: 254 Queries completed: 154 (60.63%) Queries lost: 100 (39.37%) Response codes: NOERROR 154 (100.00%) Average packet size: request 33, response 93 Run time (s): 5.000149 Queries per second: 30.799082 Average Latency (s): 0.002605 (min 0.000216, max 0.082471) Latency StdDev (s): 0.010682
No matter how many times I repeat the test the results are not any better.
The /root/dnsperf-simple.txt for dnsperf is
0.pool.ntp.org A 1.pool.ntp.org A api.dropbox.com A calendar.google.com A docs.google.com A play.google.com A ns1.heralan A ipinfo.io A netperf-eu.bufferbloat.net A outlook.office365.com A pihole0.heralan A time.akamai.com A www.amazon.com A www.bing.com A www.bloomberg.com A www.dropbox.com A www.ebay.com A www.eff.org A www.facebook.com A www.google.com A www.grafana.com A www.helsinki.fi A www.microsoft.com A www.netflix.com A www.opendns.com A www.openwrt.org A www.pushbullet.com A www.raspberrypi.org A www.reddit.com A www.tp-link.com A www.twitter.com A www.wikipedia.org A www.yahoo.com A www.youtube.com A
Every entry resolves correctly if tried manually:
while read line; do dig $line +short; echo "**"; sleep 0.2; done < /root/dnsperf-simple.txt 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 ** 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 ** api-env.dropbox-dns.com. 126.96.36.199 ** 188.8.131.52 ** 184.108.40.206 ** 220.127.116.11 ** 192.168.42.38 ** 18.104.22.168 ** flent-eu.bufferbloat.net. demo.tohojo.dk. 22.214.171.124 ** outlook.ha.office365.com. 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 ** 192.168.42.99 ** time.akamai.com.edgekey.net. e1534.dscb.akamaiedge.net. 184.108.40.206 ** tp.47cf2c8c9-frontier.amazon.com. d3ag4hukkh62yn.cloudfront.net. 220.127.116.11 ** a-0001.a-afdentry.net.trafficmanager.net. www-bing-com.dual-a-0001.a-msedge.net. dual-a-0001.a-msedge.net. 18.104.22.168 22.214.171.124 ** www.bloomberg.com.shared.bloomberga.com. bloomberg.map.fastly.net. 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 ** www-env.dropbox-dns.com. 18.104.22.168 ** slot9428.ebay.com.edgekey.net. e9428.a.akamaiedge.net. 22.214.171.124 ** eff.map.fastly.net. 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 ** star-mini.c10r.facebook.com. 18.104.22.168 ** 22.214.171.124 ** grafana.com. 126.96.36.199 ** adc-vip3.it.helsinki.fi. 188.8.131.52 ** www.microsoft.com-c-3.edgekey.net. www.microsoft.com-c-3.edgekey.net.globalredir.akadns.net. e13678.dscb.akamaiedge.net. 184.108.40.206 ** www.dradis.netflix.com. www.eu-west-1.internal.dradis.netflix.com. dualstack.apiproxy-website-nlb-prod-3-ac110f6ae472b85a.elb.eu-west-1.amazonaws.com. 220.127.116.11 18.104.22.168 22.214.171.124 ** 126.96.36.199 ** wiki-01.infra.openwrt.org. 188.8.131.52 ** 184.108.40.206 220.127.116.11 18.104.22.168 ** 22.214.171.124 126.96.36.199 188.8.131.52 ** reddit.map.fastly.net. 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 ** 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 ** twitter.com. 18.104.22.168 22.214.171.124 ** dyna.wikimedia.org. 126.96.36.199 ** new-fp-shed.wg1.b.yahoo.com. 188.8.131.52 184.108.40.206 ** youtube-ui.l.google.com. 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52 184.108.40.206 220.127.116.11 18.104.22.168 22.214.171.124 126.96.36.199 188.8.131.52
How do I get Pi-hole to respond to queries correctly and not "loosing" 1/3 - 1/2 of the queries? With every other DNS query lost, the LAN becomes unbearably slow to use...
edit(s): Tried to make the long text more readable. Also, if some details are missing, I'd be happy to add them...