DNS goes down and needs a long time to resize/remap tables

gvb · June 8, 2023, 8:22am

Hello,

Expected Behaviour:

DNS server should start quickly.
I'm running it on a raspbian vm under vmware esxi.
Traffic is quite high tho.

Actual Behaviour:

Each time that I try to update pi-hole the update goes quick and without problems.
The problems start after restarting the dns server again.
All looks fine but the web interface keeps giving me a "not working" sign and no stats appear.
This goes away after 30 minutes or so tho.

As the dns seems to go down sometimes I started looking around and found the FTL tail log.
Here I see that in the time that it is down it's doing a lot of resizing and remapping of tables

Jun 08 09:34:48 server-pihole pihole-FTL[4268]: [2023-06-08 09:34:35.400 4268M] Resizing "FTL-strings" from 22691840 to (22732800 * 1) == 22732800 (/dev/shm: 75.7MB used, 517.0MB total, FTL uses 75.7MB)
Jun 08 09:34:48 server-pihole pihole-FTL[4268]: [2023-06-08 09:34:36.391 4268M] Resizing "FTL-domains" from 13762560 to (689152 * 20) == 13783040 (/dev/shm: 75.7MB used, 517.0MB total, FTL uses 75.7MB)
Jun 08 09:34:48 server-pihole pihole-FTL[4268]: [2023-06-08 09:34:39.599 4268M] Resizing "FTL-strings" from 22732800 to (22773760 * 1) == 22773760 (/dev/shm: 75.7MB used, 517.0MB total, FTL uses 75.7MB)
Jun 08 09:34:48 server-pihole pihole-FTL[4268]: [2023-06-08 09:34:40.226 4268M] Resizing "FTL-domains" from 13783040 to (690176 * 20) == 13803520 (/dev/shm: 75.8MB used, 517.0MB total, FTL uses 75.8MB)
Jun 08 09:34:48 server-pihole pihole-FTL[4268]: [2023-06-08 09:34:43.795 4268M] Resizing "FTL-strings" from 22773760 to (22814720 * 1) == 22814720 (/dev/shm: 75.8MB used, 517.0MB total, FTL uses 75.8MB)
Jun 08 09:34:48 server-pihole pihole-FTL[4268]: [2023-06-08 09:34:44.076 4268M] Resizing "FTL-domains" from 13803520 to (691200 * 20) == 13824000 (/dev/shm: 75.8MB used, 517.0MB total, FTL uses 75.8MB)
Jun 08 09:34:48 server-pihole pihole-FTL[4268]: [2023-06-08 09:34:44.968 4268M] DB warn: TYPE should not be 100
Jun 08 09:34:48 server-pihole pihole-FTL[4268]: [2023-06-08 09:34:45.044 4268M] Resizing "FTL-queries" from 38207488 to (872448 * 44) == 38387712 (/dev/shm: 75.9MB used, 517.0MB total, FTL uses 75.8MB)
Jun 08 09:34:48 server-pihole pihole-FTL[4268]: [2023-06-08 09:34:47.942 4268M] Resizing "FTL-domains" from 13824000 to (692224 * 20) == 13844480 (/dev/shm: 76.0MB used, 517.0MB total, FTL uses 76.0MB)

I already lowered the amount of days to keep to 7 but it still needs ages to get started.
Maybe I should lower it to 1 as I don't really need to look up things.
But would this speed up the resizing/remapping aswell?

A quick fix was clearing the database then it start right away but I prefer that it stays running like it did for months before certain update.

Debug Token:

I was running the debugging diagnostics via the web interface but it hangs at *** [ DIAGNOSING ]: Dashboard headers

and the entire GUI seems to be unresponsive now so I'll get back to you with the token when I have it.

can I only restart the web interface without disrupting the dns service now that it's running again?

gvb · June 8, 2023, 8:48am

for some reason it can't upload the debug file.

[?] Would you like to upload the log? [y/N] y
* Using curl for transmission.
* curl failed, contact Pi-hole support for assistance.
* Error message: curl: (22) The requested URL returned error: 500

[✗] There was an error uploading your debug log.

Please try again or contact the Pi-hole team for assistance.

Bucking_Horn · June 9, 2023, 7:44am

You may have been affected by down-times.
Please try again.

gvb · June 12, 2023, 7:23am

ok, no problem.

here you go...

https://tricorder.pi-hole.net/KmHNgQpd/

I changed the cache size to 50.000 and now the high number of evictions is gone.

gvb · June 13, 2023, 7:02am

I got a call that someone couldn't reach a server with shared folder anymore.

Pi-hole GUI was totally unresponsive.

a restartdns solved that but I let it do 20 minutes of that resizing but the dns service still didn't come online.

so I wiped the logs again and then it resized 2 things and after a few remaps it came online again.

why doesn't the dns service start before this resizing happends?
then it doesn't matter that it needs 30+ minutes.

DL6ER · June 13, 2023, 7:34am

No, it would just reduce the required amount of disk space for the database. Pi-hole reimports the latest 24 hours of history during a restart so I'd recommend trying disabling the database altogether - especially given that you said the web UI isn't working, either.

Because we need to import the DNS history so we can append new queries at the end. Making this parallel would require a lot of work and is usually not needed. What is causing the high delay is either:

Very many queries (multi millions per hour range),
Very slow processor, or
Very slow disk speed.

Having said that, the reason for any kind of slowness is typically that the hardware simply cannot process the amount of data. This would also fit to your observation

It's easier for us to answer this when you can give us a rough estimate of

What is the used hardware to run Pi-hole?
How many queries are there roughly (per hour or per day)?
How many clients is your Pi-hole serving approximately?

gvb · June 13, 2023, 7:59am

What do I need to do to disable the database and what's the bad part of it?
Just not being able to see what request are blocked/passed?

It runs on a small Lenovo PC as VMWare virtual machine.
It also has another VM running a 3CX VOIP PABX.
All worked fine for months till some certain update tho.

every 10 minutes I see around 8-12K requests in the stats.
55K since the restart an hour ago.

Hard to tell with all those tablets and smartphones these days that pollute the network.
But it's 2 subnets so < 500 and concurrent maybe 50+

Bucking_Horn · June 13, 2023, 10:42am

You didn't mention any evictions so far?
Also, Pi-hole's embedded dnsmasq comes with a hard-coded maximum cache size of 10,000 entries.
How did you go about applying 50,000?

Probably not related, but your debug log shows that you use some *.local as your local/search domain name.

You should note that .local is reserved for use by the mDNS protocol and should NOT be used with DNS. While most modern OSs would come with mDNS support, Apple devices in particular would regularly employ mDNS for local name resolution and service discovery.

To get an idea how many different client IPs your Pi-hole has seen over time, please share the result of:

pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT count(*) FROM client_by_id;"
pihole-FTL sqlite3 /etc/pihole/pihole-FTL.db "SELECT count(*) FROM network_addresses;"

Please do so before applying below configuration.

Quoting Pi-hole's documentation:

The config parameter DBIMPORT controls whether FTL loads information from the database on startup. It needs to do this to populate the internal data structure with the most recent history. However, as importing from the database on disk can delay FTL on very large deploys, it can be disabled using this option.

gvb · June 13, 2023, 11:28am

I changed the cache size with

sudo nano /etc/dnsmasq.d/01-pihole.conf

change buffer size to f.e. 50000

this returns 49 & 75

About the .local ... do you mean our local domain?

I added it so that it can forward requests to our local microsoft AD/DNS if needed.

I currently get notified for 2 notifications.

cache size greater than 10000 may cause performance issues, and is unlikely to be useful. (I changed that)

Maximum number of concurrent DNS queries reached (max: 150)

but it's still working normal.

Bucking_Horn · June 13, 2023, 12:09pm

Those numbers are not really that high.

This can be caused by Pi-hole's upstream DNS servers not responding in time or not at all, or by a DNS loop of sorts.
Both may contribute to a (potentially vastly) exaggerated count of DNS requests, which may well have an impact on your observation of slow start-up times.

You should investigate this before you decide to disable the database.

Are Pi-hole's upstream DNS resolvers responding?
You should be able to tell by investigating the logs specifically for requests that didn't receive a reply.

Is 192.168.0.1 using Pi-hole as an upstream DNS resolver?

As you've enabled Pi-hole's Conditional Forwarding, this could have closed a DNS loop if your 192.168.0.1 would in turn use Pi-hole: Requests for unknown hostnames would then be bounced back and forth among Pi-hole and 192.168.0.1 ad infinitum, or until time-out or rate and concurrency limits would kick in.

And just to still my curiousity:
Given that your request count seems rather high, are you running your Pi-hole in some kind of campus or company environment?

gvb · June 13, 2023, 12:23pm

192.168.0.1 is the Microsoft AD/DNS server on site A.

the 'problem' pi-hole is on site B.
The Microsoft server forwards to a pi-hole server on site A.

Requests from site B should not pass site A's microsoft dns as it will only end there for xxx.ourdomain.local requests. The requests on site A's pi-hole are a lot less.

This is a (multi site) institution for people with mental disability which can be compared to a small campus/school I guess.

Bucking_Horn · June 13, 2023, 12:30pm

Software on devices may generate requests for non-existent local or non-dot domains for various reasons (e.g. connectivity or captive portal checks from browsers, or mail address checks from mail clients).

Those would cause the DNS loop to close as described above.
In that case, you should have observed 150 requests for such an offending domain in short succession in your logs. i.e one identical request after the other.

Would that be the case for you?

EDIT:

Pi-hole has been reported to operate on larger sites like university campus's.
Chances are some secondary causes may impact or at least contribute to your observation, like that potential DNS loop we are trying to investigate.

gvb · June 13, 2023, 1:56pm

you are probably refering to requests like this one

wpad.ourdomain.local

which are proxy testing requests or something like that done by browsers.
3542 since the last few hours which is a lot but not problematic I guess.

What I noticed is that 85% of the requests are now coming from _Gateway.
On the other side this is the IP address of the 2 gateways (lan & wifi) but not _Gateway.
But when I click on it it shows " showing all queries for client 192.168.10.10" which is correct.
(router is dhcp, dns & gateway and uses pi-hole for lookups, I need to test is adding the pi-hole IP as dns in the dhcp works on the hotspot. You can't connect to other devices in hotspot mode maybe the dns is an exeption)

Bucking_Horn · June 13, 2023, 4:22pm

Yes, that would be one example, and a count like that could indeed suggest a DNS loop, as client's usually request that not very often (perhaps once per session, if at all).
I'd expect there to be additonal similar domains as well, so the total count of looped queries
may be much higher.

Since your debug log shows that your DHCP server at 192.168.10.10 is correctly distributing the Pi-hole host machine's IP address as local DNS resolver, I'd expect the vast majority of DNS requests to originate directly from clients.
Yet you are observing 85% to originate from your gateway.

This would again support my suspicion of an active DNS loop - unless your router/gateway at 192.168.10.10 would indeed aggregate DNS traffic of the majority of clients in your network.

However, that could be still be the case in your scenario, as you mention two separate networks, and the bulk of your clients could well reside in your site A, which potentially aggregates DNS traffic on behalf of site A clients

Would you expect the majority of DNS requests to originate from site A's 192.168.0.0/24(?) network?
Do you run one Pi-hole machine for each of your networks?

You could use the following nslookups to verify if an unknown domain lookup triggers a loop, run from a client in either site's network each:

nslookup bogus-host

nslookup bogus-host.ourdomain.local

gvb · June 14, 2023, 6:22am

It hangs again.

I don't know if this token shows why

https://tricorder.pi-hole.net/kDe19EbY/

I'll respond to the other questions later.

gvb · June 14, 2023, 7:39am

yes, I have a pi-hole on site A (192.168.0.16) and one on site B (192.168.10.3)

both forward ourdomain.local to 192.168.0.1 which is the microsoft AD/DNS/DHCP server.
the forwarder on that server is site A's pi-hole.

wired clients use 0.1 as dns

wireless clients use one of the gateway addresses of the router's dhcp ranges (varies between 40.10, 50.10 - 54.10)
The router has as system dns the pihole (0.16)

as mentioned before if hotspot mode allows connecting to 0.16 from that range/vlan aswell I could enter 0.16 in the dhcp settings then the stats will look more realistic and not just giving a bundled amount by gateway IP. And less hops too or stressing for the router too.

here some lookups but I don't know how we can detect a possible loop with that info.

C:\Users\administrator>nslookup dummy
Server:  pi.hole
Address:  192.168.10.3

*** pi.hole can't find dummy: Non-existent domain

C:\Users\administrator>nslookup dummy.ourdomain.local
Server:  pi.hole
Address:  192.168.10.3

*** pi.hole can't find dummy.ourdomain.local: Non-existent domain

C:\Users\administrator>nslookup mypc.ourdomain.local
Server:  pi.hole
Address:  192.168.10.3

Name:    mypc.ourdomain.local
Address:  192.168.0.59

C:\Users\administrator>nslookup mypc
Server:  pi.hole
Address:  192.168.10.3

Non-authoritative answer:
Name:    mypc.ourdomain.local
Address:  192.168.0.59

gvb · June 15, 2023, 6:04am

got another mail that nothing worked again.

indeed, I can log in but then get a black page and not the page with non loading stats.

I changed the buffer back to 10.000 maybe it's more stable then.

Bucking_Horn · June 15, 2023, 7:30am

I didn't ask for the results.
I suggested to use those lookups for unknown domains to verify a loop - one of yours returns an IP.

You should monitor Pi-hole's logs for those 150 looped requests while issuing those lookups, e.g. by runing pihole -t.
Also, you should see the max concurrent warning in Pi-hole's UI.

gvb · June 27, 2023, 6:24am

G'day,

It frooze again after running fine for over a week.

I had the logging running all that time and it shows this

Jun 26 16:12:19: query[A] ocws.officeapps.live.com from 192.168.10.123
Jun 26 16:12:19: query[A] gopulse-evolve.lulululemon.com from 192.168.10.10
tail: /var/log/pihole/pihole.log: bestand is ingekort

last line can be translated to "file has been shorted"

I'm not sure if that means a storage problem or just a log file resizing after something crashed/ended.

rdwebdesign · June 27, 2023, 6:57am

Usually pihole.log is truncated at midnight.

What is the output of this command?

ls -la /var/log/pihole/pihole*