Google Wifi w/pihole DHCP problems (narrowed down to pihole-FTL pegged at 99%)

None of those outputs show anything that should cause a problem. You don't have a huge number of daily queries, excessive memory use, etc.

Take a look in the pihole log for details of the DHCP transactions and see if there are any failures or long processes taking place.

/var/log/pihole.log

Nothing jumped out at me, here are 3 tests with output, timings, and result:

https://pastebin.com/raw/mtGgtZmu

Since the DHCP function is provided by dnsmasq (which is embedded in pihole-FTL but not part of the Pi-Hole specific code), you may find some answers on the dnsmasq mailing list.

http://lists.thekelleys.org.uk/mailman/listinfo/dnsmasq-discuss

Fair enough, but here's my last tidbit of findings FWIW.

I just spun up a Pi3B with the exact same configuration (host target change in ansible, 100% same settings except for static ip address).

It is working 100% as expected. With 818k domains on blocklist, multiple devices renewed DHCP leases in <3 seconds.

Note that the pihole-FTL process still pins a core (>95%) for a few seconds, but at least that doesn't block other processes (dnsmasq) from getting some CPU cycles.

Whatever there is in my config I don't think a single cored PiZ can handle it. It has been regulated as a 2ndary pihole with no dnsmasq for now :slight_smile:

The dnsmasq code is embedded in pihole-FTL, so dnmasq is running under that process.

It is still running dnsmasq as before, but without DHCP enabled.

Oh I didn't realize it was compiled into that process, my bad. That makes much more sense why I was having problems when it spiked. Guess I'm just getting lucky with the Pi3B's higher clock speed for a single core.

This may be going too deep, but my inner programmer is curious, is that multithreaded and free of locks/mutexes between the dnsmasq code and the rest of FTL code?

pi@noads:~ $ ps -mo cmd,pid,tid,pcpu -C pihole-FTL
CMD                           PID   TID %CPU
/usr/bin/pihole-FTL         17075     -  0.7
-                               - 17075  0.3
-                               - 17077  0.0
-                               - 17078  0.0
-                               - 17079  0.0
-                               - 17080  0.1
-                               - 17081  0.0
-                               - 17082  0.0

Thanks, yeah I started looking through the code base also.

I do appreciate the help and effort you both put into this thread!

1 Like

I forgot to ask previously but why do you have SWAP in RAM ?
Isn't SWAP there to relieve RAM from stale threads, running full etc.
Better avoid SWAP entirely and make sure she's got enough RAM :wink:

Very valid question. Honestly I had been exploring overlayfs and compressed memory partitions to save the sd card from thrashing as much. I'm doing some longer term tests to see how much/if any swap these little buggers use/need.

So far the PiZ is swapping about 512b, and zram is keeping it down to 76b (lz4). The Pi3B is not using any swap yet, very likely I'll get rid of it on that one.

1 Like

Hardly any uptime because of testing:

pi@noads:~ $ uptime
 23:38:20 up 1 day,  4:56,  2 users,  load average: 0.13, 0.14, 0.14

pi@noads:~ $ free -h
              total        used        free      shared  buff/cache   available
Mem:           179M         45M         31M        9.9M        102M         77M
Swap:           99M         12M         87M

pi@noads:~ $ echo '>stats' | nc localhost 4711
domains_being_blocked 131149
dns_queries_today 10140
ads_blocked_today 1716
ads_percentage_today 16.923077
unique_domains 807
queries_forwarded 3890
queries_cached 4534
clients_ever_seen 6
unique_clients 6
dns_queries_all_types 10140
reply_NODATA 31
reply_NXDOMAIN 1
reply_CNAME 197
reply_IP 1822
privacy_level 0
status enabled
---EOM---

FTL has hooks into the dnsmasq code (run this function when a query is resolved, etc), but it is mostly kept separate from dnsmasq, especially as for most of its existence, FTL did not have dnsmasq embedded into it.

1 Like

I've started to dive into the code to see if anything pops out to me as to where/why it's spiking. I found it interesting that the problem is amplified as the block list grows, because I would not imagine that the blocklist code path would need to be involved in a DHCP lease renew transaction. It's very easy to reproduce the problem on my Pi(s). Either enabling the DHCP or the conditional forwarding causes my FTL to spike almost immediately. The problem is less on the Pi3, but the PiZ grinds to it's knees.

Ok I think I struck gold finally. Posting this up for any who happen to land on this thread.

TL/DR:
Add this to /etc/dnsmasq.d/03-custom.conf

# Fix for clients that misbehave if no WPAD option specified
dhcp-option=252,"\n"

Detailed findings for those interested below...

I inspected the shared memory and locking code and nothing jumped out at me. That many locks for the same mutex for so many functions is scary but it's shared memory and I understand it's needed.

I got both the PiZ and Pi3B into the hurt setup and watched the debug timings for all the locks and confirmed there were no deadlocks or excessive wait times waiting for the lock in any function.

So here it is... It's a DHCP client issue and pihole is not to blame, and by extension dnsmasq isn't technically to blame either because it is just doing what is requested by the clients.

Some DHCP clients, most notably android os (read: cellphones, amazon fire os, anything based on android is suspect) are immediately sending another DHCPINFORM/DHCPREQUEST if there is no Web Proxy Auto-Discovery (WPAD) option specified in the DHCPACK response.

This can be observed in the pihole.log as the client constantly sending DHCPREQUEST over and over again until either a) the client finally accepts reality and stops the loop, or b) the client side timeout expires and it gives up.

This has the domino effect of making the dnsmasq code spin hard as it constantly is trying to execute a code path for each client over and over again.

The fix above adds the WPAD option, 252, to the DHCPACK response with basically a no-op. The misbehaving client is then satisfied the option is present and no longer repeats the process over and over.

I was able to observe clients obtaining IPs in <3s from both the PiZ and Pi3B with that config change with the full ~820k domains on the blocklist.

3 Likes

I’ve tried everything listed but the other pucks still don’t get an ip. Is this still working for you?

It kind of worked, the problem was it was a race between google wifi's dnsmasq and pihole's dnsmasq(FTL) responding to the dhcp requests from clients. If the google wifi's dnsmasq sent a NACK before FTL sent an ACK the clients would get stuck in an infinite loop of DHCPREQUEST/DHCPOFFER. It was a toss up if clients would get IPs. I'd have to keep rebooting things and hoping they got an IP. After another technical discussion with the GoogleWifi team pleading with them to either let me disable their dnsmasq or add a delay to their config I had enough and ripped GoogleWifi out of my network. It was the straw that broke my back.

I've since built out a ubiquiti network and couldn't be happier.

With all the technical limitations with GoogleWifi I can honestly only recommend disabling pihole's DHCP server and configuring GoogleWifi DNS to your pihole and letting it proxy all DNS requests (making all your network clients show up as a single network client in pihole logs).

1 Like

Thanks for the update! I should also mention I discovered I also have many other issues. I’m running in a container in macos and discovered that net=host doesn’t work the same as in Linux, so dhcp was not broadcasting past the internal docker network, I can’t create a new network (macvlan) because I can’t create a vlan on my Ethernet port and have google WiFi see it and dhcp relay won’t work right now (no support, might try a separate program). I may give rooting google WiFi a try which I think allows you to stay in mesh and disable dhcp. And then of course buy a pi to run on and remove the docker issue. Weekend projects are fun.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.