Faulty regex consumes all memory and causes issues with pihole -q

I selected "Type: Wildcard Whitelist" but it appeared as "Regex Whitelist". Not sure if it should be like this - it is automatically converted to a regex style. See picture for test.domain - it was added as wildcard whitelist but appears as Regex Whitelist.

Will do that.

Yes, that is correct. Wildcards are converted to regex value when committed.

Issue opend:

1 Like

This is functionally the same issue as: Pi-hole becomes unresponsive during pihole -d and -q

Faulty regex causes -q (and, by extension -d) to stall out

(You're not alone!)

Thanks for the hint :slight_smile:

As the root cause was the same with us I opened a new topic on that issue:

4 posts were merged into an existing topic: Rename "wildcard" blacklist/whitelist because it is misleading and lead to regex errors

As mentioned in the other thread, this is already in the works for quite some time but will likely not hit v5.0 not not push back the release unnecessarily long.

Thanks for this, however, we will simply reuse the inbuilt PHP domain validator we already used before. It also checks for the maximum length of subdomains, etc.

3 Likes

Can we call this the solution to the topic then?

Depends on how you (or pihole team) interpret "solution": issue acknowledged and code will be written or code is written and functioning.

For me it's solved - I can't contribute any further except of testing the patch when it's ready.

Can I get a teleporter tarball from someone that is seeing this so I can work on it?

I think the glitch may be in something else. Noted on the GitHub issue, but calling awk directly doesn't cause a crash, and I'm seeing that using shell everything seems to be okay. I don't think there actually is a problem with using a * in a wildcard entry as that will always be the first char and will always end up as (\.|^)*. I think that's okay, just means "Zero or more dots or start of lines".

Checking with a small script and using shell regex instead of awk looks hopeful?

#!/usr/bin/env bash
domains=(gstatic.com generalmagic.com services.generalmagic.com me.services.generalmagic.com)
regex="(\.|^)*\.services\.generalmagic\.com$"

for domain in ${domains[@]}; do
  if [[ $domain =~ $regex ]]; then
    printf "Found %s in %b.\n" $domain $regex
  else
    printf "Did not find %s in %b.\n" $domain $regex
  fi
done

OUT:

Did not find gstatic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Found me.services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.```

Edit:

For completeness and ease of discussion, here's the awk:

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com$') <(echo "gstatic.com")

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "generalmagic.com")

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "services.generalmagic.com")

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
(\.|^)*\.services\.generalmagic\.com$

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com$') <(echo "me.services.generalmagic.com")
(\.|^)*\.services\.generalmagic\.com$

I can try and carve out a very small RAM lxc to check for memory starvation, 1G is the minimum on Digital Ocean and that may not show what we are looking for.

Edit: 1G RAM DigitalOcean droplet and I can't get it to crash or hang at all. I'll need something to duplicate with to try and work this down.

The awk is from Fix for regexp queries through pihole -q by mmotti · Pull Request #2780 · pi-hole/pi-hole · GitHub and the PR author may have some thoughts on what/why this is being seen.

paging @mmotti

Hmm - To be completely honest, I'm not sure. My dealings with awk are quite limited and I only started playing around with it for this purpose as it seemed to be much quicker than the alternatives when it came to a significant number of regexps. Particularly running grep in a loop for each regexp and checking the return code, as grep doesn't seem to have the ability to display the pattern that caused a match.

(\.|^)* seems like particularly bad regex, and I wonder if it could be classed as 'evil regex', or regex denial or service. The issue likely being the caret used within a group (not great practice anyway), being further complicated by the * quantifier. It could be stuck checking a huge criteria for each domain and thus dying?

I'm away from my laptop at the moment, but does grep or the more generic bash approach have the same outcome when compared against the same number of domains? Is the performance as good running a for loop with bash? Assuming the domain count for the people experiencing issues to be in the millions but haven't read every detail of the thread just yet.

Sadly I'm not sure what to suggest with this particularly with awk. The problem with regex generally is that as so long as the pattern is technically compilable, the processor (awk in this case) is at the mercy of the user.

Tried it myself.

With the script everything is fine, running just ~1 sec.

nanopi@nanopi:~$ ./test 
Did not find gstatic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Found me.services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.

But the awk got killed again due to high memory consumptation

nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com$') <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
Killed

2 Likes

As it seem to come down to awk I checked what was installed:

nanopi@nanopi:~$ apt -s install awk
NOTE: This is only a simulation!
      apt needs root privileges for real execution.
      Keep also in mind that locking is deactivated,
      so don't depend on the relevance to the real current situation!
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package awk is a virtual package provided by:
  original-awk:armhf 2012-12-20-6
  mawk:armhf 1.3.3-17+b3
  gawk:armhf 1:4.2.1+dfsg-1
  original-awk 2012-12-20-6
  mawk 1.3.3-17+b3
  gawk 1:4.2.1+dfsg-1
You should explicitly select one to install.
nanopi@nanopi:~$ sudo apt-show-versions awk
awk not installed (not available)
nanopi@nanopi:~$ apt-show-versions mawk
mawk:arm64 1.3.3-17+b3 installed: No available version in archive
nanopi@nanopi:~$ apt-show-versions gawk
gawk not installed (not available)

running the awk explicite with mawk

nanopi@nanopi:~$ mawk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
Killed

BUT installing gawk

nanopi@nanopi:~$ sudo apt install gawk

nanopi@nanopi:~$ gawk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
(\.|^)*\.services\.generalmagic\.com$
nanopi@nanopi:~$ pihole -q gstatic
 Match found in exact whitelist
   fonts.gstatic.com
 Match found in https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts:
   csi.gstatic.com 
 Match found in https://hosts-file.net/ad_servers.txt:
   csi.gstatic.com 
   p2-aahhyknavsj2m-wtnlrzkba6lht33q-if-v6exp3-v4.metric.gstatic.com 
   p2-f6rp6piuxns4u-uzq4vp76bu3w2tso-if-v6exp3-v4.metric.gstatic.com 
   p2-n3zurhre4jjvk-can5rb2f2a4urcxh-if-v6exp3-v4.metric.gstatic.com 
   p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-164149-i1-v6exp3-v4.metric.gstatic.com 
   p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-164149-i2-v6exp3-ds.metric.gstatic.com 
   p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-164149-s1-v6exp3-v4.metric.gstatic.com 
   p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-if-v6exp3-v4.metric.gstatic.com 
   p5-lj5aujgj7jl7w-r2pmxqvndsgx2im2-931517-i1-v6exp3-v4.metric.gstatic.com 
   p5-lj5aujgj7jl7w-r2pmxqvndsgx2im2-931517-i2-v6exp3-ds.metric.gstatic.com 
   s6.netlogstatic.com 
   v6exp3-ds.metric.gstatic.com 
   v6exp3-v4.metric.gstatic.com 
 Match found in https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardMobileAds.txt:
   csi.gstatic.com 
 Match found in https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardDNS.txt:
   metric.gstatic.com 
   diagnose.igstatic.com 
 Match found in https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/EasyPrivacy3rdParty.txt:
   csi.gstatic.com 
   diagnose.igstatic.com 
 Match found in https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt:
   connectivitycheck.gstatic.comwf.cbarsrv.com 
   gstaticadssl.l.google.com 
   metric.gstatic.com 
   anycast.metric.gstatic.com 
   anycast-stb.metric.gstatic.com 
   anycast1.metric.gstatic.com 
   anycast1-stb.metric.gstatic.com 
   anycast2.metric.gstatic.com 
   anycast2-stb.metric.gstatic.com 
   ds.metric.gstatic.com 
   s-v6exp1-ds.metric.gstatic.com 
   s-v6exp1-v4.metric.gstatic.com 
   stbcast.metric.gstatic.com 
   stbcast-stb.metric.gstatic.com 
   stbcast2.metric.gstatic.com 
   stbcast2-stb.metric.gstatic.com 
   stbcast3.metric.gstatic.com 
   stbcast3-stb.metric.gstatic.com 
   stbcast4.metric.gstatic.com 
   stbcast4-stb.metric.gstatic.com 
   stbcast5.metric.gstatic.com 
   stbcast5-stb.metric.gstatic.com 
   test-ipv6-dot-com-v6exp3-v4.metric.gstatic.com 
   unicast.metric.gstatic.com 
   unicast-stb.metric.gstatic.com 
   unicast2.metric.gstatic.com 
   unicast2-stb.metric.gstatic.com 
   v4.metric.gstatic.com 
   v6exp3-ds.metric.gstatic.com 
   v6exp3-v4.metric.gstatic.com 

NO ERRORS!

There must be a difference between mawk and gawk resulting in the error we've seen.

Is pihole checking for awk during installation? Maybe extend to gawk?

1 Like

This is very interesting that it's crapping out after only a single domain being passed.

Am I right in understanding at the moment that this only happens for select users? (as Dan was able to get an output without memory consumption issue)

We do awk, but that may not be needed now. I'll check the version table for bash and its internal regex matcher. If all the supported OS releases can do this in shell, then we'll do it in shell. I'll put a branch up as soon as I wake up fully and get some coffee. Testing for feature parity and speed of processing is the key.

This also would mean that there's no immediate need to change the wildcard entry, you can throw *.host.domain at it and it should work just the same.