Faulty regex consumes all memory and causes issues with pihole -q

I can try and carve out a very small RAM lxc to check for memory starvation, 1G is the minimum on Digital Ocean and that may not show what we are looking for.

Edit: 1G RAM DigitalOcean droplet and I can't get it to crash or hang at all. I'll need something to duplicate with to try and work this down.

The awk is from Fix for regexp queries through pihole -q by mmotti · Pull Request #2780 · pi-hole/pi-hole · GitHub and the PR author may have some thoughts on what/why this is being seen.

paging @mmotti

Hmm - To be completely honest, I'm not sure. My dealings with awk are quite limited and I only started playing around with it for this purpose as it seemed to be much quicker than the alternatives when it came to a significant number of regexps. Particularly running grep in a loop for each regexp and checking the return code, as grep doesn't seem to have the ability to display the pattern that caused a match.

(\.|^)* seems like particularly bad regex, and I wonder if it could be classed as 'evil regex', or regex denial or service. The issue likely being the caret used within a group (not great practice anyway), being further complicated by the * quantifier. It could be stuck checking a huge criteria for each domain and thus dying?

I'm away from my laptop at the moment, but does grep or the more generic bash approach have the same outcome when compared against the same number of domains? Is the performance as good running a for loop with bash? Assuming the domain count for the people experiencing issues to be in the millions but haven't read every detail of the thread just yet.

Sadly I'm not sure what to suggest with this particularly with awk. The problem with regex generally is that as so long as the pattern is technically compilable, the processor (awk in this case) is at the mercy of the user.

Tried it myself.

With the script everything is fine, running just ~1 sec.

nanopi@nanopi:~$ ./test 
Did not find gstatic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Found me.services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.

But the awk got killed again due to high memory consumptation

nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com$') <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
Killed

2 Likes

As it seem to come down to awk I checked what was installed:

nanopi@nanopi:~$ apt -s install awk
NOTE: This is only a simulation!
      apt needs root privileges for real execution.
      Keep also in mind that locking is deactivated,
      so don't depend on the relevance to the real current situation!
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Package awk is a virtual package provided by:
  original-awk:armhf 2012-12-20-6
  mawk:armhf 1.3.3-17+b3
  gawk:armhf 1:4.2.1+dfsg-1
  original-awk 2012-12-20-6
  mawk 1.3.3-17+b3
  gawk 1:4.2.1+dfsg-1
You should explicitly select one to install.
nanopi@nanopi:~$ sudo apt-show-versions awk
awk not installed (not available)
nanopi@nanopi:~$ apt-show-versions mawk
mawk:arm64 1.3.3-17+b3 installed: No available version in archive
nanopi@nanopi:~$ apt-show-versions gawk
gawk not installed (not available)

running the awk explicite with mawk

nanopi@nanopi:~$ mawk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
Killed

BUT installing gawk

nanopi@nanopi:~$ sudo apt install gawk

nanopi@nanopi:~$ gawk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
(\.|^)*\.services\.generalmagic\.com$
nanopi@nanopi:~$ pihole -q gstatic
 Match found in exact whitelist
   fonts.gstatic.com
 Match found in https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts:
   csi.gstatic.com 
 Match found in https://hosts-file.net/ad_servers.txt:
   csi.gstatic.com 
   p2-aahhyknavsj2m-wtnlrzkba6lht33q-if-v6exp3-v4.metric.gstatic.com 
   p2-f6rp6piuxns4u-uzq4vp76bu3w2tso-if-v6exp3-v4.metric.gstatic.com 
   p2-n3zurhre4jjvk-can5rb2f2a4urcxh-if-v6exp3-v4.metric.gstatic.com 
   p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-164149-i1-v6exp3-v4.metric.gstatic.com 
   p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-164149-i2-v6exp3-ds.metric.gstatic.com 
   p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-164149-s1-v6exp3-v4.metric.gstatic.com 
   p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-if-v6exp3-v4.metric.gstatic.com 
   p5-lj5aujgj7jl7w-r2pmxqvndsgx2im2-931517-i1-v6exp3-v4.metric.gstatic.com 
   p5-lj5aujgj7jl7w-r2pmxqvndsgx2im2-931517-i2-v6exp3-ds.metric.gstatic.com 
   s6.netlogstatic.com 
   v6exp3-ds.metric.gstatic.com 
   v6exp3-v4.metric.gstatic.com 
 Match found in https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardMobileAds.txt:
   csi.gstatic.com 
 Match found in https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardDNS.txt:
   metric.gstatic.com 
   diagnose.igstatic.com 
 Match found in https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/EasyPrivacy3rdParty.txt:
   csi.gstatic.com 
   diagnose.igstatic.com 
 Match found in https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt:
   connectivitycheck.gstatic.comwf.cbarsrv.com 
   gstaticadssl.l.google.com 
   metric.gstatic.com 
   anycast.metric.gstatic.com 
   anycast-stb.metric.gstatic.com 
   anycast1.metric.gstatic.com 
   anycast1-stb.metric.gstatic.com 
   anycast2.metric.gstatic.com 
   anycast2-stb.metric.gstatic.com 
   ds.metric.gstatic.com 
   s-v6exp1-ds.metric.gstatic.com 
   s-v6exp1-v4.metric.gstatic.com 
   stbcast.metric.gstatic.com 
   stbcast-stb.metric.gstatic.com 
   stbcast2.metric.gstatic.com 
   stbcast2-stb.metric.gstatic.com 
   stbcast3.metric.gstatic.com 
   stbcast3-stb.metric.gstatic.com 
   stbcast4.metric.gstatic.com 
   stbcast4-stb.metric.gstatic.com 
   stbcast5.metric.gstatic.com 
   stbcast5-stb.metric.gstatic.com 
   test-ipv6-dot-com-v6exp3-v4.metric.gstatic.com 
   unicast.metric.gstatic.com 
   unicast-stb.metric.gstatic.com 
   unicast2.metric.gstatic.com 
   unicast2-stb.metric.gstatic.com 
   v4.metric.gstatic.com 
   v6exp3-ds.metric.gstatic.com 
   v6exp3-v4.metric.gstatic.com 

NO ERRORS!

There must be a difference between mawk and gawk resulting in the error we've seen.

Is pihole checking for awk during installation? Maybe extend to gawk?

1 Like

This is very interesting that it's crapping out after only a single domain being passed.

Am I right in understanding at the moment that this only happens for select users? (as Dan was able to get an output without memory consumption issue)

We do awk, but that may not be needed now. I'll check the version table for bash and its internal regex matcher. If all the supported OS releases can do this in shell, then we'll do it in shell. I'll put a branch up as soon as I wake up fully and get some coffee. Testing for feature parity and speed of processing is the key.

This also would mean that there's no immediate need to change the wildcard entry, you can throw *.host.domain at it and it should work just the same.

Okay, pushed a core branch fix/awkInQuery to test. Looks okay from my very quick check.

pihole checkout core fix/awkInQuery and see what you can do with it.

Okay, which of the above lines is not correct so I can adjust accordingly?

Ah okay, my view is that entering *.host.domain in as a wildcard means you want to block a.host.domain b.host.domain and not host.domain.

I'm going by what users would or could enter in the screens. The implementation is up to us to do behind the scenes what we think the users intend.

Entering *.host.domain as Wildcard means block any subhost of host.domain. To actually block host.domain and subhosts then enter host.domain as wildcard.

That keeps things very similar to version 4's wildcards and no user behavior needs to change. They can enter a * in wildcard and the intent will be seen and acted upon. No rejection telling them the entry is invalid thus causing more help requests for "Why is *.host.domain invalid??".

Why though? It works and it's what the user intended. I'm the last person to add code that lectures users on what is proper or improper. If we can use what data is entered and give the expected result then that's what I'm concerned with.

Edit: Can we keep things on topic and discuss this problem and the potential solution? I really don't want this thread to go 100+ posts.

That's an option to consider, yes. I don't like it myself since users are not going to learn proper regex, that's why we added the helper in the first place.

I'd like to see if bash regex matching works as well/better than awk though, no matter what the ending solution happens to be.

1 Like

Agree with this. And backing up with a mod request to keep the discussion in this thread based around the proposed fix for the reported issue.

We can talk about semantics in another thread, another time. :slight_smile:

There is already a discussion based around the naming /purpose of the wildcard feature as @yubiuser linked above:

# Split regexps over a new line
str_regexList=$(printf '%s\n' "${regexList[@]}")
# Check domain against regexps
mapfile -t regexMatches < <(scanList "${domain}" "${str_regexList}" "regex")
 "regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;

It's been a while since I have looked through this code or done any bash / shell scripting, but does this return a matching pattern or all regexps if a match is found?

Also, does this work as expected? I've not done native bash pattern matching, but does it definitely iterate through each item in the list like grep/awk?

Sorry if these are silly questions. I'm not able to test currently as away from home.

Edit 2: Before I left, I was able to verify that the bad regex killed my awk too on my rpi 3b.

It returns the regex if the domain matches the regex. There isn't a list of regex passed in to the function, the only thing it takes is a domain and a single regex. Looping through the contents of the list of regexes happens elsewhere.

Seems to, but that's why I'm asking for people to test it and see what they can do with it.

Not silly questions at all, no worries.

Thanks, I think it's down to gawk/mawk/awk variants and the different optimizations and internal state engines.

Ah, I see! I may have only looked at the most recent commit on that branch. My mistake.

Will try test later

Happy to have any test results.

My statement of how the function is used is based on my understanding and checking out the bash -x output. The function scanList() is called as below:

scanList gstatic.com '(\.|^)*\.services\.generalmagic\.com$' regex

FuncName ${1} ${2} ${3}
scanList() gstatic.com (\.|^)*\.services\.generalmagic\.com$ regex

Nothing is wrong with that. The error was in the next paragraph of my post about your awk commands.