Faulty regex consumes all memory and causes issues with pihole -q

It seems to be killed while scanning the whitelist regex. Not sure why it is still scanning this list, because I disabled all regex for this test.

EDIT

Yes, it was the whitelist regex

(\.|^)*\.services\.generalmagic\.com$

Despite being already disabled I removed it via web UI and pihole -q runs very fast without error.

nanopi@nanopi:~$ pihole -q gstatic.com
  [i] No results found for gstatic.com within the block lists
nanopi@nanopi:~$ sudo bash -x /opt/pihole/query.sh gstatic.com
+ piholeDir=/etc/pihole
+ gravityDBfile=/etc/pihole/gravity.db
+ options=gstatic.com
+ all=
+ exact=
+ blockpage=
+ matchType=match
+ colfile=/opt/pihole/COL_TABLE
+ source /opt/pihole/COL_TABLE
++ [[ -t 1 ]]
+++ tput colors
++ [[ 256 -ge 8 ]]
++ COL_BOLD=''
++ COL_ULINE=''
++ COL_NC=''
++ COL_GRAY=''
++ COL_RED=''
++ COL_GREEN=''
++ COL_YELLOW=''
++ COL_BLUE=''
++ COL_PURPLE=''
++ COL_CYAN=''
++ COL_WHITE=''
++ COL_BLACK=''
++ COL_LIGHT_BLUE=''
++ COL_LIGHT_GREEN=''
++ COL_LIGHT_CYAN=''
++ COL_LIGHT_RED=''
++ COL_URG_RED=''
++ COL_LIGHT_PURPLE=''
++ COL_BROWN=''
++ COL_LIGHT_GRAY=''
++ COL_DARK_GRAY=''
++ TICK='[✓]'
++ CROSS='[✗]'
++ INFO='[i]'
++ QST='[?]'
++ DONE=' done!'
++ OVER='\r'
+ [[ gstatic.com == \-\h ]]
+ [[ gstatic.com == \-\-\h\e\l\p ]]
+ [[ gstatic.com == *\-\b\p* ]]
+ [[ gstatic.com == *\-\a\l\l* ]]
+ [[ gstatic.com == *\-\e\x\a\c\t* ]]
++ sed -E 's/ ?-(bp|adlists?|all|exact) ?//g'
+ options=gstatic.com
+ case "${options}" in
+ domainQuery=gstatic.com
+ [[ -n '' ]]
+ scanDatabaseTable gstatic.com whitelist 0
+ local domain table type querystr result extra
++ printf %q gstatic.com
+ domain=gstatic.com
+ table=whitelist
+ type=0
+ [[ whitelist == \g\r\a\v\i\t\y ]]
+ case "${exact}" in
+ querystr='SELECT domain,enabled FROM domainlist WHERE type = '\''0'\'' AND domain LIKE '\''%gstatic.com%'\'' ESCAPE '\''\'\'''
++ sqlite3 /etc/pihole/gravity.db 'SELECT domain,enabled FROM domainlist WHERE type = '\''0'\'' AND domain LIKE '\''%gstatic.com%'\'' ESCAPE '\''\'\'''
+ result=
+ [[ -z '' ]]
+ return
+ scanDatabaseTable gstatic.com blacklist 1
+ local domain table type querystr result extra
++ printf %q gstatic.com
+ domain=gstatic.com
+ table=blacklist
+ type=1
+ [[ blacklist == \g\r\a\v\i\t\y ]]
+ case "${exact}" in
+ querystr='SELECT domain,enabled FROM domainlist WHERE type = '\''1'\'' AND domain LIKE '\''%gstatic.com%'\'' ESCAPE '\''\'\'''
++ sqlite3 /etc/pihole/gravity.db 'SELECT domain,enabled FROM domainlist WHERE type = '\''1'\'' AND domain LIKE '\''%gstatic.com%'\'' ESCAPE '\''\'\'''
+ result=
+ [[ -z '' ]]
+ return
+ scanRegexDatabaseTable gstatic.com whitelist 2
+ local domain list
+ domain=gstatic.com
+ list=whitelist
+ type=2
+ mapfile -t regexList
++ sqlite3 /etc/pihole/gravity.db 'SELECT domain FROM domainlist WHERE type = 2'
+ [[ 0 -ne 0 ]]
+ scanRegexDatabaseTable gstatic.com blacklist 3
+ local domain list
+ domain=gstatic.com
+ list=blacklist
+ type=3
+ mapfile -t regexList
++ sqlite3 /etc/pihole/gravity.db 'SELECT domain FROM domainlist WHERE type = 3'
+ [[ 23 -ne 0 ]]
++ printf '%s\n' '^(.+[.-])?ad[sxv]?[0-9]*[.-]' '^(.+[.-])?adse?rv(er?|ice)?s?[0-9]*[.-]' '^(.+[.-])?telemetry[.-]' '^(www[0-9].)?xn–' '^adim(age|g)s?[0-9][.-]' '^adtrack(er|ing)?[0-9]*[.-]' '^advert(s|is(ing|ements?))?[0-9][.-]' '^aff(iliat(es?|ion))?[.-]' '^analytics?[.-]' '^banners?[.-]' '^beacons?[0-9][.-]' '^count(ers?)?[0-9]*[.-]' '^mads.' '^pixels?[-.]' '^stat(s|istics)?[0-9][_.-]' '^track(ers?|ing)?[0-9][_.-]' '^traff(ic)?[.-]' '^.*metric.*\..*\..*$' '^logs?\..*\..*$' '(^|[-_.]+)(m?a(d((vert(s|is(ing|e?ments?))?|im(age|g)s?)|([vx]|s(e?rv(er?|ice)?s?)?)|track(ers?|ing))?|ff(iliat(es?|ion))?|nalytics?)|b((anner|eacon)s?)|count(ers?)?|pixels?|stat(s|istics?)?|t(elemetry|ra(ffic|ck(ers?|ing))))[-_]*[0-9]*[-_.]' '^(.+[_.-])?adse?rv(er?|ice)?s?[0-9]*[_.-]' '^(.+[_.-])?ad[sxv]?[0-9]*[_.-]' '(^|\.)asn\.advolution\.de$'
+ str_regexList='^(.+[.-])?ad[sxv]?[0-9]*[.-]
^(.+[.-])?adse?rv(er?|ice)?s?[0-9]*[.-]
^(.+[.-])?telemetry[.-]
^(www[0-9].)?xn–
^adim(age|g)s?[0-9][.-]
^adtrack(er|ing)?[0-9]*[.-]
^advert(s|is(ing|ements?))?[0-9][.-]
^aff(iliat(es?|ion))?[.-]
^analytics?[.-]
^banners?[.-]
^beacons?[0-9][.-]
^count(ers?)?[0-9]*[.-]
^mads.
^pixels?[-.]
^stat(s|istics)?[0-9][_.-]
^track(ers?|ing)?[0-9][_.-]
^traff(ic)?[.-]
^.*metric.*\..*\..*$
^logs?\..*\..*$
(^|[-_.]+)(m?a(d((vert(s|is(ing|e?ments?))?|im(age|g)s?)|([vx]|s(e?rv(er?|ice)?s?)?)|track(ers?|ing))?|ff(iliat(es?|ion))?|nalytics?)|b((anner|eacon)s?)|count(ers?)?|pixels?|stat(s|istics?)?|t(elemetry|ra(ffic|ck(ers?|ing))))[-_]*[0-9]*[-_.]
^(.+[_.-])?adse?rv(er?|ice)?s?[0-9]*[_.-]
^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
(^|\.)asn\.advolution\.de$'
+ mapfile -t regexMatches
++ scanList gstatic.com '^(.+[.-])?ad[sxv]?[0-9]*[.-]
^(.+[.-])?adse?rv(er?|ice)?s?[0-9]*[.-]
^(.+[.-])?telemetry[.-]
^(www[0-9].)?xn–
^adim(age|g)s?[0-9][.-]
^adtrack(er|ing)?[0-9]*[.-]
^advert(s|is(ing|ements?))?[0-9][.-]
^aff(iliat(es?|ion))?[.-]
^analytics?[.-]
^banners?[.-]
^beacons?[0-9][.-]
^count(ers?)?[0-9]*[.-]
^mads.
^pixels?[-.]
^stat(s|istics)?[0-9][_.-]
^track(ers?|ing)?[0-9][_.-]
^traff(ic)?[.-]
^.*metric.*\..*\..*$
^logs?\..*\..*$
(^|[-_.]+)(m?a(d((vert(s|is(ing|e?ments?))?|im(age|g)s?)|([vx]|s(e?rv(er?|ice)?s?)?)|track(ers?|ing))?|ff(iliat(es?|ion))?|nalytics?)|b((anner|eacon)s?)|count(ers?)?|pixels?|stat(s|istics?)?|t(elemetry|ra(ffic|ck(ers?|ing))))[-_]*[0-9]*[-_.]
^(.+[_.-])?adse?rv(er?|ice)?s?[0-9]*[_.-]
^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
(^|\.)asn\.advolution\.de$' regex
++ local domain=gstatic.com 'esc_domain=gstatic\.com' 'lists=^(.+[.-])?ad[sxv]?[0-9]*[.-]
^(.+[.-])?adse?rv(er?|ice)?s?[0-9]*[.-]
^(.+[.-])?telemetry[.-]
^(www[0-9].)?xn–
^adim(age|g)s?[0-9][.-]
^adtrack(er|ing)?[0-9]*[.-]
^advert(s|is(ing|ements?))?[0-9][.-]
^aff(iliat(es?|ion))?[.-]
^analytics?[.-]
^banners?[.-]
^beacons?[0-9][.-]
^count(ers?)?[0-9]*[.-]
^mads.
^pixels?[-.]
^stat(s|istics)?[0-9][_.-]
^track(ers?|ing)?[0-9][_.-]
^traff(ic)?[.-]
^.*metric.*\..*\..*$
^logs?\..*\..*$
(^|[-_.]+)(m?a(d((vert(s|is(ing|e?ments?))?|im(age|g)s?)|([vx]|s(e?rv(er?|ice)?s?)?)|track(ers?|ing))?|ff(iliat(es?|ion))?|nalytics?)|b((anner|eacon)s?)|count(ers?)?|pixels?|stat(s|istics?)?|t(elemetry|ra(ffic|ck(ers?|ing))))[-_]*[0-9]*[-_.]
^(.+[_.-])?adse?rv(er?|ice)?s?[0-9]*[_.-]
^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
(^|\.)asn\.advolution\.de$' type=regex
++ cd /etc/pihole
++ export LC_CTYPE=C
++ LC_CTYPE=C
++ case "${type}" in
+++ echo '^(.+[.-])?ad[sxv]?[0-9]*[.-]
^(.+[.-])?adse?rv(er?|ice)?s?[0-9]*[.-]
^(.+[.-])?telemetry[.-]
^(www[0-9].)?xn–
^adim(age|g)s?[0-9][.-]
^adtrack(er|ing)?[0-9]*[.-]
^advert(s|is(ing|ements?))?[0-9][.-]
^aff(iliat(es?|ion))?[.-]
^analytics?[.-]
^banners?[.-]
^beacons?[0-9][.-]
^count(ers?)?[0-9]*[.-]
^mads.
^pixels?[-.]
^stat(s|istics)?[0-9][_.-]
^track(ers?|ing)?[0-9][_.-]
^traff(ic)?[.-]
^.*metric.*\..*\..*$
^logs?\..*\..*$
(^|[-_.]+)(m?a(d((vert(s|is(ing|e?ments?))?|im(age|g)s?)|([vx]|s(e?rv(er?|ice)?s?)?)|track(ers?|ing))?|ff(iliat(es?|ion))?|nalytics?)|b((anner|eacon)s?)|count(ers?)?|pixels?|stat(s|istics?)?|t(elemetry|ra(ffic|ck(ers?|ing))))[-_]*[0-9]*[-_.]
^(.+[_.-])?adse?rv(er?|ice)?s?[0-9]*[_.-]
^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
(^|\.)asn\.advolution\.de$'
++ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' /dev/fd/63 /dev/fd/62
+++ echo gstatic.com
+ [[ 0 -ne 0 ]]
+ mapfile -t results
++ scanDatabaseTable gstatic.com gravity
++ local domain table type querystr result extra
+++ printf %q gstatic.com
++ domain=gstatic.com
++ table=gravity
++ type=
++ [[ gravity == \g\r\a\v\i\t\y ]]
++ case "${exact}" in
++ querystr='SELECT gravity.domain,adlist.address,adlist.enabled FROM gravity LEFT JOIN adlist ON adlist.id = gravity.adlist_id WHERE domain LIKE '\''%gstatic.com%'\'' ESCAPE '\''\'\'''
+++ sqlite3 /etc/pihole/gravity.db 'SELECT gravity.domain,adlist.address,adlist.enabled FROM gravity LEFT JOIN adlist ON adlist.id = gravity.adlist_id WHERE domain LIKE '\''%gstatic.com%'\'' ESCAPE '\''\'\'''
++ result=
++ [[ -z '' ]]
++ return
+ [[ -z '' ]]
+ [[ -z '' ]]
+ [[ -z '' ]]
+ echo -e '  [i] No results found for gstatic.com within the block lists'
  [i] No results found for gstatic.com within the block lists
+ exit 0

I re-enabled all adlists and black/whitelist (without the specified whitelist regex) and it worked without issues.
I'm not very good at regex, is this an faulty one?

(\.|^)*\.services\.generalmagic\.com

If I remember correctly I created it with pihole webUI by entering *.service*.generalmagic*.com and setting type to 'whitecard whitelist'

Judging from the results, yes.

https://docs.pi-hole.net/ftldns/regex/tutorial/

Wildcard is for exact domains, the wildcard blocks subdomains of the top domain. Entering regex chars like the * and choosing wildcard is trying to wildcard that value.

That says literally “Zero or more (.|^) chars” which is not right.

regexr.com/4v4vl

Edit: copy paste breaks the formatting but the regexr link shows what I mean.

1 Like

Thanks for the answer. I think I misinterpreted wildcard - I'm used to understand 'I can use wildcard characters'.
I tried to match domains like

content1.ro.m71os.services.generalmagic.com
shop1.m7.services.generalmagic.com
pubsub1.ro.m71os.services.generalmagic.com
overlays1.ro.m71os.services.generalmagic.com

I think using services.generalmagic.com as wildcard whitelist will be sufficient.

I don't think pihole should have a full-grown regex validator (In terms of 'This is what the regex means') but maybe a simple syntax checker ('This is not a valid regex und will never match any domain.) - especially as a non-valid regex can break pihole's scripts. Or pihole should be more robust against faulty regex.

edit
Changes the topic's title to what I think reflects the issue better.

Thank you, that will help with users searching the forums.

Regex is a very powerful tool and can be used to customize the user experience. It's not easy to validate regex but I agree that having some kind of check would be helpful. In this case it would not have applied since the input was for wildcard values and should not have contained any regex values at all. This particular issue could be helped with checking for input that should only contain chars found in domain names. Opening a bug report on our core or adminlte repositories would be helpful in tracking that specific case.

I selected "Type: Wildcard Whitelist" but it appeared as "Regex Whitelist". Not sure if it should be like this - it is automatically converted to a regex style. See picture for test.domain - it was added as wildcard whitelist but appears as Regex Whitelist.

Will do that.

Yes, that is correct. Wildcards are converted to regex value when committed.

Issue opend:

1 Like

This is functionally the same issue as: Pi-hole becomes unresponsive during pihole -d and -q

Faulty regex causes -q (and, by extension -d) to stall out

(You're not alone!)

Thanks for the hint :slight_smile:

As the root cause was the same with us I opened a new topic on that issue:

4 posts were merged into an existing topic: Rename "wildcard" blacklist/whitelist because it is misleading and lead to regex errors

As mentioned in the other thread, this is already in the works for quite some time but will likely not hit v5.0 not not push back the release unnecessarily long.

Thanks for this, however, we will simply reuse the inbuilt PHP domain validator we already used before. It also checks for the maximum length of subdomains, etc.

3 Likes

Can we call this the solution to the topic then?

Depends on how you (or pihole team) interpret "solution": issue acknowledged and code will be written or code is written and functioning.

For me it's solved - I can't contribute any further except of testing the patch when it's ready.

Can I get a teleporter tarball from someone that is seeing this so I can work on it?

I think the glitch may be in something else. Noted on the GitHub issue, but calling awk directly doesn't cause a crash, and I'm seeing that using shell everything seems to be okay. I don't think there actually is a problem with using a * in a wildcard entry as that will always be the first char and will always end up as (\.|^)*. I think that's okay, just means "Zero or more dots or start of lines".

Checking with a small script and using shell regex instead of awk looks hopeful?

#!/usr/bin/env bash
domains=(gstatic.com generalmagic.com services.generalmagic.com me.services.generalmagic.com)
regex="(\.|^)*\.services\.generalmagic\.com$"

for domain in ${domains[@]}; do
  if [[ $domain =~ $regex ]]; then
    printf "Found %s in %b.\n" $domain $regex
  else
    printf "Did not find %s in %b.\n" $domain $regex
  fi
done

OUT:

Did not find gstatic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Found me.services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.```

Edit:

For completeness and ease of discussion, here's the awk:

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com$') <(echo "gstatic.com")

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "generalmagic.com")

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "services.generalmagic.com")

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
(\.|^)*\.services\.generalmagic\.com$

dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com$') <(echo "me.services.generalmagic.com")
(\.|^)*\.services\.generalmagic\.com$

I can try and carve out a very small RAM lxc to check for memory starvation, 1G is the minimum on Digital Ocean and that may not show what we are looking for.

Edit: 1G RAM DigitalOcean droplet and I can't get it to crash or hang at all. I'll need something to duplicate with to try and work this down.

The awk is from Fix for regexp queries through pihole -q by mmotti · Pull Request #2780 · pi-hole/pi-hole · GitHub and the PR author may have some thoughts on what/why this is being seen.