Faulty regex consumes all memory and causes issues with pihole -q

# Scan an array of files for matching strings
scanList(){
    # Escape full stops
    local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"
        echo "DEBUG: This is lists[@]"
    for regexp in "${lists[@]}"; do
        echo "DEBUG: regexp ${regexp}"
        echo "DEBUG: -------"
    done
    echo "GUBED"

    # Prevent grep from printing file path
    cd "$piholeDir" || exit 1

    # Prevent grep -i matching slowly: http://bit.ly/2xFXtUX
    export LC_CTYPE=C

    # /dev/null forces filename to be printed when only one list has been generated
    case "${type}" in
        "exact" ) grep -i -E -l "(^|(?<!#)\\s)${esc_domain}($|\\s|#)" ${lists} /dev/null 2>/dev/null;;
        # Create array of regexps
        # Iterate through each regexp and check whether it matches the domainQuery
        # If it does, print the matching regexp and continue looping
        # Input 1 - regexps | Input 2 - domainQuery
        "regex" ) 
            for list in `echo "${lists}"`; do
                if [[ "${domain}" =~ ${list} ]]; then
                    printf "%b\n" "${list}";
                fi
            done;;
        *       ) grep -i "${esc_domain}" ${lists} /dev/null 2>/dev/null;;
    esac
}
pihole -q analytics.test.com
 Match found in regex whitelist
   DEBUG: This is lists[@]
   DEBUG: regexp (\.|^)*\.services\.generalmagic\.com$
   DEBUG: -------
   GUBED
 Match found in regex blacklist
   DEBUG: This is lists[@]
   DEBUG: regexp (\.|^)*\.metric\.gstatic\.com$
   ^analytics?[_.-]
   DEBUG: -------
   GUBED
   ^analytics?[_.-]
 Match found in regex whitelist
   DEBUG: This is lists[@]
   DEBUG: regexp (\.|^)*\.services\.generalmagic\.com$
   DEBUG: -------
   GUBED
   DEBUG2: This is echo (\.|^)*\.services\.generalmagic\.com$
   DEBUG2: regexp (\.|^)*\.services\.generalmagic\.com$
   DEBUG2: -------
   GUBED
 Match found in regex blacklist
   DEBUG: This is lists[@]
   DEBUG: regexp (\.|^)*\.metric\.gstatic\.com$
   ^analytics?[_.-]
   DEBUG: -------
   GUBED
   DEBUG2: This is echo (\.|^)*\.metric\.gstatic\.com$ ^analytics?[_.-]
   DEBUG2: regexp (\.|^)*\.metric\.gstatic\.com$
   DEBUG2: -------
   DEBUG2: regexp ^analytics?[_.-]
   DEBUG2: -------
   GUBED
   ^analytics?[_.-]
scanList(){
    # Escape full stops
    local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"
        echo "DEBUG: This is lists[@]"
    for regexp in "${lists[@]}"; do
        echo "DEBUG: regexp ${regexp}"
        echo "DEBUG: -------"
    done
    echo "GUBED"

        echo "DEBUG2: This is echo ${lists}""
    for regexp in `echo "${lists}"`; do
        echo "DEBUG2: regexp ${regexp}"
        echo "DEBUG2: -------"
    done
    echo "GUBED"

Edit: If there is a better option than backticks I welcome it heartily! I only used backticks as a last resort.

Two secs - I am looking at this. I forgot it's /opt/pihole that actually houses the query script.

Edit: I stand corrected. Bash is weird. @DanSchaper only tagging you as this is an edit as opposed to a new post.

So, the current idea is to convert the mapfile to a multiline string and pass that to scanList, which will then use a while loop instead of the for loop with the legacy quotes.

Current issues:
For some reason the printf %s\n is ignored.

mmotti@ubuntu-server:~$ pihole -q analytics.ads.test.com
 Match found in regex blacklist
   ^(.+[_.-])?ad[sxv]?[0-9]*[_.-] ^analytics?[_.-]

scanList

# Scan an array of files for matching strings
scanList(){
    # Escape full stops
    local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"

    # Prevent grep from printing file path
    cd "$piholeDir" || exit 1

    # Prevent grep -i matching slowly: http://bit.ly/2xFXtUX
    export LC_CTYPE=C

    # /dev/null forces filename to be printed when only one list has been generated
    case "${type}" in
        "exact" ) grep -i -E -l "(^|(?<!#)\\s)${esc_domain}($|\\s|#)" "${lists}" /dev/null 2>/dev/null;;
        # Iterate through each regexp and check whether it matches the domainQuery
        # If it does, print the matching regexp and continue looping
        # Input 1 - regexps | Input 2 - domainQuery
        "regex" ) \
            while IFS= read -r regexp; do
                if [[ "${domain}" =~ ${regexp} ]]; then
                    printf '%s\n' "${regexp}"
                fi
            done < <(echo "$lists");;
        *       ) grep -i "${esc_domain}" "${lists}" /dev/null 2>/dev/null;;
    esac
}

scanRegexDatabaseTable

scanRegexDatabaseTable() {
    local domain list
    domain="${1}"
    list="${2}"
    type="${3:-}"

    # Query all regex from the corresponding database tables
    mapfile -t regexList < <(sqlite3 "${gravityDBfile}" "SELECT domain FROM domainlist WHERE type = ${type}" 2> /dev/null)

    # If we have regexps to process
    if [[ "${#regexList[@]}" -ne 0 ]]; then
	# Split regexps over a new line
        str_regexList=$(printf '%s\n' "${regexList[@]}")
        # Check domain against regexps
        mapfile -t regexMatches < <(scanList "${domain}" "${str_regexList}" "regex")

Thanks! I think I have a simple solution?
IFS is set to newline, iterate over the lists that is newline delimited.

scanList(){
    # Escape full stops
    local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"
        echo "DEBUG: This is lists[@]"
    IFS=$'\n'
    for regexp in ${lists}; do
        echo "DEBUG: regexp ${regexp}"
        echo "DEBUG: -------"
    done
    echo "GUBED"

        echo "DEBUG2: This is echo ${lists}"
    for regexp in `echo "${lists}"`; do
        echo "DEBUG2: regexp ${regexp}"
        echo "DEBUG2: -------"
    done
    echo "GUBED"
pihole -q analytics.test.com
 Match found in regex whitelist
   DEBUG: This is lists[@]
   DEBUG: regexp (\.|^)*\.services\.generalmagic\.com$
   DEBUG: -------
   GUBED
   DEBUG2: This is echo (\.|^)*\.services\.generalmagic\.com$
   DEBUG2: regexp (\.|^)*\.services\.generalmagic\.com$
   DEBUG2: -------
   GUBED
 Match found in regex blacklist
   DEBUG: This is lists[@]
   DEBUG: regexp (\.|^)*\.metric\.gstatic\.com$
   DEBUG: -------
   DEBUG: regexp ^analytics?[_.-]
   DEBUG: -------
   GUBED
   DEBUG2: This is echo (\.|^)*\.metric\.gstatic\.com$
   ^analytics?[_.-]
   DEBUG2: regexp (\.|^)*\.metric\.gstatic\.com$
   DEBUG2: -------
   DEBUG2: regexp ^analytics?[_.-]
   DEBUG2: -------
   GUBED
   ^analytics?[_.-]

Edit: Might not even need to adjust IFS in this case, just use the unquoted lists variable.

I haven't tested this, but if it works, it could be interesting? My preference (if it were my code), would be the while loop to iterate over the list though.

Also - I'm not sure how much of an issue it is, but if changing the IFS, most examples of people I have seen playing around with it take the starting value, change it, and then change it back to the starting value on completion. Might be worth tweaking slightly if you are taking this approach? :slight_smile:

I don't think we even need to adjust IFS now.

I trust your judgement. Bash is bloody weird. I remember entirely why I gave up with it now lol

Hahahaha, I know. I'm going to do some more test but with the help you provided I'm a lot further along.

1 Like

Okay, new changes up. I do like the while loop, but it's a bit easier to read with a loop over the unquoted variable.

Tip, use %b instead of the %s for something that may contain values that you don't see. And double for regex values. :slight_smile:

1 Like

Hey, if it works, I will happily concede for a simple for-loop.

Looks to work OK for me, boss!

mmotti@ubuntu-server:~$ pihole -q analytics.adserver.ads.test.com
 Match found in regex blacklist
   ^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
   ^(.+[_.-])?adse?rv(er?|ice)?s?[0-9]*[_.-]
   ^analytics?[_.-]

That's with the crappy (\.|^)*\.services\.generalmagic\.com$ in my regex blacklist too.

I only have 125k domains though so would be useful for others to test with a higher domain count.

1 Like

Can you go to the blacklist (or whitelist) display on the web interface and enter *.adserver.ads.test.com as a wildcard type? That's the key for this fix.

It adds as (\.|^)*\.adserver\.ads\.test\.com$ which obviously won't match anything with the leading \.

mmotti@ubuntu-server:~$ pihole -q adserver.ads.test.com
 Match found in regex blacklist
   ^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
   ^(.+[_.-])?adse?rv(er?|ice)?s?[0-9]*[_.-]

If I manually adjust the shitty regex to an equally shitty regex (\.|^)*adserver\.ads\.test\.com$:

mmotti@ubuntu-server:~$ pihole -q adserver.ads.test.com
 Match found in regex blacklist
   ^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
   ^(.+[_.-])?adse?rv(er?|ice)?s?[0-9]*[_.-]
   (\.|^)*adserver\.ads\.test\.com$

But, it doesn't crap out like awk did. So overall, for now, it looks like a win.

Edit edit edit edit:

I'm back on release v5.0 with the awk still in place and actually even with these shitty regexps, I'm not experiencing any crashing at all lol.

1 Like

That should match tryme.adserver.ads.test.com if it's doing things as I intended.

Ah! two secs

Edit:

You are correct

mmotti@ubuntu-server:~$ pihole -q tryme.adserver.ads.test.com
 Match found in regex blacklist
   ^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
   ^(.+[_.-])?adse?rv(er?|ice)?s?[0-9]*[_.-]
   (\.|^)*\.adserver\.ads\.test\.com$
   (\.|^)*adserver\.ads\.test\.com$

But please do not encourage users to use these examples :rofl:

1 Like

Just a thought, too. It might be worth testing the performance when looping through thousands of regexps as no doubt some people with have a bazillion wildcards.

When you have to check x amount of wildcards against millions of domains, it could become troublesome.

Might be worth adding a warning notice that it could take time to run the query if the regexps exceed a certain amount, or better yet look to run the query and get the result straight from ftl?

This fix will work for now ofc but thinking long term, and communicating directly with ftl would remove the need to loop through everything is bash.

1 Like

nanopi@nanopi:~$ pihole -q tryme.adserver.ads.test.com
 Match found in regex blacklist
   (\.|^)*\.adserver\.ads\.test\.com$
1 Like

Agree, my main concern for moving to shell regex checking was the performance issue. This fix gets us in the door and working without having to wait to redo the web interface or other more drastic changes. Future options that work better are good to have.

1 Like

I think you found a solution to the issue.

This is post #100 ... one reason more to set it to solved :wink:

2 Likes

Fixed in Malformed wildcard blocking doesn't crash awk. by dschaper · Pull Request #3186 · pi-hole/pi-hole · GitHub

2 Likes