Faulty regex consumes all memory and causes issues with pihole -q

Okay, which of the above lines is not correct so I can adjust accordingly?

Ah okay, my view is that entering *.host.domain in as a wildcard means you want to block a.host.domain b.host.domain and not host.domain.

I'm going by what users would or could enter in the screens. The implementation is up to us to do behind the scenes what we think the users intend.

Entering *.host.domain as Wildcard means block any subhost of host.domain. To actually block host.domain and subhosts then enter host.domain as wildcard.

That keeps things very similar to version 4's wildcards and no user behavior needs to change. They can enter a * in wildcard and the intent will be seen and acted upon. No rejection telling them the entry is invalid thus causing more help requests for "Why is *.host.domain invalid??".

Why though? It works and it's what the user intended. I'm the last person to add code that lectures users on what is proper or improper. If we can use what data is entered and give the expected result then that's what I'm concerned with.

Edit: Can we keep things on topic and discuss this problem and the potential solution? I really don't want this thread to go 100+ posts.

That's an option to consider, yes. I don't like it myself since users are not going to learn proper regex, that's why we added the helper in the first place.

I'd like to see if bash regex matching works as well/better than awk though, no matter what the ending solution happens to be.

1 Like

Agree with this. And backing up with a mod request to keep the discussion in this thread based around the proposed fix for the reported issue.

We can talk about semantics in another thread, another time. :slight_smile:

There is already a discussion based around the naming /purpose of the wildcard feature as @yubiuser linked above:

# Split regexps over a new line
str_regexList=$(printf '%s\n' "${regexList[@]}")
# Check domain against regexps
mapfile -t regexMatches < <(scanList "${domain}" "${str_regexList}" "regex")
 "regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;

It's been a while since I have looked through this code or done any bash / shell scripting, but does this return a matching pattern or all regexps if a match is found?

Also, does this work as expected? I've not done native bash pattern matching, but does it definitely iterate through each item in the list like grep/awk?

Sorry if these are silly questions. I'm not able to test currently as away from home.

Edit 2: Before I left, I was able to verify that the bad regex killed my awk too on my rpi 3b.

It returns the regex if the domain matches the regex. There isn't a list of regex passed in to the function, the only thing it takes is a domain and a single regex. Looping through the contents of the list of regexes happens elsewhere.

Seems to, but that's why I'm asking for people to test it and see what they can do with it.

Not silly questions at all, no worries.

Thanks, I think it's down to gawk/mawk/awk variants and the different optimizations and internal state engines.

Ah, I see! I may have only looked at the most recent commit on that branch. My mistake.

Will try test later

Happy to have any test results.

My statement of how the function is used is based on my understanding and checking out the bash -x output. The function scanList() is called as below:

scanList gstatic.com '(\.|^)*\.services\.generalmagic\.com$' regex

FuncName ${1} ${2} ${3}
scanList() gstatic.com (\.|^)*\.services\.generalmagic\.com$ regex

Nothing is wrong with that. The error was in the next paragraph of my post about your awk commands.

Great, that's what I'd like. The quote of the bash script output was for msatter to find the output he had issues with.

Edit: Clarity.

I tried the fixing branch (fix/awkInQuery)- and it worked well. Fast, no high memory consumption, no process killed.

Is this definitely the case? Unless things are drastically different from when I last looked some time ago, the process used to be as follows (some code omitted):

  1. Get regexps from the db
mapfile -t regexList < <(sqlite3 "${gravityDBfile}" "SELECT domain FROM domainlist WHERE type = ${type}" 2> /dev/null)
  1. Add regexps to a string (one each line)
str_regexList=$(printf '%s\n' "${regexList[@]}")
  1. Call the scanList function with a single domain (user input) and the multiline regex string from the db
 mapfile -t regexMatches < <(scanList "${domain}" "${str_regexList}" "regex")
  1. Iterate through each regex in the multiline string and add to a regexps array, then iterate through each regexp and check if the domain matches (print pattern if it does)
 "regex" ) awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' \
                  <(echo "${lists}") <(echo "${domain}") 2>/dev/null;;

So as I understand, the loop is actually done there.

So if we take:

"regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;

${lists} at this point should be a multiline string of regexps.

As I said, I could be way off the mark here and I have been wrong many times in the past, but this is how I remember the script functioning previously.

RESULTS BELOW

Release v5

mmotti@ubuntu-server:~$ pihole -q ads.test.bbc.co.uk
 Match found in regex blacklist
   ^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
mmotti@ubuntu-server:~$ pihole -q analytics.test.co
 Match found in regex blacklist
   ^analytics?[_.-]

New branch

Regexps that should match:

  • ^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
  • ^analytics?[_.-]
mmotti@ubuntu-server:~$ pihole -q ads.test.bbc.co.uk
  [i] No results found for ads.test.bbc.co.uk within the block lists
mmotti@ubuntu-server:~$ pihole -q analytics.test.com
  [i] No results found for analytics.test.com within the block lists

I was confused as well. I couldn't understand the reason for arrays or looping inside the function since the function is only passed a single regex.

Proposed function:

# Scan an array of files for matching strings
scanList(){
    # Escape full stops
    local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"

    # Prevent grep from printing file path
    cd "$piholeDir" || exit 1

    # Prevent grep -i matching slowly: http://bit.ly/2xFXtUX
    export LC_CTYPE=C

    # /dev/null forces filename to be printed when only one list has been generated
    case "${type}" in
        "exact" ) grep -i -E -l "(^|(?<!#)\\s)${esc_domain}($|\\s|#)" ${lists} /dev/null 2>/dev/null;;
        # Create array of regexps
        # Iterate through each regexp and check whether it matches the domainQuery
        # If it does, print the matching regexp and continue looping
        # Input 1 - regexps | Input 2 - domainQuery
        "regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;
        *       ) grep -i "${esc_domain}" ${lists} /dev/null 2>/dev/null;;
    esac
}

Current function:

# Scan an array of files for matching strings
scanList(){
    # Escape full stops
    local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"

    # Prevent grep from printing file path
    cd "$piholeDir" || exit 1

    # Prevent grep -i matching slowly: http://bit.ly/2xFXtUX
    export LC_CTYPE=C

    # /dev/null forces filename to be printed when only one list has been generated
    # shellcheck disable=SC2086
    case "${type}" in
        "exact" ) grep -i -E -l "(^|(?<!#)\\s)${esc_domain}($|\\s|#)" ${lists} /dev/null 2>/dev/null;;
        # Create array of regexps
        # Iterate through each regexp and check whether it matches the domainQuery
        # If it does, print the matching regexp and continue looping
        # Input 1 - regexps | Input 2 - domainQuery
        "regex" ) awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' \
                  <(echo "${lists}") <(echo "${domain}") 2>/dev/null;;
        *       ) grep -i "${esc_domain}" ${lists} /dev/null 2>/dev/null;;
    esac
}

Hmm, looks like the regex isn't ever passed to the queryFunc?

+ resolver=pihole-FTL
+ [[ 2 = 0 ]]
+ case "${1}" in
+ [[ ! 0 -eq 0 ]]
+ case "${1}" in
+ queryFunc -q analytics.test.co
+ shift
+ /opt/pihole/query.sh analytics.test.co
  [i] No results found for analytics.test.co within the block lists
+ exit 0

Ah I see it! Okay, quick fix coming.

I must admit, I am unclear on the exact inner workings of the rest of the script(s) and/or functions as I was entirely focused on query.sh when making changes initially, as that's where WaLLy3K had made some initial progress; the method being (if I recall correctly) to match with grep and then if there was a match, try to do a reverse match but it was fiddly and didn't quite work properly with things being escaped etc.

See pull request in question

So it's possible that I may have missed functions that could have required information being passed to.

I still believe that the function in question here is passed a list of regexps, and not a single one. Unless I am really losing my marbles lol

Nope, marbles are intact! I'll do a short loop over the list as a space delimited array. I'll have a patch up in a few minutes.

Edit: Adding set -x to the top of query.sh is eye opening! :laughing:

1 Like

Phew! :crazy_face: