Faulty regex consumes all memory and causes issues with pihole -q

DanSchaper · March 2, 2020, 1:29pm

We do awk, but that may not be needed now. I'll check the version table for bash and its internal regex matcher. If all the supported OS releases can do this in shell, then we'll do it in shell. I'll put a branch up as soon as I wake up fully and get some coffee. Testing for feature parity and speed of processing is the key.

This also would mean that there's no immediate need to change the wildcard entry, you can throw *.host.domain at it and it should work just the same.

DanSchaper · March 2, 2020, 1:47pm

Okay, pushed a core branch fix/awkInQuery to test. Looks okay from my very quick check.

pihole checkout core fix/awkInQuery and see what you can do with it.

DanSchaper · March 2, 2020, 1:53pm

yubiuser:

nanopi@nanopi:~$ ./test 
Did not find gstatic.com in (\.|^)*\.services\.generalmagic\.com$. 
Did not find generalmagic.com in (\.|^)*\.services\.generalmagic\.com$. 
Did not find services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$. 
Found me.services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.

Okay, which of the above lines is not correct so I can adjust accordingly?

DanSchaper · March 2, 2020, 2:00pm

Ah okay, my view is that entering *.host.domain in as a wildcard means you want to block a.host.domain b.host.domain and not host.domain.

DanSchaper · March 2, 2020, 2:09pm

I'm going by what users would or could enter in the screens. The implementation is up to us to do behind the scenes what we think the users intend.

Entering *.host.domain as Wildcard means block any subhost of host.domain. To actually block host.domain and subhosts then enter host.domain as wildcard.

That keeps things very similar to version 4's wildcards and no user behavior needs to change. They can enter a * in wildcard and the intent will be seen and acted upon. No rejection telling them the entry is invalid thus causing more help requests for "Why is *.host.domain invalid??".

DanSchaper · March 2, 2020, 2:19pm

Why though? It works and it's what the user intended. I'm the last person to add code that lectures users on what is proper or improper. If we can use what data is entered and give the expected result then that's what I'm concerned with.

Edit: Can we keep things on topic and discuss this problem and the potential solution? I really don't want this thread to go 100+ posts.

DanSchaper · March 2, 2020, 2:28pm

That's an option to consider, yes. I don't like it myself since users are not going to learn proper regex, that's why we added the helper in the first place.

I'd like to see if bash regex matching works as well/better than awk though, no matter what the ending solution happens to be.

PromoFaux · March 2, 2020, 2:37pm

Agree with this. And backing up with a mod request to keep the discussion in this thread based around the proposed fix for the reported issue.

We can talk about semantics in another thread, another time.

There is already a discussion based around the naming /purpose of the wildcard feature as @yubiuser linked above:

mmotti · March 2, 2020, 2:54pm

# Split regexps over a new line
str_regexList=$(printf '%s\n' "${regexList[@]}")
# Check domain against regexps
mapfile -t regexMatches < <(scanList "${domain}" "${str_regexList}" "regex")

 "regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;

It's been a while since I have looked through this code or done any bash / shell scripting, but does this return a matching pattern or all regexps if a match is found?

Also, does this work as expected? I've not done native bash pattern matching, but does it definitely iterate through each item in the list like grep/awk?

Sorry if these are silly questions. I'm not able to test currently as away from home.

Edit 2: Before I left, I was able to verify that the bad regex killed my awk too on my rpi 3b.

DanSchaper · March 2, 2020, 3:01pm

It returns the regex if the domain matches the regex. There isn't a list of regex passed in to the function, the only thing it takes is a domain and a single regex. Looping through the contents of the list of regexes happens elsewhere.

Seems to, but that's why I'm asking for people to test it and see what they can do with it.

Not silly questions at all, no worries.

Thanks, I think it's down to gawk/mawk/awk variants and the different optimizations and internal state engines.

mmotti · March 2, 2020, 3:02pm

Ah, I see! I may have only looked at the most recent commit on that branch. My mistake.

Will try test later

DanSchaper · March 2, 2020, 3:08pm

Happy to have any test results.

My statement of how the function is used is based on my understanding and checking out the bash -x output. The function scanList() is called as below:

scanList gstatic.com '(\.|^)*\.services\.generalmagic\.com$' regex

FuncName	`${1}`	`${2}`	`${3}`
scanList()	gstatic.com	`(.	^)*.services.generalmagic.com$`

yubiuser · March 2, 2020, 3:12pm

Nothing is wrong with that. The error was in the next paragraph of my post about your awk commands.

yubiuser:

But the awk got killed again due to high memory consumptation

nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com
```) <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
Killed

DanSchaper · March 2, 2020, 3:14pm

Great, that's what I'd like. The quote of the bash script output was for msatter to find the output he had issues with.

Edit: Clarity.

yubiuser · March 2, 2020, 3:20pm

I tried the fixing branch (fix/awkInQuery)- and it worked well. Fast, no high memory consumption, no process killed.

mmotti · March 2, 2020, 3:29pm

Is this definitely the case? Unless things are drastically different from when I last looked some time ago, the process used to be as follows (some code omitted):

Get regexps from the db

mapfile -t regexList < <(sqlite3 "${gravityDBfile}" "SELECT domain FROM domainlist WHERE type = ${type}" 2> /dev/null)

Add regexps to a string (one each line)

str_regexList=$(printf '%s\n' "${regexList[@]}")

Call the scanList function with a single domain (user input) and the multiline regex string from the db

 mapfile -t regexMatches < <(scanList "${domain}" "${str_regexList}" "regex")

Iterate through each regex in the multiline string and add to a regexps array, then iterate through each regexp and check if the domain matches (print pattern if it does)

 "regex" ) awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' \
                  <(echo "${lists}") <(echo "${domain}") 2>/dev/null;;

So as I understand, the loop is actually done there.

So if we take:

"regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;

${lists} at this point should be a multiline string of regexps.

As I said, I could be way off the mark here and I have been wrong many times in the past, but this is how I remember the script functioning previously.

RESULTS BELOW

Release v5

mmotti@ubuntu-server:~$ pihole -q ads.test.bbc.co.uk
 Match found in regex blacklist
   ^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
mmotti@ubuntu-server:~$ pihole -q analytics.test.co
 Match found in regex blacklist
   ^analytics?[_.-]

New branch

Regexps that should match:

^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
^analytics?[_.-]

mmotti@ubuntu-server:~$ pihole -q ads.test.bbc.co.uk
  [i] No results found for ads.test.bbc.co.uk within the block lists
mmotti@ubuntu-server:~$ pihole -q analytics.test.com
  [i] No results found for analytics.test.com within the block lists

DanSchaper · March 2, 2020, 3:39pm

I was confused as well. I couldn't understand the reason for arrays or looping inside the function since the function is only passed a single regex.

Proposed function:

# Scan an array of files for matching strings
scanList(){
    # Escape full stops
    local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"

    # Prevent grep from printing file path
    cd "$piholeDir" || exit 1

    # Prevent grep -i matching slowly: http://bit.ly/2xFXtUX
    export LC_CTYPE=C

    # /dev/null forces filename to be printed when only one list has been generated
    case "${type}" in
        "exact" ) grep -i -E -l "(^|(?<!#)\\s)${esc_domain}($|\\s|#)" ${lists} /dev/null 2>/dev/null;;
        # Create array of regexps
        # Iterate through each regexp and check whether it matches the domainQuery
        # If it does, print the matching regexp and continue looping
        # Input 1 - regexps | Input 2 - domainQuery
        "regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;
        *       ) grep -i "${esc_domain}" ${lists} /dev/null 2>/dev/null;;
    esac
}

Current function:

# Scan an array of files for matching strings
scanList(){
    # Escape full stops
    local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"

    # Prevent grep from printing file path
    cd "$piholeDir" || exit 1

    # Prevent grep -i matching slowly: http://bit.ly/2xFXtUX
    export LC_CTYPE=C

    # /dev/null forces filename to be printed when only one list has been generated
    # shellcheck disable=SC2086
    case "${type}" in
        "exact" ) grep -i -E -l "(^|(?<!#)\\s)${esc_domain}($|\\s|#)" ${lists} /dev/null 2>/dev/null;;
        # Create array of regexps
        # Iterate through each regexp and check whether it matches the domainQuery
        # If it does, print the matching regexp and continue looping
        # Input 1 - regexps | Input 2 - domainQuery
        "regex" ) awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' \
                  <(echo "${lists}") <(echo "${domain}") 2>/dev/null;;
        *       ) grep -i "${esc_domain}" ${lists} /dev/null 2>/dev/null;;
    esac
}

DanSchaper · March 2, 2020, 3:43pm

Hmm, looks like the regex isn't ever passed to the queryFunc?

+ resolver=pihole-FTL
+ [[ 2 = 0 ]]
+ case "${1}" in
+ [[ ! 0 -eq 0 ]]
+ case "${1}" in
+ queryFunc -q analytics.test.co
+ shift
+ /opt/pihole/query.sh analytics.test.co
  [i] No results found for analytics.test.co within the block lists
+ exit 0

DanSchaper · March 2, 2020, 3:49pm

Ah I see it! Okay, quick fix coming.

mmotti · March 2, 2020, 3:50pm

I must admit, I am unclear on the exact inner workings of the rest of the script(s) and/or functions as I was entirely focused on query.sh when making changes initially, as that's where WaLLy3K had made some initial progress; the method being (if I recall correctly) to match with grep and then if there was a match, try to do a reverse match but it was fiddly and didn't quite work properly with things being escaped etc.

See pull request in question

So it's possible that I may have missed functions that could have required information being passed to.

I still believe that the function in question here is passed a list of regexps, and not a single one. Unless I am really losing my marbles lol