We do awk, but that may not be needed now. I'll check the version table for bash and its internal regex matcher. If all the supported OS releases can do this in shell, then we'll do it in shell. I'll put a branch up as soon as I wake up fully and get some coffee. Testing for feature parity and speed of processing is the key.
This also would mean that there's no immediate need to change the wildcard entry, you can throw *.host.domain at it and it should work just the same.
I'm going by what users would or could enter in the screens. The implementation is up to us to do behind the scenes what we think the users intend.
Entering *.host.domain as Wildcard means block any subhost of host.domain. To actually block host.domain and subhosts then enter host.domain as wildcard.
That keeps things very similar to version 4's wildcards and no user behavior needs to change. They can enter a * in wildcard and the intent will be seen and acted upon. No rejection telling them the entry is invalid thus causing more help requests for "Why is *.host.domain invalid??".
Why though? It works and it's what the user intended. I'm the last person to add code that lectures users on what is proper or improper. If we can use what data is entered and give the expected result then that's what I'm concerned with.
Edit: Can we keep things on topic and discuss this problem and the potential solution? I really don't want this thread to go 100+ posts.
That's an option to consider, yes. I don't like it myself since users are not going to learn proper regex, that's why we added the helper in the first place.
I'd like to see if bash regex matching works as well/better than awk though, no matter what the ending solution happens to be.
# Split regexps over a new line
str_regexList=$(printf '%s\n' "${regexList[@]}")
# Check domain against regexps
mapfile -t regexMatches < <(scanList "${domain}" "${str_regexList}" "regex")
"regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;
It's been a while since I have looked through this code or done any bash / shell scripting, but does this return a matching pattern or all regexps if a match is found?
Also, does this work as expected? I've not done native bash pattern matching, but does it definitely iterate through each item in the list like grep/awk?
Sorry if these are silly questions. I'm not able to test currently as away from home.
Edit 2: Before I left, I was able to verify that the bad regex killed my awk too on my rpi 3b.
It returns the regex if the domain matches the regex. There isn't a list of regex passed in to the function, the only thing it takes is a domain and a single regex. Looping through the contents of the list of regexes happens elsewhere.
Seems to, but that's why I'm asking for people to test it and see what they can do with it.
Not silly questions at all, no worries.
Thanks, I think it's down to gawk/mawk/awk variants and the different optimizations and internal state engines.
Is this definitely the case? Unless things are drastically different from when I last looked some time ago, the process used to be as follows (some code omitted):
Get regexps from the db
mapfile -t regexList < <(sqlite3 "${gravityDBfile}" "SELECT domain FROM domainlist WHERE type = ${type}" 2> /dev/null)
Add regexps to a string (one each line)
str_regexList=$(printf '%s\n' "${regexList[@]}")
Call the scanList function with a single domain (user input) and the multiline regex string from the db
Iterate through each regex in the multiline string and add to a regexps array, then iterate through each regexp and check if the domain matches (print pattern if it does)
So as I understand, the loop is actually done there.
So if we take:
"regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;
${lists} at this point should be a multiline string of regexps.
As I said, I could be way off the mark here and I have been wrong many times in the past, but this is how I remember the script functioning previously.
RESULTS BELOW
Release v5
mmotti@ubuntu-server:~$ pihole -q ads.test.bbc.co.uk
Match found in regex blacklist
^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
mmotti@ubuntu-server:~$ pihole -q analytics.test.co
Match found in regex blacklist
^analytics?[_.-]
New branch
Regexps that should match:
^(.+[_.-])?ad[sxv]?[0-9]*[_.-]
^analytics?[_.-]
mmotti@ubuntu-server:~$ pihole -q ads.test.bbc.co.uk
[i] No results found for ads.test.bbc.co.uk within the block lists
mmotti@ubuntu-server:~$ pihole -q analytics.test.com
[i] No results found for analytics.test.com within the block lists
I was confused as well. I couldn't understand the reason for arrays or looping inside the function since the function is only passed a single regex.
Proposed function:
# Scan an array of files for matching strings
scanList(){
# Escape full stops
local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"
# Prevent grep from printing file path
cd "$piholeDir" || exit 1
# Prevent grep -i matching slowly: http://bit.ly/2xFXtUX
export LC_CTYPE=C
# /dev/null forces filename to be printed when only one list has been generated
case "${type}" in
"exact" ) grep -i -E -l "(^|(?<!#)\\s)${esc_domain}($|\\s|#)" ${lists} /dev/null 2>/dev/null;;
# Create array of regexps
# Iterate through each regexp and check whether it matches the domainQuery
# If it does, print the matching regexp and continue looping
# Input 1 - regexps | Input 2 - domainQuery
"regex" ) if [[ "${domain}" =~ ${lists} ]]; then printf "%b\n" "${lists}"; fi;;
* ) grep -i "${esc_domain}" ${lists} /dev/null 2>/dev/null;;
esac
}
Current function:
# Scan an array of files for matching strings
scanList(){
# Escape full stops
local domain="${1}" esc_domain="${1//./\\.}" lists="${2}" type="${3:-}"
# Prevent grep from printing file path
cd "$piholeDir" || exit 1
# Prevent grep -i matching slowly: http://bit.ly/2xFXtUX
export LC_CTYPE=C
# /dev/null forces filename to be printed when only one list has been generated
# shellcheck disable=SC2086
case "${type}" in
"exact" ) grep -i -E -l "(^|(?<!#)\\s)${esc_domain}($|\\s|#)" ${lists} /dev/null 2>/dev/null;;
# Create array of regexps
# Iterate through each regexp and check whether it matches the domainQuery
# If it does, print the matching regexp and continue looping
# Input 1 - regexps | Input 2 - domainQuery
"regex" ) awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' \
<(echo "${lists}") <(echo "${domain}") 2>/dev/null;;
* ) grep -i "${esc_domain}" ${lists} /dev/null 2>/dev/null;;
esac
}
Hmm, looks like the regex isn't ever passed to the queryFunc?
+ resolver=pihole-FTL
+ [[ 2 = 0 ]]
+ case "${1}" in
+ [[ ! 0 -eq 0 ]]
+ case "${1}" in
+ queryFunc -q analytics.test.co
+ shift
+ /opt/pihole/query.sh analytics.test.co
[i] No results found for analytics.test.co within the block lists
+ exit 0
I must admit, I am unclear on the exact inner workings of the rest of the script(s) and/or functions as I was entirely focused on query.sh when making changes initially, as that's where WaLLy3K had made some initial progress; the method being (if I recall correctly) to match with grep and then if there was a match, try to do a reverse match but it was fiddly and didn't quite work properly with things being escaped etc.