Will do that.
Yes, that is correct. Wildcards are converted to regex value when committed.
Issue opend:
This is functionally the same issue as: Pi-hole becomes unresponsive during pihole -d and -q
Faulty regex causes -q (and, by extension -d) to stall out
(You're not alone!)
Thanks for the hint
As the root cause was the same with us I opened a new topic on that issue:
4 posts were merged into an existing topic: Rename "wildcard" blacklist/whitelist because it is misleading and lead to regex errors
As mentioned in the other thread, this is already in the works for quite some time but will likely not hit v5.0 not not push back the release unnecessarily long.
Thanks for this, however, we will simply reuse the inbuilt PHP domain validator we already used before. It also checks for the maximum length of subdomains, etc.
Can we call this the solution to the topic then?
Depends on how you (or pihole team) interpret "solution": issue acknowledged and code will be written or code is written and functioning.
For me it's solved - I can't contribute any further except of testing the patch when it's ready.
Can I get a teleporter tarball from someone that is seeing this so I can work on it?
I think the glitch may be in something else. Noted on the GitHub issue, but calling awk
directly doesn't cause a crash, and I'm seeing that using shell everything seems to be okay. I don't think there actually is a problem with using a *
in a wildcard entry as that will always be the first char and will always end up as (\.|^)*
. I think that's okay, just means "Zero or more dots or start of lines".
Checking with a small script and using shell regex instead of awk
looks hopeful?
#!/usr/bin/env bash
domains=(gstatic.com generalmagic.com services.generalmagic.com me.services.generalmagic.com)
regex="(\.|^)*\.services\.generalmagic\.com$"
for domain in ${domains[@]}; do
if [[ $domain =~ $regex ]]; then
printf "Found %s in %b.\n" $domain $regex
else
printf "Did not find %s in %b.\n" $domain $regex
fi
done
OUT:
Did not find gstatic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Found me.services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.```
Edit:
For completeness and ease of discussion, here's the awk:
dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com$') <(echo "gstatic.com")
dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")
dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "generalmagic.com")
dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "services.generalmagic.com")
dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
(\.|^)*\.services\.generalmagic\.com$
dschaper@Mariner-10:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com$') <(echo "me.services.generalmagic.com")
(\.|^)*\.services\.generalmagic\.com$
I can try and carve out a very small RAM lxc to check for memory starvation, 1G is the minimum on Digital Ocean and that may not show what we are looking for.
Edit: 1G RAM DigitalOcean droplet and I can't get it to crash or hang at all. I'll need something to duplicate with to try and work this down.
The awk is from Fix for regexp queries through pihole -q by mmotti · Pull Request #2780 · pi-hole/pi-hole · GitHub and the PR author may have some thoughts on what/why this is being seen.
Hmm - To be completely honest, I'm not sure. My dealings with awk are quite limited and I only started playing around with it for this purpose as it seemed to be much quicker than the alternatives when it came to a significant number of regexps. Particularly running grep in a loop for each regexp and checking the return code, as grep doesn't seem to have the ability to display the pattern that caused a match.
(\.|^)*
seems like particularly bad regex, and I wonder if it could be classed as 'evil regex', or regex denial or service. The issue likely being the caret used within a group (not great practice anyway), being further complicated by the *
quantifier. It could be stuck checking a huge criteria for each domain and thus dying?
I'm away from my laptop at the moment, but does grep or the more generic bash approach have the same outcome when compared against the same number of domains? Is the performance as good running a for loop with bash? Assuming the domain count for the people experiencing issues to be in the millions but haven't read every detail of the thread just yet.
Sadly I'm not sure what to suggest with this particularly with awk. The problem with regex generally is that as so long as the pattern is technically compilable, the processor (awk in this case) is at the mercy of the user.
Tried it myself.
With the script everything is fine, running just ~1 sec.
nanopi@nanopi:~$ ./test
Did not find gstatic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Did not find services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
Found me.services.generalmagic.com in (\.|^)*\.services\.generalmagic\.com$.
But the awk
got killed again due to high memory consumptation
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo '(\.|^)*\.services\.generalmagic\.com$') <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "gstatic.com")
Killed
nanopi@nanopi:~$ awk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
Killed
As it seem to come down to awk
I checked what was installed:
nanopi@nanopi:~$ apt -s install awk
NOTE: This is only a simulation!
apt needs root privileges for real execution.
Keep also in mind that locking is deactivated,
so don't depend on the relevance to the real current situation!
Reading package lists... Done
Building dependency tree
Reading state information... Done
Package awk is a virtual package provided by:
original-awk:armhf 2012-12-20-6
mawk:armhf 1.3.3-17+b3
gawk:armhf 1:4.2.1+dfsg-1
original-awk 2012-12-20-6
mawk 1.3.3-17+b3
gawk 1:4.2.1+dfsg-1
You should explicitly select one to install.
nanopi@nanopi:~$ sudo apt-show-versions awk
awk not installed (not available)
nanopi@nanopi:~$ apt-show-versions mawk
mawk:arm64 1.3.3-17+b3 installed: No available version in archive
nanopi@nanopi:~$ apt-show-versions gawk
gawk not installed (not available)
running the awk
explicite with mawk
nanopi@nanopi:~$ mawk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
Killed
BUT installing gawk
nanopi@nanopi:~$ sudo apt install gawk
nanopi@nanopi:~$ gawk 'NR==FNR{regexps[$0];next}{for (r in regexps)if($0 ~ r)print r}' <(echo "(\.|^)*\.services\.generalmagic\.com$") <(echo "me.services.generalmagic.com")
(\.|^)*\.services\.generalmagic\.com$
nanopi@nanopi:~$ pihole -q gstatic
Match found in exact whitelist
fonts.gstatic.com
Match found in https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts:
csi.gstatic.com
Match found in https://hosts-file.net/ad_servers.txt:
csi.gstatic.com
p2-aahhyknavsj2m-wtnlrzkba6lht33q-if-v6exp3-v4.metric.gstatic.com
p2-f6rp6piuxns4u-uzq4vp76bu3w2tso-if-v6exp3-v4.metric.gstatic.com
p2-n3zurhre4jjvk-can5rb2f2a4urcxh-if-v6exp3-v4.metric.gstatic.com
p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-164149-i1-v6exp3-v4.metric.gstatic.com
p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-164149-i2-v6exp3-ds.metric.gstatic.com
p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-164149-s1-v6exp3-v4.metric.gstatic.com
p4-ajvwyt5lpjazy-us7r2dzqcjsqh7pt-if-v6exp3-v4.metric.gstatic.com
p5-lj5aujgj7jl7w-r2pmxqvndsgx2im2-931517-i1-v6exp3-v4.metric.gstatic.com
p5-lj5aujgj7jl7w-r2pmxqvndsgx2im2-931517-i2-v6exp3-ds.metric.gstatic.com
s6.netlogstatic.com
v6exp3-ds.metric.gstatic.com
v6exp3-v4.metric.gstatic.com
Match found in https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardMobileAds.txt:
csi.gstatic.com
Match found in https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardDNS.txt:
metric.gstatic.com
diagnose.igstatic.com
Match found in https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/EasyPrivacy3rdParty.txt:
csi.gstatic.com
diagnose.igstatic.com
Match found in https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt:
connectivitycheck.gstatic.comwf.cbarsrv.com
gstaticadssl.l.google.com
metric.gstatic.com
anycast.metric.gstatic.com
anycast-stb.metric.gstatic.com
anycast1.metric.gstatic.com
anycast1-stb.metric.gstatic.com
anycast2.metric.gstatic.com
anycast2-stb.metric.gstatic.com
ds.metric.gstatic.com
s-v6exp1-ds.metric.gstatic.com
s-v6exp1-v4.metric.gstatic.com
stbcast.metric.gstatic.com
stbcast-stb.metric.gstatic.com
stbcast2.metric.gstatic.com
stbcast2-stb.metric.gstatic.com
stbcast3.metric.gstatic.com
stbcast3-stb.metric.gstatic.com
stbcast4.metric.gstatic.com
stbcast4-stb.metric.gstatic.com
stbcast5.metric.gstatic.com
stbcast5-stb.metric.gstatic.com
test-ipv6-dot-com-v6exp3-v4.metric.gstatic.com
unicast.metric.gstatic.com
unicast-stb.metric.gstatic.com
unicast2.metric.gstatic.com
unicast2-stb.metric.gstatic.com
v4.metric.gstatic.com
v6exp3-ds.metric.gstatic.com
v6exp3-v4.metric.gstatic.com
NO ERRORS!
There must be a difference between mawk
and gawk
resulting in the error we've seen.
Is pihole checking for awk
during installation? Maybe extend to gawk
?
This is very interesting that it's crapping out after only a single domain being passed.
Am I right in understanding at the moment that this only happens for select users? (as Dan was able to get an output without memory consumption issue)
We do awk
, but that may not be needed now. I'll check the version table for bash
and its internal regex matcher. If all the supported OS releases can do this in shell, then we'll do it in shell. I'll put a branch up as soon as I wake up fully and get some coffee. Testing for feature parity and speed of processing is the key.
This also would mean that there's no immediate need to change the wildcard entry, you can throw *.host.domain
at it and it should work just the same.
Okay, pushed a core branch fix/awkInQuery
to test. Looks okay from my very quick check.
pihole checkout core fix/awkInQuery
and see what you can do with it.