Gravity update flags accented character domain as invalid

I'm not sure if this is a bug or correct behaviour, so reporting for analysis. I've not added a debug token since I don't think it's setup-related. I did some digging and didn't find anywhere else asking about this.

I have an adlist which, when updating Gravity, Pi-hole complains contains two invalid domains.

[i] Target: https://malware-filter.gitlab.io/malware-filter/phishing-filter-hosts.txt
[✓] Status: Retrieval successful
[i] Analyzed 59469 domains, 2 domains invalid!
    Sample of invalid domains:
    - boursoramaaccès.com
    - emails%2eazure%2emicrosoft%2ecom@url917.vakeelavaahini.com
[i] List has been updated

The domain relevant to this post is the first one. I assume it's the accented è that it doesn't like.

If I paste the domain into Domains blacklist, it is accepted no problem and converted into its Punycode representation. A green box appears stating:

+ Success!
Added xn--boursoramaaccs-7jb.com

The entry below even lists both:

boursoramaaccès.com (xn--boursoramaaccs-7jb.com)

The adlist contains a few Punycode encoded domains directly and the Gravity update handles them fine. I tested with one of them (xn--80agfkuhcahdlf0o.xn--p1ai) and it can be added manually as either its Russian unicode or its Punycode format, with the same result, no problems. It is already found in the database from the Gravity update, and can be searched for in Tools > Search Adlists using either Russian or Punycode format.

Summary

Pi-hole's Domains section handles the adding of special domains, using either their unicode or Punycode representation, without problem.

Pi-hole's Gravity update handles Punycode entries without problem, but flags the unicode representation as invalid.

Since Pi-hole can interpret and store these domains both ways without problem, it seems like a bug or omission in the Gravity update logic.

(Pi-hole v5.12.2 FTL v5.18.1 Web Interface v5.15.1)

We convert unicode chacters to punycode when users enter them via web interface to aid users which usually don't know what punycode is. Usually, users enter only a few domains in unicode format.
On the other hand, gravity complains about unicode and won't do the conversion if unicode domains are found on an adlist. We believe this is the right approach, because adlists are shared and for a huge number of users we would need to make the punycode conversion each time the adlist is loaded. The right way to fix this is to inform the maintainer of the list which only needs to do the conversion once.
__

See:
https://github.com/pi-hole/pi-hole/issues/4410
https://github.com/pi-hole/pi-hole/issues/4436
https://github.com/pi-hole/pi-hole/issues/4447
https://github.com/pi-hole/pi-hole/issues/4512

and a few other issues on github

Thanks for the github links, I had found some posts on reddit but didn't know to look on github. I'll be sure to check the issues pages in future. I agree with the assessments there and here. That list has a couple of entries that have not been cleaned up.

Here's how I was thinking about the different scenarios Pi-hole deals with, with respect to domains containing non-standard unicode characters. You can see that the Gravity update appears to be the odd one out – but while writing this I found that the Query long term log also doesn't like them, which downgrades the 'odd one out effect' I was seeing.

interaction mode unicode punycode
Add domains to black/whitelist yes, converted yes
Query long-term log via Query field no, not found yes, found
Search Adlists via Search field yes, converted and found yes, found
Blocked/allowed by FTL yes, but converted by client first so n/a? yes
Add from adlist during Gravity update no, reports invalid yes

In summary, Pi-hole does a great job converting and handling non-punycode representations in various places, and this almost made those situations where it doesn't convert seem like a possible bug. Apologies for missing the github previous discussion. Thanks again.

I think we could add the unicode -> punycode conversion when searching from the long-term database.

1 Like

That would be a useful addition, thanks. Out of interest is that because you feel the logic/code is already in place for the adlist searches, so you can repurpose it for the query database too?

Both the long-term database and the current Query database only support displaying and searching punycode representations. When adding a punycode or unicode domain to a black/whitelist, the entry shows both formats. Could this same tweak allow the Query log to show both formats as they come up? So instead of showing (using example domain above)

xn--boursoramaaccs-7jb.com

it shows

boursoramaaccès.com (xn--boursoramaaccs-7jb.com)

or maybe

xn--boursoramaaccs-7jb.com (boursoramaaccès.com)

Not sure if that's similarly related to the extra capability you're proposing. It's not a feature request by any means – Pi-hole continues to work great and handle them all seamlessly under the hood either way.

Happy to test any tweaks on a different branch if it helps.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.