Invalid domain bug

Problem with Beta 5.0:
lightswitch05 who hosts a number of pihole lists has 8 domains that are in Unicode format that is rendered invalid. Owner of that repository Daniel mentioned that this is a bug and slated to be fixed in version 5
His repository @ Invalid domain · Issue #153 · lightswitch05/hosts · GitHub

Just wondering if this is fixed or on track to be fixed?

Edit - add exact issue from update log
[i] Target: https://www.github.developerdan.com/hosts/lists/ads-and-tracking-extended.txt
[✓] Status: Retrieval successful
[i] Received 110984 domains, 8 domains invalid!
Sample of invalid domains:

If you are running the latest 5.0 beta you have the latest code. And, I don't see an open issue on Pi-hole GitHub.

Note that the blocklist maintainer has duplicate domains using only punycode (as noted on his GitHub), so you aren't missing any domains here.

@lightswitch05

I was the one reporting the IDN domains to the maintainer initially. He wanted to keep them (because not only pihole uses this hosts file) but added the punycode version of the mentioned domains.
I was fine with this and ignored piholes's warning about invalid domains because I knew that theses domains are duplicates.

But I can see that the decision to keep the original IDN domains can cause trouble and support requests to both pihole and the list maintainer.

It's kind of a stalemate: It's not pihole's duty to correct malformated list but the list maintainer does have valide reasons to keep IDN domains. But are IDN domains malformated entries?

One way to solve it could be for pihole to convert IDN domains (or automatically all?) to punycode before checking if they are invalid.

(I think I have responded to a similar observation in the past, but can't find it.)

Internationalizing Domain Names in Applications (IDNA) was conceived to allow client-side use of language-specific characters in domain names without requiring any existing infrastructure (DNS servers, mall servers, etc., including associated protocols) to change.

The corresponding original RFC 3490 clearly states that IDNA is employed at application level, not on the server side

From RFC 3490

The IDNA protocol is contained completely within applications. It is not a client-server or peer-to-peer protocol: everything is done inside the application itself. When used with a DNS resolver library, IDNA is inserted as a "shim" between the application and the resolver library. When used for writing names into a DNS zone, IDNA is used just before the name is committed to the zone.

Hence, DNS servers never see any IDN domain name, which means DNS records do not store IDN domain names at all, only their Punycode representations.

So nslookup ɢoogletranslate.com results in NXDOMAIN, whereas its Punycode equivalent nslookup xn--oogletranslate-u5f.com will return an IP (if not blocked by Pi-hole) :wink:

Granted, Pi-hole could do a conversion before committing a list entry do its database.
But forcing all list consumers (not just Pi-hole) to apply that conversion means mutiplying the effort for Punycode conversions by number of clients times using that entry (potentially, if done on the fly).
This seems like a waste of resources to me.

:man_facepalming: You responded to me...

Thanks for the explanation again. I'm clearly with you (that's why I contacted the maintainer) but that doesn't solve the problem.
As long as some adlists contains IDNs (for whatever reason) this will generate the invalid warning in pihole leading to user requests that consume moderator/dev's time to answer.

Solutions:

  1. all maintainers remove IDNs from their lists (unrealistic)
  2. pihole converty IDNs to punycode (wast of resources)
  3. remove piholes invalid warning (removes ability to reach solution #1)
  4. keep it as it is (generates support requests)
  5. (placeholder of clever idea...).

Hello. Sorry about this issue everyone. I don’t believe this is something that the pihole needs to be concerned about and said as much in the ticket. Thank you for all the information @Bucking_Horn, that is really useful.

Although I primarily use my hosts formatted list with the pihole, I also use it with uBlock Matrix, and I know others use it with uBlock Origin and still more tools. It’s completely makes sense that the pihole only needs to know the puny code version- I’m not sure about browser extensions. If someone has any information that all the ads blockers out there only need punycode versions, then I believe having both versions in the list provides the most coverage.

So:

  1. all maintainers remove IDNs from their lists (unrealistic)

As much as I like pihole, it’s not the only tool that accepts hosts formatted lists.

  1. pihole converts IDNs to punycode (wast of resources)

I’m not against this, but I also don’t think it’s necessary as I added both versions.

  1. keep it as it is (generates support requests)

Yes. Perhaps someone will have some information for me about browser extensions and I can just remove the domains causing the complaints.

As long as some adlists contains IDNs (for whatever reason) this will generate the invalid warning in pihole leading to user requests that consume moderator/dev’s time to answer.

Yes, because they are invalid.

Domains are only allowed to contain ASCII-characters, the invention/introduction of international domain names (IDNs) did not change this.

Relevant RFC section

2.3.2.1. IDNA-valid strings, A-label, and U-label
IDNA-aware applications permit only A-labels and NR-LDH labels to appear in zone files and queries. U-labels can appear, along with the other two, in presentation and user interface forms, and in protocols that use IDNA forms but that do not involve the DNS itself.

Source: Section 2.3.2.1. of RFC 5890


Definition of NR-LDH labels: only ASCII

These specifications use the term "NR-LDH label" strictly to refer to an all-ASCII label that obeys the LDH label syntax discussed in Section 2.3.1 [...]

Source: Section 2.3.2.2. of RFC 5890


Definition of A labels: only ASCII

An A-label is recognizable from the prefix "xn--" before the characters produced by the Punycode algorithm [RFC3492]; thus, a user application can identify an A-label and convert it into Unicode (or some local coded character set) for display.

Source: Section 1.1. of RFC 5894

see also:

An "A-label" is the ASCII-Compatible Encoding (ACE, see Section 2.3.2.5) form of an IDNA-valid string. [...]

Source: Section 2.3.2.1. of RFC 5890


Definition of U labels: Unicode

A "U-label" is an IDNA-valid string of Unicode characters, in Normalization Form C (NFC) and including at least one non-ASCII character, expressed in a standard Unicode Encoding Form (such as UTF-8). [...]

Source: Section 2.3.2.1. of RFC 5890


Unicode domains should only appear in IDN U-labels which are, in turn, only allowed for the user-frontend but not in the internal DNS machinery where only A- and NR-LDH labels are allowed (c.f. my first dropdown field).

TL;DR Hence, my interpretation is that having Unicode domains in /etc/hosts-like lists is incorrect.

3 Likes

Pi-hole is doing exactly as users requested. In the referenced thread above regarding how gravity processes and reports invalid entries in the imported lists (which you participated in), this was the agreed upon solution.

The domains are invalid, they are not imported into Pi-hole, and Pi-hole is reporting this.

Number 4 is the answer here, in my opinion. Or, if blocklist maintainers want to make lists that are completely HOSTS compatible, then number 1 is the solution, but number 4 will still function as is. This is not unrealistic, as many blocklists are completely imported with no errors.

In this specific case, we have met everybody's needs.

The blocklist maintainer put some unicode entries due to how they employ the blocklist.

This is no stalemate. Pi-hole is working as intended. Those entries are not valid for Pi-hole, so Pi-hole skipped them and reported this to the user.

We can handle the very small number of support requests we will receive for this issue.

@lightswitch05 removed the IDN entries from the adlist