[DEV] bad handling of capitalization of domain names

Please follow the below template, it will help us to help you!

Expected Behaviour:

I am on current devel branch with new regex whitelist feature.

  1. When adding domains to blacklist or whitelist, I would expect to not beeing able to add the same domain with different capitalization.

  2. Also it should not be able to add the same domains simultaneously in whitelist and blacklist.

  3. When querying a domain in the ad lists, I would expect the same output from CLI and web interface, regardless of capitalization.

Actual Behaviour:

  1. Currently it's possible to add the same domain with different capitalization to whitelist and blacklist.

  2. It's also possible to add the same domain simultaneously in whitelist and blacklist.

  3. Querying for e.g. Microsoft on the web interface and in CLI (using pihole -q -adlist Microsoft) leads to different outputs.
    On the web interface, the entered domain gets converted to lower case in queryads.js.
    Querying for Microsoft and microsoft in the CLI leads to different outputs.

    I had expected that in CLI the entered domain always get converted to lower case.
    Then there would no need to do this in javascript and the outputs would always be equal.

    However to be honest, I currently do not know how the domain name capitalization is handeld when the gravity list database gets filled and how incoming DNS queries are handled in this respect.

Debug Token:

-

Domain names are case insensitive so we shouldn't have anything that is affected by case.

May need to look for grep and change to grep -i, and set the PCRE to be insensitive.

Oh, this is ugly.

Probably should just make that whole script to be shopt -s nocasematch.

Actually, this is a quite difficult task to do. Yes, this is more simple for exact white- and blacklists, but for their respective regex versions, much more work would be required to ensure there are no duplicates. Even then, they are harmless as the whitelists always trump so the behavior should always be as expected (given users known what's on their lists).

FTL, internally, converts everything to lowercase before processing anything. It is the very first step before any query processing takes place.

For this reason, domains with at least one capital letter in the database do never match any incoming domain. We could amend the query expression by adding COLLATE NOCASE to the end of this line:

However, I'm not sure about the performance impact this may have. Currently, the lookups are very very fast as they are working on the B-tree. However, this tree can only do exact searches efficiently IIRC. This means either SQLite would need to traverse all entries every time, converting both ends to lower case before comparing or by testing the database against any possible capitilization variant of a given domain. Both sounds awfully slow. So my vote goes towards ensuring there capitalized domains do not make their way into the database.


Regex are case-sensitive by definition, however, we support setting

REGEX_IGNORECASE=true

in pihole-FTL.conf to overwrite this. This option defaults to false, however. @DanSchaper @PromoFaux Do we want to change this? We should discuss this properly, maybe also somewhere else.

I don't think case should ever matter in our code since it doesn't matter for the RFCs. Bash can be worked with to ignore case (though query.sh is a monster and I don't know why we did it all with shell instead of a sed/awk or more proper tool). Doesn't PCRE have a case sensitivity modifier that we can implement?

(?i) makes the regex case insensitive
Regex Tutorial - Turning Modes On and Off for Only Part of The Regular Expression

The CLI tools will remove an entry from the opposite list you add a domain to:

root@pihole:/etc/.pihole# pihole -w test.com
  [i] Adding test.com to the whitelist...
  [i] test.com does not exist in blacklist, no need to remove!
  [βœ“] Reloading DNS service
root@pihole:/etc/.pihole# pihole -b test.com
  [i] Adding test.com to the blacklist...
  [i] Removing test.com from the blacklist... [nb. there is a bug here, it should say whitelist]
  [βœ“] Reloading DNS service
root@pihole:/etc/.pihole# pihole -b test.com
  [i] test.com already exists in blacklist, no need to add!
  [i] test.com does not exist in whitelist, no need to remove!
root@pihole:/etc/.pihole# pihole -w test.com
  [i] Adding test.com to the whitelist...
  [i] Removing test.com from the whitelist... [nb. there is a bug here, it should say blacklist]
  [βœ“] Reloading DNS service

When adding a regex rule, it should be sufficient to check if exactly the same rule already is on the opposite list. Maybe this already works this way in CLI?

If I add a domain simultaneously to Blacklist and Whitelist then the domain gets blocked although it should be not.
I have tested this with google.com and test.de.

Right, and as expected the entered domain gets converted to lowercase letters.

Agreed it should not and it doesn't in testing my own. Whitelist always wins.

Can you get us a debug so we can see why your particular setup is not doing that?

Ok. I've added google.com and test.de back to both lists. Debug token is 9ggd1ejgon.

Narrator: This is on Master, so pretty much means nothing.

Nothing in either list, dig should NXDOMAIN

dschaper@nanopihole:/etc/pihole$ ls -l {white,black}list.txt
ls: cannot access 'whitelist.txt': No such file or directory
-rw-r--r-- 1 root root 0 Nov 29 19:12 blacklist.txt

dschaper@nanopihole:/etc/pihole$ cat {white,black}list.txt
cat: whitelist.txt: No such file or directory

dschaper@nanopihole:/etc/pihole$ dig +answer arbitrary.domain

; <<>> DiG 9.11.5-P4-5.1-Debian <<>> +answer arbitrary.domain
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 65059
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;arbitrary.domain.              IN      A

;; Query time: 1 msec
;; SERVER: 192.168.10.2#53(192.168.10.2)
;; WHEN: Fri Nov 29 19:16:16 UTC 2019
;; MSG SIZE  rcvd: 45

Blacklist domain, should be 0.0.0.0

dschaper@nanopihole:/etc/pihole$ pihole -b arbitrary.domain
  [i] Adding arbitrary.domain to blacklist...
  [i] arbitrary.domain does not exist in whitelist, no need to remove!

  [i] Pi-hole blocking is enabled
  [i] Using cached Event Horizon list...
  [i] 45733 unique domains trapped in the Event Horizon
  [i] Number of blacklisted domains: 1
  [i] Number of regex filters: 8
  [βœ“] Parsing domains into hosts format
  [βœ“] Cleaning up stray matter

  [βœ“] Force-reloading DNS service
  [βœ“] DNS service is running
  [βœ“] Pi-hole blocking is Enabled

dschaper@nanopihole:/etc/pihole$ dig +answer arbitrary.domain

; <<>> DiG 9.11.5-P4-5.1-Debian <<>> +answer arbitrary.domain
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15074
;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;arbitrary.domain.              IN      A

;; ANSWER SECTION:
arbitrary.domain.       2       IN      A       0.0.0.0

;; Query time: 1 msec
;; SERVER: 192.168.10.2#53(192.168.10.2)
;; WHEN: Fri Nov 29 19:16:43 UTC 2019
;; MSG SIZE  rcvd: 61

dschaper@nanopihole:/etc/pihole$ ls -l {white,black}list.txt
-rw-r--r-- 1 root root 17 Nov 29 19:16 blacklist.txt
-rw-r--r-- 1 root root  0 Nov 29 19:16 whitelist.txt

dschaper@nanopihole:/etc/pihole$ cat {white,black}list.txt
arbitrary.domain

Whitelist domain, should be NXDOMAIN again from upstream (allowed to resolve)

dschaper@nanopihole:/etc/pihole$ pihole -w arbitrary.domain
  [i] Adding arbitrary.domain to whitelist...
  [i] Removing arbitrary.domain from blacklist...

  [i] Pi-hole blocking is enabled
  [i] Using cached Event Horizon list...
  [i] 45733 unique domains trapped in the Event Horizon
  [i] Number of whitelisted domains: 1
  [i] Number of blacklisted domains: 0
  [i] Number of regex filters: 8
  [βœ“] Parsing domains into hosts format
  [βœ“] Cleaning up stray matter

  [βœ“] Force-reloading DNS service
  [βœ“] DNS service is running
  [βœ“] Pi-hole blocking is Enabled

dschaper@nanopihole:/etc/pihole$ dig +answer arbitrary.domain

; <<>> DiG 9.11.5-P4-5.1-Debian <<>> +answer arbitrary.domain
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 13691
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1472
;; QUESTION SECTION:
;arbitrary.domain.              IN      A

;; AUTHORITY SECTION:
.                       932     IN      SOA     a.root-servers.net. nstld.verisign-grs.com. 2019112900 1800 900 604800 86400

;; Query time: 4 msec
;; SERVER: 192.168.10.2#53(192.168.10.2)
;; WHEN: Fri Nov 29 19:17:09 UTC 2019
;; MSG SIZE  rcvd: 120

dschaper@nanopihole:/etc/pihole$ ls -l {white,black}list.txt
-rw-r--r-- 1 root root  0 Nov 29 19:17 blacklist.txt
-rw-r--r-- 1 root root 17 Nov 29 19:17 whitelist.txt

dschaper@nanopihole:/etc/pihole$ cat {white,black}list.txt
arbitrary.domain

This condition shouldn't happen, Dom/Adam?

*** [ DIAGNOSING ]: Exact whitelist
   id    domain                                                                                                enabled  date_added           date_modified        comment                                           
   ----  ----------------------------------------------------------------------------------------------------  -------  -------------------  -------------------  --------------------------------------------------                                      
   156   google.com                                                                                            1        2019-11-29 20:07:59  2019-11-29 20:07:59                                                    
   157   test.de                                                                                               1        2019-11-29 20:08:11  2019-11-29 20:08:11                                                    

*** [ DIAGNOSING ]: Exact whitelist groups

*** [ DIAGNOSING ]: Regex whitelist

*** [ DIAGNOSING ]: Regex whitelist groups

*** [ DIAGNOSING ]: Exact blacklist
   id    domain                                                                                                enabled  date_added           date_modified        comment                                           
   ----  ----------------------------------------------------------------------------------------------------  -------  -------------------  -------------------  --------------------------------------------------
   125   test.de                                                                                               1        2019-11-29 20:08:24  2019-11-29 20:08:24                                                    
   126   google.com                                                                                            1        2019-11-29 20:08:35  2019-11-29 20:08:35                                                    

Oh, I forgot to mention at the beginning, that I am able to add domains simultaiosly to both lists via the web interface.

In CLI it works as expected.

Since the domains are on both lists, I expected the whitelisting to have priority, also because

Yes, I'll defer to @DL6ER and @PromoFaux since they know the database integration better than I.

Oh yes, CLI and web interface are completely decoupled when it comes to editing lists in v5.0.

We'll have to extend

to something like

INSERT OR IGNORE INTO ".$table." (domain,comment) VALUES (:domain, :comment)
WHERE NOT EXISTS(SELECT * FROM ".$othertable1." WHERE domain = :domain)
  AND NOT EXISTS(SELECT * FROM ".$othertable2." WHERE domain = :domain)
  AND [...]

this can quickly become quite ugly. Also, it is only a front-end fix. Users can still insert duplicates into the database when they interact manually with the database.

The alternative is to restructure the database. Instead of maintaining 4 tables with always the same structure (exact/regex black-/whitelist), we can put all on them in a single table with additional columns type (0 = exact, 1 = regex) and color (0 = black, 1 = white). The views can then return those where the type and color match, e.g.,

CREATE VIEW vw_whitelist AS SELECT DISTINCT domain
    FROM domainlist
    WHERE domainlist.enabled = 1 AND type = 0 AND color = 1
    ORDER BY domainlist.id;

Uniqueness can then trivially be achieved by having a unique index in the single domainlist table.

What do you think about this proposal @PromoFaux and @DanSchaper ?

Agree with the single-table approach. It is less unweildy to interact with that way.

With a unique key on the domain, then nobody could enter any duplicate domains, which is a good thing. Question is, of course, is this desired behaviour? Are there any legitimate reasons why one may want the domain to be in both the white and blacklist? Probably not... but if I don't ask!

I don't think so. The whitelist always wins so duplicates are, in fact, already now harmless.
So the question is also: Is it worth restructuring the entire thing now? However, if we do it, then we should do it now.

For sure, if we're going to change how it works, we should do that now.

I'm not sure (at least, in all my years with SQL, i haven't used something like this) if it is possible to enforce a unique key between several tables. It is, I guess, possible, but would be messy code-wise.

If we keep several tables, then we need to run several queries to ensure that the data is or is not in the other table.

With one table we try to insert, and if it fails, then it is a duplicate.

This was my first idea, however, it doesn't seem SQL(ite) supports this. If, at all, this is possible, then the way to go seems to be SQL functions. And this really looks like a mess.

Lets, then, go with a single table to contain all the lists with an identifying key for list type. Actually, we could maybe cut it down to one column for identifier and use a bit mask

attribute value
blacklist 1
whitelist 2
regex 4
decimal:
1 = blacklist
5 = regex blacklist
2 = whitelist
6 = regex whitelist

That said, even after typing this out, probably just have values 0,1,2, and 3 to mean the same things, and it would be less complex.

Or maybe even, as you suggested :slight_smile:

domainId domain comments type isRegex
1 goodsite.com i like this site 1 0
2 badsite.com I don't like this site 0 0
3 (.*)\.something\.(.*) stops something with regex 0 1
4 (.*)\.somethingElse\.(.*) allows something with regex 1 1

tl;dr - disregard my first suggestions :smiley:

I already have a PR for this :slight_smile:

1 Like