Apply Pi-Hole blocking to CNAMEs

This is what I see on the wire for queries of the f7ds.liberation.fr domain:

02:05:14.421451 IP 192.168.10.150.49450 > 192.168.10.2.53: 53354+ [1au] A? f7ds.liberation.fr. (59)
02:05:14.423161 IP 192.168.10.2.10875 > 208.67.222.222.53: 18107+ [1au] A? f7ds.liberation.fr. (59)
02:05:14.725895 IP 208.67.222.222.53 > 192.168.10.2.10875: 18107 3/0/1 CNAME liberation.eulerian.net., CNAME atc.eulerian.net., A 109.232.197.179 (118)
02:05:14.726932 IP 192.168.10.2.53 > 192.168.10.150.49450: 53354 3/0/1 CNAME liberation.eulerian.net., CNAME atc.eulerian.net., A 109.232.197.179 (118)

That is telling me that my workstation (192.168.10.150) is asking for the domain from Pi-hole (192.168.10.2). Pi-hole is asking the upstream (208.67.222.222) for that domain and getting the single response that has everything resolved off of the Pi-hole. If you have Unbound set up locally then Pi-hole asks unbound for the address of f7ds and unbound is the one that resolves the CNAMEs.

Deferring to DL as he knows the internals better than I.

I don't know exactly how pi-hole is handling the requests, but the info that is needed to block is in the DNS response it get from upstream (and in the DNS response it sends downstream), so it may be used without any other request ?

liberation is the website that attracted attention on this problem because it's a newspaper that claimed "now we don't track our paying customers" ... but many websites use this trick to bypass 3rd-party cookie blocking in browsers.
Some are very usefull, by example, the french national railway company use this trick too ... you can't just kill them all.

FYI, Criteo is actively suggesting the DNS delegation on their support page, so it seems becoming a thing:

https://guides.criteotilt.com/cname/cname_generic/

OneTag 2.0 is Criteo’s latest cross-device innovation protecting your reach of shoppers as well as facilitating ad relevance and ensuring accurate sales attribution.
To implement OneTag 2.0 it is necessary that you delegate Criteo a sub-domain by creating a CNAME record in your name-server/hosting platform.

There is a project that tries to assess delegated subdomains from the PiHole database and generating a custom blocklist, it's worth checking out: geoffrey/eulaurarien: Generates a host list of first-party trackers for ad-blocking. - eulaurarien - Frogeye Git

(I'm not affilitated with any of the above)

Well, yes, and no. You were typing your text while I typed mine but, nonetheless, I think it should clear from my (I hope not too extensive) description below.

Okay. So I will describe shortly (I really planned to be brief!) the current technical limitations and why it could work in the end, nonetheless. Please be aware that the texts under "1." and "2." describe fundamentally different principles.
The first description (1.) describes the current philosophy of Pi-hole (v4.x and the upcoming v5.x).
The second description (2.) describes a new philosophy I have realized here.

1. Why is "deep" CNAME blocking currently not possible?
Pi-hole' blocking happens only at one specific point in time: When receiving a new query. This is the point where FTL decides whether it installs a cache entry pointing towards 0.0.0.0/::.
Thereafter, the DNS processing chain is happening as usual in dnsmasq. It first checks the cache for the domain.

  • If it finds the entry we injected, it replies with 0.0.0.0/:: because this is in this cache entry (query status = blocked).
  • If it finds another cache entry (because it knows the domain already), it replies from cache as well (query status = cached).
  • If it cannot find the query in its cache, it send the request upstream (query status = forwarded).

Whatever the reply of the upstream server may be is directly sent to the client. We cannot change anything anymore at this point in time. Let me explain this a little bit on the example used in this discussion at several places:

;; QUESTION SECTION:
;f7ds.liberation.fr.		IN	A

;; ANSWER SECTION:
f7ds.liberation.fr.	307	IN	CNAME	liberation.eulerian.net.
liberation.eulerian.net. 7199	IN	CNAME	atc.eulerian.net.
atc.eulerian.net.	7199	IN	A	109.232.197.179

The browser sends the request for f7ds.liberation.fr. FTL decides to not block this domain (not contained on any blocking list) and sends the request upstream. The reply are the three records you see here. All at once. It is important to realized that there is no ping-pong activity (like we first get liberation.eulerian.net, then we query this one and so on) involved here. Instead, all the records are received at once from the upstream destination. They are straight served to the original requestor. There is no way for us to give notice of opposition at this point in time.

2. Why could "deep" CNAME blocking become reality in the future?
As already linked above, we vastly redesigned some of the inner machinery of FTL. The turning away from blocking queries by installing corresponding cache entries towards direct replies to DNS queries (which involves truly building DNS UDP/TCP replies ourselves) theoretically enables us to intervene at any point in the process, removing the one fundamental limitation that prevents us from being able to do this the current variant of Pi-hole.
Why do I write "theoretical"? Well, as has also been raised as a concern in here already, validating not only the incoming request but also the (possibly many) replies certainly adds complexity to the overall process. If this is done with regex filters, the performance impact will be even worse.*

I will keep this in mind when continuing to develop on

which, I admit, is something I have somewhat lost out of sight given all the many other things happening in real life. Once the performance impact of regex filters* is resolved, adding the feature requested in here should be possible.


*A comment on the performance impact:

  1. Current version of Pi-hole
    When a query is received, all regex filters (say 10) are evaluated. Based on them, a domain might be blocked, even if it is on no exact blocking list. The domain is then marked internally as known and no regex evaluation is happening for this domain in the future.
  2. Proposed changes in the machinery
    When a query is received, all regex filters (say 10) are evaluated. Based on them, a domain might be blocked, even if it is on no exact blocking list. Okay, that's similar up to here. However, we don't inject the cache entry now, but directly reply. We could still memorize this for the domain, however, the internal cache we need for this now needs to memorize this for each connected client. This increases the dynamic complexity notably and is something I haven't (yet) implemented into my proposal. Although I think that this is possible, I'm not (yet) sure how to do it best.
    Until this is implemented, all regex filters are evaluated for each and every domain being requested - and with the feature requested in this discussion - also for all replies. The complexity of 10 regex filters (once) increases to 40 regex filters (for every request!). All this is possible, however, the project as such targets all the Raspberry Pi devices. With this change, we are slowly diverging from this target. At least as long as regex filtering is used.
1 Like

Thank you for another thorough explanation, @DL6ER.

I had started experimenting with adding the line address=/eulerian.net/# temporarily to 01-pihole.conf the very minute you composed your answer, to see how dnsmasq itself does handle this, only to confirm what you elaborated in your post just now: dnsmasq itself is not inspecting the answers, just the queries.

Still, it seems that Pi-hole is in a better position than browser based filters like uBlock Origin, as Pi-hole actually sees the full DNS answers, rather than just the resulting IP address.

To my simple mind, it looks like suppressing DNS responses in addition to blocking DNS queries would roughly double the workload on Pi-hole.
For my Zero sporting under 200k blocked domains and about 10 manual blocklist entries (one regex) serving 6 clients, that would hardly be noticeable, probably raising cpu utilization from ~4% to ~8%.
But I realise that other installations with more clients, requests and blocked domains would be hit harder.

You emphasize the effect of regex evaluation.
So what would be the performance impact of wildcard matching at the DNS response level?
Wouldn't that tackle the major part of CNAME obfuscated 3rd party blockers, e.g. just suppressing a domain (like eulerian.net) when encountered in a DNS response?

Why do you think unbound should be relevant for this feature? The majority of Pi-hole users will likely run FTL with a distant upstream DNS. We are looking at implementing this somewhere but only inside Pi-hole itself.

The CPU utilization is a difficult, maybe misleading measure here as it is a momentary unit. Better suited would be either looking at load or memory utilization. However, the best metric for measuring the performance of Pi-hole should be the delay of replies. This is obviously difficult with queries that are sent upstream, however, replies answered from cache undergo (mostly) the same routines and can be used to measure.

Say your local delay is 2 msec and you have 4 active clients making 10 queries per second, this means Pi-hole will be busy for 0.002410 = 40 msec per second. This is a "busyness" of 8%.
If you, however, use the same hardware and due to your many regex filters, the delay per query is, say, 100 msec, then the business will be 0.100410 = 400%. This means the Pi-hole would only be able to reply to 1/4-th of the incoming queries in time and a certain backlog will build up. Clients who retry queries because of the delay make things only worse.

I hope this example makes it clear why we have to keep the delay (= the work per individual query) as low as possible. I should also say that I have never seen a delay coming even close to 100 msec even with regex filter lists going into the hundreds.

Having said all that, I spent a few hours writing and testing a suitable framework for keeping the majority of our current regex performance also with the newly proposed way of doing things. This could very well reduce the additional amount of work from N-times (where N may be a large number in unfortunate scenarios) to two-fold, which is obviously a much better compromise. Lots of implementation work still has to be done but we are, again, on a good track now.

1 Like

Apparently, there is no solution to this problem, now or in the near future. This script will at least inform you for which domains this is happening, at least for unbound users, that have enabled unbound-control.
It's not pretty, but it does the job. The result can be found in /etc/pihole/cnamematches.list and is formatted to be used by pihole-FTL, a restart is required (addn-hosts=/etc/pihole/cnamematches.list).
The script isn't smart, so CNAME entries that need to work, should be added to the whitelist and removed from the result file.

'edit'
added comment to the result in /etc/pihole/cnamematches.list
result will now look like:

0.0.0.0 fonts.gstatic.com # CNAME gstaticadssl.l.google.com found in gravity list

'/edit'

#!/bin/bash

TAB=`echo -e "\t"`
sudo /usr/sbin/unbound-control dump_cache | grep 'CNAME' | grep "$TAB" | while read -r line; do
   stats=$line
   set -- $stats
   domain=$(eval echo \${1%.})
   cname=$(eval echo \${5%.})
   #echo 'domain: '$domain
   #echo 'cname: '$cname
   if grep --quiet $cname /etc/pihole/gravity.list; then
      if ! grep --quiet $domain /etc/pihole/cnamematches.list; then
         if ! grep --quiet $domain /etc/pihole/whitelist.txt; then
            if ! grep --quiet $domain /etc/pihole/gravity.list; then
               #echo 'cname found in gravity: ' $cname
               #echo 'domain to add to blocklist: ' $domain
               printf '0.0.0.0 '$domain' # CNAME '$cname' found in gravity list\n' | sudo tee -a /etc/pihole/cnamematches.list
            fi
         fi
      fi
   fi
done

Since the unbound cache is very much alive, you'll need to schedule a cron job to execute the script regularly. I'm aware this only works for entries, cached by unbound, so some CNAME entries could be missing.

The command dig @127.10.10.2 -p 5552 +dnssec f7ds.liberation.fr (modify IP and port to directly request resolution by unbound), will ensure the domain, used in the original post is available in the cache, prior to running the script, with the assumption that liberation.eulerian.net is part of a blocklist

pihole -q liberation.eulerian.net
 Match found in list.3.dbl.oisd.nl.domains:
   liberation.eulerian.net

liberation.eulerian.net is already in the gravity list, because it is in list.3.dbl.oisd.nl.domains.

It is f7ds.liberation.fr that will be added to /etc/pihole/cnamematches.list

This can never be a wildcard (cannot be liberation.fr) because, for example, www.liberation.fr may be a valid page, static.liberation.fr may contain some additional elements required. In order to effectively prevent the CNAME trick, we need to specify the domain exactly (f7ds.liberation.fr)

/etc/pihole/cnamematches.list will contain:

0.0.0.0 f7ds.liberation.fr

Another example to make things clear:

gravity.list contains gstaticadssl.l.google.com

pihole -q gstaticadssl.l.google.com
 Match found in list.3.dbl.oisd.nl.domains:
   gstaticadssl.l.google.com

gravity.list doesn't contain fonts.gstatic.com

pihole -q fonts.gstatic.com
  [i] No results found for fonts.gstatic.com within the block lists

but gstaticadssl.l.google.com is a CNAME for fonts.gstatic.com

dig fonts.gstatic.com

; <<>> DiG 9.11.5-P4-5.1-Raspbian <<>> fonts.gstatic.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15334
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1472
;; QUESTION SECTION:
;fonts.gstatic.com.             IN      A

;; ANSWER SECTION:
fonts.gstatic.com.      0       IN      CNAME   gstaticadssl.l.google.com.
gstaticadssl.l.google.com. 0    IN      A       172.217.19.195

As soon as CNAME gstaticadssl.l.google.com is in the unbound cache AND the script runs, fonts.gstatic.com will be added to /etc/pihole/cnamematches.list

0.0.0.0 fonts.gstatic.com

/etc/pihole/cnamematches.list will only grow (learn as time progresses) when using a cron job to execute the script. Example (my script is /home/pi/cron/cname.sh):

5,20,35,50 * * * *  root PATH="$PATH:/home/pi" /home/pi/cron/cname.sh >/dev/null 2>&1

You can add domain names to /etc/pihole/whitelist.txt to prevent blocking, but you need to manually remove the entry from /etc/pihole/cnamematches.list

pihole-FTL will only use the new list if you add a dnsmasq configuraton file, content:

addn-hosts=/etc/pihole/cnamematches.list

AND pihole-FTL is restarted.
Worst case scenario is the default restart once a week (pihole -g)

I am a bit coinfused by your posts, @jpgpi250 jpg :thinking:

What makes you think that?
Pi-hole's developers already stated they are on track for developing a solution:

And with regard to your CNAME matching:

f7ds.liberation.fr is the domain name request that Pi-hole is passing to FTL-DNS/dnsmasq.
If you block that, you don't need to look at the DNS answer. In fact, blocking f7ds.liberation.fr is very similar to what @DanSchaper has proposed to tackle the problem, and you woulnd't need an extra config file to do that.

If the intention of writing to cnamematches.list is to gather informaton about CNAME obfuscated 3rd party trackers, it probably would make sense to include the offending domain names from the answers as well. Otherwise you lose the information why an entry goes on that list.

Also, I am unsure what parts of your explanation are a definite part of Pi-hole's coming version, how unbound is involved and which ones are your own.

Pi-Hole does not normally restart with a gravity update.

When I run pihole -g, the output says:

pihole -g
  [i] Pi-hole blocking is enabled
  [i] Neutrino emissions detected...
  [✓] Pulling blocklist source list into range

   [i] Target: adaway.org (hosts.txt)
  [✓] Status: No changes detected

….

  [✓] Cleaning up stray matter

  [✓] Force-reloading DNS service
  [✓] DNS service is running
  [✓] Pi-hole blocking is Enabled

What does Force-reloading DNS service mean, if it isn't a restart, this to use the new gravity.list?

I don't expect a new pihole-FTL binary (with CNAME detection) release this year. pihole 5.x is being developed for months now; gravity, regex, blacklist, whitelist, … are moved into a database (/etc/pihole/gravity.db). Meanwhile, for unbound users only, this method let's you quantify the problem.

Both examples discussed have only the CNAME entry in the gravity list, the domain name isn't in the gravity list. How would you know you need to block the domain name, if you don't detect the CNAME is used to bypass pihole. Only when the CNAME is found in the gravity list AND the domain name isn't, an entry will be made.

adding an extra config file, content addn-hosts=/etc/pihole/cnamematches.list is just one of the possibilities. If you don't want to add an extra dnsmasq configuration file, you could choose to add /etc/pihole/cnamematches.list as a block list, syntax (and yes, NOT a typo, really 3 forward slashes):

file:///etc/pihole/cnamematches.list

Example (this site does NOT use this method in real life): Let's assume you visit pcsupport.lenovo.com and they start using the CNAME method to bypass pihole. They would create a DNS entry recommend.lenovo.com, cname liberation.eulerian.net, and use the domain name to trigger a third party script. Even if liberation.eulerian.net is in the gravity list, the current version of pihole-FTL would NOT detect this. Running the script without the www limitation would add recommend.lenovo.com to the list. Running the script with the www limitation would never add recommend.lenovo.com to the list, thus the DNS query would be resolved to a valid IP.

As far as I know, whitelist.txt and gravity.list are replaced by a database (/etc/pihole/gravity.db) in pihole 5.x. This would actually be beneficial for the detection. The majority of the scripts processing time is used by the grep searches on gravity.list. As soon as this data is available in the database, a simple sqlite3 query would really speed up things. Of course, the script will than require a rewrite.

IT does not mean restart it means reload. This is not done by

but by sending a signal to FTL causing it to re-read the gravity list and clearing the DNS cache. However, this is done without restarting. And in the future (the PR I linked above), we might simply reload gravity even without clearing the DNS cache.

Well, true. We could actually do it, however, it will be useless for most as we do not even have started to think about how a web interface page could look like for this. Users fine with interacting directly with the database can already try it. I was able to resolve the mentioned performance penalty this morning bringing this a lot forward. However, true, I haven't had time to properly look at CNAME interventions. And I think I shouldn't do it as part of this PR but as a follow-up.

Found a possible problem with the used logic.

if the cname is nieuwsblad.be (assume this is in gravity.list)
and the domain is anything.nieuwsblad.be

a new entry would be created.

but since both the cname and the domain end with nieuwsblad.be, this is NOT something you want to catch (make a new entry).

This would require some additional logic in the above script, such as:

if [ -z "$(echo $domain | grep -- "$cname"'$')" ]; then
	# continue processing ...
fi

Don't know if you need to consider this in your pihole-FTL code...

Here is the code for the script: pi-hole/gravity.sh at master · pi-hole/pi-hole · GitHub

As a blocklist maintainer, I 100% support this feature. I add all hosts as I find them, but its certainly a losing battle to keep up with them all. Here is just one example that everyone should be failure with:

4645336.fls.doubleclick.net. 21599 IN CNAME dart.l.doubleclick.net

There are millions of doubleclick hosts, but a drastically fewer number of CNAME values for all those hosts. Being able to block just dart.l.doubleclick.net and have it take affect for all the doubleclick hosts would be amazing! And before anyone says it - yes its easy to block all of doubleclick.net using regex/wildcards - but as a blocklist maintainer, I do not have the ability to make use of that feature, not to mention the performance tradeoffs of regex.

Furthermore, I would be opposed to making this feature a separate list. I would recommend going the same path as uBlock origin where its a on/off toggle for all lists and defaults to be off.

1 Like

Is a more sensible way to handle the doubleclick domains with a regex? One regex and you are done.

@jfb maybe you didn't read my full comment?

And before anyone says it - yes its easy to block all of doubleclick.net using regex/wildcards - but as a blocklist maintainer, I do not have the ability to make use of that feature

Yes, I can block things locally via regex easily - but that is not something that I can include in my public blocklist for others to use.

I did read it. The point is that blocklists with individual domain blocks are not the ideal solution for this specific problem. A more suitable feature of Pi-Hole (regex) should be applied in this case.

The regex feature was added to Pi-Hole so that it can be used.

Seems like NextDNS implemented exactly this already:
https://news.ycombinator.com/item?id=21610386