Encountered non-critical SQL warnings

The question is not really "how" but rather "why" here.

And with why I do not mean "Why should we clean a list" but rather Does the list domains still make sense after cleaning?. I'm pretty sure the answer will almost always be No.

Assume a list blocks supermarket.com/ads/* and we reduce this to supermarket.com. Unwanted blocking.

After thinking about it a while I would agree with the view that cleaning a list is not something pihole should do.
But to educate users to select their lists well and to contact list maintainers if the list is not in HOSTS format pihole should provide an easy was to see which list causes problems.

My wish would be something like this:
If a list (or part of it) is not in HOSTS format skip the entire list. In the web UI grey out this list (or in Group management set it fixed to disabled), mark it with a red triangle and add a mouse over ('List is not in the correct format and has been disabled entirely. Please contact the list's maintainer to provide it in correct HOSTS format').

I'm currently working on something. Performance is always critical.
It is not as simple as was suggested, many lists have additional comments, etc. whilst they're still (mostly) in HOSTS format.

Please stand by.

1 Like

Please update and try again. As I'm still traveling this is, again, a blind-shot. The code should work, however, your tries will reveal if it actually works.

Wow, that's nice!

  [✓] Preparing new gravity database
  [i] Target: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
  [✓] Status: Retrieval successful
  [i] List quality: 51233 of 51233 lines importable (100.0%)

  [i] Target: https://mirror1.malwaredomains.com/files/justdomains
  [✓] Status: Retrieval successful
  [i] List quality: 26860 of 26860 lines importable (100.0%)

  [i] Target: http://sysctl.org/cameleon/hosts
  [✓] Status: Retrieval successful
  [i] List quality: 20567 of 20567 lines importable (100.0%)

  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_tracking.txt
  [✓] Status: Retrieval successful
  [i] List quality: 34 of 34 lines importable (100.0%)

  [i] Target: https://s3.amazonaws.com/lists.disconnect.me/simple_ad.txt
  [✓] Status: Retrieval successful
  [i] List quality: 2701 of 2701 lines importable (100.0%)

  [i] Target: https://hosts-file.net/ad_servers.txt
  [✓] Status: Retrieval successful
  [i] List quality: 45736 of 45736 lines importable (100.0%)

  [i] Target: http://www.malwaredomainlist.com/hostslist/hosts.txt
  [✓] Status: Retrieval successful
  [i] List quality: 1104 of 1104 lines importable (100.0%)

  [i] Target: https://easylist.to/easylistgermany/easylistgermany.txt
  [✓] Status: Retrieval successful
  [i] List quality: 2 of 945 lines importable (0.2%)
      Example for invalid domains (showing only the first five):
      - 1.1]
      - easylist.germany@gmail.com
      - ||2mdn.net^$object,third-party,domain=anleger-fernsehen.de|blick.ch|fitforfun.de|focus.de|giga.de|golem.de|helpster.de|myspass.de|netzwelt.de|stol.it|sueddeutsche.de|tvtoday.de
      - ||4rm.de^$third-party
      - ||85.114.133.62^$third-party

  [i] Target: http://pgl.yoyo.org/adservers/serverlist.php?hostformat=hosts&showintro=0&mimetype=plaintext
  [✓] Status: Retrieval successful
  [i] List quality: 3292 of 3292 lines importable (100.0%)

  [i] Target: https://adaway.org/hosts.txt
  [✓] Status: Retrieval successful
  [i] List quality: 12156 of 12156 lines importable (100.0%)

  [i] Target: https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardMobileAds.txt
  [✓] Status: Retrieval successful
  [i] List quality: 976 of 976 lines importable (100.0%)

  [i] Target: https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardMobileSpyware.txt
  [✓] Status: Retrieval successful
  [i] List quality: 305 of 305 lines importable (100.0%)

  [i] Target: https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardDNS.txt
  [✓] Status: Retrieval successful
  [i] List quality: 28614 of 28619 lines importable (99.9%)
      Example for invalid domains (showing only the first five):
      - банрек.рус
      - укроп-петрушка-огурцы.рф
      - хельга.рф
      - эхх.рф
      - mxtads.com:8040

  [i] Target: https://gitlab.com/ZeroDot1/CoinBlockerLists/raw/master/hosts
  [✓] Status: Retrieval successful
  [i] List quality: 46752 of 46752 lines importable (100.0%)

  [i] Target: https://raw.githubusercontent.com/biroloter/Mobile-Ad-Hosts/master/hosts
  [✓] Status: Retrieval successful
  [i] List quality: 674 of 674 lines importable (100.0%)

  [i] Target: https://raw.githubusercontent.com/HenningVanRaumle/pihole-ytadblock/master/ytadblock.txt
  [✓] Status: Retrieval successful
  [i] List quality: 69 of 69 lines importable (100.0%)

  [i] Target: https://raw.githubusercontent.com/anudeepND/youtubeadsblacklist/master/hosts.txt
  [✓] Status: Retrieval successful
  [i] List quality: 8882 of 8882 lines importable (100.0%)

  [i] Target: https://raw.githubusercontent.com/crazy-max/WindowsSpyBlocker/master/data/hosts/spy.txt
  [✓] Status: Retrieval successful
  [i] List quality: 365 of 365 lines importable (100.0%)

  [i] Target: https://www.dropbox.com/s/dz12nye2750vvbl/blacklist_spy-adware.txt?dl=0
  [✓] Status: Retrieval successful
  [i] List quality: 0 of 2 lines importable (0.0%)
      Example for invalid domains (showing only the first five):
      - event.preventdefault();
      - ensemble.getrequestid();

  [✓] Storing downloaded domains in new gravity database
  [✓] Building tree
  [✓] Swapping databases
  [i] Number of gravity domains: 250322 (204367 unique domains)

Thanks for testing, this looks how I intended it to look. Good.
This solution also rejects incompatible domains. This should also resolve all the previously seen SQLite errors/warnings as they will no longer happen at all.

Pinging @jpgpi250 and @Vexira on this as we know they have a large collection of lists so they can easily check if my solution is showing any odd behavior like filtering out domains it shouldn't. Also, all the database warnings/errors should be gone now.

pihole checkout core tweak/database_warnings_inspection

Hey I did as, was suggested to do by @DanSchaper and to check all my lists and I shrunk it to just a set that doesn't throw any errors, I've been sifting through the lists to see which ones were clean.

But I'll back up this current install and restore the backup from the old install and test your fix, also thank you for your efforts.

Ohh cool cheers for that I think I found one of his lists is down on fire blog, I'll be sure to find it and let him know

1 Like

From your PR I can see Valid are: a-z, A-Z, 0-9, dot (.), minus (-), underscore (_) and after pihole- g I get

  [i] 28614 of 28619 domains imported (99.9%)
      Sample of invalid domains (showing up to five):
      - банрек.рус
      - укроп-петрушка-огурцы.рф
      - хельга.рф
      - эхх.рф
      - mxtads.com:8040

I don't know if theses cyrillic (?) domains are valid, but as you are checking just for latin letters they might never be imported. Can you extend the check for internationalized domains?

Those are indeed internationalized domain names (IDNA) as resolvable by your browser.

DNS on the other hand never uses these names directly, as it is restricted to a subset of ASCII.

In order for IDNAs to be usable by DNS, they have to be converted to their Punycode equivalent, e.g.

банрек.рус is equivalent to xn--80abnuns.xn--p1acf

Depending on your browser, you might actually see the URL change to its Punycode equivalent when entering and submitting an IDNA URL.

As the lists are meant to be domain names for filtering at DNS level, I'd assume that it would be the list maintainer's job, rather than Pi-hole's, to provide the Punycode names in the first place.

Having Pi-hole do the conversion would mean to mutiply the effort for Punycode conversions by number of Pi-hole installations times building the gravity when using that list.
This would not only be a waste of ressources, it would slow down gravity creation as well.

1 Like

Thanks for the explanation, didn't know that DNS uses a subset of ASCII.

Will contact the list maintainer.

It's still crap, but you want to stuff yourself full of manure, I guess that's your wish.

If this stays then it needs to say

976 of 976 lines importable and that's it. No implying the list is quality, no saying it's 100% correct.

The latest update turned it into

 Target: https://raw.githubusercontent.com/r-a-y/mobile-hosts/master/AdguardDNS.txt
  [✓] Status: Retrieval successful
  [i] 28614 of 28619 domains imported (99.9%)
      Sample of invalid domains (showing up to five):
      - банрек.рус
      - укроп-петрушка-огурцы.рф
      - хельга.рф
      - эхх.рф
      - mxtads.com:8040

  [i] Target: https://gitlab.com/ZeroDot1/CoinBlockerLists/raw/master/hosts
  [✓] Status: Retrieval successful
  [i] 46752 of 46752 domains imported (100.0%)

Now that it is a problem, the users will have to clean up their lists, or the list maintainers will have to clean up the lists.

I want to have users know that they are putting garbage in to their DNS servers and learn to not do it anymore. This will help everyone going forward and far in to the future. If we have to monkeypatch in workarounds for bad data then no one will ever change and they will continue to put crap in their DNS servers.

1 Like

So you're saying people will stop using Pi-hole because we tell them that their lists were ineffective before and now they are not valid? Seems that we'd gain more trust from being fully transparent and teaching people what exactly it is they are doing.

You don't tell a diabetic to eat a dozen donuts but just remember to triple the insulin the next time.

There should be no rating. Pi-hole does not and can not "rate" a list. I can make a 100% accepted list that blocks you from accessing anything outside of your local network. Is that Quality 100!? People will now say "Hey, this is a 100% Pi-hole list! It's full of shit, but it's 100%!"