Some websites are showing up as unknown in status while others work

DL6ER · October 7, 2020, 2:37pm

Thanks for providing the PCAP via PM. I checked what was going on in your network and found that the second query was in fact a resubmission because Windows was impatient.

Windows resubmitted after waiting only 0.1 seconds! That's pretty odd and a bit low for a timeout, but okay, this is probably among the things that cannot be fixed on Windows.

Now we know what is going on and I can look into reproducing this locally so we can work on a fix.

Scepterus · October 7, 2020, 2:45pm

that's great! i would point out that I do use tcp optimizer on my windows machines with these settings:
image_2020-10-07_174412
image_2020-10-07_174436

I didn't see anything that's immediately relevant, but maybe you will.

DanSchaper · October 7, 2020, 2:46pm

Undo the optimizer and see if things work right without it. If so, add back tweaks one at a time until you find the one that is causing it.

I'm very sure it's one of the tweaks.

Scepterus · October 7, 2020, 3:02pm

that's a lot of restarts for something that won't fix this for sure. I'll read later into each tweak to see in-depth if something is more relevant to this case. however if it's a time out thing, those entries were in the 600ms range for some of the sites, it could be a normal timeout.

DanSchaper · October 7, 2020, 3:06pm

It's just a single restart. The one that disables the entire list of changes. That will tell you with certainty if the issue is with pihole-FTL of if it's self-inflicted.

DL6ER · October 8, 2020, 9:04am

Just a quick update: Reproducing this locally turns out to be a lot trickier than I figured initially because Linux (which is the only operating system I have at hand) is trying really hard to prevent me from doing DNS lookups with such a ridiculously low retry timeout

Still work in progress...

Scepterus · October 8, 2020, 9:11am

ah Linux, allowing you to do stupid things if you want to, but you'll have to work hard for that. yeah, windows is a bit more flexible with user errors. anyway, I might have the time today to restore the settings of the TCP optimizer to defaults and check if that's the cause.

DL6ER · October 8, 2020, 3:11pm

I honestly disagree. From what I know, the registry is a beast you don't want to edit manually. And you can only tweak such things in Windows using third-party software.

Anyway, even when I was able to reproduce retried queries by sending queries with the same query ID in short succession, I was not able to reproduce exactly what you saw. However, I'm currently on my somewhat limited mobile setup and will try to reproduce this at home next week.

So far, the proposed change is documented here:

github.com/pi-hole/FTL

Add new query retried status

pi-hole:development ← pi-hole:fix/query_retries

opened 03:08PM - 08 Oct 20 UTC

DL6ER

+44 -6

**By submitting this pull request, I confirm the following:** - [X] I have re…ad and understood the [contributors guide](https://github.com/pi-hole/pi-hole/blob/master/CONTRIBUTING.md). - [X] I have checked that [another pull request](https://github.com/pi-hole/FTL/pulls) for this purpose does not exist. - [X] I have considered, and confirmed that this submission will be valuable to others. - [X] I accept that this submission may not be used, and the pull request closed at the will of the maintainer. - [X] I give this submission freely, and claim no ownership to its content. **How familiar are you with the codebase?:** ## 10 --- Add new status `RETRIED` (12) to be used for queries which were retried. If a query was retried five times until we received a reply form upstream, queries 1-4 will be marked as `RETRIED` and only query 5 will stay in status `FORWARDED`. This does not affect the statistics because all five queries where send upstream, so five upstream packages are counted, even when only one query stays in status `FORWARDED`. **edit** Add another new status `RETRIED_DNSSEC` (13) to be used for queries for which which automatic DNSSEC queries were retried. If we've already got an answer to a query, but we're awaiting keys for validation, there's no point retrying the query, so we're retrying the key query instead. The web interface will need updating to support this status. This will be done in a follow-up PR.

Scepterus · October 8, 2020, 4:56pm

windows is a bit more flexible

I mentioned it's flexible to user errors, I worked as a pc tech for most of my careers, the ease with which normal users can destroy windows with a few clicks be it with 3rd party software or just randomly, is astounding.

as for the change, that's great! I hope this also helps other people, maybe ones with lower-end hardware or low memory or something.

haven't gotten around to testing the TCP optimizer, maybe tomorrow. will update with the results.

Scepterus · October 9, 2020, 11:25am

yep, reverted to windows defaults and the issue disappeared. will test further to see if it's just temporary.
EDIT: was wrong, it did not change, and I saw this happen on a computer in the network I'm pretty sure I did not use the optimizer on.

DL6ER · October 9, 2020, 2:20pm

Hmm, strange that we are not seeing this from other users on Windows (at least there are no reports). Anyway, a method to handle this is on its way. I hope this will work for you as well.

@Scepterus Could you try

pihole checkout ftl fix/retries_master

and see the situation changes?

Scepterus · October 10, 2020, 4:21pm

@DL6ER do I need to restart after that? because I didn't and I still see those.

DL6ER · October 10, 2020, 7:25pm

Restart shouldn't be necessary. I guess it may be something else (or rather: in addition) then. I'll keep looking for it.

Coro · October 10, 2020, 7:30pm

Can we reproduce this on a Mac?

Scepterus · October 11, 2020, 4:28am

thanks!

Scepterus · October 13, 2020, 4:35am

I think I found a possible explanation in TCP optimizer. there's a setting called "Retransmit Timeout" the description for it says it determines the time before connections are aborted.

now you can see in my screenshot the initial time is 2 seconds, and the minimum time is 300 ms. that would explain the queries that took more than 600ms to respond showing up as unknown.

however, queries should mostly not take that long to respond, I'm using Cloudflare DNS which has a very fast response time, around 60-80ms.

DL6ER · October 13, 2020, 5:22am

Hmm, yes, that's indeed interesting. Can you test the delay for some random domains you have not queried before? Like

dig ebay.com @1.1.1.1
dig ikea.com @1.1.1.1

and some else, checking the reply time (right at the bottom)?

Scepterus · October 13, 2020, 10:30am

; <<>> DiG 9.11.5-P4-5.1+deb10u2-Raspbian <<>> ikea.com @1.1.1.1
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 704
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;ikea.com.                      IN      A

;; ANSWER SECTION:
ikea.com.               300     IN      A       204.74.99.103

;; Query time: 65 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Tue Oct 13 13:25:48 IDT 2020
;; MSG SIZE  rcvd: 53

Did not enter IKEA at all, so it's a new site. it's 65 ms.

this is one I ran:

dig blizzard.com @1.1.1.3

; <<>> DiG 9.11.5-P4-5.1+deb10u2-Raspbian <<>> blizzard.com @1.1.1.3
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 14091
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;blizzard.com.                  IN      A

;; ANSWER SECTION:
blizzard.com.           129     IN      A       137.221.106.104

;; Query time: 188 msec
;; SERVER: 1.1.1.3#53(1.1.1.3)
;; WHEN: Tue Oct 13 13:27:31 IDT 2020
;; MSG SIZE  rcvd: 57

I will watch my network to see if something is using the upload to the limit of my isp's bandwidth. would have been nice to have a dashboard in pihole for traffic that at least goes through the pihole.

Scepterus · October 13, 2020, 11:05am

; <<>> DiG 9.11.5-P4-5.1+deb10u2-Raspbian <<>> get.paleorecipebook.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 40500
;; flags: qr rd ra; QUERY: 1, ANSWER: 3, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;get.paleorecipebook.com.       IN      A

;; ANSWER SECTION:
get.paleorecipebook.com. 300    IN      CNAME   unbouncepages.com.
unbouncepages.com.      60      IN      A       54.93.101.66
unbouncepages.com.      60      IN      A       18.196.95.178

;; Query time: 339 msec
;; SERVER: 1.1.1.3#53(1.1.1.3)
;; WHEN: Tue Oct 13 14:05:16 IDT 2020
;; MSG SIZE  rcvd: 112