[Solved] Weird connection drops / unstable name translation

Hello everyone,

I've had a pi-hole + proxy setup running correctly for most of 2023. 2 nights ago, it started to act up.
My network setup runs along the lines of:

  • Asus router running DHCP and with WAN connection
  • Pihole on RPI (not sure if 3 or 4), connected to the main router
  • Gigabyte router on second division, acting as AP
  • Unraid server connected to second router, and running an NGINX Proxy Manager as a transparent proxy

A normal request on my browser for a network-name (unraid.rabbit, for example) will go:
router -> pi-hole -> NPM -> ip:port

Up until now, things have run mostly well, with just a small drop here and there, it took me quite a bit to get the whole thing running as I wanted, but I was able to, and for quite a few months, it was stable as hell.

Then, maybe 2 nights ago, there was a power outage during the night, and things started to act up. Right now, after I reinstalled through the "pihole -r" command, and afterward selecting the "wait for network to start" option, I was able to get my eth0 to get the IP address correctly at startup.

However, not all is smooth. After some time with low volume, or idling, connectivity to Pihole dies. I get constant "connection refused" errors whenever I try to access the web UI /admin (through IP, pi.hole, or my network name translation). SSH also refuses to connect through any means.

Name translation happens in an unstable way too on the rest of the network. I can barely use unraid to search for updates on its docker machines, with warnings that the lists couldn't be retrieved, due to name translation failure.

I've noticed that if I wait several minutes (5-10?), eventually connectivity seems to be re-established.

Expected Behaviour:

Name translation and network connection to remain stable throughout the day. Access to the web page and SSH to be available at all times.

Actual Behaviour:

After some minutes of low volume/idling, whenever I try to access the Pihole remotely, or do name translations, it completely stops working with "connection refused" errors when accessing its UI or through SSH. Name translation to the outside seems a bit more stable, although it fails sometimes.

Workaround:

Currently, if I go directly to my RPI, and do a ping to the network devices, normal behavior resumes, until a new "timeout" of sorts happens again. Whenever I try to ping my unraid server, or my desktop, it takes up to 10s to start pinging, but when it does, it has very low latency, and this is what causes normal behavior to resume.

This kinda seems to be related to something between connectivity, or loss of some information on the networking part?

Debug Token:

https://tricorder.pi-hole.net/jMpxgQ3X/

Hi MiguelBazil,

i had something similar some time ago after a power outage. I spend hours with the pi and then i found that it was a broken sd card.

Try with a new sd card.

Marcus

Hey, thanks for the response. I'm going to pray that it's not something like that, as I'm using an external SSD, and that'd be an expensive issue :sweat:

Are there any tools/commands I could use to verify the integrity of the NAND's of an SSD?

Hi,
try smartctl:
$ sudo smartctl -t short /dev/sdX (where /dev/sdX is your SDD).
If it is not already installed:
$ sudo apt install smartmontools
Marcus

Thanks for the info. Ran the short test, and no issues were found. Not saying it may not be the case, just that nothing was detected. I wonder if a full reinstall, and then loading the configurations could have a better chance.

Also, would there be a way to test the filesystem for errors?

And for completeness, here's the result from smartctl:

sudo smartctl --all /dev/sda
smartctl 7.2 2020-12-30 r5155 [aarch64-linux-6.1.21-v8+] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Phison Driven SSDs
Device Model: PNY CS900 120GB SSD
Serial Number: PNY422000267102198E6
LU WWN Device Id: 5 f8db4c 4220198e6
Firmware Version: CS900613
User Capacity: 120,034,123,776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-4 (minor revision not indicated)
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Mon Jan 29 20:55:41 2024 WET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (65535) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 30) minutes.
Conveyance self-test routine
recommended polling time: ( 6) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
9 Power_On_Hours 0x0012 100 100 000 Old_age Always - 25063
12 Power_Cycle_Count 0x0012 100 100 000 Old_age Always - 75
168 SATA_Phy_Error_Count 0x0012 100 100 000 Old_age Always - 0
170 Bad_Blk_Ct_Erl/Lat 0x0003 100 100 000 Pre-fail Always - 0/141
173 MaxAvgErase_Ct 0x0012 100 100 000 Old_age Always - 91 (Average 61)
192 Unsafe_Shutdown_Count 0x0012 100 100 000 Old_age Always - 66
194 Temperature_Celsius 0x0023 067 067 000 Pre-fail Always - 33 (Min/Max 33/33)
218 CRC_Error_Count 0x000b 100 100 050 Pre-fail Always - 0
231 SSD_Life_Left 0x0013 100 100 000 Pre-fail Always - 97
241 Lifetime_Writes_GiB 0x0012 100 100 000 Old_age Always - 1926

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error

1 Short offline Completed without error 00% 25063 -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has ever been run
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Update on this:

The issue was not on the Pi-hole. I'm not sure how, but apparently unraid decided to assign Pi-hole's IP to another docker image, and it was a constant fight for the IP on the network. That's why it was constantly failing, because the router kept on changing the MAC address assigned to the IP.

I'm still trying to figure out how the IP was incorrectly assigned, but that closes the issue. Hope this helps someone in the future.

1 Like

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.