My V8 Pi-hole instance with (Debian 12 / Proxmox VMs)

smokingwheels · April 6, 2026, 3:08am

Over the last 5 days I’ve been building and tuning a DNS test environment around the Pi-hole, focusing on high-QPS performance and packet-path limits.

Initial goal was simple benchmarking, but it evolved into a full stack test:

• 8 Pi-hole instances (Debian 12 / Proxmox VMs)
• Google for recursive resolution
• dnsdist in front end for load distribution
• dnspyre used for sustained high-concurrency testing

Key findings and tuning steps:

Switch bottleneck
A cheap unmanaged switch was dropping packets under load. Replacing it significantly reduced UDP errors and improved consistency.
UDP vs TCP tuning
Initially focused on TCP parameters — minimal impact. Real gains came from tuning UDP path and kernel networking.
Kernel tuning (major impact)
The most effective change was tuning:

net.core.netdev_max_backlog = 32768

This reduced packet drops while avoiding excessive queue buildup. Higher values increased latency and reduced throughput.

VM environment
Testing was done on Proxmox using VirtIO networking with default settings. No advanced NIC queue tuning or CPU pinning was applied — results reflect kernel/network tuning rather than hypervisor-level optimisation.
Cache vs real workload
Testing revealed two distinct performance profiles:

• Warm cache (1000 hostnames)
~80k QPS
~0.2–0.3% errors

• Larger working set (5000 hostnames)
~6.5k QPS
~5–6% errors

This clearly shows cache amplification vs real recursive/upstream limits.

WAN limitation
With ~10 Mbps upload, upstream capacity aligns closely with ~6k QPS when cache effectiveness drops — confirming bandwidth as the limiting factor in “cold” scenarios.

Summary:

The system is now capable of:
• very high throughput for cached responses (~80k QPS)
• stable performance under sustained load
• exposing clear boundaries between cache, network, and upstream limits

Biggest takeaway:
Performance at this level is no longer about Pi-hole itself — it’s about packet handling, buffering, and network path efficiency.

Still experimenting, but current setup is stable and repeatable.

nero355 · April 6, 2026, 1:43pm

So which one should people avoid ??

smokingwheels · April 6, 2026, 9:58pm

I used a tp-link 8 port switch worth about $30
I switched to tp-link 16 port switch worth about $100

Bucking_Horn · April 7, 2026, 11:12am

Pointing at a specific model here would seem a bit unfair to me.

A typical home network probably sees a few thousands to ten thousands DNS requests a day - it doesn't have to cope with 6,500 queries per second amounting to 23 million per hour, or well over half a billion requests per day, so that question doesn't seem applicable for home usage scenarios.

For home usage people, there's likely no reason to retire their switch just because it didn't manage well in smokingwheels's stress tests (unless they also were into serious stress testing their equipment).

nero355 · April 7, 2026, 2:28pm

It’s always good to know this kind of stuff in case someone starts looking for issues on the Internet and finds this thread mentioning the very basic TP-Link 108 Switch but now we just need the exact Revision of it because there have been like 4 or 5 of them for the whole 105/108 line :

105
105E
105PE
108
108E
108PE

For example Games “SPAM” a lot of UDP traffic and IMHO this might cause packetloss there too… You never know…

smokingwheels · April 7, 2026, 4:02pm

@nero355

Yeah revision matters, but let’s be real about expectations here.

The TL-SG105/108 series (any of them) are cheap unmanaged switches with limited buffers and basic ASICs. They’re designed for throughput (Gbps), not high PPS workloads.

What bites people is this:

Games, DNS, VoIP, etc. = small UDP packets → very high packets-per-second
These switches have tiny buffers + limited packet processing capacity
Result = microbursts → buffer overflow → packet loss

So even if you’re only doing:

50–100 Mbps
you can still drop packets if PPS is high enough.

Key point:

Bandwidth ≠ performance
PPS (packets/sec) is the real limiter

For $30:

Fine for normal home use
Fine for bulk traffic (downloads, streaming)
Fine for light gaming

But:

Not designed for stress tools (dnsblast, dnspyre, dnsblast-go, etc.)
Not designed for sustained high PPS
Not consistent under microburst load

So yeah — value for money = good,

but expecting it to behave like enterprise gear under load is unrealistic.

If someone is chasing packet loss:

test with a better switch (even a cheap managed one with bigger buffers)
or reduce burst/concurrency
or accept the hardware limit

Great $30 switch for what it is.
Not a high-PPS device.
Packet loss under UDP bursts is expected, not a fault.

I will switch back to my $30 8 port tp-link switch to test after all the tuning is done.

smokingwheels · April 8, 2026, 3:49am

DNS Performance Testing Summary (DNSdist + Pi-hole cluster)

I’ve been benchmarking a local DNS stack using DNSdist (frontend) with multiple Pi-hole backends under high-QPS UDP load (dnspyre/dnsblast style testing).

Hardware

HP DL360 (older server)
Proxmox VM environment
Up to 24 cores available

Test Setup

dnsdist as frontend load balancer
Pi-hole instances as backends (scaled from a few → up to 12)
High PPS / UDP-heavy workload (small packets, burst traffic)
Both cached and uncached scenarios tested

NOTE: The results below primarily reflect cached DNS performance, not full recursive (uncached) resolution.

Key Findings

1. Peak vs Sustained Performance

Peak (short burst): ~140k QPS
Sustained (realistic): ~60k–70k QPS

System handles very high bursts but settles to a stable throughput ceiling.

2. Frontend Core Scaling

Increasing dnsdist cores alone did not always improve performance
With limited backend capacity:
- More cores = more queueing + higher drops
With larger backend (12 Pi-holes)
- Higher core counts improved throughput

Frontend scaling only helps if backend can absorb it

3. Backend Scaling (most important factor)

Adding more Pi-hole instances gave the biggest improvement
Example:
- Smaller backend → ~60–68k QPS
- 12 Pi-holes → ~71k QPS sustained

Backend fanout had more impact than CPU tuning

4. RAM Disk (tmpfs) Testing

Moving logs/DB to RAM:
- Reduced disk I/O
- Improved median latency (p50)
But:
- Increased queue depth
- Higher error rates under burst load

Removing I/O bottlenecks can increase overload effects

5. Network Tuning (sysctl)

Increased buffers/backlog improved burst handling
Allowed ingestion of very high initial QPS (~400k+)
But:
- Did not increase sustained throughput
- Increased latency under overload

Higher buffers = more queueing, not more capacity

6. Real Bottleneck

The limiting factors are NOT:

CPU (low load observed)
Disk (after tuning)
Network stack (after sysctl tuning)

The bottleneck is:

dnsdist processing + scheduling
Pi-hole/FTL backend capacity
Queue buildup under burst load

Typical Stable Envelope

Across multiple runs:

Throughput: ~65k–70k QPS
Error rate: ~1.5%–3%
Latency (under burst):
- p50: ~200–250 ms
- p95: ~280–300 ms

Key Takeaways

More cores ≠ more performance
Backend scaling > frontend scaling
Burst capacity ≠ sustainable throughput
Reducing bottlenecks can expose deeper limits
Queueing (not CPU) drives latency at high load

Best Performing Setup

So far:

dnsdist: ~20 cores
Backend: 12 Pi-hole instances
No RAM disk tricks
Tuned network stack

Best balance of throughput and stability

Final Thoughts

For high-QPS DNS workloads:

Focus on backend scaling and distribution
Avoid over-driving the system with unrealistic burst loads
Tune for sustained throughput, not just peak numbers

Pi-hole (Tuned) Performance Snapshot Cached data

After tuning Pi-hole (logging adjustments, general cleanup, and backend balancing), I ran a focused test to look at individual node behaviour.

Results (single-node style load 1000 domain list)

Achieved send QPS: ~4,588
Total queries: 67,164
Successful: 66,484
Failed: 680
Error rate: ~1.01%
Latency:
- p50: ~74 ms
- p95: ~128 ms

Results (single-node style load 20 domain list)

Achieved send QPS: ~12k

Interpretation

This reflects per-node performance under sustained load, not aggregate cluster throughput
Results are consistent with earlier findings of ~12k cached peak per node, dropping under sustained pressure
Latency remains significantly lower than full-cluster burst tests due to:
- reduced queue depth
- more controlled ingestion rate

Key Observations

Pi-hole handles moderate sustained load well, but:
- throughput per node is limited under continuous pressure
- queueing still appears as load increases
Error rate (~1%) is acceptable for this test shape and aligns with earlier cluster behaviour

Takeaway

Individual Pi-hole instances perform reliably in the low-thousands QPS range under sustained load
Scaling to higher throughput requires horizontal backend expansion (more nodes) rather than pushing individual instances harder

If anyone else is pushing Pi-hole/dnsdist at high PPS, would be interested to compare results