My V8 Pi-hole instance with (Debian 12 / Proxmox VMs)

Over the last 5 days I’ve been building and tuning a DNS test environment around the Pi-hole, focusing on high-QPS performance and packet-path limits.

Initial goal was simple benchmarking, but it evolved into a full stack test:

• 8 Pi-hole instances (Debian 12 / Proxmox VMs)
• Google for recursive resolution
• dnsdist in front end for load distribution
• dnspyre used for sustained high-concurrency testing

Key findings and tuning steps:

  1. Switch bottleneck
    A cheap unmanaged switch was dropping packets under load. Replacing it significantly reduced UDP errors and improved consistency.

  2. UDP vs TCP tuning
    Initially focused on TCP parameters — minimal impact. Real gains came from tuning UDP path and kernel networking.

  3. Kernel tuning (major impact)
    The most effective change was tuning:

net.core.netdev_max_backlog = 32768

This reduced packet drops while avoiding excessive queue buildup. Higher values increased latency and reduced throughput.

  1. VM environment
    Testing was done on Proxmox using VirtIO networking with default settings. No advanced NIC queue tuning or CPU pinning was applied — results reflect kernel/network tuning rather than hypervisor-level optimisation.

  2. Cache vs real workload
    Testing revealed two distinct performance profiles:

Warm cache (1000 hostnames)
~80k QPS
~0.2–0.3% errors

Larger working set (5000 hostnames)
~6.5k QPS
~5–6% errors

This clearly shows cache amplification vs real recursive/upstream limits.

  1. WAN limitation
    With ~10 Mbps upload, upstream capacity aligns closely with ~6k QPS when cache effectiveness drops — confirming bandwidth as the limiting factor in “cold” scenarios.

Summary:

The system is now capable of:
• very high throughput for cached responses (~80k QPS)
• stable performance under sustained load
• exposing clear boundaries between cache, network, and upstream limits

Biggest takeaway:
Performance at this level is no longer about Pi-hole itself — it’s about packet handling, buffering, and network path efficiency.

Still experimenting, but current setup is stable and repeatable.

1 Like

So which one should people avoid ?? :grimacing:

I used a tp-link 8 port switch worth about $30
I switched to tp-link 16 port switch worth about $100

1 Like

Pointing at a specific model here would seem a bit unfair to me.

A typical home network probably sees a few thousands to ten thousands DNS requests a day - it doesn't have to cope with 6,500 queries per second amounting to 23 million per hour, or well over half a billion requests per day, so that question doesn't seem applicable for home usage scenarios.

For home usage people, there's likely no reason to retire their switch just because it didn't manage well in smokingwheels's stress tests (unless they also were into serious stress testing their equipment).

1 Like

It’s always good to know this kind of stuff in case someone starts looking for issues on the Internet and finds this thread mentioning the very basic TP-Link 108 Switch but now we just need the exact Revision of it because there have been like 4 or 5 of them for the whole 105/108 line :

  • 105
  • 105E
  • 105PE
  • 108
  • 108E
  • 108PE

For example Games “SPAM” a lot of UDP traffic and IMHO this might cause packetloss there too… You never know…

@nero355

Yeah revision matters, but let’s be real about expectations here.

The TL-SG105/108 series (any of them) are cheap unmanaged switches with limited buffers and basic ASICs. They’re designed for throughput (Gbps), not high PPS workloads.

What bites people is this:

  • Games, DNS, VoIP, etc. = small UDP packets → very high packets-per-second

  • These switches have tiny buffers + limited packet processing capacity

  • Result = microbursts → buffer overflow → packet loss

So even if you’re only doing:

  • 50–100 Mbps
    you can still drop packets if PPS is high enough.

Key point:

:backhand_index_pointing_right: Bandwidth ≠ performance
:backhand_index_pointing_right: PPS (packets/sec) is the real limiter


For $30:

  • :check_mark: Fine for normal home use

  • :check_mark: Fine for bulk traffic (downloads, streaming)

  • :check_mark: Fine for light gaming

But:

  • :cross_mark: Not designed for stress tools (dnsblast, dnspyre, dnsblast-go, etc.)

  • :cross_mark: Not designed for sustained high PPS

  • :cross_mark: Not consistent under microburst load


So yeah — value for money = good,

but expecting it to behave like enterprise gear under load is unrealistic.

If someone is chasing packet loss:

  • test with a better switch (even a cheap managed one with bigger buffers)

  • or reduce burst/concurrency

  • or accept the hardware limit


Great $30 switch for what it is.
Not a high-PPS device.
Packet loss under UDP bursts is expected, not a fault.

I will switch back to my $30 8 port tp-link switch to test after all the tuning is done.

1 Like

:test_tube: DNS Performance Testing Summary (DNSdist + Pi-hole cluster)

I’ve been benchmarking a local DNS stack using DNSdist (frontend) with multiple Pi-hole backends under high-QPS UDP load (dnspyre/dnsblast style testing).

:desktop_computer: Hardware

  • HP DL360 (older server)

  • Proxmox VM environment

  • Up to 24 cores available


:wrench: Test Setup

  • dnsdist as frontend load balancer

  • Pi-hole instances as backends (scaled from a few → up to 12)

  • High PPS / UDP-heavy workload (small packets, burst traffic)

  • Both cached and uncached scenarios tested

:backhand_index_pointing_right: NOTE: The results below primarily reflect cached DNS performance, not full recursive (uncached) resolution.


:bar_chart: Key Findings

1. Peak vs Sustained Performance

  • Peak (short burst): ~140k QPS

  • Sustained (realistic): ~60k–70k QPS

:right_arrow: System handles very high bursts but settles to a stable throughput ceiling.


2. Frontend Core Scaling

  • Increasing dnsdist cores alone did not always improve performance

  • With limited backend capacity:

    • More cores = more queueing + higher drops
  • With larger backend (12 Pi-holes)

    • Higher core counts improved throughput

:right_arrow: Frontend scaling only helps if backend can absorb it


3. Backend Scaling (most important factor)

  • Adding more Pi-hole instances gave the biggest improvement

  • Example:

    • Smaller backend → ~60–68k QPS

    • 12 Pi-holes → ~71k QPS sustained

:right_arrow: Backend fanout had more impact than CPU tuning


4. RAM Disk (tmpfs) Testing

  • Moving logs/DB to RAM:

    • Reduced disk I/O

    • Improved median latency (p50)

  • But:

    • Increased queue depth

    • Higher error rates under burst load

:right_arrow: Removing I/O bottlenecks can increase overload effects


5. Network Tuning (sysctl)

  • Increased buffers/backlog improved burst handling

  • Allowed ingestion of very high initial QPS (~400k+)

  • But:

    • Did not increase sustained throughput

    • Increased latency under overload

:right_arrow: Higher buffers = more queueing, not more capacity


6. Real Bottleneck

The limiting factors are NOT:

  • CPU (low load observed)

  • Disk (after tuning)

  • Network stack (after sysctl tuning)

The bottleneck is:

  • dnsdist processing + scheduling

  • Pi-hole/FTL backend capacity

  • Queue buildup under burst load


:chart_decreasing: Typical Stable Envelope

Across multiple runs:

  • Throughput: ~65k–70k QPS

  • Error rate: ~1.5%–3%

  • Latency (under burst):

    • p50: ~200–250 ms

    • p95: ~280–300 ms


:brain: Key Takeaways

  • More cores ≠ more performance

  • Backend scaling > frontend scaling

  • Burst capacity ≠ sustainable throughput

  • Reducing bottlenecks can expose deeper limits

  • Queueing (not CPU) drives latency at high load


:chequered_flag: Best Performing Setup

So far:

  • dnsdist: ~20 cores

  • Backend: 12 Pi-hole instances

  • No RAM disk tricks

  • Tuned network stack

:right_arrow: Best balance of throughput and stability


:speech_balloon: Final Thoughts

For high-QPS DNS workloads:

  • Focus on backend scaling and distribution

  • Avoid over-driving the system with unrealistic burst loads

  • Tune for sustained throughput, not just peak numbers


:test_tube: Pi-hole (Tuned) Performance Snapshot Cached data

After tuning Pi-hole (logging adjustments, general cleanup, and backend balancing), I ran a focused test to look at individual node behaviour.

:bar_chart: Results (single-node style load 1000 domain list)

  • Achieved send QPS: ~4,588

  • Total queries: 67,164

  • Successful: 66,484

  • Failed: 680

  • Error rate: ~1.01%

  • Latency:

    • p50: ~74 ms

    • p95: ~128 ms


:bar_chart: Results (single-node style load 20 domain list)

  • Achieved send QPS: ~12k

:brain: Interpretation

  • This reflects per-node performance under sustained load, not aggregate cluster throughput

  • Results are consistent with earlier findings of ~12k cached peak per node, dropping under sustained pressure

  • Latency remains significantly lower than full-cluster burst tests due to:

    • reduced queue depth

    • more controlled ingestion rate


:magnifying_glass_tilted_left: Key Observations

  • Pi-hole handles moderate sustained load well, but:

    • throughput per node is limited under continuous pressure

    • queueing still appears as load increases

  • Error rate (~1%) is acceptable for this test shape and aligns with earlier cluster behaviour


:chequered_flag: Takeaway

:backhand_index_pointing_right: Individual Pi-hole instances perform reliably in the low-thousands QPS range under sustained load
:backhand_index_pointing_right: Scaling to higher throughput requires horizontal backend expansion (more nodes) rather than pushing individual instances harder

If anyone else is pushing Pi-hole/dnsdist at high PPS, would be interested to compare results :+1:

1 Like