[PATCH] Enabling Threaded NAPI by Default
Mirco Barone
mirco.barone at polito.it
Thu May 29 10:22:31 UTC 2025
On 5/28/2025 7:26 PM, Mirco Barone wrote:
> This patch enables threaded NAPI by default for WireGuard devices in
> response to low performance behavior that we observed when multiple
> tunnels (and thus multiple wg devices) are deployed on a single host.
> This affects any kind of multi-tunnel deployment, regardless of whether
> the tunnels share the same endpoints or not (i.e., a VPN concentrator
> type of gateway would also be affected).
>
> The problem is caused by the fact that, in case of a traffic surge that
> involves multiple tunnels at the same time, the polling of the NAPI
> instance of all these wg devices tends to converge onto the same core,
> causing underutilization of the CPU and bottlenecking performance.
>
> This happens because NAPI polling is hosted by default in softirq
> context, but the WireGuard driver only raises this softirq after the rx
> peer queue has been drained, which doesn't happen during high traffic.
> In this case, the softirq already active on a core is reused instead of
> raising a new one.
>
> As a result, once two or more tunnel softirqs have been scheduled on
> the same core, they remain pinned there until the surge ends.
>
> In our experiments, this almost always leads to all tunnel NAPIs being
> handled on a single core shortly after a surge begins, limiting
> scalability to less than 3× the performance of a single tunnel, despite
> plenty of unused CPU cores being available.
>
> The proposed mitigation is to enable threaded NAPI for all WireGuard
> devices. This moves the NAPI polling context to a dedicated per-device
> kernel thread, allowing the scheduler to balance the load across all
> available cores.
>
> On our 32-core gateways, enabling threaded NAPI yields a ~4× performance
> improvement with 16 tunnels, increasing throughput from ~13 Gbps to
> ~48 Gbps. Meanwhile, CPU usage on the receiver (which is the bottleneck)
> jumps from 20% to 100%.
>
> We have found no performance regressions in any scenario we tested.
> Single-tunnel throughput remains unchanged.
>
> More details are available in our Netdev paper:
> https://netdevconf.info/0x18/docs/netdev-0x18-paper23-talk-paper.pdf
> ---
> drivers/net/wireguard/device.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/wireguard/device.c b/drivers/net/wireguard/device.c
> index 45e9b908dbfb..bb77f54d7526 100644
> --- a/drivers/net/wireguard/device.c
> +++ b/drivers/net/wireguard/device.c
> @@ -363,6 +363,7 @@ static int wg_newlink(struct net *src_net, struct net_device *dev,
> ret = wg_ratelimiter_init();
> if (ret < 0)
> goto err_free_handshake_queue;
> + dev_set_threaded(dev,true);
>
> ret = register_netdevice(dev);
> if (ret < 0)
Signed-off-by: Mirco Barone <mirco.barone at polito.it>
More information about the WireGuard
mailing list