WireGuard IRQ distribution

Rumen Telbizov rumen.telbizov at menlosecurity.com
Tue May 9 22:17:21 UTC 2023


Hello WireGuard,

New subscriber to the list here.
I've been running performance tests between two bare-metal machines,
trying to gauge
what performance at what CPU utilization I can expect out of
WireGuard. While doing so
I noticed that the immediate bottleneck becomes an IRQ which lands on
a single CPU core.
I strongly suspect that this is because the underlying packet flow
between the two machines
is exactly the same 5-tuple: UDP, src IP:51280, dst IP:51280. Since
WireGuard doesn't vary
the source UDP port, all packets land on the same IRQ and thus the
same CPU. No huge
surprises so far, if my understanding is correct. The interesting part
comes when I try to
introduce UDP source-port variability artificially through nftables -
see below for details.
Even though I am able to distribute the IRQ load pretty well across
all cores, the overall
performance actually drops by about 50%. I was hoping to get some
ideas as to what might
be going on and if this is an expected behaviour. Any further pointers
as to how I can fully
utilize all my CPU capacity and get as close to wire-speed would be appreciated.


Setup -- 2 x of the following:
* Xeon(R) E-2378G CPU @ 2.80GHz, 64GB RAM
* MT27800 Family [ConnectX-5] - 2 x 25Gbit/s in LACP bond = 50Gbit/s
* Debian 11, kernel: 5.10.178-3
* modinfo wireguard: version: 1.0.0
* Server running: iperf3 -s
* Client running: iperf3 -c XXX -Z -t 30

Baseline iperf3 performance over plain VLAN:
* Stable 24Gbit/s and 2Mpps

bmon:
  Gb                      (RX Bits/second)
24.54 .........|.||..|.||.||.||||||..||.||.......................
20.45 .........|||||||||||||||||||||||||||||.....................
16.36 ........||||||||||||||||||||||||||||||.....................
12.27 ........||||||||||||||||||||||||||||||.....................
8.18 ........|||||||||||||||||||||||||||||||.....................
4.09 ::::::::|||||||||||||||||||||||||||||||:::::::::::::::::::::
     1   5   10   15   20   25   30   35   40   45   50   55   60
   M                     (RX Packets/second)
2.03 .........|.||..|.||.||.||||||..||.||........................
1.69 .........|||||||||||||||||||||||||||||......................
1.35 ........||||||||||||||||||||||||||||||......................
1.01 ........||||||||||||||||||||||||||||||......................
0.68 ........|||||||||||||||||||||||||||||||.....................
0.34 ::::::::|||||||||||||||||||||||||||||||:::::::::::::::::::::
     1   5   10   15   20   25   30   35   40   45   50   55   60

top:
%Cpu0  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  1.0 us,  1.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  1.0 us,  0.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu7  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu8  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu9  :  1.0 us,  1.0 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu10 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu11 :  0.0 us,  0.9 sy,  0.0 ni, 16.8 id,  0.0 wa,  0.0 hi, 82.2 si,  0.0 st
%Cpu12 :  0.0 us, 32.3 sy,  0.0 ni, 65.6 id,  0.0 wa,  0.0 hi,  2.1 si,  0.0 st
%Cpu13 :  1.0 us, 36.3 sy,  0.0 ni, 59.8 id,  0.0 wa,  0.0 hi,  2.9 si,  0.0 st
%Cpu14 :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu15 :  0.0 us,  1.0 sy,  0.0 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

The IRQs do pile up behind CPU 11 because iperf3 is single-threaded.
Still, I can reach the full bandwidth of a single NIC (25Gbit/s) which is
also an artefact of the LACP hashing of a single packet flow.


Scenario 1: No port randomization (stock wireguard setup)
* all IRQs land on a single CPU core
* 8Gbit/s and 660Kpps

bmon:
      Gb                      (RX Bits/second)
    8.01 ...........|||||||||||||||.|||||||||||||....................
    6.68 ...........|||||||||||||||||||||||||||||....................
    5.34 ...........||||||||||||||||||||||||||||||...................
    4.01 ...........||||||||||||||||||||||||||||||...................
    2.67 ...........||||||||||||||||||||||||||||||...................
    1.34 ::::::::::|||||||||||||||||||||||||||||||:::::::::::::::::::
         1   5   10   15   20   25   30   35   40   45   50   55   60
       K                     (RX Packets/second)
  661.71 ...........|||||||||||||||.|||||||||||||....................
  551.42 ...........|||||||||||||||||||||||||||||....................
  441.14 ...........||||||||||||||||||||||||||||||...................
  330.85 ...........||||||||||||||||||||||||||||||...................
  220.57 ...........||||||||||||||||||||||||||||||...................
  110.28 ::::::::::|||||||||||||||||||||||||||||||:::::::::::::::::::
         1   5   10   15   20   25   30   35   40   45   50   55   60

top:
%Cpu0  :  0.0 us, 28.0 sy,  0.0 ni, 69.0 id,  0.0 wa,  0.0 hi,  3.0 si,  0.0 st
%Cpu1  :  0.0 us, 18.1 sy,  0.0 ni, 79.8 id,  0.0 wa,  0.0 hi,  2.1 si,  0.0 st
%Cpu2  :  0.0 us, 20.2 sy,  0.0 ni, 77.9 id,  0.0 wa,  0.0 hi,  1.9 si,  0.0 st
%Cpu3  :  0.0 us, 22.8 sy,  0.0 ni, 74.3 id,  0.0 wa,  0.0 hi,  3.0 si,  0.0 st
%Cpu4  :  0.0 us, 14.6 sy,  0.0 ni, 85.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us, 12.6 sy,  0.0 ni, 87.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu6  :  0.0 us, 21.3 sy,  0.0 ni, 75.5 id,  0.0 wa,  0.0 hi,  3.2 si,  0.0 st
%Cpu7  :  0.0 us, 17.6 sy,  0.0 ni, 76.9 id,  0.0 wa,  0.0 hi,  5.5 si,  0.0 st
%Cpu8  :  1.1 us, 24.2 sy,  0.0 ni, 70.5 id,  0.0 wa,  0.0 hi,  4.2 si,  0.0 st
%Cpu9  :  0.0 us, 20.2 sy,  0.0 ni, 74.5 id,  0.0 wa,  0.0 hi,  5.3 si,  0.0 st
%Cpu10 :  0.0 us, 30.3 sy,  0.0 ni, 62.6 id,  0.0 wa,  0.0 hi,  7.1 si,  0.0 st
%Cpu11 :  0.0 us, 22.3 sy,  0.0 ni, 71.3 id,  0.0 wa,  0.0 hi,  6.4 si,  0.0 st
%Cpu12 :  1.1 us, 15.8 sy,  0.0 ni, 76.8 id,  0.0 wa,  0.0 hi,  6.3 si,  0.0 st
%Cpu13 :  0.0 us,  0.0 sy,  0.0 ni,  5.0 id,  0.0 wa,  0.0 hi, 95.0 si,  0.0 st
%Cpu14 :  1.0 us, 23.7 sy,  0.0 ni, 71.1 id,  0.0 wa,  0.0 hi,  4.1 si,  0.0 st
%Cpu15 :  0.0 us, 23.2 sy,  0.0 ni, 73.7 id,  0.0 wa,  0.0 hi,  3.2 si,  0.0 st

As mentioned above I suspect this is an effect of the single 5-tuple
UDP, src 169.254.100.2:51280, dst 169.254.100.1:51280 that WireGuard
uses under the hood. Parallelizing iperf3 has no effect since it all comes
down to the same flow on the wire after encapsulation.

This is the point where I decided to try to diversify / randomize the source
UDP port to try to distribute the CPU load over the remaining cores.


Scenario 2: UDP source port randomization via nftables

* 4Gbit/s and 337Kpps
* I applied the following nftables to transparently change the source UDP port
at transmit time and then to bring it back to what WireGuard expects.
table inet raw {
    chain POSTROUTING {
        type filter hook postrouting priority raw; policy accept;
        oif bond0.2000 udp dport 51280   notrack udp sport set ip id
    }

    chain PREROUTING {
        type filter hook prerouting priority raw; policy accept;
        iif bond0.2000 udp dport 51280   notrack udp sport set 51280
    }
}

In essence I set the source UDP port to the IP ID field which gives me
a pretty good distribution of source UDP ports. I tried using the random
and inc modules of nftables but with no luck, port was always 0.
This trick seems to work though.

bmon:
      Gb                      (RX Bits/second)
    4.08 ........|..|||.||||||||..||||||||...........................
    3.40 ........||.||||||||||||||||||||||||||.......................
    2.72 .......||||||||||||||||||||||||||||||.......................
    2.04 .......||||||||||||||||||||||||||||||.......................
    1.36 .......||||||||||||||||||||||||||||||.......................
    0.68 :::::::|||||||||||||||||||||||||||||||::::::::::::::::::::::
         1   5   10   15   20   25   30   35   40   45   50   55   60
       K                     (RX Packets/second)
  337.23 ........|..|||.||||||||..||||||||...........................
  281.02 ........||.||||||||||||||||||||||||||.......................
  224.82 .......||||||||||||||||||||||||||||||.......................
  168.61 .......||||||||||||||||||||||||||||||.......................
  112.41 .......||||||||||||||||||||||||||||||.......................
   56.20 :::::::|||||||||||||||||||||||||||||||::::::::::::::::::::::
         1   5   10   15   20   25   30   35   40   45   50   55   60

top:
%Cpu0  :  0.0 us, 16.5 sy,  0.0 ni, 62.9 id,  0.0 wa,  0.0 hi, 20.6 si,  0.0 st
%Cpu1  :  0.0 us, 50.5 sy,  0.0 ni, 31.1 id,  0.0 wa,  0.0 hi, 18.4 si,  0.0 st
%Cpu2  :  0.0 us, 16.8 sy,  0.0 ni, 68.4 id,  0.0 wa,  0.0 hi, 14.7 si,  0.0 st
%Cpu3  :  0.0 us, 20.6 sy,  0.0 ni, 61.8 id,  0.0 wa,  0.0 hi, 17.6 si,  0.0 st
%Cpu4  :  0.0 us, 13.1 sy,  0.0 ni, 68.7 id,  0.0 wa,  0.0 hi, 18.2 si,  0.0 st
%Cpu5  :  0.0 us, 19.2 sy,  0.0 ni, 61.6 id,  0.0 wa,  0.0 hi, 19.2 si,  0.0 st
%Cpu6  :  0.0 us, 15.5 sy,  0.0 ni, 62.1 id,  0.0 wa,  0.0 hi, 22.3 si,  0.0 st
%Cpu7  :  0.0 us, 29.3 sy,  0.0 ni, 53.5 id,  0.0 wa,  0.0 hi, 17.2 si,  0.0 st
%Cpu8  :  1.0 us, 18.0 sy,  0.0 ni, 59.0 id,  0.0 wa,  0.0 hi, 22.0 si,  0.0 st
%Cpu9  :  0.0 us, 20.8 sy,  0.0 ni, 68.9 id,  0.0 wa,  0.0 hi, 10.4 si,  0.0 st
%Cpu10 :  1.0 us, 16.8 sy,  0.0 ni, 66.3 id,  0.0 wa,  0.0 hi, 15.8 si,  0.0 st
%Cpu11 :  0.0 us, 13.4 sy,  0.0 ni, 66.0 id,  0.0 wa,  0.0 hi, 20.6 si,  0.0 st
%Cpu12 :  0.0 us, 21.9 sy,  0.0 ni, 64.6 id,  0.0 wa,  0.0 hi, 13.5 si,  0.0 st
%Cpu13 :  0.0 us, 22.4 sy,  0.0 ni, 60.2 id,  0.0 wa,  0.0 hi, 17.3 si,  0.0 st
%Cpu14 :  0.0 us, 23.0 sy,  0.0 ni, 61.0 id,  0.0 wa,  0.0 hi, 16.0 si,  0.0 st
%Cpu15 :  0.0 us, 16.8 sy,  0.0 ni, 67.4 id,  0.0 wa,  0.0 hi, 15.8 si,  0.0 st

As you can see the IRQs are pretty well balanced and I have tons of idle on all
cores, yet I get half the performance numbers.

I'll continue with my tests and try a newer kernel, but wanted to share this
with this community to try and get your feedback.

Thank you,
Rumen Telbizov


More information about the WireGuard mailing list