My GSoC contributions to WireGuard
j.neuschaefer at gmx.net
Mon Aug 13 22:20:39 CEST 2018
over the past few months, I was a GSoC student in the WireGuard project
(nominally the Linux Foundation). Together with Thomas Gschwantner and Gauvain
Roussel-Tarbouriech (who were also GSoC students), and of course Jason
Donenfeld, I tried to implement some of the performance optimizations from the
TL;DR: Here are the commits that I contributed during that time, which were
also merged (not all of them were):
Our first task was to change the trie (prefix tree) used in allowedips.c
to look up peers based on their IP(v6) address, to use native endianness
(little endian on the x86 systems prevalent today) instead of big endian,
which is how the bytes in a IP(v6) address are stored in a packet. This
saves a few byte swaps per lookup in situations with many peers, because the
number is reduced from one byte swap per edge of trie that was walked to one
byte swap per complete traversal through the trie.
We implemented and discussed a few versions of this optimization, and
finally Jason committed his, and a selftest to help ensure that the new
implementation is correct. I noticed that the debugging code that can
print the allowedips trie in Graphviz format was now wrong, and fixed
The next goal was to replace the spinlock-based ring buffer implementation
used in WireGuard to implement the receive/transmit and encrypt/decrypt
queues with a lock-free version. For this, we looked at ConcurrencyKit,
and reimplemented their algorithms. However, after facing no clear
performance improvement, and in some cases, serious performance degradation,
we gave up on this goal. The code, along with a selftest can be found on
the jn/mpmc-wip branch.
Next, I made use of the NAPI infrastructure of the Linux networking
subsystem, to improve the performance of the receive path, with some
success, but also a performance regression on some systems, which
Jason appears to have solved by using GRO before passing packets up
through the network stack.
Since my integration of NAPI relied one NAPI context per WireGuard peer, and
NAPI contexts are stored in a fixed-size hashtable in the kernel, Thomas
disabled NAPI busy polling, thus disabling the use of this hashtable.
This avoids overloading the hashtable when there are many WireGuard peers.
My last task was to change WireGuard's hashtables from a fixed-size
hashtable to the resizable hashtable implementation (called "rhashtable")
that's already in the upstream kernel. This will save some memory when
few items are stored, and avoid overloading, and thus slowing down, the
hashtables when many items are stored. The main difficulty here stems from
the fact that WireGuard's pubkey_hashtable uses the SipHash function to be
better protected against hash collision, or "hashtable poisoning"
attacks, in which the hashed data is chosen in such a way that many
items are stored in the same hash bucket, slowing down lookups. SipHash uses
a 128-bit random seed to make the hashes less predictable, while the
existing rhashtable API only allows 32 bits. Thus, to properly integrate
SipHash into rhashtable, I need to extend the rhashtable API to use a longer
random seed. A work-in-progress version of this change can be found in the
linux-dev repo on git.zx2c4.com.
I have not reached all the goals that I initially set out to reach. In
particular, here are some of the items that were part of my GSoC project
- Reducing bufferbloat with CoDel/DQL. Since WireGuard temporarily stores
packets in queues, and queues can add latency when they're full, it would be
interesting to try to limit the added latency with CoDel or Linux's
dynamic queue limits library (DQL).
- CPU auto-scaling and NUMA awareness. Currently, WireGuard splits the
encryption/decryption workload among all CPUs currently online. For low
packet rates it might be a good idea to stick to one CPU, to avoid polluting
the other CPUs' caches, and possibly to avoid the other CPUs from sleep
states. In larger systems with two or more set of CPUs, each with their own
L3 cache and DDR memory controller, i.e. in NUMA systems, it might also
be a good idea to confine WireGuard to one NUMA node, to save bus traffic.
- Performance profiling and benchmarking. I included performance profiling and
benchmarking in my project proposal, because serious performance optimization
has to rely on benchmarks to find out which approach is *really* the fastest,
and on profiling to figure out where the bottlenecks are. I did, however, do
very little in this direction, beyond running iperf3.
Even though I did not do everything I intended to do, it was nice to bring
WireGuard a bit further and learn something about optimization and the Linux
networking stack. Thanks to Jason Donenfeld, the Linux Foundation, and Google
for providing this opportunity.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 833 bytes
Desc: not available
More information about the WireGuard