ARM multitheaded?

Tue Nov 21 11:02:02 CET 2017

Hi René,

There are a few bottlenecks in the existing queuing code:

- transmission of packets is limited to one core, even if encryption
is multicore, to avoid out of order packets.
- packet queues use a ring buffer with two spinlocks, which cause
contention on systems with copious amounts of CPUs (not your case).
- CPU autoscaling - sometimes using all the cores isn't useful if that
lowers the clockrate or if there are few packets, but we don't have an
auto scale-up/scale-down algorithm right now. instead we blast out to
all cores always.
- CPU locality - cores might be created on one core and encrypted on
another. not much we can do about this with a multicore algorithm,
unless there are "hints" or dual per-cpu and per-device queues with
scheduling between them, which is complicated and would need lots of
thought.
- the transmission core is also used as an encryption core. in some
environments this is a benefit, in others a detriment.
- there's a slightly expensive bitmask operation to determine which
CPU should be used for the next packet.
- other challenging puzzles from queue-theory land.

I've CCd Samuel and Toke in case they want to jump in on this thread
and complain some about other aspects of the multicore algorithm. It's
certainly much better than it was during padata-era, but there's still
a lot to be done. The implementation lives here:

>From these lines on down, best read from bottom to top.
https://git.zx2c4.com/WireGuard/tree/src/send.c#n185
https://git.zx2c4.com/WireGuard/tree/src/receive.c#n281
Utility functions:
https://git.zx2c4.com/WireGuard/tree/src/queueing.c
https://git.zx2c4.com/WireGuard/tree/src/queueing.h

Let me know if you have further ideas for improving performance.

Jason