[WireGuard] Major Queueing Algorithm Simplification

Fri Nov 4 15:45:06 CET 2016

"Jason A. Donenfeld" <Jason at zx2c4.com> writes:

> Hey,
>
> This might be of interest...
>
> Before, every time I got a GSO superpacket from the kernel, I'd split
> it into little packets, and then queue each little packet as a
> different parallel job.
>
> Now, every time I get a GSO super packet from the kernel, I split it
> into little packets, and queue up that whole bundle of packets as a
> single parallel job. This means that each GSO superpacket expansion
> gets processed on a single CPU. This greatly simplifies the algorithm,
> and delivers mega impressive performance throughput gains.
>
> In practice, what this means is that if you call send(tcp_socket_fd,
> buffer, biglength), then each 65k contiguous chunk of buffer will be
> encrypted on the same CPU. Before, each 1.5k contiguous chunk would be
> encrypted on the same CPU.
>
> I had thought about doing this a long time ago, but didn't, due to
> reasons that are now fuzzy to me. I believe it had something to do
> with latency. But at the moment, I think this solution will actually
> reduce latency on systems with lots of cores, since it means those
> cores don't all have to be synchronized before a bundle can be sent
> out. I haven't measured this yet, and I welcome any such tests. The
> magic commit for this is [1], if you'd like to compare before and
> after.
>
> Are there any obvious objections I've overlooked with this simplified
> approach?

My guess would be that it would worsen latency. You now basically have
head of line blocking where an entire superpacket needs to be processed
before another flow gets to transmit one packet.

I guess this also means that the total amount of data that is currently
being processed increases? I.e., before you would have (max number of
jobs * 1.5K) bytes queued up for encryption at once, where now you will
have (max number of jobs * 65K) bytes? That can be a substantive amount
of latency.

But really, instead of guessing why not measure?

Simply run a `flent tcp_4up <hostname>` through the tunnel (requires a
netperf server instance running on <hostname>) and look at the latency
graph. The TCP flows will start five seconds after the ping flow; this
shouldn't cause the ping RTT to rise by more than max ~10ms. And of
course, trying this on a machine that does *not* have a gazillion
megafast cores as well is important :)

There's an Ubuntu PPA to get Flent, or you can just `pip install flent`.
See https://flent.org/intro.html#installing-flent

-Toke