[WireGuard] Major Queueing Algorithm Simplification

Fri Nov 4 14:24:58 CET 2016

Hey,

This might be of interest...

Before, every time I got a GSO superpacket from the kernel, I'd split
it into little packets, and then queue each little packet as a
different parallel job.

Now, every time I get a GSO super packet from the kernel, I split it
into little packets, and queue up that whole bundle of packets as a
single parallel job. This means that each GSO superpacket expansion
gets processed on a single CPU. This greatly simplifies the algorithm,
and delivers mega impressive performance throughput gains.

In practice, what this means is that if you call send(tcp_socket_fd,
buffer, biglength), then each 65k contiguous chunk of buffer will be
encrypted on the same CPU. Before, each 1.5k contiguous chunk would be
encrypted on the same CPU.

I had thought about doing this a long time ago, but didn't, due to
reasons that are now fuzzy to me. I believe it had something to do
with latency. But at the moment, I think this solution will actually
reduce latency on systems with lots of cores, since it means those
cores don't all have to be synchronized before a bundle can be sent
out. I haven't measured this yet, and I welcome any such tests. The
magic commit for this is [1], if you'd like to compare before and
after.

Are there any obvious objections I've overlooked with this simplified approach?

Thanks,
Jason

[1] https://git.zx2c4.com/WireGuard/commit/?id=7901251422e55bcd55ab04afb7fb390983593e39