[PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers

Wed Feb 17 23:41:41 UTC 2021

"Jason A. Donenfeld" <Jason at zx2c4.com> writes:

> On Wed, Feb 17, 2021 at 7:36 PM Toke Høiland-Jørgensen <toke at toke.dk> wrote:
>> Are these performance measurements are based on micro-benchmarks of the
>> queueing structure, or overall wireguard performance? Do you see any
>> measurable difference in the overall performance (i.e., throughput
>> drop)?
>
> These are from counting cycles per instruction using perf and seeing
> which instructions are hotspots that take a greater or smaller
> percentage of the overall time.

Right. Would still love to see some actual numbers if you have them.
I.e., what kind of overhead is the queueing operations compared to the
rest of the wg data path, and how much of that is the hotspot
operations? Even better if you have a comparison with a spinlock
version, but I do realise that may be asking too much :)

>> And what about relative to using one of the existing skb queueing
>> primitives in the kernel? Including some actual numbers would be nice to
>> justify adding yet-another skb queueing scheme to the kernel :)
>
> If you're referring to skb_queue_* and friends, those very much will
> not work in any way, shape, or form here. Aside from the fact that the
> MPSC nature of it is problematic for performance, those functions use
> a doubly linked list. In wireguard's case, there is only one pointer
> available (skb->prev), as skb->next is used to create the singly
> linked skb_list (see skb_list_walk_safe) of gso frags. And in fact, by
> having these two pointers next to each other for the separate lists,
> it doesn't need to pull in another cache line. This isn't "yet-another
> queueing scheme" in the kernel. This is just a singly linked list
> queue.

Having this clearly articulated in the commit message would be good, and
could prevent others from pushing back against what really does appear
at first glance to be "yet-another queueing scheme"...

I.e., in the version you posted you go "the ring buffer is too much
memory, so here's a new linked-list queueing algoritm", skipping the
"and this is why we can't use any of the existing ones" in-between.

>> I say this also because the actual queueing of the packets has never
>> really shown up on any performance radar in the qdisc and mac80211
>> layers, which both use traditional spinlock-protected queueing
>> structures.
>
> Those are single threaded and the locks aren't really contended much.
>
>> that would be good; also for figuring out if this algorithm might be
>> useful in other areas as well (and don't get me wrong, I'm fascinated by
>> it!).
>
> If I find the motivation -- and if the mailing list conversations
> don't become overly miserable -- I might try to fashion the queueing
> mechanism into a general header-only data structure in include/linux/.
> But that'd take a bit of work to see if there are actually places
> where it matters and where it's useful. WireGuard can get away with it
> because of its workqueue design, but other things probably aren't as
> lucky like that. So I'm on the fence about generality.

Yeah, I can't think of any off the top of my head either. But I'll
definitely keep this in mind if I do run into any. If there's no obvious
contenders, IMO it would be fine to just keep it internal to wg until
such a use case shows up, and then generalise it at that time. Although
that does give it less visibility for other users, it also saves you
some potentially-redundant work :)

>> > -     if (wg_packet_queue_init(&peer->tx_queue, wg_packet_tx_worker, false,
>> > -                              MAX_QUEUED_PACKETS))
>> > -             goto err_2;
>> > +     INIT_WORK(&peer->transmit_packet_work, wg_packet_tx_worker);
>>
>> It's not quite clear to me why changing the queue primitives requires
>> adding another work queue?
>
> It doesn't require a new workqueue. It's just that a workqueue was
> init'd earlier in the call to "wg_packet_queue_init", which allocated
> a ring buffer at the same time. We're not going through that
> infrastructure anymore, but I still want the workqueue it used, so I
> init it there instead. I truncated the diff in my quoted reply -- take
> a look at that quote above and you'll see more clearly what I mean.

Ah, right, it's moving things from wg_packet_queue_init() - missed that.
Thanks!

>> > +#define NEXT(skb) ((skb)->prev)
>>
>> In particular, please explain this oxymoronic define :)
>
> I can write more about that, sure. But it's what I wrote earlier in
> this email -- the next pointer is taken; the prev one is free. So,
> this uses the prev one.

Yeah, I just meant to duplicate the explanation and references in
comments as well as the commit message, to save the people looking at
the code in the future some head scratching, and to make the origins
of the algorithm clear (credit where credit is due and all that).

>> While this is nice and compact it's also really hard to read.
>
> Actually I've already reworked that a bit in master to get the memory
> barrier better.

That version still hides the possible race inside a nested macro
expansion, though. Not doing your readers any favours.

>> I don't see anywhere that you're clearing the next pointer (or prev, as
>> it were). Which means you'll likely end up passing packets up or down
>> the stack with that pointer still set, right? See this commit for a
>> previous instance where something like this has lead to issues:
>>
>> 22f6bbb7bcfc ("net: use skb_list_del_init() to remove from RX sublists")
>
> The prev pointer is never used for anything or initialized to NULL
> anywhere. skb_mark_not_on_list concerns skb->next.

I was more concerned with stepping on the 'struct list_head' that shares
the space with the next and prev pointers, actually. But if you audited
that there are no other users of the pointer space at all, great! Please
do note this somewhere, though.

> Thanks for the review.

You're welcome - feel free to add my:

Reviewed-by: Toke Høiland-Jørgensen <toke at redhat.com>

-Toke