[PATCH RFC v1] wireguard: queueing: get rid of per-peer ring buffers

Tue Feb 9 15:44:41 UTC 2021

Hi Dmitry,

Thanks for the review.

On Tue, Feb 9, 2021 at 9:24 AM Dmitry Vyukov <dvyukov at google.com> wrote:
> Strictly saying, 0.15% is for delaying the newly added item only. This
> is not a problem, we can just consider that push has not finished yet
> in this case. You can get this with any queue. It's just that consumer
> has peeked on producer that it started enqueue but has not finished
> yet. In a mutex-protected queue consumers just don't have the
> opportunity to peek, they just block until enqueue has completed.
> The problem is only when a partially queued item blocks subsequent
> completely queued items. That should be some small fraction of 0.15%.

Ah right. I'll make that clear in the commit message.

> We could try to cheat a bit here.
> We could split the counter into:
>
> atomic_t enqueued;
> unsigned dequeued;
>
> then, consumer will do just dequeued++.
> Producers can do (depending on how precise you want them to be):
>
> if ((int)(atomic_read(&enqueued) - dequeued) >= MAX)
>     return false;
> atomic_add(&enqueued, 1);
>
> or, for more precise counting we could do a CAS loop on enqueued.

I guess the CAS case would look like `if
(!atomic_add_unless(&enqueued, 1, MAX + dequeued))` or similar, though
>= might be safer than ==, so writing out the loop manually wouldn't
be a bad idea.

But... I would probably need smp_load/smp_store helpers around
dequeued, right? Unless we argue some degree of courseness doesn't
matter.

> Or, we could check, opportunistically increment, and then decrement if
> overflow, but that looks the least favorable option.

I had originally done something like that, but I didn't like the idea
of it being able to grow beyond the limit by the number of CPU cores.

The other option, of course, is to just do nothing, and keep the
atomic as-is. There's already ~high overhead from kref_get, so I could
always revisit this after I move from kref.h over to
percpu-refcount.h.

>
> > +struct prev_queue {
> > +       struct sk_buff *head, *tail, *peeked;
> > +       struct { struct sk_buff *next, *prev; } empty;
> > +       atomic_t count;
> >  };
>
>
> This would benefit from a comment explaining that empty needs to match
> sk_buff up to prev (and a corresponding build bug that offset of prev
> match in empty and sk_buff), and why we use prev instead of next (I
> don't know).

That's a good idea. Will do.

> > @@ -197,8 +188,8 @@ static void rcu_release(struct rcu_head *rcu)
> >         struct wg_peer *peer = container_of(rcu, struct wg_peer, rcu);
> >
> >         dst_cache_destroy(&peer->endpoint_cache);
> > -       wg_packet_queue_free(&peer->rx_queue, false);
> > -       wg_packet_queue_free(&peer->tx_queue, false);
> > +       WARN_ON(wg_prev_queue_dequeue(&peer->tx_queue) || peer->tx_queue.peeked);
> > +       WARN_ON(wg_prev_queue_dequeue(&peer->rx_queue) || peer->rx_queue.peeked);
>
> This could use just wg_prev_queue_peek.

Nice catch, thanks.

Jason