[PATCH] wg-quick: linux: fix MTU calculation (use PMTUD)

Thu Nov 23 03:33:39 UTC 2023

Hi Daniel

Thanks for having a look at this.

On Mon, 20 Nov 2023 at 01:17, Daniel Gröber <dxld at darkboxed.org> wrote:

> > because this only queries the routing cache.  To
> > trigger PMTUD on the endpoint and fill this cache, it is necessary to
> > send an ICMP with the DF bit set.
>
> I don't think this is useful. Path MTU may change, doing this only once
> when the interface comes up just makes wg-quick less predictable IMO.

Yes, I understand PMTU may change, usually when changing internet
connection. There is also the issue of bringing up an interface without
a connection, such as when using the wg-quick startup service.
Accommodating dynamic PMTU is probably out of scope of the wg-quick
script, but is something I would like to look into separately.

I still think it would be beneficial to set the MTU optimally if only
upon bringing an interface up, because PMTU is usually stable for a
particular gateway and having this built in makes it far easier for
users to automatically obtain the appropriate MTU. I think it also more
accurately reflects the man page which suggests automatic discovery.

> > 2. Consider IPv6/4 Header Size
> >
> > Currently an 80 byte header size is assumed i.e. IPv6=40 + WireGuard=40.
> > However this is not optimal in the case of IPv4. Since determining the
> > IP header size is required for PMTUD anyway, this is now optimised as a
> > side effect of endpoint MTU calculation.
>
> This is not a good idea. Consider what happens when a peer roams from an
> IPv4 to a IPv6 endpoint address. It's better to be conservative and assume
> IPv6 sized overhead, besides IPv4 is legacy anyway ;)

MTU calculation is performed independently for each endpoint, with
separate header size calculation accommodating both IPv4 and IPv6
addresses along side each other. The smallest MTU of all endpoints is
used, so switching from an IPv4 to an IPv6 endpoint should not result in
an MTU which is too large due to IP header size differences.

In my case the current behaviour is not conservative enough, but due to
absence of PMTUD rather than assumed IP header sizes.

> > 3. Use Smallest Endpoint MTU
> >
> > Currently in the case of multiple endpoints the largest endpoint path
> > MTU is used. However WireGuard will dynamically switch between endpoints
> > when e.g. one fails, so the smallest MTU is now used to ensure all
> > endpoints will function correctly.
>
> "function correctly". Do note that WireGuard lets it's UDP packets be
> fragmented. So connectivity will still work even when the wg device MTU
> doesn't match the (current) PMTU. The only downsides to this mismatch being
> performance:
>
>  - additional header overhead for fragments,
>  - less than half max packets-per-second performance and
>  - additional lateny for tunnel packets hit by IPv6 PMTU discovery
>
>    I was surprised to learn that this would happen periodically, every time
>    the PMTU cache expires. Seems inherent in the IPv6 design as there's no
>    way (AFAICT) for the kernel to validate the PMTU before the cache
>    expires (like is done for NDP for example).

So, the reason I ended up tinkering with WireGuard MTU is due to real
world reliability issues. Although the risk in setting it optimally
based on PMTU remains unclear to me, marginal performance gains are not
what brought me here. Networking is not my area of expertise, so the
best I can do is lay out my experience and see if you think it adds any
weight in favour of this change in behaviour, because I haven't done a
full root cause analysis:

I found that browsing the web over WireGuard with an MTU set larger than
the PMTU resulted in randomly stalled HTTP requests. This is noticeable
even with a single stalled HTTP request due to the HTTP 1.1 head of line
blocking issue. I tested this manually with individual HTTP requests
with a large enough payload, verifying that it only occurs over
WireGuard connections.

With naked HTTP/TCP the network seems happy, I assume it is fragmenting
packets; but over WireGuard, somehow, some packets just seem to get
dropped. Maybe UDP is getting treated differently, or maybe what's
actually happening is the network is blackholing in both cases but PMTUD
is figuring this out in the case of TCP (RFC 2923), and maybe that stops
working when encapsulated in UDP?...  But this is pure speculation, I'm
out of my depth here, and haven't dug any deeper.

This behaviour is probably network operator dependent, or specific to
LTE networks, which I use for permanent internet access, and which
commonly use a lower than average MTU. For example my current ISP uses
1380, and the current wg-quick behaviour is to set the MTU to the
default route interface MTU less 80 bytes (1420 for regular interfaces),
which results in the above behaviour.

I've used all four of the major mobile network operators in my country
and experienced this on two of them (separate physical networks, not
virtual operators). The other two used an MTU of 1500 anyway.

Just to prove I'm not entirely on my own, this issue also appears to be
known to WireGuard VPN providers, .e.g from Mullvad's FAQ:

> The default MTU (maximum transmission unit) for WireGuard in the Mullvad
> app is 1380. You can set it to 1280 if the WireGuard connection stops
> working. This may be necessary in some mobile networks.

I suppose it could be argued this is not a WireGuard concern, mobile
networks are behaving weirdly. Also IME it's not entirely unreliable
above the optimal MTU, it's just *less* reliable.

I had not anticipated such a patch would have any down sides, I saw this
as a general deficiency - Although I appreciate, as you pointed out, it
is not a 100% complete solution.

I'm interested more in what your concerns are and what you think of the
above, but will move along if you still think it's not suitable.

Cheers

Tom