[RFC PATCH 0/4] Introduce per-peer MTU setting

Fri Jan 7 22:13:21 UTC 2022

Toke Høiland-Jørgensen <toke at toke.dk> writes:
> I'm not sure I understand the use case? Either PMTUD works through the
> tunnel and you can just let that do its job, or it doesn't and you
> have to do out-of-band discovery anyway in which case you can just use
> the FIB route MTU?

For traffic _through_ the WireGuard tunnel, that is correct. As
WireGuard in general does not do any funny business with the traffic it
forwards, path MTU discovery through the tunnel works just fine. I'll
call that end-to-end PMTUD. If this does not work for any reason, one
has to fall back onto specifying the MTU in FIB, or some other
mechanism.

I am however concerned about the link(s) _underneath_ the WireGuard
tunnel (where the encrypted + authenticated packets are forwarded), so
the endpoint-to-endpoint link. Regular path MTU discovery does not work
here. As far as I understand, the reasoning behind this is that even if
the WireGuard endpoint does receive ICMP Fragmentation Needed / Packet
Too Big messages from a host on the path the tunnel traverses, these
messages are not and cannot be authenticated. This means that this
information cannot be forwarded to the sender of the original packet,
outside of the tunnel.

This is a real-word issue I am experiencing in WireGuard setups. For
instance, I administer the WireGuard instance of a small student ISP.
Clients connect from a variety of networks to this endpoint, such as DSL
links (PPPoE) which commonly have 1492 bytes MTU, or connections using
Dual-Stack Lite, having 1460 bytes MTU due to the encapsulation
overhead. Essentially no residential providers fragment packets, and
some do not even send ICMP responses. Sometimes people use a tunnel
inside another tunnel, further decreasing MTU.

While reducing the server and client MTUs to the maximum MTU supported
by all supported link types technically works, it increases IP, tunnel
and transport header overhead. It is thus desirable to be able to
specify an individual MTU per WireGuard peer, to use the available MTU
on the respective routes. This is also on the WireGuard project's todo
[1] and has been discussed before [2].

> what do you mean by "usable on the rest of the route"?

Actually, I think I might be wrong here. Initial tests have suggested me
that if the route MTU is specified in the FIB, Linux would not take any
ICMP Fragmentation Needed / Packet Too Big responses into account. I've
tested this again, and it seems to indeed perform proper path MTU
discovery even if the route MTU is specified. This is important as a
route to the destination host might first go through a WireGuard tunnel
to a peer, and then forwarded over paths which might have an even lower
MTU.

Thus the FIB entry MTU is a viable solution for setting individual
peer's route limits, but it might be rather inelegant to modify the
route's MTU values in the FIB from within kernel space, which might be
needed for an in-band PMTUD mechanism.

>> Furthermore, with the goal of eventually introducing an in-band
>> per-peer PMTUD mechanism, keeping an internal per-peer MTU value does
>> not require modifying the FIB and thus potentially interfere with
>> userspace.
>
> What "in-band per-peer PMTUD mechanism"? And why does it need this?

As outlined above, WireGuard cannot utilize the regular ICMP-based PMTUD
mechanism over the endpoint-to-endpoint path. It is however not great to
default to a low MTU to accomodate for low-MTU links on this path, and
very inconvenient to manually adjust the tunnel MTUs.

A solution to this issue could be a PMTUD mechanism through the tunnel
link itself. It would circumvent the security considerations with
ICMP-based PMTUD by relying exclusively on an encrypted + authenticated
message exchange. For instance, a naive approach could be to send ICMP
echo messages with increasing/decreasing payload size to the peer and
discover the usable tunnel MTU based on the (lost) responses. While this
can be implemented outside of the WireGuard kernel module, it makes
certain assumptions about the tunnel and endpoint configuration, such as
the endpoints having an IP assigned, this IP being in the AllowedIPs
(not a given), responding to ICMP echo packets, etc. If such a mechanism
were to be (optionally) integrated into WireGuard directly, it could
have the potential to reduce these kinds of headaches significantly.

#+BEGIN_EXAMPLE
Here is an illustration of these issues using a hacky Mininet test
setup[3], which has the following topology (all traffic from h5 being
routed over the tunnel between h1 and h4), with fragmentation disabled:

      /--- wireguard ---\
     /                   \
    /  eth    eth    eth  \
    h1 <-> h2 <-> h3 <-> h4 <-> h5

The route from h1 to h4 has an MTU of 1500 bytes:

    mininet> h1 ping -c1 -Mdo -s1472 h4
    1480 bytes from 10.0.2.2: icmp_seq=1 ttl=62 time=0.508 ms

The route from h1 to h5 (through the WireGuard tunnel, via h4) has an
MTU of 1420 bytes:

    mininet> h1 ping -c1 -Mdo -s1392 h5
    1400 bytes from 192.168.1.2: icmp_seq=1 ttl=63 time=7.44 ms

When decreasing the MTU of the h2 to h3 link, we can observe that PMTUD
works on the route of h1 to h4:

    mininet> h2 ip link set h2-eth1 mtu 1492
    mininet> h3 ip link set h3-eth0 mtu 1492
    mininet> h1 ping -c1 -Mdo -s1472 h4
    From 10.0.0.2 icmp_seq=1 Frag needed and DF set (mtu = 1492)

However, when trying to ping h5 from h1 through the WireGuard tunnel,
the packet is silently dropped:

    mininet> h1 ping -c1 -Mdo -s1392 -W1 h5
    PING 192.168.1.2 (192.168.1.2) 1392(1420) bytes of data.

    --- 192.168.1.2 ping statistics ---
    1 packets transmitted, 0 received, 100% packet loss, time 0ms

We can change the appropriate FIB entry of the route _through_ the
tunnel to make Linux aware of the lower MTU:

    mininet> h1 ip route change 192.168.1.0/24 dev wg0 mtu 1412
    mininet> h1 ping -c1 -Mdo -s1392 -W1 h5
    ping: local error: message too long, mtu=1412
    mininet> h1 ping -c1 -Mdo -s1384 -W1 h5
    1392 bytes from 192.168.1.2: icmp_seq=1 ttl=63 time=10.8 ms

When lowering the MTU of the h4 to h5 link even further (not part of the
endpoint-to-endpoint link, but the route), PMTUD does work, which is
good:

    mininet> h4 ip link set h4-eth1 mtu 1400
    mininet> h5 ip link set h5-eth0 mtu 1400
    mininet> h1 ping -c1 -Mdo -s1384 -W1 h5
    PING 192.168.1.2 (192.168.1.2) 1384(1412) bytes of data.
    From 192.168.0.2 icmp_seq=1 Frag needed and DF set (mtu = 1400)
#+END_EXAMPLE

Let me know if that made things any clearer. :)

- Leon

[1]: https://www.wireguard.com/todo/#per-peer-pmtu
[2]: https://lists.zx2c4.com/pipermail/wireguard/2018-April/002651.html
[3]: https://gist.github.com/lschuermann/7e5de6e00358d1312c86e2144d7352b4