Wg source address is too sticky for multihomed systems aka multiple endpoints redux

Sun Jul 23 17:05:04 UTC 2023

Hi John,

On Fri, Jul 21, 2023 at 09:47:11AM -0400, John Lauro wrote:
> I have a lots of multihomed routers setup for vpn site to site and
> running bgp over the vpn mesh.
> 
> First, make sure these are all 0 as are multihomed.
> cat $( find /proc/sys/net/ipv4 -name rp_filter )

My routers are behind consumer ISPs so I never get packets which would fail
RPF and I have RPF upstream of me either way, so this doesn't make a
difference in my case. Like I said I have ip-rules (PBR) to direct traffic
to the correct interface based on source address to appease upstream's RPF.

> The other thing I do is I run a different wireguard interface and peer
> on a different port and interface.

Same, in order to run a routing daemon on top of wg you pretty much have to
do that currently as only one peer may have AllowedIPs=::/0 but the routing
daemons dont (yet, I'm working on this for babel) know how to update
AllowedIPs.

> With bgp on top, one multihomed router to another multihomed router
> just ends up being multiple links it can route over and let linux/bgp
> decide which ones to use and automatically fail over if one path goes
> down.
> 
> That said, I don't have any NAT and both ends have fixed IPs, although
> they are multihomed.

I'm pretty sure you're not seeing the problem I describe here because your
paths are going to be pretty equivalent, but in my case one is DOCSIS3 and
one is LTE/5G (depends on weather) which is much worse in terms of
bandwidth and latency/jitter consistency. So I can actually see the
difference in applications (video buffering etc) which is what had me start
debugging in the first place :)

> Can you create a separate wireguard interface for each physical
> interface (I suggest a different port too).  Separate wireguard
> interfaces should keep WG from having issues, and of course disabling
> rp_filter to keep linux from having issues.

Hmm, that might just work since my routing daemon does RTT based routing
and the mobile connection is going to be much worse there. I already have
to deploy two tunnel because of the mentioned v4/v6 dualstack issue so I'm
not really keen to multiply that number _again_. Besides my `set fwmark`
workaround does actually legitimately work but it's ugly as hell :)

> On Fri, Jul 21, 2023 at 4:05 AM Nico Schottelius

/me realizes you were replying to Nico *blush*. See this is why you don't
top-post. Learn some netiquette people :-)

I've actually taken my followup discussion with Nico off-list because I
think it might be a more involved debug session on what's going on in his
setup, which is going to distract from my proposal. I'll send any
conclusions we come to back to the list though.

FYI: I do have a patch to add the necessary debugging code and logs to show
the concrete issue here, I just didn't want to cause information overload
in the initial mail. Just let me know and I'll send those along if there's
any doubt about whether what I describe is the actual issue I'm having. I'm
pretty convinced but the first rule of the internet it that the problem is
always the X-Y problem~.

Thanks,
--Daniel