DNS resolution retries and EAI_NONAME
wireguard at zackelan.com
Tue Nov 3 09:57:38 CET 2020
Short version: if I set WG_ENDPOINT_RESOLUTION_RETRIES=infinity, I would like wg(8) to actually retry infinitely, rather than exiting the first time it gets what it assumes to be a permanent failure.
When WG_ENDPOINT_RESOLUTION_RETRIES is set, wg will retry endpoint resolution failures...but it special-cases 2 or 3 error response codes  - EAI_NONAME, EAI_FAIL and (if defined) EAI_NODATA because it considers them "permanent" failures that are not worth retrying.
I have several Wireguard tunnels that are set to start at boot on a NixOS box I host. NixOS sets this variable to infinite for me . Despite this, when I reboot that host, I consistently have the tunnels fail on startup. They're failing with a error that wg(8) considers permanent:
Oct 29 19:08:48 mord kernel: wireguard: WireGuard 1.0.20200908 loaded. See www.wireguard.com for information.
Oct 29 19:08:48 mord wireguard-wg0-peer-eG9xbERdnkQrnrpnRrteQQ4zKn-Bi2WA2V2Y2X0UCl0--x3d-start: Name or service not known: `4184f12df2343796c155.freemyip.com:12345'
Oct 29 19:08:48 mord systemd: wireguard-wg0-peer-eG9xbERdnkQrnrpnRrteQQ4zKn-Bi2WA2V2Y2X0UCl0\x3d.service: Main process exited, code=exited, status=1/FAILURE
Oct 29 19:08:48 mord wireguard-wg0-peer-CBdpnmSVnwIvgtj4M1g3LlCLm-wooeo--x2bs5AARyoPjxU--x3d-start: Name or service not known: `d67930b08f5396e21ae1.freemyip.com:12345'
Oct 29 19:08:48 mord systemd: wireguard-wg0-peer-CBdpnmSVnwIvgtj4M1g3LlCLm-wooeo\x2bs5AARyoPjxU\x3d.service: Main process exited, code=exited, status=1/FAILURE
Oct 29 19:08:48 mord wireguard-wg0-peer-J4rZgtReGrTwTglP05wLQt1GniIfUV4o4zAqcO-b3AI--x3d-start: Name or service not known: `9fa2756baed60cb5f18e.freemyip.com:12345'
Oct 29 19:08:48 mord systemd: wireguard-wg0-peer-J4rZgtReGrTwTglP05wLQt1GniIfUV4o4zAqcO-b3AI\x3d.service: Main process exited, code=exited, status=1/FAILURE
This host gets an IP from DHCP a few seconds later, and after that I can SSH in and manually start the Wireguard tunnels without issue.
The assumption that wg(8) makes - that EAI_NONAME / "name or service not known" is a permanent failure - may be true in some cases, but isn't true in mine.
I think it might also make sense, along with not special-casing those error codes, to lower the default number of retries to maybe 1 or 2 (instead of 15)? That would achieve the desired effect of not taking forever to fail if the error truly is permanent, but also allow use cases like mine where the tunnel is configured to start on boot and I want "the network will be up soon, trust me" retry behavior.
More information about the WireGuard