From edumazet at google.com Sun Oct 5 13:04:31 2025 From: edumazet at google.com (Eric Dumazet) Date: Sun, 5 Oct 2025 06:04:31 -0700 Subject: [PATCH] wireguard: allowedips: Use kfree_rcu() instead of call_rcu() In-Reply-To: <20251005122626.26988-1-wangfushuai@baidu.com> References: <20251005122626.26988-1-wangfushuai@baidu.com> Message-ID: On Sun, Oct 5, 2025 at 5:26?AM Fushuai Wang wrote: > > Replace call_rcu() + kmem_cache_free() with kfree_rcu() to simplify > the code and reduce function size. > > Signed-off-by: Fushuai Wang Hmm... have you compiled this patch ? I think all compilers would complain loudly. > --- > drivers/net/wireguard/allowedips.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > index 09f7fcd7da78..506f7cf0d7cf 100644 > --- a/drivers/net/wireguard/allowedips.c > +++ b/drivers/net/wireguard/allowedips.c > @@ -48,11 +48,6 @@ static void push_rcu(struct allowedips_node **stack, > } > } > > -static void node_free_rcu(struct rcu_head *rcu) > -{ > - kmem_cache_free(node_cache, container_of(rcu, struct allowedips_node, rcu)); > -} > - > static void root_free_rcu(struct rcu_head *rcu) > { > struct allowedips_node *node, *stack[MAX_ALLOWEDIPS_DEPTH] = { > @@ -271,13 +266,13 @@ static void remove_node(struct allowedips_node *node, struct mutex *lock) > if (free_parent) > child = rcu_dereference_protected(parent->bit[!(node->parent_bit_packed & 1)], > lockdep_is_held(lock)); > - call_rcu(&node->rcu, node_free_rcu); > + kfree_rcu(&node, rcu); kfree_rcu(node, rcu); > if (!free_parent) > return; > if (child) > child->parent_bit_packed = parent->parent_bit_packed; > *(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child; > - call_rcu(&parent->rcu, node_free_rcu); > + kfree_rcu(&parent, rcu); kfree_rcu(parent, rcu); > } > > static int remove(struct allowedips_node __rcu **trie, u8 bits, const u8 *key, > -- > 2.36.1 > From pabeni at redhat.com Tue Oct 7 12:55:07 2025 From: pabeni at redhat.com (Paolo Abeni) Date: Tue, 7 Oct 2025 14:55:07 +0200 Subject: [PATCH v2] wireguard: allowedips: Use kfree_rcu() instead of call_rcu() In-Reply-To: <20251005133936.32667-1-wangfushuai@baidu.com> References: <20251005133936.32667-1-wangfushuai@baidu.com> Message-ID: On 10/5/25 3:39 PM, Fushuai Wang wrote: > Replace call_rcu() + kmem_cache_free() with kfree_rcu() to simplify > the code and reduce function size. > > Signed-off-by: Fushuai Wang > --- > drivers/net/wireguard/allowedips.c | 9 ++------- > 1 file changed, 2 insertions(+), 7 deletions(-) > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > index 09f7fcd7da78..5ece9acad64d 100644 > --- a/drivers/net/wireguard/allowedips.c > +++ b/drivers/net/wireguard/allowedips.c > @@ -48,11 +48,6 @@ static void push_rcu(struct allowedips_node **stack, > } > } > > -static void node_free_rcu(struct rcu_head *rcu) > -{ > - kmem_cache_free(node_cache, container_of(rcu, struct allowedips_node, rcu)); > -} > - > static void root_free_rcu(struct rcu_head *rcu) > { > struct allowedips_node *node, *stack[MAX_ALLOWEDIPS_DEPTH] = { > @@ -271,13 +266,13 @@ static void remove_node(struct allowedips_node *node, struct mutex *lock) > if (free_parent) > child = rcu_dereference_protected(parent->bit[!(node->parent_bit_packed & 1)], > lockdep_is_held(lock)); > - call_rcu(&node->rcu, node_free_rcu); > + kfree_rcu(node, rcu); > if (!free_parent) > return; > if (child) > child->parent_bit_packed = parent->parent_bit_packed; > *(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child; > - call_rcu(&parent->rcu, node_free_rcu); > + kfree_rcu(parent, rcu); > } > > static int remove(struct allowedips_node __rcu **trie, u8 bits, const u8 *key, This is net-next material, and net-next is currently closed for the merge window, but I guess Jason will take this patch in his tree. Cheers, Paolo From patchwork-bot+linux-riscv at kernel.org Thu Oct 9 01:07:09 2025 From: patchwork-bot+linux-riscv at kernel.org (patchwork-bot+linux-riscv at kernel.org) Date: Thu, 09 Oct 2025 01:07:09 +0000 Subject: update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()) In-Reply-To: References: Message-ID: <175997202925.3661959.5694356441030280085.git-patchwork-notify@kernel.org> Hello: This patch was applied to riscv/linux.git (for-next) by Mike Rapoport (Microsoft) : On Mon, 25 Aug 2025 19:58:10 +0300 you wrote: > On Mon, Aug 25, 2025 at 06:23:48PM +0200, David Hildenbrand wrote: > > > > I don't quite understand the interaction with PG_Reserved and why anybody > > using this function should care. > > > > So maybe you can rephrase in a way that is easier to digest, and rather > > focuses on what callers of this function are supposed to do vs. have the > > liberty of not doing? > > [...] Here is the summary with links: - update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()) https://git.kernel.org/riscv/c/b3dcc9d1d806 You are awesome, thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/patchwork/pwbot.html From david at redhat.com Thu Oct 9 06:12:25 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 9 Oct 2025 08:12:25 +0200 Subject: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-7-david@redhat.com> Message-ID: <5a5013ca-e976-4622-b881-290eb0d78b44@redhat.com> On 09.10.25 06:21, Balbir Singh wrote: > On 8/22/25 06:06, David Hildenbrand wrote: >> Let's reject them early, which in turn makes folio_alloc_gigantic() reject >> them properly. >> >> To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER >> and calculate MAX_FOLIO_NR_PAGES based on that. >> >> Signed-off-by: David Hildenbrand >> --- >> include/linux/mm.h | 6 ++++-- >> mm/page_alloc.c | 5 ++++- >> 2 files changed, 8 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 00c8a54127d37..77737cbf2216a 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) >> >> /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ >> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >> -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) >> +#define MAX_FOLIO_ORDER PUD_ORDER > > Do we need to check for CONTIG_ALLOC as well with CONFIG_ARCH_HAS_GIGANTIC_PAGE? > I don't think so, can you elaborate? >> #else >> -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES >> +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER >> #endif >> >> +#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) >> + >> /* >> * compound_nr() returns the number of pages in this potentially compound >> * page. compound_nr() can be called on a tail page, and is defined to >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index ca9e6b9633f79..1e6ae4c395b30 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) >> int alloc_contig_range_noprof(unsigned long start, unsigned long end, >> acr_flags_t alloc_flags, gfp_t gfp_mask) >> { >> + const unsigned int order = ilog2(end - start); > > Do we need a VM_WARN_ON(end < start)? I don't think so. > >> unsigned long outer_start, outer_end; >> int ret = 0; >> >> @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >> PB_ISOLATE_MODE_CMA_ALLOC : >> PB_ISOLATE_MODE_OTHER; >> >> + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) >> + return -EINVAL; >> + >> gfp_mask = current_gfp_context(gfp_mask); >> if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) >> return -EINVAL; >> @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >> free_contig_range(end, outer_end - end); >> } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { >> struct page *head = pfn_to_page(start); >> - int order = ilog2(end - start); >> >> check_new_pages(head, order); >> prep_new_page(head, order, gfp_mask, 0); > > Acked-by: Balbir Singh Thanks for the review, but note that this is already upstream. -- Cheers David / dhildenb From christophe.leroy at csgroup.eu Thu Oct 9 07:14:24 2025 From: christophe.leroy at csgroup.eu (Christophe Leroy) Date: Thu, 9 Oct 2025 09:14:24 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <20250901150359.867252-9-david@redhat.com> References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> Message-ID: <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> Hi David, Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: > Let's check that no hstate that corresponds to an unreasonable folio size > is registered by an architecture. If we were to succeed registering, we > could later try allocating an unsupported gigantic folio size. > > Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER > is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have > to use a BUILD_BUG_ON_INVALID() to make it compile. > > No existing kernel configuration should be able to trigger this check: > either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or > gigantic folios will not exceed a memory section (the case on sparse). > > Reviewed-by: Zi Yan > Reviewed-by: Lorenzo Stoakes > Reviewed-by: Liam R. Howlett > Signed-off-by: David Hildenbrand I get following warning on powerpc with linus tree, bisected to commit 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes when registering hstate") ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at mm/hugetlb.c:4744 hugetlb_add_hstate+0xc0/0x180 Modules linked in: CPU: 0 UID: 0 PID: 0 Comm: swapper Not tainted 6.17.0-rc4-00275-g7b4f21f5e038 #1683 NONE Hardware name: QEMU ppce500 e5500 0x80240020 QEMU e500 NIP: c000000001357408 LR: c000000001357c90 CTR: 0000000000000003 REGS: c00000000152bad0 TRAP: 0700 Not tainted (6.17.0-rc4-00275-g7b4f21f5e038) MSR: 0000000080021002 CR: 44000448 XER: 20000000 IRQMASK: 1 GPR00: c000000001357c90 c00000000152bd70 c000000001339000 0000000000000012 GPR04: 000000000000000a 0000000000001000 000000000000001e 0000000000000000 GPR08: 0000000000000000 0000000000000000 0000000000000001 000000000000000a GPR12: c000000001357b68 c000000001590000 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: c0000000011adb40 c00000000156b528 0000000000000000 c00000000156b4b0 GPR28: c00000000156b528 0000000000000012 0000000040000000 0000000000000000 NIP [c000000001357408] hugetlb_add_hstate+0xc0/0x180 LR [c000000001357c90] hugepagesz_setup+0x128/0x150 Call Trace: [c00000000152bd70] [c00000000152bda0] init_stack+0x3da0/0x4000 (unreliable) [c00000000152be10] [c000000001357c90] hugepagesz_setup+0x128/0x150 [c00000000152be80] [c00000000135841c] hugetlb_bootmem_alloc+0x84/0x104 [c00000000152bec0] [c00000000135143c] mm_core_init+0x30/0x174 [c00000000152bf30] [c000000001332ed4] start_kernel+0x540/0x880 [c00000000152bfe0] [c000000000000a50] start_here_common+0x1c/0x20 Code: 2c09000f 39000001 38e00000 39400001 7d00401e 0b080000 281d0001 7d00505e 79080020 0b080000 281d000c 7d4a385e <0b0a0000> 1f5a00b8 38bf0020 3c82ffe8 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at mm/hugetlb.c:4744 hugetlb_add_hstate+0xc0/0x180 Modules linked in: CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G W 6.17.0-rc4-00275-g7b4f21f5e038 #1683 NONE Tainted: [W]=WARN Hardware name: QEMU ppce500 e5500 0x80240020 QEMU e500 NIP: c000000001357408 LR: c000000001357c90 CTR: 0000000000000005 REGS: c00000000152bad0 TRAP: 0700 Tainted: G W (6.17.0-rc4-00275-g7b4f21f5e038) MSR: 0000000080021002 CR: 48000448 XER: 20000000 IRQMASK: 1 GPR00: c000000001357c90 c00000000152bd70 c000000001339000 000000000000000e GPR04: 000000000000000a 0000000000001000 0000000040000000 0000000000000000 GPR08: 0000000000000000 0000000000000001 0000000000000001 0000000000000280 GPR12: c000000001357b68 c000000001590000 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: c0000000011adb40 c00000000156b5e0 0000000000000001 c00000000156b4b0 GPR28: c00000000156b528 000000000000000e 0000000004000000 00000000000000b8 NIP [c000000001357408] hugetlb_add_hstate+0xc0/0x180 LR [c000000001357c90] hugepagesz_setup+0x128/0x150 Call Trace: [c00000000152bd70] [c000000000f27048] __func__.0+0x0/0x18 (unreliable) [c00000000152be10] [c000000001357c90] hugepagesz_setup+0x128/0x150 [c00000000152be80] [c00000000135841c] hugetlb_bootmem_alloc+0x84/0x104 [c00000000152bec0] [c00000000135143c] mm_core_init+0x30/0x174 [c00000000152bf30] [c000000001332ed4] start_kernel+0x540/0x880 [c00000000152bfe0] [c000000000000a50] start_here_common+0x1c/0x20 Code: 2c09000f 39000001 38e00000 39400001 7d00401e 0b080000 281d0001 7d00505e 79080020 0b080000 281d000c 7d4a385e <0b0a0000> 1f5a00b8 38bf0020 3c82ffe8 ---[ end trace 0000000000000000 ]--- ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at mm/hugetlb.c:4744 hugetlb_add_hstate+0xc0/0x180 Modules linked in: CPU: 0 UID: 0 PID: 0 Comm: swapper Tainted: G W 6.17.0-rc4-00275-g7b4f21f5e038 #1683 NONE Tainted: [W]=WARN Hardware name: QEMU ppce500 e5500 0x80240020 QEMU e500 NIP: c000000001357408 LR: c000000001357c90 CTR: 0000000000000004 REGS: c00000000152bad0 TRAP: 0700 Tainted: G W (6.17.0-rc4-00275-g7b4f21f5e038) MSR: 0000000080021002 CR: 48000448 XER: 20000000 IRQMASK: 1 GPR00: c000000001357c90 c00000000152bd70 c000000001339000 0000000000000010 GPR04: 000000000000000a 0000000000001000 0000000004000000 0000000000000000 GPR08: 0000000000000000 0000000000000002 0000000000000001 0000000000000a00 GPR12: c000000001357b68 c000000001590000 0000000000000000 0000000000000000 GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 GPR24: c0000000011adb40 c00000000156b698 0000000000000002 c00000000156b4b0 GPR28: c00000000156b528 0000000000000010 0000000010000000 0000000000000170 NIP [c000000001357408] hugetlb_add_hstate+0xc0/0x180 LR [c000000001357c90] hugepagesz_setup+0x128/0x150 Call Trace: [c00000000152bd70] [c000000000f27048] __func__.0+0x0/0x18 (unreliable) [c00000000152be10] [c000000001357c90] hugepagesz_setup+0x128/0x150 [c00000000152be80] [c00000000135841c] hugetlb_bootmem_alloc+0x84/0x104 [c00000000152bec0] [c00000000135143c] mm_core_init+0x30/0x174 [c00000000152bf30] [c000000001332ed4] start_kernel+0x540/0x880 [c00000000152bfe0] [c000000000000a50] start_here_common+0x1c/0x20 Code: 2c09000f 39000001 38e00000 39400001 7d00401e 0b080000 281d0001 7d00505e 79080020 0b080000 281d000c 7d4a385e <0b0a0000> 1f5a00b8 38bf0020 3c82ffe8 ---[ end trace 0000000000000000 ]--- > --- > mm/hugetlb.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 1e777cc51ad04..d3542e92a712e 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) > > BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE < > __NR_HPAGEFLAGS); > + BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); > > if (!hugepages_supported()) { > if (hugetlb_max_hstate || default_hstate_max_huge_pages) > @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order) > } > BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); > BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); > + WARN_ON(order > MAX_FOLIO_ORDER); > h = &hstates[hugetlb_max_hstate++]; > __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); > h->order = order; From david at redhat.com Thu Oct 9 07:22:14 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 9 Oct 2025 09:22:14 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> Message-ID: On 09.10.25 09:14, Christophe Leroy wrote: > Hi David, > > Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >> Let's check that no hstate that corresponds to an unreasonable folio size >> is registered by an architecture. If we were to succeed registering, we >> could later try allocating an unsupported gigantic folio size. >> >> Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER >> is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have >> to use a BUILD_BUG_ON_INVALID() to make it compile. >> >> No existing kernel configuration should be able to trigger this check: >> either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or >> gigantic folios will not exceed a memory section (the case on sparse). >> >> Reviewed-by: Zi Yan >> Reviewed-by: Lorenzo Stoakes >> Reviewed-by: Liam R. Howlett >> Signed-off-by: David Hildenbrand > > I get following warning on powerpc with linus tree, bisected to commit > 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes when > registering hstate") Do you have the kernel config around? Is it 32bit? That would be helpful. [...] >> --- >> mm/hugetlb.c | 2 ++ >> 1 file changed, 2 insertions(+) >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 1e777cc51ad04..d3542e92a712e 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) >> >> BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE < >> __NR_HPAGEFLAGS); >> + BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); >> >> if (!hugepages_supported()) { >> if (hugetlb_max_hstate || default_hstate_max_huge_pages) >> @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order) >> } >> BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); >> BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); >> + WARN_ON(order > MAX_FOLIO_ORDER); >> h = &hstates[hugetlb_max_hstate++]; >> __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); >> h->order = order; We end up registering hugetlb folios that are bigger than MAX_FOLIO_ORDER. So we have to figure out how a config can trigger that (and if we have to support that). -- Cheers David / dhildenb From christophe.leroy at csgroup.eu Thu Oct 9 07:44:30 2025 From: christophe.leroy at csgroup.eu (Christophe Leroy) Date: Thu, 9 Oct 2025 09:44:30 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> Message-ID: <1fb2259f-65e1-4cd0-ae70-b355843970e4@csgroup.eu> Le 09/10/2025 ? 09:22, David Hildenbrand a ?crit?: > On 09.10.25 09:14, Christophe Leroy wrote: >> Hi David, >> >> Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >>> Let's check that no hstate that corresponds to an unreasonable folio >>> size >>> is registered by an architecture. If we were to succeed registering, we >>> could later try allocating an unsupported gigantic folio size. >>> >>> Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER >>> is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, >>> we have >>> to use a BUILD_BUG_ON_INVALID() to make it compile. >>> >>> No existing kernel configuration should be able to trigger this check: >>> either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or >>> gigantic folios will not exceed a memory section (the case on sparse). >>> >>> Reviewed-by: Zi Yan >>> Reviewed-by: Lorenzo Stoakes >>> Reviewed-by: Liam R. Howlett >>> Signed-off-by: David Hildenbrand >> >> I get following warning on powerpc with linus tree, bisected to commit >> 7b4f21f5e038 ("mm/hugetlb: check for unreasonable folio sizes when >> registering hstate") > > Do you have the kernel config around? Is it 32bit? > > That would be helpful. That's corenet64_smp_defconfig Boot on QEMU with: qemu-system-ppc64 -smp 2 -nographic -M ppce500 -cpu e5500 -m 1G Christophe From david at redhat.com Thu Oct 9 08:14:02 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 9 Oct 2025 10:14:02 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> Message-ID: <9361c75a-ab37-4d7f-8680-9833430d93d4@redhat.com> On 09.10.25 10:04, Christophe Leroy wrote: > > > Le 09/10/2025 ? 09:22, David Hildenbrand a ?crit?: >> On 09.10.25 09:14, Christophe Leroy wrote: >>> Hi David, >>> >>> Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index 1e777cc51ad04..d3542e92a712e 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) >>>> ?????? BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE < >>>> ?????????????? __NR_HPAGEFLAGS); >>>> +??? BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); >>>> ?????? if (!hugepages_supported()) { >>>> ?????????? if (hugetlb_max_hstate || default_hstate_max_huge_pages) >>>> @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order) >>>> ?????? } >>>> ?????? BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); >>>> ?????? BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); >>>> +??? WARN_ON(order > MAX_FOLIO_ORDER); >>>> ?????? h = &hstates[hugetlb_max_hstate++]; >>>> ?????? __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); >>>> ?????? h->order = order; >> >> We end up registering hugetlb folios that are bigger than >> MAX_FOLIO_ORDER. So we have to figure out how a config can trigger that >> (and if we have to support that). >> > > MAX_FOLIO_ORDER is defined as: > > #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > #define MAX_FOLIO_ORDER PUD_ORDER > #else > #define MAX_FOLIO_ORDER MAX_PAGE_ORDER > #endif > > MAX_PAGE_ORDER is the limit for dynamic creation of hugepages via > /sys/kernel/mm/hugepages/ but bigger pages can be created at boottime > with kernel boot parameters without CONFIG_ARCH_HAS_GIGANTIC_PAGE: > > hugepagesz=64m hugepages=1 hugepagesz=256m hugepages=1 > > Gives: > > HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages > HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page > HugeTLB: registered 64.0 MiB page size, pre-allocated 1 pages > HugeTLB: 0 KiB vmemmap can be freed for a 64.0 MiB page > HugeTLB: registered 256 MiB page size, pre-allocated 1 pages > HugeTLB: 0 KiB vmemmap can be freed for a 256 MiB page > HugeTLB: registered 4.00 MiB page size, pre-allocated 0 pages > HugeTLB: 0 KiB vmemmap can be freed for a 4.00 MiB page > HugeTLB: registered 16.0 MiB page size, pre-allocated 0 pages > HugeTLB: 0 KiB vmemmap can be freed for a 16.0 MiB page I think it's a violation of CONFIG_ARCH_HAS_GIGANTIC_PAGE. The existing folio_dump() code would not handle it correctly as well. See how snapshot_page() uses MAX_FOLIO_NR_PAGES. -- Cheers David / dhildenb From christophe.leroy at csgroup.eu Thu Oct 9 08:04:34 2025 From: christophe.leroy at csgroup.eu (Christophe Leroy) Date: Thu, 9 Oct 2025 10:04:34 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> Message-ID: Le 09/10/2025 ? 09:22, David Hildenbrand a ?crit?: > On 09.10.25 09:14, Christophe Leroy wrote: >> Hi David, >> >> Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>> index 1e777cc51ad04..d3542e92a712e 100644 >>> --- a/mm/hugetlb.c >>> +++ b/mm/hugetlb.c >>> @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) >>> ?????? BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE < >>> ?????????????? __NR_HPAGEFLAGS); >>> +??? BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); >>> ?????? if (!hugepages_supported()) { >>> ?????????? if (hugetlb_max_hstate || default_hstate_max_huge_pages) >>> @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order) >>> ?????? } >>> ?????? BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); >>> ?????? BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); >>> +??? WARN_ON(order > MAX_FOLIO_ORDER); >>> ?????? h = &hstates[hugetlb_max_hstate++]; >>> ?????? __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); >>> ?????? h->order = order; > > We end up registering hugetlb folios that are bigger than > MAX_FOLIO_ORDER. So we have to figure out how a config can trigger that > (and if we have to support that). > MAX_FOLIO_ORDER is defined as: #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE #define MAX_FOLIO_ORDER PUD_ORDER #else #define MAX_FOLIO_ORDER MAX_PAGE_ORDER #endif MAX_PAGE_ORDER is the limit for dynamic creation of hugepages via /sys/kernel/mm/hugepages/ but bigger pages can be created at boottime with kernel boot parameters without CONFIG_ARCH_HAS_GIGANTIC_PAGE: hugepagesz=64m hugepages=1 hugepagesz=256m hugepages=1 Gives: HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page HugeTLB: registered 64.0 MiB page size, pre-allocated 1 pages HugeTLB: 0 KiB vmemmap can be freed for a 64.0 MiB page HugeTLB: registered 256 MiB page size, pre-allocated 1 pages HugeTLB: 0 KiB vmemmap can be freed for a 256 MiB page HugeTLB: registered 4.00 MiB page size, pre-allocated 0 pages HugeTLB: 0 KiB vmemmap can be freed for a 4.00 MiB page HugeTLB: registered 16.0 MiB page size, pre-allocated 0 pages HugeTLB: 0 KiB vmemmap can be freed for a 16.0 MiB page Christophe From david at redhat.com Thu Oct 9 09:20:24 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 9 Oct 2025 11:20:24 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <03671aa8-4276-4707-9c75-83c96968cbb2@csgroup.eu> References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> <9361c75a-ab37-4d7f-8680-9833430d93d4@redhat.com> <03671aa8-4276-4707-9c75-83c96968cbb2@csgroup.eu> Message-ID: <1db15a30-72d6-4045-8aa1-68bd8411b0ba@redhat.com> On 09.10.25 11:16, Christophe Leroy wrote: > > > Le 09/10/2025 ? 10:14, David Hildenbrand a ?crit?: >> On 09.10.25 10:04, Christophe Leroy wrote: >>> >>> >>> Le 09/10/2025 ? 09:22, David Hildenbrand a ?crit?: >>>> On 09.10.25 09:14, Christophe Leroy wrote: >>>>> Hi David, >>>>> >>>>> Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>>> index 1e777cc51ad04..d3542e92a712e 100644 >>>>>> --- a/mm/hugetlb.c >>>>>> +++ b/mm/hugetlb.c >>>>>> @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) >>>>>> ??????? BUILD_BUG_ON(sizeof_field(struct page, private) * >>>>>> BITS_PER_BYTE < >>>>>> ??????????????? __NR_HPAGEFLAGS); >>>>>> +??? BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); >>>>>> ??????? if (!hugepages_supported()) { >>>>>> ??????????? if (hugetlb_max_hstate || default_hstate_max_huge_pages) >>>>>> @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int >>>>>> order) >>>>>> ??????? } >>>>>> ??????? BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); >>>>>> ??????? BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); >>>>>> +??? WARN_ON(order > MAX_FOLIO_ORDER); >>>>>> ??????? h = &hstates[hugetlb_max_hstate++]; >>>>>> ??????? __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); >>>>>> ??????? h->order = order; >>>> >>>> We end up registering hugetlb folios that are bigger than >>>> MAX_FOLIO_ORDER. So we have to figure out how a config can trigger that >>>> (and if we have to support that). >>>> >>> >>> MAX_FOLIO_ORDER is defined as: >>> >>> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >>> #define MAX_FOLIO_ORDER??????? PUD_ORDER >>> #else >>> #define MAX_FOLIO_ORDER??????? MAX_PAGE_ORDER >>> #endif >>> >>> MAX_PAGE_ORDER is the limit for dynamic creation of hugepages via >>> /sys/kernel/mm/hugepages/ but bigger pages can be created at boottime >>> with kernel boot parameters without CONFIG_ARCH_HAS_GIGANTIC_PAGE: >>> >>> ??? hugepagesz=64m hugepages=1 hugepagesz=256m hugepages=1 >>> >>> Gives: >>> >>> HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages >>> HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >>> HugeTLB: registered 64.0 MiB page size, pre-allocated 1 pages >>> HugeTLB: 0 KiB vmemmap can be freed for a 64.0 MiB page >>> HugeTLB: registered 256 MiB page size, pre-allocated 1 pages >>> HugeTLB: 0 KiB vmemmap can be freed for a 256 MiB page >>> HugeTLB: registered 4.00 MiB page size, pre-allocated 0 pages >>> HugeTLB: 0 KiB vmemmap can be freed for a 4.00 MiB page >>> HugeTLB: registered 16.0 MiB page size, pre-allocated 0 pages >>> HugeTLB: 0 KiB vmemmap can be freed for a 16.0 MiB page >> >> I think it's a violation of CONFIG_ARCH_HAS_GIGANTIC_PAGE. The existing >> folio_dump() code would not handle it correctly as well. > > I'm trying to dig into history and when looking at commit 4eb0716e868e > ("hugetlb: allow to free gigantic pages regardless of the > configuration") I understand that CONFIG_ARCH_HAS_GIGANTIC_PAGE is > needed to be able to allocate gigantic pages at runtime. It is not > needed to reserve gigantic pages at boottime. > > What am I missing ? That CONFIG_ARCH_HAS_GIGANTIC_PAGE has nothing runtime-specific in its name. Can't we just select CONFIG_ARCH_HAS_GIGANTIC_PAGE for the relevant hugetlb config that allows for *gigantic pages*. -- Cheers David / dhildenb From christophe.leroy at csgroup.eu Thu Oct 9 09:16:52 2025 From: christophe.leroy at csgroup.eu (Christophe Leroy) Date: Thu, 9 Oct 2025 11:16:52 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <9361c75a-ab37-4d7f-8680-9833430d93d4@redhat.com> References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> <9361c75a-ab37-4d7f-8680-9833430d93d4@redhat.com> Message-ID: <03671aa8-4276-4707-9c75-83c96968cbb2@csgroup.eu> Le 09/10/2025 ? 10:14, David Hildenbrand a ?crit?: > On 09.10.25 10:04, Christophe Leroy wrote: >> >> >> Le 09/10/2025 ? 09:22, David Hildenbrand a ?crit?: >>> On 09.10.25 09:14, Christophe Leroy wrote: >>>> Hi David, >>>> >>>> Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>> index 1e777cc51ad04..d3542e92a712e 100644 >>>>> --- a/mm/hugetlb.c >>>>> +++ b/mm/hugetlb.c >>>>> @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) >>>>> ??????? BUILD_BUG_ON(sizeof_field(struct page, private) * >>>>> BITS_PER_BYTE < >>>>> ??????????????? __NR_HPAGEFLAGS); >>>>> +??? BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); >>>>> ??????? if (!hugepages_supported()) { >>>>> ??????????? if (hugetlb_max_hstate || default_hstate_max_huge_pages) >>>>> @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int >>>>> order) >>>>> ??????? } >>>>> ??????? BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); >>>>> ??????? BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); >>>>> +??? WARN_ON(order > MAX_FOLIO_ORDER); >>>>> ??????? h = &hstates[hugetlb_max_hstate++]; >>>>> ??????? __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); >>>>> ??????? h->order = order; >>> >>> We end up registering hugetlb folios that are bigger than >>> MAX_FOLIO_ORDER. So we have to figure out how a config can trigger that >>> (and if we have to support that). >>> >> >> MAX_FOLIO_ORDER is defined as: >> >> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >> #define MAX_FOLIO_ORDER??????? PUD_ORDER >> #else >> #define MAX_FOLIO_ORDER??????? MAX_PAGE_ORDER >> #endif >> >> MAX_PAGE_ORDER is the limit for dynamic creation of hugepages via >> /sys/kernel/mm/hugepages/ but bigger pages can be created at boottime >> with kernel boot parameters without CONFIG_ARCH_HAS_GIGANTIC_PAGE: >> >> ??? hugepagesz=64m hugepages=1 hugepagesz=256m hugepages=1 >> >> Gives: >> >> HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages >> HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >> HugeTLB: registered 64.0 MiB page size, pre-allocated 1 pages >> HugeTLB: 0 KiB vmemmap can be freed for a 64.0 MiB page >> HugeTLB: registered 256 MiB page size, pre-allocated 1 pages >> HugeTLB: 0 KiB vmemmap can be freed for a 256 MiB page >> HugeTLB: registered 4.00 MiB page size, pre-allocated 0 pages >> HugeTLB: 0 KiB vmemmap can be freed for a 4.00 MiB page >> HugeTLB: registered 16.0 MiB page size, pre-allocated 0 pages >> HugeTLB: 0 KiB vmemmap can be freed for a 16.0 MiB page > > I think it's a violation of CONFIG_ARCH_HAS_GIGANTIC_PAGE. The existing > folio_dump() code would not handle it correctly as well. I'm trying to dig into history and when looking at commit 4eb0716e868e ("hugetlb: allow to free gigantic pages regardless of the configuration") I understand that CONFIG_ARCH_HAS_GIGANTIC_PAGE is needed to be able to allocate gigantic pages at runtime. It is not needed to reserve gigantic pages at boottime. What am I missing ? > > See how snapshot_page() uses MAX_FOLIO_NR_PAGES. > From christophe.leroy at csgroup.eu Thu Oct 9 10:01:08 2025 From: christophe.leroy at csgroup.eu (Christophe Leroy) Date: Thu, 9 Oct 2025 12:01:08 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <1db15a30-72d6-4045-8aa1-68bd8411b0ba@redhat.com> References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> <9361c75a-ab37-4d7f-8680-9833430d93d4@redhat.com> <03671aa8-4276-4707-9c75-83c96968cbb2@csgroup.eu> <1db15a30-72d6-4045-8aa1-68bd8411b0ba@redhat.com> Message-ID: <0c730c52-97ee-43ea-9697-ac11d2880ab7@csgroup.eu> Le 09/10/2025 ? 11:20, David Hildenbrand a ?crit?: > On 09.10.25 11:16, Christophe Leroy wrote: >> >> >> Le 09/10/2025 ? 10:14, David Hildenbrand a ?crit?: >>> On 09.10.25 10:04, Christophe Leroy wrote: >>>> >>>> >>>> Le 09/10/2025 ? 09:22, David Hildenbrand a ?crit?: >>>>> On 09.10.25 09:14, Christophe Leroy wrote: >>>>>> Hi David, >>>>>> >>>>>> Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>>>> index 1e777cc51ad04..d3542e92a712e 100644 >>>>>>> --- a/mm/hugetlb.c >>>>>>> +++ b/mm/hugetlb.c >>>>>>> @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) >>>>>>> ???????? BUILD_BUG_ON(sizeof_field(struct page, private) * >>>>>>> BITS_PER_BYTE < >>>>>>> ???????????????? __NR_HPAGEFLAGS); >>>>>>> +??? BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); >>>>>>> ???????? if (!hugepages_supported()) { >>>>>>> ???????????? if (hugetlb_max_hstate || >>>>>>> default_hstate_max_huge_pages) >>>>>>> @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int >>>>>>> order) >>>>>>> ???????? } >>>>>>> ???????? BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); >>>>>>> ???????? BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); >>>>>>> +??? WARN_ON(order > MAX_FOLIO_ORDER); >>>>>>> ???????? h = &hstates[hugetlb_max_hstate++]; >>>>>>> ???????? __mutex_init(&h->resize_lock, "resize mutex", &h- >>>>>>> >resize_key); >>>>>>> ???????? h->order = order; >>>>> >>>>> We end up registering hugetlb folios that are bigger than >>>>> MAX_FOLIO_ORDER. So we have to figure out how a config can trigger >>>>> that >>>>> (and if we have to support that). >>>>> >>>> >>>> MAX_FOLIO_ORDER is defined as: >>>> >>>> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >>>> #define MAX_FOLIO_ORDER??????? PUD_ORDER >>>> #else >>>> #define MAX_FOLIO_ORDER??????? MAX_PAGE_ORDER >>>> #endif >>>> >>>> MAX_PAGE_ORDER is the limit for dynamic creation of hugepages via >>>> /sys/kernel/mm/hugepages/ but bigger pages can be created at boottime >>>> with kernel boot parameters without CONFIG_ARCH_HAS_GIGANTIC_PAGE: >>>> >>>> ???? hugepagesz=64m hugepages=1 hugepagesz=256m hugepages=1 >>>> >>>> Gives: >>>> >>>> HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages >>>> HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >>>> HugeTLB: registered 64.0 MiB page size, pre-allocated 1 pages >>>> HugeTLB: 0 KiB vmemmap can be freed for a 64.0 MiB page >>>> HugeTLB: registered 256 MiB page size, pre-allocated 1 pages >>>> HugeTLB: 0 KiB vmemmap can be freed for a 256 MiB page >>>> HugeTLB: registered 4.00 MiB page size, pre-allocated 0 pages >>>> HugeTLB: 0 KiB vmemmap can be freed for a 4.00 MiB page >>>> HugeTLB: registered 16.0 MiB page size, pre-allocated 0 pages >>>> HugeTLB: 0 KiB vmemmap can be freed for a 16.0 MiB page >>> >>> I think it's a violation of CONFIG_ARCH_HAS_GIGANTIC_PAGE. The existing >>> folio_dump() code would not handle it correctly as well. >> >> I'm trying to dig into history and when looking at commit 4eb0716e868e >> ("hugetlb: allow to free gigantic pages regardless of the >> configuration") I understand that CONFIG_ARCH_HAS_GIGANTIC_PAGE is >> needed to be able to allocate gigantic pages at runtime. It is not >> needed to reserve gigantic pages at boottime. >> >> What am I missing ? > > That CONFIG_ARCH_HAS_GIGANTIC_PAGE has nothing runtime-specific in its > name. In its name for sure, but the commit I mention says: On systems without CONTIG_ALLOC activated but that support gigantic pages, boottime reserved gigantic pages can not be freed at all. This patch simply enables the possibility to hand back those pages to memory allocator. And one of the hunks is: diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 7f7fbd8bd9d5b..7a1aa53d188d3 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -19,7 +19,7 @@ config ARM64 select ARCH_HAS_FAST_MULTIPLIER select ARCH_HAS_FORTIFY_SOURCE select ARCH_HAS_GCOV_PROFILE_ALL - select ARCH_HAS_GIGANTIC_PAGE if CONTIG_ALLOC + select ARCH_HAS_GIGANTIC_PAGE select ARCH_HAS_KCOV select ARCH_HAS_KEEPINITRD select ARCH_HAS_MEMBARRIER_SYNC_CORE So I understand from the commit message that it was possible at that time to have gigantic pages without ARCH_HAS_GIGANTIC_PAGE as long as you didn't have to be able to free them during runtime. > > Can't we just select CONFIG_ARCH_HAS_GIGANTIC_PAGE for the relevant > hugetlb config that allows for *gigantic pages*. > We probably can, but I'd really like to understand history and how we ended up in the situation we are now. Because blind fixes often lead to more problems. If I follow things correctly I see a helper gigantic_page_supported() added by commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation at runtime"). And then commit 461a7184320a ("mm/hugetlb: introduce ARCH_HAS_GIGANTIC_PAGE") is added to wrap gigantic_page_supported() Then commit 4eb0716e868e ("hugetlb: allow to free gigantic pages regardless of the configuration") changed gigantic_page_supported() to gigantic_page_runtime_supported() So where are we now ? Christophe From david at redhat.com Thu Oct 9 10:27:17 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 9 Oct 2025 12:27:17 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <0c730c52-97ee-43ea-9697-ac11d2880ab7@csgroup.eu> References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> <9361c75a-ab37-4d7f-8680-9833430d93d4@redhat.com> <03671aa8-4276-4707-9c75-83c96968cbb2@csgroup.eu> <1db15a30-72d6-4045-8aa1-68bd8411b0ba@redhat.com> <0c730c52-97ee-43ea-9697-ac11d2880ab7@csgroup.eu> Message-ID: <543e9440-8ee0-4d9e-9b05-0107032d665b@redhat.com> On 09.10.25 12:01, Christophe Leroy wrote: > > > Le 09/10/2025 ? 11:20, David Hildenbrand a ?crit?: >> On 09.10.25 11:16, Christophe Leroy wrote: >>> >>> >>> Le 09/10/2025 ? 10:14, David Hildenbrand a ?crit?: >>>> On 09.10.25 10:04, Christophe Leroy wrote: >>>>> >>>>> >>>>> Le 09/10/2025 ? 09:22, David Hildenbrand a ?crit?: >>>>>> On 09.10.25 09:14, Christophe Leroy wrote: >>>>>>> Hi David, >>>>>>> >>>>>>> Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>>>>> index 1e777cc51ad04..d3542e92a712e 100644 >>>>>>>> --- a/mm/hugetlb.c >>>>>>>> +++ b/mm/hugetlb.c >>>>>>>> @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) >>>>>>>> ???????? BUILD_BUG_ON(sizeof_field(struct page, private) * >>>>>>>> BITS_PER_BYTE < >>>>>>>> ???????????????? __NR_HPAGEFLAGS); >>>>>>>> +??? BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); >>>>>>>> ???????? if (!hugepages_supported()) { >>>>>>>> ???????????? if (hugetlb_max_hstate || >>>>>>>> default_hstate_max_huge_pages) >>>>>>>> @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int >>>>>>>> order) >>>>>>>> ???????? } >>>>>>>> ???????? BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); >>>>>>>> ???????? BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); >>>>>>>> +??? WARN_ON(order > MAX_FOLIO_ORDER); >>>>>>>> ???????? h = &hstates[hugetlb_max_hstate++]; >>>>>>>> ???????? __mutex_init(&h->resize_lock, "resize mutex", &h- >>>>>>>>> resize_key); >>>>>>>> ???????? h->order = order; >>>>>> >>>>>> We end up registering hugetlb folios that are bigger than >>>>>> MAX_FOLIO_ORDER. So we have to figure out how a config can trigger >>>>>> that >>>>>> (and if we have to support that). >>>>>> >>>>> >>>>> MAX_FOLIO_ORDER is defined as: >>>>> >>>>> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >>>>> #define MAX_FOLIO_ORDER??????? PUD_ORDER >>>>> #else >>>>> #define MAX_FOLIO_ORDER??????? MAX_PAGE_ORDER >>>>> #endif >>>>> >>>>> MAX_PAGE_ORDER is the limit for dynamic creation of hugepages via >>>>> /sys/kernel/mm/hugepages/ but bigger pages can be created at boottime >>>>> with kernel boot parameters without CONFIG_ARCH_HAS_GIGANTIC_PAGE: >>>>> >>>>> ???? hugepagesz=64m hugepages=1 hugepagesz=256m hugepages=1 >>>>> >>>>> Gives: >>>>> >>>>> HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages >>>>> HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >>>>> HugeTLB: registered 64.0 MiB page size, pre-allocated 1 pages >>>>> HugeTLB: 0 KiB vmemmap can be freed for a 64.0 MiB page >>>>> HugeTLB: registered 256 MiB page size, pre-allocated 1 pages >>>>> HugeTLB: 0 KiB vmemmap can be freed for a 256 MiB page >>>>> HugeTLB: registered 4.00 MiB page size, pre-allocated 0 pages >>>>> HugeTLB: 0 KiB vmemmap can be freed for a 4.00 MiB page >>>>> HugeTLB: registered 16.0 MiB page size, pre-allocated 0 pages >>>>> HugeTLB: 0 KiB vmemmap can be freed for a 16.0 MiB page >>>> >>>> I think it's a violation of CONFIG_ARCH_HAS_GIGANTIC_PAGE. The existing >>>> folio_dump() code would not handle it correctly as well. >>> >>> I'm trying to dig into history and when looking at commit 4eb0716e868e >>> ("hugetlb: allow to free gigantic pages regardless of the >>> configuration") I understand that CONFIG_ARCH_HAS_GIGANTIC_PAGE is >>> needed to be able to allocate gigantic pages at runtime. It is not >>> needed to reserve gigantic pages at boottime. >>> >>> What am I missing ? >> >> That CONFIG_ARCH_HAS_GIGANTIC_PAGE has nothing runtime-specific in its >> name. > > In its name for sure, but the commit I mention says: > > On systems without CONTIG_ALLOC activated but that support gigantic > pages, > boottime reserved gigantic pages can not be freed at all. This patch > simply enables the possibility to hand back those pages to memory > allocator. Right, I think it was a historical artifact. > > And one of the hunks is: > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index 7f7fbd8bd9d5b..7a1aa53d188d3 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -19,7 +19,7 @@ config ARM64 > select ARCH_HAS_FAST_MULTIPLIER > select ARCH_HAS_FORTIFY_SOURCE > select ARCH_HAS_GCOV_PROFILE_ALL > - select ARCH_HAS_GIGANTIC_PAGE if CONTIG_ALLOC > + select ARCH_HAS_GIGANTIC_PAGE > select ARCH_HAS_KCOV > select ARCH_HAS_KEEPINITRD > select ARCH_HAS_MEMBARRIER_SYNC_CORE > > So I understand from the commit message that it was possible at that > time to have gigantic pages without ARCH_HAS_GIGANTIC_PAGE as long as > you didn't have to be able to free them during runtime. Yes, I agree. > >> >> Can't we just select CONFIG_ARCH_HAS_GIGANTIC_PAGE for the relevant >> hugetlb config that allows for *gigantic pages*. >> > > We probably can, but I'd really like to understand history and how we > ended up in the situation we are now. > Because blind fixes often lead to more problems. Yes, let's figure out how to to it cleanly. > > If I follow things correctly I see a helper gigantic_page_supported() > added by commit 944d9fec8d7a ("hugetlb: add support for gigantic page > allocation at runtime"). > > And then commit 461a7184320a ("mm/hugetlb: introduce > ARCH_HAS_GIGANTIC_PAGE") is added to wrap gigantic_page_supported() > > Then commit 4eb0716e868e ("hugetlb: allow to free gigantic pages > regardless of the configuration") changed gigantic_page_supported() to > gigantic_page_runtime_supported() > > So where are we now ? In commit fae7d834c43ccdb9fcecaf4d0f33145d884b3e5c Author: Matthew Wilcox (Oracle) Date: Tue Feb 27 19:23:31 2024 +0000 mm: add __dump_folio() We started assuming that a folio in the system (boottime, dynamic, whatever) has a maximum of MAX_FOLIO_NR_PAGES. Any other interpretation doesn't make any sense for MAX_FOLIO_NR_PAGES. So we have two questions: 1) How to teach MAX_FOLIO_NR_PAGES that hugetlb supports gigantic pages 2) How do we handle CONFIG_ARCH_HAS_GIGANTIC_PAGE We have the following options (A) Rename existing CONFIG_ARCH_HAS_GIGANTIC_PAGE to something else that is clearer and add a new CONFIG_ARCH_HAS_GIGANTIC_PAGE. (B) Rename existing CONFIG_ARCH_HAS_GIGANTIC_PAGE -> to something else that is clearer and derive somehow else that hugetlb in that config supports gigantic pages. (c) Just use CONFIG_ARCH_HAS_GIGANTIC_PAGE if hugetlb on an architecture supports gigantic pages. I don't quite see why an architecture should be able to opt in into dynamically allocating+freeing gigantic pages. That's just CONTIG_ALLOC magic and not some arch-specific thing IIRC. Note that in mm/hugetlb.c it is #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE #ifdef CONFIG_CONTIG_ALLOC Meaning that at least the allocation side is guarded by CONTIG_ALLOC. So I think (C) is just the right thing to do. diff --git a/fs/Kconfig b/fs/Kconfig index 0bfdaecaa8775..12c11eb9279d3 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -283,6 +283,8 @@ config HUGETLB_PMD_PAGE_TABLE_SHARING def_bool HUGETLB_PAGE depends on ARCH_WANT_HUGE_PMD_SHARE && SPLIT_PMD_PTLOCKS +# An architecture must select this option if there is any mechanism (esp. hugetlb) +# could obtain gigantic folios. config ARCH_HAS_GIGANTIC_PAGE bool -- Cheers David / dhildenb From david at redhat.com Thu Oct 9 10:30:27 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 9 Oct 2025 12:30:27 +0200 Subject: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-7-david@redhat.com> <5a5013ca-e976-4622-b881-290eb0d78b44@redhat.com> Message-ID: <7d82cf5e-f60c-4295-9566-c40f6897fce7@redhat.com> On 09.10.25 12:25, Balbir Singh wrote: > On 10/9/25 17:12, David Hildenbrand wrote: >> On 09.10.25 06:21, Balbir Singh wrote: >>> On 8/22/25 06:06, David Hildenbrand wrote: >>>> Let's reject them early, which in turn makes folio_alloc_gigantic() reject >>>> them properly. >>>> >>>> To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER >>>> and calculate MAX_FOLIO_NR_PAGES based on that. >>>> >>>> Signed-off-by: David Hildenbrand >>>> --- >>>> ? include/linux/mm.h | 6 ++++-- >>>> ? mm/page_alloc.c??? | 5 ++++- >>>> ? 2 files changed, 8 insertions(+), 3 deletions(-) >>>> >>>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>>> index 00c8a54127d37..77737cbf2216a 100644 >>>> --- a/include/linux/mm.h >>>> +++ b/include/linux/mm.h >>>> @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) >>>> ? ? /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ >>>> ? #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >>>> -#define MAX_FOLIO_NR_PAGES??? (1UL << PUD_ORDER) >>>> +#define MAX_FOLIO_ORDER??????? PUD_ORDER >>> >>> Do we need to check for CONTIG_ALLOC as well with CONFIG_ARCH_HAS_GIGANTIC_PAGE? >>> >> >> I don't think so, can you elaborate? >> > > The only way to allocate a gigantic page is to use CMA, IIRC, which is covered by CONTIG_ALLOC As we are discussing as part of v2 right now, there is the way to just obtain them from memblock during boot. > >>>> ? #else >>>> -#define MAX_FOLIO_NR_PAGES??? MAX_ORDER_NR_PAGES >>>> +#define MAX_FOLIO_ORDER??????? MAX_PAGE_ORDER >>>> ? #endif >>>> ? +#define MAX_FOLIO_NR_PAGES??? (1UL << MAX_FOLIO_ORDER) >>>> + >>>> ? /* >>>> ?? * compound_nr() returns the number of pages in this potentially compound >>>> ?? * page.? compound_nr() can be called on a tail page, and is defined to >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index ca9e6b9633f79..1e6ae4c395b30 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) >>>> ? int alloc_contig_range_noprof(unsigned long start, unsigned long end, >>>> ??????????????????? acr_flags_t alloc_flags, gfp_t gfp_mask) >>>> ? { >>>> +??? const unsigned int order = ilog2(end - start); >>> >>> Do we need a VM_WARN_ON(end < start)? >> >> I don't think so. >> > > end - start being < 0, completely breaks ilog2. But we would error out because ilog2 > MAX_FOLIO_ORDER, so we should fine Right, and if we have code that buggy that does it, it probably shouldn't be our responsibility to sanity check that :) It would have been completely buggy before this patch. > >>> >>>> ????? unsigned long outer_start, outer_end; >>>> ????? int ret = 0; >>>> ? @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >>>> ????????????????????????? PB_ISOLATE_MODE_CMA_ALLOC : >>>> ????????????????????????? PB_ISOLATE_MODE_OTHER; >>>> ? +??? if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) >>>> +??????? return -EINVAL; >>>> + >>>> ????? gfp_mask = current_gfp_context(gfp_mask); >>>> ????? if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) >>>> ????????? return -EINVAL; >>>> @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >>>> ????????????? free_contig_range(end, outer_end - end); >>>> ????? } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { >>>> ????????? struct page *head = pfn_to_page(start); >>>> -??????? int order = ilog2(end - start); >>>> ? ????????? check_new_pages(head, order); >>>> ????????? prep_new_page(head, order, gfp_mask, 0); >>> >>> Acked-by: Balbir Singh >> >> Thanks for the review, but note that this is already upstream. >> > > Sorry, this showed up in my updated mm thread and I ended up reviewing it, please ignore if it's upstream I'm happy for any review (better in reply to v2), because any bug caught early is good! -- Cheers David / dhildenb From christophe.leroy at csgroup.eu Thu Oct 9 12:08:05 2025 From: christophe.leroy at csgroup.eu (Christophe Leroy) Date: Thu, 9 Oct 2025 14:08:05 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <543e9440-8ee0-4d9e-9b05-0107032d665b@redhat.com> References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> <9361c75a-ab37-4d7f-8680-9833430d93d4@redhat.com> <03671aa8-4276-4707-9c75-83c96968cbb2@csgroup.eu> <1db15a30-72d6-4045-8aa1-68bd8411b0ba@redhat.com> <0c730c52-97ee-43ea-9697-ac11d2880ab7@csgroup.eu> <543e9440-8ee0-4d9e-9b05-0107032d665b@redhat.com> Message-ID: <4632e721-0ac8-4d72-a8ed-e6c928eee94d@csgroup.eu> Le 09/10/2025 ? 12:27, David Hildenbrand a ?crit?: > On 09.10.25 12:01, Christophe Leroy wrote: >> >> >> Le 09/10/2025 ? 11:20, David Hildenbrand a ?crit?: >>> On 09.10.25 11:16, Christophe Leroy wrote: >>>> >>>> >>>> Le 09/10/2025 ? 10:14, David Hildenbrand a ?crit?: >>>>> On 09.10.25 10:04, Christophe Leroy wrote: >>>>>> >>>>>> >>>>>> Le 09/10/2025 ? 09:22, David Hildenbrand a ?crit?: >>>>>>> On 09.10.25 09:14, Christophe Leroy wrote: >>>>>>>> Hi David, >>>>>>>> >>>>>>>> Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >>>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>>>>>> index 1e777cc51ad04..d3542e92a712e 100644 >>>>>>>>> --- a/mm/hugetlb.c >>>>>>>>> +++ b/mm/hugetlb.c >>>>>>>>> @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) >>>>>>>>> ????????? BUILD_BUG_ON(sizeof_field(struct page, private) * >>>>>>>>> BITS_PER_BYTE < >>>>>>>>> ????????????????? __NR_HPAGEFLAGS); >>>>>>>>> +??? BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); >>>>>>>>> ????????? if (!hugepages_supported()) { >>>>>>>>> ????????????? if (hugetlb_max_hstate || >>>>>>>>> default_hstate_max_huge_pages) >>>>>>>>> @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int >>>>>>>>> order) >>>>>>>>> ????????? } >>>>>>>>> ????????? BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); >>>>>>>>> ????????? BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); >>>>>>>>> +??? WARN_ON(order > MAX_FOLIO_ORDER); >>>>>>>>> ????????? h = &hstates[hugetlb_max_hstate++]; >>>>>>>>> ????????? __mutex_init(&h->resize_lock, "resize mutex", &h- >>>>>>>>>> resize_key); >>>>>>>>> ????????? h->order = order; >>>>>>> >>>>>>> We end up registering hugetlb folios that are bigger than >>>>>>> MAX_FOLIO_ORDER. So we have to figure out how a config can trigger >>>>>>> that >>>>>>> (and if we have to support that). >>>>>>> >>>>>> >>>>>> MAX_FOLIO_ORDER is defined as: >>>>>> >>>>>> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >>>>>> #define MAX_FOLIO_ORDER??????? PUD_ORDER >>>>>> #else >>>>>> #define MAX_FOLIO_ORDER??????? MAX_PAGE_ORDER >>>>>> #endif >>>>>> >>>>>> MAX_PAGE_ORDER is the limit for dynamic creation of hugepages via >>>>>> /sys/kernel/mm/hugepages/ but bigger pages can be created at boottime >>>>>> with kernel boot parameters without CONFIG_ARCH_HAS_GIGANTIC_PAGE: >>>>>> >>>>>> ????? hugepagesz=64m hugepages=1 hugepagesz=256m hugepages=1 >>>>>> >>>>>> Gives: >>>>>> >>>>>> HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages >>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >>>>>> HugeTLB: registered 64.0 MiB page size, pre-allocated 1 pages >>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 64.0 MiB page >>>>>> HugeTLB: registered 256 MiB page size, pre-allocated 1 pages >>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 256 MiB page >>>>>> HugeTLB: registered 4.00 MiB page size, pre-allocated 0 pages >>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 4.00 MiB page >>>>>> HugeTLB: registered 16.0 MiB page size, pre-allocated 0 pages >>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 16.0 MiB page >>>>> >>>>> I think it's a violation of CONFIG_ARCH_HAS_GIGANTIC_PAGE. The >>>>> existing >>>>> folio_dump() code would not handle it correctly as well. >>>> >>>> I'm trying to dig into history and when looking at commit 4eb0716e868e >>>> ("hugetlb: allow to free gigantic pages regardless of the >>>> configuration") I understand that CONFIG_ARCH_HAS_GIGANTIC_PAGE is >>>> needed to be able to allocate gigantic pages at runtime. It is not >>>> needed to reserve gigantic pages at boottime. >>>> >>>> What am I missing ? >>> >>> That CONFIG_ARCH_HAS_GIGANTIC_PAGE has nothing runtime-specific in its >>> name. >> >> In its name for sure, but the commit I mention says: >> >> ????? On systems without CONTIG_ALLOC activated but that support gigantic >> pages, >> ????? boottime reserved gigantic pages can not be freed at all.? This >> patch >> ????? simply enables the possibility to hand back those pages to memory >> ????? allocator. > > Right, I think it was a historical artifact. > >> >> And one of the hunks is: >> >> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >> index 7f7fbd8bd9d5b..7a1aa53d188d3 100644 >> --- a/arch/arm64/Kconfig >> +++ b/arch/arm64/Kconfig >> @@ -19,7 +19,7 @@ config ARM64 >> ????????? select ARCH_HAS_FAST_MULTIPLIER >> ????????? select ARCH_HAS_FORTIFY_SOURCE >> ????????? select ARCH_HAS_GCOV_PROFILE_ALL >> -?????? select ARCH_HAS_GIGANTIC_PAGE if CONTIG_ALLOC >> +?????? select ARCH_HAS_GIGANTIC_PAGE >> ????????? select ARCH_HAS_KCOV >> ????????? select ARCH_HAS_KEEPINITRD >> ????????? select ARCH_HAS_MEMBARRIER_SYNC_CORE >> >> So I understand from the commit message that it was possible at that >> time to have gigantic pages without ARCH_HAS_GIGANTIC_PAGE as long as >> you didn't have to be able to free them during runtime. > > Yes, I agree. > >> >>> >>> Can't we just select CONFIG_ARCH_HAS_GIGANTIC_PAGE for the relevant >>> hugetlb config that allows for *gigantic pages*. >>> >> >> We probably can, but I'd really like to understand history and how we >> ended up in the situation we are now. >> Because blind fixes often lead to more problems. > > Yes, let's figure out how to to it cleanly. > >> >> If I follow things correctly I see a helper gigantic_page_supported() >> added by commit 944d9fec8d7a ("hugetlb: add support for gigantic page >> allocation at runtime"). >> >> And then commit 461a7184320a ("mm/hugetlb: introduce >> ARCH_HAS_GIGANTIC_PAGE") is added to wrap gigantic_page_supported() >> >> Then commit 4eb0716e868e ("hugetlb: allow to free gigantic pages >> regardless of the configuration") changed gigantic_page_supported() to >> gigantic_page_runtime_supported() >> >> So where are we now ? > > In > > commit fae7d834c43ccdb9fcecaf4d0f33145d884b3e5c > Author: Matthew Wilcox (Oracle) > Date:?? Tue Feb 27 19:23:31 2024 +0000 > > ??? mm: add __dump_folio() > > > We started assuming that a folio in the system (boottime, dynamic, > whatever) > has a maximum of MAX_FOLIO_NR_PAGES. > > Any other interpretation doesn't make any sense for MAX_FOLIO_NR_PAGES. > > > So we have two questions: > > 1) How to teach MAX_FOLIO_NR_PAGES that hugetlb supports gigantic pages > > 2) How do we handle CONFIG_ARCH_HAS_GIGANTIC_PAGE > > > We have the following options > > (A) Rename existing CONFIG_ARCH_HAS_GIGANTIC_PAGE to something else that is > clearer and add a new CONFIG_ARCH_HAS_GIGANTIC_PAGE. > > (B) Rename existing CONFIG_ARCH_HAS_GIGANTIC_PAGE -> to something else > that is > clearer and derive somehow else that hugetlb in that config supports > gigantic pages. > > (c) Just use CONFIG_ARCH_HAS_GIGANTIC_PAGE if hugetlb on an architecture > supports gigantic pages. > > > I don't quite see why an architecture should be able to opt in into > dynamically > allocating+freeing gigantic pages. That's just CONTIG_ALLOC magic and > not some > arch-specific thing IIRC. > > > Note that in mm/hugetlb.c it is > > ????#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > ????#ifdef CONFIG_CONTIG_ALLOC > > Meaning that at least the allocation side is guarded by CONTIG_ALLOC. Yes but not the freeing since commit 4eb0716e868e ("hugetlb: allow to free gigantic pages regardless of the configuration") > > So I think (C) is just the right thing to do. > > diff --git a/fs/Kconfig b/fs/Kconfig > index 0bfdaecaa8775..12c11eb9279d3 100644 > --- a/fs/Kconfig > +++ b/fs/Kconfig > @@ -283,6 +283,8 @@ config HUGETLB_PMD_PAGE_TABLE_SHARING > ??????? def_bool HUGETLB_PAGE > ??????? depends on ARCH_WANT_HUGE_PMD_SHARE && SPLIT_PMD_PTLOCKS > > +# An architecture must select this option if there is any mechanism > (esp. hugetlb) > +# could obtain gigantic folios. > ?config ARCH_HAS_GIGANTIC_PAGE > ??????? bool > > I gave it a try. That's not enough, it fixes the problem for 64 Mbytes pages and 256 Mbytes pages, but not for 1 Gbytes pages. Max folio is defined by PUD_ORDER, but PUD_SIZE is 256 Mbytes so we need to make MAX_FOLIO larger. Do we change it to P4D_ORDER or is it too much ? P4D_SIZE is 128 Gbytes Christophe From david at redhat.com Thu Oct 9 13:05:06 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 9 Oct 2025 15:05:06 +0200 Subject: (bisected) [PATCH v2 08/37] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <4632e721-0ac8-4d72-a8ed-e6c928eee94d@csgroup.eu> References: <20250901150359.867252-1-david@redhat.com> <20250901150359.867252-9-david@redhat.com> <3e043453-3f27-48ad-b987-cc39f523060a@csgroup.eu> <9361c75a-ab37-4d7f-8680-9833430d93d4@redhat.com> <03671aa8-4276-4707-9c75-83c96968cbb2@csgroup.eu> <1db15a30-72d6-4045-8aa1-68bd8411b0ba@redhat.com> <0c730c52-97ee-43ea-9697-ac11d2880ab7@csgroup.eu> <543e9440-8ee0-4d9e-9b05-0107032d665b@redhat.com> <4632e721-0ac8-4d72-a8ed-e6c928eee94d@csgroup.eu> Message-ID: On 09.10.25 14:08, Christophe Leroy wrote: > > > Le 09/10/2025 ? 12:27, David Hildenbrand a ?crit?: >> On 09.10.25 12:01, Christophe Leroy wrote: >>> >>> >>> Le 09/10/2025 ? 11:20, David Hildenbrand a ?crit?: >>>> On 09.10.25 11:16, Christophe Leroy wrote: >>>>> >>>>> >>>>> Le 09/10/2025 ? 10:14, David Hildenbrand a ?crit?: >>>>>> On 09.10.25 10:04, Christophe Leroy wrote: >>>>>>> >>>>>>> >>>>>>> Le 09/10/2025 ? 09:22, David Hildenbrand a ?crit?: >>>>>>>> On 09.10.25 09:14, Christophe Leroy wrote: >>>>>>>>> Hi David, >>>>>>>>> >>>>>>>>> Le 01/09/2025 ? 17:03, David Hildenbrand a ?crit?: >>>>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>>>>>>>> index 1e777cc51ad04..d3542e92a712e 100644 >>>>>>>>>> --- a/mm/hugetlb.c >>>>>>>>>> +++ b/mm/hugetlb.c >>>>>>>>>> @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) >>>>>>>>>> ????????? BUILD_BUG_ON(sizeof_field(struct page, private) * >>>>>>>>>> BITS_PER_BYTE < >>>>>>>>>> ????????????????? __NR_HPAGEFLAGS); >>>>>>>>>> +??? BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); >>>>>>>>>> ????????? if (!hugepages_supported()) { >>>>>>>>>> ????????????? if (hugetlb_max_hstate || >>>>>>>>>> default_hstate_max_huge_pages) >>>>>>>>>> @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int >>>>>>>>>> order) >>>>>>>>>> ????????? } >>>>>>>>>> ????????? BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); >>>>>>>>>> ????????? BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); >>>>>>>>>> +??? WARN_ON(order > MAX_FOLIO_ORDER); >>>>>>>>>> ????????? h = &hstates[hugetlb_max_hstate++]; >>>>>>>>>> ????????? __mutex_init(&h->resize_lock, "resize mutex", &h- >>>>>>>>>>> resize_key); >>>>>>>>>> ????????? h->order = order; >>>>>>>> >>>>>>>> We end up registering hugetlb folios that are bigger than >>>>>>>> MAX_FOLIO_ORDER. So we have to figure out how a config can trigger >>>>>>>> that >>>>>>>> (and if we have to support that). >>>>>>>> >>>>>>> >>>>>>> MAX_FOLIO_ORDER is defined as: >>>>>>> >>>>>>> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >>>>>>> #define MAX_FOLIO_ORDER??????? PUD_ORDER >>>>>>> #else >>>>>>> #define MAX_FOLIO_ORDER??????? MAX_PAGE_ORDER >>>>>>> #endif >>>>>>> >>>>>>> MAX_PAGE_ORDER is the limit for dynamic creation of hugepages via >>>>>>> /sys/kernel/mm/hugepages/ but bigger pages can be created at boottime >>>>>>> with kernel boot parameters without CONFIG_ARCH_HAS_GIGANTIC_PAGE: >>>>>>> >>>>>>> ????? hugepagesz=64m hugepages=1 hugepagesz=256m hugepages=1 >>>>>>> >>>>>>> Gives: >>>>>>> >>>>>>> HugeTLB: registered 1.00 GiB page size, pre-allocated 0 pages >>>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 1.00 GiB page >>>>>>> HugeTLB: registered 64.0 MiB page size, pre-allocated 1 pages >>>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 64.0 MiB page >>>>>>> HugeTLB: registered 256 MiB page size, pre-allocated 1 pages >>>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 256 MiB page >>>>>>> HugeTLB: registered 4.00 MiB page size, pre-allocated 0 pages >>>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 4.00 MiB page >>>>>>> HugeTLB: registered 16.0 MiB page size, pre-allocated 0 pages >>>>>>> HugeTLB: 0 KiB vmemmap can be freed for a 16.0 MiB page >>>>>> >>>>>> I think it's a violation of CONFIG_ARCH_HAS_GIGANTIC_PAGE. The >>>>>> existing >>>>>> folio_dump() code would not handle it correctly as well. >>>>> >>>>> I'm trying to dig into history and when looking at commit 4eb0716e868e >>>>> ("hugetlb: allow to free gigantic pages regardless of the >>>>> configuration") I understand that CONFIG_ARCH_HAS_GIGANTIC_PAGE is >>>>> needed to be able to allocate gigantic pages at runtime. It is not >>>>> needed to reserve gigantic pages at boottime. >>>>> >>>>> What am I missing ? >>>> >>>> That CONFIG_ARCH_HAS_GIGANTIC_PAGE has nothing runtime-specific in its >>>> name. >>> >>> In its name for sure, but the commit I mention says: >>> >>> ????? On systems without CONTIG_ALLOC activated but that support gigantic >>> pages, >>> ????? boottime reserved gigantic pages can not be freed at all.? This >>> patch >>> ????? simply enables the possibility to hand back those pages to memory >>> ????? allocator. >> >> Right, I think it was a historical artifact. >> >>> >>> And one of the hunks is: >>> >>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig >>> index 7f7fbd8bd9d5b..7a1aa53d188d3 100644 >>> --- a/arch/arm64/Kconfig >>> +++ b/arch/arm64/Kconfig >>> @@ -19,7 +19,7 @@ config ARM64 >>> ????????? select ARCH_HAS_FAST_MULTIPLIER >>> ????????? select ARCH_HAS_FORTIFY_SOURCE >>> ????????? select ARCH_HAS_GCOV_PROFILE_ALL >>> -?????? select ARCH_HAS_GIGANTIC_PAGE if CONTIG_ALLOC >>> +?????? select ARCH_HAS_GIGANTIC_PAGE >>> ????????? select ARCH_HAS_KCOV >>> ????????? select ARCH_HAS_KEEPINITRD >>> ????????? select ARCH_HAS_MEMBARRIER_SYNC_CORE >>> >>> So I understand from the commit message that it was possible at that >>> time to have gigantic pages without ARCH_HAS_GIGANTIC_PAGE as long as >>> you didn't have to be able to free them during runtime. >> >> Yes, I agree. >> >>> >>>> >>>> Can't we just select CONFIG_ARCH_HAS_GIGANTIC_PAGE for the relevant >>>> hugetlb config that allows for *gigantic pages*. >>>> >>> >>> We probably can, but I'd really like to understand history and how we >>> ended up in the situation we are now. >>> Because blind fixes often lead to more problems. >> >> Yes, let's figure out how to to it cleanly. >> >>> >>> If I follow things correctly I see a helper gigantic_page_supported() >>> added by commit 944d9fec8d7a ("hugetlb: add support for gigantic page >>> allocation at runtime"). >>> >>> And then commit 461a7184320a ("mm/hugetlb: introduce >>> ARCH_HAS_GIGANTIC_PAGE") is added to wrap gigantic_page_supported() >>> >>> Then commit 4eb0716e868e ("hugetlb: allow to free gigantic pages >>> regardless of the configuration") changed gigantic_page_supported() to >>> gigantic_page_runtime_supported() >>> >>> So where are we now ? >> >> In >> >> commit fae7d834c43ccdb9fcecaf4d0f33145d884b3e5c >> Author: Matthew Wilcox (Oracle) >> Date:?? Tue Feb 27 19:23:31 2024 +0000 >> >> ??? mm: add __dump_folio() >> >> >> We started assuming that a folio in the system (boottime, dynamic, >> whatever) >> has a maximum of MAX_FOLIO_NR_PAGES. >> >> Any other interpretation doesn't make any sense for MAX_FOLIO_NR_PAGES. >> >> >> So we have two questions: >> >> 1) How to teach MAX_FOLIO_NR_PAGES that hugetlb supports gigantic pages >> >> 2) How do we handle CONFIG_ARCH_HAS_GIGANTIC_PAGE >> >> >> We have the following options >> >> (A) Rename existing CONFIG_ARCH_HAS_GIGANTIC_PAGE to something else that is >> clearer and add a new CONFIG_ARCH_HAS_GIGANTIC_PAGE. >> >> (B) Rename existing CONFIG_ARCH_HAS_GIGANTIC_PAGE -> to something else >> that is >> clearer and derive somehow else that hugetlb in that config supports >> gigantic pages. >> >> (c) Just use CONFIG_ARCH_HAS_GIGANTIC_PAGE if hugetlb on an architecture >> supports gigantic pages. >> >> >> I don't quite see why an architecture should be able to opt in into >> dynamically >> allocating+freeing gigantic pages. That's just CONTIG_ALLOC magic and >> not some >> arch-specific thing IIRC. >> >> >> Note that in mm/hugetlb.c it is >> >> ????#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >> ????#ifdef CONFIG_CONTIG_ALLOC >> >> Meaning that at least the allocation side is guarded by CONTIG_ALLOC. > > Yes but not the freeing since commit 4eb0716e868e ("hugetlb: allow to > free gigantic pages regardless of the configuration") Right, the freeing path is just always around as we no longer depend free_contig_range(). > >> >> So I think (C) is just the right thing to do. >> >> diff --git a/fs/Kconfig b/fs/Kconfig >> index 0bfdaecaa8775..12c11eb9279d3 100644 >> --- a/fs/Kconfig >> +++ b/fs/Kconfig >> @@ -283,6 +283,8 @@ config HUGETLB_PMD_PAGE_TABLE_SHARING >> ??????? def_bool HUGETLB_PAGE >> ??????? depends on ARCH_WANT_HUGE_PMD_SHARE && SPLIT_PMD_PTLOCKS >> >> +# An architecture must select this option if there is any mechanism >> (esp. hugetlb) >> +# could obtain gigantic folios. >> ?config ARCH_HAS_GIGANTIC_PAGE >> ??????? bool >> >> > > I gave it a try. That's not enough, it fixes the problem for 64 Mbytes > pages and 256 Mbytes pages, but not for 1 Gbytes pages. Thanks! > > Max folio is defined by PUD_ORDER, but PUD_SIZE is 256 Mbytes so we need > to make MAX_FOLIO larger. Do we change it to P4D_ORDER or is it too much > ? P4D_SIZE is 128 Gbytes The exact size doesn't matter, we started with something that soundes reasonable. I added the comment "There is no real limit on the folio size. We limit them to the maximum we currently expect (e.g., hugetlb, dax)." We can set it to whatever we would expect for now. -- Cheers David / dhildenb From lkp at intel.com Thu Oct 9 16:48:53 2025 From: lkp at intel.com (kernel test robot) Date: Fri, 10 Oct 2025 00:48:53 +0800 Subject: [PATCH] wireguard: allowedips: Use kfree_rcu() instead of call_rcu() In-Reply-To: <20251005122626.26988-1-wangfushuai@baidu.com> References: <20251005122626.26988-1-wangfushuai@baidu.com> Message-ID: <202510100057.ZUiqBtur-lkp@intel.com> Hi Fushuai, kernel test robot noticed the following build errors: [auto build test ERROR on linus/master] [also build test ERROR on crng-random/master v6.17 next-20251009] [If your patch is applied to the wrong git tree, kindly drop us a note. And when submitting patch, we suggest to use '--base' as documented in https://git-scm.com/docs/git-format-patch#_base_tree_information] url: https://github.com/intel-lab-lkp/linux/commits/Fushuai-Wang/wireguard-allowedips-Use-kfree_rcu-instead-of-call_rcu/20251009-142048 base: linus/master patch link: https://lore.kernel.org/r/20251005122626.26988-1-wangfushuai%40baidu.com patch subject: [PATCH] wireguard: allowedips: Use kfree_rcu() instead of call_rcu() config: m68k-hp300_defconfig (https://download.01.org/0day-ci/archive/20251010/202510100057.ZUiqBtur-lkp at intel.com/config) compiler: m68k-linux-gcc (GCC) 15.1.0 reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20251010/202510100057.ZUiqBtur-lkp at intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot | Closes: https://lore.kernel.org/oe-kbuild-all/202510100057.ZUiqBtur-lkp at intel.com/ All errors (new ones prefixed by >>): In file included from : drivers/net/wireguard/allowedips.c: In function 'remove_node': >> include/linux/stddef.h:16:33: error: '*0' is a pointer; did you mean to use '->'? 16 | #define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER) | ^~~~~~~~~~~~~~~~~~ include/linux/compiler_types.h:575:23: note: in definition of macro '__compiletime_assert' 575 | if (!(condition)) \ | ^~~~~~~~~ include/linux/compiler_types.h:595:9: note: in expansion of macro '_compiletime_assert' 595 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) | ^~~~~~~~~~~~~~~~ include/linux/rcupdate.h:1124:17: note: in expansion of macro 'BUILD_BUG_ON' 1124 | BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \ | ^~~~~~~~~~~~ include/linux/rcupdate.h:1124:30: note: in expansion of macro 'offsetof' 1124 | BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \ | ^~~~~~~~ include/linux/rcupdate.h:1087:29: note: in expansion of macro 'kvfree_rcu_arg_2' 1087 | #define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf) | ^~~~~~~~~~~~~~~~ drivers/net/wireguard/allowedips.c:269:9: note: in expansion of macro 'kfree_rcu' 269 | kfree_rcu(&node, rcu); | ^~~~~~~~~ In file included from include/linux/rculist.h:11, from include/linux/dcache.h:8, from include/linux/fs.h:9, from include/linux/highmem.h:5, from include/linux/bvec.h:10, from include/linux/skbuff.h:17, from include/linux/ip.h:16, from drivers/net/wireguard/allowedips.h:10, from drivers/net/wireguard/allowedips.c:6: >> include/linux/rcupdate.h:1125:41: error: '___p' is a pointer to pointer; did you mean to dereference it before applying '->' to it? 1125 | kvfree_call_rcu(&((___p)->rhf), (void *) (___p)); \ | ^~ include/linux/rcupdate.h:1087:29: note: in expansion of macro 'kvfree_rcu_arg_2' 1087 | #define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf) | ^~~~~~~~~~~~~~~~ drivers/net/wireguard/allowedips.c:269:9: note: in expansion of macro 'kfree_rcu' 269 | kfree_rcu(&node, rcu); | ^~~~~~~~~ >> include/linux/stddef.h:16:33: error: '*0' is a pointer; did you mean to use '->'? 16 | #define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER) | ^~~~~~~~~~~~~~~~~~ include/linux/compiler_types.h:575:23: note: in definition of macro '__compiletime_assert' 575 | if (!(condition)) \ | ^~~~~~~~~ include/linux/compiler_types.h:595:9: note: in expansion of macro '_compiletime_assert' 595 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) | ^~~~~~~~~~~~~~~~ include/linux/rcupdate.h:1124:17: note: in expansion of macro 'BUILD_BUG_ON' 1124 | BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \ | ^~~~~~~~~~~~ include/linux/rcupdate.h:1124:30: note: in expansion of macro 'offsetof' 1124 | BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \ | ^~~~~~~~ include/linux/rcupdate.h:1087:29: note: in expansion of macro 'kvfree_rcu_arg_2' 1087 | #define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf) | ^~~~~~~~~~~~~~~~ drivers/net/wireguard/allowedips.c:275:9: note: in expansion of macro 'kfree_rcu' 275 | kfree_rcu(&parent, rcu); | ^~~~~~~~~ >> include/linux/rcupdate.h:1125:41: error: '___p' is a pointer to pointer; did you mean to dereference it before applying '->' to it? 1125 | kvfree_call_rcu(&((___p)->rhf), (void *) (___p)); \ | ^~ include/linux/rcupdate.h:1087:29: note: in expansion of macro 'kvfree_rcu_arg_2' 1087 | #define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf) | ^~~~~~~~~~~~~~~~ drivers/net/wireguard/allowedips.c:275:9: note: in expansion of macro 'kfree_rcu' 275 | kfree_rcu(&parent, rcu); | ^~~~~~~~~ -- In file included from : allowedips.c: In function 'remove_node': >> include/linux/stddef.h:16:33: error: '*0' is a pointer; did you mean to use '->'? 16 | #define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER) | ^~~~~~~~~~~~~~~~~~ include/linux/compiler_types.h:575:23: note: in definition of macro '__compiletime_assert' 575 | if (!(condition)) \ | ^~~~~~~~~ include/linux/compiler_types.h:595:9: note: in expansion of macro '_compiletime_assert' 595 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) | ^~~~~~~~~~~~~~~~ include/linux/rcupdate.h:1124:17: note: in expansion of macro 'BUILD_BUG_ON' 1124 | BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \ | ^~~~~~~~~~~~ include/linux/rcupdate.h:1124:30: note: in expansion of macro 'offsetof' 1124 | BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \ | ^~~~~~~~ include/linux/rcupdate.h:1087:29: note: in expansion of macro 'kvfree_rcu_arg_2' 1087 | #define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf) | ^~~~~~~~~~~~~~~~ allowedips.c:269:9: note: in expansion of macro 'kfree_rcu' 269 | kfree_rcu(&node, rcu); | ^~~~~~~~~ In file included from include/linux/rculist.h:11, from include/linux/dcache.h:8, from include/linux/fs.h:9, from include/linux/highmem.h:5, from include/linux/bvec.h:10, from include/linux/skbuff.h:17, from include/linux/ip.h:16, from allowedips.h:10, from allowedips.c:6: >> include/linux/rcupdate.h:1125:41: error: '___p' is a pointer to pointer; did you mean to dereference it before applying '->' to it? 1125 | kvfree_call_rcu(&((___p)->rhf), (void *) (___p)); \ | ^~ include/linux/rcupdate.h:1087:29: note: in expansion of macro 'kvfree_rcu_arg_2' 1087 | #define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf) | ^~~~~~~~~~~~~~~~ allowedips.c:269:9: note: in expansion of macro 'kfree_rcu' 269 | kfree_rcu(&node, rcu); | ^~~~~~~~~ >> include/linux/stddef.h:16:33: error: '*0' is a pointer; did you mean to use '->'? 16 | #define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER) | ^~~~~~~~~~~~~~~~~~ include/linux/compiler_types.h:575:23: note: in definition of macro '__compiletime_assert' 575 | if (!(condition)) \ | ^~~~~~~~~ include/linux/compiler_types.h:595:9: note: in expansion of macro '_compiletime_assert' 595 | _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__) | ^~~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:39:37: note: in expansion of macro 'compiletime_assert' 39 | #define BUILD_BUG_ON_MSG(cond, msg) compiletime_assert(!(cond), msg) | ^~~~~~~~~~~~~~~~~~ include/linux/build_bug.h:50:9: note: in expansion of macro 'BUILD_BUG_ON_MSG' 50 | BUILD_BUG_ON_MSG(condition, "BUILD_BUG_ON failed: " #condition) | ^~~~~~~~~~~~~~~~ include/linux/rcupdate.h:1124:17: note: in expansion of macro 'BUILD_BUG_ON' 1124 | BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \ | ^~~~~~~~~~~~ include/linux/rcupdate.h:1124:30: note: in expansion of macro 'offsetof' 1124 | BUILD_BUG_ON(offsetof(typeof(*(ptr)), rhf) >= 4096); \ | ^~~~~~~~ include/linux/rcupdate.h:1087:29: note: in expansion of macro 'kvfree_rcu_arg_2' 1087 | #define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf) | ^~~~~~~~~~~~~~~~ allowedips.c:275:9: note: in expansion of macro 'kfree_rcu' 275 | kfree_rcu(&parent, rcu); | ^~~~~~~~~ >> include/linux/rcupdate.h:1125:41: error: '___p' is a pointer to pointer; did you mean to dereference it before applying '->' to it? 1125 | kvfree_call_rcu(&((___p)->rhf), (void *) (___p)); \ | ^~ include/linux/rcupdate.h:1087:29: note: in expansion of macro 'kvfree_rcu_arg_2' 1087 | #define kfree_rcu(ptr, rhf) kvfree_rcu_arg_2(ptr, rhf) | ^~~~~~~~~~~~~~~~ allowedips.c:275:9: note: in expansion of macro 'kfree_rcu' 275 | kfree_rcu(&parent, rcu); | ^~~~~~~~~ vim +16 include/linux/stddef.h 6e218287432472 Richard Knutsson 2006-09-30 14 ^1da177e4c3f41 Linus Torvalds 2005-04-16 15 #undef offsetof 14e83077d55ff4 Rasmus Villemoes 2022-03-23 @16 #define offsetof(TYPE, MEMBER) __builtin_offsetof(TYPE, MEMBER) 3876488444e712 Denys Vlasenko 2015-03-09 17 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki From rdunlap at infradead.org Tue Oct 14 01:37:30 2025 From: rdunlap at infradead.org (Randy Dunlap) Date: Mon, 13 Oct 2025 18:37:30 -0700 Subject: [PATCH][v3] hung_task: Panic after fixed number of hung tasks In-Reply-To: <20251012115035.2169-1-lirongqing@baidu.com> References: <20251012115035.2169-1-lirongqing@baidu.com> Message-ID: Hi-- On 10/12/25 4:50 AM, lirongqing wrote: > From: Li RongQing > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index a51ab46..7d9a8ee 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1992,14 +1992,20 @@ > the added memory block itself do not be affected. > > hung_task_panic= > - [KNL] Should the hung task detector generate panics. > - Format: 0 | 1 > + [KNL] Number of hung tasks to trigger kernel panic. > + Format: > + > + Set this to the number of hung tasks that must be > + detected before triggering a kernel panic. > + > + 0: don't panic > + 1: panic immediately on first hung task > + N: panic after N hung tasks are detect are detected > > - A value of 1 instructs the kernel to panic when a > - hung task is detected. The default value is controlled > - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time > - option. The value selected by this boot parameter can > - be changed later by the kernel.hung_task_panic sysctl. > + The default value is controlled by the > + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value > + selected by this boot parameter can be changed later by the > + kernel.hung_task_panic sysctl. > > hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) > terminal devices. Valid values: 0..8 -- ~Randy From mhiramat at kernel.org Thu Oct 16 08:02:12 2025 From: mhiramat at kernel.org (Masami Hiramatsu (Google)) Date: Thu, 16 Oct 2025 17:02:12 +0900 Subject: [PATCH][v4] hung_task: Panic when there are more than N hung tasks at the same time In-Reply-To: <20251015063615.2632-1-lirongqing@baidu.com> References: <20251015063615.2632-1-lirongqing@baidu.com> Message-ID: <20251016170212.65e2ad95b80cdeeb6f7d7ce3@kernel.org> On Wed, 15 Oct 2025 14:36:15 +0800 lirongqing wrote: > From: Li RongQing > > Currently, when 'hung_task_panic' is enabled, the kernel panics > immediately upon detecting the first hung task. However, some hung > tasks are transient and allow system recovery, while persistent hangs > should trigger a panic when accumulating beyond a threshold. > > Extend the 'hung_task_panic' sysctl to accept a threshold value > specifying the number of hung tasks that must be detected before > triggering a kernel panic. This provides finer control for environments > where transient hangs may occur but persistent hangs should be fatal. > > The sysctl now accepts: > - 0: don't panic (maintains original behavior) > - 1: panic on first hung task (maintains original behavior) > - N > 1: panic after N hung tasks are detected in a single scan > > This maintains backward compatibility while providing flexibility for > different hang scenarios. Looks good to me. Reviewed-by: Masami Hiramatsu (Google) Thank you, > > Signed-off-by: Li RongQing > Cc: Andrew Jeffery > Cc: Anshuman Khandual > Cc: Arnd Bergmann > Cc: David Hildenbrand > Cc: Florian Wesphal > Cc: Jakub Kacinski > Cc: Jason A. Donenfeld > Cc: Joel Granados > Cc: Joel Stanley > Cc: Jonathan Corbet > Cc: Kees Cook > Cc: Lance Yang > Cc: Liam Howlett > Cc: Lorenzo Stoakes > Cc: "Masami Hiramatsu (Google)" > Cc: "Paul E . McKenney" > Cc: Pawan Gupta > Cc: Petr Mladek > Cc: Phil Auld > Cc: Randy Dunlap > Cc: Russell King > Cc: Shuah Khan > Cc: Simon Horman > Cc: Stanislav Fomichev > Cc: Steven Rostedt > --- > diff with v3: comments modification, suggested by Lance, Masami, Randy and Petr > diff with v2: do not add a new sysctl, extend hung_task_panic, suggested by Kees Cook > > Documentation/admin-guide/kernel-parameters.txt | 20 +++++++++++++------- > Documentation/admin-guide/sysctl/kernel.rst | 9 +++++---- > arch/arm/configs/aspeed_g5_defconfig | 2 +- > kernel/configs/debug.config | 2 +- > kernel/hung_task.c | 15 ++++++++++----- > lib/Kconfig.debug | 9 +++++---- > tools/testing/selftests/wireguard/qemu/kernel.config | 2 +- > 7 files changed, 36 insertions(+), 23 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index a51ab46..492f0bc 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1992,14 +1992,20 @@ > the added memory block itself do not be affected. > > hung_task_panic= > - [KNL] Should the hung task detector generate panics. > - Format: 0 | 1 > + [KNL] Number of hung tasks to trigger kernel panic. > + Format: > + > + When set to a non-zero value, a kernel panic will be triggered if > + the number of detected hung tasks reaches this value. > + > + 0: don't panic > + 1: panic immediately on first hung task > + N: panic after N hung tasks are detected in a single scan > > - A value of 1 instructs the kernel to panic when a > - hung task is detected. The default value is controlled > - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time > - option. The value selected by this boot parameter can > - be changed later by the kernel.hung_task_panic sysctl. > + The default value is controlled by the > + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value > + selected by this boot parameter can be changed later by the > + kernel.hung_task_panic sysctl. > > hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) > terminal devices. Valid values: 0..8 > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst > index f3ee807..0065a55 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -397,13 +397,14 @@ a hung task is detected. > hung_task_panic > =============== > > -Controls the kernel's behavior when a hung task is detected. > +When set to a non-zero value, a kernel panic will be triggered if the > +number of hung tasks found during a single scan reaches this value. > This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. > > -= ================================================= > += ======================================================= > 0 Continue operation. This is the default behavior. > -1 Panic immediately. > -= ================================================= > +N Panic when N hung tasks are found during a single scan. > += ======================================================= > > > hung_task_check_count > diff --git a/arch/arm/configs/aspeed_g5_defconfig b/arch/arm/configs/aspeed_g5_defconfig > index 61cee1e..c3b0d5f 100644 > --- a/arch/arm/configs/aspeed_g5_defconfig > +++ b/arch/arm/configs/aspeed_g5_defconfig > @@ -308,7 +308,7 @@ CONFIG_PANIC_ON_OOPS=y > CONFIG_PANIC_TIMEOUT=-1 > CONFIG_SOFTLOCKUP_DETECTOR=y > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y > -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y > +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 > CONFIG_WQ_WATCHDOG=y > # CONFIG_SCHED_DEBUG is not set > CONFIG_FUNCTION_TRACER=y > diff --git a/kernel/configs/debug.config b/kernel/configs/debug.config > index e81327d..9f6ab7d 100644 > --- a/kernel/configs/debug.config > +++ b/kernel/configs/debug.config > @@ -83,7 +83,7 @@ CONFIG_SLUB_DEBUG_ON=y > # > # Debug Oops, Lockups and Hangs > # > -# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set > +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=0 > # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set > CONFIG_DEBUG_ATOMIC_SLEEP=y > CONFIG_DETECT_HUNG_TASK=y > diff --git a/kernel/hung_task.c b/kernel/hung_task.c > index b2c1f14..84b4b04 100644 > --- a/kernel/hung_task.c > +++ b/kernel/hung_task.c > @@ -81,7 +81,7 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace; > * hung task is detected: > */ > static unsigned int __read_mostly sysctl_hung_task_panic = > - IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC); > + CONFIG_BOOTPARAM_HUNG_TASK_PANIC; > > static int > hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr) > @@ -218,8 +218,11 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti > } > #endif > > -static void check_hung_task(struct task_struct *t, unsigned long timeout) > +static void check_hung_task(struct task_struct *t, unsigned long timeout, > + unsigned long prev_detect_count) > { > + unsigned long total_hung_task; > + > if (!task_is_hung(t, timeout)) > return; > > @@ -229,9 +232,10 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) > */ > sysctl_hung_task_detect_count++; > > + total_hung_task = sysctl_hung_task_detect_count - prev_detect_count; > trace_sched_process_hang(t); > > - if (sysctl_hung_task_panic) { > + if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) { > console_verbose(); > hung_task_show_lock = true; > hung_task_call_panic = true; > @@ -300,6 +304,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) > int max_count = sysctl_hung_task_check_count; > unsigned long last_break = jiffies; > struct task_struct *g, *t; > + unsigned long prev_detect_count = sysctl_hung_task_detect_count; > > /* > * If the system crashed already then all bets are off, > @@ -320,7 +325,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) > last_break = jiffies; > } > > - check_hung_task(t, timeout); > + check_hung_task(t, timeout, prev_detect_count); > } > unlock: > rcu_read_unlock(); > @@ -389,7 +394,7 @@ static const struct ctl_table hung_task_sysctls[] = { > .mode = 0644, > .proc_handler = proc_dointvec_minmax, > .extra1 = SYSCTL_ZERO, > - .extra2 = SYSCTL_ONE, > + .extra2 = SYSCTL_INT_MAX, > }, > { > .procname = "hung_task_check_count", > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > index 3034e294..3976c90 100644 > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -1258,12 +1258,13 @@ config DEFAULT_HUNG_TASK_TIMEOUT > Keeping the default should be fine in most cases. > > config BOOTPARAM_HUNG_TASK_PANIC > - bool "Panic (Reboot) On Hung Tasks" > + int "Number of hung tasks to trigger kernel panic" > depends on DETECT_HUNG_TASK > + default 0 > help > - Say Y here to enable the kernel to panic on "hung tasks", > - which are bugs that cause the kernel to leave a task stuck > - in uninterruptible "D" state. > + When set to a non-zero value, a kernel panic will be triggered > + if the number of hung tasks found during a single scan reaches > + this value. > > The panic can be used in combination with panic_timeout, > to cause the system to reboot automatically after a > diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config > index 936b18b..0504c11 100644 > --- a/tools/testing/selftests/wireguard/qemu/kernel.config > +++ b/tools/testing/selftests/wireguard/qemu/kernel.config > @@ -81,7 +81,7 @@ CONFIG_WQ_WATCHDOG=y > CONFIG_DETECT_HUNG_TASK=y > CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y > -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y > +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 > CONFIG_PANIC_TIMEOUT=-1 > CONFIG_STACKTRACE=y > CONFIG_EARLY_PRINTK=y > -- > 2.9.4 > -- Masami Hiramatsu (Google) From pmenzel at molgen.mpg.de Thu Oct 16 12:47:51 2025 From: pmenzel at molgen.mpg.de (Paul Menzel) Date: Thu, 16 Oct 2025 14:47:51 +0200 Subject: [PATCH][v4] hung_task: Panic when there are more than N hung tasks at the same time In-Reply-To: <20251015063615.2632-1-lirongqing@baidu.com> References: <20251015063615.2632-1-lirongqing@baidu.com> Message-ID: <906dd11d-26db-4570-840a-e4797748c05c@molgen.mpg.de> Dear RongQing, Thank you for the patch. One minor comment regarding the Kconfig description. Am 15.10.25 um 08:36 schrieb lirongqing: > From: Li RongQing > > Currently, when 'hung_task_panic' is enabled, the kernel panics > immediately upon detecting the first hung task. However, some hung > tasks are transient and allow system recovery, while persistent hangs > should trigger a panic when accumulating beyond a threshold. > > Extend the 'hung_task_panic' sysctl to accept a threshold value > specifying the number of hung tasks that must be detected before > triggering a kernel panic. This provides finer control for environments > where transient hangs may occur but persistent hangs should be fatal. > > The sysctl now accepts: > - 0: don't panic (maintains original behavior) > - 1: panic on first hung task (maintains original behavior) > - N > 1: panic after N hung tasks are detected in a single scan > > This maintains backward compatibility while providing flexibility for > different hang scenarios. > > Signed-off-by: Li RongQing > Cc: Andrew Jeffery > Cc: Anshuman Khandual > Cc: Arnd Bergmann > Cc: David Hildenbrand > Cc: Florian Wesphal > Cc: Jakub Kacinski > Cc: Jason A. Donenfeld > Cc: Joel Granados > Cc: Joel Stanley > Cc: Jonathan Corbet > Cc: Kees Cook > Cc: Lance Yang > Cc: Liam Howlett > Cc: Lorenzo Stoakes > Cc: "Masami Hiramatsu (Google)" > Cc: "Paul E . McKenney" > Cc: Pawan Gupta > Cc: Petr Mladek > Cc: Phil Auld > Cc: Randy Dunlap > Cc: Russell King > Cc: Shuah Khan > Cc: Simon Horman > Cc: Stanislav Fomichev > Cc: Steven Rostedt > --- > diff with v3: comments modification, suggested by Lance, Masami, Randy and Petr > diff with v2: do not add a new sysctl, extend hung_task_panic, suggested by Kees Cook > > Documentation/admin-guide/kernel-parameters.txt | 20 +++++++++++++------- > Documentation/admin-guide/sysctl/kernel.rst | 9 +++++---- > arch/arm/configs/aspeed_g5_defconfig | 2 +- > kernel/configs/debug.config | 2 +- > kernel/hung_task.c | 15 ++++++++++----- > lib/Kconfig.debug | 9 +++++---- > tools/testing/selftests/wireguard/qemu/kernel.config | 2 +- > 7 files changed, 36 insertions(+), 23 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index a51ab46..492f0bc 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1992,14 +1992,20 @@ > the added memory block itself do not be affected. > > hung_task_panic= > - [KNL] Should the hung task detector generate panics. > - Format: 0 | 1 > + [KNL] Number of hung tasks to trigger kernel panic. > + Format: > + > + When set to a non-zero value, a kernel panic will be triggered if > + the number of detected hung tasks reaches this value. > + > + 0: don't panic > + 1: panic immediately on first hung task > + N: panic after N hung tasks are detected in a single scan > > - A value of 1 instructs the kernel to panic when a > - hung task is detected. The default value is controlled > - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time > - option. The value selected by this boot parameter can > - be changed later by the kernel.hung_task_panic sysctl. > + The default value is controlled by the > + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value > + selected by this boot parameter can be changed later by the > + kernel.hung_task_panic sysctl. > > hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) > terminal devices. Valid values: 0..8 > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst > index f3ee807..0065a55 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -397,13 +397,14 @@ a hung task is detected. > hung_task_panic > =============== > > -Controls the kernel's behavior when a hung task is detected. > +When set to a non-zero value, a kernel panic will be triggered if the > +number of hung tasks found during a single scan reaches this value. > This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. > > -= ================================================= > += ======================================================= > 0 Continue operation. This is the default behavior. > -1 Panic immediately. > -= ================================================= > +N Panic when N hung tasks are found during a single scan. > += ======================================================= > > > hung_task_check_count [?] > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > index 3034e294..3976c90 100644 > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -1258,12 +1258,13 @@ config DEFAULT_HUNG_TASK_TIMEOUT > Keeping the default should be fine in most cases. > > config BOOTPARAM_HUNG_TASK_PANIC > - bool "Panic (Reboot) On Hung Tasks" > + int "Number of hung tasks to trigger kernel panic" > depends on DETECT_HUNG_TASK > + default 0 > help > - Say Y here to enable the kernel to panic on "hung tasks", > - which are bugs that cause the kernel to leave a task stuck > - in uninterruptible "D" state. > + When set to a non-zero value, a kernel panic will be triggered > + if the number of hung tasks found during a single scan reaches > + this value. > > The panic can be used in combination with panic_timeout, > to cause the system to reboot automatically after a Why not leave the sentence about the uninterruptible "D" state in there? Also, it sounds like, some are actually using this in production. Maybe it should be moved out of `Kconfig.debug` too? Kind regards, Paul From akpm at linux-foundation.org Thu Oct 16 20:50:28 2025 From: akpm at linux-foundation.org (Andrew Morton) Date: Thu, 16 Oct 2025 13:50:28 -0700 Subject: =?UTF-8?Q?[=E5=A4=96=E9=83=A8=E9=82=AE=E4=BB=B6]?= Re: [PATCH][v4] hung_task: Panic when there are more than N hung tasks at the same time In-Reply-To: References: <20251015063615.2632-1-lirongqing@baidu.com> <4db3bd26-1f74-4096-84fd-f652ec9a4d27@linux.dev> Message-ID: <20251016135028.aea65e20b0bc7efee11572f1@linux-foundation.org> On Thu, 16 Oct 2025 05:57:34 +0000 "Li,Rongqing" wrote: > > If you agree, likely no need to resend - Andrew could pick it up directly when > > applying :) > > > > This is better; > > Andrew, could you pick it up directly No problems, thanks. From jaquilina at eagleeyet.net Mon Oct 20 00:50:30 2025 From: jaquilina at eagleeyet.net (Jonathan Aquilina) Date: Mon, 20 Oct 2025 00:50:30 +0000 Subject: Laptop using tunnel on iPhone through hotspot Message-ID: Good evening, I have a very interesting use case that I would like to run by the list. My iPhone is peered with my opnsense router which has WireGuard. Is it possible if I have a laptop connected to my phone's hotspot to have all network traffic go over the WireGuard tunnel that my phone is connected to? Regards, Jonathan From syzbot+f2fbf7478a35a94c8b7c at syzkaller.appspotmail.com Mon Oct 20 01:52:25 2025 From: syzbot+f2fbf7478a35a94c8b7c at syzkaller.appspotmail.com (syzbot) Date: Sun, 19 Oct 2025 18:52:25 -0700 Subject: [syzbot] [wireguard?] INFO: task hung in wg_netns_pre_exit (5) In-Reply-To: <66f49736.050a0220.211276.0036.GAE@google.com> Message-ID: <68f595d8.050a0220.91a22.043c.GAE@google.com> syzbot has found a reproducer for the following issue on: HEAD commit: 88224095b4e5 Merge branch 'net-dsa-lantiq_gswip-clean-up-a.. git tree: net-next console output: https://syzkaller.appspot.com/x/log.txt?x=17b28de2580000 kernel config: https://syzkaller.appspot.com/x/.config?x=913caf94397d1b8d dashboard link: https://syzkaller.appspot.com/bug?extid=f2fbf7478a35a94c8b7c compiler: Debian clang version 20.1.8 (++20250708063551+0c9f909b7976-1~exp1~20250708183702.136), Debian LLD 20.1.8 syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10796b04580000 Downloadable assets: disk image: https://storage.googleapis.com/syzbot-assets/f3cb46a2b9fc/disk-88224095.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/0d43ffbc738d/vmlinux-88224095.xz kernel image: https://storage.googleapis.com/syzbot-assets/9817e4fdd10a/bzImage-88224095.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+f2fbf7478a35a94c8b7c at syzkaller.appspotmail.com INFO: task kworker/u8:8:6081 blocked for more than 143 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:kworker/u8:8 state:D stack:23424 pid:6081 tgid:6081 ppid:2 task_flags:0x4208060 flags:0x00080000 Workqueue: netns cleanup_net Call Trace: context_switch kernel/sched/core.c:5325 [inline] __schedule+0x1798/0x4cc0 kernel/sched/core.c:6929 __schedule_loop kernel/sched/core.c:7011 [inline] schedule+0x165/0x360 kernel/sched/core.c:7026 schedule_preempt_disabled+0x13/0x30 kernel/sched/core.c:7083 __mutex_lock_common kernel/locking/mutex.c:676 [inline] __mutex_lock+0x7e6/0x1350 kernel/locking/mutex.c:760 wg_netns_pre_exit+0x1c/0x1d0 drivers/net/wireguard/device.c:419 ops_pre_exit_list net/core/net_namespace.c:161 [inline] ops_undo_list+0x187/0x990 net/core/net_namespace.c:234 cleanup_net+0x4d8/0x820 net/core/net_namespace.c:696 process_one_work kernel/workqueue.c:3263 [inline] process_scheduled_works+0xae1/0x17b0 kernel/workqueue.c:3346 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3427 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x4bc/0x870 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 INFO: task syz-executor:6161 blocked for more than 147 seconds. Not tainted syzkaller #0 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:syz-executor state:D stack:25608 pid:6161 tgid:6161 ppid:1 task_flags:0x400140 flags:0x00080002 Call Trace: context_switch kernel/sched/core.c:5325 [inline] __schedule+0x1798/0x4cc0 kernel/sched/core.c:6929 __schedule_loop kernel/sched/core.c:7011 [inline] schedule+0x165/0x360 kernel/sched/core.c:7026 --- If you want syzbot to run the reproducer, reply with: #syz test: git://repo/address.git branch-or-commit-hash If you attach or paste a git patch, syzbot will apply it before testing. From mike at pineview.net Mon Oct 20 02:12:13 2025 From: mike at pineview.net (Mike O'Connor) Date: Mon, 20 Oct 2025 12:42:13 +1030 Subject: Laptop using tunnel on iPhone through hotspot In-Reply-To: References: Message-ID: <38FEDEE6-1770-41E7-8718-18978650C993@pineview.net> Hi Jonathan iPhobe do not allow this, and I might be wrong but I think this is also the case with Android. You will need to run a VPN on your laptop. Cheers > On 20 Oct 2025, at 11:26?am, Jonathan Aquilina wrote: > > ?Good evening, > > I have a very interesting use case that I would like to run by the list. > > My iPhone is peered with my opnsense router which has WireGuard. > > Is it possible if I have a laptop connected to my phone's hotspot to have all network traffic go over the WireGuard tunnel that my phone is connected to? > > Regards, > Jonathan From syzbot+f2fbf7478a35a94c8b7c at syzkaller.appspotmail.com Mon Oct 20 19:23:03 2025 From: syzbot+f2fbf7478a35a94c8b7c at syzkaller.appspotmail.com (syzbot) Date: Mon, 20 Oct 2025 12:23:03 -0700 Subject: [syzbot] [wireguard?] INFO: task hung in wg_netns_pre_exit (5) In-Reply-To: <66f49736.050a0220.211276.0036.GAE@google.com> Message-ID: <68f68c17.050a0220.91a22.0450.GAE@google.com> syzbot has bisected this issue to: commit d4dfc5700e867b22ab94f960f9a9972696a637d5 Author: Andrii Nakryiko Date: Tue Mar 19 23:38:49 2024 +0000 bpf: pass whole link instead of prog when triggering raw tracepoint bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=17ccbc58580000 start commit: 88224095b4e5 Merge branch 'net-dsa-lantiq_gswip-clean-up-a.. git tree: net-next final oops: https://syzkaller.appspot.com/x/report.txt?x=142cbc58580000 console output: https://syzkaller.appspot.com/x/log.txt?x=102cbc58580000 kernel config: https://syzkaller.appspot.com/x/.config?x=913caf94397d1b8d dashboard link: https://syzkaller.appspot.com/bug?extid=f2fbf7478a35a94c8b7c syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10796b04580000 Reported-by: syzbot+f2fbf7478a35a94c8b7c at syzkaller.appspotmail.com Fixes: d4dfc5700e86 ("bpf: pass whole link instead of prog when triggering raw tracepoint") For information about bisection process see: https://goo.gl/tpsmEJ#bisection From wireguard at asbjorn.st Thu Oct 30 19:13:00 2025 From: wireguard at asbjorn.st (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Thu, 30 Oct 2025 19:13:00 +0000 Subject: [PATCH wireguard-tools v2 0/2] ipc: linux: kernel-side device filtering Message-ID: <20251030191305.800464-1-wireguard@asbjorn.st> Move device filtering to the kernel, thereby reducing netlink traffic. The first patch request kernel-side filtering. The second patch removes the old filtering code, as an additional step, which breaks on earlier than Linux v4.6. I assume that a dependency on Linux v4.6+ is acceptable for wg-tools now, as wireguard-linux-compat haven't been updated for 3 years. --- Changes: v2: - Added info about kernel-support to commit message - Added another patch, for removing client-side filtering v1: https://lists.zx2c4.com/pipermail/wireguard/2025-September/009004.html Asbj?rn Sloth T?nnesen (2): ipc: linux: filter netdevices kernel-side ipc: linux: remove user-space device filtering src/ipc-linux.h | 22 ++++++++-------------- 1 file changed, 8 insertions(+), 14 deletions(-) -- 2.51.0 From wireguard at asbjorn.st Thu Oct 30 19:13:01 2025 From: wireguard at asbjorn.st (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Thu, 30 Oct 2025 19:13:01 +0000 Subject: [PATCH wireguard-tools v2 1/2] ipc: linux: filter netdevices kernel-side In-Reply-To: <20251030191305.800464-1-wireguard@asbjorn.st> References: <20251030191305.800464-1-wireguard@asbjorn.st> Message-ID: <20251030191305.800464-2-wireguard@asbjorn.st> Tell the kernel that we are only interested in wireguard netdevices, so that the kernel don't have to dump all the other netdevices. Kernel-side support for this was added in Linux v4.6 in commit dc599f76c22b ("net: Add support for filtering link dump by master device and kind"). Tested with 10000 netdevices (common with ISP BNG setups), out of which 1 was a wireguard netdevice. Baseline: # time ./src/wg show real 0m0.342s user 0m0.013s sys 0m0.290s With patch: # time ./src/wg show real 0m0.006s user 0m0.000s sys 0m0.005s Signed-off-by: Asbj?rn Sloth T?nnesen --- src/ipc-linux.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/src/ipc-linux.h b/src/ipc-linux.h index 01247f1..c56fede 100644 --- a/src/ipc-linux.h +++ b/src/ipc-linux.h @@ -80,6 +80,7 @@ static int kernel_get_wireguard_interfaces(struct string_list *list) int ret = 0; struct nlmsghdr *nlh; struct ifinfomsg *ifm; + struct nlattr *linkinfo_nest; ret = -ENOMEM; rtnl_buffer = calloc(SOCKET_BUFFER_SIZE, 1); @@ -105,6 +106,11 @@ static int kernel_get_wireguard_interfaces(struct string_list *list) nlh->nlmsg_seq = seq; ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm)); ifm->ifi_family = AF_UNSPEC; + + linkinfo_nest = mnl_attr_nest_start(nlh, IFLA_LINKINFO); + mnl_attr_put_strz(nlh, IFLA_INFO_KIND, WG_GENL_NAME); + mnl_attr_nest_end(nlh, linkinfo_nest); + message_len = nlh->nlmsg_len; if (mnl_socket_sendto(nl, rtnl_buffer, message_len) < 0) { -- 2.51.0 From wireguard at asbjorn.st Thu Oct 30 19:13:02 2025 From: wireguard at asbjorn.st (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Thu, 30 Oct 2025 19:13:02 +0000 Subject: [PATCH wireguard-tools v2 2/2] ipc: linux: remove user-space device filtering In-Reply-To: <20251030191305.800464-1-wireguard@asbjorn.st> References: <20251030191305.800464-1-wireguard@asbjorn.st> Message-ID: <20251030191305.800464-3-wireguard@asbjorn.st> As devices are now filtered kernel-side, then we can remove the code for filtering in user-space. This breaks device listing for kernels earlier than Linux v4.6. Signed-off-by: Asbj?rn Sloth T?nnesen --- src/ipc-linux.h | 16 ++-------------- 1 file changed, 2 insertions(+), 14 deletions(-) diff --git a/src/ipc-linux.h b/src/ipc-linux.h index c56fede..45bb55c 100644 --- a/src/ipc-linux.h +++ b/src/ipc-linux.h @@ -29,25 +29,13 @@ struct interface { const char *name; - bool is_wireguard; }; -static int parse_linkinfo(const struct nlattr *attr, void *data) -{ - struct interface *interface = data; - - if (mnl_attr_get_type(attr) == IFLA_INFO_KIND && !strcmp(WG_GENL_NAME, mnl_attr_get_str(attr))) - interface->is_wireguard = true; - return MNL_CB_OK; -} - static int parse_infomsg(const struct nlattr *attr, void *data) { struct interface *interface = data; - if (mnl_attr_get_type(attr) == IFLA_LINKINFO) - return mnl_attr_parse_nested(attr, parse_linkinfo, data); - else if (mnl_attr_get_type(attr) == IFLA_IFNAME) + if (mnl_attr_get_type(attr) == IFLA_IFNAME) interface->name = mnl_attr_get_str(attr); return MNL_CB_OK; } @@ -61,7 +49,7 @@ static int read_devices_cb(const struct nlmsghdr *nlh, void *data) ret = mnl_attr_parse(nlh, sizeof(struct ifinfomsg), parse_infomsg, &interface); if (ret != MNL_CB_OK) return ret; - if (interface.name && interface.is_wireguard) + if (interface.name) ret = string_list_add(list, interface.name); if (ret < 0) return ret; -- 2.51.0 From anthonypaul at gmail.com Fri Oct 31 13:14:14 2025 From: anthonypaul at gmail.com (Anthony Paul) Date: Fri, 31 Oct 2025 10:44:14 -0230 Subject: Crash on Windows ARM64 when Import and potential fix In-Reply-To: References: Message-ID: Hi Simon. WG for arm64 windows still has all the issues with the file selection stuff... Any way you could get this patch pushed through? Anything I can do to help? Thanks, Anthony On Wed, May 21, 2025 at 4:55?AM Simon Rozman wrote: > > Hi, > > > There are numerous reports that the import function causes a crash on > > Windows Arm64. (I believe it is actually the file selector that causes > > the crash) > > > > https://www.reddit.com/r/WireGuard/comments/kwqnb5/wireguard_client_cras > > hes_when_trying_to_add/ > > > > Thanks for reaching out. > > We already have a patch for this in the wireguard-windows repo: > https://git.zx2c4.com/wireguard-windows/commit/?id=8e6558eba6665b51de35779bffa46803dbc4c10d > > It is pending review and official release. > > Best regards, > Simon From Jason at zx2c4.com Fri Oct 31 13:21:05 2025 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Fri, 31 Oct 2025 14:21:05 +0100 Subject: Crash on Windows ARM64 when Import and potential fix In-Reply-To: References: Message-ID: On Fri, Oct 31, 2025 at 10:44:14AM -0230, Anthony Paul wrote: > Hi Simon. > > WG for arm64 windows still has all the issues with the file selection > stuff... Any way you could get this patch pushed through? Anything I > can do to help? I'm in the process of rebuilding my Windows rig so that I can resume development of this. From balbirs at nvidia.com Thu Oct 9 04:21:25 2025 From: balbirs at nvidia.com (Balbir Singh) Date: Thu, 09 Oct 2025 04:21:25 -0000 Subject: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <20250821200701.1329277-7-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-7-david@redhat.com> Message-ID: On 8/22/25 06:06, David Hildenbrand wrote: > Let's reject them early, which in turn makes folio_alloc_gigantic() reject > them properly. > > To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER > and calculate MAX_FOLIO_NR_PAGES based on that. > > Signed-off-by: David Hildenbrand > --- > include/linux/mm.h | 6 ++++-- > mm/page_alloc.c | 5 ++++- > 2 files changed, 8 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 00c8a54127d37..77737cbf2216a 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) > > /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) > +#define MAX_FOLIO_ORDER PUD_ORDER Do we need to check for CONTIG_ALLOC as well with CONFIG_ARCH_HAS_GIGANTIC_PAGE? > #else > -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES > +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER > #endif > > +#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) > + > /* > * compound_nr() returns the number of pages in this potentially compound > * page. compound_nr() can be called on a tail page, and is defined to > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index ca9e6b9633f79..1e6ae4c395b30 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) > int alloc_contig_range_noprof(unsigned long start, unsigned long end, > acr_flags_t alloc_flags, gfp_t gfp_mask) > { > + const unsigned int order = ilog2(end - start); Do we need a VM_WARN_ON(end < start)? > unsigned long outer_start, outer_end; > int ret = 0; > > @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, > PB_ISOLATE_MODE_CMA_ALLOC : > PB_ISOLATE_MODE_OTHER; > > + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) > + return -EINVAL; > + > gfp_mask = current_gfp_context(gfp_mask); > if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) > return -EINVAL; > @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, > free_contig_range(end, outer_end - end); > } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { > struct page *head = pfn_to_page(start); > - int order = ilog2(end - start); > > check_new_pages(head, order); > prep_new_page(head, order, gfp_mask, 0); Acked-by: Balbir Singh Balbir From balbirs at nvidia.com Thu Oct 9 10:26:02 2025 From: balbirs at nvidia.com (Balbir Singh) Date: Thu, 09 Oct 2025 10:26:02 -0000 Subject: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <5a5013ca-e976-4622-b881-290eb0d78b44@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-7-david@redhat.com> <5a5013ca-e976-4622-b881-290eb0d78b44@redhat.com> Message-ID: On 10/9/25 17:12, David Hildenbrand wrote: > On 09.10.25 06:21, Balbir Singh wrote: >> On 8/22/25 06:06, David Hildenbrand wrote: >>> Let's reject them early, which in turn makes folio_alloc_gigantic() reject >>> them properly. >>> >>> To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER >>> and calculate MAX_FOLIO_NR_PAGES based on that. >>> >>> Signed-off-by: David Hildenbrand >>> --- >>> ? include/linux/mm.h | 6 ++++-- >>> ? mm/page_alloc.c??? | 5 ++++- >>> ? 2 files changed, 8 insertions(+), 3 deletions(-) >>> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>> index 00c8a54127d37..77737cbf2216a 100644 >>> --- a/include/linux/mm.h >>> +++ b/include/linux/mm.h >>> @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) >>> ? ? /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ >>> ? #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >>> -#define MAX_FOLIO_NR_PAGES??? (1UL << PUD_ORDER) >>> +#define MAX_FOLIO_ORDER??????? PUD_ORDER >> >> Do we need to check for CONTIG_ALLOC as well with CONFIG_ARCH_HAS_GIGANTIC_PAGE? >> > > I don't think so, can you elaborate? > The only way to allocate a gigantic page is to use CMA, IIRC, which is covered by CONTIG_ALLOC >>> ? #else >>> -#define MAX_FOLIO_NR_PAGES??? MAX_ORDER_NR_PAGES >>> +#define MAX_FOLIO_ORDER??????? MAX_PAGE_ORDER >>> ? #endif >>> ? +#define MAX_FOLIO_NR_PAGES??? (1UL << MAX_FOLIO_ORDER) >>> + >>> ? /* >>> ?? * compound_nr() returns the number of pages in this potentially compound >>> ?? * page.? compound_nr() can be called on a tail page, and is defined to >>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>> index ca9e6b9633f79..1e6ae4c395b30 100644 >>> --- a/mm/page_alloc.c >>> +++ b/mm/page_alloc.c >>> @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) >>> ? int alloc_contig_range_noprof(unsigned long start, unsigned long end, >>> ??????????????????? acr_flags_t alloc_flags, gfp_t gfp_mask) >>> ? { >>> +??? const unsigned int order = ilog2(end - start); >> >> Do we need a VM_WARN_ON(end < start)? > > I don't think so. > end - start being < 0, completely breaks ilog2. But we would error out because ilog2 > MAX_FOLIO_ORDER, so we should fine >> >>> ????? unsigned long outer_start, outer_end; >>> ????? int ret = 0; >>> ? @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >>> ????????????????????????? PB_ISOLATE_MODE_CMA_ALLOC : >>> ????????????????????????? PB_ISOLATE_MODE_OTHER; >>> ? +??? if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) >>> +??????? return -EINVAL; >>> + >>> ????? gfp_mask = current_gfp_context(gfp_mask); >>> ????? if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) >>> ????????? return -EINVAL; >>> @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >>> ????????????? free_contig_range(end, outer_end - end); >>> ????? } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { >>> ????????? struct page *head = pfn_to_page(start); >>> -??????? int order = ilog2(end - start); >>> ? ????????? check_new_pages(head, order); >>> ????????? prep_new_page(head, order, gfp_mask, 0); >> >> Acked-by: Balbir Singh > > Thanks for the review, but note that this is already upstream. > Sorry, this showed up in my updated mm thread and I ended up reviewing it, please ignore if it's upstream Balbir From wangfushuai at baidu.com Sun Oct 5 12:27:06 2025 From: wangfushuai at baidu.com (Fushuai Wang) Date: Sun, 05 Oct 2025 12:27:06 -0000 Subject: [PATCH] wireguard: allowedips: Use kfree_rcu() instead of call_rcu() Message-ID: <20251005122626.26988-1-wangfushuai@baidu.com> Replace call_rcu() + kmem_cache_free() with kfree_rcu() to simplify the code and reduce function size. Signed-off-by: Fushuai Wang --- drivers/net/wireguard/allowedips.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c index 09f7fcd7da78..506f7cf0d7cf 100644 --- a/drivers/net/wireguard/allowedips.c +++ b/drivers/net/wireguard/allowedips.c @@ -48,11 +48,6 @@ static void push_rcu(struct allowedips_node **stack, } } -static void node_free_rcu(struct rcu_head *rcu) -{ - kmem_cache_free(node_cache, container_of(rcu, struct allowedips_node, rcu)); -} - static void root_free_rcu(struct rcu_head *rcu) { struct allowedips_node *node, *stack[MAX_ALLOWEDIPS_DEPTH] = { @@ -271,13 +266,13 @@ static void remove_node(struct allowedips_node *node, struct mutex *lock) if (free_parent) child = rcu_dereference_protected(parent->bit[!(node->parent_bit_packed & 1)], lockdep_is_held(lock)); - call_rcu(&node->rcu, node_free_rcu); + kfree_rcu(&node, rcu); if (!free_parent) return; if (child) child->parent_bit_packed = parent->parent_bit_packed; *(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child; - call_rcu(&parent->rcu, node_free_rcu); + kfree_rcu(&parent, rcu); } static int remove(struct allowedips_node __rcu **trie, u8 bits, const u8 *key, -- 2.36.1 From wangfushuai at baidu.com Sun Oct 5 13:31:00 2025 From: wangfushuai at baidu.com (Fushuai Wang) Date: Sun, 05 Oct 2025 13:31:00 -0000 Subject: [PATCH] wireguard: allowedips: Use kfree_rcu() instead of call_rcu() In-Reply-To: References: Message-ID: <20251005133031.31891-1-wangfushuai@baidu.com> >> Replace call_rcu() + kmem_cache_free() with kfree_rcu() to simplify >> the code and reduce function size. >> >> Signed-off-by: Fushuai Wang > > Hmm... have you compiled this patch ? > > I think all compilers would complain loudly. you are right. I uploaded the wrong version of the patch. I will send the correct v2 shortly. Thank you for pointing it out! Regards, Wang. From wangfushuai at baidu.com Sun Oct 5 13:40:03 2025 From: wangfushuai at baidu.com (Fushuai Wang) Date: Sun, 05 Oct 2025 13:40:03 -0000 Subject: [PATCH v2] wireguard: allowedips: Use kfree_rcu() instead of call_rcu() Message-ID: <20251005133936.32667-1-wangfushuai@baidu.com> Replace call_rcu() + kmem_cache_free() with kfree_rcu() to simplify the code and reduce function size. Signed-off-by: Fushuai Wang --- drivers/net/wireguard/allowedips.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c index 09f7fcd7da78..5ece9acad64d 100644 --- a/drivers/net/wireguard/allowedips.c +++ b/drivers/net/wireguard/allowedips.c @@ -48,11 +48,6 @@ static void push_rcu(struct allowedips_node **stack, } } -static void node_free_rcu(struct rcu_head *rcu) -{ - kmem_cache_free(node_cache, container_of(rcu, struct allowedips_node, rcu)); -} - static void root_free_rcu(struct rcu_head *rcu) { struct allowedips_node *node, *stack[MAX_ALLOWEDIPS_DEPTH] = { @@ -271,13 +266,13 @@ static void remove_node(struct allowedips_node *node, struct mutex *lock) if (free_parent) child = rcu_dereference_protected(parent->bit[!(node->parent_bit_packed & 1)], lockdep_is_held(lock)); - call_rcu(&node->rcu, node_free_rcu); + kfree_rcu(node, rcu); if (!free_parent) return; if (child) child->parent_bit_packed = parent->parent_bit_packed; *(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child; - call_rcu(&parent->rcu, node_free_rcu); + kfree_rcu(parent, rcu); } static int remove(struct allowedips_node __rcu **trie, u8 bits, const u8 *key, -- 2.36.1 From lirongqing at baidu.com Sun Oct 12 11:53:20 2025 From: lirongqing at baidu.com (lirongqing) Date: Sun, 12 Oct 2025 11:53:20 -0000 Subject: [PATCH][v3] hung_task: Panic after fixed number of hung tasks Message-ID: <20251012115035.2169-1-lirongqing@baidu.com> From: Li RongQing Currently, when 'hung_task_panic' is enabled, the kernel panics immediately upon detecting the first hung task. However, some hung tasks are transient and the system can recover, while others are persistent and may accumulate progressively. This patch extends the 'hung_task_panic' sysctl to allow specifying the number of hung tasks that must be detected before triggering a kernel panic. This provides finer control for environments where transient hangs may occur but persistent hangs should still be fatal. The sysctl can be set to: - 0: disabled (never panic) - 1: original behavior (panic on first hung task) - N: panic when N hung tasks are detected This maintains backward compatibility while providing more flexibility for handling different hang scenarios. Signed-off-by: Li RongQing --- Diff with v2: not add new sysctl, extend hung_task_panic Documentation/admin-guide/kernel-parameters.txt | 20 +++++++++++++------- Documentation/admin-guide/sysctl/kernel.rst | 3 ++- arch/arm/configs/aspeed_g5_defconfig | 2 +- kernel/configs/debug.config | 2 +- kernel/hung_task.c | 16 +++++++++++----- lib/Kconfig.debug | 10 ++++++---- tools/testing/selftests/wireguard/qemu/kernel.config | 2 +- 7 files changed, 35 insertions(+), 20 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a51ab46..7d9a8ee 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1992,14 +1992,20 @@ the added memory block itself do not be affected. hung_task_panic= - [KNL] Should the hung task detector generate panics. - Format: 0 | 1 + [KNL] Number of hung tasks to trigger kernel panic. + Format: + + Set this to the number of hung tasks that must be + detected before triggering a kernel panic. + + 0: don't panic + 1: panic immediately on first hung task + N: panic after N hung tasks are detect - A value of 1 instructs the kernel to panic when a - hung task is detected. The default value is controlled - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time - option. The value selected by this boot parameter can - be changed later by the kernel.hung_task_panic sysctl. + The default value is controlled by the + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value + selected by this boot parameter can be changed later by the + kernel.hung_task_panic sysctl. hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) terminal devices. Valid values: 0..8 diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index f3ee807..0a8dfab 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -397,7 +397,8 @@ a hung task is detected. hung_task_panic =============== -Controls the kernel's behavior when a hung task is detected. +When set to a non-zero value, a kernel panic will be triggered if the +number of detected hung tasks reaches this value This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. = ================================================= diff --git a/arch/arm/configs/aspeed_g5_defconfig b/arch/arm/configs/aspeed_g5_defconfig index 61cee1e..c3b0d5f 100644 --- a/arch/arm/configs/aspeed_g5_defconfig +++ b/arch/arm/configs/aspeed_g5_defconfig @@ -308,7 +308,7 @@ CONFIG_PANIC_ON_OOPS=y CONFIG_PANIC_TIMEOUT=-1 CONFIG_SOFTLOCKUP_DETECTOR=y CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 CONFIG_WQ_WATCHDOG=y # CONFIG_SCHED_DEBUG is not set CONFIG_FUNCTION_TRACER=y diff --git a/kernel/configs/debug.config b/kernel/configs/debug.config index e81327d..9f6ab7d 100644 --- a/kernel/configs/debug.config +++ b/kernel/configs/debug.config @@ -83,7 +83,7 @@ CONFIG_SLUB_DEBUG_ON=y # # Debug Oops, Lockups and Hangs # -# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=0 # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_DETECT_HUNG_TASK=y diff --git a/kernel/hung_task.c b/kernel/hung_task.c index b2c1f14..3929ed9 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -81,7 +81,7 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace; * hung task is detected: */ static unsigned int __read_mostly sysctl_hung_task_panic = - IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC); + CONFIG_BOOTPARAM_HUNG_TASK_PANIC; static int hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr) @@ -218,8 +218,11 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti } #endif -static void check_hung_task(struct task_struct *t, unsigned long timeout) +static void check_hung_task(struct task_struct *t, unsigned long timeout, + unsigned long prev_detect_count) { + unsigned long total_hung_task; + if (!task_is_hung(t, timeout)) return; @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) */ sysctl_hung_task_detect_count++; + total_hung_task = sysctl_hung_task_detect_count - prev_detect_count; trace_sched_process_hang(t); - if (sysctl_hung_task_panic) { + if (sysctl_hung_task_panic && + (total_hung_task >= sysctl_hung_task_panic)) { console_verbose(); hung_task_show_lock = true; hung_task_call_panic = true; @@ -300,6 +305,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) int max_count = sysctl_hung_task_check_count; unsigned long last_break = jiffies; struct task_struct *g, *t; + unsigned long prev_detect_count = sysctl_hung_task_detect_count; /* * If the system crashed already then all bets are off, @@ -320,7 +326,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) last_break = jiffies; } - check_hung_task(t, timeout); + check_hung_task(t, timeout, prev_detect_count); } unlock: rcu_read_unlock(); @@ -389,7 +395,7 @@ static const struct ctl_table hung_task_sysctls[] = { .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, - .extra2 = SYSCTL_ONE, + .extra2 = SYSCTL_INT_MAX, }, { .procname = "hung_task_check_count", diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 3034e294..077b9e4 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1258,12 +1258,14 @@ config DEFAULT_HUNG_TASK_TIMEOUT Keeping the default should be fine in most cases. config BOOTPARAM_HUNG_TASK_PANIC - bool "Panic (Reboot) On Hung Tasks" + int "Number of hung tasks to trigger kernel panic" depends on DETECT_HUNG_TASK + default 0 help - Say Y here to enable the kernel to panic on "hung tasks", - which are bugs that cause the kernel to leave a task stuck - in uninterruptible "D" state. + The number of hung tasks must be detected to trigger kernel panic. + + - 0: Don't trigger panic + - N: Panic when N hung tasks are detected The panic can be used in combination with panic_timeout, to cause the system to reboot automatically after a diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config index 936b18b..0504c11 100644 --- a/tools/testing/selftests/wireguard/qemu/kernel.config +++ b/tools/testing/selftests/wireguard/qemu/kernel.config @@ -81,7 +81,7 @@ CONFIG_WQ_WATCHDOG=y CONFIG_DETECT_HUNG_TASK=y CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 CONFIG_PANIC_TIMEOUT=-1 CONFIG_STACKTRACE=y CONFIG_EARLY_PRINTK=y -- 2.9.4 From Markus.Elfring at web.de Sun Oct 12 13:27:57 2025 From: Markus.Elfring at web.de (Markus Elfring) Date: Sun, 12 Oct 2025 13:27:57 -0000 Subject: [PATCH v3] hung_task: Panic after fixed number of hung tasks In-Reply-To: <20251012115035.2169-1-lirongqing@baidu.com> References: <20251012115035.2169-1-lirongqing@baidu.com> Message-ID: <0aea408f-f6d7-4e2d-8cee-1801ad7f3139@web.de> ? > This patch extends the ? Will another imperative wording approach become more helpful for an improved change description? https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/process/submitting-patches.rst?h=v6.17#n94 ? > +++ b/kernel/hung_task.c ? @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) ? > trace_sched_process_hang(t); > > - if (sysctl_hung_task_panic) { > + if (sysctl_hung_task_panic && > + (total_hung_task >= sysctl_hung_task_panic)) { ? I suggest to use the following source code variant instead. if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) { Regards, Markus From lirongqing at baidu.com Mon Oct 13 02:17:45 2025 From: lirongqing at baidu.com (Li,Rongqing) Date: Mon, 13 Oct 2025 02:17:45 -0000 Subject: =?utf-8?B?UkU6IFvlpJbpg6jpgq7ku7ZdIFJlOiBbUEFUQ0ggdjNdIGh1bmdfdGFzazog?= =?utf-8?Q?Panic_after_fixed_number_of_hung_tasks?= In-Reply-To: <0aea408f-f6d7-4e2d-8cee-1801ad7f3139@web.de> References: <20251012115035.2169-1-lirongqing@baidu.com> <0aea408f-f6d7-4e2d-8cee-1801ad7f3139@web.de> Message-ID: <2b19bac7de174fe6baa07234b88c8156@baidu.com> > ? > > This patch extends the ? > > Will another imperative wording approach become more helpful for an > improved change description? > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Docum > entation/process/submitting-patches.rst?h=v6.17#n94 > will fix in next version > > ? > > +++ b/kernel/hung_task.c > ? > @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, > unsigned long timeout) ? > > trace_sched_process_hang(t); > > > > - if (sysctl_hung_task_panic) { > > + if (sysctl_hung_task_panic && > > + (total_hung_task >= sysctl_hung_task_panic)) { > ? > > I suggest to use the following source code variant instead. > > if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) > { > will fix in next version thanks -Li > > Regards, > Markus From lirongqing at baidu.com Tue Oct 14 02:05:22 2025 From: lirongqing at baidu.com (Li,Rongqing) Date: Tue, 14 Oct 2025 02:05:22 -0000 Subject: =?utf-8?B?UkU6IFvlpJbpg6jpgq7ku7ZdIFJlOiBbUEFUQ0hdW3YzXSBodW5nX3Rhc2s6?= =?utf-8?Q?_Panic_after_fixed_number_of_hung_tasks?= In-Reply-To: References: <20251012115035.2169-1-lirongqing@baidu.com> Message-ID: <61a45c98b3a14b75bb83a2a5ea4dab51@baidu.com> > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > > index a51ab46..7d9a8ee 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -1992,14 +1992,20 @@ > > the added memory block itself do not be affected. > > > > hung_task_panic= > > - [KNL] Should the hung task detector generate panics. > > - Format: 0 | 1 > > + [KNL] Number of hung tasks to trigger kernel panic. > > + Format: > > + > > + Set this to the number of hung tasks that must be > > + detected before triggering a kernel panic. > > + > > + 0: don't panic > > + 1: panic immediately on first hung task > > + N: panic after N hung tasks are detect > > are detected > Thanks, will fix in next version -Li From lance.yang at linux.dev Tue Oct 14 05:24:11 2025 From: lance.yang at linux.dev (Lance Yang) Date: Tue, 14 Oct 2025 05:24:11 -0000 Subject: [PATCH][v3] hung_task: Panic after fixed number of hung tasks In-Reply-To: <20251012115035.2169-1-lirongqing@baidu.com> References: <20251012115035.2169-1-lirongqing@baidu.com> Message-ID: <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> Thanks for the patch! I noticed the implementation panics only when N tasks are detected within a single scan, because total_hung_task is reset for each check_hung_uninterruptible_tasks() run. So some suggestions to align the documentation with the code's behavior below :) On 2025/10/12 19:50, lirongqing wrote: > From: Li RongQing > > Currently, when 'hung_task_panic' is enabled, the kernel panics > immediately upon detecting the first hung task. However, some hung > tasks are transient and the system can recover, while others are > persistent and may accumulate progressively. > > This patch extends the 'hung_task_panic' sysctl to allow specifying > the number of hung tasks that must be detected before triggering > a kernel panic. This provides finer control for environments where > transient hangs may occur but persistent hangs should still be fatal. > > The sysctl can be set to: > - 0: disabled (never panic) > - 1: original behavior (panic on first hung task) > - N: panic when N hung tasks are detected > > This maintains backward compatibility while providing more flexibility > for handling different hang scenarios. > > Signed-off-by: Li RongQing > --- > Diff with v2: not add new sysctl, extend hung_task_panic > > Documentation/admin-guide/kernel-parameters.txt | 20 +++++++++++++------- > Documentation/admin-guide/sysctl/kernel.rst | 3 ++- > arch/arm/configs/aspeed_g5_defconfig | 2 +- > kernel/configs/debug.config | 2 +- > kernel/hung_task.c | 16 +++++++++++----- > lib/Kconfig.debug | 10 ++++++---- > tools/testing/selftests/wireguard/qemu/kernel.config | 2 +- > 7 files changed, 35 insertions(+), 20 deletions(-) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt > index a51ab46..7d9a8ee 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -1992,14 +1992,20 @@ > the added memory block itself do not be affected. > > hung_task_panic= > - [KNL] Should the hung task detector generate panics. > - Format: 0 | 1 > + [KNL] Number of hung tasks to trigger kernel panic. > + Format: > + > + Set this to the number of hung tasks that must be > + detected before triggering a kernel panic. > + > + 0: don't panic > + 1: panic immediately on first hung task > + N: panic after N hung tasks are detect The description should be more specific :) N: panic after N hung tasks are detected in a single scan Would it be better and cleaner? > > - A value of 1 instructs the kernel to panic when a > - hung task is detected. The default value is controlled > - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time > - option. The value selected by this boot parameter can > - be changed later by the kernel.hung_task_panic sysctl. > + The default value is controlled by the > + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value > + selected by this boot parameter can be changed later by the > + kernel.hung_task_panic sysctl. > > hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) > terminal devices. Valid values: 0..8 > diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst > index f3ee807..0a8dfab 100644 > --- a/Documentation/admin-guide/sysctl/kernel.rst > +++ b/Documentation/admin-guide/sysctl/kernel.rst > @@ -397,7 +397,8 @@ a hung task is detected. > hung_task_panic > =============== > > -Controls the kernel's behavior when a hung task is detected. > +When set to a non-zero value, a kernel panic will be triggered if the > +number of detected hung tasks reaches this value Hmm... that is also ambiguous ... +When set to a non-zero value, a kernel panic will be triggered if the +number of hung tasks found during a single scan reaches this value. > This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. > > = ================================================= > diff --git a/arch/arm/configs/aspeed_g5_defconfig b/arch/arm/configs/aspeed_g5_defconfig > index 61cee1e..c3b0d5f 100644 > --- a/arch/arm/configs/aspeed_g5_defconfig > +++ b/arch/arm/configs/aspeed_g5_defconfig > @@ -308,7 +308,7 @@ CONFIG_PANIC_ON_OOPS=y > CONFIG_PANIC_TIMEOUT=-1 > CONFIG_SOFTLOCKUP_DETECTOR=y > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y > -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y > +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 > CONFIG_WQ_WATCHDOG=y > # CONFIG_SCHED_DEBUG is not set > CONFIG_FUNCTION_TRACER=y > diff --git a/kernel/configs/debug.config b/kernel/configs/debug.config > index e81327d..9f6ab7d 100644 > --- a/kernel/configs/debug.config > +++ b/kernel/configs/debug.config > @@ -83,7 +83,7 @@ CONFIG_SLUB_DEBUG_ON=y > # > # Debug Oops, Lockups and Hangs > # > -# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set > +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=0 > # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set > CONFIG_DEBUG_ATOMIC_SLEEP=y > CONFIG_DETECT_HUNG_TASK=y > diff --git a/kernel/hung_task.c b/kernel/hung_task.c > index b2c1f14..3929ed9 100644 > --- a/kernel/hung_task.c > +++ b/kernel/hung_task.c > @@ -81,7 +81,7 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace; > * hung task is detected: > */ > static unsigned int __read_mostly sysctl_hung_task_panic = > - IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC); > + CONFIG_BOOTPARAM_HUNG_TASK_PANIC; > > static int > hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr) > @@ -218,8 +218,11 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti > } > #endif > > -static void check_hung_task(struct task_struct *t, unsigned long timeout) > +static void check_hung_task(struct task_struct *t, unsigned long timeout, > + unsigned long prev_detect_count) > { > + unsigned long total_hung_task; > + > if (!task_is_hung(t, timeout)) > return; > > @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) > */ > sysctl_hung_task_detect_count++; > > + total_hung_task = sysctl_hung_task_detect_count - prev_detect_count; > trace_sched_process_hang(t); > > - if (sysctl_hung_task_panic) { > + if (sysctl_hung_task_panic && > + (total_hung_task >= sysctl_hung_task_panic)) { > console_verbose(); > hung_task_show_lock = true; > hung_task_call_panic = true; > @@ -300,6 +305,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) > int max_count = sysctl_hung_task_check_count; > unsigned long last_break = jiffies; > struct task_struct *g, *t; > + unsigned long prev_detect_count = sysctl_hung_task_detect_count; > > /* > * If the system crashed already then all bets are off, > @@ -320,7 +326,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) > last_break = jiffies; > } > > - check_hung_task(t, timeout); > + check_hung_task(t, timeout, prev_detect_count); > } > unlock: > rcu_read_unlock(); > @@ -389,7 +395,7 @@ static const struct ctl_table hung_task_sysctls[] = { > .mode = 0644, > .proc_handler = proc_dointvec_minmax, > .extra1 = SYSCTL_ZERO, > - .extra2 = SYSCTL_ONE, > + .extra2 = SYSCTL_INT_MAX, > }, > { > .procname = "hung_task_check_count", > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug > index 3034e294..077b9e4 100644 > --- a/lib/Kconfig.debug > +++ b/lib/Kconfig.debug > @@ -1258,12 +1258,14 @@ config DEFAULT_HUNG_TASK_TIMEOUT > Keeping the default should be fine in most cases. > > config BOOTPARAM_HUNG_TASK_PANIC > - bool "Panic (Reboot) On Hung Tasks" > + int "Number of hung tasks to trigger kernel panic" > depends on DETECT_HUNG_TASK > + default 0 > help > - Say Y here to enable the kernel to panic on "hung tasks", > - which are bugs that cause the kernel to leave a task stuck > - in uninterruptible "D" state. > + The number of hung tasks must be detected to trigger kernel panic. > + > + - 0: Don't trigger panic > + - N: Panic when N hung tasks are detected + - N: Panic when N hung tasks are detected in a single scan With these documentation changes, this patch would accurately describe its behavior, IMHO. > > The panic can be used in combination with panic_timeout, > to cause the system to reboot automatically after a > diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config > index 936b18b..0504c11 100644 > --- a/tools/testing/selftests/wireguard/qemu/kernel.config > +++ b/tools/testing/selftests/wireguard/qemu/kernel.config > @@ -81,7 +81,7 @@ CONFIG_WQ_WATCHDOG=y > CONFIG_DETECT_HUNG_TASK=y > CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y > -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y > +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 > CONFIG_PANIC_TIMEOUT=-1 > CONFIG_STACKTRACE=y > CONFIG_EARLY_PRINTK=y From lirongqing at baidu.com Tue Oct 14 05:38:35 2025 From: lirongqing at baidu.com (Li,Rongqing) Date: Tue, 14 Oct 2025 05:38:35 -0000 Subject: =?utf-8?B?UkU6IFvlpJbpg6jpgq7ku7ZdIFJlOiBbUEFUQ0hdW3YzXSBodW5nX3Rhc2s6?= =?utf-8?Q?_Panic_after_fixed_number_of_hung_tasks?= In-Reply-To: <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> References: <20251012115035.2169-1-lirongqing@baidu.com> <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> Message-ID: <884f179defbe482b9873912ce88350b5@baidu.com> > > I noticed the implementation panics only when N tasks are detected within a > single scan, because total_hung_task is reset for each > check_hung_uninterruptible_tasks() run. > > So some suggestions to align the documentation with the code's behavior > below :) > True, I will change the comments as your suggestion, thanks -Li > On 2025/10/12 19:50, lirongqing wrote: > > From: Li RongQing > > > > Currently, when 'hung_task_panic' is enabled, the kernel panics > > immediately upon detecting the first hung task. However, some hung > > tasks are transient and the system can recover, while others are > > persistent and may accumulate progressively. > > > > This patch extends the 'hung_task_panic' sysctl to allow specifying > > the number of hung tasks that must be detected before triggering a > > kernel panic. This provides finer control for environments where > > transient hangs may occur but persistent hangs should still be fatal. > > > > The sysctl can be set to: > > - 0: disabled (never panic) > > - 1: original behavior (panic on first hung task) > > - N: panic when N hung tasks are detected > > > > This maintains backward compatibility while providing more flexibility > > for handling different hang scenarios. > > > > Signed-off-by: Li RongQing > > --- > > Diff with v2: not add new sysctl, extend hung_task_panic > > > > Documentation/admin-guide/kernel-parameters.txt | 20 > +++++++++++++------- > > Documentation/admin-guide/sysctl/kernel.rst | 3 ++- > > arch/arm/configs/aspeed_g5_defconfig | 2 +- > > kernel/configs/debug.config | 2 +- > > kernel/hung_task.c | 16 > +++++++++++----- > > lib/Kconfig.debug | 10 > ++++++---- > > tools/testing/selftests/wireguard/qemu/kernel.config | 2 +- > > 7 files changed, 35 insertions(+), 20 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > > b/Documentation/admin-guide/kernel-parameters.txt > > index a51ab46..7d9a8ee 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -1992,14 +1992,20 @@ > > the added memory block itself do not be affected. > > > > hung_task_panic= > > - [KNL] Should the hung task detector generate panics. > > - Format: 0 | 1 > > + [KNL] Number of hung tasks to trigger kernel panic. > > + Format: > > + > > + Set this to the number of hung tasks that must be > > + detected before triggering a kernel panic. > > + > > + 0: don't panic > > + 1: panic immediately on first hung task > > + N: panic after N hung tasks are detect > > The description should be more specific :) > > N: panic after N hung tasks are detected in a single scan > > Would it be better and cleaner? > > > > > - A value of 1 instructs the kernel to panic when a > > - hung task is detected. The default value is controlled > > - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time > > - option. The value selected by this boot parameter can > > - be changed later by the kernel.hung_task_panic sysctl. > > + The default value is controlled by the > > + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. > The value > > + selected by this boot parameter can be changed later by the > > + kernel.hung_task_panic sysctl. > > > > hvc_iucv= [S390] Number of z/VM IUCV hypervisor console > (HVC) > > terminal devices. Valid values: 0..8 diff --git > > a/Documentation/admin-guide/sysctl/kernel.rst > > b/Documentation/admin-guide/sysctl/kernel.rst > > index f3ee807..0a8dfab 100644 > > --- a/Documentation/admin-guide/sysctl/kernel.rst > > +++ b/Documentation/admin-guide/sysctl/kernel.rst > > @@ -397,7 +397,8 @@ a hung task is detected. > > hung_task_panic > > =============== > > > > -Controls the kernel's behavior when a hung task is detected. > > +When set to a non-zero value, a kernel panic will be triggered if the > > +number of detected hung tasks reaches this value > > Hmm... that is also ambiguous ... > > +When set to a non-zero value, a kernel panic will be triggered if the > +number of hung tasks found during a single scan reaches this value. > > > This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. > > > > = ================================================= > > diff --git a/arch/arm/configs/aspeed_g5_defconfig > > b/arch/arm/configs/aspeed_g5_defconfig > > index 61cee1e..c3b0d5f 100644 > > --- a/arch/arm/configs/aspeed_g5_defconfig > > +++ b/arch/arm/configs/aspeed_g5_defconfig > > @@ -308,7 +308,7 @@ CONFIG_PANIC_ON_OOPS=y > > CONFIG_PANIC_TIMEOUT=-1 > > CONFIG_SOFTLOCKUP_DETECTOR=y > > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y > > -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y > > +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 > > CONFIG_WQ_WATCHDOG=y > > # CONFIG_SCHED_DEBUG is not set > > CONFIG_FUNCTION_TRACER=y > > diff --git a/kernel/configs/debug.config b/kernel/configs/debug.config > > index e81327d..9f6ab7d 100644 > > --- a/kernel/configs/debug.config > > +++ b/kernel/configs/debug.config > > @@ -83,7 +83,7 @@ CONFIG_SLUB_DEBUG_ON=y > > # > > # Debug Oops, Lockups and Hangs > > # > > -# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set > > +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=0 > > # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set > > CONFIG_DEBUG_ATOMIC_SLEEP=y > > CONFIG_DETECT_HUNG_TASK=y > > diff --git a/kernel/hung_task.c b/kernel/hung_task.c index > > b2c1f14..3929ed9 100644 > > --- a/kernel/hung_task.c > > +++ b/kernel/hung_task.c > > @@ -81,7 +81,7 @@ static unsigned int __read_mostly > sysctl_hung_task_all_cpu_backtrace; > > * hung task is detected: > > */ > > static unsigned int __read_mostly sysctl_hung_task_panic = > > - IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC); > > + CONFIG_BOOTPARAM_HUNG_TASK_PANIC; > > > > static int > > hung_task_panic(struct notifier_block *this, unsigned long event, > > void *ptr) @@ -218,8 +218,11 @@ static inline void > debug_show_blocker(struct task_struct *task, unsigned long ti > > } > > #endif > > > > -static void check_hung_task(struct task_struct *t, unsigned long > > timeout) > > +static void check_hung_task(struct task_struct *t, unsigned long timeout, > > + unsigned long prev_detect_count) > > { > > + unsigned long total_hung_task; > > + > > if (!task_is_hung(t, timeout)) > > return; > > > > @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, > unsigned long timeout) > > */ > > sysctl_hung_task_detect_count++; > > > > + total_hung_task = sysctl_hung_task_detect_count - prev_detect_count; > > trace_sched_process_hang(t); > > > > - if (sysctl_hung_task_panic) { > > + if (sysctl_hung_task_panic && > > + (total_hung_task >= sysctl_hung_task_panic)) { > > console_verbose(); > > hung_task_show_lock = true; > > hung_task_call_panic = true; > > @@ -300,6 +305,7 @@ static void > check_hung_uninterruptible_tasks(unsigned long timeout) > > int max_count = sysctl_hung_task_check_count; > > unsigned long last_break = jiffies; > > struct task_struct *g, *t; > > + unsigned long prev_detect_count = sysctl_hung_task_detect_count; > > > > /* > > * If the system crashed already then all bets are off, @@ -320,7 > > +326,7 @@ static void check_hung_uninterruptible_tasks(unsigned long > timeout) > > last_break = jiffies; > > } > > > > - check_hung_task(t, timeout); > > + check_hung_task(t, timeout, prev_detect_count); > > } > > unlock: > > rcu_read_unlock(); > > @@ -389,7 +395,7 @@ static const struct ctl_table hung_task_sysctls[] = { > > .mode = 0644, > > .proc_handler = proc_dointvec_minmax, > > .extra1 = SYSCTL_ZERO, > > - .extra2 = SYSCTL_ONE, > > + .extra2 = SYSCTL_INT_MAX, > > }, > > { > > .procname = "hung_task_check_count", > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index > > 3034e294..077b9e4 100644 > > --- a/lib/Kconfig.debug > > +++ b/lib/Kconfig.debug > > @@ -1258,12 +1258,14 @@ config DEFAULT_HUNG_TASK_TIMEOUT > > Keeping the default should be fine in most cases. > > > > config BOOTPARAM_HUNG_TASK_PANIC > > - bool "Panic (Reboot) On Hung Tasks" > > + int "Number of hung tasks to trigger kernel panic" > > depends on DETECT_HUNG_TASK > > + default 0 > > help > > - Say Y here to enable the kernel to panic on "hung tasks", > > - which are bugs that cause the kernel to leave a task stuck > > - in uninterruptible "D" state. > > + The number of hung tasks must be detected to trigger kernel panic. > > + > > + - 0: Don't trigger panic > > + - N: Panic when N hung tasks are detected > > + - N: Panic when N hung tasks are detected in a single scan > > With these documentation changes, this patch would accurately describe its > behavior, IMHO. > > > > > The panic can be used in combination with panic_timeout, > > to cause the system to reboot automatically after a diff --git > > a/tools/testing/selftests/wireguard/qemu/kernel.config > > b/tools/testing/selftests/wireguard/qemu/kernel.config > > index 936b18b..0504c11 100644 > > --- a/tools/testing/selftests/wireguard/qemu/kernel.config > > +++ b/tools/testing/selftests/wireguard/qemu/kernel.config > > @@ -81,7 +81,7 @@ CONFIG_WQ_WATCHDOG=y > > CONFIG_DETECT_HUNG_TASK=y > > CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y > > CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y > > -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y > > +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 > > CONFIG_PANIC_TIMEOUT=-1 > > CONFIG_STACKTRACE=y > > CONFIG_EARLY_PRINTK=y > From pmladek at suse.com Tue Oct 14 09:45:08 2025 From: pmladek at suse.com (Petr Mladek) Date: Tue, 14 Oct 2025 09:45:08 -0000 Subject: [PATCH][v3] hung_task: Panic after fixed number of hung tasks In-Reply-To: <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> References: <20251012115035.2169-1-lirongqing@baidu.com> <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> Message-ID: On Tue 2025-10-14 13:23:58, Lance Yang wrote: > Thanks for the patch! > > I noticed the implementation panics only when N tasks are detected > within a single scan, because total_hung_task is reset for each > check_hung_uninterruptible_tasks() run. Great catch! Does it make sense? Is is the intended behavior, please? > So some suggestions to align the documentation with the code's > behavior below :) > On 2025/10/12 19:50, lirongqing wrote: > > From: Li RongQing > > > > Currently, when 'hung_task_panic' is enabled, the kernel panics > > immediately upon detecting the first hung task. However, some hung > > tasks are transient and the system can recover, while others are > > persistent and may accumulate progressively. My understanding is that this patch wanted to do: + report even temporary stalls + panic only when the stall was much longer and likely persistent Which might make some sense. But the code does something else. > > --- a/kernel/hung_task.c > > +++ b/kernel/hung_task.c > > @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) > > */ > > sysctl_hung_task_detect_count++; > > + total_hung_task = sysctl_hung_task_detect_count - prev_detect_count; > > trace_sched_process_hang(t); > > - if (sysctl_hung_task_panic) { > > + if (sysctl_hung_task_panic && > > + (total_hung_task >= sysctl_hung_task_panic)) { > > console_verbose(); > > hung_task_show_lock = true; > > hung_task_call_panic = true; I would expect that this patch added another counter, similar to sysctl_hung_task_detect_count. It would be incremented only once per check when a hung task was detected. And it would be cleared (reset) when no hung task was found. Best Regards, Petr From lirongqing at baidu.com Tue Oct 14 10:54:49 2025 From: lirongqing at baidu.com (Li,Rongqing) Date: Tue, 14 Oct 2025 10:54:49 -0000 Subject: [????] Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks In-Reply-To: References: <20251012115035.2169-1-lirongqing@baidu.com> <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> Message-ID: > On Tue 2025-10-14 13:23:58, Lance Yang wrote: > > Thanks for the patch! > > > > I noticed the implementation panics only when N tasks are detected > > within a single scan, because total_hung_task is reset for each > > check_hung_uninterruptible_tasks() run. > > Great catch! > > Does it make sense? > Is is the intended behavior, please? > Yes, this is intended behavior > > So some suggestions to align the documentation with the code's > > behavior below :) > > > On 2025/10/12 19:50, lirongqing wrote: > > > From: Li RongQing > > > > > > Currently, when 'hung_task_panic' is enabled, the kernel panics > > > immediately upon detecting the first hung task. However, some hung > > > tasks are transient and the system can recover, while others are > > > persistent and may accumulate progressively. > > My understanding is that this patch wanted to do: > > + report even temporary stalls > + panic only when the stall was much longer and likely persistent > > Which might make some sense. But the code does something else. > A single task hanging for an extended period may not be a critical issue, as users might still log into the system to investigate. However, if multiple tasks hang simultaneously-such as in cases of I/O hangs caused by disk failures-it could prevent users from logging in and become a serious problem, and a panic is expected. > > > --- a/kernel/hung_task.c > > > +++ b/kernel/hung_task.c > > > @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct > *t, unsigned long timeout) > > > */ > > > sysctl_hung_task_detect_count++; > > > + total_hung_task = sysctl_hung_task_detect_count - > > > +prev_detect_count; > > > trace_sched_process_hang(t); > > > - if (sysctl_hung_task_panic) { > > > + if (sysctl_hung_task_panic && > > > + (total_hung_task >= sysctl_hung_task_panic)) { > > > console_verbose(); > > > hung_task_show_lock = true; > > > hung_task_call_panic = true; > > I would expect that this patch added another counter, similar to > sysctl_hung_task_detect_count. It would be incremented only once per check > when a hung task was detected. And it would be cleared (reset) when no > hung task was found. > > Best Regards, > Petr From lance.yang at linux.dev Tue Oct 14 10:59:23 2025 From: lance.yang at linux.dev (Lance Yang) Date: Tue, 14 Oct 2025 10:59:23 -0000 Subject: [PATCH][v3] hung_task: Panic after fixed number of hung tasks In-Reply-To: References: <20251012115035.2169-1-lirongqing@baidu.com> <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> Message-ID: <3acdcd15-7e52-4a9a-9492-a434ed609dcc@linux.dev> On 2025/10/14 17:45, Petr Mladek wrote: > On Tue 2025-10-14 13:23:58, Lance Yang wrote: >> Thanks for the patch! >> >> I noticed the implementation panics only when N tasks are detected >> within a single scan, because total_hung_task is reset for each >> check_hung_uninterruptible_tasks() run. > > Great catch! > > Does it make sense? > Is is the intended behavior, please? > >> So some suggestions to align the documentation with the code's >> behavior below :) > >> On 2025/10/12 19:50, lirongqing wrote: >>> From: Li RongQing >>> >>> Currently, when 'hung_task_panic' is enabled, the kernel panics >>> immediately upon detecting the first hung task. However, some hung >>> tasks are transient and the system can recover, while others are >>> persistent and may accumulate progressively. > > My understanding is that this patch wanted to do: > > + report even temporary stalls > + panic only when the stall was much longer and likely persistent > > Which might make some sense. But the code does something else. Cool. Sounds good to me! > >>> --- a/kernel/hung_task.c >>> +++ b/kernel/hung_task.c >>> @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) >>> */ >>> sysctl_hung_task_detect_count++; >>> + total_hung_task = sysctl_hung_task_detect_count - prev_detect_count; >>> trace_sched_process_hang(t); >>> - if (sysctl_hung_task_panic) { >>> + if (sysctl_hung_task_panic && >>> + (total_hung_task >= sysctl_hung_task_panic)) { >>> console_verbose(); >>> hung_task_show_lock = true; >>> hung_task_call_panic = true; > > I would expect that this patch added another counter, similar to > sysctl_hung_task_detect_count. It would be incremented only > once per check when a hung task was detected. And it would > be cleared (reset) when no hung task was found. Much cleaner. We could add an internal counter for that, yeah. No need to expose it to userspace ;) Petr's suggestion seems to align better with the goal of panicking on persistent hangs, IMHO. Panic after N consecutive checks with hung tasks. @RongQing does that work for you? From lirongqing at baidu.com Tue Oct 14 11:23:02 2025 From: lirongqing at baidu.com (Li,Rongqing) Date: Tue, 14 Oct 2025 11:23:02 -0000 Subject: =?utf-8?B?UkU6IFvlpJbpg6jpgq7ku7ZdIFJlOiBbUEFUQ0hdW3YzXSBodW5nX3Rhc2s6?= =?utf-8?Q?_Panic_after_fixed_number_of_hung_tasks?= In-Reply-To: <3acdcd15-7e52-4a9a-9492-a434ed609dcc@linux.dev> References: <20251012115035.2169-1-lirongqing@baidu.com> <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> <3acdcd15-7e52-4a9a-9492-a434ed609dcc@linux.dev> Message-ID: <38af4922ca44433fa7cd168f7c520dc9@baidu.com> > >>> Currently, when 'hung_task_panic' is enabled, the kernel panics > >>> immediately upon detecting the first hung task. However, some hung > >>> tasks are transient and the system can recover, while others are > >>> persistent and may accumulate progressively. > > > > My understanding is that this patch wanted to do: > > > > + report even temporary stalls > > + panic only when the stall was much longer and likely persistent > > > > Which might make some sense. But the code does something else. > > Cool. Sounds good to me! > > > > >>> --- a/kernel/hung_task.c > >>> +++ b/kernel/hung_task.c > >>> @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct > *t, unsigned long timeout) > >>> */ > >>> sysctl_hung_task_detect_count++; > >>> + total_hung_task = sysctl_hung_task_detect_count - > >>> +prev_detect_count; > >>> trace_sched_process_hang(t); > >>> - if (sysctl_hung_task_panic) { > >>> + if (sysctl_hung_task_panic && > >>> + (total_hung_task >= sysctl_hung_task_panic)) { > >>> console_verbose(); > >>> hung_task_show_lock = true; > >>> hung_task_call_panic = true; > > > > I would expect that this patch added another counter, similar to > > sysctl_hung_task_detect_count. It would be incremented only once per > > check when a hung task was detected. And it would be cleared (reset) > > when no hung task was found. > > Much cleaner. We could add an internal counter for that, yeah. No need to > expose it to userspace ;) > > Petr's suggestion seems to align better with the goal of panicking on > persistent hangs, IMHO. Panic after N consecutive checks with hung tasks. > > @RongQing does that work for you? In my opinion, a single task hang is not a critical issue, fatal hangs?such as those caused by I/O hangs, network card failures, or hangs while holding locks?will inevitably lead to multiple tasks being hung. In such scenarios, users cannot even log in to the machine, making it extremely difficult to investigate the root cause. Therefore, I believe the current approach is sound. What's your opinion? -Li From lance.yang at linux.dev Tue Oct 14 11:41:11 2025 From: lance.yang at linux.dev (Lance Yang) Date: Tue, 14 Oct 2025 11:41:11 -0000 Subject: =?UTF-8?B?UmU6IFvlpJbpg6jpgq7ku7ZdIFJlOiBbUEFUQ0hdW3YzXSBodW5nX3Rh?= =?UTF-8?Q?sk=3A_Panic_after_fixed_number_of_hung_tasks?= In-Reply-To: <38af4922ca44433fa7cd168f7c520dc9@baidu.com> References: <20251012115035.2169-1-lirongqing@baidu.com> <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> <3acdcd15-7e52-4a9a-9492-a434ed609dcc@linux.dev> <38af4922ca44433fa7cd168f7c520dc9@baidu.com> Message-ID: <096168a6-8687-4dae-a774-0741d3e5a891@linux.dev> On 2025/10/14 19:18, Li,Rongqing wrote: >>>>> Currently, when 'hung_task_panic' is enabled, the kernel panics >>>>> immediately upon detecting the first hung task. However, some hung >>>>> tasks are transient and the system can recover, while others are >>>>> persistent and may accumulate progressively. >>> >>> My understanding is that this patch wanted to do: >>> >>> + report even temporary stalls >>> + panic only when the stall was much longer and likely persistent >>> >>> Which might make some sense. But the code does something else. >> >> Cool. Sounds good to me! >> >>> >>>>> --- a/kernel/hung_task.c >>>>> +++ b/kernel/hung_task.c >>>>> @@ -229,9 +232,11 @@ static void check_hung_task(struct task_struct >> *t, unsigned long timeout) >>>>> */ >>>>> sysctl_hung_task_detect_count++; >>>>> + total_hung_task = sysctl_hung_task_detect_count - >>>>> +prev_detect_count; >>>>> trace_sched_process_hang(t); >>>>> - if (sysctl_hung_task_panic) { >>>>> + if (sysctl_hung_task_panic && >>>>> + (total_hung_task >= sysctl_hung_task_panic)) { >>>>> console_verbose(); >>>>> hung_task_show_lock = true; >>>>> hung_task_call_panic = true; >>> >>> I would expect that this patch added another counter, similar to >>> sysctl_hung_task_detect_count. It would be incremented only once per >>> check when a hung task was detected. And it would be cleared (reset) >>> when no hung task was found. >> >> Much cleaner. We could add an internal counter for that, yeah. No need to >> expose it to userspace ;) >> >> Petr's suggestion seems to align better with the goal of panicking on >> persistent hangs, IMHO. Panic after N consecutive checks with hung tasks. >> >> @RongQing does that work for you? > > > In my opinion, a single task hang is not a critical issue, fatal hangs?such as those caused by I/O hangs, network card failures, or hangs while holding locks?will inevitably lead to multiple tasks being hung. In such scenarios, users cannot even log in to the machine, making it extremely difficult to investigate the root cause. Therefore, I believe the current approach is sound. What's your opinion? Thanks! I'm fine with either approach. Let's hear what the other folks think ;) From lance.yang at linux.dev Thu Oct 16 05:07:45 2025 From: lance.yang at linux.dev (Lance Yang) Date: Thu, 16 Oct 2025 05:07:45 -0000 Subject: [PATCH][v4] hung_task: Panic when there are more than N hung tasks at the same time In-Reply-To: <20251015063615.2632-1-lirongqing@baidu.com> References: <20251015063615.2632-1-lirongqing@baidu.com> Message-ID: <4db3bd26-1f74-4096-84fd-f652ec9a4d27@linux.dev> LGTM. It works as expected, thanks! On 2025/10/15 14:36, lirongqing wrote: > From: Li RongQing For the commit message, I'd suggest the following for better clarity: ``` The hung_task_panic sysctl is currently a blunt instrument: it's all or nothing. Panicking on a single hung task can be an overreaction to a transient glitch. A more reliable indicator of a systemic problem is when multiple tasks hang simultaneously. Extend hung_task_panic to accept an integer threshold, allowing the kernel to panic only when N hung tasks are detected in a single scan. This provides finer control to distinguish between isolated incidents and system-wide failures. The accepted values are: - 0: Don't panic (unchanged) - 1: Panic on the first hung task (unchanged) - N > 1: Panic after N hung tasks are detected in a single scan The original behavior is preserved for values 0 and 1, maintaining full backward compatibility. ``` If you agree, likely no need to resend - Andrew could pick it up directly when applying :) > > Currently, when 'hung_task_panic' is enabled, the kernel panics > immediately upon detecting the first hung task. However, some hung > tasks are transient and allow system recovery, while persistent hangs > should trigger a panic when accumulating beyond a threshold. > > Extend the 'hung_task_panic' sysctl to accept a threshold value > specifying the number of hung tasks that must be detected before > triggering a kernel panic. This provides finer control for environments > where transient hangs may occur but persistent hangs should be fatal. > > The sysctl now accepts: > - 0: don't panic (maintains original behavior) > - 1: panic on first hung task (maintains original behavior) > - N > 1: panic after N hung tasks are detected in a single scan > > This maintains backward compatibility while providing flexibility for > different hang scenarios. > > Signed-off-by: Li RongQing > Cc: Andrew Jeffery > Cc: Anshuman Khandual > Cc: Arnd Bergmann > Cc: David Hildenbrand > Cc: Florian Wesphal > Cc: Jakub Kacinski > Cc: Jason A. Donenfeld > Cc: Joel Granados > Cc: Joel Stanley > Cc: Jonathan Corbet > Cc: Kees Cook > Cc: Lance Yang > Cc: Liam Howlett > Cc: Lorenzo Stoakes > Cc: "Masami Hiramatsu (Google)" > Cc: "Paul E . McKenney" > Cc: Pawan Gupta > Cc: Petr Mladek > Cc: Phil Auld > Cc: Randy Dunlap > Cc: Russell King > Cc: Shuah Khan > Cc: Simon Horman > Cc: Stanislav Fomichev > Cc: Steven Rostedt > --- So: Reviewed-by: Lance Yang Tested-by: Lance Yang Cheers, Lance From lirongqing at baidu.com Thu Oct 16 05:59:45 2025 From: lirongqing at baidu.com (Li,Rongqing) Date: Thu, 16 Oct 2025 05:59:45 -0000 Subject: =?utf-8?B?UkU6IFvlpJbpg6jpgq7ku7ZdIFJlOiBbUEFUQ0hdW3Y0XSBodW5nX3Rhc2s6?= =?utf-8?B?IFBhbmljIHdoZW4gdGhlcmUgYXJlIG1vcmUgdGhhbiBOIGh1bmcgdGFza3Mg?= =?utf-8?Q?at_the_same_time?= In-Reply-To: <4db3bd26-1f74-4096-84fd-f652ec9a4d27@linux.dev> References: <20251015063615.2632-1-lirongqing@baidu.com> <4db3bd26-1f74-4096-84fd-f652ec9a4d27@linux.dev> Message-ID: > LGTM. It works as expected, thanks! > > On 2025/10/15 14:36, lirongqing wrote: > > From: Li RongQing > > For the commit message, I'd suggest the following for better clarity: > > ``` > The hung_task_panic sysctl is currently a blunt instrument: it's all or nothing. > > Panicking on a single hung task can be an overreaction to a transient glitch. A > more reliable indicator of a systemic problem is when multiple tasks hang > simultaneously. > > Extend hung_task_panic to accept an integer threshold, allowing the kernel > to panic only when N hung tasks are detected in a single scan. This provides > finer control to distinguish between isolated incidents and system-wide > failures. > > The accepted values are: > - 0: Don't panic (unchanged) > - 1: Panic on the first hung task (unchanged) > - N > 1: Panic after N hung tasks are detected in a single scan > > The original behavior is preserved for values 0 and 1, maintaining full > backward compatibility. > ``` > > If you agree, likely no need to resend - Andrew could pick it up directly when > applying :) > This is better; Andrew, could you pick it up directly Thanks -Li > > > > Currently, when 'hung_task_panic' is enabled, the kernel panics > > immediately upon detecting the first hung task. However, some hung > > tasks are transient and allow system recovery, while persistent hangs > > should trigger a panic when accumulating beyond a threshold. > > > > Extend the 'hung_task_panic' sysctl to accept a threshold value > > specifying the number of hung tasks that must be detected before > > triggering a kernel panic. This provides finer control for > > environments where transient hangs may occur but persistent hangs > should be fatal. > > > > The sysctl now accepts: > > - 0: don't panic (maintains original behavior) > > - 1: panic on first hung task (maintains original behavior) > > - N > 1: panic after N hung tasks are detected in a single scan > > > > This maintains backward compatibility while providing flexibility for > > different hang scenarios. > > > > Signed-off-by: Li RongQing > > Cc: Andrew Jeffery > > Cc: Anshuman Khandual > > Cc: Arnd Bergmann > > Cc: David Hildenbrand > > Cc: Florian Wesphal > > Cc: Jakub Kacinski > > Cc: Jason A. Donenfeld > > Cc: Joel Granados > > Cc: Joel Stanley > > Cc: Jonathan Corbet > > Cc: Kees Cook > > Cc: Lance Yang > > Cc: Liam Howlett > > Cc: Lorenzo Stoakes > > Cc: "Masami Hiramatsu (Google)" > > Cc: "Paul E . McKenney" > > Cc: Pawan Gupta > > Cc: Petr Mladek > > Cc: Phil Auld > > Cc: Randy Dunlap > > Cc: Russell King > > Cc: Shuah Khan > > Cc: Simon Horman > > Cc: Stanislav Fomichev > > Cc: Steven Rostedt > > --- > > So: > > Reviewed-by: Lance Yang > Tested-by: Lance Yang > > Cheers, > Lance From lirongqing at baidu.com Fri Oct 17 02:11:58 2025 From: lirongqing at baidu.com (Li,Rongqing) Date: Fri, 17 Oct 2025 02:11:58 -0000 Subject: =?utf-8?B?UkU6IFvlpJbpg6jpgq7ku7ZdIFJlOiBbUEFUQ0hdW3Y0XSBodW5nX3Rhc2s6?= =?utf-8?B?IFBhbmljIHdoZW4gdGhlcmUgYXJlIG1vcmUgdGhhbiBOIGh1bmcgdGFza3Mg?= =?utf-8?Q?at_the_same_time?= In-Reply-To: <906dd11d-26db-4570-840a-e4797748c05c@molgen.mpg.de> References: <20251015063615.2632-1-lirongqing@baidu.com> <906dd11d-26db-4570-840a-e4797748c05c@molgen.mpg.de> Message-ID: > > Am 15.10.25 um 08:36 schrieb lirongqing: > > From: Li RongQing > > > > Currently, when 'hung_task_panic' is enabled, the kernel panics > > immediately upon detecting the first hung task. However, some hung > > tasks are transient and allow system recovery, while persistent hangs > > should trigger a panic when accumulating beyond a threshold. > > > > Extend the 'hung_task_panic' sysctl to accept a threshold value > > specifying the number of hung tasks that must be detected before > > triggering a kernel panic. This provides finer control for > > environments where transient hangs may occur but persistent hangs > should be fatal. > > > > The sysctl now accepts: > > - 0: don't panic (maintains original behavior) > > - 1: panic on first hung task (maintains original behavior) > > - N > 1: panic after N hung tasks are detected in a single scan > > > > This maintains backward compatibility while providing flexibility for > > different hang scenarios. > > > > Signed-off-by: Li RongQing > > Cc: Andrew Jeffery > > Cc: Anshuman Khandual > > Cc: Arnd Bergmann > > Cc: David Hildenbrand > > Cc: Florian Wesphal > > Cc: Jakub Kacinski > > Cc: Jason A. Donenfeld > > Cc: Joel Granados > > Cc: Joel Stanley > > Cc: Jonathan Corbet > > Cc: Kees Cook > > Cc: Lance Yang > > Cc: Liam Howlett > > Cc: Lorenzo Stoakes > > Cc: "Masami Hiramatsu (Google)" > > Cc: "Paul E . McKenney" > > Cc: Pawan Gupta > > Cc: Petr Mladek > > Cc: Phil Auld > > Cc: Randy Dunlap > > Cc: Russell King > > Cc: Shuah Khan > > Cc: Simon Horman > > Cc: Stanislav Fomichev > > Cc: Steven Rostedt > > --- > > diff with v3: comments modification, suggested by Lance, Masami, Randy > > and Petr diff with v2: do not add a new sysctl, extend > > hung_task_panic, suggested by Kees Cook > > > > Documentation/admin-guide/kernel-parameters.txt | 20 > +++++++++++++------- > > Documentation/admin-guide/sysctl/kernel.rst | 9 +++++---- > > arch/arm/configs/aspeed_g5_defconfig | 2 +- > > kernel/configs/debug.config | 2 +- > > kernel/hung_task.c | 15 > ++++++++++----- > > lib/Kconfig.debug | 9 > +++++---- > > tools/testing/selftests/wireguard/qemu/kernel.config | 2 +- > > 7 files changed, 36 insertions(+), 23 deletions(-) > > > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > > b/Documentation/admin-guide/kernel-parameters.txt > > index a51ab46..492f0bc 100644 > > --- a/Documentation/admin-guide/kernel-parameters.txt > > +++ b/Documentation/admin-guide/kernel-parameters.txt > > @@ -1992,14 +1992,20 @@ > > the added memory block itself do not be affected. > > > > hung_task_panic= > > - [KNL] Should the hung task detector generate panics. > > - Format: 0 | 1 > > + [KNL] Number of hung tasks to trigger kernel panic. > > + Format: > > + > > + When set to a non-zero value, a kernel panic will be triggered > if > > + the number of detected hung tasks reaches this value. > > + > > + 0: don't panic > > + 1: panic immediately on first hung task > > + N: panic after N hung tasks are detected in a single scan > > > > - A value of 1 instructs the kernel to panic when a > > - hung task is detected. The default value is controlled > > - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time > > - option. The value selected by this boot parameter can > > - be changed later by the kernel.hung_task_panic sysctl. > > + The default value is controlled by the > > + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. > The value > > + selected by this boot parameter can be changed later by the > > + kernel.hung_task_panic sysctl. > > > > hvc_iucv= [S390] Number of z/VM IUCV hypervisor console > (HVC) > > terminal devices. Valid values: 0..8 diff --git > > a/Documentation/admin-guide/sysctl/kernel.rst > > b/Documentation/admin-guide/sysctl/kernel.rst > > index f3ee807..0065a55 100644 > > --- a/Documentation/admin-guide/sysctl/kernel.rst > > +++ b/Documentation/admin-guide/sysctl/kernel.rst > > @@ -397,13 +397,14 @@ a hung task is detected. > > hung_task_panic > > =============== > > > > -Controls the kernel's behavior when a hung task is detected. > > +When set to a non-zero value, a kernel panic will be triggered if the > > +number of hung tasks found during a single scan reaches this value. > > This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. > > > > -= ================================================= > > += ======================================================= > > 0 Continue operation. This is the default behavior. > > -1 Panic immediately. > > -= ================================================= > > +N Panic when N hung tasks are found during a single scan. > > += ======================================================= > > > > > > hung_task_check_count > > [?] > > > diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index > > 3034e294..3976c90 100644 > > --- a/lib/Kconfig.debug > > +++ b/lib/Kconfig.debug > > @@ -1258,12 +1258,13 @@ config DEFAULT_HUNG_TASK_TIMEOUT > > Keeping the default should be fine in most cases. > > > > config BOOTPARAM_HUNG_TASK_PANIC > > - bool "Panic (Reboot) On Hung Tasks" > > + int "Number of hung tasks to trigger kernel panic" > > depends on DETECT_HUNG_TASK > > + default 0 > > help > > - Say Y here to enable the kernel to panic on "hung tasks", > > - which are bugs that cause the kernel to leave a task stuck > > - in uninterruptible "D" state. > > + When set to a non-zero value, a kernel panic will be triggered > > + if the number of hung tasks found during a single scan reaches > > + this value. > > > > The panic can be used in combination with panic_timeout, > > to cause the system to reboot automatically after a > Why not leave the sentence about the uninterruptible "D" state in there? > This seem to say a kernel bug to cause hung task, but it maybe hardware failure(or virtio backend bug); so I do not keep it > Also, it sounds like, some are actually using this in production. Maybe it > should be moved out of `Kconfig.debug` too? > I think hung task panic is a useful feature, it should move out of Kconfig.debug Thanks -Li > > Kind regards, > > Paul From andrew at codeconstruct.com.au Fri Oct 17 05:17:57 2025 From: andrew at codeconstruct.com.au (Andrew Jeffery) Date: Fri, 17 Oct 2025 05:17:57 -0000 Subject: [PATCH][v4] hung_task: Panic when there are more than N hung tasks at the same time In-Reply-To: <20251015063615.2632-1-lirongqing@baidu.com> References: <20251015063615.2632-1-lirongqing@baidu.com> Message-ID: <57dffe112a461a218c7dab6bfc3b02967440cc77.camel@codeconstruct.com.au> On Wed, 2025-10-15 at 14:36 +0800, lirongqing wrote: > From: Li RongQing > > Currently, when 'hung_task_panic' is enabled, the kernel panics > immediately upon detecting the first hung task. However, some hung > tasks are transient and allow system recovery, while persistent hangs > should trigger a panic when accumulating beyond a threshold. > > Extend the 'hung_task_panic' sysctl to accept a threshold value > specifying the number of hung tasks that must be detected before > triggering a kernel panic. This provides finer control for environments > where transient hangs may occur but persistent hangs should be fatal. > > The sysctl now accepts: > - 0: don't panic (maintains original behavior) > - 1: panic on first hung task (maintains original behavior) > - N > 1: panic after N hung tasks are detected in a single scan > > This maintains backward compatibility while providing flexibility for > different hang scenarios. > > Signed-off-by: Li RongQing > Cc: Andrew Jeffery > Cc: Anshuman Khandual > Cc: Arnd Bergmann > Cc: David Hildenbrand > Cc: Florian Wesphal > Cc: Jakub Kacinski > Cc: Jason A. Donenfeld > Cc: Joel Granados > Cc: Joel Stanley > Cc: Jonathan Corbet > Cc: Kees Cook > Cc: Lance Yang > Cc: Liam Howlett > Cc: Lorenzo Stoakes > Cc: "Masami Hiramatsu (Google)" > Cc: "Paul E . McKenney" > Cc: Pawan Gupta > Cc: Petr Mladek > Cc: Phil Auld > Cc: Randy Dunlap > Cc: Russell King > Cc: Shuah Khan > Cc: Simon Horman > Cc: Stanislav Fomichev > Cc: Steven Rostedt > --- > diff with v3: comments modification, suggested by Lance, Masami, Randy and Petr > diff with v2: do not add a new sysctl, extend hung_task_panic, suggested by Kees Cook > > ?Documentation/admin-guide/kernel-parameters.txt????? | 20 +++++++++++++------- > ?Documentation/admin-guide/sysctl/kernel.rst????????? |? 9 +++++---- > ?arch/arm/configs/aspeed_g5_defconfig???????????????? |? 2 +- For the aspeed_g5_defconfig change: Acked-by: Andrew Jeffery From andrii.nakryiko at gmail.com Tue Oct 21 17:49:50 2025 From: andrii.nakryiko at gmail.com (Andrii Nakryiko) Date: Tue, 21 Oct 2025 17:49:50 -0000 Subject: [PATCH] Fix up 'make versioncheck' issues In-Reply-To: References: Message-ID: On Mon, Oct 20, 2025 at 7:09?PM Jesper Juhl wrote: > > From d2e411b4cd37b1936a30d130e2b21e37e62e0cfb Mon Sep 17 00:00:00 2001 > From: Jesper Juhl > Date: Tue, 21 Oct 2025 03:51:21 +0200 > Subject: [PATCH] [PATCH] Fix up 'make versioncheck' issues > > 'make versioncheck' currently flags a few files that don't need to > needs it but doesn't include it. This patch fixes that up. > > Signed-Off-By: Jesper Juhl > --- > samples/bpf/spintest.bpf.c | 1 - > tools/lib/bpf/bpf_helpers.h | 2 ++ > tools/testing/selftests/bpf/progs/dev_cgroup.c | 1 - > tools/testing/selftests/bpf/progs/netcnt_prog.c | 2 -- > tools/testing/selftests/bpf/progs/test_map_lock.c | 1 - > tools/testing/selftests/bpf/progs/test_send_signal_kern.c | 1 - > tools/testing/selftests/bpf/progs/test_spin_lock.c | 1 - > tools/testing/selftests/bpf/progs/test_tcp_estats.c | 1 - > tools/testing/selftests/wireguard/qemu/init.c | 1 - > 9 files changed, 2 insertions(+), 9 deletions(-) > > diff --git a/samples/bpf/spintest.bpf.c b/samples/bpf/spintest.bpf.c > index cba5a9d507831..6278f6d0b731f 100644 > --- a/samples/bpf/spintest.bpf.c > +++ b/samples/bpf/spintest.bpf.c > @@ -5,7 +5,6 @@ > * License as published by the Free Software Foundation. > */ > #include "vmlinux.h" > -#include > #include > #include > > diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h > index 80c0285406561..393ce1063a977 100644 > --- a/tools/lib/bpf/bpf_helpers.h > +++ b/tools/lib/bpf/bpf_helpers.h > @@ -2,6 +2,8 @@ > #ifndef __BPF_HELPERS__ > #define __BPF_HELPERS__ > > +#include > + this is libbpf's public API header, we are not adding linux/version.h here. Linux version on which something was built has nothing to do with the version of Linux on which the BPF program is actually running. And BPF programs are most of the time intentionally Linux version-agnostic. [...] From ast at fiberby.net Wed Oct 29 20:51:55 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:55 -0000 Subject: [PATCH net-next v1 06/11] uapi: wireguard: move flag enums In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-7-ast@fiberby.net> Move the wg*_flag enums, so that they are defined above the attribute set enums, as ynl-gen would place them. This is an incremental step towards adopting an UAPI header generated by ynl-gen. This is split out to keep the patches readable. This is a trivial patch with no behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- include/uapi/linux/wireguard.h | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index 3ebfffd61269a..a2815f4f29104 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -15,6 +15,20 @@ enum wgdevice_flag { WGDEVICE_F_REPLACE_PEERS = 1U << 0, __WGDEVICE_F_ALL = WGDEVICE_F_REPLACE_PEERS }; + +enum wgpeer_flag { + WGPEER_F_REMOVE_ME = 1U << 0, + WGPEER_F_REPLACE_ALLOWEDIPS = 1U << 1, + WGPEER_F_UPDATE_ONLY = 1U << 2, + __WGPEER_F_ALL = WGPEER_F_REMOVE_ME | WGPEER_F_REPLACE_ALLOWEDIPS | + WGPEER_F_UPDATE_ONLY +}; + +enum wgallowedip_flag { + WGALLOWEDIP_F_REMOVE_ME = 1U << 0, + __WGALLOWEDIP_F_ALL = WGALLOWEDIP_F_REMOVE_ME +}; + enum wgdevice_attribute { WGDEVICE_A_UNSPEC, WGDEVICE_A_IFINDEX, @@ -29,13 +43,6 @@ enum wgdevice_attribute { }; #define WGDEVICE_A_MAX (__WGDEVICE_A_LAST - 1) -enum wgpeer_flag { - WGPEER_F_REMOVE_ME = 1U << 0, - WGPEER_F_REPLACE_ALLOWEDIPS = 1U << 1, - WGPEER_F_UPDATE_ONLY = 1U << 2, - __WGPEER_F_ALL = WGPEER_F_REMOVE_ME | WGPEER_F_REPLACE_ALLOWEDIPS | - WGPEER_F_UPDATE_ONLY -}; enum wgpeer_attribute { WGPEER_A_UNSPEC, WGPEER_A_PUBLIC_KEY, @@ -52,10 +59,6 @@ enum wgpeer_attribute { }; #define WGPEER_A_MAX (__WGPEER_A_LAST - 1) -enum wgallowedip_flag { - WGALLOWEDIP_F_REMOVE_ME = 1U << 0, - __WGALLOWEDIP_F_ALL = WGALLOWEDIP_F_REMOVE_ME -}; enum wgallowedip_attribute { WGALLOWEDIP_A_UNSPEC, WGALLOWEDIP_A_FAMILY, -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:56 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:56 -0000 Subject: [PATCH net-next v1 11/11] wireguard: netlink: generate netlink code In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-12-ast@fiberby.net> This patch adopts netlink policy and command definitions as generated by ynl-gen, thus completing the conversion to YNL. Given that the old and new policy is functionally identical, and just moved to a new file, then it serves to verify that the policy in the spec in identical to the previous policy code. No behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/Makefile | 1 + drivers/net/wireguard/netlink.c | 62 +++-------------------- drivers/net/wireguard/netlink_gen.c | 77 +++++++++++++++++++++++++++++ drivers/net/wireguard/netlink_gen.h | 29 +++++++++++ 4 files changed, 114 insertions(+), 55 deletions(-) create mode 100644 drivers/net/wireguard/netlink_gen.c create mode 100644 drivers/net/wireguard/netlink_gen.h diff --git a/drivers/net/wireguard/Makefile b/drivers/net/wireguard/Makefile index dbe1f8514efc3..ae4b479cddbda 100644 --- a/drivers/net/wireguard/Makefile +++ b/drivers/net/wireguard/Makefile @@ -14,4 +14,5 @@ wireguard-y += allowedips.o wireguard-y += ratelimiter.o wireguard-y += cookie.o wireguard-y += netlink.o +wireguard-y += netlink_gen.o obj-$(CONFIG_WIREGUARD) := wireguard.o diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index 3595349448b2c..6a7e522e3a78e 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -9,6 +9,7 @@ #include "socket.h" #include "queueing.h" #include "messages.h" +#include "netlink_gen.h" #include @@ -19,37 +20,6 @@ static struct genl_family genl_family; -static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { - [WGDEVICE_A_IFINDEX] = { .type = NLA_U32 }, - [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, - [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), - [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, - [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, - [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(peer_policy), -}; - -static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { - [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x7), - [WGPEER_A_ENDPOINT] = NLA_POLICY_MIN_LEN(sizeof(struct sockaddr)), - [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16 }, - [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(sizeof(struct __kernel_timespec)), - [WGPEER_A_RX_BYTES] = { .type = NLA_U64 }, - [WGPEER_A_TX_BYTES] = { .type = NLA_U64 }, - [WGPEER_A_ALLOWEDIPS] = NLA_POLICY_NESTED_ARRAY(allowedip_policy), - [WGPEER_A_PROTOCOL_VERSION] = { .type = NLA_U32 } -}; - -static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1] = { - [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16 }, - [WGALLOWEDIP_A_IPADDR] = NLA_POLICY_MIN_LEN(sizeof(struct in_addr)), - [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 }, - [WGALLOWEDIP_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), -}; - static struct wg_device *lookup_interface(struct nlattr **attrs, struct sk_buff *skb) { @@ -197,7 +167,7 @@ get_peer(struct wg_peer *peer, struct sk_buff *skb, struct dump_ctx *ctx) return -EMSGSIZE; } -static int wireguard_nl_get_device_start(struct netlink_callback *cb) +int wireguard_nl_get_device_start(struct netlink_callback *cb) { struct wg_device *wg; @@ -208,8 +178,8 @@ static int wireguard_nl_get_device_start(struct netlink_callback *cb) return 0; } -static int wireguard_nl_get_device_dumpit(struct sk_buff *skb, - struct netlink_callback *cb) +int wireguard_nl_get_device_dumpit(struct sk_buff *skb, + struct netlink_callback *cb) { struct wg_peer *peer, *next_peer_cursor; struct dump_ctx *ctx = DUMP_CTX(cb); @@ -303,7 +273,7 @@ static int wireguard_nl_get_device_dumpit(struct sk_buff *skb, */ } -static int wireguard_nl_get_device_done(struct netlink_callback *cb) +int wireguard_nl_get_device_done(struct netlink_callback *cb) { struct dump_ctx *ctx = DUMP_CTX(cb); @@ -501,8 +471,8 @@ static int set_peer(struct wg_device *wg, struct nlattr **attrs) return ret; } -static int wireguard_nl_set_device_doit(struct sk_buff *skb, - struct genl_info *info) +int wireguard_nl_set_device_doit(struct sk_buff *skb, + struct genl_info *info) { struct wg_device *wg = lookup_interface(info->attrs, skb); u32 flags = 0; @@ -616,24 +586,6 @@ static int wireguard_nl_set_device_doit(struct sk_buff *skb, return ret; } -static const struct genl_split_ops wireguard_nl_ops[] = { - { - .cmd = WG_CMD_GET_DEVICE, - .start = wireguard_nl_get_device_start, - .dumpit = wireguard_nl_get_device_dumpit, - .done = wireguard_nl_get_device_done, - .policy = device_policy, - .maxattr = WGDEVICE_A_PEERS, - .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DUMP, - }, { - .cmd = WG_CMD_SET_DEVICE, - .doit = wireguard_nl_set_device_doit, - .policy = device_policy, - .maxattr = WGDEVICE_A_PEERS, - .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DO, - } -}; - static struct genl_family genl_family __ro_after_init = { .split_ops = wireguard_nl_ops, .n_split_ops = ARRAY_SIZE(wireguard_nl_ops), diff --git a/drivers/net/wireguard/netlink_gen.c b/drivers/net/wireguard/netlink_gen.c new file mode 100644 index 0000000000000..f95fa133778f1 --- /dev/null +++ b/drivers/net/wireguard/netlink_gen.c @@ -0,0 +1,77 @@ +// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) +/* Do not edit directly, auto-generated from: */ +/* Documentation/netlink/specs/wireguard.yaml */ +/* YNL-GEN kernel source */ + +#include +#include + +#include "netlink_gen.h" + +#include +#include + +/* Common nested types */ +const struct nla_policy wireguard_wgallowedip_nl_policy[WGALLOWEDIP_A_FLAGS + 1] = { + [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16, }, + [WGALLOWEDIP_A_IPADDR] = NLA_POLICY_MIN_LEN(4), + [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8, }, + [WGALLOWEDIP_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), +}; + +const struct nla_policy wireguard_wgpeer_nl_policy[WGPEER_A_PROTOCOL_VERSION + 1] = { + [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x7), + [WGPEER_A_ENDPOINT] = NLA_POLICY_MIN_LEN(16), + [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16, }, + [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(16), + [WGPEER_A_RX_BYTES] = { .type = NLA_U64, }, + [WGPEER_A_TX_BYTES] = { .type = NLA_U64, }, + [WGPEER_A_ALLOWEDIPS] = NLA_POLICY_NESTED_ARRAY(wireguard_wgallowedip_nl_policy), + [WGPEER_A_PROTOCOL_VERSION] = { .type = NLA_U32, }, +}; + +/* WG_CMD_GET_DEVICE - dump */ +static const struct nla_policy wireguard_get_device_nl_policy[WGDEVICE_A_PEERS + 1] = { + [WGDEVICE_A_IFINDEX] = { .type = NLA_U32, }, + [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = 15, }, + [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), + [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16, }, + [WGDEVICE_A_FWMARK] = { .type = NLA_U32, }, + [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(wireguard_wgpeer_nl_policy), +}; + +/* WG_CMD_SET_DEVICE - do */ +static const struct nla_policy wireguard_set_device_nl_policy[WGDEVICE_A_PEERS + 1] = { + [WGDEVICE_A_IFINDEX] = { .type = NLA_U32, }, + [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = 15, }, + [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), + [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16, }, + [WGDEVICE_A_FWMARK] = { .type = NLA_U32, }, + [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(wireguard_wgpeer_nl_policy), +}; + +/* Ops table for wireguard */ +const struct genl_split_ops wireguard_nl_ops[2] = { + { + .cmd = WG_CMD_GET_DEVICE, + .start = wireguard_nl_get_device_start, + .dumpit = wireguard_nl_get_device_dumpit, + .done = wireguard_nl_get_device_done, + .policy = wireguard_get_device_nl_policy, + .maxattr = WGDEVICE_A_PEERS, + .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DUMP, + }, + { + .cmd = WG_CMD_SET_DEVICE, + .doit = wireguard_nl_set_device_doit, + .policy = wireguard_set_device_nl_policy, + .maxattr = WGDEVICE_A_PEERS, + .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DO, + }, +}; diff --git a/drivers/net/wireguard/netlink_gen.h b/drivers/net/wireguard/netlink_gen.h new file mode 100644 index 0000000000000..e635b1f5f0df5 --- /dev/null +++ b/drivers/net/wireguard/netlink_gen.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ +/* Do not edit directly, auto-generated from: */ +/* Documentation/netlink/specs/wireguard.yaml */ +/* YNL-GEN kernel header */ + +#ifndef _LINUX_WIREGUARD_GEN_H +#define _LINUX_WIREGUARD_GEN_H + +#include +#include + +#include +#include + +/* Common nested types */ +extern const struct nla_policy wireguard_wgallowedip_nl_policy[WGALLOWEDIP_A_FLAGS + 1]; +extern const struct nla_policy wireguard_wgpeer_nl_policy[WGPEER_A_PROTOCOL_VERSION + 1]; + +/* Ops table for wireguard */ +extern const struct genl_split_ops wireguard_nl_ops[2]; + +int wireguard_nl_get_device_start(struct netlink_callback *cb); +int wireguard_nl_get_device_done(struct netlink_callback *cb); + +int wireguard_nl_get_device_dumpit(struct sk_buff *skb, + struct netlink_callback *cb); +int wireguard_nl_set_device_doit(struct sk_buff *skb, struct genl_info *info); + +#endif /* _LINUX_WIREGUARD_GEN_H */ -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:57 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:57 -0000 Subject: [PATCH net-next v1 01/11] wireguard: netlink: validate nested arrays in policy In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-2-ast@fiberby.net> Use NLA_POLICY_NESTED_ARRAY() to perform nested array validation in the policy validation step. The nested policy was already enforced through nla_parse_nested(), however extack wasn't passed previously. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index 67f962eb8b46d..9bc76e1bcba2d 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -27,7 +27,7 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGDEVICE_F_ALL), [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, - [WGDEVICE_A_PEERS] = { .type = NLA_NESTED } + [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(peer_policy), }; static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { @@ -39,7 +39,7 @@ static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(sizeof(struct __kernel_timespec)), [WGPEER_A_RX_BYTES] = { .type = NLA_U64 }, [WGPEER_A_TX_BYTES] = { .type = NLA_U64 }, - [WGPEER_A_ALLOWEDIPS] = { .type = NLA_NESTED }, + [WGPEER_A_ALLOWEDIPS] = NLA_POLICY_NESTED_ARRAY(allowedip_policy), [WGPEER_A_PROTOCOL_VERSION] = { .type = NLA_U32 } }; @@ -467,7 +467,7 @@ static int set_peer(struct wg_device *wg, struct nlattr **attrs) nla_for_each_nested(attr, attrs[WGPEER_A_ALLOWEDIPS], rem) { ret = nla_parse_nested(allowedip, WGALLOWEDIP_A_MAX, - attr, allowedip_policy, NULL); + attr, NULL, NULL); if (ret < 0) goto out; ret = set_allowedip(peer, allowedip); @@ -593,7 +593,7 @@ static int wg_set_device(struct sk_buff *skb, struct genl_info *info) nla_for_each_nested(attr, info->attrs[WGDEVICE_A_PEERS], rem) { ret = nla_parse_nested(peer, WGPEER_A_MAX, attr, - peer_policy, NULL); + NULL, NULL); if (ret < 0) goto out; ret = set_peer(wg, peer); -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:57 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:57 -0000 Subject: [PATCH net-next v1 05/11] uapi: wireguard: move enum wg_cmd In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-6-ast@fiberby.net> This patch moves enum wg_cmd to the end of the file, where ynl-gen would like to generate it. This is an incremental step towards adopting an UAPI header generated by ynl-gen. This is split out to keep the patches readable. This is a trivial patch with no behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- include/uapi/linux/wireguard.h | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index dee4401e0b5df..3ebfffd61269a 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -11,13 +11,6 @@ #define WG_KEY_LEN 32 -enum wg_cmd { - WG_CMD_GET_DEVICE, - WG_CMD_SET_DEVICE, - __WG_CMD_MAX -}; -#define WG_CMD_MAX (__WG_CMD_MAX - 1) - enum wgdevice_flag { WGDEVICE_F_REPLACE_PEERS = 1U << 0, __WGDEVICE_F_ALL = WGDEVICE_F_REPLACE_PEERS @@ -73,4 +66,12 @@ enum wgallowedip_attribute { }; #define WGALLOWEDIP_A_MAX (__WGALLOWEDIP_A_LAST - 1) +enum wg_cmd { + WG_CMD_GET_DEVICE, + WG_CMD_SET_DEVICE, + + __WG_CMD_MAX +}; +#define WG_CMD_MAX (__WG_CMD_MAX - 1) + #endif /* _WG_UAPI_WIREGUARD_H */ -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:57 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:57 -0000 Subject: [PATCH net-next v1 07/11] uapi: wireguard: generate header with ynl-gen In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-8-ast@fiberby.net> Use ynl-gen to generate the UAPI header for wireguard. The cosmetic changes in this patch, confirms that the spec is aligned with the implementation, and ensures that it stays in sync. Changes in generated header: * Trivial include guard rename. * Trivial white space changes. * Trivial comment changes. * Precompute bitflags in ynl-gen (see [1]). * Drop __*_F_ALL constants (see [1]). [1] https://lore.kernel.org/r/20251014123201.6ecfd146 at kernel.org/ No behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 6 +++--- include/uapi/linux/wireguard.h | 37 ++++++++++++++++----------------- 2 files changed, 21 insertions(+), 22 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index 024d4a6cc74c6..86333c263e6a5 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -24,7 +24,7 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGDEVICE_F_ALL), + [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(peer_policy), @@ -33,7 +33,7 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGPEER_F_ALL), + [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x7), [WGPEER_A_ENDPOINT] = NLA_POLICY_MIN_LEN(sizeof(struct sockaddr)), [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16 }, [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(sizeof(struct __kernel_timespec)), @@ -47,7 +47,7 @@ static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1] = { [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16 }, [WGALLOWEDIP_A_IPADDR] = NLA_POLICY_MIN_LEN(sizeof(struct in_addr)), [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 }, - [WGALLOWEDIP_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGALLOWEDIP_F_ALL), + [WGALLOWEDIP_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), }; static struct wg_device *lookup_interface(struct nlattr **attrs, diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index a2815f4f29104..dc3924d0c5524 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -1,32 +1,28 @@ -/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */ -/* - * Copyright (C) 2015-2019 Jason A. Donenfeld . All Rights Reserved. - */ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ +/* Do not edit directly, auto-generated from: */ +/* Documentation/netlink/specs/wireguard.yaml */ +/* YNL-GEN uapi header */ -#ifndef _WG_UAPI_WIREGUARD_H -#define _WG_UAPI_WIREGUARD_H +#ifndef _UAPI_LINUX_WIREGUARD_H +#define _UAPI_LINUX_WIREGUARD_H -#define WG_GENL_NAME "wireguard" -#define WG_GENL_VERSION 1 +#define WG_GENL_NAME "wireguard" +#define WG_GENL_VERSION 1 -#define WG_KEY_LEN 32 +#define WG_KEY_LEN 32 enum wgdevice_flag { - WGDEVICE_F_REPLACE_PEERS = 1U << 0, - __WGDEVICE_F_ALL = WGDEVICE_F_REPLACE_PEERS + WGDEVICE_F_REPLACE_PEERS = 1, }; enum wgpeer_flag { - WGPEER_F_REMOVE_ME = 1U << 0, - WGPEER_F_REPLACE_ALLOWEDIPS = 1U << 1, - WGPEER_F_UPDATE_ONLY = 1U << 2, - __WGPEER_F_ALL = WGPEER_F_REMOVE_ME | WGPEER_F_REPLACE_ALLOWEDIPS | - WGPEER_F_UPDATE_ONLY + WGPEER_F_REMOVE_ME = 1, + WGPEER_F_REPLACE_ALLOWEDIPS = 2, + WGPEER_F_UPDATE_ONLY = 4, }; enum wgallowedip_flag { - WGALLOWEDIP_F_REMOVE_ME = 1U << 0, - __WGALLOWEDIP_F_ALL = WGALLOWEDIP_F_REMOVE_ME + WGALLOWEDIP_F_REMOVE_ME = 1, }; enum wgdevice_attribute { @@ -39,6 +35,7 @@ enum wgdevice_attribute { WGDEVICE_A_LISTEN_PORT, WGDEVICE_A_FWMARK, WGDEVICE_A_PEERS, + __WGDEVICE_A_LAST }; #define WGDEVICE_A_MAX (__WGDEVICE_A_LAST - 1) @@ -55,6 +52,7 @@ enum wgpeer_attribute { WGPEER_A_TX_BYTES, WGPEER_A_ALLOWEDIPS, WGPEER_A_PROTOCOL_VERSION, + __WGPEER_A_LAST }; #define WGPEER_A_MAX (__WGPEER_A_LAST - 1) @@ -65,6 +63,7 @@ enum wgallowedip_attribute { WGALLOWEDIP_A_IPADDR, WGALLOWEDIP_A_CIDR_MASK, WGALLOWEDIP_A_FLAGS, + __WGALLOWEDIP_A_LAST }; #define WGALLOWEDIP_A_MAX (__WGALLOWEDIP_A_LAST - 1) @@ -77,4 +76,4 @@ enum wg_cmd { }; #define WG_CMD_MAX (__WG_CMD_MAX - 1) -#endif /* _WG_UAPI_WIREGUARD_H */ +#endif /* _UAPI_LINUX_WIREGUARD_H */ -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:57 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:57 -0000 Subject: [PATCH net-next v1 04/11] netlink: specs: add specification for wireguard In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-5-ast@fiberby.net> This patch adds an near[1] complete YNL specification for wireguard, documenting the protocol in a machine-readable format, than the comment in wireguard.h, and eases usage from C and non-C programming languages alike. The generated C library will be featured in the next patch, so in this patch I will use the in-kernel python client for examples. This makes the documentation in the UAPI header redundant, and it is therefore removed. The in-line documentation in the spec, is based on the existing comment in wireguard.h, and once released then it will be available in the kernel documentation at: https://docs.kernel.org/netlink/specs/wireguard.html (until then run: make htmldocs) Generate wireguard.rst from this spec: $ make -C tools/net/ynl/generated/ wireguard.rst Query wireguard interface through pyynl: $ sudo ./tools/net/ynl/pyynl/cli.py --family wireguard \ --dump get-device \ --json '{"ifindex":3}' [{'fwmark': 0, 'ifindex': 3, 'ifname': 'wg-test', 'listen-port': 54318, 'peers': [{0: {'allowedips': [{0: {'cidr-mask': 0, 'family': 2, 'ipaddr': '0.0.0.0'}}, {0: {'cidr-mask': 0, 'family': 10, 'ipaddr': '::'}}], 'endpoint': b'[...]', 'last-handshake-time': {'nsec': 42, 'sec': 42}, 'persistent-keepalive-interval': 42, 'preshared-key': '[...]', 'protocol-version': 1, 'public-key': '[...]', 'rx-bytes': 42, 'tx-bytes': 42}}], 'private-key': '[...]', 'public-key': '[...]'}] Add another allowed IP prefix: $ sudo ./tools/net/ynl/pyynl/cli.py --family wireguard \ --do set-device --json '{"ifindex":3,"peers":[ {"public-key":"6a df b1 83 a4 ..","allowedips":[ {"cidr-mask":0,"family":10,"ipaddr":"::"}]}]}' [1] As can be seen above, the "endpoint" is only decoded as binary data, as it can't be described fully in YNL. It's a struct sockaddr_in or struct sockaddr_in6 depending on the attribute length. Signed-off-by: Asbj?rn Sloth T?nnesen --- Documentation/netlink/specs/wireguard.yaml | 307 +++++++++++++++++++++ MAINTAINERS | 1 + include/uapi/linux/wireguard.h | 129 --------- 3 files changed, 308 insertions(+), 129 deletions(-) create mode 100644 Documentation/netlink/specs/wireguard.yaml diff --git a/Documentation/netlink/specs/wireguard.yaml b/Documentation/netlink/specs/wireguard.yaml new file mode 100644 index 0000000000000..f3226fa38095e --- /dev/null +++ b/Documentation/netlink/specs/wireguard.yaml @@ -0,0 +1,307 @@ +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) +--- +name: wireguard +protocol: genetlink-legacy + +doc: | + **Netlink protocol to control WireGuard network devices.** + + The below enums and macros are for interfacing with WireGuard, + using generic netlink, with family ``WG_GENL_NAME`` and version + ``WG_GENL_VERSION``. It defines two commands: get and set. + Note that while they share many common attributes, these two + commands actually accept a slightly different set of inputs and + outputs. These differences are noted under the individual attributes. +c-family-name: wg-genl-name +c-version-name: wg-genl-version +max-by-define: true + +definitions: + - + name-prefix: wg- + name: key-len + type: const + value: 32 + - + name: --kernel-timespec + type: struct + header: linux/time_types.h + members: + - + name: sec + type: u64 + doc: Number of seconds, since UNIX epoch. + - + name: nsec + type: u64 + doc: Number of nanoseconds, after the second began. + - + name: wgdevice-flags + name-prefix: wgdevice-f- + enum-name: wgdevice-flag + type: flags + entries: + - replace-peers + - + name: wgpeer-flags + name-prefix: wgpeer-f- + enum-name: wgpeer-flag + type: flags + entries: + - remove-me + - replace-allowedips + - update-only + - + name: wgallowedip-flags + name-prefix: wgallowedip-f- + enum-name: wgallowedip-flag + type: flags + entries: + - remove-me + +attribute-sets: + - + name: wgdevice + enum-name: wgdevice-attribute + name-prefix: wgdevice-a- + attr-cnt-name: --wgdevice-a-last + attributes: + - + name: unspec + type: unused + value: 0 + - + name: ifindex + type: u32 + - + name: ifname + type: string + checks: + max-len: 15 + - + name: private-key + type: binary + doc: Set to all zeros to remove. + display-hint: hex + checks: + exact-len: wg-key-len + - + name: public-key + type: binary + display-hint: hex + checks: + exact-len: wg-key-len + - + name: flags + doc: | + ``0`` or ``WGDEVICE_F_REPLACE_PEERS`` if all current peers + should be removed prior to adding the list below. + type: u32 + enum: wgdevice-flags + checks: + flags-mask: wgdevice-flags + - + name: listen-port + type: u16 + doc: Set as ``0`` to choose randomly. + - + name: fwmark + type: u32 + doc: Set as ``0`` to disable. + - + name: peers + type: indexed-array + sub-type: nest + nested-attributes: wgpeer + doc: The index is set as ``0`` in ``DUMP``, and unused in ``DO``. + - + name: wgpeer + enum-name: wgpeer-attribute + name-prefix: wgpeer-a- + attr-cnt-name: --wgpeer-a-last + attributes: + - + name: unspec + type: unused + value: 0 + - + name: public-key + type: binary + display-hint: hex + checks: + exact-len: wg-key-len + - + name: preshared-key + type: binary + doc: Set as all zeros to remove. + display-hint: hex + checks: + exact-len: wg-key-len + - + name: flags + doc: | + ``0`` and/or ``WGPEER_F_REMOVE_ME`` if the specified peer should not + exist at the end of the operation, rather than added/updated + and/or ``WGPEER_F_REPLACE_ALLOWEDIPS`` if all current allowed IPs + of this peer should be removed prior to adding the list below + and/or ``WGPEER_F_UPDATE_ONLY`` if the peer should only be set if + it already exists. + type: u32 + enum: wgpeer-flags + checks: + flags-mask: wgpeer-flags + - + name: endpoint + doc: struct sockaddr_in or struct sockaddr_in6 + type: binary + checks: + min-len: 16 + - + name: persistent-keepalive-interval + type: u16 + doc: Set as ``0`` to disable. + - + name: last-handshake-time + type: binary + struct: --kernel-timespec + checks: + exact-len: 16 + - + name: rx-bytes + type: u64 + - + name: tx-bytes + type: u64 + - + name: allowedips + type: indexed-array + sub-type: nest + nested-attributes: wgallowedip + doc: The index is set as ``0`` in ``DUMP``, and unused in ``DO``. + - + name: protocol-version + type: u32 + doc: | + Should not be set or used at all by most users of this API, + as the most recent protocol will be used when this is unset. + Otherwise, must be set to ``1``. + - + name: wgallowedip + enum-name: wgallowedip-attribute + name-prefix: wgallowedip-a- + attr-cnt-name: --wgallowedip-a-last + attributes: + - + name: unspec + type: unused + value: 0 + - + name: family + type: u16 + doc: IP family, either ``AF_INET`` or ``AF_INET6``. + - + name: ipaddr + type: binary + doc: Either ``struct in_addr`` or ``struct in6_addr``. + display-hint: ipv4-or-v6 + checks: + min-len: 4 + - + name: cidr-mask + type: u8 + - + name: flags + type: u32 + doc: | + ``WGALLOWEDIP_F_REMOVE_ME`` if the specified IP should be + removed; otherwise, this IP will be added if it is not + already present. + enum: wgallowedip-flags + checks: + flags-mask: wgallowedip-flags + +operations: + enum-name: wg-cmd + name-prefix: wg-cmd- + list: + - + name: get-device + value: 0 + doc: | + Retrieve WireGuard device + ~~~~~~~~~~~~~~~~~~~~~~~~~ + + The command should be called with one but not both of: + + - ``WGDEVICE_A_IFINDEX`` + - ``WGDEVICE_A_IFNAME`` + + The kernel will then return several messages (``NLM_F_MULTI``). + It is possible that all of the allowed IPs of a single peer + will not fit within a single netlink message. In that case, the + same peer will be written in the following message, except it will + only contain ``WGPEER_A_PUBLIC_KEY`` and ``WGPEER_A_ALLOWEDIPS``. + This may occur several times in a row for the same peer. + It is then up to the receiver to coalesce adjacent peers. + Likewise, it is possible that all peers will not fit within a + single message. + So, subsequent peers will be sent in following messages, + except those will only contain ``WGDEVICE_A_IFNAME`` and + ``WGDEVICE_A_PEERS``. It is then up to the receiver to coalesce + these messages to form the complete list of peers. + + While this command does accept the other ``WGDEVICE_A_*`` + attributes, for compatibility reasons, but they are ignored + by this command, and should not be used in requests. + + Since this is an ``NLA_F_DUMP`` command, the final message will + always be ``NLMSG_DONE``, even if an error occurs. However, this + ``NLMSG_DONE`` message contains an integer error code. It is + either zero or a negative error code corresponding to the errno. + attribute-set: wgdevice + flags: [uns-admin-perm] + + dump: + pre: wireguard-nl-get-device-start + post: wireguard-nl-get-device-done + # request only uses ifindex | ifname, but keep .maxattr as is + request: &all-attrs + attributes: + - ifindex + - ifname + - private-key + - public-key + - flags + - listen-port + - fwmark + - peers + reply: *all-attrs + - + name: set-device + value: 1 + doc: | + Set WireGuard device + ~~~~~~~~~~~~~~~~~~~~ + + This command should be called with a wgdevice set, containing one + but not both of ``WGDEVICE_A_IFINDEX`` and ``WGDEVICE_A_IFNAME``. + + It is possible that the amount of configuration data exceeds that + of the maximum message length accepted by the kernel. + In that case, several messages should be sent one after another, + with each successive one filling in information not contained in + the prior. + Note that if ``WGDEVICE_F_REPLACE_PEERS`` is specified in the first + message, it probably should not be specified in fragments that come + after, so that the list of peers is only cleared the first time but + appended after. + Likewise for peers, if ``WGPEER_F_REPLACE_ALLOWEDIPS`` is specified + in the first message of a peer, it likely should not be specified + in subsequent fragments. + + If an error occurs, ``NLMSG_ERROR`` will reply containing an errno. + attribute-set: wgdevice + flags: [uns-admin-perm] + + do: + request: *all-attrs diff --git a/MAINTAINERS b/MAINTAINERS index d652f4f27756e..1bceeb4f5d122 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -27630,6 +27630,7 @@ M: Jason A. Donenfeld L: wireguard at lists.zx2c4.com L: netdev at vger.kernel.org S: Maintained +F: Documentation/netlink/specs/wireguard.yaml F: drivers/net/wireguard/ F: tools/testing/selftests/wireguard/ diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index 8c26391196d50..dee4401e0b5df 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -1,135 +1,6 @@ /* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */ /* * Copyright (C) 2015-2019 Jason A. Donenfeld . All Rights Reserved. - * - * Documentation - * ============= - * - * The below enums and macros are for interfacing with WireGuard, using generic - * netlink, with family WG_GENL_NAME and version WG_GENL_VERSION. It defines two - * methods: get and set. Note that while they share many common attributes, - * these two functions actually accept a slightly different set of inputs and - * outputs. - * - * WG_CMD_GET_DEVICE - * ----------------- - * - * May only be called via NLM_F_REQUEST | NLM_F_DUMP. The command should contain - * one but not both of: - * - * WGDEVICE_A_IFINDEX: NLA_U32 - * WGDEVICE_A_IFNAME: NLA_NUL_STRING, maxlen IFNAMSIZ - 1 - * - * The kernel will then return several messages (NLM_F_MULTI) containing the - * following tree of nested items: - * - * WGDEVICE_A_IFINDEX: NLA_U32 - * WGDEVICE_A_IFNAME: NLA_NUL_STRING, maxlen IFNAMSIZ - 1 - * WGDEVICE_A_PRIVATE_KEY: NLA_EXACT_LEN, len WG_KEY_LEN - * WGDEVICE_A_PUBLIC_KEY: NLA_EXACT_LEN, len WG_KEY_LEN - * WGDEVICE_A_LISTEN_PORT: NLA_U16 - * WGDEVICE_A_FWMARK: NLA_U32 - * WGDEVICE_A_PEERS: NLA_NESTED - * 0: NLA_NESTED - * WGPEER_A_PUBLIC_KEY: NLA_EXACT_LEN, len WG_KEY_LEN - * WGPEER_A_PRESHARED_KEY: NLA_EXACT_LEN, len WG_KEY_LEN - * WGPEER_A_ENDPOINT: NLA_MIN_LEN(struct sockaddr), struct sockaddr_in or struct sockaddr_in6 - * WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL: NLA_U16 - * WGPEER_A_LAST_HANDSHAKE_TIME: NLA_EXACT_LEN, struct __kernel_timespec - * WGPEER_A_RX_BYTES: NLA_U64 - * WGPEER_A_TX_BYTES: NLA_U64 - * WGPEER_A_ALLOWEDIPS: NLA_NESTED - * 0: NLA_NESTED - * WGALLOWEDIP_A_FAMILY: NLA_U16 - * WGALLOWEDIP_A_IPADDR: NLA_MIN_LEN(struct in_addr), struct in_addr or struct in6_addr - * WGALLOWEDIP_A_CIDR_MASK: NLA_U8 - * 0: NLA_NESTED - * ... - * 0: NLA_NESTED - * ... - * ... - * WGPEER_A_PROTOCOL_VERSION: NLA_U32 - * 0: NLA_NESTED - * ... - * ... - * - * It is possible that all of the allowed IPs of a single peer will not - * fit within a single netlink message. In that case, the same peer will - * be written in the following message, except it will only contain - * WGPEER_A_PUBLIC_KEY and WGPEER_A_ALLOWEDIPS. This may occur several - * times in a row for the same peer. It is then up to the receiver to - * coalesce adjacent peers. Likewise, it is possible that all peers will - * not fit within a single message. So, subsequent peers will be sent - * in following messages, except those will only contain WGDEVICE_A_IFNAME - * and WGDEVICE_A_PEERS. It is then up to the receiver to coalesce these - * messages to form the complete list of peers. - * - * Since this is an NLA_F_DUMP command, the final message will always be - * NLMSG_DONE, even if an error occurs. However, this NLMSG_DONE message - * contains an integer error code. It is either zero or a negative error - * code corresponding to the errno. - * - * WG_CMD_SET_DEVICE - * ----------------- - * - * May only be called via NLM_F_REQUEST. The command should contain the - * following tree of nested items, containing one but not both of - * WGDEVICE_A_IFINDEX and WGDEVICE_A_IFNAME: - * - * WGDEVICE_A_IFINDEX: NLA_U32 - * WGDEVICE_A_IFNAME: NLA_NUL_STRING, maxlen IFNAMSIZ - 1 - * WGDEVICE_A_FLAGS: NLA_U32, 0 or WGDEVICE_F_REPLACE_PEERS if all current - * peers should be removed prior to adding the list below. - * WGDEVICE_A_PRIVATE_KEY: len WG_KEY_LEN, all zeros to remove - * WGDEVICE_A_LISTEN_PORT: NLA_U16, 0 to choose randomly - * WGDEVICE_A_FWMARK: NLA_U32, 0 to disable - * WGDEVICE_A_PEERS: NLA_NESTED - * 0: NLA_NESTED - * WGPEER_A_PUBLIC_KEY: len WG_KEY_LEN - * WGPEER_A_FLAGS: NLA_U32, 0 and/or WGPEER_F_REMOVE_ME if the - * specified peer should not exist at the end of the - * operation, rather than added/updated and/or - * WGPEER_F_REPLACE_ALLOWEDIPS if all current allowed - * IPs of this peer should be removed prior to adding - * the list below and/or WGPEER_F_UPDATE_ONLY if the - * peer should only be set if it already exists. - * WGPEER_A_PRESHARED_KEY: len WG_KEY_LEN, all zeros to remove - * WGPEER_A_ENDPOINT: struct sockaddr_in or struct sockaddr_in6 - * WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL: NLA_U16, 0 to disable - * WGPEER_A_ALLOWEDIPS: NLA_NESTED - * 0: NLA_NESTED - * WGALLOWEDIP_A_FAMILY: NLA_U16 - * WGALLOWEDIP_A_IPADDR: struct in_addr or struct in6_addr - * WGALLOWEDIP_A_CIDR_MASK: NLA_U8 - * WGALLOWEDIP_A_FLAGS: NLA_U32, WGALLOWEDIP_F_REMOVE_ME if - * the specified IP should be removed; - * otherwise, this IP will be added if - * it is not already present. - * 0: NLA_NESTED - * ... - * 0: NLA_NESTED - * ... - * ... - * WGPEER_A_PROTOCOL_VERSION: NLA_U32, should not be set or used at - * all by most users of this API, as the - * most recent protocol will be used when - * this is unset. Otherwise, must be set - * to 1. - * 0: NLA_NESTED - * ... - * ... - * - * It is possible that the amount of configuration data exceeds that of - * the maximum message length accepted by the kernel. In that case, several - * messages should be sent one after another, with each successive one - * filling in information not contained in the prior. Note that if - * WGDEVICE_F_REPLACE_PEERS is specified in the first message, it probably - * should not be specified in fragments that come after, so that the list - * of peers is only cleared the first time but appended after. Likewise for - * peers, if WGPEER_F_REPLACE_ALLOWEDIPS is specified in the first message - * of a peer, it likely should not be specified in subsequent fragments. - * - * If an error occurs, NLMSG_ERROR will reply containing an errno. */ #ifndef _WG_UAPI_WIREGUARD_H -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:58 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:58 -0000 Subject: [PATCH net-next v1 09/11] wireguard: netlink: convert to split ops In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-10-ast@fiberby.net> This patch converts wireguard from using legacy struct genl_ops to struct genl_split_ops, by applying the same transformation as genl_cmd_full_to_split() would otherwise do at runtime. WGDEVICE_A_MAX is swapped for WGDEVICE_A_PEERS, while they are currently equivalent, then .maxattr should be the maximum attribute that a given command supports, which might not be WGDEVICE_A_MAX. This is an incremental step towards adopting netlink policy code generated by ynl-gen, ensuring that the code and spec is aligned. This is a trivial patch with no behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index 86333c263e6a5..2acd651f4c71f 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -614,28 +614,30 @@ static int wg_set_device(struct sk_buff *skb, struct genl_info *info) return ret; } -static const struct genl_ops genl_ops[] = { +static const struct genl_split_ops wireguard_nl_ops[] = { { .cmd = WG_CMD_GET_DEVICE, .start = wg_get_device_start, .dumpit = wg_get_device_dump, .done = wg_get_device_done, - .flags = GENL_UNS_ADMIN_PERM + .policy = device_policy, + .maxattr = WGDEVICE_A_PEERS, + .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DUMP, }, { .cmd = WG_CMD_SET_DEVICE, .doit = wg_set_device, - .flags = GENL_UNS_ADMIN_PERM + .policy = device_policy, + .maxattr = WGDEVICE_A_PEERS, + .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DO, } }; static struct genl_family genl_family __ro_after_init = { - .ops = genl_ops, - .n_ops = ARRAY_SIZE(genl_ops), + .split_ops = wireguard_nl_ops, + .n_split_ops = ARRAY_SIZE(wireguard_nl_ops), .name = WG_GENL_NAME, .version = WG_GENL_VERSION, - .maxattr = WGDEVICE_A_MAX, .module = THIS_MODULE, - .policy = device_policy, .netnsok = true }; -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:58 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:58 -0000 Subject: [PATCH net-next v1 00/11] wireguard: netlink: ynl conversion Message-ID: <20251029205123.286115-1-ast@fiberby.net> This series completes the implementation of YNL for wireguard, as previously announced[1]. This series consist of 5 parts: 1) Patch 01-03 - Misc. changes 2) Patch 04 - Add YNL specification for wireguard 3) Patch 05-07 - Transition to a generated UAPI header 4) Patch 08 - Adds a sample program for the generated C library 5) Patch 09-11 - Transition to generated netlink policy code The main benefit of having a YNL specification is unlocked after the first 2 parts, the RFC version seems to already have spawned a new Rust netlink binding[2] using wireguard as it's main example. Part 3 and 5 validates that the specification is complete and aligned, the generated code might have a few warts, but they don't matter too much, and are mostly a transitional problem[3]. Part 4 is possible after part 2, but is ordered after part 3, as it needs to duplicate the UAPI header in tools/include. For the non-generated kernel C code the diff stat looks like this: $ git diff --stat net-next/main..wg-ynl include/ drivers/ \ ':(exclude)*netlink_gen*' drivers/net/wireguard/Makefile | 1 + drivers/net/wireguard/netlink.c | 70 +++--------- include/uapi/linux/wireguard.h | 190 ++++++-------------------------- 3 files changed, 47 insertions(+), 214 deletions(-) [1] [PATCH net 0/4] tools: ynl-gen: misc fixes + wireguard ynl plan https://lore.kernel.org/r/20250901145034.525518-1-ast at fiberby.net/ [2] https://github.com/one-d-wide/netlink-bindings/ [3] https://lore.kernel.org/r/20251014123201.6ecfd146 at kernel.org/ --- v1: - Policy arguement to nla_parse_nested() changed to NULL (thanks Johannes). - Added attr-cnt-name to the spec, to reduce the diff a bit. - Refined the doc in the spec a bit. - Reword commit messages a bit. - Reordered the patches, and reduced the series from 14 to 11 patches. RFC: https://lore.kernel.org/r/20250904-wg-ynl-rfc at fiberby.net/ diff -Naur a/sent/0904/b/0002-wireguard-netlink-validate-nested-arrays-in-policy.patch 0001-wireguard-netlink-validate-nested-arrays-in-policy.patch diff -Naur a/sent/0904/b/0001-wireguard-netlink-use-WG_KEY_LEN-in-policies.patch 0002-wireguard-netlink-use-WG_KEY_LEN-in-policies.patch diff -Naur a/sent/0904/b/0013-wireguard-netlink-enable-strict-genetlink-validation.patch 0003-wireguard-netlink-enable-strict-genetlink-validation.patch diff -Naur a/sent/0904/b/0003-netlink-specs-add-specification-for-wireguard.patch 0004-netlink-specs-add-specification-for-wireguard.patch Asbj?rn Sloth T?nnesen (11): wireguard: netlink: validate nested arrays in policy wireguard: netlink: use WG_KEY_LEN in policies wireguard: netlink: enable strict genetlink validation netlink: specs: add specification for wireguard uapi: wireguard: move enum wg_cmd uapi: wireguard: move flag enums uapi: wireguard: generate header with ynl-gen tools: ynl: add sample for wireguard wireguard: netlink: convert to split ops wireguard: netlink: rename netlink handlers wireguard: netlink: generate netlink code Documentation/netlink/specs/wireguard.yaml | 307 +++++++++++++++++++++ MAINTAINERS | 3 + drivers/net/wireguard/Makefile | 1 + drivers/net/wireguard/netlink.c | 70 +---- drivers/net/wireguard/netlink_gen.c | 77 ++++++ drivers/net/wireguard/netlink_gen.h | 29 ++ include/uapi/linux/wireguard.h | 190 +++---------- tools/include/uapi/linux/wireguard.h | 79 ++++++ tools/net/ynl/samples/.gitignore | 1 + tools/net/ynl/samples/wireguard.c | 104 +++++++ 10 files changed, 647 insertions(+), 214 deletions(-) create mode 100644 Documentation/netlink/specs/wireguard.yaml create mode 100644 drivers/net/wireguard/netlink_gen.c create mode 100644 drivers/net/wireguard/netlink_gen.h create mode 100644 tools/include/uapi/linux/wireguard.h create mode 100644 tools/net/ynl/samples/wireguard.c -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:58 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:58 -0000 Subject: [PATCH net-next v1 08/11] tools: ynl: add sample for wireguard In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-9-ast@fiberby.net> Add a sample application for wireguard, using the generated C library, The main benefit of this is to exercise the generated library, which might be useful for future selftests. The UAPI header is copied to tools/include/uapi/, when the header changes ynl-gen will regenerate both copies. Example: $ make -C tools/net/ynl/lib $ make -C tools/net/ynl/generated $ make -C tools/net/ynl/samples wireguard $ ./tools/net/ynl/samples/wireguard usage: ./tools/net/ynl/samples/wireguard $ sudo ./tools/net/ynl/samples/wireguard wg-test Interface 3: wg-test Peer 6adfb183a4a2c94a2f92dab5ade762a4788[...]: Data: rx: 42 / tx: 42 bytes Allowed IPs: 0.0.0.0/0 ::/0 Signed-off-by: Asbj?rn Sloth T?nnesen --- MAINTAINERS | 2 + tools/include/uapi/linux/wireguard.h | 79 ++++++++++++++++++++ tools/net/ynl/samples/.gitignore | 1 + tools/net/ynl/samples/wireguard.c | 104 +++++++++++++++++++++++++++ 4 files changed, 186 insertions(+) create mode 100644 tools/include/uapi/linux/wireguard.h create mode 100644 tools/net/ynl/samples/wireguard.c diff --git a/MAINTAINERS b/MAINTAINERS index 1bceeb4f5d122..e7ec4cb4d044f 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -27632,6 +27632,8 @@ L: netdev at vger.kernel.org S: Maintained F: Documentation/netlink/specs/wireguard.yaml F: drivers/net/wireguard/ +F: tools/include/uapi/linux/wireguard.h +F: tools/net/ynl/samples/wireguard.c F: tools/testing/selftests/wireguard/ WISTRON LAPTOP BUTTON DRIVER diff --git a/tools/include/uapi/linux/wireguard.h b/tools/include/uapi/linux/wireguard.h new file mode 100644 index 0000000000000..dc3924d0c5524 --- /dev/null +++ b/tools/include/uapi/linux/wireguard.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ +/* Do not edit directly, auto-generated from: */ +/* Documentation/netlink/specs/wireguard.yaml */ +/* YNL-GEN uapi header */ + +#ifndef _UAPI_LINUX_WIREGUARD_H +#define _UAPI_LINUX_WIREGUARD_H + +#define WG_GENL_NAME "wireguard" +#define WG_GENL_VERSION 1 + +#define WG_KEY_LEN 32 + +enum wgdevice_flag { + WGDEVICE_F_REPLACE_PEERS = 1, +}; + +enum wgpeer_flag { + WGPEER_F_REMOVE_ME = 1, + WGPEER_F_REPLACE_ALLOWEDIPS = 2, + WGPEER_F_UPDATE_ONLY = 4, +}; + +enum wgallowedip_flag { + WGALLOWEDIP_F_REMOVE_ME = 1, +}; + +enum wgdevice_attribute { + WGDEVICE_A_UNSPEC, + WGDEVICE_A_IFINDEX, + WGDEVICE_A_IFNAME, + WGDEVICE_A_PRIVATE_KEY, + WGDEVICE_A_PUBLIC_KEY, + WGDEVICE_A_FLAGS, + WGDEVICE_A_LISTEN_PORT, + WGDEVICE_A_FWMARK, + WGDEVICE_A_PEERS, + + __WGDEVICE_A_LAST +}; +#define WGDEVICE_A_MAX (__WGDEVICE_A_LAST - 1) + +enum wgpeer_attribute { + WGPEER_A_UNSPEC, + WGPEER_A_PUBLIC_KEY, + WGPEER_A_PRESHARED_KEY, + WGPEER_A_FLAGS, + WGPEER_A_ENDPOINT, + WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL, + WGPEER_A_LAST_HANDSHAKE_TIME, + WGPEER_A_RX_BYTES, + WGPEER_A_TX_BYTES, + WGPEER_A_ALLOWEDIPS, + WGPEER_A_PROTOCOL_VERSION, + + __WGPEER_A_LAST +}; +#define WGPEER_A_MAX (__WGPEER_A_LAST - 1) + +enum wgallowedip_attribute { + WGALLOWEDIP_A_UNSPEC, + WGALLOWEDIP_A_FAMILY, + WGALLOWEDIP_A_IPADDR, + WGALLOWEDIP_A_CIDR_MASK, + WGALLOWEDIP_A_FLAGS, + + __WGALLOWEDIP_A_LAST +}; +#define WGALLOWEDIP_A_MAX (__WGALLOWEDIP_A_LAST - 1) + +enum wg_cmd { + WG_CMD_GET_DEVICE, + WG_CMD_SET_DEVICE, + + __WG_CMD_MAX +}; +#define WG_CMD_MAX (__WG_CMD_MAX - 1) + +#endif /* _UAPI_LINUX_WIREGUARD_H */ diff --git a/tools/net/ynl/samples/.gitignore b/tools/net/ynl/samples/.gitignore index 7f5fca7682d74..09c61e4c18cd4 100644 --- a/tools/net/ynl/samples/.gitignore +++ b/tools/net/ynl/samples/.gitignore @@ -7,3 +7,4 @@ rt-addr rt-link rt-route tc +wireguard diff --git a/tools/net/ynl/samples/wireguard.c b/tools/net/ynl/samples/wireguard.c new file mode 100644 index 0000000000000..43f3551eb101a --- /dev/null +++ b/tools/net/ynl/samples/wireguard.c @@ -0,0 +1,104 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +#include "wireguard-user.h" + +static void print_allowed_ip(const struct wireguard_wgallowedip *aip) +{ + char addr_out[INET6_ADDRSTRLEN]; + + if (!inet_ntop(aip->family, aip->ipaddr, addr_out, sizeof(addr_out))) { + addr_out[0] = '?'; + addr_out[1] = '\0'; + } + printf("\t\t\t%s/%u\n", addr_out, aip->cidr_mask); +} + +/* Only printing public key in this demo. For better key formatting, + * use the constant-time implementation as found in wireguard-tools. + */ +static void print_peer_header(const struct wireguard_wgpeer *peer) +{ + unsigned int i; + uint8_t *key = peer->public_key; + unsigned int len = peer->_len.public_key; + + if (len != 32) + return; + printf("\tPeer "); + for (i = 0; i < len; i++) + printf("%02x", key[i]); + printf(":\n"); +} + +static void print_peer(const struct wireguard_wgpeer *peer) +{ + unsigned int i; + + print_peer_header(peer); + printf("\t\tData: rx: %llu / tx: %llu bytes\n", + peer->rx_bytes, peer->tx_bytes); + printf("\t\tAllowed IPs:\n"); + for (i = 0; i < peer->_count.allowedips; i++) + print_allowed_ip(&peer->allowedips[i]); +} + +static void build_request(struct wireguard_get_device_req *req, char *arg) +{ + char *endptr; + int ifindex; + + ifindex = strtol(arg, &endptr, 0); + if (endptr != arg + strlen(arg) || errno != 0) + ifindex = 0; + if (ifindex > 0) + wireguard_get_device_req_set_ifindex(req, ifindex); + else + wireguard_get_device_req_set_ifname(req, arg); +} + +int main(int argc, char **argv) +{ + struct wireguard_get_device_list *devs; + struct wireguard_get_device_req *req; + struct ynl_sock *ys; + + if (argc < 2) { + fprintf(stderr, "usage: %s \n", argv[0]); + return 1; + } + + req = wireguard_get_device_req_alloc(); + build_request(req, argv[1]); + + ys = ynl_sock_create(&ynl_wireguard_family, NULL); + if (!ys) + return 2; + + devs = wireguard_get_device_dump(ys, req); + if (!devs) + goto err_close; + + ynl_dump_foreach(devs, d) { + unsigned int i; + + printf("Interface %d: %s\n", d->ifindex, d->ifname); + for (i = 0; i < d->_count.peers; i++) + print_peer(&d->peers[i]); + } + wireguard_get_device_list_free(devs); + wireguard_get_device_req_free(req); + ynl_sock_destroy(ys); + + return 0; + +err_close: + fprintf(stderr, "YNL (%d): %s\n", ys->err.code, ys->err.msg); + wireguard_get_device_req_free(req); + ynl_sock_destroy(ys); + return 3; +} -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:58 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:58 -0000 Subject: [PATCH net-next v1 03/11] wireguard: netlink: enable strict genetlink validation In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-4-ast@fiberby.net> Wireguard is a modern enough genetlink family, that it doesn't need resv_start_op. It already had policies in place when it was first merged, it has also never used the reserved field, or other things toggled by resv_start_op. wireguard-tools have always used zero initialized memory, and have never touched the reserved field, neither have any other clients I have checked. Closed-source clients are much more likely to use the embeddedable library from wireguard-tools, than a DIY implementation using uninitialized memory. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index d36e94220d2c3..024d4a6cc74c6 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -631,7 +631,6 @@ static const struct genl_ops genl_ops[] = { static struct genl_family genl_family __ro_after_init = { .ops = genl_ops, .n_ops = ARRAY_SIZE(genl_ops), - .resv_start_op = WG_CMD_SET_DEVICE + 1, .name = WG_GENL_NAME, .version = WG_GENL_VERSION, .maxattr = WGDEVICE_A_MAX, -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:58 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:58 -0000 Subject: [PATCH net-next v1 10/11] wireguard: netlink: rename netlink handlers In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-11-ast@fiberby.net> Rename netlink handlers to use the naming expected by ynl-gen. This is an incremental step towards adopting netlink command definitions generated by ynl-gen. This is a trivial patch with no behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index 2acd651f4c71f..3595349448b2c 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -197,7 +197,7 @@ get_peer(struct wg_peer *peer, struct sk_buff *skb, struct dump_ctx *ctx) return -EMSGSIZE; } -static int wg_get_device_start(struct netlink_callback *cb) +static int wireguard_nl_get_device_start(struct netlink_callback *cb) { struct wg_device *wg; @@ -208,7 +208,8 @@ static int wg_get_device_start(struct netlink_callback *cb) return 0; } -static int wg_get_device_dump(struct sk_buff *skb, struct netlink_callback *cb) +static int wireguard_nl_get_device_dumpit(struct sk_buff *skb, + struct netlink_callback *cb) { struct wg_peer *peer, *next_peer_cursor; struct dump_ctx *ctx = DUMP_CTX(cb); @@ -302,7 +303,7 @@ static int wg_get_device_dump(struct sk_buff *skb, struct netlink_callback *cb) */ } -static int wg_get_device_done(struct netlink_callback *cb) +static int wireguard_nl_get_device_done(struct netlink_callback *cb) { struct dump_ctx *ctx = DUMP_CTX(cb); @@ -500,7 +501,8 @@ static int set_peer(struct wg_device *wg, struct nlattr **attrs) return ret; } -static int wg_set_device(struct sk_buff *skb, struct genl_info *info) +static int wireguard_nl_set_device_doit(struct sk_buff *skb, + struct genl_info *info) { struct wg_device *wg = lookup_interface(info->attrs, skb); u32 flags = 0; @@ -617,15 +619,15 @@ static int wg_set_device(struct sk_buff *skb, struct genl_info *info) static const struct genl_split_ops wireguard_nl_ops[] = { { .cmd = WG_CMD_GET_DEVICE, - .start = wg_get_device_start, - .dumpit = wg_get_device_dump, - .done = wg_get_device_done, + .start = wireguard_nl_get_device_start, + .dumpit = wireguard_nl_get_device_dumpit, + .done = wireguard_nl_get_device_done, .policy = device_policy, .maxattr = WGDEVICE_A_PEERS, .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DUMP, }, { .cmd = WG_CMD_SET_DEVICE, - .doit = wg_set_device, + .doit = wireguard_nl_set_device_doit, .policy = device_policy, .maxattr = WGDEVICE_A_PEERS, .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DO, -- 2.51.0 From ast at fiberby.net Wed Oct 29 20:51:59 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Wed, 29 Oct 2025 20:51:59 -0000 Subject: [PATCH net-next v1 02/11] wireguard: netlink: use WG_KEY_LEN in policies In-Reply-To: <20251029205123.286115-1-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> Message-ID: <20251029205123.286115-3-ast@fiberby.net> When converting the netlink policies to YNL, then the constants used in the policy has to be visible to user-space. As NOISE_*_KEY_LEN isn't visible for userspace, then change the policy to use WG_KEY_LEN, as is also documented in the UAPI header: $ grep WG_KEY_LEN include/uapi/linux/wireguard.h * WGDEVICE_A_PRIVATE_KEY: NLA_EXACT_LEN, len WG_KEY_LEN * WGDEVICE_A_PUBLIC_KEY: NLA_EXACT_LEN, len WG_KEY_LEN * WGPEER_A_PUBLIC_KEY: NLA_EXACT_LEN, len WG_KEY_LEN * WGPEER_A_PRESHARED_KEY: NLA_EXACT_LEN, len WG_KEY_LEN [...] Add a couple of BUILD_BUG_ON() to ensure that they stay in sync. No behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index 9bc76e1bcba2d..d36e94220d2c3 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -22,8 +22,8 @@ static struct genl_family genl_family; static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { [WGDEVICE_A_IFINDEX] = { .type = NLA_U32 }, [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, - [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(NOISE_PUBLIC_KEY_LEN), - [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(NOISE_PUBLIC_KEY_LEN), + [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGDEVICE_F_ALL), [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, @@ -31,8 +31,8 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { }; static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { - [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(NOISE_PUBLIC_KEY_LEN), - [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(NOISE_SYMMETRIC_KEY_LEN), + [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGPEER_F_ALL), [WGPEER_A_ENDPOINT] = NLA_POLICY_MIN_LEN(sizeof(struct sockaddr)), [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16 }, @@ -642,6 +642,9 @@ static struct genl_family genl_family __ro_after_init = { int __init wg_genetlink_init(void) { + BUILD_BUG_ON(WG_KEY_LEN != NOISE_PUBLIC_KEY_LEN); + BUILD_BUG_ON(WG_KEY_LEN != NOISE_SYMMETRIC_KEY_LEN); + return genl_register_family(&genl_family); } -- 2.51.0 From ast at fiberby.net Thu Oct 30 09:46:38 2025 From: ast at fiberby.net (=?UTF-8?Q?Asbj=C3=B8rn_Sloth_T=C3=B8nnesen?=) Date: Thu, 30 Oct 2025 09:46:38 -0000 Subject: [PATCH net-next v1 01/11] wireguard: netlink: validate nested arrays in policy In-Reply-To: <20251029205123.286115-2-ast@fiberby.net> References: <20251029205123.286115-1-ast@fiberby.net> <20251029205123.286115-2-ast@fiberby.net> Message-ID: <52e4619e-d018-4395-a94a-499ff7fd918d@fiberby.net> On 10/29/25 8:51 PM, Asbj?rn Sloth T?nnesen wrote: > diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c > index 67f962eb8b46d..9bc76e1bcba2d 100644 > --- a/drivers/net/wireguard/netlink.c > +++ b/drivers/net/wireguard/netlink.c > @@ -27,7 +27,7 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { > [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGDEVICE_F_ALL), > [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, > [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, > - [WGDEVICE_A_PEERS] = { .type = NLA_NESTED } > + [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(peer_policy), > }; > > static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { > @@ -39,7 +39,7 @@ static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { > [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(sizeof(struct __kernel_timespec)), > [WGPEER_A_RX_BYTES] = { .type = NLA_U64 }, > [WGPEER_A_TX_BYTES] = { .type = NLA_U64 }, > - [WGPEER_A_ALLOWEDIPS] = { .type = NLA_NESTED }, > + [WGPEER_A_ALLOWEDIPS] = NLA_POLICY_NESTED_ARRAY(allowedip_policy), > [WGPEER_A_PROTOCOL_VERSION] = { .type = NLA_U32 } > }; Oops, I messed this patch up. I will add forward declarations in v2, which will be removed again once the policy code is generated, as that will be less messy than reordering the policies. -- pw-bot: changes-requested From ast at fiberby.net Fri Oct 31 16:07:14 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:14 -0000 Subject: [PATCH net-next v2 00/11] wireguard: netlink: ynl conversion Message-ID: <20251031160539.1701943-1-ast@fiberby.net> This series completes the implementation of YNL for wireguard, as previously announced[1]. This series consist of 5 parts: 1) Patch 01-03 - Misc. changes 2) Patch 04 - Add YNL specification for wireguard 3) Patch 05-07 - Transition to a generated UAPI header 4) Patch 08 - Adds a sample program for the generated C library 5) Patch 09-11 - Transition to generated netlink policy code The main benefit of having a YNL specification is unlocked after the first 2 parts, the RFC version seems to already have spawned a new Rust netlink binding[2] using wireguard as it's main example. Part 3 and 5 validates that the specification is complete and aligned, the generated code might have a few warts, but they don't matter too much, and are mostly a transitional problem[3]. Part 4 is possible after part 2, but is ordered after part 3, as it needs to duplicate the UAPI header in tools/include. For the non-generated kernel C code the diff stat looks like this: $ git diff --stat net-next/main..wg-ynl include/ drivers/ \ ':(exclude)*netlink_gen*' drivers/net/wireguard/Makefile | 1 + drivers/net/wireguard/netlink.c | 70 +++--------- include/uapi/linux/wireguard.h | 190 ++++++-------------------------- 3 files changed, 47 insertions(+), 214 deletions(-) [1] [PATCH net 0/4] tools: ynl-gen: misc fixes + wireguard ynl plan https://lore.kernel.org/r/20250901145034.525518-1-ast at fiberby.net/ [2] https://github.com/one-d-wide/netlink-bindings/ [3] https://lore.kernel.org/r/20251014123201.6ecfd146 at kernel.org/ --- v2: - Add missing forward declaration v1: https://lore.kernel.org/r/20251029205123.286115-1-ast at fiberby.net/ - Policy arguement to nla_parse_nested() changed to NULL (thanks Johannes). - Added attr-cnt-name to the spec, to reduce the diff a bit. - Refined the doc in the spec a bit. - Reword commit messages a bit. - Reordered the patches, and reduced the series from 14 to 11 patches. RFC: https://lore.kernel.org/r/20250904-wg-ynl-rfc at fiberby.net/ Asbj?rn Sloth T?nnesen (11): wireguard: netlink: validate nested arrays in policy wireguard: netlink: use WG_KEY_LEN in policies wireguard: netlink: enable strict genetlink validation netlink: specs: add specification for wireguard uapi: wireguard: move enum wg_cmd uapi: wireguard: move flag enums uapi: wireguard: generate header with ynl-gen tools: ynl: add sample for wireguard wireguard: netlink: convert to split ops wireguard: netlink: rename netlink handlers wireguard: netlink: generate netlink code Documentation/netlink/specs/wireguard.yaml | 307 +++++++++++++++++++++ MAINTAINERS | 3 + drivers/net/wireguard/Makefile | 1 + drivers/net/wireguard/netlink.c | 70 +---- drivers/net/wireguard/netlink_gen.c | 77 ++++++ drivers/net/wireguard/netlink_gen.h | 29 ++ include/uapi/linux/wireguard.h | 190 +++---------- tools/include/uapi/linux/wireguard.h | 79 ++++++ tools/net/ynl/samples/.gitignore | 1 + tools/net/ynl/samples/wireguard.c | 104 +++++++ 10 files changed, 647 insertions(+), 214 deletions(-) create mode 100644 Documentation/netlink/specs/wireguard.yaml create mode 100644 drivers/net/wireguard/netlink_gen.c create mode 100644 drivers/net/wireguard/netlink_gen.h create mode 100644 tools/include/uapi/linux/wireguard.h create mode 100644 tools/net/ynl/samples/wireguard.c -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:15 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:15 -0000 Subject: [PATCH net-next v2 06/11] uapi: wireguard: move flag enums In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-7-ast@fiberby.net> Move the wg*_flag enums, so that they are defined above the attribute set enums, as ynl-gen would place them. This is an incremental step towards adopting an UAPI header generated by ynl-gen. This is split out to keep the patches readable. This is a trivial patch with no behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- include/uapi/linux/wireguard.h | 25 ++++++++++++++----------- 1 file changed, 14 insertions(+), 11 deletions(-) diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index 3ebfffd61269..a2815f4f2910 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -15,6 +15,20 @@ enum wgdevice_flag { WGDEVICE_F_REPLACE_PEERS = 1U << 0, __WGDEVICE_F_ALL = WGDEVICE_F_REPLACE_PEERS }; + +enum wgpeer_flag { + WGPEER_F_REMOVE_ME = 1U << 0, + WGPEER_F_REPLACE_ALLOWEDIPS = 1U << 1, + WGPEER_F_UPDATE_ONLY = 1U << 2, + __WGPEER_F_ALL = WGPEER_F_REMOVE_ME | WGPEER_F_REPLACE_ALLOWEDIPS | + WGPEER_F_UPDATE_ONLY +}; + +enum wgallowedip_flag { + WGALLOWEDIP_F_REMOVE_ME = 1U << 0, + __WGALLOWEDIP_F_ALL = WGALLOWEDIP_F_REMOVE_ME +}; + enum wgdevice_attribute { WGDEVICE_A_UNSPEC, WGDEVICE_A_IFINDEX, @@ -29,13 +43,6 @@ enum wgdevice_attribute { }; #define WGDEVICE_A_MAX (__WGDEVICE_A_LAST - 1) -enum wgpeer_flag { - WGPEER_F_REMOVE_ME = 1U << 0, - WGPEER_F_REPLACE_ALLOWEDIPS = 1U << 1, - WGPEER_F_UPDATE_ONLY = 1U << 2, - __WGPEER_F_ALL = WGPEER_F_REMOVE_ME | WGPEER_F_REPLACE_ALLOWEDIPS | - WGPEER_F_UPDATE_ONLY -}; enum wgpeer_attribute { WGPEER_A_UNSPEC, WGPEER_A_PUBLIC_KEY, @@ -52,10 +59,6 @@ enum wgpeer_attribute { }; #define WGPEER_A_MAX (__WGPEER_A_LAST - 1) -enum wgallowedip_flag { - WGALLOWEDIP_F_REMOVE_ME = 1U << 0, - __WGALLOWEDIP_F_ALL = WGALLOWEDIP_F_REMOVE_ME -}; enum wgallowedip_attribute { WGALLOWEDIP_A_UNSPEC, WGALLOWEDIP_A_FAMILY, -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:15 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:15 -0000 Subject: [PATCH net-next v2 01/11] wireguard: netlink: validate nested arrays in policy In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-2-ast@fiberby.net> Use NLA_POLICY_NESTED_ARRAY() to perform nested array validation in the policy validation step. The nested policy was already enforced through nla_parse_nested(), however extack wasn't passed previously. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index 67f962eb8b46..e4416f23d427 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -18,6 +18,8 @@ #include static struct genl_family genl_family; +static const struct nla_policy peer_policy[WGPEER_A_MAX + 1]; +static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1]; static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { [WGDEVICE_A_IFINDEX] = { .type = NLA_U32 }, @@ -27,7 +29,7 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGDEVICE_F_ALL), [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, - [WGDEVICE_A_PEERS] = { .type = NLA_NESTED } + [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(peer_policy), }; static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { @@ -39,7 +41,7 @@ static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(sizeof(struct __kernel_timespec)), [WGPEER_A_RX_BYTES] = { .type = NLA_U64 }, [WGPEER_A_TX_BYTES] = { .type = NLA_U64 }, - [WGPEER_A_ALLOWEDIPS] = { .type = NLA_NESTED }, + [WGPEER_A_ALLOWEDIPS] = NLA_POLICY_NESTED_ARRAY(allowedip_policy), [WGPEER_A_PROTOCOL_VERSION] = { .type = NLA_U32 } }; @@ -467,7 +469,7 @@ static int set_peer(struct wg_device *wg, struct nlattr **attrs) nla_for_each_nested(attr, attrs[WGPEER_A_ALLOWEDIPS], rem) { ret = nla_parse_nested(allowedip, WGALLOWEDIP_A_MAX, - attr, allowedip_policy, NULL); + attr, NULL, NULL); if (ret < 0) goto out; ret = set_allowedip(peer, allowedip); @@ -593,7 +595,7 @@ static int wg_set_device(struct sk_buff *skb, struct genl_info *info) nla_for_each_nested(attr, info->attrs[WGDEVICE_A_PEERS], rem) { ret = nla_parse_nested(peer, WGPEER_A_MAX, attr, - peer_policy, NULL); + NULL, NULL); if (ret < 0) goto out; ret = set_peer(wg, peer); -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:15 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:15 -0000 Subject: [PATCH net-next v2 07/11] uapi: wireguard: generate header with ynl-gen In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-8-ast@fiberby.net> Use ynl-gen to generate the UAPI header for wireguard. The cosmetic changes in this patch, confirms that the spec is aligned with the implementation, and ensures that it stays in sync. Changes in generated header: * Trivial include guard rename. * Trivial white space changes. * Trivial comment changes. * Precompute bitflags in ynl-gen (see [1]). * Drop __*_F_ALL constants (see [1]). [1] https://lore.kernel.org/r/20251014123201.6ecfd146 at kernel.org/ No behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 6 +++--- include/uapi/linux/wireguard.h | 37 ++++++++++++++++----------------- 2 files changed, 21 insertions(+), 22 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index 682678d24a9f..f9bed135000f 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -26,7 +26,7 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGDEVICE_F_ALL), + [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(peer_policy), @@ -35,7 +35,7 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGPEER_F_ALL), + [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x7), [WGPEER_A_ENDPOINT] = NLA_POLICY_MIN_LEN(sizeof(struct sockaddr)), [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16 }, [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(sizeof(struct __kernel_timespec)), @@ -49,7 +49,7 @@ static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1] = { [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16 }, [WGALLOWEDIP_A_IPADDR] = NLA_POLICY_MIN_LEN(sizeof(struct in_addr)), [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 }, - [WGALLOWEDIP_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGALLOWEDIP_F_ALL), + [WGALLOWEDIP_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), }; static struct wg_device *lookup_interface(struct nlattr **attrs, diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index a2815f4f2910..dc3924d0c552 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -1,32 +1,28 @@ -/* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */ -/* - * Copyright (C) 2015-2019 Jason A. Donenfeld . All Rights Reserved. - */ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ +/* Do not edit directly, auto-generated from: */ +/* Documentation/netlink/specs/wireguard.yaml */ +/* YNL-GEN uapi header */ -#ifndef _WG_UAPI_WIREGUARD_H -#define _WG_UAPI_WIREGUARD_H +#ifndef _UAPI_LINUX_WIREGUARD_H +#define _UAPI_LINUX_WIREGUARD_H -#define WG_GENL_NAME "wireguard" -#define WG_GENL_VERSION 1 +#define WG_GENL_NAME "wireguard" +#define WG_GENL_VERSION 1 -#define WG_KEY_LEN 32 +#define WG_KEY_LEN 32 enum wgdevice_flag { - WGDEVICE_F_REPLACE_PEERS = 1U << 0, - __WGDEVICE_F_ALL = WGDEVICE_F_REPLACE_PEERS + WGDEVICE_F_REPLACE_PEERS = 1, }; enum wgpeer_flag { - WGPEER_F_REMOVE_ME = 1U << 0, - WGPEER_F_REPLACE_ALLOWEDIPS = 1U << 1, - WGPEER_F_UPDATE_ONLY = 1U << 2, - __WGPEER_F_ALL = WGPEER_F_REMOVE_ME | WGPEER_F_REPLACE_ALLOWEDIPS | - WGPEER_F_UPDATE_ONLY + WGPEER_F_REMOVE_ME = 1, + WGPEER_F_REPLACE_ALLOWEDIPS = 2, + WGPEER_F_UPDATE_ONLY = 4, }; enum wgallowedip_flag { - WGALLOWEDIP_F_REMOVE_ME = 1U << 0, - __WGALLOWEDIP_F_ALL = WGALLOWEDIP_F_REMOVE_ME + WGALLOWEDIP_F_REMOVE_ME = 1, }; enum wgdevice_attribute { @@ -39,6 +35,7 @@ enum wgdevice_attribute { WGDEVICE_A_LISTEN_PORT, WGDEVICE_A_FWMARK, WGDEVICE_A_PEERS, + __WGDEVICE_A_LAST }; #define WGDEVICE_A_MAX (__WGDEVICE_A_LAST - 1) @@ -55,6 +52,7 @@ enum wgpeer_attribute { WGPEER_A_TX_BYTES, WGPEER_A_ALLOWEDIPS, WGPEER_A_PROTOCOL_VERSION, + __WGPEER_A_LAST }; #define WGPEER_A_MAX (__WGPEER_A_LAST - 1) @@ -65,6 +63,7 @@ enum wgallowedip_attribute { WGALLOWEDIP_A_IPADDR, WGALLOWEDIP_A_CIDR_MASK, WGALLOWEDIP_A_FLAGS, + __WGALLOWEDIP_A_LAST }; #define WGALLOWEDIP_A_MAX (__WGALLOWEDIP_A_LAST - 1) @@ -77,4 +76,4 @@ enum wg_cmd { }; #define WG_CMD_MAX (__WG_CMD_MAX - 1) -#endif /* _WG_UAPI_WIREGUARD_H */ +#endif /* _UAPI_LINUX_WIREGUARD_H */ -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:15 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:15 -0000 Subject: [PATCH net-next v2 08/11] tools: ynl: add sample for wireguard In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-9-ast@fiberby.net> Add a sample application for wireguard, using the generated C library, The main benefit of this is to exercise the generated library, which might be useful for future selftests. The UAPI header is copied to tools/include/uapi/, when the header changes ynl-gen will regenerate both copies. Example: $ make -C tools/net/ynl/lib $ make -C tools/net/ynl/generated $ make -C tools/net/ynl/samples wireguard $ ./tools/net/ynl/samples/wireguard usage: ./tools/net/ynl/samples/wireguard $ sudo ./tools/net/ynl/samples/wireguard wg-test Interface 3: wg-test Peer 6adfb183a4a2c94a2f92dab5ade762a4788[...]: Data: rx: 42 / tx: 42 bytes Allowed IPs: 0.0.0.0/0 ::/0 Signed-off-by: Asbj?rn Sloth T?nnesen --- MAINTAINERS | 2 + tools/include/uapi/linux/wireguard.h | 79 ++++++++++++++++++++ tools/net/ynl/samples/.gitignore | 1 + tools/net/ynl/samples/wireguard.c | 104 +++++++++++++++++++++++++++ 4 files changed, 186 insertions(+) create mode 100644 tools/include/uapi/linux/wireguard.h create mode 100644 tools/net/ynl/samples/wireguard.c diff --git a/MAINTAINERS b/MAINTAINERS index 65c71728e2c6..5b1b8cd37124 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -27643,6 +27643,8 @@ L: netdev at vger.kernel.org S: Maintained F: Documentation/netlink/specs/wireguard.yaml F: drivers/net/wireguard/ +F: tools/include/uapi/linux/wireguard.h +F: tools/net/ynl/samples/wireguard.c F: tools/testing/selftests/wireguard/ WISTRON LAPTOP BUTTON DRIVER diff --git a/tools/include/uapi/linux/wireguard.h b/tools/include/uapi/linux/wireguard.h new file mode 100644 index 000000000000..dc3924d0c552 --- /dev/null +++ b/tools/include/uapi/linux/wireguard.h @@ -0,0 +1,79 @@ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ +/* Do not edit directly, auto-generated from: */ +/* Documentation/netlink/specs/wireguard.yaml */ +/* YNL-GEN uapi header */ + +#ifndef _UAPI_LINUX_WIREGUARD_H +#define _UAPI_LINUX_WIREGUARD_H + +#define WG_GENL_NAME "wireguard" +#define WG_GENL_VERSION 1 + +#define WG_KEY_LEN 32 + +enum wgdevice_flag { + WGDEVICE_F_REPLACE_PEERS = 1, +}; + +enum wgpeer_flag { + WGPEER_F_REMOVE_ME = 1, + WGPEER_F_REPLACE_ALLOWEDIPS = 2, + WGPEER_F_UPDATE_ONLY = 4, +}; + +enum wgallowedip_flag { + WGALLOWEDIP_F_REMOVE_ME = 1, +}; + +enum wgdevice_attribute { + WGDEVICE_A_UNSPEC, + WGDEVICE_A_IFINDEX, + WGDEVICE_A_IFNAME, + WGDEVICE_A_PRIVATE_KEY, + WGDEVICE_A_PUBLIC_KEY, + WGDEVICE_A_FLAGS, + WGDEVICE_A_LISTEN_PORT, + WGDEVICE_A_FWMARK, + WGDEVICE_A_PEERS, + + __WGDEVICE_A_LAST +}; +#define WGDEVICE_A_MAX (__WGDEVICE_A_LAST - 1) + +enum wgpeer_attribute { + WGPEER_A_UNSPEC, + WGPEER_A_PUBLIC_KEY, + WGPEER_A_PRESHARED_KEY, + WGPEER_A_FLAGS, + WGPEER_A_ENDPOINT, + WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL, + WGPEER_A_LAST_HANDSHAKE_TIME, + WGPEER_A_RX_BYTES, + WGPEER_A_TX_BYTES, + WGPEER_A_ALLOWEDIPS, + WGPEER_A_PROTOCOL_VERSION, + + __WGPEER_A_LAST +}; +#define WGPEER_A_MAX (__WGPEER_A_LAST - 1) + +enum wgallowedip_attribute { + WGALLOWEDIP_A_UNSPEC, + WGALLOWEDIP_A_FAMILY, + WGALLOWEDIP_A_IPADDR, + WGALLOWEDIP_A_CIDR_MASK, + WGALLOWEDIP_A_FLAGS, + + __WGALLOWEDIP_A_LAST +}; +#define WGALLOWEDIP_A_MAX (__WGALLOWEDIP_A_LAST - 1) + +enum wg_cmd { + WG_CMD_GET_DEVICE, + WG_CMD_SET_DEVICE, + + __WG_CMD_MAX +}; +#define WG_CMD_MAX (__WG_CMD_MAX - 1) + +#endif /* _UAPI_LINUX_WIREGUARD_H */ diff --git a/tools/net/ynl/samples/.gitignore b/tools/net/ynl/samples/.gitignore index 7f5fca7682d7..09c61e4c18cd 100644 --- a/tools/net/ynl/samples/.gitignore +++ b/tools/net/ynl/samples/.gitignore @@ -7,3 +7,4 @@ rt-addr rt-link rt-route tc +wireguard diff --git a/tools/net/ynl/samples/wireguard.c b/tools/net/ynl/samples/wireguard.c new file mode 100644 index 000000000000..43f3551eb101 --- /dev/null +++ b/tools/net/ynl/samples/wireguard.c @@ -0,0 +1,104 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include + +#include "wireguard-user.h" + +static void print_allowed_ip(const struct wireguard_wgallowedip *aip) +{ + char addr_out[INET6_ADDRSTRLEN]; + + if (!inet_ntop(aip->family, aip->ipaddr, addr_out, sizeof(addr_out))) { + addr_out[0] = '?'; + addr_out[1] = '\0'; + } + printf("\t\t\t%s/%u\n", addr_out, aip->cidr_mask); +} + +/* Only printing public key in this demo. For better key formatting, + * use the constant-time implementation as found in wireguard-tools. + */ +static void print_peer_header(const struct wireguard_wgpeer *peer) +{ + unsigned int i; + uint8_t *key = peer->public_key; + unsigned int len = peer->_len.public_key; + + if (len != 32) + return; + printf("\tPeer "); + for (i = 0; i < len; i++) + printf("%02x", key[i]); + printf(":\n"); +} + +static void print_peer(const struct wireguard_wgpeer *peer) +{ + unsigned int i; + + print_peer_header(peer); + printf("\t\tData: rx: %llu / tx: %llu bytes\n", + peer->rx_bytes, peer->tx_bytes); + printf("\t\tAllowed IPs:\n"); + for (i = 0; i < peer->_count.allowedips; i++) + print_allowed_ip(&peer->allowedips[i]); +} + +static void build_request(struct wireguard_get_device_req *req, char *arg) +{ + char *endptr; + int ifindex; + + ifindex = strtol(arg, &endptr, 0); + if (endptr != arg + strlen(arg) || errno != 0) + ifindex = 0; + if (ifindex > 0) + wireguard_get_device_req_set_ifindex(req, ifindex); + else + wireguard_get_device_req_set_ifname(req, arg); +} + +int main(int argc, char **argv) +{ + struct wireguard_get_device_list *devs; + struct wireguard_get_device_req *req; + struct ynl_sock *ys; + + if (argc < 2) { + fprintf(stderr, "usage: %s \n", argv[0]); + return 1; + } + + req = wireguard_get_device_req_alloc(); + build_request(req, argv[1]); + + ys = ynl_sock_create(&ynl_wireguard_family, NULL); + if (!ys) + return 2; + + devs = wireguard_get_device_dump(ys, req); + if (!devs) + goto err_close; + + ynl_dump_foreach(devs, d) { + unsigned int i; + + printf("Interface %d: %s\n", d->ifindex, d->ifname); + for (i = 0; i < d->_count.peers; i++) + print_peer(&d->peers[i]); + } + wireguard_get_device_list_free(devs); + wireguard_get_device_req_free(req); + ynl_sock_destroy(ys); + + return 0; + +err_close: + fprintf(stderr, "YNL (%d): %s\n", ys->err.code, ys->err.msg); + wireguard_get_device_req_free(req); + ynl_sock_destroy(ys); + return 3; +} -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:15 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:15 -0000 Subject: [PATCH net-next v2 05/11] uapi: wireguard: move enum wg_cmd In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-6-ast@fiberby.net> This patch moves enum wg_cmd to the end of the file, where ynl-gen would like to generate it. This is an incremental step towards adopting an UAPI header generated by ynl-gen. This is split out to keep the patches readable. This is a trivial patch with no behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- include/uapi/linux/wireguard.h | 15 ++++++++------- 1 file changed, 8 insertions(+), 7 deletions(-) diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index dee4401e0b5d..3ebfffd61269 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -11,13 +11,6 @@ #define WG_KEY_LEN 32 -enum wg_cmd { - WG_CMD_GET_DEVICE, - WG_CMD_SET_DEVICE, - __WG_CMD_MAX -}; -#define WG_CMD_MAX (__WG_CMD_MAX - 1) - enum wgdevice_flag { WGDEVICE_F_REPLACE_PEERS = 1U << 0, __WGDEVICE_F_ALL = WGDEVICE_F_REPLACE_PEERS @@ -73,4 +66,12 @@ enum wgallowedip_attribute { }; #define WGALLOWEDIP_A_MAX (__WGALLOWEDIP_A_LAST - 1) +enum wg_cmd { + WG_CMD_GET_DEVICE, + WG_CMD_SET_DEVICE, + + __WG_CMD_MAX +}; +#define WG_CMD_MAX (__WG_CMD_MAX - 1) + #endif /* _WG_UAPI_WIREGUARD_H */ -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:16 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:16 -0000 Subject: [PATCH net-next v2 10/11] wireguard: netlink: rename netlink handlers In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-11-ast@fiberby.net> Rename netlink handlers to use the naming expected by ynl-gen. This is an incremental step towards adopting netlink command definitions generated by ynl-gen. This is a trivial patch with no behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 18 ++++++++++-------- 1 file changed, 10 insertions(+), 8 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index 7fecc25bd781..ff1549fe55e2 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -199,7 +199,7 @@ get_peer(struct wg_peer *peer, struct sk_buff *skb, struct dump_ctx *ctx) return -EMSGSIZE; } -static int wg_get_device_start(struct netlink_callback *cb) +static int wireguard_nl_get_device_start(struct netlink_callback *cb) { struct wg_device *wg; @@ -210,7 +210,8 @@ static int wg_get_device_start(struct netlink_callback *cb) return 0; } -static int wg_get_device_dump(struct sk_buff *skb, struct netlink_callback *cb) +static int wireguard_nl_get_device_dumpit(struct sk_buff *skb, + struct netlink_callback *cb) { struct wg_peer *peer, *next_peer_cursor; struct dump_ctx *ctx = DUMP_CTX(cb); @@ -304,7 +305,7 @@ static int wg_get_device_dump(struct sk_buff *skb, struct netlink_callback *cb) */ } -static int wg_get_device_done(struct netlink_callback *cb) +static int wireguard_nl_get_device_done(struct netlink_callback *cb) { struct dump_ctx *ctx = DUMP_CTX(cb); @@ -502,7 +503,8 @@ static int set_peer(struct wg_device *wg, struct nlattr **attrs) return ret; } -static int wg_set_device(struct sk_buff *skb, struct genl_info *info) +static int wireguard_nl_set_device_doit(struct sk_buff *skb, + struct genl_info *info) { struct wg_device *wg = lookup_interface(info->attrs, skb); u32 flags = 0; @@ -619,15 +621,15 @@ static int wg_set_device(struct sk_buff *skb, struct genl_info *info) static const struct genl_split_ops wireguard_nl_ops[] = { { .cmd = WG_CMD_GET_DEVICE, - .start = wg_get_device_start, - .dumpit = wg_get_device_dump, - .done = wg_get_device_done, + .start = wireguard_nl_get_device_start, + .dumpit = wireguard_nl_get_device_dumpit, + .done = wireguard_nl_get_device_done, .policy = device_policy, .maxattr = WGDEVICE_A_PEERS, .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DUMP, }, { .cmd = WG_CMD_SET_DEVICE, - .doit = wg_set_device, + .doit = wireguard_nl_set_device_doit, .policy = device_policy, .maxattr = WGDEVICE_A_PEERS, .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DO, -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:16 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:16 -0000 Subject: [PATCH net-next v2 03/11] wireguard: netlink: enable strict genetlink validation In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-4-ast@fiberby.net> Wireguard is a modern enough genetlink family, that it doesn't need resv_start_op. It already had policies in place when it was first merged, it has also never used the reserved field, or other things toggled by resv_start_op. wireguard-tools have always used zero initialized memory, and have never touched the reserved field, neither have any other clients I have checked. Closed-source clients are much more likely to use the embeddedable library from wireguard-tools, than a DIY implementation using uninitialized memory. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 1 - 1 file changed, 1 deletion(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index db57a74d379b..682678d24a9f 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -633,7 +633,6 @@ static const struct genl_ops genl_ops[] = { static struct genl_family genl_family __ro_after_init = { .ops = genl_ops, .n_ops = ARRAY_SIZE(genl_ops), - .resv_start_op = WG_CMD_SET_DEVICE + 1, .name = WG_GENL_NAME, .version = WG_GENL_VERSION, .maxattr = WGDEVICE_A_MAX, -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:16 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:16 -0000 Subject: [PATCH net-next v2 02/11] wireguard: netlink: use WG_KEY_LEN in policies In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-3-ast@fiberby.net> When converting the netlink policies to YNL, then the constants used in the policy has to be visible to user-space. As NOISE_*_KEY_LEN isn't visible for userspace, then change the policy to use WG_KEY_LEN, as is also documented in the UAPI header: $ grep WG_KEY_LEN include/uapi/linux/wireguard.h * WGDEVICE_A_PRIVATE_KEY: NLA_EXACT_LEN, len WG_KEY_LEN * WGDEVICE_A_PUBLIC_KEY: NLA_EXACT_LEN, len WG_KEY_LEN * WGPEER_A_PUBLIC_KEY: NLA_EXACT_LEN, len WG_KEY_LEN * WGPEER_A_PRESHARED_KEY: NLA_EXACT_LEN, len WG_KEY_LEN [...] Add a couple of BUILD_BUG_ON() to ensure that they stay in sync. No behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index e4416f23d427..db57a74d379b 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -24,8 +24,8 @@ static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1]; static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { [WGDEVICE_A_IFINDEX] = { .type = NLA_U32 }, [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, - [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(NOISE_PUBLIC_KEY_LEN), - [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(NOISE_PUBLIC_KEY_LEN), + [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGDEVICE_F_ALL), [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, @@ -33,8 +33,8 @@ static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { }; static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { - [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(NOISE_PUBLIC_KEY_LEN), - [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(NOISE_SYMMETRIC_KEY_LEN), + [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, __WGPEER_F_ALL), [WGPEER_A_ENDPOINT] = NLA_POLICY_MIN_LEN(sizeof(struct sockaddr)), [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16 }, @@ -644,6 +644,9 @@ static struct genl_family genl_family __ro_after_init = { int __init wg_genetlink_init(void) { + BUILD_BUG_ON(WG_KEY_LEN != NOISE_PUBLIC_KEY_LEN); + BUILD_BUG_ON(WG_KEY_LEN != NOISE_SYMMETRIC_KEY_LEN); + return genl_register_family(&genl_family); } -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:16 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:16 -0000 Subject: [PATCH net-next v2 09/11] wireguard: netlink: convert to split ops In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-10-ast@fiberby.net> This patch converts wireguard from using legacy struct genl_ops to struct genl_split_ops, by applying the same transformation as genl_cmd_full_to_split() would otherwise do at runtime. WGDEVICE_A_MAX is swapped for WGDEVICE_A_PEERS, while they are currently equivalent, then .maxattr should be the maximum attribute that a given command supports, which might not be WGDEVICE_A_MAX. This is an incremental step towards adopting netlink policy code generated by ynl-gen, ensuring that the code and spec is aligned. This is a trivial patch with no behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/netlink.c | 16 +++++++++------- 1 file changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index f9bed135000f..7fecc25bd781 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -616,28 +616,30 @@ static int wg_set_device(struct sk_buff *skb, struct genl_info *info) return ret; } -static const struct genl_ops genl_ops[] = { +static const struct genl_split_ops wireguard_nl_ops[] = { { .cmd = WG_CMD_GET_DEVICE, .start = wg_get_device_start, .dumpit = wg_get_device_dump, .done = wg_get_device_done, - .flags = GENL_UNS_ADMIN_PERM + .policy = device_policy, + .maxattr = WGDEVICE_A_PEERS, + .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DUMP, }, { .cmd = WG_CMD_SET_DEVICE, .doit = wg_set_device, - .flags = GENL_UNS_ADMIN_PERM + .policy = device_policy, + .maxattr = WGDEVICE_A_PEERS, + .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DO, } }; static struct genl_family genl_family __ro_after_init = { - .ops = genl_ops, - .n_ops = ARRAY_SIZE(genl_ops), + .split_ops = wireguard_nl_ops, + .n_split_ops = ARRAY_SIZE(wireguard_nl_ops), .name = WG_GENL_NAME, .version = WG_GENL_VERSION, - .maxattr = WGDEVICE_A_MAX, .module = THIS_MODULE, - .policy = device_policy, .netnsok = true }; -- 2.51.0 From pmladek at suse.com Tue Oct 14 13:09:13 2025 From: pmladek at suse.com (Petr Mladek) Date: Tue, 14 Oct 2025 13:09:13 -0000 Subject: [????] Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks In-Reply-To: References: <20251012115035.2169-1-lirongqing@baidu.com> <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> Message-ID: On Tue 2025-10-14 10:49:53, Li,Rongqing wrote: > > > On Tue 2025-10-14 13:23:58, Lance Yang wrote: > > > Thanks for the patch! > > > > > > I noticed the implementation panics only when N tasks are detected > > > within a single scan, because total_hung_task is reset for each > > > check_hung_uninterruptible_tasks() run. > > > > Great catch! > > > > Does it make sense? > > Is is the intended behavior, please? > > > > Yes, this is intended behavior > > > > So some suggestions to align the documentation with the code's > > > behavior below :) > > > > > On 2025/10/12 19:50, lirongqing wrote: > > > > From: Li RongQing > > > > > > > > Currently, when 'hung_task_panic' is enabled, the kernel panics > > > > immediately upon detecting the first hung task. However, some hung > > > > tasks are transient and the system can recover, while others are > > > > persistent and may accumulate progressively. > > > > My understanding is that this patch wanted to do: > > > > + report even temporary stalls > > + panic only when the stall was much longer and likely persistent > > > > Which might make some sense. But the code does something else. > > > > A single task hanging for an extended period may not be a critical > issue, as users might still log into the system to investigate. > However, if multiple tasks hang simultaneously-such as in cases > of I/O hangs caused by disk failures-it could prevent users from > logging in and become a serious problem, and a panic is expected. I see. This another approach and it makes sense as well. An this is much more clear description than the original text. I would also update the subject to something like: hung_task: Panic when there are more than N hung tasks at the same time That said, I think that both approaches make sense. Your approach would trigger the panic when many processes are stuck. Note that it still might be a transient state. But I agree that the more stuck processes exist the more serious the problem likely is for the heath of the system. My approach would trigger panic when a single process hangs for a long time. It will trigger more likely only when the problem is persistent. The seriousness depends on which particular process get stuck. I am fine with your approach. Just please, make more clear that the number means the number of hung tasks at the same time. And mention the problems to login, ... Best Regards, Petr From lirongqing at baidu.com Wed Oct 15 02:08:55 2025 From: lirongqing at baidu.com (Li,Rongqing) Date: Wed, 15 Oct 2025 02:08:55 -0000 Subject: [????] Re: [????] Re: [PATCH][v3] hung_task: Panic after fixed number of hung tasks In-Reply-To: References: <20251012115035.2169-1-lirongqing@baidu.com> <588c1935-835f-4cab-9679-f31c1e903a9a@linux.dev> Message-ID: > I would also update the subject to something like: > > hung_task: Panic when there are more than N hung tasks at the same > time > Ok, I will update > > > That said, I think that both approaches make sense. > > Your approach would trigger the panic when many processes are stuck. > Note that it still might be a transient state. But I agree that the more stuck > processes exist the more serious the problem likely is for the heath of the > system. > > My approach would trigger panic when a single process hangs for a long > time. It will trigger more likely only when the problem is persistent. The > seriousness depends on which particular process get stuck. > Yes, both are reasonable requirement, and I will leave it to you or anyone else interested to implement it Thanks -Li. > I am fine with your approach. Just please, make more clear that the number > means the number of hung tasks at the same time. > And mention the problems to login, ... > > Best Regards, > Petr From lirongqing at baidu.com Wed Oct 15 06:39:04 2025 From: lirongqing at baidu.com (lirongqing) Date: Wed, 15 Oct 2025 06:39:04 -0000 Subject: [PATCH][v4] hung_task: Panic when there are more than N hung tasks at the same time Message-ID: <20251015063615.2632-1-lirongqing@baidu.com> From: Li RongQing Currently, when 'hung_task_panic' is enabled, the kernel panics immediately upon detecting the first hung task. However, some hung tasks are transient and allow system recovery, while persistent hangs should trigger a panic when accumulating beyond a threshold. Extend the 'hung_task_panic' sysctl to accept a threshold value specifying the number of hung tasks that must be detected before triggering a kernel panic. This provides finer control for environments where transient hangs may occur but persistent hangs should be fatal. The sysctl now accepts: - 0: don't panic (maintains original behavior) - 1: panic on first hung task (maintains original behavior) - N > 1: panic after N hung tasks are detected in a single scan This maintains backward compatibility while providing flexibility for different hang scenarios. Signed-off-by: Li RongQing Cc: Andrew Jeffery Cc: Anshuman Khandual Cc: Arnd Bergmann Cc: David Hildenbrand Cc: Florian Wesphal Cc: Jakub Kacinski Cc: Jason A. Donenfeld Cc: Joel Granados Cc: Joel Stanley Cc: Jonathan Corbet Cc: Kees Cook Cc: Lance Yang Cc: Liam Howlett Cc: Lorenzo Stoakes Cc: "Masami Hiramatsu (Google)" Cc: "Paul E . McKenney" Cc: Pawan Gupta Cc: Petr Mladek Cc: Phil Auld Cc: Randy Dunlap Cc: Russell King Cc: Shuah Khan Cc: Simon Horman Cc: Stanislav Fomichev Cc: Steven Rostedt --- diff with v3: comments modification, suggested by Lance, Masami, Randy and Petr diff with v2: do not add a new sysctl, extend hung_task_panic, suggested by Kees Cook Documentation/admin-guide/kernel-parameters.txt | 20 +++++++++++++------- Documentation/admin-guide/sysctl/kernel.rst | 9 +++++---- arch/arm/configs/aspeed_g5_defconfig | 2 +- kernel/configs/debug.config | 2 +- kernel/hung_task.c | 15 ++++++++++----- lib/Kconfig.debug | 9 +++++---- tools/testing/selftests/wireguard/qemu/kernel.config | 2 +- 7 files changed, 36 insertions(+), 23 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index a51ab46..492f0bc 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -1992,14 +1992,20 @@ the added memory block itself do not be affected. hung_task_panic= - [KNL] Should the hung task detector generate panics. - Format: 0 | 1 + [KNL] Number of hung tasks to trigger kernel panic. + Format: + + When set to a non-zero value, a kernel panic will be triggered if + the number of detected hung tasks reaches this value. + + 0: don't panic + 1: panic immediately on first hung task + N: panic after N hung tasks are detected in a single scan - A value of 1 instructs the kernel to panic when a - hung task is detected. The default value is controlled - by the CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time - option. The value selected by this boot parameter can - be changed later by the kernel.hung_task_panic sysctl. + The default value is controlled by the + CONFIG_BOOTPARAM_HUNG_TASK_PANIC build-time option. The value + selected by this boot parameter can be changed later by the + kernel.hung_task_panic sysctl. hvc_iucv= [S390] Number of z/VM IUCV hypervisor console (HVC) terminal devices. Valid values: 0..8 diff --git a/Documentation/admin-guide/sysctl/kernel.rst b/Documentation/admin-guide/sysctl/kernel.rst index f3ee807..0065a55 100644 --- a/Documentation/admin-guide/sysctl/kernel.rst +++ b/Documentation/admin-guide/sysctl/kernel.rst @@ -397,13 +397,14 @@ a hung task is detected. hung_task_panic =============== -Controls the kernel's behavior when a hung task is detected. +When set to a non-zero value, a kernel panic will be triggered if the +number of hung tasks found during a single scan reaches this value. This file shows up if ``CONFIG_DETECT_HUNG_TASK`` is enabled. -= ================================================= += ======================================================= 0 Continue operation. This is the default behavior. -1 Panic immediately. -= ================================================= +N Panic when N hung tasks are found during a single scan. += ======================================================= hung_task_check_count diff --git a/arch/arm/configs/aspeed_g5_defconfig b/arch/arm/configs/aspeed_g5_defconfig index 61cee1e..c3b0d5f 100644 --- a/arch/arm/configs/aspeed_g5_defconfig +++ b/arch/arm/configs/aspeed_g5_defconfig @@ -308,7 +308,7 @@ CONFIG_PANIC_ON_OOPS=y CONFIG_PANIC_TIMEOUT=-1 CONFIG_SOFTLOCKUP_DETECTOR=y CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 CONFIG_WQ_WATCHDOG=y # CONFIG_SCHED_DEBUG is not set CONFIG_FUNCTION_TRACER=y diff --git a/kernel/configs/debug.config b/kernel/configs/debug.config index e81327d..9f6ab7d 100644 --- a/kernel/configs/debug.config +++ b/kernel/configs/debug.config @@ -83,7 +83,7 @@ CONFIG_SLUB_DEBUG_ON=y # # Debug Oops, Lockups and Hangs # -# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=0 # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set CONFIG_DEBUG_ATOMIC_SLEEP=y CONFIG_DETECT_HUNG_TASK=y diff --git a/kernel/hung_task.c b/kernel/hung_task.c index b2c1f14..84b4b04 100644 --- a/kernel/hung_task.c +++ b/kernel/hung_task.c @@ -81,7 +81,7 @@ static unsigned int __read_mostly sysctl_hung_task_all_cpu_backtrace; * hung task is detected: */ static unsigned int __read_mostly sysctl_hung_task_panic = - IS_ENABLED(CONFIG_BOOTPARAM_HUNG_TASK_PANIC); + CONFIG_BOOTPARAM_HUNG_TASK_PANIC; static int hung_task_panic(struct notifier_block *this, unsigned long event, void *ptr) @@ -218,8 +218,11 @@ static inline void debug_show_blocker(struct task_struct *task, unsigned long ti } #endif -static void check_hung_task(struct task_struct *t, unsigned long timeout) +static void check_hung_task(struct task_struct *t, unsigned long timeout, + unsigned long prev_detect_count) { + unsigned long total_hung_task; + if (!task_is_hung(t, timeout)) return; @@ -229,9 +232,10 @@ static void check_hung_task(struct task_struct *t, unsigned long timeout) */ sysctl_hung_task_detect_count++; + total_hung_task = sysctl_hung_task_detect_count - prev_detect_count; trace_sched_process_hang(t); - if (sysctl_hung_task_panic) { + if (sysctl_hung_task_panic && total_hung_task >= sysctl_hung_task_panic) { console_verbose(); hung_task_show_lock = true; hung_task_call_panic = true; @@ -300,6 +304,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) int max_count = sysctl_hung_task_check_count; unsigned long last_break = jiffies; struct task_struct *g, *t; + unsigned long prev_detect_count = sysctl_hung_task_detect_count; /* * If the system crashed already then all bets are off, @@ -320,7 +325,7 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout) last_break = jiffies; } - check_hung_task(t, timeout); + check_hung_task(t, timeout, prev_detect_count); } unlock: rcu_read_unlock(); @@ -389,7 +394,7 @@ static const struct ctl_table hung_task_sysctls[] = { .mode = 0644, .proc_handler = proc_dointvec_minmax, .extra1 = SYSCTL_ZERO, - .extra2 = SYSCTL_ONE, + .extra2 = SYSCTL_INT_MAX, }, { .procname = "hung_task_check_count", diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index 3034e294..3976c90 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -1258,12 +1258,13 @@ config DEFAULT_HUNG_TASK_TIMEOUT Keeping the default should be fine in most cases. config BOOTPARAM_HUNG_TASK_PANIC - bool "Panic (Reboot) On Hung Tasks" + int "Number of hung tasks to trigger kernel panic" depends on DETECT_HUNG_TASK + default 0 help - Say Y here to enable the kernel to panic on "hung tasks", - which are bugs that cause the kernel to leave a task stuck - in uninterruptible "D" state. + When set to a non-zero value, a kernel panic will be triggered + if the number of hung tasks found during a single scan reaches + this value. The panic can be used in combination with panic_timeout, to cause the system to reboot automatically after a diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config index 936b18b..0504c11 100644 --- a/tools/testing/selftests/wireguard/qemu/kernel.config +++ b/tools/testing/selftests/wireguard/qemu/kernel.config @@ -81,7 +81,7 @@ CONFIG_WQ_WATCHDOG=y CONFIG_DETECT_HUNG_TASK=y CONFIG_BOOTPARAM_HARDLOCKUP_PANIC=y CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC=y -CONFIG_BOOTPARAM_HUNG_TASK_PANIC=y +CONFIG_BOOTPARAM_HUNG_TASK_PANIC=1 CONFIG_PANIC_TIMEOUT=-1 CONFIG_STACKTRACE=y CONFIG_EARLY_PRINTK=y -- 2.9.4 From jesperjuhl76 at gmail.com Tue Oct 21 02:09:09 2025 From: jesperjuhl76 at gmail.com (Jesper Juhl) Date: Tue, 21 Oct 2025 02:09:09 -0000 Subject: [PATCH] Fix up 'make versioncheck' issues Message-ID: >From d2e411b4cd37b1936a30d130e2b21e37e62e0cfb Mon Sep 17 00:00:00 2001 From: Jesper Juhl Date: Tue, 21 Oct 2025 03:51:21 +0200 Subject: [PATCH] [PATCH] Fix up 'make versioncheck' issues 'make versioncheck' currently flags a few files that don't need to needs it but doesn't include it. This patch fixes that up. Signed-Off-By: Jesper Juhl --- samples/bpf/spintest.bpf.c | 1 - tools/lib/bpf/bpf_helpers.h | 2 ++ tools/testing/selftests/bpf/progs/dev_cgroup.c | 1 - tools/testing/selftests/bpf/progs/netcnt_prog.c | 2 -- tools/testing/selftests/bpf/progs/test_map_lock.c | 1 - tools/testing/selftests/bpf/progs/test_send_signal_kern.c | 1 - tools/testing/selftests/bpf/progs/test_spin_lock.c | 1 - tools/testing/selftests/bpf/progs/test_tcp_estats.c | 1 - tools/testing/selftests/wireguard/qemu/init.c | 1 - 9 files changed, 2 insertions(+), 9 deletions(-) diff --git a/samples/bpf/spintest.bpf.c b/samples/bpf/spintest.bpf.c index cba5a9d507831..6278f6d0b731f 100644 --- a/samples/bpf/spintest.bpf.c +++ b/samples/bpf/spintest.bpf.c @@ -5,7 +5,6 @@ * License as published by the Free Software Foundation. */ #include "vmlinux.h" -#include #include #include diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h index 80c0285406561..393ce1063a977 100644 --- a/tools/lib/bpf/bpf_helpers.h +++ b/tools/lib/bpf/bpf_helpers.h @@ -2,6 +2,8 @@ #ifndef __BPF_HELPERS__ #define __BPF_HELPERS__ +#include + /* * Note that bpf programs need to include either * vmlinux.h (auto-generated from BTF) or linux/types.h diff --git a/tools/testing/selftests/bpf/progs/dev_cgroup.c b/tools/testing/selftests/bpf/progs/dev_cgroup.c index c1dfbd2b56fc9..4c4e747bf827a 100644 --- a/tools/testing/selftests/bpf/progs/dev_cgroup.c +++ b/tools/testing/selftests/bpf/progs/dev_cgroup.c @@ -6,7 +6,6 @@ */ #include -#include #include SEC("cgroup/dev") diff --git a/tools/testing/selftests/bpf/progs/netcnt_prog.c b/tools/testing/selftests/bpf/progs/netcnt_prog.c index f9ef8aee56f16..3cf6b7a27a34a 100644 --- a/tools/testing/selftests/bpf/progs/netcnt_prog.c +++ b/tools/testing/selftests/bpf/progs/netcnt_prog.c @@ -1,7 +1,5 @@ // SPDX-License-Identifier: GPL-2.0 #include -#include - #include #include "netcnt_common.h" diff --git a/tools/testing/selftests/bpf/progs/test_map_lock.c b/tools/testing/selftests/bpf/progs/test_map_lock.c index 1c02511b73cdb..982bdbf0dba6b 100644 --- a/tools/testing/selftests/bpf/progs/test_map_lock.c +++ b/tools/testing/selftests/bpf/progs/test_map_lock.c @@ -1,7 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 // Copyright (c) 2019 Facebook #include -#include #include #define VAR_NUM 16 diff --git a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c index 176a355e30624..e70b191162359 100644 --- a/tools/testing/selftests/bpf/progs/test_send_signal_kern.c +++ b/tools/testing/selftests/bpf/progs/test_send_signal_kern.c @@ -1,7 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 // Copyright (c) 2019 Facebook #include -#include #include struct task_struct *bpf_task_from_pid(int pid) __ksym; diff --git a/tools/testing/selftests/bpf/progs/test_spin_lock.c b/tools/testing/selftests/bpf/progs/test_spin_lock.c index d8d77bdffd3d2..9bcee268f828b 100644 --- a/tools/testing/selftests/bpf/progs/test_spin_lock.c +++ b/tools/testing/selftests/bpf/progs/test_spin_lock.c @@ -1,7 +1,6 @@ // SPDX-License-Identifier: GPL-2.0 // Copyright (c) 2019 Facebook #include -#include #include #include "bpf_misc.h" diff --git a/tools/testing/selftests/bpf/progs/test_tcp_estats.c b/tools/testing/selftests/bpf/progs/test_tcp_estats.c index e2ae049c2f850..eb0e55ba3f284 100644 --- a/tools/testing/selftests/bpf/progs/test_tcp_estats.c +++ b/tools/testing/selftests/bpf/progs/test_tcp_estats.c @@ -34,7 +34,6 @@ #include #include #include -#include #include #include diff --git a/tools/testing/selftests/wireguard/qemu/init.c b/tools/testing/selftests/wireguard/qemu/init.c index 3e49924dd77e8..20d8d3192f75c 100644 --- a/tools/testing/selftests/wireguard/qemu/init.c +++ b/tools/testing/selftests/wireguard/qemu/init.c @@ -24,7 +24,6 @@ #include #include #include -#include __attribute__((noreturn)) static void poweroff(void) { -- 2.51.1 From ast at fiberby.net Fri Oct 31 16:07:14 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:14 -0000 Subject: [PATCH net-next v2 11/11] wireguard: netlink: generate netlink code In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-12-ast@fiberby.net> This patch adopts netlink policy and command definitions as generated by ynl-gen, thus completing the conversion to YNL. Given that the old and new policy is functionally identical, and just moved to a new file, then it serves to verify that the policy in the spec in identical to the previous policy code. The new files are covered by drivers/net/wireguard/ pattern in MAINTAINERS. No behavioural changes intended. Signed-off-by: Asbj?rn Sloth T?nnesen --- drivers/net/wireguard/Makefile | 1 + drivers/net/wireguard/netlink.c | 64 +++--------------------- drivers/net/wireguard/netlink_gen.c | 77 +++++++++++++++++++++++++++++ drivers/net/wireguard/netlink_gen.h | 29 +++++++++++ 4 files changed, 114 insertions(+), 57 deletions(-) create mode 100644 drivers/net/wireguard/netlink_gen.c create mode 100644 drivers/net/wireguard/netlink_gen.h diff --git a/drivers/net/wireguard/Makefile b/drivers/net/wireguard/Makefile index dbe1f8514efc..ae4b479cddbd 100644 --- a/drivers/net/wireguard/Makefile +++ b/drivers/net/wireguard/Makefile @@ -14,4 +14,5 @@ wireguard-y += allowedips.o wireguard-y += ratelimiter.o wireguard-y += cookie.o wireguard-y += netlink.o +wireguard-y += netlink_gen.o obj-$(CONFIG_WIREGUARD) := wireguard.o diff --git a/drivers/net/wireguard/netlink.c b/drivers/net/wireguard/netlink.c index ff1549fe55e2..6a7e522e3a78 100644 --- a/drivers/net/wireguard/netlink.c +++ b/drivers/net/wireguard/netlink.c @@ -9,6 +9,7 @@ #include "socket.h" #include "queueing.h" #include "messages.h" +#include "netlink_gen.h" #include @@ -18,39 +19,6 @@ #include static struct genl_family genl_family; -static const struct nla_policy peer_policy[WGPEER_A_MAX + 1]; -static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1]; - -static const struct nla_policy device_policy[WGDEVICE_A_MAX + 1] = { - [WGDEVICE_A_IFINDEX] = { .type = NLA_U32 }, - [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = IFNAMSIZ - 1 }, - [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), - [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16 }, - [WGDEVICE_A_FWMARK] = { .type = NLA_U32 }, - [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(peer_policy), -}; - -static const struct nla_policy peer_policy[WGPEER_A_MAX + 1] = { - [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), - [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x7), - [WGPEER_A_ENDPOINT] = NLA_POLICY_MIN_LEN(sizeof(struct sockaddr)), - [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16 }, - [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(sizeof(struct __kernel_timespec)), - [WGPEER_A_RX_BYTES] = { .type = NLA_U64 }, - [WGPEER_A_TX_BYTES] = { .type = NLA_U64 }, - [WGPEER_A_ALLOWEDIPS] = NLA_POLICY_NESTED_ARRAY(allowedip_policy), - [WGPEER_A_PROTOCOL_VERSION] = { .type = NLA_U32 } -}; - -static const struct nla_policy allowedip_policy[WGALLOWEDIP_A_MAX + 1] = { - [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16 }, - [WGALLOWEDIP_A_IPADDR] = NLA_POLICY_MIN_LEN(sizeof(struct in_addr)), - [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8 }, - [WGALLOWEDIP_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), -}; static struct wg_device *lookup_interface(struct nlattr **attrs, struct sk_buff *skb) @@ -199,7 +167,7 @@ get_peer(struct wg_peer *peer, struct sk_buff *skb, struct dump_ctx *ctx) return -EMSGSIZE; } -static int wireguard_nl_get_device_start(struct netlink_callback *cb) +int wireguard_nl_get_device_start(struct netlink_callback *cb) { struct wg_device *wg; @@ -210,8 +178,8 @@ static int wireguard_nl_get_device_start(struct netlink_callback *cb) return 0; } -static int wireguard_nl_get_device_dumpit(struct sk_buff *skb, - struct netlink_callback *cb) +int wireguard_nl_get_device_dumpit(struct sk_buff *skb, + struct netlink_callback *cb) { struct wg_peer *peer, *next_peer_cursor; struct dump_ctx *ctx = DUMP_CTX(cb); @@ -305,7 +273,7 @@ static int wireguard_nl_get_device_dumpit(struct sk_buff *skb, */ } -static int wireguard_nl_get_device_done(struct netlink_callback *cb) +int wireguard_nl_get_device_done(struct netlink_callback *cb) { struct dump_ctx *ctx = DUMP_CTX(cb); @@ -503,8 +471,8 @@ static int set_peer(struct wg_device *wg, struct nlattr **attrs) return ret; } -static int wireguard_nl_set_device_doit(struct sk_buff *skb, - struct genl_info *info) +int wireguard_nl_set_device_doit(struct sk_buff *skb, + struct genl_info *info) { struct wg_device *wg = lookup_interface(info->attrs, skb); u32 flags = 0; @@ -618,24 +586,6 @@ static int wireguard_nl_set_device_doit(struct sk_buff *skb, return ret; } -static const struct genl_split_ops wireguard_nl_ops[] = { - { - .cmd = WG_CMD_GET_DEVICE, - .start = wireguard_nl_get_device_start, - .dumpit = wireguard_nl_get_device_dumpit, - .done = wireguard_nl_get_device_done, - .policy = device_policy, - .maxattr = WGDEVICE_A_PEERS, - .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DUMP, - }, { - .cmd = WG_CMD_SET_DEVICE, - .doit = wireguard_nl_set_device_doit, - .policy = device_policy, - .maxattr = WGDEVICE_A_PEERS, - .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DO, - } -}; - static struct genl_family genl_family __ro_after_init = { .split_ops = wireguard_nl_ops, .n_split_ops = ARRAY_SIZE(wireguard_nl_ops), diff --git a/drivers/net/wireguard/netlink_gen.c b/drivers/net/wireguard/netlink_gen.c new file mode 100644 index 000000000000..f95fa133778f --- /dev/null +++ b/drivers/net/wireguard/netlink_gen.c @@ -0,0 +1,77 @@ +// SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) +/* Do not edit directly, auto-generated from: */ +/* Documentation/netlink/specs/wireguard.yaml */ +/* YNL-GEN kernel source */ + +#include +#include + +#include "netlink_gen.h" + +#include +#include + +/* Common nested types */ +const struct nla_policy wireguard_wgallowedip_nl_policy[WGALLOWEDIP_A_FLAGS + 1] = { + [WGALLOWEDIP_A_FAMILY] = { .type = NLA_U16, }, + [WGALLOWEDIP_A_IPADDR] = NLA_POLICY_MIN_LEN(4), + [WGALLOWEDIP_A_CIDR_MASK] = { .type = NLA_U8, }, + [WGALLOWEDIP_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), +}; + +const struct nla_policy wireguard_wgpeer_nl_policy[WGPEER_A_PROTOCOL_VERSION + 1] = { + [WGPEER_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGPEER_A_PRESHARED_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGPEER_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x7), + [WGPEER_A_ENDPOINT] = NLA_POLICY_MIN_LEN(16), + [WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL] = { .type = NLA_U16, }, + [WGPEER_A_LAST_HANDSHAKE_TIME] = NLA_POLICY_EXACT_LEN(16), + [WGPEER_A_RX_BYTES] = { .type = NLA_U64, }, + [WGPEER_A_TX_BYTES] = { .type = NLA_U64, }, + [WGPEER_A_ALLOWEDIPS] = NLA_POLICY_NESTED_ARRAY(wireguard_wgallowedip_nl_policy), + [WGPEER_A_PROTOCOL_VERSION] = { .type = NLA_U32, }, +}; + +/* WG_CMD_GET_DEVICE - dump */ +static const struct nla_policy wireguard_get_device_nl_policy[WGDEVICE_A_PEERS + 1] = { + [WGDEVICE_A_IFINDEX] = { .type = NLA_U32, }, + [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = 15, }, + [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), + [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16, }, + [WGDEVICE_A_FWMARK] = { .type = NLA_U32, }, + [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(wireguard_wgpeer_nl_policy), +}; + +/* WG_CMD_SET_DEVICE - do */ +static const struct nla_policy wireguard_set_device_nl_policy[WGDEVICE_A_PEERS + 1] = { + [WGDEVICE_A_IFINDEX] = { .type = NLA_U32, }, + [WGDEVICE_A_IFNAME] = { .type = NLA_NUL_STRING, .len = 15, }, + [WGDEVICE_A_PRIVATE_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_PUBLIC_KEY] = NLA_POLICY_EXACT_LEN(WG_KEY_LEN), + [WGDEVICE_A_FLAGS] = NLA_POLICY_MASK(NLA_U32, 0x1), + [WGDEVICE_A_LISTEN_PORT] = { .type = NLA_U16, }, + [WGDEVICE_A_FWMARK] = { .type = NLA_U32, }, + [WGDEVICE_A_PEERS] = NLA_POLICY_NESTED_ARRAY(wireguard_wgpeer_nl_policy), +}; + +/* Ops table for wireguard */ +const struct genl_split_ops wireguard_nl_ops[2] = { + { + .cmd = WG_CMD_GET_DEVICE, + .start = wireguard_nl_get_device_start, + .dumpit = wireguard_nl_get_device_dumpit, + .done = wireguard_nl_get_device_done, + .policy = wireguard_get_device_nl_policy, + .maxattr = WGDEVICE_A_PEERS, + .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DUMP, + }, + { + .cmd = WG_CMD_SET_DEVICE, + .doit = wireguard_nl_set_device_doit, + .policy = wireguard_set_device_nl_policy, + .maxattr = WGDEVICE_A_PEERS, + .flags = GENL_UNS_ADMIN_PERM | GENL_CMD_CAP_DO, + }, +}; diff --git a/drivers/net/wireguard/netlink_gen.h b/drivers/net/wireguard/netlink_gen.h new file mode 100644 index 000000000000..e635b1f5f0df --- /dev/null +++ b/drivers/net/wireguard/netlink_gen.h @@ -0,0 +1,29 @@ +/* SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) */ +/* Do not edit directly, auto-generated from: */ +/* Documentation/netlink/specs/wireguard.yaml */ +/* YNL-GEN kernel header */ + +#ifndef _LINUX_WIREGUARD_GEN_H +#define _LINUX_WIREGUARD_GEN_H + +#include +#include + +#include +#include + +/* Common nested types */ +extern const struct nla_policy wireguard_wgallowedip_nl_policy[WGALLOWEDIP_A_FLAGS + 1]; +extern const struct nla_policy wireguard_wgpeer_nl_policy[WGPEER_A_PROTOCOL_VERSION + 1]; + +/* Ops table for wireguard */ +extern const struct genl_split_ops wireguard_nl_ops[2]; + +int wireguard_nl_get_device_start(struct netlink_callback *cb); +int wireguard_nl_get_device_done(struct netlink_callback *cb); + +int wireguard_nl_get_device_dumpit(struct sk_buff *skb, + struct netlink_callback *cb); +int wireguard_nl_set_device_doit(struct sk_buff *skb, struct genl_info *info); + +#endif /* _LINUX_WIREGUARD_GEN_H */ -- 2.51.0 From ast at fiberby.net Fri Oct 31 16:07:14 2025 From: ast at fiberby.net (=?UTF-8?q?Asbj=C3=B8rn=20Sloth=20T=C3=B8nnesen?=) Date: Fri, 31 Oct 2025 16:07:14 -0000 Subject: [PATCH net-next v2 04/11] netlink: specs: add specification for wireguard In-Reply-To: <20251031160539.1701943-1-ast@fiberby.net> References: <20251031160539.1701943-1-ast@fiberby.net> Message-ID: <20251031160539.1701943-5-ast@fiberby.net> This patch adds an near[1] complete YNL specification for wireguard, documenting the protocol in a machine-readable format, than the comment in wireguard.h, and eases usage from C and non-C programming languages alike. The generated C library will be featured in the next patch, so in this patch I will use the in-kernel python client for examples. This makes the documentation in the UAPI header redundant, and it is therefore removed. The in-line documentation in the spec, is based on the existing comment in wireguard.h, and once released then it will be available in the kernel documentation at: https://docs.kernel.org/netlink/specs/wireguard.html (until then run: make htmldocs) Generate wireguard.rst from this spec: $ make -C tools/net/ynl/generated/ wireguard.rst Query wireguard interface through pyynl: $ sudo ./tools/net/ynl/pyynl/cli.py --family wireguard \ --dump get-device \ --json '{"ifindex":3}' [{'fwmark': 0, 'ifindex': 3, 'ifname': 'wg-test', 'listen-port': 54318, 'peers': [{0: {'allowedips': [{0: {'cidr-mask': 0, 'family': 2, 'ipaddr': '0.0.0.0'}}, {0: {'cidr-mask': 0, 'family': 10, 'ipaddr': '::'}}], 'endpoint': b'[...]', 'last-handshake-time': {'nsec': 42, 'sec': 42}, 'persistent-keepalive-interval': 42, 'preshared-key': '[...]', 'protocol-version': 1, 'public-key': '[...]', 'rx-bytes': 42, 'tx-bytes': 42}}], 'private-key': '[...]', 'public-key': '[...]'}] Add another allowed IP prefix: $ sudo ./tools/net/ynl/pyynl/cli.py --family wireguard \ --do set-device --json '{"ifindex":3,"peers":[ {"public-key":"6a df b1 83 a4 ..","allowedips":[ {"cidr-mask":0,"family":10,"ipaddr":"::"}]}]}' [1] As can be seen above, the "endpoint" is only decoded as binary data, as it can't be described fully in YNL. It's a struct sockaddr_in or struct sockaddr_in6 depending on the attribute length. Signed-off-by: Asbj?rn Sloth T?nnesen --- Documentation/netlink/specs/wireguard.yaml | 307 +++++++++++++++++++++ MAINTAINERS | 1 + include/uapi/linux/wireguard.h | 129 --------- 3 files changed, 308 insertions(+), 129 deletions(-) create mode 100644 Documentation/netlink/specs/wireguard.yaml diff --git a/Documentation/netlink/specs/wireguard.yaml b/Documentation/netlink/specs/wireguard.yaml new file mode 100644 index 000000000000..f3226fa38095 --- /dev/null +++ b/Documentation/netlink/specs/wireguard.yaml @@ -0,0 +1,307 @@ +# SPDX-License-Identifier: ((GPL-2.0 WITH Linux-syscall-note) OR BSD-3-Clause) +--- +name: wireguard +protocol: genetlink-legacy + +doc: | + **Netlink protocol to control WireGuard network devices.** + + The below enums and macros are for interfacing with WireGuard, + using generic netlink, with family ``WG_GENL_NAME`` and version + ``WG_GENL_VERSION``. It defines two commands: get and set. + Note that while they share many common attributes, these two + commands actually accept a slightly different set of inputs and + outputs. These differences are noted under the individual attributes. +c-family-name: wg-genl-name +c-version-name: wg-genl-version +max-by-define: true + +definitions: + - + name-prefix: wg- + name: key-len + type: const + value: 32 + - + name: --kernel-timespec + type: struct + header: linux/time_types.h + members: + - + name: sec + type: u64 + doc: Number of seconds, since UNIX epoch. + - + name: nsec + type: u64 + doc: Number of nanoseconds, after the second began. + - + name: wgdevice-flags + name-prefix: wgdevice-f- + enum-name: wgdevice-flag + type: flags + entries: + - replace-peers + - + name: wgpeer-flags + name-prefix: wgpeer-f- + enum-name: wgpeer-flag + type: flags + entries: + - remove-me + - replace-allowedips + - update-only + - + name: wgallowedip-flags + name-prefix: wgallowedip-f- + enum-name: wgallowedip-flag + type: flags + entries: + - remove-me + +attribute-sets: + - + name: wgdevice + enum-name: wgdevice-attribute + name-prefix: wgdevice-a- + attr-cnt-name: --wgdevice-a-last + attributes: + - + name: unspec + type: unused + value: 0 + - + name: ifindex + type: u32 + - + name: ifname + type: string + checks: + max-len: 15 + - + name: private-key + type: binary + doc: Set to all zeros to remove. + display-hint: hex + checks: + exact-len: wg-key-len + - + name: public-key + type: binary + display-hint: hex + checks: + exact-len: wg-key-len + - + name: flags + doc: | + ``0`` or ``WGDEVICE_F_REPLACE_PEERS`` if all current peers + should be removed prior to adding the list below. + type: u32 + enum: wgdevice-flags + checks: + flags-mask: wgdevice-flags + - + name: listen-port + type: u16 + doc: Set as ``0`` to choose randomly. + - + name: fwmark + type: u32 + doc: Set as ``0`` to disable. + - + name: peers + type: indexed-array + sub-type: nest + nested-attributes: wgpeer + doc: The index is set as ``0`` in ``DUMP``, and unused in ``DO``. + - + name: wgpeer + enum-name: wgpeer-attribute + name-prefix: wgpeer-a- + attr-cnt-name: --wgpeer-a-last + attributes: + - + name: unspec + type: unused + value: 0 + - + name: public-key + type: binary + display-hint: hex + checks: + exact-len: wg-key-len + - + name: preshared-key + type: binary + doc: Set as all zeros to remove. + display-hint: hex + checks: + exact-len: wg-key-len + - + name: flags + doc: | + ``0`` and/or ``WGPEER_F_REMOVE_ME`` if the specified peer should not + exist at the end of the operation, rather than added/updated + and/or ``WGPEER_F_REPLACE_ALLOWEDIPS`` if all current allowed IPs + of this peer should be removed prior to adding the list below + and/or ``WGPEER_F_UPDATE_ONLY`` if the peer should only be set if + it already exists. + type: u32 + enum: wgpeer-flags + checks: + flags-mask: wgpeer-flags + - + name: endpoint + doc: struct sockaddr_in or struct sockaddr_in6 + type: binary + checks: + min-len: 16 + - + name: persistent-keepalive-interval + type: u16 + doc: Set as ``0`` to disable. + - + name: last-handshake-time + type: binary + struct: --kernel-timespec + checks: + exact-len: 16 + - + name: rx-bytes + type: u64 + - + name: tx-bytes + type: u64 + - + name: allowedips + type: indexed-array + sub-type: nest + nested-attributes: wgallowedip + doc: The index is set as ``0`` in ``DUMP``, and unused in ``DO``. + - + name: protocol-version + type: u32 + doc: | + Should not be set or used at all by most users of this API, + as the most recent protocol will be used when this is unset. + Otherwise, must be set to ``1``. + - + name: wgallowedip + enum-name: wgallowedip-attribute + name-prefix: wgallowedip-a- + attr-cnt-name: --wgallowedip-a-last + attributes: + - + name: unspec + type: unused + value: 0 + - + name: family + type: u16 + doc: IP family, either ``AF_INET`` or ``AF_INET6``. + - + name: ipaddr + type: binary + doc: Either ``struct in_addr`` or ``struct in6_addr``. + display-hint: ipv4-or-v6 + checks: + min-len: 4 + - + name: cidr-mask + type: u8 + - + name: flags + type: u32 + doc: | + ``WGALLOWEDIP_F_REMOVE_ME`` if the specified IP should be + removed; otherwise, this IP will be added if it is not + already present. + enum: wgallowedip-flags + checks: + flags-mask: wgallowedip-flags + +operations: + enum-name: wg-cmd + name-prefix: wg-cmd- + list: + - + name: get-device + value: 0 + doc: | + Retrieve WireGuard device + ~~~~~~~~~~~~~~~~~~~~~~~~~ + + The command should be called with one but not both of: + + - ``WGDEVICE_A_IFINDEX`` + - ``WGDEVICE_A_IFNAME`` + + The kernel will then return several messages (``NLM_F_MULTI``). + It is possible that all of the allowed IPs of a single peer + will not fit within a single netlink message. In that case, the + same peer will be written in the following message, except it will + only contain ``WGPEER_A_PUBLIC_KEY`` and ``WGPEER_A_ALLOWEDIPS``. + This may occur several times in a row for the same peer. + It is then up to the receiver to coalesce adjacent peers. + Likewise, it is possible that all peers will not fit within a + single message. + So, subsequent peers will be sent in following messages, + except those will only contain ``WGDEVICE_A_IFNAME`` and + ``WGDEVICE_A_PEERS``. It is then up to the receiver to coalesce + these messages to form the complete list of peers. + + While this command does accept the other ``WGDEVICE_A_*`` + attributes, for compatibility reasons, but they are ignored + by this command, and should not be used in requests. + + Since this is an ``NLA_F_DUMP`` command, the final message will + always be ``NLMSG_DONE``, even if an error occurs. However, this + ``NLMSG_DONE`` message contains an integer error code. It is + either zero or a negative error code corresponding to the errno. + attribute-set: wgdevice + flags: [uns-admin-perm] + + dump: + pre: wireguard-nl-get-device-start + post: wireguard-nl-get-device-done + # request only uses ifindex | ifname, but keep .maxattr as is + request: &all-attrs + attributes: + - ifindex + - ifname + - private-key + - public-key + - flags + - listen-port + - fwmark + - peers + reply: *all-attrs + - + name: set-device + value: 1 + doc: | + Set WireGuard device + ~~~~~~~~~~~~~~~~~~~~ + + This command should be called with a wgdevice set, containing one + but not both of ``WGDEVICE_A_IFINDEX`` and ``WGDEVICE_A_IFNAME``. + + It is possible that the amount of configuration data exceeds that + of the maximum message length accepted by the kernel. + In that case, several messages should be sent one after another, + with each successive one filling in information not contained in + the prior. + Note that if ``WGDEVICE_F_REPLACE_PEERS`` is specified in the first + message, it probably should not be specified in fragments that come + after, so that the list of peers is only cleared the first time but + appended after. + Likewise for peers, if ``WGPEER_F_REPLACE_ALLOWEDIPS`` is specified + in the first message of a peer, it likely should not be specified + in subsequent fragments. + + If an error occurs, ``NLMSG_ERROR`` will reply containing an errno. + attribute-set: wgdevice + flags: [uns-admin-perm] + + do: + request: *all-attrs diff --git a/MAINTAINERS b/MAINTAINERS index 1ab7e8746299..65c71728e2c6 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -27641,6 +27641,7 @@ M: Jason A. Donenfeld L: wireguard at lists.zx2c4.com L: netdev at vger.kernel.org S: Maintained +F: Documentation/netlink/specs/wireguard.yaml F: drivers/net/wireguard/ F: tools/testing/selftests/wireguard/ diff --git a/include/uapi/linux/wireguard.h b/include/uapi/linux/wireguard.h index 8c26391196d5..dee4401e0b5d 100644 --- a/include/uapi/linux/wireguard.h +++ b/include/uapi/linux/wireguard.h @@ -1,135 +1,6 @@ /* SPDX-License-Identifier: (GPL-2.0 WITH Linux-syscall-note) OR MIT */ /* * Copyright (C) 2015-2019 Jason A. Donenfeld . All Rights Reserved. - * - * Documentation - * ============= - * - * The below enums and macros are for interfacing with WireGuard, using generic - * netlink, with family WG_GENL_NAME and version WG_GENL_VERSION. It defines two - * methods: get and set. Note that while they share many common attributes, - * these two functions actually accept a slightly different set of inputs and - * outputs. - * - * WG_CMD_GET_DEVICE - * ----------------- - * - * May only be called via NLM_F_REQUEST | NLM_F_DUMP. The command should contain - * one but not both of: - * - * WGDEVICE_A_IFINDEX: NLA_U32 - * WGDEVICE_A_IFNAME: NLA_NUL_STRING, maxlen IFNAMSIZ - 1 - * - * The kernel will then return several messages (NLM_F_MULTI) containing the - * following tree of nested items: - * - * WGDEVICE_A_IFINDEX: NLA_U32 - * WGDEVICE_A_IFNAME: NLA_NUL_STRING, maxlen IFNAMSIZ - 1 - * WGDEVICE_A_PRIVATE_KEY: NLA_EXACT_LEN, len WG_KEY_LEN - * WGDEVICE_A_PUBLIC_KEY: NLA_EXACT_LEN, len WG_KEY_LEN - * WGDEVICE_A_LISTEN_PORT: NLA_U16 - * WGDEVICE_A_FWMARK: NLA_U32 - * WGDEVICE_A_PEERS: NLA_NESTED - * 0: NLA_NESTED - * WGPEER_A_PUBLIC_KEY: NLA_EXACT_LEN, len WG_KEY_LEN - * WGPEER_A_PRESHARED_KEY: NLA_EXACT_LEN, len WG_KEY_LEN - * WGPEER_A_ENDPOINT: NLA_MIN_LEN(struct sockaddr), struct sockaddr_in or struct sockaddr_in6 - * WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL: NLA_U16 - * WGPEER_A_LAST_HANDSHAKE_TIME: NLA_EXACT_LEN, struct __kernel_timespec - * WGPEER_A_RX_BYTES: NLA_U64 - * WGPEER_A_TX_BYTES: NLA_U64 - * WGPEER_A_ALLOWEDIPS: NLA_NESTED - * 0: NLA_NESTED - * WGALLOWEDIP_A_FAMILY: NLA_U16 - * WGALLOWEDIP_A_IPADDR: NLA_MIN_LEN(struct in_addr), struct in_addr or struct in6_addr - * WGALLOWEDIP_A_CIDR_MASK: NLA_U8 - * 0: NLA_NESTED - * ... - * 0: NLA_NESTED - * ... - * ... - * WGPEER_A_PROTOCOL_VERSION: NLA_U32 - * 0: NLA_NESTED - * ... - * ... - * - * It is possible that all of the allowed IPs of a single peer will not - * fit within a single netlink message. In that case, the same peer will - * be written in the following message, except it will only contain - * WGPEER_A_PUBLIC_KEY and WGPEER_A_ALLOWEDIPS. This may occur several - * times in a row for the same peer. It is then up to the receiver to - * coalesce adjacent peers. Likewise, it is possible that all peers will - * not fit within a single message. So, subsequent peers will be sent - * in following messages, except those will only contain WGDEVICE_A_IFNAME - * and WGDEVICE_A_PEERS. It is then up to the receiver to coalesce these - * messages to form the complete list of peers. - * - * Since this is an NLA_F_DUMP command, the final message will always be - * NLMSG_DONE, even if an error occurs. However, this NLMSG_DONE message - * contains an integer error code. It is either zero or a negative error - * code corresponding to the errno. - * - * WG_CMD_SET_DEVICE - * ----------------- - * - * May only be called via NLM_F_REQUEST. The command should contain the - * following tree of nested items, containing one but not both of - * WGDEVICE_A_IFINDEX and WGDEVICE_A_IFNAME: - * - * WGDEVICE_A_IFINDEX: NLA_U32 - * WGDEVICE_A_IFNAME: NLA_NUL_STRING, maxlen IFNAMSIZ - 1 - * WGDEVICE_A_FLAGS: NLA_U32, 0 or WGDEVICE_F_REPLACE_PEERS if all current - * peers should be removed prior to adding the list below. - * WGDEVICE_A_PRIVATE_KEY: len WG_KEY_LEN, all zeros to remove - * WGDEVICE_A_LISTEN_PORT: NLA_U16, 0 to choose randomly - * WGDEVICE_A_FWMARK: NLA_U32, 0 to disable - * WGDEVICE_A_PEERS: NLA_NESTED - * 0: NLA_NESTED - * WGPEER_A_PUBLIC_KEY: len WG_KEY_LEN - * WGPEER_A_FLAGS: NLA_U32, 0 and/or WGPEER_F_REMOVE_ME if the - * specified peer should not exist at the end of the - * operation, rather than added/updated and/or - * WGPEER_F_REPLACE_ALLOWEDIPS if all current allowed - * IPs of this peer should be removed prior to adding - * the list below and/or WGPEER_F_UPDATE_ONLY if the - * peer should only be set if it already exists. - * WGPEER_A_PRESHARED_KEY: len WG_KEY_LEN, all zeros to remove - * WGPEER_A_ENDPOINT: struct sockaddr_in or struct sockaddr_in6 - * WGPEER_A_PERSISTENT_KEEPALIVE_INTERVAL: NLA_U16, 0 to disable - * WGPEER_A_ALLOWEDIPS: NLA_NESTED - * 0: NLA_NESTED - * WGALLOWEDIP_A_FAMILY: NLA_U16 - * WGALLOWEDIP_A_IPADDR: struct in_addr or struct in6_addr - * WGALLOWEDIP_A_CIDR_MASK: NLA_U8 - * WGALLOWEDIP_A_FLAGS: NLA_U32, WGALLOWEDIP_F_REMOVE_ME if - * the specified IP should be removed; - * otherwise, this IP will be added if - * it is not already present. - * 0: NLA_NESTED - * ... - * 0: NLA_NESTED - * ... - * ... - * WGPEER_A_PROTOCOL_VERSION: NLA_U32, should not be set or used at - * all by most users of this API, as the - * most recent protocol will be used when - * this is unset. Otherwise, must be set - * to 1. - * 0: NLA_NESTED - * ... - * ... - * - * It is possible that the amount of configuration data exceeds that of - * the maximum message length accepted by the kernel. In that case, several - * messages should be sent one after another, with each successive one - * filling in information not contained in the prior. Note that if - * WGDEVICE_F_REPLACE_PEERS is specified in the first message, it probably - * should not be specified in fragments that come after, so that the list - * of peers is only cleared the first time but appended after. Likewise for - * peers, if WGPEER_F_REPLACE_ALLOWEDIPS is specified in the first message - * of a peer, it likely should not be specified in subsequent fragments. - * - * If an error occurs, NLMSG_ERROR will reply containing an errno. */ #ifndef _WG_UAPI_WIREGUARD_H -- 2.51.0 From jofficer at gmail.com Wed Oct 1 03:59:23 2025 From: jofficer at gmail.com (Joey Officer) Date: Wed, 01 Oct 2025 03:59:23 -0000 Subject: Fwd: Wireguard for Windows - NCO elevation In-Reply-To: References: Message-ID: We?re using WireGuard for Windows with LimitedOperatorUI enabled and standard users granted to the local Network Configuration Operators group. Intermittently, often after sleep or a restart and if a tunnel was active, we get: ?WireGuard may only be used by users who are a member of the Builtin Administrators group.? Environment: * WireGuard for Windows: 0.5.3 - master (downloaded from official site) * Windows: both Windows 10 & 11 (Fast Startup: off - as best I can determine) * Users in NCO: Assigned via policy from Intune * UI startup: manual I traced the check to TokenIsElevatedOrElevatable, which only authorizes when the token is elevated + admin, or a linked elevated admin token exists. This seems to preclude Limited-Operator use unless the user is also an admin. Proposed change: when LimitedOperatorUI is enabled, explicitly allow tokens that are members of NCO (without requiring elevation) Diagnostics (when the rejection happens): * IsElevated(): false * NCO SID present in TokenGroups: I believe this is true, based on the results of whoami on Windows returning NCO membership. I've written (today) a debug tool to report tokens when this fails again. Workaround that consistently helps: As admin or with remote management software : stop WireGuard Tunnel: ${name} service then have the user log out/in (new access token minted). We typically tell the user to ensure they are using user/pass versus re-entering their PIN on login Does the above approach align with the intended Limited-Operator design? If so, I?m happy to clean up a proper patch + tests; if not, where should the NCO allowance be wired in? Any guidance on the preferred place to gate Limited-Operator UI would be appreciated. Respectfully, Joey From florian at uekermann.me Mon Oct 20 14:27:53 2025 From: florian at uekermann.me (Florian Uekermann) Date: Mon, 20 Oct 2025 14:27:53 -0000 Subject: [PATCH] wg-quick: fix darwin MTU detection Message-ID: <20251020142254.16546-2-florian@uekermann.me> Signed-off-by: Florian Uekermann --- On macOS The set_mtu function fails to detect the default interface correctly and always sets the (potentially incorrect) default MTU as a result. The problem is that netstat -nr -f inet returns the interface name in the 4th column, not the 6th. I used macOS 15.4 for testing, but I am not particularly familiar with the Apple ecosystem. I'm not sure if this never worked, the netstat shipped by Apple changed at some point and how/which other platforms (iOS?) may be affected. So please keep that in mind before merging. src/wg-quick/darwin.bash | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/wg-quick/darwin.bash b/src/wg-quick/darwin.bash index 1b7fe5e..0467f0e 100755 --- a/src/wg-quick/darwin.bash +++ b/src/wg-quick/darwin.bash @@ -177,7 +177,7 @@ set_mtu() { cmd ifconfig "$REAL_INTERFACE" mtu "$MTU" return fi - while read -r destination _ _ _ _ netif _; do + while read -r destination _ _ netif _; do if [[ $destination == default ]]; then defaultif="$netif" break -- 2.51.0