From syzbot+list782da1462944fe92dbc9 at syzkaller.appspotmail.com Mon Jun 3 07:52:28 2024 From: syzbot+list782da1462944fe92dbc9 at syzkaller.appspotmail.com (syzbot) Date: Mon, 03 Jun 2024 00:52:28 -0700 Subject: [syzbot] Monthly wireguard report (Jun 2024) Message-ID: <00000000000065d96e0619f79dff@google.com> Hello wireguard maintainers/developers, This is a 31-day syzbot report for the wireguard subsystem. All related reports/information can be found at: https://syzkaller.appspot.com/upstream/s/wireguard During the period, 4 new issues were detected and 0 were fixed. In total, 9 issues are still open and 16 have been fixed so far. Some of the still happening issues: Ref Crashes Repro Title <1> 972 No KCSAN: data-race in wg_packet_send_staged_packets / wg_packet_send_staged_packets (3) https://syzkaller.appspot.com/bug?extid=6ba34f16b98fe40daef1 <2> 63 No INFO: task hung in wg_destruct https://syzkaller.appspot.com/bug?extid=a6bdd2d02402f18fdd5e <3> 48 No INFO: task hung in wg_netns_pre_exit (4) https://syzkaller.appspot.com/bug?extid=1d5c9cd5bcdce13e618e <4> 4 No WARNING in kthread_unpark (2) https://syzkaller.appspot.com/bug?extid=943d34fa3cf2191e3068 <5> 1 No WARNING: locking bug in wg_packet_encrypt_worker https://syzkaller.appspot.com/bug?extid=f19160c19b77d76b5bc2 <6> 1 No general protection fault in wg_packet_receive https://syzkaller.appspot.com/bug?extid=470d70be7e9ee9f22a01 --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller at googlegroups.com. To disable reminders for individual bugs, reply with the following command: #syz set no-reminders To change bug's subsystems, reply with: #syz set subsystems: new-subsystem You may send multiple commands in a single email message. From syzbot+9f1d21c20c7306ca9417 at syzkaller.appspotmail.com Tue Jun 4 19:56:29 2024 From: syzbot+9f1d21c20c7306ca9417 at syzkaller.appspotmail.com (syzbot) Date: Tue, 04 Jun 2024 12:56:29 -0700 Subject: [syzbot] [wireguard?] WARNING: locking bug in wg_packet_decrypt_worker Message-ID: <00000000000089d850061a15d886@google.com> Hello, syzbot found the following issue on: HEAD commit: 83814698cf48 Merge tag 'powerpc-6.10-2' of git://git.kerne.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=16746d3a980000 kernel config: https://syzkaller.appspot.com/x/.config?x=733cc7a95171d8e7 dashboard link: https://syzkaller.appspot.com/bug?extid=9f1d21c20c7306ca9417 compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 userspace arch: i386 Unfortunately, I don't have any reproducer for this issue yet. Downloadable assets: disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-83814698.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/7042fdcb685d/vmlinux-83814698.xz kernel image: https://storage.googleapis.com/syzbot-assets/9f795e13834f/bzImage-83814698.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+9f1d21c20c7306ca9417 at syzkaller.appspotmail.com ------------[ cut here ]------------ DEBUG_LOCKS_WARN_ON(1) WARNING: CPU: 0 PID: 10 at kernel/locking/lockdep.c:232 hlock_class kernel/locking/lockdep.c:232 [inline] WARNING: CPU: 0 PID: 10 at kernel/locking/lockdep.c:232 hlock_class+0xfa/0x130 kernel/locking/lockdep.c:221 Modules linked in: CPU: 0 PID: 10 Comm: kworker/0:1 Not tainted 6.10.0-rc1-syzkaller-00304-g83814698cf48 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Workqueue: wg-crypt-wg0 wg_packet_decrypt_worker RIP: 0010:hlock_class kernel/locking/lockdep.c:232 [inline] RIP: 0010:hlock_class+0xfa/0x130 kernel/locking/lockdep.c:221 Code: b6 14 11 38 d0 7c 04 84 d2 75 43 8b 05 b3 39 77 0e 85 c0 75 19 90 48 c7 c6 00 bd 2c 8b 48 c7 c7 a0 b7 2c 8b e8 97 47 e5 ff 90 <0f> 0b 90 90 90 31 c0 eb 9e e8 88 f7 7f 00 e9 1c ff ff ff 48 c7 c7 RSP: 0018:ffffc900003c7a00 EFLAGS: 00010082 RAX: 0000000000000000 RBX: 0000000000000f2b RCX: ffffffff81510229 RDX: ffff888015f38000 RSI: ffffffff81510236 RDI: 0000000000000001 RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: 000000002d2d2d2d R12: 0000000000000000 R13: 0000000000000000 R14: ffff888015f38b30 R15: 0000000000000f2b FS: 0000000000000000(0000) GS:ffff88802c000000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f743fb94 CR3: 000000005ff86000 CR4: 0000000000350ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: check_wait_context kernel/locking/lockdep.c:4773 [inline] __lock_acquire+0x3f2/0x3b30 kernel/locking/lockdep.c:5087 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:126 [inline] _raw_spin_lock_bh+0x33/0x40 kernel/locking/spinlock.c:178 spin_lock_bh include/linux/spinlock.h:356 [inline] ptr_ring_consume_bh include/linux/ptr_ring.h:365 [inline] wg_packet_decrypt_worker+0x2aa/0x530 drivers/net/wireguard/receive.c:499 process_one_work+0x958/0x1ad0 kernel/workqueue.c:3231 process_scheduled_works kernel/workqueue.c:3312 [inline] worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller at googlegroups.com. syzbot will keep track of this issue. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot. If the report is already addressed, let syzbot know by replying with: #syz fix: exact-commit-title If you want to overwrite report's subsystems, reply with: #syz set subsystems: new-subsystem (See the list of subsystem names on the web dashboard) If the report is a duplicate of another one, reply with: #syz dup: exact-subject-of-another-report If you want to undo deduplication, reply with: #syz undup From Julia.Lawall at inria.fr Sun Jun 9 08:27:12 2024 From: Julia.Lawall at inria.fr (Julia Lawall) Date: Sun, 9 Jun 2024 10:27:12 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback Message-ID: <20240609082726.32742-1-Julia.Lawall@inria.fr> Since SLOB was removed, it is not necessary to use call_rcu when the callback only performs kmem_cache_free. Use kfree_rcu() directly. The changes were done using the following Coccinelle semantic patch. This semantic patch is designed to ignore cases where the callback function is used in another way. // @r@ expression e; local idexpression e2; identifier cb,f; position p; @@ ( call_rcu(...,e2) | call_rcu(&e->f,cb at p) ) @r1@ type T; identifier x,r.cb; @@ cb(...) { ( kmem_cache_free(...); | T x = ...; kmem_cache_free(...,x); | T x; x = ...; kmem_cache_free(...,x); ) } @s depends on r1@ position p != r.p; identifier r.cb; @@ cb at p @script:ocaml@ cb << r.cb; p << s.p; @@ Printf.eprintf "Other use of %s at %s:%d\n" cb (List.hd p).file (List.hd p).line @depends on r1 && !s@ expression e; identifier r.cb,f; position r.p; @@ - call_rcu(&e->f,cb at p) + kfree_rcu(e,f) @r1a depends on !s@ type T; identifier x,r.cb; @@ - cb(...) { ( - kmem_cache_free(...); | - T x = ...; - kmem_cache_free(...,x); | - T x; - x = ...; - kmem_cache_free(...,x); ) - } // Signed-off-by: Julia Lawall Reviewed-by: Paul E. McKenney Reviewed-by: Vlastimil Babka --- arch/powerpc/kvm/book3s_mmu_hpte.c | 8 +------- block/blk-ioc.c | 9 +-------- drivers/net/wireguard/allowedips.c | 9 ++------- fs/ecryptfs/dentry.c | 8 +------- fs/nfsd/nfs4state.c | 9 +-------- fs/tracefs/inode.c | 10 +--------- kernel/time/posix-timers.c | 9 +-------- kernel/workqueue.c | 8 +------- net/bridge/br_fdb.c | 9 +-------- net/can/gw.c | 13 +++---------- net/ipv4/fib_trie.c | 8 +------- net/ipv4/inetpeer.c | 9 ++------- net/ipv6/ip6_fib.c | 9 +-------- net/ipv6/xfrm6_tunnel.c | 8 +------- net/kcm/kcmsock.c | 10 +--------- net/netfilter/nf_conncount.c | 10 +--------- net/netfilter/nf_conntrack_expect.c | 10 +--------- net/netfilter/xt_hashlimit.c | 9 +-------- 18 files changed, 22 insertions(+), 143 deletions(-) From Julia.Lawall at inria.fr Sun Jun 9 08:27:13 2024 From: Julia.Lawall at inria.fr (Julia Lawall) Date: Sun, 9 Jun 2024 10:27:13 +0200 Subject: [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <20240609082726.32742-1-Julia.Lawall@inria.fr> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> Message-ID: <20240609082726.32742-2-Julia.Lawall@inria.fr> Since SLOB was removed, it is not necessary to use call_rcu when the callback only performs kmem_cache_free. Use kfree_rcu() directly. The changes were done using the following Coccinelle semantic patch. This semantic patch is designed to ignore cases where the callback function is used in another way. // @r@ expression e; local idexpression e2; identifier cb,f; position p; @@ ( call_rcu(...,e2) | call_rcu(&e->f,cb at p) ) @r1@ type T; identifier x,r.cb; @@ cb(...) { ( kmem_cache_free(...); | T x = ...; kmem_cache_free(...,x); | T x; x = ...; kmem_cache_free(...,x); ) } @s depends on r1@ position p != r.p; identifier r.cb; @@ cb at p @script:ocaml@ cb << r.cb; p << s.p; @@ Printf.eprintf "Other use of %s at %s:%d\n" cb (List.hd p).file (List.hd p).line @depends on r1 && !s@ expression e; identifier r.cb,f; position r.p; @@ - call_rcu(&e->f,cb at p) + kfree_rcu(e,f) @r1a depends on !s@ type T; identifier x,r.cb; @@ - cb(...) { ( - kmem_cache_free(...); | - T x = ...; - kmem_cache_free(...,x); | - T x; - x = ...; - kmem_cache_free(...,x); ) - } // Signed-off-by: Julia Lawall Reviewed-by: Paul E. McKenney Reviewed-by: Vlastimil Babka --- drivers/net/wireguard/allowedips.c | 9 ++------- 1 file changed, 2 insertions(+), 7 deletions(-) diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c index 0ba714ca5185..e4e1638fce1b 100644 --- a/drivers/net/wireguard/allowedips.c +++ b/drivers/net/wireguard/allowedips.c @@ -48,11 +48,6 @@ static void push_rcu(struct allowedips_node **stack, } } -static void node_free_rcu(struct rcu_head *rcu) -{ - kmem_cache_free(node_cache, container_of(rcu, struct allowedips_node, rcu)); -} - static void root_free_rcu(struct rcu_head *rcu) { struct allowedips_node *node, *stack[MAX_ALLOWEDIPS_DEPTH] = { @@ -330,13 +325,13 @@ void wg_allowedips_remove_by_peer(struct allowedips *table, child = rcu_dereference_protected( parent->bit[!(node->parent_bit_packed & 1)], lockdep_is_held(lock)); - call_rcu(&node->rcu, node_free_rcu); + kfree_rcu(node, rcu); if (!free_parent) continue; if (child) child->parent_bit_packed = parent->parent_bit_packed; *(struct allowedips_node **)(parent->parent_bit_packed & ~3UL) = child; - call_rcu(&parent->rcu, node_free_rcu); + kfree_rcu(parent, rcu); } } From Jason at zx2c4.com Sun Jun 9 14:32:06 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Sun, 9 Jun 2024 16:32:06 +0200 Subject: [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <20240609082726.32742-2-Julia.Lawall@inria.fr> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240609082726.32742-2-Julia.Lawall@inria.fr> Message-ID: Hi Julia & Vlastimil, On Sun, Jun 09, 2024 at 10:27:13AM +0200, Julia Lawall wrote: > Since SLOB was removed, it is not necessary to use call_rcu > when the callback only performs kmem_cache_free. Use > kfree_rcu() directly. Thanks, I applied this to the wireguard tree, and I'll send this out as a fix for 6.10. Let me know if this is unfavorable to you and if you'd like to take this somewhere yourself, in which case I'll give you my ack. Just a question, though, for Vlastimil -- I know that with the SLOB removal, kfree() is now allowed on kmemcache'd objects. Do you plan to do a blanket s/kmem_cache_free/kfree/g at some point, and then remove kmem_cache_free all together? Jason From julia.lawall at inria.fr Sun Jun 9 14:36:15 2024 From: julia.lawall at inria.fr (Julia Lawall) Date: Sun, 9 Jun 2024 16:36:15 +0200 (CEST) Subject: [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240609082726.32742-2-Julia.Lawall@inria.fr> Message-ID: On Sun, 9 Jun 2024, Jason A. Donenfeld wrote: > Hi Julia & Vlastimil, > > On Sun, Jun 09, 2024 at 10:27:13AM +0200, Julia Lawall wrote: > > Since SLOB was removed, it is not necessary to use call_rcu > > when the callback only performs kmem_cache_free. Use > > kfree_rcu() directly. > > Thanks, I applied this to the wireguard tree, and I'll send this out as > a fix for 6.10. Let me know if this is unfavorable to you and if you'd > like to take this somewhere yourself, in which case I'll give you my > ack. Please push it onward. julia > > Just a question, though, for Vlastimil -- I know that with the SLOB > removal, kfree() is now allowed on kmemcache'd objects. Do you plan to > do a blanket s/kmem_cache_free/kfree/g at some point, and then remove > kmem_cache_free all together? > > Jason > From nico.schottelius at ungleich.ch Sun Jun 9 15:39:46 2024 From: nico.schottelius at ungleich.ch (Nico Schottelius) Date: Sun, 09 Jun 2024 17:39:46 +0200 Subject: Wireguard address binding - how to fix? In-Reply-To: <87bk4tc5m7.fsf@ungleich.ch> (Nico Schottelius's message of "Sun, 26 May 2024 10:57:52 +0200") References: <87le4cfz0u.fsf@ungleich.ch> <20240514113648.neaj6kfazx4fi7af@House.clients.dxld.at> <87msojhbq0.fsf@ungleich.ch> <87a5kjgw3j.fsf@ungleich.ch> <874jarfd43.fsf@ungleich.ch> <87bk4tc5m7.fsf@ungleich.ch> Message-ID: <87tti2xgzh.fsf@ungleich.ch> Jason, may I shortly ask what your opinion is on the patch and whether there is a way forward to make wireguard usable on systems with multiple IP addresses? Best regards, Nico Nico Schottelius writes: > d tbsky writes: >> I remembered how exciting when I tested wireguard at 2017. until I >> asked muti-home question in the list. >> wiregurad is beautiful,elegant,fast but not easy to get along with. >> openvpn is not so amazing but it can get the job done. > > Nice summary, hits the nail quite well. > > Jason, do you mind having a look at the submitted patches for IP address > binding and comment on them? Or alternatively can you give green light > for generally moving forward so that a direct inclusion in the Linux > kernel would be accepted? > > Best regards, > > Nico -------------- next part -------------- -- Sustainable and modern Infrastructures by ungleich.ch -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 873 bytes Desc: not available URL: From vbabka at suse.cz Mon Jun 10 20:38:08 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Mon, 10 Jun 2024 22:38:08 +0200 Subject: [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240609082726.32742-2-Julia.Lawall@inria.fr> Message-ID: <3f58c9a6-614f-4188-9a38-72c26fb42c8e@suse.cz> On 6/9/24 4:32 PM, Jason A. Donenfeld wrote: > Hi Julia & Vlastimil, > > On Sun, Jun 09, 2024 at 10:27:13AM +0200, Julia Lawall wrote: >> Since SLOB was removed, it is not necessary to use call_rcu >> when the callback only performs kmem_cache_free. Use >> kfree_rcu() directly. > > Thanks, I applied this to the wireguard tree, and I'll send this out as > a fix for 6.10. Let me know if this is unfavorable to you and if you'd > like to take this somewhere yourself, in which case I'll give you my > ack. > > Just a question, though, for Vlastimil -- I know that with the SLOB > removal, kfree() is now allowed on kmemcache'd objects. Do you plan to > do a blanket s/kmem_cache_free/kfree/g at some point, and then remove > kmem_cache_free all together? Hmm, not really, but obligatory Cc for willy who'd love to have "one free() to rule them all" IIRC. My current thinking is that kmem_cache_free() can save the kmem_cache lookup, or serve as a double check if debugging is enabled, and doesn't have much downside. If someone wants to not care about the kmem_cache pointer, they can use kfree(). Even convert their subsystem at will. But a mass conversion of everything would be rather lot of churn for not much of a benefit, IMHO. From Jason at zx2c4.com Mon Jun 10 20:59:07 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Mon, 10 Jun 2024 22:59:07 +0200 Subject: [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <3f58c9a6-614f-4188-9a38-72c26fb42c8e@suse.cz> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240609082726.32742-2-Julia.Lawall@inria.fr> <3f58c9a6-614f-4188-9a38-72c26fb42c8e@suse.cz> Message-ID: Hi Vlastimil, On Mon, Jun 10, 2024 at 10:38:08PM +0200, Vlastimil Babka wrote: > On 6/9/24 4:32 PM, Jason A. Donenfeld wrote: > > Hi Julia & Vlastimil, > > > > On Sun, Jun 09, 2024 at 10:27:13AM +0200, Julia Lawall wrote: > >> Since SLOB was removed, it is not necessary to use call_rcu > >> when the callback only performs kmem_cache_free. Use > >> kfree_rcu() directly. > > > > Thanks, I applied this to the wireguard tree, and I'll send this out as > > a fix for 6.10. Let me know if this is unfavorable to you and if you'd > > like to take this somewhere yourself, in which case I'll give you my > > ack. > > > > Just a question, though, for Vlastimil -- I know that with the SLOB > > removal, kfree() is now allowed on kmemcache'd objects. Do you plan to > > do a blanket s/kmem_cache_free/kfree/g at some point, and then remove > > kmem_cache_free all together? > > Hmm, not really, but obligatory Cc for willy who'd love to have "one free() > to rule them all" IIRC. > > My current thinking is that kmem_cache_free() can save the kmem_cache > lookup, or serve as a double check if debugging is enabled, and doesn't have > much downside. If someone wants to not care about the kmem_cache pointer, > they can use kfree(). Even convert their subsystem at will. But a mass > conversion of everything would be rather lot of churn for not much of a > benefit, IMHO. Huh, interesting. I can see the practical sense in that, not causing unnecessary churn and such. At the same time, this doesn't appeal much to some sort of orderly part of my mind. Either all kmalloc/kmem_cache memory is kfree()d as the rule for what is best, or a kmalloc pairs with a kfree and a kmem_cache_alloc pairs with a kmem_cache_free and that's the rule. And those can be checked and enforced and so forth. But saying, "oh, well, they might work a bit different, but whatever you want is basically fine; there's no rhyme or reason" is somehow dissatisfying. Maybe the rule is actually, "use kmem_cache_free if you can because it saves a pointer lookup, but don't go out of your way to do that and certainly don't bloat .text to make it happen," then maybe that makes sense? But I dunno, I find myself wanting a rule and consistency. (Did you find it annoying that in this paragraph, I used () on only one function mention but not on the others? If so, maybe you're like me.) Maybe I should just chill though. Anyway, only my 2?, and my opinion here isn't worth much, so please regard this as only a gut statement from a bystander. Jason From germano.massullo at gmail.com Wed Jun 12 14:11:00 2024 From: germano.massullo at gmail.com (Germano Massullo) Date: Wed, 12 Jun 2024 16:11:00 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 Message-ID: Hello, I would like to ask if you are aware of any mini PCI express card that provides hardware acceleration for ChaCha20 algorithm. I would need it to improve Turris Omnia Wireguard throughput Cheers! From kuba at kernel.org Wed Jun 12 21:33:05 2024 From: kuba at kernel.org (Jakub Kicinski) Date: Wed, 12 Jun 2024 14:33:05 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <20240609082726.32742-1-Julia.Lawall@inria.fr> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> Message-ID: <20240612143305.451abf58@kernel.org> On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > Since SLOB was removed, it is not necessary to use call_rcu > when the callback only performs kmem_cache_free. Use > kfree_rcu() directly. > > The changes were done using the following Coccinelle semantic patch. > This semantic patch is designed to ignore cases where the callback > function is used in another way. How does the discussion on: [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ reflect on this series? IIUC we should hold off.. From paulmck at kernel.org Wed Jun 12 22:37:55 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Wed, 12 Jun 2024 15:37:55 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <20240612143305.451abf58@kernel.org> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> Message-ID: On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > Since SLOB was removed, it is not necessary to use call_rcu > > when the callback only performs kmem_cache_free. Use > > kfree_rcu() directly. > > > > The changes were done using the following Coccinelle semantic patch. > > This semantic patch is designed to ignore cases where the callback > > function is used in another way. > > How does the discussion on: > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > reflect on this series? IIUC we should hold off.. We do need to hold off for the ones in kernel modules (such as 07/14) where the kmem_cache is destroyed during module unload. OK, I might as well go through them... [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback Needs to wait, see wg_allowedips_slab_uninit(). [PATCH 02/14] net: replace call_rcu by kfree_rcu for simple kmem_cache_free callback I don't immediately see the rcu_barrier(), but if there isn't one in there somewhere there probably should be. Caution suggests a need to wait. [PATCH 03/14] KVM: PPC: replace call_rcu by kfree_rcu for simple kmem_cache_free callback I don't immediately see the rcu_barrier(), but if there isn't one in there somewhere there probably should be. Caution suggests a need to wait. [PATCH 04/14] xfrm6_tunnel: replace call_rcu by kfree_rcu for simple kmem_cache_free callback Needs to wait, see xfrm6_tunnel_fini(). [PATCH 05/14] tracefs: replace call_rcu by kfree_rcu for simple kmem_cache_free callback This one is fine because the tracefs_inode_cachep kmem_cache is created at boot and never destroyed. [PATCH 06/14] eCryptfs: replace call_rcu by kfree_rcu for simple kmem_cache_free callback I don't see a kmem_cache_destroy(), but then again, I also don't see the kmem_cache_create(). Unless someone can see what I am not seeing, let's wait. [PATCH 07/14] net: bridge: replace call_rcu by kfree_rcu for simple kmem_cache_free callback Needs to wait, see br_fdb_fini() and br_deinit(). [PATCH 08/14] nfsd: replace call_rcu by kfree_rcu for simple kmem_cache_free callback I don't immediately see the rcu_barrier(), but if there isn't one in there somewhere there probably should be. Caution suggests a need to wait. [PATCH 09/14] block: replace call_rcu by kfree_rcu for simple kmem_cache_free callback I don't see a kmem_cache_destroy(), but then again, I also don't see the kmem_cache_create(). Unless someone can see what I am not seeing, let's wait. [PATCH 10/14] can: gw: replace call_rcu by kfree_rcu for simple kmem_cache_free callback Needs to wait, see cgw_module_exit(). [PATCH 11/14] posix-timers: replace call_rcu by kfree_rcu for simple kmem_cache_free callback This one is fine because the posix_timers_cache kmem_cache is created at boot and never destroyed. [PATCH 12/14] workqueue: replace call_rcu by kfree_rcu for simple kmem_cache_free callback This one is fine because the pwq_cache kmem_cache is created at boot and never destroyed. [PATCH 13/14] kcm: replace call_rcu by kfree_rcu for simple kmem_cache_free callback I don't immediately see the rcu_barrier(), but if there isn't one in there somewhere there probably should be. Caution suggests a need to wait. [PATCH 14/14] netfilter: replace call_rcu by kfree_rcu for simple kmem_cache_free callback Needs to wait, see hashlimit_mt_exit(). So 05/14, 11/14 and 12/14 are OK and can go ahead. The rest need some help. Apologies for my having gotten overly enthusiastic about this change! Thanx, Paul From kuba at kernel.org Wed Jun 12 22:46:15 2024 From: kuba at kernel.org (Jakub Kicinski) Date: Wed, 12 Jun 2024 15:46:15 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> Message-ID: <20240612154615.21206fea@kernel.org> On Wed, 12 Jun 2024 15:37:55 -0700 Paul E. McKenney wrote: > So 05/14, 11/14 and 12/14 are OK and can go ahead. The rest need some > help. Thank you for the breakdown! From paulmck at kernel.org Wed Jun 12 23:04:19 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Wed, 12 Jun 2024 16:04:19 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <7e58e73d-4173-49fe-8f05-38a3699bc2c1@kernel.dk> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <7e58e73d-4173-49fe-8f05-38a3699bc2c1@kernel.dk> Message-ID: On Wed, Jun 12, 2024 at 04:52:57PM -0600, Jens Axboe wrote: > On 6/12/24 4:37 PM, Paul E. McKenney wrote: > > [PATCH 09/14] block: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > I don't see a kmem_cache_destroy(), but then again, I also don't > > see the kmem_cache_create(). Unless someone can see what I am > > not seeing, let's wait. > > It's in that same file: > > blk_ioc_init() > > the cache itself never goes away, as the ioc code is not unloadable. So > I think the change there should be fine. Thank you, Jens! (And to Jakub for motivating me to go look.) So to update the scorecared, 05/14, 09/14, 11/14 and 12/14 are OK and can go ahead. Thanx, Paul From Jason at zx2c4.com Wed Jun 12 23:31:57 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Thu, 13 Jun 2024 01:31:57 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> Message-ID: On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > Since SLOB was removed, it is not necessary to use call_rcu > > > when the callback only performs kmem_cache_free. Use > > > kfree_rcu() directly. > > > > > > The changes were done using the following Coccinelle semantic patch. > > > This semantic patch is designed to ignore cases where the callback > > > function is used in another way. > > > > How does the discussion on: > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > reflect on this series? IIUC we should hold off.. > > We do need to hold off for the ones in kernel modules (such as 07/14) > where the kmem_cache is destroyed during module unload. > > OK, I might as well go through them... > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > Needs to wait, see wg_allowedips_slab_uninit(). Right, this has exactly the same pattern as the batman-adv issue: void wg_allowedips_slab_uninit(void) { rcu_barrier(); kmem_cache_destroy(node_cache); } I'll hold off on sending that up until this matter is resolved. Jason From Jason at zx2c4.com Thu Jun 13 00:31:53 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Thu, 13 Jun 2024 02:31:53 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> Message-ID: On Thu, Jun 13, 2024 at 01:31:57AM +0200, Jason A. Donenfeld wrote: > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > when the callback only performs kmem_cache_free. Use > > > > kfree_rcu() directly. > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > This semantic patch is designed to ignore cases where the callback > > > > function is used in another way. > > > > > > How does the discussion on: > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > reflect on this series? IIUC we should hold off.. > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > where the kmem_cache is destroyed during module unload. > > > > OK, I might as well go through them... > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > Needs to wait, see wg_allowedips_slab_uninit(). > > Right, this has exactly the same pattern as the batman-adv issue: > > void wg_allowedips_slab_uninit(void) > { > rcu_barrier(); > kmem_cache_destroy(node_cache); > } > > I'll hold off on sending that up until this matter is resolved. BTW, I think this whole thing might be caused by: a35d16905efc ("rcu: Add basic support for kfree_rcu() batching") The commit message there mentions: There is an implication with rcu_barrier() with this patch. Since the kfree_rcu() calls can be batched, and may not be handed yet to the RCU machinery in fact, the monitor may not have even run yet to do the queue_rcu_work(), there seems no easy way of implementing rcu_barrier() to wait for those kfree_rcu()s that are already made. So this means a kfree_rcu() followed by an rcu_barrier() does not imply that memory will be freed once rcu_barrier() returns. Before that, a kfree_rcu() used to just add a normal call_rcu() into the list, but with the function offset < 4096 as a special marker. So the kfree_rcu() calls would be treated alongside the other call_rcu() ones and thus affected by rcu_barrier(). Looks like that behavior is no more since this commit. Rather than getting rid of the batching, which seems good for efficiency, I wonder if the right fix to this would be adding a `should_destroy` boolean to kmem_cache, which kmem_cache_destroy() sets to true. And then right after it checks `if (number_of_allocations == 0) actually_destroy()`, and likewise on each kmem_cache_free(), it could check `if (should_destroy && number_of_allocations == 0) actually_destroy()`. This way, the work is delayed until it's safe to do so. This might also mitigate other lurking bugs of bad code that calls kmem_cache_destroy() before kmem_cache_free(). Jason From paulmck at kernel.org Thu Jun 13 03:38:02 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Wed, 12 Jun 2024 20:38:02 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> Message-ID: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> On Thu, Jun 13, 2024 at 02:31:53AM +0200, Jason A. Donenfeld wrote: > On Thu, Jun 13, 2024 at 01:31:57AM +0200, Jason A. Donenfeld wrote: > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > when the callback only performs kmem_cache_free. Use > > > > > kfree_rcu() directly. > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > This semantic patch is designed to ignore cases where the callback > > > > > function is used in another way. > > > > > > > > How does the discussion on: > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > reflect on this series? IIUC we should hold off.. > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > where the kmem_cache is destroyed during module unload. > > > > > > OK, I might as well go through them... > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > Right, this has exactly the same pattern as the batman-adv issue: > > > > void wg_allowedips_slab_uninit(void) > > { > > rcu_barrier(); > > kmem_cache_destroy(node_cache); > > } > > > > I'll hold off on sending that up until this matter is resolved. > > BTW, I think this whole thing might be caused by: > > a35d16905efc ("rcu: Add basic support for kfree_rcu() batching") > > The commit message there mentions: > > There is an implication with rcu_barrier() with this patch. Since the > kfree_rcu() calls can be batched, and may not be handed yet to the RCU > machinery in fact, the monitor may not have even run yet to do the > queue_rcu_work(), there seems no easy way of implementing rcu_barrier() > to wait for those kfree_rcu()s that are already made. So this means a > kfree_rcu() followed by an rcu_barrier() does not imply that memory will > be freed once rcu_barrier() returns. > > Before that, a kfree_rcu() used to just add a normal call_rcu() into the > list, but with the function offset < 4096 as a special marker. So the > kfree_rcu() calls would be treated alongside the other call_rcu() ones > and thus affected by rcu_barrier(). Looks like that behavior is no more > since this commit. You might well be right, and thank you for digging into this! > Rather than getting rid of the batching, which seems good for > efficiency, I wonder if the right fix to this would be adding a > `should_destroy` boolean to kmem_cache, which kmem_cache_destroy() sets > to true. And then right after it checks `if (number_of_allocations == 0) > actually_destroy()`, and likewise on each kmem_cache_free(), it could > check `if (should_destroy && number_of_allocations == 0) > actually_destroy()`. This way, the work is delayed until it's safe to do > so. This might also mitigate other lurking bugs of bad code that calls > kmem_cache_destroy() before kmem_cache_free(). Here are the current options being considered, including those that are completely brain-dead: o Document current state. (Must use call_rcu() if module destroys slab of RCU-protected objects.) Need to review Julia's and Uladzislau's series of patches that change call_rcu() of slab objects to kfree_rcu(). o Make rcu_barrier() wait for kfree_rcu() objects. (This is surprisingly complex and will wait unnecessarily in some cases. However, it does preserve current code.) o Make a kfree_rcu_barrier() that waits for kfree_rcu() objects. (This avoids the unnecessary waits, but adds complexity to kfree_rcu(). This is harder than it looks, but could be done, for example by maintaining pairs of per-CPU counters and handling them in an SRCU-like fashion. Need some way of communicating the index, though.) (There might be use cases where both rcu_barrier() and kfree_rcu_barrier() would need to be invoked.) A simpler way to implement this is to scan all of the in-flight objects, and queue each (either separately or in bulk) using call_rcu(). This still has problems with kfree_rcu_mightsleep() under low-memory conditions, in which case there are a bunch of synchronize_rcu() instances waiting. These instances could use SRCU-like per-CPU arrays of counters. Or just protect the calls to synchronize_rcu() and the later frees with an SRCU reader, then have the other end call synchronize_srcu(). o Make the current kmem_cache_destroy() asynchronously wait for all memory to be returned, then complete the destruction. (This gets rid of a valuable debugging technique because in normal use, it is a bug to attempt to destroy a kmem_cache that has objects still allocated.) o Make a kmem_cache_destroy_rcu() that asynchronously waits for all memory to be returned, then completes the destruction. (This raises the question of what to is it takes a "long time" for the objects to be freed.) o Make a kmem_cache_free_barrier() that blocks until all objects in the specified kmem_cache have been freed. o Make a kmem_cache_destroy_wait() that waits for all memory to be returned, then does the destruction. This is equivalent to: kmem_cache_free_barrier(&mycache); kmem_cache_destroy(&mycache); Uladzislau has started discussions on the last few of these: https://lore.kernel.org/all/ZmnL4jkhJLIW924W at pc636/ I have also added this information to a Google Document for easier tracking: https://docs.google.com/document/d/1v0rcZLvvjVGejT3523W0rDy_sLFu2LWc_NR3fQItZaA/edit?usp=sharing Other thoughts? Thanx, Paul From Jason at zx2c4.com Thu Jun 13 11:58:59 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Thu, 13 Jun 2024 13:58:59 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> Message-ID: On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > Since SLOB was removed, it is not necessary to use call_rcu > > > when the callback only performs kmem_cache_free. Use > > > kfree_rcu() directly. > > > > > > The changes were done using the following Coccinelle semantic patch. > > > This semantic patch is designed to ignore cases where the callback > > > function is used in another way. > > > > How does the discussion on: > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > reflect on this series? IIUC we should hold off.. > > We do need to hold off for the ones in kernel modules (such as 07/14) > where the kmem_cache is destroyed during module unload. > > OK, I might as well go through them... > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > Needs to wait, see wg_allowedips_slab_uninit(). Also, notably, this patch needs additionally: diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c index e4e1638fce1b..c95f6937c3f1 100644 --- a/drivers/net/wireguard/allowedips.c +++ b/drivers/net/wireguard/allowedips.c @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) void wg_allowedips_slab_uninit(void) { - rcu_barrier(); kmem_cache_destroy(node_cache); } Once kmem_cache_destroy has been fixed to be deferrable. I assume the other patches are similar -- an rcu_barrier() can be removed. So some manual meddling of these might be in order. Jason From Jason at zx2c4.com Thu Jun 13 12:22:41 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Thu, 13 Jun 2024 14:22:41 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> Message-ID: On Wed, Jun 12, 2024 at 08:38:02PM -0700, Paul E. McKenney wrote: > o Make the current kmem_cache_destroy() asynchronously wait for > all memory to be returned, then complete the destruction. > (This gets rid of a valuable debugging technique because > in normal use, it is a bug to attempt to destroy a kmem_cache > that has objects still allocated.) > > o Make a kmem_cache_destroy_rcu() that asynchronously waits for > all memory to be returned, then completes the destruction. > (This raises the question of what to is it takes a "long time" > for the objects to be freed.) These seem like the best two options. > o Make a kmem_cache_free_barrier() that blocks until all > objects in the specified kmem_cache have been freed. > > o Make a kmem_cache_destroy_wait() that waits for all memory to > be returned, then does the destruction. This is equivalent to: > > kmem_cache_free_barrier(&mycache); > kmem_cache_destroy(&mycache); These also seem fine, but I'm less keen about blocking behavior. Though, along the ideas of kmem_cache_destroy_rcu(), you might also consider renaming this last one to kmem_cache_destroy_rcu_wait/barrier(). This way, it's RCU focused, and you can deal directly with the question of, "how long is too long to block/to memleak?" Specifically what I mean is that we can still claim a memory leak has occurred if one batched kfree_rcu freeing grace period has elapsed since the last call to kmem_cache_destroy_rcu_wait/barrier() or kmem_cache_destroy_rcu(). In that case, you quit blocking, or you quit asynchronously waiting, and then you splat about a memleak like we have now. But then, if that mechanism generally works, we don't really need a new function and we can just go with the first option of making kmem_cache_destroy() asynchronously wait. It'll wait, as you described, but then we adjust the tail of every kfree_rcu batch freeing cycle to check if there are _still_ any old outstanding kmem_cache_destroy() requests. If so, then we can splat and keep the old debugging info we currently have for finding memleaks. Jason From paulmck at kernel.org Thu Jun 13 12:46:11 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Thu, 13 Jun 2024 05:46:11 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> Message-ID: On Thu, Jun 13, 2024 at 02:22:41PM +0200, Jason A. Donenfeld wrote: > On Wed, Jun 12, 2024 at 08:38:02PM -0700, Paul E. McKenney wrote: > > o Make the current kmem_cache_destroy() asynchronously wait for > > all memory to be returned, then complete the destruction. > > (This gets rid of a valuable debugging technique because > > in normal use, it is a bug to attempt to destroy a kmem_cache > > that has objects still allocated.) > > > > o Make a kmem_cache_destroy_rcu() that asynchronously waits for > > all memory to be returned, then completes the destruction. > > (This raises the question of what to is it takes a "long time" > > for the objects to be freed.) > > These seem like the best two options. I like them myself, but much depends on how much violence they do to the slab subsystem and to debuggability. > > o Make a kmem_cache_free_barrier() that blocks until all > > objects in the specified kmem_cache have been freed. > > > > o Make a kmem_cache_destroy_wait() that waits for all memory to > > be returned, then does the destruction. This is equivalent to: > > > > kmem_cache_free_barrier(&mycache); > > kmem_cache_destroy(&mycache); > > These also seem fine, but I'm less keen about blocking behavior. One advantage of the blocking behavior is that it pinpoints memory leaks from that slab. On the other hand, one can argue that you want this to block during testing but to be asynchronous in production. Borrowing someone else's hand, there are probably lots of other arguments one can make. > Though, along the ideas of kmem_cache_destroy_rcu(), you might also > consider renaming this last one to kmem_cache_destroy_rcu_wait/barrier(). > This way, it's RCU focused, and you can deal directly with the question > of, "how long is too long to block/to memleak?" Good point! > Specifically what I mean is that we can still claim a memory leak has > occurred if one batched kfree_rcu freeing grace period has elapsed since > the last call to kmem_cache_destroy_rcu_wait/barrier() or > kmem_cache_destroy_rcu(). In that case, you quit blocking, or you quit > asynchronously waiting, and then you splat about a memleak like we have > now. How about a kmem_cache_destroy_rcu() that marks that specified cache for destruction, and then a kmem_cache_destroy_barrier() that waits? I took the liberty of adding your name to the Google document [1] and adding this section: kmem_cache_destroy_rcu/_barrier() The idea here is to provide a asynchronous? kmem_cache_destroy_rcu() as described above along with a kmem_cache_destroy_barrier() that waits for the destruction of all prior kmem_cache instances previously passed to kmem_cache_destroy_rcu().? Alternatively,? could return a cookie that could be passed into a later call to kmem_cache_destroy_barrier().? This alternative has the advantage of isolating which kmem_cache instance is suffering the memory leak. Please let me know if either liberty is in any way problematic. > But then, if that mechanism generally works, we don't really need a new > function and we can just go with the first option of making > kmem_cache_destroy() asynchronously wait. It'll wait, as you described, > but then we adjust the tail of every kfree_rcu batch freeing cycle to > check if there are _still_ any old outstanding kmem_cache_destroy() > requests. If so, then we can splat and keep the old debugging info we > currently have for finding memleaks. The mechanism can always be sabotaged by memory-leak bugs on the part of the user of the kmem_cache structure in play, right? OK, but I see your point. I added this to the existing "kmem_cache_destroy() Lingers for kfree_rcu()" section: One way of preserving this debugging information is to splat if all of the slab?s memory has not been freed within a reasonable timeframe, perhaps the same 21 seconds that causes an RCU CPU stall warning. Does that capture it? Thanx, Paul [1] https://docs.google.com/document/d/1v0rcZLvvjVGejT3523W0rDy_sLFu2LWc_NR3fQItZaA/edit?usp=sharing From paulmck at kernel.org Thu Jun 13 12:47:08 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Thu, 13 Jun 2024 05:47:08 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> Message-ID: <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > when the callback only performs kmem_cache_free. Use > > > > kfree_rcu() directly. > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > This semantic patch is designed to ignore cases where the callback > > > > function is used in another way. > > > > > > How does the discussion on: > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > reflect on this series? IIUC we should hold off.. > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > where the kmem_cache is destroyed during module unload. > > > > OK, I might as well go through them... > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > Needs to wait, see wg_allowedips_slab_uninit(). > > Also, notably, this patch needs additionally: > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > index e4e1638fce1b..c95f6937c3f1 100644 > --- a/drivers/net/wireguard/allowedips.c > +++ b/drivers/net/wireguard/allowedips.c > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > void wg_allowedips_slab_uninit(void) > { > - rcu_barrier(); > kmem_cache_destroy(node_cache); > } > > Once kmem_cache_destroy has been fixed to be deferrable. > > I assume the other patches are similar -- an rcu_barrier() can be > removed. So some manual meddling of these might be in order. Assuming that the deferrable kmem_cache_destroy() is the option chosen, agreed. Thanx, Paul From urezki at gmail.com Thu Jun 13 13:06:54 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Thu, 13 Jun 2024 15:06:54 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> Message-ID: On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > when the callback only performs kmem_cache_free. Use > > > > > kfree_rcu() directly. > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > This semantic patch is designed to ignore cases where the callback > > > > > function is used in another way. > > > > > > > > How does the discussion on: > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > reflect on this series? IIUC we should hold off.. > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > where the kmem_cache is destroyed during module unload. > > > > > > OK, I might as well go through them... > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > Also, notably, this patch needs additionally: > > > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > > index e4e1638fce1b..c95f6937c3f1 100644 > > --- a/drivers/net/wireguard/allowedips.c > > +++ b/drivers/net/wireguard/allowedips.c > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > void wg_allowedips_slab_uninit(void) > > { > > - rcu_barrier(); > > kmem_cache_destroy(node_cache); > > } > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > I assume the other patches are similar -- an rcu_barrier() can be > > removed. So some manual meddling of these might be in order. > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > agreed. > void kmem_cache_destroy(struct kmem_cache *s) { int err = -EBUSY; bool rcu_set; if (unlikely(!s) || !kasan_check_byte(s)) return; cpus_read_lock(); mutex_lock(&slab_mutex); rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; s->refcount--; if (s->refcount) goto out_unlock; err = shutdown_cache(s); WARN(err, "%s %s: Slab cache still has objects when called from %pS", __func__, s->name, (void *)_RET_IP_); ... cpus_read_unlock(); if (!err && !rcu_set) kmem_cache_release(s); } so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages and a cache by a grace period. Similar flag can be added, like SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself if there are still objects which should be freed. Any thoughts here? -- Uladzislau Rezki From Jason at zx2c4.com Thu Jun 13 14:11:52 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Thu, 13 Jun 2024 16:11:52 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> Message-ID: On Thu, Jun 13, 2024 at 05:46:11AM -0700, Paul E. McKenney wrote: > How about a kmem_cache_destroy_rcu() that marks that specified cache > for destruction, and then a kmem_cache_destroy_barrier() that waits? > > I took the liberty of adding your name to the Google document [1] and > adding this section: Cool, though no need to make me yellow! > > But then, if that mechanism generally works, we don't really need a new > > function and we can just go with the first option of making > > kmem_cache_destroy() asynchronously wait. It'll wait, as you described, > > but then we adjust the tail of every kfree_rcu batch freeing cycle to > > check if there are _still_ any old outstanding kmem_cache_destroy() > > requests. If so, then we can splat and keep the old debugging info we > > currently have for finding memleaks. > > The mechanism can always be sabotaged by memory-leak bugs on the part > of the user of the kmem_cache structure in play, right? > > OK, but I see your point. I added this to the existing > "kmem_cache_destroy() Lingers for kfree_rcu()" section: > > One way of preserving this debugging information is to splat if > all of the slab?s memory has not been freed within a reasonable > timeframe, perhaps the same 21 seconds that causes an RCU CPU > stall warning. > > Does that capture it? Not quite what I was thinking. Your 21 seconds as a time-based thing I guess could be fine. But I was mostly thinking: 1) kmem_cache_destroy() is called, but there are outstanding objects, so it defers. 2) Sometime later, a kfree_rcu_work batch freeing operation runs. 3) At the end of this batch freeing, the kernel notices that the kmem_cache whose destruction was previously deferred still has outstanding objects and has not been destroyed. It can conclude that there's thus been a memory leak. In other words, instead of having to do this based on timers, you can just have the batch freeing code ask, "did those pending kmem_cache destructions get completed as a result of this last operation?" From kuba at kernel.org Thu Jun 13 14:17:38 2024 From: kuba at kernel.org (Jakub Kicinski) Date: Thu, 13 Jun 2024 07:17:38 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> Message-ID: <20240613071738.0655ff4f@kernel.org> On Wed, 12 Jun 2024 20:38:02 -0700 Paul E. McKenney wrote: > o Make rcu_barrier() wait for kfree_rcu() objects. (This is > surprisingly complex and will wait unnecessarily in some cases. > However, it does preserve current code.) Not sure how much mental capacity for API variations we expect from people using caches, but I feel like this would score the highest on Rusty's API scale. I'd even venture an opinion that it's less confusing to require cache users to have their own (trivial) callbacks than add API variants we can't error check even at runtime... From perry at cynic.org Thu Jun 13 14:34:32 2024 From: perry at cynic.org (Perry The Cynic) Date: Thu, 13 Jun 2024 07:34:32 -0700 Subject: Wireguard, iPhone, and cruise ships Message-ID: <60B826FA-3FCA-40B5-9771-8FFEDA6278AB@cynic.org> Dear wg community, I recently enjoyed a cruise to Alaska. Fun and easy, and with Starlink on board, the WiFi connectivity was actually not bad (some sporadic packet drops, mostly). Sadly, the cruise company?s network unceremoniously drops UDP of most kinds, leading to my Wireguard VPN (to my inside network at home) failing entirely. The cruise line is utterly immovable on this: ?it?s someone else?s fault, and how dare you want to do this nonstandard thing?? Yes, I actually talked to their onboard IT guy. ?It?s on the network path somewhere, and they don?t even tell me how and why." Now I totally understand Wireguard?s attitude towards this: It?s not a ?core? wg problem, and should be solved on the outside by whatever tools happen to fit the problem. If this was a linux-to-linux connection, I?d just pop in my favorite TCP-ish tunnel tool and move on. But it?s an iPhone (and iPad). And iOS doesn?t seem to like network composability. At all. Once you move outside the ?it?s a VPN endpoint? paradigm, things get stuck very quickly. I realize this is all Apple?s fault, and they should allow building arbitrary network stacks in iOS. But they don?t (yet). NWConnection is getting pretty good, but it requires in-app code composition. AFAIK, you can?t stack two iOS VPNs on top of each other (right?). So what are the practically available options here? I can set up whatever is needed on the server endpoint (it?s Debian), but what can I do on my phone to make wg work through an HTTP(s)-shaped pinhole? I?d hate to have to ditch wg for some other vpn just for that rare case? but what?s the answer? And, to prefetch a possible ending of this discussion: if I coded up patches to the iOS client that add some tcp-wrapper option, would you take it? Cheers ? perry --------------------------------------------------------------------------- Perry The Cynic perry at cynic.org To a blind optimist, an optimistic realist must seem like an Accursed Cynic. --------------------------------------------------------------------------- From perry at cynic.org Thu Jun 13 14:42:41 2024 From: perry at cynic.org (Perry The Cynic) Date: Thu, 13 Jun 2024 07:42:41 -0700 Subject: Wireguard, iPhone, and cruise ships In-Reply-To: References: <60B826FA-3FCA-40B5-9771-8FFEDA6278AB@cynic.org> Message-ID: <2A8A3A9D-82CD-451B-B693-3FD01CF5861C@cynic.org> I?m basically coming to the conclusion that it?s not a wg core issue, but it IS an iOS app issue. If iOS won?t support a composition that works, then the app needs to. Somehow. Cheers ? perry > On Jun 13, 2024, at 7:40?AM, Amir Omidi wrote: > > I think there is "technically" a way to put a VPN on a VPN and that is doing one of those VPNs as a configuration profile. I'm not 100% sure about this though. > > I've run into very similar issues to this at various hotels. I've also always wished there was something to do HTTP tunneling on Wireguard officially to help with these awful network setups. But I also understand that's not a core WG issue. > > > On Thu, Jun 13, 2024 at 2:35?PM Perry The Cynic wrote: > Dear wg community, > > I recently enjoyed a cruise to Alaska. Fun and easy, and with Starlink on board, the WiFi connectivity was actually not bad (some sporadic packet drops, mostly). Sadly, the cruise company?s network unceremoniously drops UDP of most kinds, leading to my Wireguard VPN (to my inside network at home) failing entirely. The cruise line is utterly immovable on this: ?it?s someone else?s fault, and how dare you want to do this nonstandard thing?? Yes, I actually talked to their onboard IT guy. ?It?s on the network path somewhere, and they don?t even tell me how and why." > > Now I totally understand Wireguard?s attitude towards this: It?s not a ?core? wg problem, and should be solved on the outside by whatever tools happen to fit the problem. If this was a linux-to-linux connection, I?d just pop in my favorite TCP-ish tunnel tool and move on. But it?s an iPhone (and iPad). And iOS doesn?t seem to like network composability. At all. Once you move outside the ?it?s a VPN endpoint? paradigm, things get stuck very quickly. I realize this is all Apple?s fault, and they should allow building arbitrary network stacks in iOS. But they don?t (yet). NWConnection is getting pretty good, but it requires in-app code composition. AFAIK, you can?t stack two iOS VPNs on top of each other (right?). > > So what are the practically available options here? I can set up whatever is needed on the server endpoint (it?s Debian), but what can I do on my phone to make wg work through an HTTP(s)-shaped pinhole? I?d hate to have to ditch wg for some other vpn just for that rare case? but what?s the answer? > > And, to prefetch a possible ending of this discussion: if I coded up patches to the iOS client that add some tcp-wrapper option, would you take it? > > Cheers > ? perry > --------------------------------------------------------------------------- > Perry The Cynic perry at cynic.org > To a blind optimist, an optimistic realist must seem like an Accursed Cynic. > --------------------------------------------------------------------------- > From a at unstable.cc Thu Jun 13 14:45:40 2024 From: a at unstable.cc (Antonio Quartulli) Date: Thu, 13 Jun 2024 16:45:40 +0200 Subject: Wireguard, iPhone, and cruise ships In-Reply-To: <60B826FA-3FCA-40B5-9771-8FFEDA6278AB@cynic.org> References: <60B826FA-3FCA-40B5-9771-8FFEDA6278AB@cynic.org> Message-ID: Hi, On 13/06/2024 16:34, Perry The Cynic wrote: > So what are the practically available options here? I can set up whatever is needed on the server endpoint (it?s Debian), but what can I do on my phone to make wg work through an HTTP(s)-shaped pinhole? I?d hate to have to ditch wg for some other vpn just for that rare case? but what?s the answer? How about simply getting a small travel router that you can install between your devices and the uplink connection (being this wifi or ethernet)? The travel router could be running OpenWRT and thus allow you to play any wanted trick. Cheers, -- Antonio Quartulli From perry at cynic.org Thu Jun 13 14:52:19 2024 From: perry at cynic.org (Perry The Cynic) Date: Thu, 13 Jun 2024 07:52:19 -0700 Subject: Wireguard, iPhone, and cruise ships In-Reply-To: References: <60B826FA-3FCA-40B5-9771-8FFEDA6278AB@cynic.org> Message-ID: That works when I?m in my room/cabin/place. I?m actually building a Raspberry Pi-based travel box right now (so next time I have linux tools to diagnose problems), and it can do tcp wrapping/forwarding. But carrying a battery-powered router-sized thing around on vacation somewhat defeats the point of carrying an iPhone on travel. Another box to break, another battery to run out. And my wife wants vpn access too, even if she?s away from me. Cheers ? perry > On Jun 13, 2024, at 7:45?AM, Antonio Quartulli wrote: > > Hi, > > On 13/06/2024 16:34, Perry The Cynic wrote: >> So what are the practically available options here? I can set up whatever is needed on the server endpoint (it?s Debian), but what can I do on my phone to make wg work through an HTTP(s)-shaped pinhole? I?d hate to have to ditch wg for some other vpn just for that rare case? but what?s the answer? > > How about simply getting a small travel router that you can install between your devices and the uplink connection (being this wifi or ethernet)? > > The travel router could be running OpenWRT and thus allow you to play any wanted trick. > > Cheers, > > -- > Antonio Quartulli From paulmck at kernel.org Thu Jun 13 14:53:24 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Thu, 13 Jun 2024 07:53:24 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <20240613071738.0655ff4f@kernel.org> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <20240613071738.0655ff4f@kernel.org> Message-ID: <62757652-8874-45d7-afec-734edeb03831@paulmck-laptop> On Thu, Jun 13, 2024 at 07:17:38AM -0700, Jakub Kicinski wrote: > On Wed, 12 Jun 2024 20:38:02 -0700 Paul E. McKenney wrote: > > o Make rcu_barrier() wait for kfree_rcu() objects. (This is > > surprisingly complex and will wait unnecessarily in some cases. > > However, it does preserve current code.) > > Not sure how much mental capacity for API variations we expect from > people using caches, but I feel like this would score the highest > on Rusty's API scale. I'd even venture an opinion that it's less > confusing to require cache users to have their own (trivial) callbacks > than add API variants we can't error check even at runtime... Fair point, though please see Jason's emails. And the underlying within-RCU mechanism is the same either way, so that API decision can be deferred for some time. But the within-slab mechanism does have the advantage of also possibly simplifying reference-counting and the potential upcoming hazard pointers. On the other hand, I currently have no idea what level of violence this change would make to the slab subsystem. Thanx, Paul From paulmck at kernel.org Thu Jun 13 15:06:30 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Thu, 13 Jun 2024 08:06:30 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> Message-ID: <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > kfree_rcu() directly. > > > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > > This semantic patch is designed to ignore cases where the callback > > > > > > function is used in another way. > > > > > > > > > > How does the discussion on: > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > OK, I might as well go through them... > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > Also, notably, this patch needs additionally: > > > > > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > --- a/drivers/net/wireguard/allowedips.c > > > +++ b/drivers/net/wireguard/allowedips.c > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > void wg_allowedips_slab_uninit(void) > > > { > > > - rcu_barrier(); > > > kmem_cache_destroy(node_cache); > > > } > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > removed. So some manual meddling of these might be in order. > > > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > > agreed. > > > > void kmem_cache_destroy(struct kmem_cache *s) > { > int err = -EBUSY; > bool rcu_set; > > if (unlikely(!s) || !kasan_check_byte(s)) > return; > > cpus_read_lock(); > mutex_lock(&slab_mutex); > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > s->refcount--; > if (s->refcount) > goto out_unlock; > > err = shutdown_cache(s); > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > __func__, s->name, (void *)_RET_IP_); > ... > cpus_read_unlock(); > if (!err && !rcu_set) > kmem_cache_release(s); > } > > > so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages > and a cache by a grace period. Similar flag can be added, like > SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself > if there are still objects which should be freed. > > Any thoughts here? Wouldn't we also need some additional code to later check for all objects being freed to the slab, whether or not that code is initiated from kmem_cache_destroy()? Either way, I am adding the SLAB_DESTROY_ONCE_FULLY_FREED possibility, thank you! [1] Thanx, Paul [1] https://docs.google.com/document/d/1v0rcZLvvjVGejT3523W0rDy_sLFu2LWc_NR3fQItZaA/edit?usp=sharing From paulmck at kernel.org Thu Jun 13 15:12:05 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Thu, 13 Jun 2024 08:12:05 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> Message-ID: <6595ff2a-690e-4d6c-9be5-eb83f2df23fa@paulmck-laptop> On Thu, Jun 13, 2024 at 04:11:52PM +0200, Jason A. Donenfeld wrote: > On Thu, Jun 13, 2024 at 05:46:11AM -0700, Paul E. McKenney wrote: > > How about a kmem_cache_destroy_rcu() that marks that specified cache > > for destruction, and then a kmem_cache_destroy_barrier() that waits? > > > > I took the liberty of adding your name to the Google document [1] and > > adding this section: > > Cool, though no need to make me yellow! No worries, Jakub is also colored yellow. People added tomorrow will be cyan if I follow my usual change-color ordering. ;-) > > > But then, if that mechanism generally works, we don't really need a new > > > function and we can just go with the first option of making > > > kmem_cache_destroy() asynchronously wait. It'll wait, as you described, > > > but then we adjust the tail of every kfree_rcu batch freeing cycle to > > > check if there are _still_ any old outstanding kmem_cache_destroy() > > > requests. If so, then we can splat and keep the old debugging info we > > > currently have for finding memleaks. > > > > The mechanism can always be sabotaged by memory-leak bugs on the part > > of the user of the kmem_cache structure in play, right? > > > > OK, but I see your point. I added this to the existing > > "kmem_cache_destroy() Lingers for kfree_rcu()" section: > > > > One way of preserving this debugging information is to splat if > > all of the slab?s memory has not been freed within a reasonable > > timeframe, perhaps the same 21 seconds that causes an RCU CPU > > stall warning. > > > > Does that capture it? > > Not quite what I was thinking. Your 21 seconds as a time-based thing I > guess could be fine. But I was mostly thinking: > > 1) kmem_cache_destroy() is called, but there are outstanding objects, so > it defers. > > 2) Sometime later, a kfree_rcu_work batch freeing operation runs. Or not, if there has been a leak and there happens to be no outstanding kfree_rcu() memory. > 3) At the end of this batch freeing, the kernel notices that the > kmem_cache whose destruction was previously deferred still has > outstanding objects and has not been destroyed. It can conclude that > there's thus been a memory leak. And the batch freeing can be replicated across CPUs, so it would be necessary to determine which was last to do this effective. Don't get me wrong, this can be done, but the performance/latency tradeoffs can be interesting. > In other words, instead of having to do this based on timers, you can > just have the batch freeing code ask, "did those pending kmem_cache > destructions get completed as a result of this last operation?" I agree that kfree_rcu_work-batch time is a good time to evaluate slab (and I have added this to the document), but I do not believe that it can completely replace timeouts. Thanx, Paul From urezki at gmail.com Thu Jun 13 17:38:59 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Thu, 13 Jun 2024 19:38:59 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > > > This semantic patch is designed to ignore cases where the callback > > > > > > > function is used in another way. > > > > > > > > > > > > How does the discussion on: > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > --- a/drivers/net/wireguard/allowedips.c > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > { > > > > - rcu_barrier(); > > > > kmem_cache_destroy(node_cache); > > > > } > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > > removed. So some manual meddling of these might be in order. > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > > > agreed. > > > > > > > void kmem_cache_destroy(struct kmem_cache *s) > > { > > int err = -EBUSY; > > bool rcu_set; > > > > if (unlikely(!s) || !kasan_check_byte(s)) > > return; > > > > cpus_read_lock(); > > mutex_lock(&slab_mutex); > > > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > > > s->refcount--; > > if (s->refcount) > > goto out_unlock; > > > > err = shutdown_cache(s); > > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > > __func__, s->name, (void *)_RET_IP_); > > ... > > cpus_read_unlock(); > > if (!err && !rcu_set) > > kmem_cache_release(s); > > } > > > > > > so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages > > and a cache by a grace period. Similar flag can be added, like > > SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself > > if there are still objects which should be freed. > > > > Any thoughts here? > > Wouldn't we also need some additional code to later check for all objects > being freed to the slab, whether or not that code is initiated from > kmem_cache_destroy()? > Same away as SLAB_TYPESAFE_BY_RCU is handled from the kmem_cache_destroy() function. It checks that flag and if it is true and extra worker is scheduled to perform a deferred(instead of right away) destroy after rcu_barrier() finishes. -- Uladzislau Rezki From paulmck at kernel.org Thu Jun 13 17:45:59 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Thu, 13 Jun 2024 10:45:59 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Thu, Jun 13, 2024 at 07:38:59PM +0200, Uladzislau Rezki wrote: > On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > > > > This semantic patch is designed to ignore cases where the callback > > > > > > > > function is used in another way. > > > > > > > > > > > > > > How does the discussion on: > > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > > --- a/drivers/net/wireguard/allowedips.c > > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > > { > > > > > - rcu_barrier(); > > > > > kmem_cache_destroy(node_cache); > > > > > } > > > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > > > removed. So some manual meddling of these might be in order. > > > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > > > > agreed. > > > > > > > > > > void kmem_cache_destroy(struct kmem_cache *s) > > > { > > > int err = -EBUSY; > > > bool rcu_set; > > > > > > if (unlikely(!s) || !kasan_check_byte(s)) > > > return; > > > > > > cpus_read_lock(); > > > mutex_lock(&slab_mutex); > > > > > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > > > > > s->refcount--; > > > if (s->refcount) > > > goto out_unlock; > > > > > > err = shutdown_cache(s); > > > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > > > __func__, s->name, (void *)_RET_IP_); > > > ... > > > cpus_read_unlock(); > > > if (!err && !rcu_set) > > > kmem_cache_release(s); > > > } > > > > > > > > > so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages > > > and a cache by a grace period. Similar flag can be added, like > > > SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself > > > if there are still objects which should be freed. > > > > > > Any thoughts here? > > > > Wouldn't we also need some additional code to later check for all objects > > being freed to the slab, whether or not that code is initiated from > > kmem_cache_destroy()? > > > Same away as SLAB_TYPESAFE_BY_RCU is handled from the kmem_cache_destroy() function. > It checks that flag and if it is true and extra worker is scheduled to perform a > deferred(instead of right away) destroy after rcu_barrier() finishes. Like this? SLAB_DESTROY_ONCE_FULLY_FREED Instead of adding a new kmem_cache_destroy_rcu() or kmem_cache_destroy_wait() API member, instead add a SLAB_DESTROY_ONCE_FULLY_FREED flag that can be passed to the existing kmem_cache_destroy() function.? Use of this flag would suppress any warnings that would otherwise be issued if there was still slab memory yet to be freed, and it would also spawn workqueues (or timers or whatever) to do any needed cleanup work. Thanx, Paul From urezki at gmail.com Thu Jun 13 17:58:17 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Thu, 13 Jun 2024 19:58:17 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Thu, Jun 13, 2024 at 10:45:59AM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 07:38:59PM +0200, Uladzislau Rezki wrote: > > On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > > > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > > > > > This semantic patch is designed to ignore cases where the callback > > > > > > > > > function is used in another way. > > > > > > > > > > > > > > > > How does the discussion on: > > > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > > > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > > > --- a/drivers/net/wireguard/allowedips.c > > > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > > > { > > > > > > - rcu_barrier(); > > > > > > kmem_cache_destroy(node_cache); > > > > > > } > > > > > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > > > > removed. So some manual meddling of these might be in order. > > > > > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > > > > > agreed. > > > > > > > > > > > > > void kmem_cache_destroy(struct kmem_cache *s) > > > > { > > > > int err = -EBUSY; > > > > bool rcu_set; > > > > > > > > if (unlikely(!s) || !kasan_check_byte(s)) > > > > return; > > > > > > > > cpus_read_lock(); > > > > mutex_lock(&slab_mutex); > > > > > > > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > > > > > > > s->refcount--; > > > > if (s->refcount) > > > > goto out_unlock; > > > > > > > > err = shutdown_cache(s); > > > > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > > > > __func__, s->name, (void *)_RET_IP_); > > > > ... > > > > cpus_read_unlock(); > > > > if (!err && !rcu_set) > > > > kmem_cache_release(s); > > > > } > > > > > > > > > > > > so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages > > > > and a cache by a grace period. Similar flag can be added, like > > > > SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself > > > > if there are still objects which should be freed. > > > > > > > > Any thoughts here? > > > > > > Wouldn't we also need some additional code to later check for all objects > > > being freed to the slab, whether or not that code is initiated from > > > kmem_cache_destroy()? > > > > > Same away as SLAB_TYPESAFE_BY_RCU is handled from the kmem_cache_destroy() function. > > It checks that flag and if it is true and extra worker is scheduled to perform a > > deferred(instead of right away) destroy after rcu_barrier() finishes. > > Like this? > > SLAB_DESTROY_ONCE_FULLY_FREED > > Instead of adding a new kmem_cache_destroy_rcu() > or kmem_cache_destroy_wait() API member, instead add a > SLAB_DESTROY_ONCE_FULLY_FREED flag that can be passed to the > existing kmem_cache_destroy() function.? Use of this flag would > suppress any warnings that would otherwise be issued if there > was still slab memory yet to be freed, and it would also spawn > workqueues (or timers or whatever) to do any needed cleanup work. > > The flag is passed as all others during creating a cache: slab = kmem_cache_create(name, size, ..., SLAB_DESTROY_ONCE_FULLY_FREED | OTHER_FLAGS, NULL); the rest description is correct to me. -- Uladzislau Rezki From paulmck at kernel.org Thu Jun 13 18:13:52 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Thu, 13 Jun 2024 11:13:52 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Thu, Jun 13, 2024 at 07:58:17PM +0200, Uladzislau Rezki wrote: > On Thu, Jun 13, 2024 at 10:45:59AM -0700, Paul E. McKenney wrote: > > On Thu, Jun 13, 2024 at 07:38:59PM +0200, Uladzislau Rezki wrote: > > > On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > > > > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > > > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > > > > > > This semantic patch is designed to ignore cases where the callback > > > > > > > > > > function is used in another way. > > > > > > > > > > > > > > > > > > How does the discussion on: > > > > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > > > > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > > > > --- a/drivers/net/wireguard/allowedips.c > > > > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > > > > { > > > > > > > - rcu_barrier(); > > > > > > > kmem_cache_destroy(node_cache); > > > > > > > } > > > > > > > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > > > > > removed. So some manual meddling of these might be in order. > > > > > > > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > > > > > > agreed. > > > > > > > > > > > > > > > > void kmem_cache_destroy(struct kmem_cache *s) > > > > > { > > > > > int err = -EBUSY; > > > > > bool rcu_set; > > > > > > > > > > if (unlikely(!s) || !kasan_check_byte(s)) > > > > > return; > > > > > > > > > > cpus_read_lock(); > > > > > mutex_lock(&slab_mutex); > > > > > > > > > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > > > > > > > > > s->refcount--; > > > > > if (s->refcount) > > > > > goto out_unlock; > > > > > > > > > > err = shutdown_cache(s); > > > > > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > > > > > __func__, s->name, (void *)_RET_IP_); > > > > > ... > > > > > cpus_read_unlock(); > > > > > if (!err && !rcu_set) > > > > > kmem_cache_release(s); > > > > > } > > > > > > > > > > > > > > > so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages > > > > > and a cache by a grace period. Similar flag can be added, like > > > > > SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself > > > > > if there are still objects which should be freed. > > > > > > > > > > Any thoughts here? > > > > > > > > Wouldn't we also need some additional code to later check for all objects > > > > being freed to the slab, whether or not that code is initiated from > > > > kmem_cache_destroy()? > > > > > > > Same away as SLAB_TYPESAFE_BY_RCU is handled from the kmem_cache_destroy() function. > > > It checks that flag and if it is true and extra worker is scheduled to perform a > > > deferred(instead of right away) destroy after rcu_barrier() finishes. > > > > Like this? > > > > SLAB_DESTROY_ONCE_FULLY_FREED > > > > Instead of adding a new kmem_cache_destroy_rcu() > > or kmem_cache_destroy_wait() API member, instead add a > > SLAB_DESTROY_ONCE_FULLY_FREED flag that can be passed to the > > existing kmem_cache_destroy() function.? Use of this flag would > > suppress any warnings that would otherwise be issued if there > > was still slab memory yet to be freed, and it would also spawn > > workqueues (or timers or whatever) to do any needed cleanup work. > > > > > The flag is passed as all others during creating a cache: > > slab = kmem_cache_create(name, size, ..., SLAB_DESTROY_ONCE_FULLY_FREED | OTHER_FLAGS, NULL); > > the rest description is correct to me. Good catch, fixed, thank you! Thanx, Paul From urezki at gmail.com Fri Jun 14 12:35:33 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Fri, 14 Jun 2024 14:35:33 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240612143305.451abf58@kernel.org> <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Thu, Jun 13, 2024 at 11:13:52AM -0700, Paul E. McKenney wrote: > On Thu, Jun 13, 2024 at 07:58:17PM +0200, Uladzislau Rezki wrote: > > On Thu, Jun 13, 2024 at 10:45:59AM -0700, Paul E. McKenney wrote: > > > On Thu, Jun 13, 2024 at 07:38:59PM +0200, Uladzislau Rezki wrote: > > > > On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > > > > > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > > > > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > > > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > > > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > > > > > > > This semantic patch is designed to ignore cases where the callback > > > > > > > > > > > function is used in another way. > > > > > > > > > > > > > > > > > > > > How does the discussion on: > > > > > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > > > > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > > > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > > > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > > > > > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > > > > > --- a/drivers/net/wireguard/allowedips.c > > > > > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > > > > > { > > > > > > > > - rcu_barrier(); > > > > > > > > kmem_cache_destroy(node_cache); > > > > > > > > } > > > > > > > > > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > > > > > > removed. So some manual meddling of these might be in order. > > > > > > > > > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > > > > > > > agreed. > > > > > > > > > > > > > > > > > > > void kmem_cache_destroy(struct kmem_cache *s) > > > > > > { > > > > > > int err = -EBUSY; > > > > > > bool rcu_set; > > > > > > > > > > > > if (unlikely(!s) || !kasan_check_byte(s)) > > > > > > return; > > > > > > > > > > > > cpus_read_lock(); > > > > > > mutex_lock(&slab_mutex); > > > > > > > > > > > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > > > > > > > > > > > s->refcount--; > > > > > > if (s->refcount) > > > > > > goto out_unlock; > > > > > > > > > > > > err = shutdown_cache(s); > > > > > > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > > > > > > __func__, s->name, (void *)_RET_IP_); > > > > > > ... > > > > > > cpus_read_unlock(); > > > > > > if (!err && !rcu_set) > > > > > > kmem_cache_release(s); > > > > > > } > > > > > > > > > > > > > > > > > > so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages > > > > > > and a cache by a grace period. Similar flag can be added, like > > > > > > SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself > > > > > > if there are still objects which should be freed. > > > > > > > > > > > > Any thoughts here? > > > > > > > > > > Wouldn't we also need some additional code to later check for all objects > > > > > being freed to the slab, whether or not that code is initiated from > > > > > kmem_cache_destroy()? > > > > > > > > > Same away as SLAB_TYPESAFE_BY_RCU is handled from the kmem_cache_destroy() function. > > > > It checks that flag and if it is true and extra worker is scheduled to perform a > > > > deferred(instead of right away) destroy after rcu_barrier() finishes. > > > > > > Like this? > > > > > > SLAB_DESTROY_ONCE_FULLY_FREED > > > > > > Instead of adding a new kmem_cache_destroy_rcu() > > > or kmem_cache_destroy_wait() API member, instead add a > > > SLAB_DESTROY_ONCE_FULLY_FREED flag that can be passed to the > > > existing kmem_cache_destroy() function.? Use of this flag would > > > suppress any warnings that would otherwise be issued if there > > > was still slab memory yet to be freed, and it would also spawn > > > workqueues (or timers or whatever) to do any needed cleanup work. > > > > > > > > The flag is passed as all others during creating a cache: > > > > slab = kmem_cache_create(name, size, ..., SLAB_DESTROY_ONCE_FULLY_FREED | OTHER_FLAGS, NULL); > > > > the rest description is correct to me. > > Good catch, fixed, thank you! > And here we go with prototype(untested): diff --git a/include/linux/slab.h b/include/linux/slab.h index 7247e217e21b..700b8a909f8a 100644 --- a/include/linux/slab.h +++ b/include/linux/slab.h @@ -59,6 +59,7 @@ enum _slab_flag_bits { #ifdef CONFIG_SLAB_OBJ_EXT _SLAB_NO_OBJ_EXT, #endif + _SLAB_DEFER_DESTROY, _SLAB_FLAGS_LAST_BIT }; @@ -139,6 +140,7 @@ enum _slab_flag_bits { */ /* Defer freeing slabs to RCU */ #define SLAB_TYPESAFE_BY_RCU __SLAB_FLAG_BIT(_SLAB_TYPESAFE_BY_RCU) +#define SLAB_DEFER_DESTROY __SLAB_FLAG_BIT(_SLAB_DEFER_DESTROY) /* Trace allocations and frees */ #define SLAB_TRACE __SLAB_FLAG_BIT(_SLAB_TRACE) diff --git a/mm/slab_common.c b/mm/slab_common.c index 1560a1546bb1..99458a0197b5 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -45,6 +45,11 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work); static DECLARE_WORK(slab_caches_to_rcu_destroy_work, slab_caches_to_rcu_destroy_workfn); +static LIST_HEAD(slab_caches_defer_destroy); +static void slab_caches_defer_destroy_workfn(struct work_struct *work); +static DECLARE_DELAYED_WORK(slab_caches_defer_destroy_work, + slab_caches_defer_destroy_workfn); + /* * Set of flags that will prevent slab merging */ @@ -448,6 +453,31 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work) } } +static void +slab_caches_defer_destroy_workfn(struct work_struct *work) +{ + struct kmem_cache *s, *s2; + + mutex_lock(&slab_mutex); + list_for_each_entry_safe(s, s2, &slab_caches_defer_destroy, list) { + if (__kmem_cache_empty(s)) { + /* free asan quarantined objects */ + kasan_cache_shutdown(s); + (void) __kmem_cache_shutdown(s); + + list_del(&s->list); + + debugfs_slab_release(s); + kfence_shutdown_cache(s); + kmem_cache_release(s); + } + } + mutex_unlock(&slab_mutex); + + if (!list_empty(&slab_caches_defer_destroy)) + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); +} + static int shutdown_cache(struct kmem_cache *s) { /* free asan quarantined objects */ @@ -493,6 +523,13 @@ void kmem_cache_destroy(struct kmem_cache *s) if (s->refcount) goto out_unlock; + /* Should a destroy process be deferred? */ + if (s->flags & SLAB_DEFER_DESTROY) { + list_move_tail(&s->list, &slab_caches_defer_destroy); + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); + goto out_unlock; + } + err = shutdown_cache(s); WARN(err, "%s %s: Slab cache still has objects when called from %pS", __func__, s->name, (void *)_RET_IP_); Thanks! -- Uladzislau Rezki From paulmck at kernel.org Fri Jun 14 14:17:29 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Fri, 14 Jun 2024 07:17:29 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: > On Thu, Jun 13, 2024 at 11:13:52AM -0700, Paul E. McKenney wrote: > > On Thu, Jun 13, 2024 at 07:58:17PM +0200, Uladzislau Rezki wrote: > > > On Thu, Jun 13, 2024 at 10:45:59AM -0700, Paul E. McKenney wrote: > > > > On Thu, Jun 13, 2024 at 07:38:59PM +0200, Uladzislau Rezki wrote: > > > > > On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > > > > > > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > > > > > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > > > > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > > > > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > > > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > > > > > > > > This semantic patch is designed to ignore cases where the callback > > > > > > > > > > > > function is used in another way. > > > > > > > > > > > > > > > > > > > > > > How does the discussion on: > > > > > > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > > > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > > > > > > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > > > > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > > > > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > > > > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > > > > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > > > > > > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > > > > > > --- a/drivers/net/wireguard/allowedips.c > > > > > > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > > > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > > > > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > > > > > > { > > > > > > > > > - rcu_barrier(); > > > > > > > > > kmem_cache_destroy(node_cache); > > > > > > > > > } > > > > > > > > > > > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > > > > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > > > > > > > removed. So some manual meddling of these might be in order. > > > > > > > > > > > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > > > > > > > > agreed. > > > > > > > > > > > > > > > > > > > > > > void kmem_cache_destroy(struct kmem_cache *s) > > > > > > > { > > > > > > > int err = -EBUSY; > > > > > > > bool rcu_set; > > > > > > > > > > > > > > if (unlikely(!s) || !kasan_check_byte(s)) > > > > > > > return; > > > > > > > > > > > > > > cpus_read_lock(); > > > > > > > mutex_lock(&slab_mutex); > > > > > > > > > > > > > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > > > > > > > > > > > > > s->refcount--; > > > > > > > if (s->refcount) > > > > > > > goto out_unlock; > > > > > > > > > > > > > > err = shutdown_cache(s); > > > > > > > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > > > > > > > __func__, s->name, (void *)_RET_IP_); > > > > > > > ... > > > > > > > cpus_read_unlock(); > > > > > > > if (!err && !rcu_set) > > > > > > > kmem_cache_release(s); > > > > > > > } > > > > > > > > > > > > > > > > > > > > > so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages > > > > > > > and a cache by a grace period. Similar flag can be added, like > > > > > > > SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself > > > > > > > if there are still objects which should be freed. > > > > > > > > > > > > > > Any thoughts here? > > > > > > > > > > > > Wouldn't we also need some additional code to later check for all objects > > > > > > being freed to the slab, whether or not that code is initiated from > > > > > > kmem_cache_destroy()? > > > > > > > > > > > Same away as SLAB_TYPESAFE_BY_RCU is handled from the kmem_cache_destroy() function. > > > > > It checks that flag and if it is true and extra worker is scheduled to perform a > > > > > deferred(instead of right away) destroy after rcu_barrier() finishes. > > > > > > > > Like this? > > > > > > > > SLAB_DESTROY_ONCE_FULLY_FREED > > > > > > > > Instead of adding a new kmem_cache_destroy_rcu() > > > > or kmem_cache_destroy_wait() API member, instead add a > > > > SLAB_DESTROY_ONCE_FULLY_FREED flag that can be passed to the > > > > existing kmem_cache_destroy() function.? Use of this flag would > > > > suppress any warnings that would otherwise be issued if there > > > > was still slab memory yet to be freed, and it would also spawn > > > > workqueues (or timers or whatever) to do any needed cleanup work. > > > > > > > > > > > The flag is passed as all others during creating a cache: > > > > > > slab = kmem_cache_create(name, size, ..., SLAB_DESTROY_ONCE_FULLY_FREED | OTHER_FLAGS, NULL); > > > > > > the rest description is correct to me. > > > > Good catch, fixed, thank you! > > > And here we go with prototype(untested): Thank you for putting this together! It looks way simpler than I would have guessed, and quite a bit simpler than I would expect it would be to extend rcu_barrier() to cover kfree_rcu(). > > diff --git a/include/linux/slab.h b/include/linux/slab.h > index 7247e217e21b..700b8a909f8a 100644 > --- a/include/linux/slab.h > +++ b/include/linux/slab.h > @@ -59,6 +59,7 @@ enum _slab_flag_bits { > #ifdef CONFIG_SLAB_OBJ_EXT > _SLAB_NO_OBJ_EXT, > #endif > + _SLAB_DEFER_DESTROY, > _SLAB_FLAGS_LAST_BIT > }; > > @@ -139,6 +140,7 @@ enum _slab_flag_bits { > */ > /* Defer freeing slabs to RCU */ > #define SLAB_TYPESAFE_BY_RCU __SLAB_FLAG_BIT(_SLAB_TYPESAFE_BY_RCU) > +#define SLAB_DEFER_DESTROY __SLAB_FLAG_BIT(_SLAB_DEFER_DESTROY) > /* Trace allocations and frees */ > #define SLAB_TRACE __SLAB_FLAG_BIT(_SLAB_TRACE) > > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 1560a1546bb1..99458a0197b5 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -45,6 +45,11 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work); > static DECLARE_WORK(slab_caches_to_rcu_destroy_work, > slab_caches_to_rcu_destroy_workfn); > > +static LIST_HEAD(slab_caches_defer_destroy); > +static void slab_caches_defer_destroy_workfn(struct work_struct *work); > +static DECLARE_DELAYED_WORK(slab_caches_defer_destroy_work, > + slab_caches_defer_destroy_workfn); > + > /* > * Set of flags that will prevent slab merging > */ > @@ -448,6 +453,31 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work) > } > } > > +static void > +slab_caches_defer_destroy_workfn(struct work_struct *work) > +{ > + struct kmem_cache *s, *s2; > + > + mutex_lock(&slab_mutex); > + list_for_each_entry_safe(s, s2, &slab_caches_defer_destroy, list) { > + if (__kmem_cache_empty(s)) { > + /* free asan quarantined objects */ > + kasan_cache_shutdown(s); > + (void) __kmem_cache_shutdown(s); > + > + list_del(&s->list); > + > + debugfs_slab_release(s); > + kfence_shutdown_cache(s); > + kmem_cache_release(s); > + } My guess is that there would want to be a splat if the slab stuck around for too long, but maybe that should instead be handled elsewhere or in some other way? I must defer to you guys on that one. Thanx, Paul > + } > + mutex_unlock(&slab_mutex); > + > + if (!list_empty(&slab_caches_defer_destroy)) > + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); > +} > + > static int shutdown_cache(struct kmem_cache *s) > { > /* free asan quarantined objects */ > @@ -493,6 +523,13 @@ void kmem_cache_destroy(struct kmem_cache *s) > if (s->refcount) > goto out_unlock; > > + /* Should a destroy process be deferred? */ > + if (s->flags & SLAB_DEFER_DESTROY) { > + list_move_tail(&s->list, &slab_caches_defer_destroy); > + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); > + goto out_unlock; > + } > + > err = shutdown_cache(s); > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > __func__, s->name, (void *)_RET_IP_); > From urezki at gmail.com Fri Jun 14 14:50:45 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Fri, 14 Jun 2024 16:50:45 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Fri, Jun 14, 2024 at 07:17:29AM -0700, Paul E. McKenney wrote: > On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: > > On Thu, Jun 13, 2024 at 11:13:52AM -0700, Paul E. McKenney wrote: > > > On Thu, Jun 13, 2024 at 07:58:17PM +0200, Uladzislau Rezki wrote: > > > > On Thu, Jun 13, 2024 at 10:45:59AM -0700, Paul E. McKenney wrote: > > > > > On Thu, Jun 13, 2024 at 07:38:59PM +0200, Uladzislau Rezki wrote: > > > > > > On Thu, Jun 13, 2024 at 08:06:30AM -0700, Paul E. McKenney wrote: > > > > > > > On Thu, Jun 13, 2024 at 03:06:54PM +0200, Uladzislau Rezki wrote: > > > > > > > > On Thu, Jun 13, 2024 at 05:47:08AM -0700, Paul E. McKenney wrote: > > > > > > > > > On Thu, Jun 13, 2024 at 01:58:59PM +0200, Jason A. Donenfeld wrote: > > > > > > > > > > On Wed, Jun 12, 2024 at 03:37:55PM -0700, Paul E. McKenney wrote: > > > > > > > > > > > On Wed, Jun 12, 2024 at 02:33:05PM -0700, Jakub Kicinski wrote: > > > > > > > > > > > > On Sun, 9 Jun 2024 10:27:12 +0200 Julia Lawall wrote: > > > > > > > > > > > > > Since SLOB was removed, it is not necessary to use call_rcu > > > > > > > > > > > > > when the callback only performs kmem_cache_free. Use > > > > > > > > > > > > > kfree_rcu() directly. > > > > > > > > > > > > > > > > > > > > > > > > > > The changes were done using the following Coccinelle semantic patch. > > > > > > > > > > > > > This semantic patch is designed to ignore cases where the callback > > > > > > > > > > > > > function is used in another way. > > > > > > > > > > > > > > > > > > > > > > > > How does the discussion on: > > > > > > > > > > > > [PATCH] Revert "batman-adv: prefer kfree_rcu() over call_rcu() with free-only callbacks" > > > > > > > > > > > > https://lore.kernel.org/all/20240612133357.2596-1-linus.luessing at c0d3.blue/ > > > > > > > > > > > > reflect on this series? IIUC we should hold off.. > > > > > > > > > > > > > > > > > > > > > > We do need to hold off for the ones in kernel modules (such as 07/14) > > > > > > > > > > > where the kmem_cache is destroyed during module unload. > > > > > > > > > > > > > > > > > > > > > > OK, I might as well go through them... > > > > > > > > > > > > > > > > > > > > > > [PATCH 01/14] wireguard: allowedips: replace call_rcu by kfree_rcu for simple kmem_cache_free callback > > > > > > > > > > > Needs to wait, see wg_allowedips_slab_uninit(). > > > > > > > > > > > > > > > > > > > > Also, notably, this patch needs additionally: > > > > > > > > > > > > > > > > > > > > diff --git a/drivers/net/wireguard/allowedips.c b/drivers/net/wireguard/allowedips.c > > > > > > > > > > index e4e1638fce1b..c95f6937c3f1 100644 > > > > > > > > > > --- a/drivers/net/wireguard/allowedips.c > > > > > > > > > > +++ b/drivers/net/wireguard/allowedips.c > > > > > > > > > > @@ -377,7 +377,6 @@ int __init wg_allowedips_slab_init(void) > > > > > > > > > > > > > > > > > > > > void wg_allowedips_slab_uninit(void) > > > > > > > > > > { > > > > > > > > > > - rcu_barrier(); > > > > > > > > > > kmem_cache_destroy(node_cache); > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > Once kmem_cache_destroy has been fixed to be deferrable. > > > > > > > > > > > > > > > > > > > > I assume the other patches are similar -- an rcu_barrier() can be > > > > > > > > > > removed. So some manual meddling of these might be in order. > > > > > > > > > > > > > > > > > > Assuming that the deferrable kmem_cache_destroy() is the option chosen, > > > > > > > > > agreed. > > > > > > > > > > > > > > > > > > > > > > > > > void kmem_cache_destroy(struct kmem_cache *s) > > > > > > > > { > > > > > > > > int err = -EBUSY; > > > > > > > > bool rcu_set; > > > > > > > > > > > > > > > > if (unlikely(!s) || !kasan_check_byte(s)) > > > > > > > > return; > > > > > > > > > > > > > > > > cpus_read_lock(); > > > > > > > > mutex_lock(&slab_mutex); > > > > > > > > > > > > > > > > rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > > > > > > > > > > > > > > > > s->refcount--; > > > > > > > > if (s->refcount) > > > > > > > > goto out_unlock; > > > > > > > > > > > > > > > > err = shutdown_cache(s); > > > > > > > > WARN(err, "%s %s: Slab cache still has objects when called from %pS", > > > > > > > > __func__, s->name, (void *)_RET_IP_); > > > > > > > > ... > > > > > > > > cpus_read_unlock(); > > > > > > > > if (!err && !rcu_set) > > > > > > > > kmem_cache_release(s); > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > so we have SLAB_TYPESAFE_BY_RCU flag that defers freeing slab-pages > > > > > > > > and a cache by a grace period. Similar flag can be added, like > > > > > > > > SLAB_DESTROY_ONCE_FULLY_FREED, in this case a worker rearm itself > > > > > > > > if there are still objects which should be freed. > > > > > > > > > > > > > > > > Any thoughts here? > > > > > > > > > > > > > > Wouldn't we also need some additional code to later check for all objects > > > > > > > being freed to the slab, whether or not that code is initiated from > > > > > > > kmem_cache_destroy()? > > > > > > > > > > > > > Same away as SLAB_TYPESAFE_BY_RCU is handled from the kmem_cache_destroy() function. > > > > > > It checks that flag and if it is true and extra worker is scheduled to perform a > > > > > > deferred(instead of right away) destroy after rcu_barrier() finishes. > > > > > > > > > > Like this? > > > > > > > > > > SLAB_DESTROY_ONCE_FULLY_FREED > > > > > > > > > > Instead of adding a new kmem_cache_destroy_rcu() > > > > > or kmem_cache_destroy_wait() API member, instead add a > > > > > SLAB_DESTROY_ONCE_FULLY_FREED flag that can be passed to the > > > > > existing kmem_cache_destroy() function.? Use of this flag would > > > > > suppress any warnings that would otherwise be issued if there > > > > > was still slab memory yet to be freed, and it would also spawn > > > > > workqueues (or timers or whatever) to do any needed cleanup work. > > > > > > > > > > > > > > The flag is passed as all others during creating a cache: > > > > > > > > slab = kmem_cache_create(name, size, ..., SLAB_DESTROY_ONCE_FULLY_FREED | OTHER_FLAGS, NULL); > > > > > > > > the rest description is correct to me. > > > > > > Good catch, fixed, thank you! > > > > > And here we go with prototype(untested): > > Thank you for putting this together! It looks way simpler than I would > have guessed, and quite a bit simpler than I would expect it would be > to extend rcu_barrier() to cover kfree_rcu(). > Yep, it should be pretty pretty straightforward. The slab mechanism does not have a functionality when it comes to defer of destroying, i.e. it is not allowed to destroy non-fully-freed-slab: void kmem_cache_destroy(struct kmem_cache *s) { ... err = shutdown_cache(s); WARN(err, "%s %s: Slab cache still has objects when called from %pS", __func__, s->name, (void *)_RET_IP_); ... So, this patch extends it. > > > > +static void > > +slab_caches_defer_destroy_workfn(struct work_struct *work) > > +{ > > + struct kmem_cache *s, *s2; > > + > > + mutex_lock(&slab_mutex); > > + list_for_each_entry_safe(s, s2, &slab_caches_defer_destroy, list) { > > + if (__kmem_cache_empty(s)) { > > + /* free asan quarantined objects */ > > + kasan_cache_shutdown(s); > > + (void) __kmem_cache_shutdown(s); > > + > > + list_del(&s->list); > > + > > + debugfs_slab_release(s); > > + kfence_shutdown_cache(s); > > + kmem_cache_release(s); > > + } > > My guess is that there would want to be a splat if the slab stuck around > for too long, but maybe that should instead be handled elsewhere or in > some other way? I must defer to you guys on that one. > Probably yes. -- Uladzislau Rezki From Jason at zx2c4.com Fri Jun 14 19:33:45 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Fri, 14 Jun 2024 21:33:45 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: > + /* Should a destroy process be deferred? */ > + if (s->flags & SLAB_DEFER_DESTROY) { > + list_move_tail(&s->list, &slab_caches_defer_destroy); > + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); > + goto out_unlock; > + } Wouldn't it be smoother to have the actual kmem_cache_free() function check to see if it's been marked for destruction and the refcount is zero, rather than polling every one second? I mentioned this approach in: https://lore.kernel.org/all/Zmo9-YGraiCj5-MI at zx2c4.com/ - I wonder if the right fix to this would be adding a `should_destroy` boolean to kmem_cache, which kmem_cache_destroy() sets to true. And then right after it checks `if (number_of_allocations == 0) actually_destroy()`, and likewise on each kmem_cache_free(), it could check `if (should_destroy && number_of_allocations == 0) actually_destroy()`. Jason From max.schulze at online.de Sun Jun 16 13:47:38 2024 From: max.schulze at online.de (Max Schulze) Date: Sun, 16 Jun 2024 15:47:38 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: References: Message-ID: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> Hi, On 12.06.24 16:11, Germano Massullo wrote: > Hello, I would like to ask if you are aware of any mini PCI express > card that provides hardware acceleration for ChaCha20 algorithm. I > would need it to improve Turris Omnia Wireguard throughput why do you think this is the bottleneck and at what speed are you hitting a limit? Curious, as I always found wg performance to be excellent, even on ARM. From germano.massullo at gmail.com Sun Jun 16 14:59:34 2024 From: germano.massullo at gmail.com (Germano Massullo) Date: Sun, 16 Jun 2024 16:59:34 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> References: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> Message-ID: <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> Il 16/06/24 15:47, Max Schulze ha scritto: > why do you think this is the bottleneck and at what speed are you hitting a limit? I get ~550 Mbit/s throughput in LAN between a Ryzen 5 3600 and the Turris Omnia which CPU goes to 100% load during iperf3 test From max.schulze at online.de Sun Jun 16 19:00:37 2024 From: max.schulze at online.de (Max Schulze) Date: Sun, 16 Jun 2024 21:00:37 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> References: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> Message-ID: On 16.06.24 16:59, Germano Massullo wrote: > Il 16/06/24 15:47, Max Schulze ha scritto: >> why do you think this is the bottleneck and at what speed are you hitting a limit? > > I get ~550 Mbit/s throughput in LAN between a Ryzen 5 3600 and the Turris Omnia which CPU goes to 100% load during iperf3 test Ok then I think you really max out the cpu. I have not heard of any acceleration card. Overall I think it's not too bad. Some notes: Per [1], my stock Raspberry Pi 4 B (BCM2711, quad-core), has roughly 1.5x cpu-power than the dual-core Marvel Armada 385. Are you running iperf3 with --bidir? I get: > - - - - - - - - - - - - - - - - - - - - - - - - - > [ ID][Role] Interval Transfer Bitrate Retr > [ 5][TX-C] 0.00-120.00 sec 10.4 GBytes 744 Mbits/sec 1839 sender > [ 5][TX-C] 0.00-120.00 sec 10.4 GBytes 744 Mbits/sec receiver > [ 7][RX-C] 0.00-120.00 sec 9.08 GBytes 650 Mbits/sec 151 sender > [ 7][RX-C] 0.00-120.00 sec 9.08 GBytes 650 Mbits/sec receiver Keep in mind that iperf3 itself uses some cpu. You could test serving a static file and transfer via http. ( ex: dd if=/dev/urandom of=/dev/shm/test.rand bs=1M count=300 and serve with [2], and download with "wget -O /dev/null [...]" ) I get 848 Mbit/s when downloading to the pi and 728 Mbit/s when downloading from it (everything via wireguard). [1] https://github.com/ThomasKaiser/sbc-bench/blob/master/Results.md [2] https://github.com/svenstaro/miniserve From germano.massullo at gmail.com Mon Jun 17 09:21:15 2024 From: germano.massullo at gmail.com (Germano Massullo) Date: Mon, 17 Jun 2024 11:21:15 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: References: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> Message-ID: Il 16/06/24 21:00, Max Schulze ha scritto: > Ok then I think you really max out the cpu. I have not heard of any acceleration card. Overall I think it's not too bad. The problem is that is far under my internet connection capabilities (1 Gbit/s upload) > Are you running iperf3 with --bidir? Such flag halves the throughput, I am getting ~280 Mbit/s compared to the previous value I got by using iperf3 -c 10.0.50.1 -P 4 -Z bbr (using the Ryzen as client) > Keep in mind that iperf3 itself uses some cpu. > You could test serving a static file and transfer via http. The iperf3 CPU usage is not so high, it wouldn't change much to use the http transfer From a at unstable.cc Mon Jun 17 09:45:46 2024 From: a at unstable.cc (Antonio Quartulli) Date: Mon, 17 Jun 2024 11:45:46 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: References: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> Message-ID: On 17/06/2024 11:21, Germano Massullo wrote: > Il 16/06/24 21:00, Max Schulze ha scritto: >> Ok then I think you really max out the cpu. I have not heard of any >> acceleration card. Overall I think it's not too bad. > > The problem is that is far under my internet connection capabilities (1 > Gbit/s upload) > >> Are you running iperf3 with --bidir? > > Such flag halves the throughput, I am getting ~280 Mbit/s compared to > the previous value I got by using > iperf3 -c 10.0.50.1 -P 4 -Z bbr > (using the Ryzen as client) >> Keep in mind that iperf3 itself uses some cpu. >> You could test serving a static file and transfer via http. > > The iperf3 CPU usage is not so high, it wouldn't change much to use the > http transfer Have you tried running the test between a client, behind the omnia turris and the wg server? I am asking because such embedded devices are not necessarily fast in generating the traffic that iperf requires, therefore using a different client may give you a better estimate. Regards, -- Antonio Quartulli From germano.massullo at gmail.com Mon Jun 17 11:08:15 2024 From: germano.massullo at gmail.com (Germano Massullo) Date: Mon, 17 Jun 2024 13:08:15 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: References: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> Message-ID: Il 17/06/24 11:45, Antonio Quartulli ha scritto: > Have you tried running the test between a client, behind the omnia > turris and the wg server? Do you mean a configuration where the Turris Omnia is not acting as Wireguard peer/gateway? I could do it but I prefer not From a at unstable.cc Mon Jun 17 11:42:50 2024 From: a at unstable.cc (Antonio Quartulli) Date: Mon, 17 Jun 2024 13:42:50 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: References: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> Message-ID: <8f366adc-89a5-4b81-9a4f-8bcb1d08aad3@unstable.cc> Hi, On 17/06/2024 13:08, Germano Massullo wrote: > Il 17/06/24 11:45, Antonio Quartulli ha scritto: >> Have you tried running the test between a client, behind the omnia >> turris and the wg server? > > Do you mean a configuration where the Turris Omnia is not acting as > Wireguard peer/gateway? I could do it but I prefer not No no. Sorry I might have used the wrong words. Basically you should keep the wg setup as it is, but instead of running the iperf client on the turris, you run it on another host that uses the turris as gateway (as if the turris was the gateway of a LAN). This way the tunnel is still established between the turris and the server (which is what you want to test), but you move the traffic generation to a different host (which is most likely what you will have in a real scenario). I hope I clarified your doubt. Regards, -- Antonio Quartulli From germano.massullo at gmail.com Mon Jun 17 12:32:19 2024 From: germano.massullo at gmail.com (Germano Massullo) Date: Mon, 17 Jun 2024 14:32:19 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: <8f366adc-89a5-4b81-9a4f-8bcb1d08aad3@unstable.cc> References: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> <8f366adc-89a5-4b81-9a4f-8bcb1d08aad3@unstable.cc> Message-ID: <4940a8fd-6e87-49a4-83e0-8daa69e7a68f@gmail.com> Il 17/06/24 13:42, Antonio Quartulli ha scritto: > Hi, > > On 17/06/2024 13:08, Germano Massullo wrote: >> Il 17/06/24 11:45, Antonio Quartulli ha scritto: >>> Have you tried running the test between a client, behind the omnia >>> turris and the wg server? >> >> Do you mean a configuration where the Turris Omnia is not acting as >> Wireguard peer/gateway? I could do it but I prefer not > > No no. Sorry I might have used the wrong words. > > Basically you should keep the wg setup as it is, but instead of > running the iperf client on the turris, you run it on another host > that uses the turris as gateway (as if the turris was the gateway of a > LAN). > > This way the tunnel is still established between the turris and the > server (which is what you want to test), but you move the traffic > generation to a different host (which is most likely what you will > have in a real scenario). > > I hope I clarified your doubt. > > Regards, > > Got it. That configuration will not improve the throughput cause the reason why I started this benchmark is finding out the bottleneck in my configuration, which is very similar to the one you described From rm at romanrm.net Mon Jun 17 12:41:59 2024 From: rm at romanrm.net (Roman Mamedov) Date: Mon, 17 Jun 2024 17:41:59 +0500 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: <4940a8fd-6e87-49a4-83e0-8daa69e7a68f@gmail.com> References: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> <8f366adc-89a5-4b81-9a4f-8bcb1d08aad3@unstable.cc> <4940a8fd-6e87-49a4-83e0-8daa69e7a68f@gmail.com> Message-ID: <20240617174159.46b69d3b@nvm> On Mon, 17 Jun 2024 14:32:19 +0200 Germano Massullo wrote: > Got it. That configuration will not improve the throughput cause the > reason why I started this benchmark is finding out the bottleneck in my > configuration, which is very similar to the one you described Point is that iperf itself is using a huge amount of CPU. You can run your test and launch "top" in another SSH window. In my experience for slow CPUs during such tests the CPU use may be like 60% iperf. If your typical scenario is router just forwarding packets between networks and into WG tunnel, and not providing any network services itself (such as Samba), testing with iperf launched on the router will not be representative of real-world usage bottleneck or lack thereof. -- With respect, Roman From urezki at gmail.com Mon Jun 17 13:50:56 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Mon, 17 Jun 2024 15:50:56 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Fri, Jun 14, 2024 at 09:33:45PM +0200, Jason A. Donenfeld wrote: > On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: > > + /* Should a destroy process be deferred? */ > > + if (s->flags & SLAB_DEFER_DESTROY) { > > + list_move_tail(&s->list, &slab_caches_defer_destroy); > > + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); > > + goto out_unlock; > > + } > > Wouldn't it be smoother to have the actual kmem_cache_free() function > check to see if it's been marked for destruction and the refcount is > zero, rather than polling every one second? I mentioned this approach > in: https://lore.kernel.org/all/Zmo9-YGraiCj5-MI at zx2c4.com/ - > > I wonder if the right fix to this would be adding a `should_destroy` > boolean to kmem_cache, which kmem_cache_destroy() sets to true. And > then right after it checks `if (number_of_allocations == 0) > actually_destroy()`, and likewise on each kmem_cache_free(), it > could check `if (should_destroy && number_of_allocations == 0) > actually_destroy()`. > I do not find pooling as bad way we can go with. But your proposal sounds reasonable to me also. We can combine both "prototypes" to one and offer. Can you post a prototype here? Thanks! -- Uladzislau Rezki From germano.massullo at gmail.com Mon Jun 17 14:31:49 2024 From: germano.massullo at gmail.com (Germano Massullo) Date: Mon, 17 Jun 2024 16:31:49 +0200 Subject: Mini PCIE HW accelerator for ChaCha20 In-Reply-To: <20240617174159.46b69d3b@nvm> References: <78a56a8e-59d2-4948-a761-f1f1b3a6b26a@online.de> <93a15ab0-1cfc-4b39-97fe-64712eafb278@gmail.com> <8f366adc-89a5-4b81-9a4f-8bcb1d08aad3@unstable.cc> <4940a8fd-6e87-49a4-83e0-8daa69e7a68f@gmail.com> <20240617174159.46b69d3b@nvm> Message-ID: <9e717774-c1cd-493f-abfd-fcf3d75eec8d@gmail.com> After having checked that iperf3 was indeed consuming a lot of a CPU core on the Turris Omnia, I modified the Wireguard topology in order to have the router to just be the Wireguard gateway among two LAN computers ( [A] <--wireguard--> [C] <--wireguard--> [B] ), and I have run the iperf3 among such computers iperf3 -c x.x.x.x -P 4 -Z bbr and the throughput was ~320 Mbit/s. Considering that the router had to handle two Wireguard tunnels, one could guess (without any claim of accuracy due lack of more accurate tests), that the maximum Wireguard throughput that such router can handle is ~2x 320 Mbit/s = ~640 Mbit/s [A]: Ryzen 5 3600 - kernel 5.14.0-427.18.1.el9_4.x86_64 [B]: Ryzen 7 PRO 6850U -? kernel 6.8.11-300.fc40.x86_64 [C]: Turris Omnia - TurrisOS 7.0.0, kernel 5.15.148 From vbabka at suse.cz Mon Jun 17 14:37:20 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Mon, 17 Jun 2024 16:37:20 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: <324c25c9-fe86-4d82-b4e2-5f3ad76031c7@suse.cz> On 6/14/24 9:33 PM, Jason A. Donenfeld wrote: > On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: >> + /* Should a destroy process be deferred? */ >> + if (s->flags & SLAB_DEFER_DESTROY) { >> + list_move_tail(&s->list, &slab_caches_defer_destroy); >> + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); >> + goto out_unlock; >> + } > > Wouldn't it be smoother to have the actual kmem_cache_free() function > check to see if it's been marked for destruction and the refcount is > zero, rather than polling every one second? I mentioned this approach > in: https://lore.kernel.org/all/Zmo9-YGraiCj5-MI at zx2c4.com/ - > > I wonder if the right fix to this would be adding a `should_destroy` > boolean to kmem_cache, which kmem_cache_destroy() sets to true. And > then right after it checks `if (number_of_allocations == 0) > actually_destroy()`, and likewise on each kmem_cache_free(), it > could check `if (should_destroy && number_of_allocations == 0) > actually_destroy()`. I would prefer not to affect the performance of kmem_cache_free() by doing such checks, if possible. Ideally we'd have a way to wait/poll for the kfree_rcu() "grace period" expiring even with the batching that's implemented there. Even if it's pesimistically long to avoid affecting kfree_rcu() performance. The goal here is just to print the warnings if there was a leak and the precise timing of them shouldn't matter. The owning module could be already unloaded at that point? I guess only a kunit test could want to be synchronous and then it could just ask for kmem_cache_free() to wait synchronously. > Jason From Jason at zx2c4.com Mon Jun 17 14:56:17 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Mon, 17 Jun 2024 16:56:17 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <80e03b02-7e24-4342-af0b-ba5117b19828@paulmck-laptop> <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Mon, Jun 17, 2024 at 03:50:56PM +0200, Uladzislau Rezki wrote: > On Fri, Jun 14, 2024 at 09:33:45PM +0200, Jason A. Donenfeld wrote: > > On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: > > > + /* Should a destroy process be deferred? */ > > > + if (s->flags & SLAB_DEFER_DESTROY) { > > > + list_move_tail(&s->list, &slab_caches_defer_destroy); > > > + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); > > > + goto out_unlock; > > > + } > > > > Wouldn't it be smoother to have the actual kmem_cache_free() function > > check to see if it's been marked for destruction and the refcount is > > zero, rather than polling every one second? I mentioned this approach > > in: https://lore.kernel.org/all/Zmo9-YGraiCj5-MI at zx2c4.com/ - > > > > I wonder if the right fix to this would be adding a `should_destroy` > > boolean to kmem_cache, which kmem_cache_destroy() sets to true. And > > then right after it checks `if (number_of_allocations == 0) > > actually_destroy()`, and likewise on each kmem_cache_free(), it > > could check `if (should_destroy && number_of_allocations == 0) > > actually_destroy()`. > > > I do not find pooling as bad way we can go with. But your proposal > sounds reasonable to me also. We can combine both "prototypes" to > one and offer. > > Can you post a prototype here? This is untested, but the simplest, shortest possible version would be: diff --git a/mm/slab.h b/mm/slab.h index 5f8f47c5bee0..907c0ea56c01 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -275,6 +275,7 @@ struct kmem_cache { unsigned int inuse; /* Offset to metadata */ unsigned int align; /* Alignment */ unsigned int red_left_pad; /* Left redzone padding size */ + bool is_destroyed; /* Destruction happens when no objects */ const char *name; /* Name (only for display!) */ struct list_head list; /* List of slab caches */ #ifdef CONFIG_SYSFS diff --git a/mm/slab_common.c b/mm/slab_common.c index 1560a1546bb1..f700bed066d9 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -494,8 +494,8 @@ void kmem_cache_destroy(struct kmem_cache *s) goto out_unlock; err = shutdown_cache(s); - WARN(err, "%s %s: Slab cache still has objects when called from %pS", - __func__, s->name, (void *)_RET_IP_); + if (err) + s->is_destroyed = true; out_unlock: mutex_unlock(&slab_mutex); cpus_read_unlock(); diff --git a/mm/slub.c b/mm/slub.c index 1373ac365a46..7db8fe90a323 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) return; trace_kmem_cache_free(_RET_IP_, x, s); slab_free(s, virt_to_slab(x), x, _RET_IP_); + if (s->is_destroyed) + kmem_cache_destroy(s); } EXPORT_SYMBOL(kmem_cache_free); @@ -5342,9 +5344,6 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) if (!slab->inuse) { remove_partial(n, slab); list_add(&slab->slab_list, &discard); - } else { - list_slab_objects(s, slab, - "Objects remaining in %s on __kmem_cache_shutdown()"); } } spin_unlock_irq(&n->list_lock); From vbabka at suse.cz Mon Jun 17 15:10:50 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Mon, 17 Jun 2024 17:10:50 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> Message-ID: On 6/13/24 2:22 PM, Jason A. Donenfeld wrote: > On Wed, Jun 12, 2024 at 08:38:02PM -0700, Paul E. McKenney wrote: >> o Make the current kmem_cache_destroy() asynchronously wait for >> all memory to be returned, then complete the destruction. >> (This gets rid of a valuable debugging technique because >> in normal use, it is a bug to attempt to destroy a kmem_cache >> that has objects still allocated.) This seems like the best option to me. As Jason already said, the debugging technique is not affected significantly, if the warning just occurs asynchronously later. The module can be already unloaded at that point, as the leak is never checked programatically anyway to control further execution, it's just a splat in dmesg. > Specifically what I mean is that we can still claim a memory leak has > occurred if one batched kfree_rcu freeing grace period has elapsed since > the last call to kmem_cache_destroy_rcu_wait/barrier() or > kmem_cache_destroy_rcu(). In that case, you quit blocking, or you quit > asynchronously waiting, and then you splat about a memleak like we have > now. Yes so we'd need the kmem_cache_free_barrier() for a slab kunit test (or the pessimistic variant waiting for the 21 seconds), and a polling variant of the same thing for the asynchronous destruction. Or we don't need a polling variant if it's ok to invoke such a barrier in a schedule_work() workfn. We should not need any new kmem_cache flag nor kmem_cache_destroy() flag to burden the users of kfree_rcu() with. We have __kmem_cache_shutdown() that will try to flush everything immediately and if it doesn't succeed, we can assume kfree_rcu() might be in flight and try to wait for it asynchronously, without any flags. SLAB_TYPESAFE_BY_RCU is still handled specially because it has special semantics as well. As for users of call_rcu() with arbitrary callbacks that might be functions from the module that is about to unload, these should not return from kmem_cache_destroy() with objects in flight. But those should be using rcu_barrier() before calling kmem_cache_destroy() already, and probably we should not try to handle this automagically? Maybe one potential change with the described approach is that today they would get the "cache not empty" warning immediately. But that wouldn't stop the module unload so later the callbacks would try to execute unmapped code anyway. With the new approach the asynchronous handling might delay the "cache not empty" warnings (or not, if kmem_cache_free_barrier() would finish before a rcu_barrier() would) so the unmapped code execution would come first. I don't think that would be a regression. From paulmck at kernel.org Mon Jun 17 16:12:28 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Mon, 17 Jun 2024 09:12:28 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> Message-ID: <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> On Mon, Jun 17, 2024 at 05:10:50PM +0200, Vlastimil Babka wrote: > On 6/13/24 2:22 PM, Jason A. Donenfeld wrote: > > On Wed, Jun 12, 2024 at 08:38:02PM -0700, Paul E. McKenney wrote: > >> o Make the current kmem_cache_destroy() asynchronously wait for > >> all memory to be returned, then complete the destruction. > >> (This gets rid of a valuable debugging technique because > >> in normal use, it is a bug to attempt to destroy a kmem_cache > >> that has objects still allocated.) > > This seems like the best option to me. As Jason already said, the debugging > technique is not affected significantly, if the warning just occurs > asynchronously later. The module can be already unloaded at that point, as > the leak is never checked programatically anyway to control further > execution, it's just a splat in dmesg. Works for me! > > Specifically what I mean is that we can still claim a memory leak has > > occurred if one batched kfree_rcu freeing grace period has elapsed since > > the last call to kmem_cache_destroy_rcu_wait/barrier() or > > kmem_cache_destroy_rcu(). In that case, you quit blocking, or you quit > > asynchronously waiting, and then you splat about a memleak like we have > > now. > > Yes so we'd need the kmem_cache_free_barrier() for a slab kunit test (or the > pessimistic variant waiting for the 21 seconds), and a polling variant of > the same thing for the asynchronous destruction. Or we don't need a polling > variant if it's ok to invoke such a barrier in a schedule_work() workfn. > > We should not need any new kmem_cache flag nor kmem_cache_destroy() flag to > burden the users of kfree_rcu() with. We have __kmem_cache_shutdown() that > will try to flush everything immediately and if it doesn't succeed, we can > assume kfree_rcu() might be in flight and try to wait for it asynchronously, > without any flags. That does sound like a very attractive approach. > SLAB_TYPESAFE_BY_RCU is still handled specially because it has special > semantics as well. > > As for users of call_rcu() with arbitrary callbacks that might be functions > from the module that is about to unload, these should not return from > kmem_cache_destroy() with objects in flight. But those should be using > rcu_barrier() before calling kmem_cache_destroy() already, and probably we > should not try to handle this automagically? Maybe one potential change with > the described approach is that today they would get the "cache not empty" > warning immediately. But that wouldn't stop the module unload so later the > callbacks would try to execute unmapped code anyway. With the new approach > the asynchronous handling might delay the "cache not empty" warnings (or > not, if kmem_cache_free_barrier() would finish before a rcu_barrier() would) > so the unmapped code execution would come first. I don't think that would be > a regression. Agreed. There are some use cases where a call_rcu() from a module without an rcu_barrier() would be OK, for example, if the callback function was defined in the core kernel and either: (1) The memory was from kmalloc() or (2) The memory was from kmem_cache_alloc() and your suggested changes above have been applied. My current belief is that these are too special of cases to be worth optimizing for, so that the rule should remain "If you use call_rcu() in a module, you must call rcu_barrier() within the module-unload code." There have been discussions of having module-unload automatically invoke rcu_barrier() if needed, but thus far we have not come up with a good way to do this. Challenges include things like static inline functions from the core kernel invoking call_rcu(), in which case how to figure out that the rcu_barrier() is not needed? Thanx, Paul From urezki at gmail.com Mon Jun 17 16:30:53 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Mon, 17 Jun 2024 18:30:53 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Mon, Jun 17, 2024 at 04:56:17PM +0200, Jason A. Donenfeld wrote: > On Mon, Jun 17, 2024 at 03:50:56PM +0200, Uladzislau Rezki wrote: > > On Fri, Jun 14, 2024 at 09:33:45PM +0200, Jason A. Donenfeld wrote: > > > On Fri, Jun 14, 2024 at 02:35:33PM +0200, Uladzislau Rezki wrote: > > > > + /* Should a destroy process be deferred? */ > > > > + if (s->flags & SLAB_DEFER_DESTROY) { > > > > + list_move_tail(&s->list, &slab_caches_defer_destroy); > > > > + schedule_delayed_work(&slab_caches_defer_destroy_work, HZ); > > > > + goto out_unlock; > > > > + } > > > > > > Wouldn't it be smoother to have the actual kmem_cache_free() function > > > check to see if it's been marked for destruction and the refcount is > > > zero, rather than polling every one second? I mentioned this approach > > > in: https://lore.kernel.org/all/Zmo9-YGraiCj5-MI at zx2c4.com/ - > > > > > > I wonder if the right fix to this would be adding a `should_destroy` > > > boolean to kmem_cache, which kmem_cache_destroy() sets to true. And > > > then right after it checks `if (number_of_allocations == 0) > > > actually_destroy()`, and likewise on each kmem_cache_free(), it > > > could check `if (should_destroy && number_of_allocations == 0) > > > actually_destroy()`. > > > > > I do not find pooling as bad way we can go with. But your proposal > > sounds reasonable to me also. We can combine both "prototypes" to > > one and offer. > > > > Can you post a prototype here? > > This is untested, but the simplest, shortest possible version would be: > > diff --git a/mm/slab.h b/mm/slab.h > index 5f8f47c5bee0..907c0ea56c01 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -275,6 +275,7 @@ struct kmem_cache { > unsigned int inuse; /* Offset to metadata */ > unsigned int align; /* Alignment */ > unsigned int red_left_pad; /* Left redzone padding size */ > + bool is_destroyed; /* Destruction happens when no objects */ > const char *name; /* Name (only for display!) */ > struct list_head list; /* List of slab caches */ > #ifdef CONFIG_SYSFS > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 1560a1546bb1..f700bed066d9 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -494,8 +494,8 @@ void kmem_cache_destroy(struct kmem_cache *s) > goto out_unlock; > > err = shutdown_cache(s); > - WARN(err, "%s %s: Slab cache still has objects when called from %pS", > - __func__, s->name, (void *)_RET_IP_); > + if (err) > + s->is_destroyed = true; > Here if an "err" is less then "0" means there are still objects whereas "is_destroyed" is set to "true" which is not correlated with a comment: "Destruction happens when no objects" > out_unlock: > mutex_unlock(&slab_mutex); > cpus_read_unlock(); > diff --git a/mm/slub.c b/mm/slub.c > index 1373ac365a46..7db8fe90a323 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > return; > trace_kmem_cache_free(_RET_IP_, x, s); > slab_free(s, virt_to_slab(x), x, _RET_IP_); > + if (s->is_destroyed) > + kmem_cache_destroy(s); > } > EXPORT_SYMBOL(kmem_cache_free); > > @@ -5342,9 +5344,6 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) > if (!slab->inuse) { > remove_partial(n, slab); > list_add(&slab->slab_list, &discard); > - } else { > - list_slab_objects(s, slab, > - "Objects remaining in %s on __kmem_cache_shutdown()"); > } > } > spin_unlock_irq(&n->list_lock); > Anyway it looks like it was not welcome to do it in the kmem_cache_free() function due to performance reason. -- Uladzislau Rezki From Jason at zx2c4.com Mon Jun 17 16:33:23 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Mon, 17 Jun 2024 18:33:23 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On Mon, Jun 17, 2024 at 6:30?PM Uladzislau Rezki wrote: > Here if an "err" is less then "0" means there are still objects > whereas "is_destroyed" is set to "true" which is not correlated > with a comment: > > "Destruction happens when no objects" The comment is just poorly written. But the logic of the code is right. > > > out_unlock: > > mutex_unlock(&slab_mutex); > > cpus_read_unlock(); > > diff --git a/mm/slub.c b/mm/slub.c > > index 1373ac365a46..7db8fe90a323 100644 > > --- a/mm/slub.c > > +++ b/mm/slub.c > > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > > return; > > trace_kmem_cache_free(_RET_IP_, x, s); > > slab_free(s, virt_to_slab(x), x, _RET_IP_); > > + if (s->is_destroyed) > > + kmem_cache_destroy(s); > > } > > EXPORT_SYMBOL(kmem_cache_free); > > > > @@ -5342,9 +5344,6 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) > > if (!slab->inuse) { > > remove_partial(n, slab); > > list_add(&slab->slab_list, &discard); > > - } else { > > - list_slab_objects(s, slab, > > - "Objects remaining in %s on __kmem_cache_shutdown()"); > > } > > } > > spin_unlock_irq(&n->list_lock); > > > Anyway it looks like it was not welcome to do it in the kmem_cache_free() > function due to performance reason. "was not welcome" - Vlastimil mentioned *potential* performance concerns before I posted this. I suspect he might have a different view now, maybe? Vlastimil, this is just checking a boolean (which could be unlikely()'d), which should have pretty minimal overhead. Is that alright with you? Jason From vbabka at suse.cz Mon Jun 17 16:38:52 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Mon, 17 Jun 2024 18:38:52 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <7efde25f-6af5-4a67-abea-b26732a8aca1@paulmck-laptop> Message-ID: On 6/17/24 6:33 PM, Jason A. Donenfeld wrote: > On Mon, Jun 17, 2024 at 6:30?PM Uladzislau Rezki wrote: >> Here if an "err" is less then "0" means there are still objects >> whereas "is_destroyed" is set to "true" which is not correlated >> with a comment: >> >> "Destruction happens when no objects" > > The comment is just poorly written. But the logic of the code is right. > >> >> > out_unlock: >> > mutex_unlock(&slab_mutex); >> > cpus_read_unlock(); >> > diff --git a/mm/slub.c b/mm/slub.c >> > index 1373ac365a46..7db8fe90a323 100644 >> > --- a/mm/slub.c >> > +++ b/mm/slub.c >> > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) >> > return; >> > trace_kmem_cache_free(_RET_IP_, x, s); >> > slab_free(s, virt_to_slab(x), x, _RET_IP_); >> > + if (s->is_destroyed) >> > + kmem_cache_destroy(s); >> > } >> > EXPORT_SYMBOL(kmem_cache_free); >> > >> > @@ -5342,9 +5344,6 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) >> > if (!slab->inuse) { >> > remove_partial(n, slab); >> > list_add(&slab->slab_list, &discard); >> > - } else { >> > - list_slab_objects(s, slab, >> > - "Objects remaining in %s on __kmem_cache_shutdown()"); >> > } >> > } >> > spin_unlock_irq(&n->list_lock); >> > >> Anyway it looks like it was not welcome to do it in the kmem_cache_free() >> function due to performance reason. > > "was not welcome" - Vlastimil mentioned *potential* performance > concerns before I posted this. I suspect he might have a different > view now, maybe? > > Vlastimil, this is just checking a boolean (which could be > unlikely()'d), which should have pretty minimal overhead. Is that > alright with you? Well I doubt we can just set and check it without any barriers? The completion of the last pending kfree_rcu() might race with kmem_cache_destroy() in a way that will leave the cache there forever, no? And once we add barriers it becomes a perf issue? > Jason From urezki at gmail.com Mon Jun 17 16:42:23 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Mon, 17 Jun 2024 18:42:23 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: Message-ID: On Mon, Jun 17, 2024 at 06:33:23PM +0200, Jason A. Donenfeld wrote: > On Mon, Jun 17, 2024 at 6:30?PM Uladzislau Rezki wrote: > > Here if an "err" is less then "0" means there are still objects > > whereas "is_destroyed" is set to "true" which is not correlated > > with a comment: > > > > "Destruction happens when no objects" > > The comment is just poorly written. But the logic of the code is right. > OK. > > > > > out_unlock: > > > mutex_unlock(&slab_mutex); > > > cpus_read_unlock(); > > > diff --git a/mm/slub.c b/mm/slub.c > > > index 1373ac365a46..7db8fe90a323 100644 > > > --- a/mm/slub.c > > > +++ b/mm/slub.c > > > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > > > return; > > > trace_kmem_cache_free(_RET_IP_, x, s); > > > slab_free(s, virt_to_slab(x), x, _RET_IP_); > > > + if (s->is_destroyed) > > > + kmem_cache_destroy(s); > Here i am not follow you. How do you see that a cache has been fully freed? Or is it just super draft code? Thanks! -- Uladzislau Rezki From Jason at zx2c4.com Mon Jun 17 16:57:45 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Mon, 17 Jun 2024 18:57:45 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: Message-ID: On Mon, Jun 17, 2024 at 06:42:23PM +0200, Uladzislau Rezki wrote: > On Mon, Jun 17, 2024 at 06:33:23PM +0200, Jason A. Donenfeld wrote: > > On Mon, Jun 17, 2024 at 6:30?PM Uladzislau Rezki wrote: > > > Here if an "err" is less then "0" means there are still objects > > > whereas "is_destroyed" is set to "true" which is not correlated > > > with a comment: > > > > > > "Destruction happens when no objects" > > > > The comment is just poorly written. But the logic of the code is right. > > > OK. > > > > > > > > out_unlock: > > > > mutex_unlock(&slab_mutex); > > > > cpus_read_unlock(); > > > > diff --git a/mm/slub.c b/mm/slub.c > > > > index 1373ac365a46..7db8fe90a323 100644 > > > > --- a/mm/slub.c > > > > +++ b/mm/slub.c > > > > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > > > > return; > > > > trace_kmem_cache_free(_RET_IP_, x, s); > > > > slab_free(s, virt_to_slab(x), x, _RET_IP_); > > > > + if (s->is_destroyed) > > > > + kmem_cache_destroy(s); > > > Here i am not follow you. How do you see that a cache has been fully > freed? Or is it just super draft code? kmem_cache_destroy() does this in shutdown_cache(). From Jason at zx2c4.com Mon Jun 17 17:04:34 2024 From: Jason at zx2c4.com (Jason A. Donenfeld) Date: Mon, 17 Jun 2024 19:04:34 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: Message-ID: On Mon, Jun 17, 2024 at 06:38:52PM +0200, Vlastimil Babka wrote: > On 6/17/24 6:33 PM, Jason A. Donenfeld wrote: > > On Mon, Jun 17, 2024 at 6:30?PM Uladzislau Rezki wrote: > >> Here if an "err" is less then "0" means there are still objects > >> whereas "is_destroyed" is set to "true" which is not correlated > >> with a comment: > >> > >> "Destruction happens when no objects" > > > > The comment is just poorly written. But the logic of the code is right. > > > >> > >> > out_unlock: > >> > mutex_unlock(&slab_mutex); > >> > cpus_read_unlock(); > >> > diff --git a/mm/slub.c b/mm/slub.c > >> > index 1373ac365a46..7db8fe90a323 100644 > >> > --- a/mm/slub.c > >> > +++ b/mm/slub.c > >> > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > >> > return; > >> > trace_kmem_cache_free(_RET_IP_, x, s); > >> > slab_free(s, virt_to_slab(x), x, _RET_IP_); > >> > + if (s->is_destroyed) > >> > + kmem_cache_destroy(s); > >> > } > >> > EXPORT_SYMBOL(kmem_cache_free); > >> > > >> > @@ -5342,9 +5344,6 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) > >> > if (!slab->inuse) { > >> > remove_partial(n, slab); > >> > list_add(&slab->slab_list, &discard); > >> > - } else { > >> > - list_slab_objects(s, slab, > >> > - "Objects remaining in %s on __kmem_cache_shutdown()"); > >> > } > >> > } > >> > spin_unlock_irq(&n->list_lock); > >> > > >> Anyway it looks like it was not welcome to do it in the kmem_cache_free() > >> function due to performance reason. > > > > "was not welcome" - Vlastimil mentioned *potential* performance > > concerns before I posted this. I suspect he might have a different > > view now, maybe? > > > > Vlastimil, this is just checking a boolean (which could be > > unlikely()'d), which should have pretty minimal overhead. Is that > > alright with you? > > Well I doubt we can just set and check it without any barriers? The > completion of the last pending kfree_rcu() might race with > kmem_cache_destroy() in a way that will leave the cache there forever, no? > And once we add barriers it becomes a perf issue? Hm, yea you might be right about barriers being required. But actually, might this point toward a larger problem with no matter what approach, polling or event, is chosen? If the current rule is that kmem_cache_free() must never race with kmem_cache_destroy(), because users have always made diligent use of call_rcu()/rcu_barrier() and such, but now we're going to let those race with each other - either by my thing above or by polling - so we're potentially going to get in trouble and need some barriers anyway. I think? Jason From urezki at gmail.com Mon Jun 17 17:19:53 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Mon, 17 Jun 2024 19:19:53 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: Message-ID: On Mon, Jun 17, 2024 at 06:57:45PM +0200, Jason A. Donenfeld wrote: > On Mon, Jun 17, 2024 at 06:42:23PM +0200, Uladzislau Rezki wrote: > > On Mon, Jun 17, 2024 at 06:33:23PM +0200, Jason A. Donenfeld wrote: > > > On Mon, Jun 17, 2024 at 6:30?PM Uladzislau Rezki wrote: > > > > Here if an "err" is less then "0" means there are still objects > > > > whereas "is_destroyed" is set to "true" which is not correlated > > > > with a comment: > > > > > > > > "Destruction happens when no objects" > > > > > > The comment is just poorly written. But the logic of the code is right. > > > > > OK. > > > > > > > > > > > out_unlock: > > > > > mutex_unlock(&slab_mutex); > > > > > cpus_read_unlock(); > > > > > diff --git a/mm/slub.c b/mm/slub.c > > > > > index 1373ac365a46..7db8fe90a323 100644 > > > > > --- a/mm/slub.c > > > > > +++ b/mm/slub.c > > > > > @@ -4510,6 +4510,8 @@ void kmem_cache_free(struct kmem_cache *s, void *x) > > > > > return; > > > > > trace_kmem_cache_free(_RET_IP_, x, s); > > > > > slab_free(s, virt_to_slab(x), x, _RET_IP_); > > > > > + if (s->is_destroyed) > > > > > + kmem_cache_destroy(s); > > > > > Here i am not follow you. How do you see that a cache has been fully > > freed? Or is it just super draft code? > > kmem_cache_destroy() does this in shutdown_cache(). > Right. In this scenario you invoke kmem_cache_destroy() over and over until the last object gets freed. This potentially slowing the kmem_cache_free() which is not OK, at least to me. -- Uladzislau Rezki From vbabka at suse.cz Mon Jun 17 17:23:36 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Mon, 17 Jun 2024 19:23:36 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> Message-ID: <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> On 6/17/24 6:12 PM, Paul E. McKenney wrote: > On Mon, Jun 17, 2024 at 05:10:50PM +0200, Vlastimil Babka wrote: >> On 6/13/24 2:22 PM, Jason A. Donenfeld wrote: >> > On Wed, Jun 12, 2024 at 08:38:02PM -0700, Paul E. McKenney wrote: >> >> o Make the current kmem_cache_destroy() asynchronously wait for >> >> all memory to be returned, then complete the destruction. >> >> (This gets rid of a valuable debugging technique because >> >> in normal use, it is a bug to attempt to destroy a kmem_cache >> >> that has objects still allocated.) >> >> This seems like the best option to me. As Jason already said, the debugging >> technique is not affected significantly, if the warning just occurs >> asynchronously later. The module can be already unloaded at that point, as >> the leak is never checked programatically anyway to control further >> execution, it's just a splat in dmesg. > > Works for me! Great. So this is how a prototype could look like, hopefully? The kunit test does generate the splat for me, which should be because the rcu_barrier() in the implementation (marked to be replaced with the real thing) is really insufficient. Note the test itself passes as this kind of error isn't wired up properly. Another thing to resolve is the marked comment about kasan_shutdown() with potential kfree_rcu()'s in flight. Also you need CONFIG_SLUB_DEBUG enabled otherwise node_nr_slabs() is a no-op and it might fail to notice the pending slabs. This will need to change. ----8<---- diff --git a/lib/slub_kunit.c b/lib/slub_kunit.c index e6667a28c014..e3e4d0ca40b7 100644 --- a/lib/slub_kunit.c +++ b/lib/slub_kunit.c @@ -5,6 +5,7 @@ #include #include #include +#include #include "../mm/slab.h" static struct kunit_resource resource; @@ -157,6 +158,26 @@ static void test_kmalloc_redzone_access(struct kunit *test) kmem_cache_destroy(s); } +struct test_kfree_rcu_struct { + struct rcu_head rcu; +}; + +static void test_kfree_rcu(struct kunit *test) +{ + struct kmem_cache *s = test_kmem_cache_create("TestSlub_kfree_rcu", + sizeof(struct test_kfree_rcu_struct), + SLAB_NO_MERGE); + struct test_kfree_rcu_struct *p = kmem_cache_alloc(s, GFP_KERNEL); + + kasan_disable_current(); + + KUNIT_EXPECT_EQ(test, 0, slab_errors); + + kasan_enable_current(); + kfree_rcu(p, rcu); + kmem_cache_destroy(s); +} + static int test_init(struct kunit *test) { slab_errors = 0; @@ -177,6 +198,7 @@ static struct kunit_case test_cases[] = { KUNIT_CASE(test_clobber_redzone_free), KUNIT_CASE(test_kmalloc_redzone_access), + KUNIT_CASE(test_kfree_rcu), {} }; diff --git a/mm/slab.h b/mm/slab.h index b16e63191578..a0295600af92 100644 --- a/mm/slab.h +++ b/mm/slab.h @@ -277,6 +277,8 @@ struct kmem_cache { unsigned int red_left_pad; /* Left redzone padding size */ const char *name; /* Name (only for display!) */ struct list_head list; /* List of slab caches */ + struct work_struct async_destroy_work; + #ifdef CONFIG_SYSFS struct kobject kobj; /* For sysfs */ #endif @@ -474,7 +476,7 @@ static inline bool is_kmalloc_cache(struct kmem_cache *s) SLAB_NO_USER_FLAGS) bool __kmem_cache_empty(struct kmem_cache *); -int __kmem_cache_shutdown(struct kmem_cache *); +int __kmem_cache_shutdown(struct kmem_cache *, bool); void __kmem_cache_release(struct kmem_cache *); int __kmem_cache_shrink(struct kmem_cache *); void slab_kmem_cache_release(struct kmem_cache *); diff --git a/mm/slab_common.c b/mm/slab_common.c index 5b1f996bed06..c5c356d0235d 100644 --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -44,6 +44,8 @@ static LIST_HEAD(slab_caches_to_rcu_destroy); static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work); static DECLARE_WORK(slab_caches_to_rcu_destroy_work, slab_caches_to_rcu_destroy_workfn); +static void kmem_cache_kfree_rcu_destroy_workfn(struct work_struct *work); + /* * Set of flags that will prevent slab merging @@ -234,6 +236,7 @@ static struct kmem_cache *create_cache(const char *name, s->refcount = 1; list_add(&s->list, &slab_caches); + INIT_WORK(&s->async_destroy_work, kmem_cache_kfree_rcu_destroy_workfn); return s; out_free_cache: @@ -449,12 +452,16 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work) } } -static int shutdown_cache(struct kmem_cache *s) +static int shutdown_cache(struct kmem_cache *s, bool warn_inuse) { /* free asan quarantined objects */ + /* + * XXX: is it ok to call this multiple times? and what happens with a + * kfree_rcu() in flight that finishes after or in parallel with this? + */ kasan_cache_shutdown(s); - if (__kmem_cache_shutdown(s) != 0) + if (__kmem_cache_shutdown(s, warn_inuse) != 0) return -EBUSY; list_del(&s->list); @@ -477,6 +484,32 @@ void slab_kmem_cache_release(struct kmem_cache *s) kmem_cache_free(kmem_cache, s); } +static void kmem_cache_kfree_rcu_destroy_workfn(struct work_struct *work) +{ + struct kmem_cache *s; + int err = -EBUSY; + bool rcu_set; + + s = container_of(work, struct kmem_cache, async_destroy_work); + + // XXX use the real kmem_cache_free_barrier() or similar thing here + rcu_barrier(); + + cpus_read_lock(); + mutex_lock(&slab_mutex); + + rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; + + err = shutdown_cache(s, true); + WARN(err, "kmem_cache_destroy %s: Slab cache still has objects", + s->name); + + mutex_unlock(&slab_mutex); + cpus_read_unlock(); + if (!err && !rcu_set) + kmem_cache_release(s); +} + void kmem_cache_destroy(struct kmem_cache *s) { int err = -EBUSY; @@ -494,9 +527,9 @@ void kmem_cache_destroy(struct kmem_cache *s) if (s->refcount) goto out_unlock; - err = shutdown_cache(s); - WARN(err, "%s %s: Slab cache still has objects when called from %pS", - __func__, s->name, (void *)_RET_IP_); + err = shutdown_cache(s, false); + if (err) + schedule_work(&s->async_destroy_work); out_unlock: mutex_unlock(&slab_mutex); cpus_read_unlock(); diff --git a/mm/slub.c b/mm/slub.c index 1617d8014ecd..4d435b3d2b5f 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5342,7 +5342,8 @@ static void list_slab_objects(struct kmem_cache *s, struct slab *slab, * This is called from __kmem_cache_shutdown(). We must take list_lock * because sysfs file might still access partial list after the shutdowning. */ -static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) +static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n, + bool warn_inuse) { LIST_HEAD(discard); struct slab *slab, *h; @@ -5353,7 +5354,7 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) if (!slab->inuse) { remove_partial(n, slab); list_add(&slab->slab_list, &discard); - } else { + } else if (warn_inuse) { list_slab_objects(s, slab, "Objects remaining in %s on __kmem_cache_shutdown()"); } @@ -5378,7 +5379,7 @@ bool __kmem_cache_empty(struct kmem_cache *s) /* * Release all resources used by a slab cache. */ -int __kmem_cache_shutdown(struct kmem_cache *s) +int __kmem_cache_shutdown(struct kmem_cache *s, bool warn_inuse) { int node; struct kmem_cache_node *n; @@ -5386,7 +5387,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s) flush_all_cpus_locked(s); /* Attempt to free all objects */ for_each_kmem_cache_node(s, node, n) { - free_partial(s, n); + free_partial(s, n, warn_inuse); if (n->nr_partial || node_nr_slabs(n)) return 1; } From urezki at gmail.com Mon Jun 17 18:42:09 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Mon, 17 Jun 2024 20:42:09 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> Message-ID: On Mon, Jun 17, 2024 at 07:23:36PM +0200, Vlastimil Babka wrote: > On 6/17/24 6:12 PM, Paul E. McKenney wrote: > > On Mon, Jun 17, 2024 at 05:10:50PM +0200, Vlastimil Babka wrote: > >> On 6/13/24 2:22 PM, Jason A. Donenfeld wrote: > >> > On Wed, Jun 12, 2024 at 08:38:02PM -0700, Paul E. McKenney wrote: > >> >> o Make the current kmem_cache_destroy() asynchronously wait for > >> >> all memory to be returned, then complete the destruction. > >> >> (This gets rid of a valuable debugging technique because > >> >> in normal use, it is a bug to attempt to destroy a kmem_cache > >> >> that has objects still allocated.) > >> > >> This seems like the best option to me. As Jason already said, the debugging > >> technique is not affected significantly, if the warning just occurs > >> asynchronously later. The module can be already unloaded at that point, as > >> the leak is never checked programatically anyway to control further > >> execution, it's just a splat in dmesg. > > > > Works for me! > > Great. So this is how a prototype could look like, hopefully? The kunit test > does generate the splat for me, which should be because the rcu_barrier() in > the implementation (marked to be replaced with the real thing) is really > insufficient. Note the test itself passes as this kind of error isn't wired > up properly. > > Another thing to resolve is the marked comment about kasan_shutdown() with > potential kfree_rcu()'s in flight. > > Also you need CONFIG_SLUB_DEBUG enabled otherwise node_nr_slabs() is a no-op > and it might fail to notice the pending slabs. This will need to change. > > ----8<---- > diff --git a/lib/slub_kunit.c b/lib/slub_kunit.c > index e6667a28c014..e3e4d0ca40b7 100644 > --- a/lib/slub_kunit.c > +++ b/lib/slub_kunit.c > @@ -5,6 +5,7 @@ > #include > #include > #include > +#include > #include "../mm/slab.h" > > static struct kunit_resource resource; > @@ -157,6 +158,26 @@ static void test_kmalloc_redzone_access(struct kunit *test) > kmem_cache_destroy(s); > } > > +struct test_kfree_rcu_struct { > + struct rcu_head rcu; > +}; > + > +static void test_kfree_rcu(struct kunit *test) > +{ > + struct kmem_cache *s = test_kmem_cache_create("TestSlub_kfree_rcu", > + sizeof(struct test_kfree_rcu_struct), > + SLAB_NO_MERGE); > + struct test_kfree_rcu_struct *p = kmem_cache_alloc(s, GFP_KERNEL); > + > + kasan_disable_current(); > + > + KUNIT_EXPECT_EQ(test, 0, slab_errors); > + > + kasan_enable_current(); > + kfree_rcu(p, rcu); > + kmem_cache_destroy(s); > +} > + > static int test_init(struct kunit *test) > { > slab_errors = 0; > @@ -177,6 +198,7 @@ static struct kunit_case test_cases[] = { > > KUNIT_CASE(test_clobber_redzone_free), > KUNIT_CASE(test_kmalloc_redzone_access), > + KUNIT_CASE(test_kfree_rcu), > {} > }; > > diff --git a/mm/slab.h b/mm/slab.h > index b16e63191578..a0295600af92 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -277,6 +277,8 @@ struct kmem_cache { > unsigned int red_left_pad; /* Left redzone padding size */ > const char *name; /* Name (only for display!) */ > struct list_head list; /* List of slab caches */ > + struct work_struct async_destroy_work; > + > #ifdef CONFIG_SYSFS > struct kobject kobj; /* For sysfs */ > #endif > @@ -474,7 +476,7 @@ static inline bool is_kmalloc_cache(struct kmem_cache *s) > SLAB_NO_USER_FLAGS) > > bool __kmem_cache_empty(struct kmem_cache *); > -int __kmem_cache_shutdown(struct kmem_cache *); > +int __kmem_cache_shutdown(struct kmem_cache *, bool); > void __kmem_cache_release(struct kmem_cache *); > int __kmem_cache_shrink(struct kmem_cache *); > void slab_kmem_cache_release(struct kmem_cache *); > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 5b1f996bed06..c5c356d0235d 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -44,6 +44,8 @@ static LIST_HEAD(slab_caches_to_rcu_destroy); > static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work); > static DECLARE_WORK(slab_caches_to_rcu_destroy_work, > slab_caches_to_rcu_destroy_workfn); > +static void kmem_cache_kfree_rcu_destroy_workfn(struct work_struct *work); > + > > /* > * Set of flags that will prevent slab merging > @@ -234,6 +236,7 @@ static struct kmem_cache *create_cache(const char *name, > > s->refcount = 1; > list_add(&s->list, &slab_caches); > + INIT_WORK(&s->async_destroy_work, kmem_cache_kfree_rcu_destroy_workfn); > return s; > > out_free_cache: > @@ -449,12 +452,16 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work) > } > } > > -static int shutdown_cache(struct kmem_cache *s) > +static int shutdown_cache(struct kmem_cache *s, bool warn_inuse) > { > /* free asan quarantined objects */ > + /* > + * XXX: is it ok to call this multiple times? and what happens with a > + * kfree_rcu() in flight that finishes after or in parallel with this? > + */ > kasan_cache_shutdown(s); > > - if (__kmem_cache_shutdown(s) != 0) > + if (__kmem_cache_shutdown(s, warn_inuse) != 0) > return -EBUSY; > > list_del(&s->list); > @@ -477,6 +484,32 @@ void slab_kmem_cache_release(struct kmem_cache *s) > kmem_cache_free(kmem_cache, s); > } > > +static void kmem_cache_kfree_rcu_destroy_workfn(struct work_struct *work) > +{ > + struct kmem_cache *s; > + int err = -EBUSY; > + bool rcu_set; > + > + s = container_of(work, struct kmem_cache, async_destroy_work); > + > + // XXX use the real kmem_cache_free_barrier() or similar thing here It implies that we need to introduce kfree_rcu_barrier(), a new API, which i wanted to avoid initially. Since you do it asynchronous can we just repeat and wait until it a cache is furry freed? I am asking because inventing a new kfree_rcu_barrier() might not be so straight forward. -- Uladzislau Rezki From paulmck at kernel.org Mon Jun 17 18:54:39 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Mon, 17 Jun 2024 11:54:39 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> Message-ID: <1755282b-e3f5-4d18-9eab-fc6a29ca5886@paulmck-laptop> On Mon, Jun 17, 2024 at 07:23:36PM +0200, Vlastimil Babka wrote: > On 6/17/24 6:12 PM, Paul E. McKenney wrote: > > On Mon, Jun 17, 2024 at 05:10:50PM +0200, Vlastimil Babka wrote: > >> On 6/13/24 2:22 PM, Jason A. Donenfeld wrote: > >> > On Wed, Jun 12, 2024 at 08:38:02PM -0700, Paul E. McKenney wrote: > >> >> o Make the current kmem_cache_destroy() asynchronously wait for > >> >> all memory to be returned, then complete the destruction. > >> >> (This gets rid of a valuable debugging technique because > >> >> in normal use, it is a bug to attempt to destroy a kmem_cache > >> >> that has objects still allocated.) > >> > >> This seems like the best option to me. As Jason already said, the debugging > >> technique is not affected significantly, if the warning just occurs > >> asynchronously later. The module can be already unloaded at that point, as > >> the leak is never checked programatically anyway to control further > >> execution, it's just a splat in dmesg. > > > > Works for me! > > Great. So this is how a prototype could look like, hopefully? The kunit test > does generate the splat for me, which should be because the rcu_barrier() in > the implementation (marked to be replaced with the real thing) is really > insufficient. Note the test itself passes as this kind of error isn't wired > up properly. ;-) ;-) ;-) Some might want confirmation that their cleanup efforts succeeded, but if so, I will let them make that known. > Another thing to resolve is the marked comment about kasan_shutdown() with > potential kfree_rcu()'s in flight. Could that simply move to the worker function? (Hey, had to ask!) > Also you need CONFIG_SLUB_DEBUG enabled otherwise node_nr_slabs() is a no-op > and it might fail to notice the pending slabs. This will need to change. Agreed. Looks generally good. A few questions below, to be taken with a grain of salt. Thanx, Paul > ----8<---- > diff --git a/lib/slub_kunit.c b/lib/slub_kunit.c > index e6667a28c014..e3e4d0ca40b7 100644 > --- a/lib/slub_kunit.c > +++ b/lib/slub_kunit.c > @@ -5,6 +5,7 @@ > #include > #include > #include > +#include > #include "../mm/slab.h" > > static struct kunit_resource resource; > @@ -157,6 +158,26 @@ static void test_kmalloc_redzone_access(struct kunit *test) > kmem_cache_destroy(s); > } > > +struct test_kfree_rcu_struct { > + struct rcu_head rcu; > +}; > + > +static void test_kfree_rcu(struct kunit *test) > +{ > + struct kmem_cache *s = test_kmem_cache_create("TestSlub_kfree_rcu", > + sizeof(struct test_kfree_rcu_struct), > + SLAB_NO_MERGE); > + struct test_kfree_rcu_struct *p = kmem_cache_alloc(s, GFP_KERNEL); > + > + kasan_disable_current(); > + > + KUNIT_EXPECT_EQ(test, 0, slab_errors); > + > + kasan_enable_current(); > + kfree_rcu(p, rcu); > + kmem_cache_destroy(s); Looks like the type of test for this! > +} > + > static int test_init(struct kunit *test) > { > slab_errors = 0; > @@ -177,6 +198,7 @@ static struct kunit_case test_cases[] = { > > KUNIT_CASE(test_clobber_redzone_free), > KUNIT_CASE(test_kmalloc_redzone_access), > + KUNIT_CASE(test_kfree_rcu), > {} > }; > > diff --git a/mm/slab.h b/mm/slab.h > index b16e63191578..a0295600af92 100644 > --- a/mm/slab.h > +++ b/mm/slab.h > @@ -277,6 +277,8 @@ struct kmem_cache { > unsigned int red_left_pad; /* Left redzone padding size */ > const char *name; /* Name (only for display!) */ > struct list_head list; /* List of slab caches */ > + struct work_struct async_destroy_work; > + > #ifdef CONFIG_SYSFS > struct kobject kobj; /* For sysfs */ > #endif > @@ -474,7 +476,7 @@ static inline bool is_kmalloc_cache(struct kmem_cache *s) > SLAB_NO_USER_FLAGS) > > bool __kmem_cache_empty(struct kmem_cache *); > -int __kmem_cache_shutdown(struct kmem_cache *); > +int __kmem_cache_shutdown(struct kmem_cache *, bool); > void __kmem_cache_release(struct kmem_cache *); > int __kmem_cache_shrink(struct kmem_cache *); > void slab_kmem_cache_release(struct kmem_cache *); > diff --git a/mm/slab_common.c b/mm/slab_common.c > index 5b1f996bed06..c5c356d0235d 100644 > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -44,6 +44,8 @@ static LIST_HEAD(slab_caches_to_rcu_destroy); > static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work); > static DECLARE_WORK(slab_caches_to_rcu_destroy_work, > slab_caches_to_rcu_destroy_workfn); > +static void kmem_cache_kfree_rcu_destroy_workfn(struct work_struct *work); > + > > /* > * Set of flags that will prevent slab merging > @@ -234,6 +236,7 @@ static struct kmem_cache *create_cache(const char *name, > > s->refcount = 1; > list_add(&s->list, &slab_caches); > + INIT_WORK(&s->async_destroy_work, kmem_cache_kfree_rcu_destroy_workfn); > return s; > > out_free_cache: > @@ -449,12 +452,16 @@ static void slab_caches_to_rcu_destroy_workfn(struct work_struct *work) > } > } > > -static int shutdown_cache(struct kmem_cache *s) > +static int shutdown_cache(struct kmem_cache *s, bool warn_inuse) > { > /* free asan quarantined objects */ > + /* > + * XXX: is it ok to call this multiple times? and what happens with a > + * kfree_rcu() in flight that finishes after or in parallel with this? > + */ > kasan_cache_shutdown(s); > > - if (__kmem_cache_shutdown(s) != 0) > + if (__kmem_cache_shutdown(s, warn_inuse) != 0) > return -EBUSY; > > list_del(&s->list); > @@ -477,6 +484,32 @@ void slab_kmem_cache_release(struct kmem_cache *s) > kmem_cache_free(kmem_cache, s); > } > > +static void kmem_cache_kfree_rcu_destroy_workfn(struct work_struct *work) > +{ > + struct kmem_cache *s; > + int err = -EBUSY; > + bool rcu_set; > + > + s = container_of(work, struct kmem_cache, async_destroy_work); > + > + // XXX use the real kmem_cache_free_barrier() or similar thing here > + rcu_barrier(); > + > + cpus_read_lock(); > + mutex_lock(&slab_mutex); > + > + rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; > + > + err = shutdown_cache(s, true); This is currently the only call to shutdown_cache()? So there is to be a way for the caller to have some influence over the value of that bool? > + WARN(err, "kmem_cache_destroy %s: Slab cache still has objects", > + s->name); Don't we want to have some sort of delay here? Or is this the 21-second delay and/or kfree_rcu_barrier() mentioned before? > + mutex_unlock(&slab_mutex); > + cpus_read_unlock(); > + if (!err && !rcu_set) > + kmem_cache_release(s); > +} > + > void kmem_cache_destroy(struct kmem_cache *s) > { > int err = -EBUSY; > @@ -494,9 +527,9 @@ void kmem_cache_destroy(struct kmem_cache *s) > if (s->refcount) > goto out_unlock; > > - err = shutdown_cache(s); > - WARN(err, "%s %s: Slab cache still has objects when called from %pS", > - __func__, s->name, (void *)_RET_IP_); > + err = shutdown_cache(s, false); > + if (err) > + schedule_work(&s->async_destroy_work); > out_unlock: > mutex_unlock(&slab_mutex); > cpus_read_unlock(); > diff --git a/mm/slub.c b/mm/slub.c > index 1617d8014ecd..4d435b3d2b5f 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -5342,7 +5342,8 @@ static void list_slab_objects(struct kmem_cache *s, struct slab *slab, > * This is called from __kmem_cache_shutdown(). We must take list_lock > * because sysfs file might still access partial list after the shutdowning. > */ > -static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) > +static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n, > + bool warn_inuse) > { > LIST_HEAD(discard); > struct slab *slab, *h; > @@ -5353,7 +5354,7 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) > if (!slab->inuse) { > remove_partial(n, slab); > list_add(&slab->slab_list, &discard); > - } else { > + } else if (warn_inuse) { > list_slab_objects(s, slab, > "Objects remaining in %s on __kmem_cache_shutdown()"); > } > @@ -5378,7 +5379,7 @@ bool __kmem_cache_empty(struct kmem_cache *s) > /* > * Release all resources used by a slab cache. > */ > -int __kmem_cache_shutdown(struct kmem_cache *s) > +int __kmem_cache_shutdown(struct kmem_cache *s, bool warn_inuse) > { > int node; > struct kmem_cache_node *n; > @@ -5386,7 +5387,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s) > flush_all_cpus_locked(s); > /* Attempt to free all objects */ > for_each_kmem_cache_node(s, node, n) { > - free_partial(s, n); > + free_partial(s, n, warn_inuse); > if (n->nr_partial || node_nr_slabs(n)) > return 1; > } > From vbabka at suse.cz Mon Jun 17 21:08:58 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Mon, 17 Jun 2024 23:08:58 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> Message-ID: <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> On 6/17/24 8:42 PM, Uladzislau Rezki wrote: >> + >> + s = container_of(work, struct kmem_cache, async_destroy_work); >> + >> + // XXX use the real kmem_cache_free_barrier() or similar thing here > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i > wanted to avoid initially. I wanted to avoid new API or flags for kfree_rcu() users and this would be achieved. The barrier is used internally so I don't consider that an API to avoid. How difficult is the implementation is another question, depending on how the current batching works. Once (if) we have sheaves proven to work and move kfree_rcu() fully into SLUB, the barrier might also look different and hopefully easier. So maybe it's not worth to invest too much into that barrier and just go for the potentially longer, but easier to implement? > Since you do it asynchronous can we just repeat > and wait until it a cache is furry freed? The problem is we want to detect the cases when it's not fully freed because there was an actual read. So at some point we'd need to stop the repeats because we know there can no longer be any kfree_rcu()'s in flight since the kmem_cache_destroy() was called. > I am asking because inventing a new kfree_rcu_barrier() might not be so > straight forward. Agreed. > > -- > Uladzislau Rezki From vbabka at suse.cz Mon Jun 17 21:19:00 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Mon, 17 Jun 2024 23:19:00 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: Message-ID: On 6/17/24 7:04 PM, Jason A. Donenfeld wrote: >>> Vlastimil, this is just checking a boolean (which could be >>> unlikely()'d), which should have pretty minimal overhead. Is that >>> alright with you? >> >> Well I doubt we can just set and check it without any barriers? The >> completion of the last pending kfree_rcu() might race with >> kmem_cache_destroy() in a way that will leave the cache there forever, no? >> And once we add barriers it becomes a perf issue? > > Hm, yea you might be right about barriers being required. But actually, > might this point toward a larger problem with no matter what approach, > polling or event, is chosen? If the current rule is that > kmem_cache_free() must never race with kmem_cache_destroy(), because Yes calling alloc/free operations that race with destroy is a bug and we can't prevent that. > users have always made diligent use of call_rcu()/rcu_barrier() and But the issue we are solving here is a bit different - the users are not buggy, they do kfree_rcu() and then kmem_cache_destroy() and no more operations on the cache afterwards. We need to ensure that the handling of kfree_rcu() (which ultimately is basically kmem_cache_free() but internally to rcu/slub) doesn't race with kmem_cache_destroy(). > such, but now we're going to let those race with each other - either by > my thing above or by polling - so we're potentially going to get in trouble > and need some barriers anyway. The barrier in the async part of kmem_cache_destroy() should be enough to make sure all kfree_rcu() have finished before we proceed with the potentially racy parts of destroying, and we should be able to avoid changes in kmem_cache_free(). > I think? > > Jason From vbabka at suse.cz Mon Jun 17 21:34:04 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Mon, 17 Jun 2024 23:34:04 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <1755282b-e3f5-4d18-9eab-fc6a29ca5886@paulmck-laptop> References: <20240609082726.32742-1-Julia.Lawall@inria.fr> <20240612143305.451abf58@kernel.org> <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <1755282b-e3f5-4d18-9eab-fc6a29ca5886@paulmck-laptop> Message-ID: On 6/17/24 8:54 PM, Paul E. McKenney wrote: > On Mon, Jun 17, 2024 at 07:23:36PM +0200, Vlastimil Babka wrote: >> On 6/17/24 6:12 PM, Paul E. McKenney wrote: >>> On Mon, Jun 17, 2024 at 05:10:50PM +0200, Vlastimil Babka wrote: >>>> On 6/13/24 2:22 PM, Jason A. Donenfeld wrote: >>>>> On Wed, Jun 12, 2024 at 08:38:02PM -0700, Paul E. McKenney wrote: >>>>>> o Make the current kmem_cache_destroy() asynchronously wait for >>>>>> all memory to be returned, then complete the destruction. >>>>>> (This gets rid of a valuable debugging technique because >>>>>> in normal use, it is a bug to attempt to destroy a kmem_cache >>>>>> that has objects still allocated.) >>>> >>>> This seems like the best option to me. As Jason already said, the debugging >>>> technique is not affected significantly, if the warning just occurs >>>> asynchronously later. The module can be already unloaded at that point, as >>>> the leak is never checked programatically anyway to control further >>>> execution, it's just a splat in dmesg. >>> >>> Works for me! >> >> Great. So this is how a prototype could look like, hopefully? The kunit test >> does generate the splat for me, which should be because the rcu_barrier() in >> the implementation (marked to be replaced with the real thing) is really >> insufficient. Note the test itself passes as this kind of error isn't wired >> up properly. > > ;-) ;-) ;-) Yeah yeah, I just used the kunit module as a convenient way add the code that should see if there's the splat :) > Some might want confirmation that their cleanup efforts succeeded, > but if so, I will let them make that known. It could be just the kunit test that could want that, but I don't see how it could wrap and inspect the result of the async handling and suppress the splats for intentionally triggered errors as many of the other tests do. >> Another thing to resolve is the marked comment about kasan_shutdown() with >> potential kfree_rcu()'s in flight. > > Could that simply move to the worker function? (Hey, had to ask!) I think I had a reason why not, but I guess it could move. It would just mean that if any objects are quarantined, we'll go for the async freeing even though those could be flushed immediately. Guess that's not too bad. >> Also you need CONFIG_SLUB_DEBUG enabled otherwise node_nr_slabs() is a no-op >> and it might fail to notice the pending slabs. This will need to change. > > Agreed. > > Looks generally good. A few questions below, to be taken with a > grain of salt. Thanks! >> +static void kmem_cache_kfree_rcu_destroy_workfn(struct work_struct *work) >> +{ >> + struct kmem_cache *s; >> + int err = -EBUSY; >> + bool rcu_set; >> + >> + s = container_of(work, struct kmem_cache, async_destroy_work); >> + >> + // XXX use the real kmem_cache_free_barrier() or similar thing here >> + rcu_barrier(); Note here's the barrier. >> + cpus_read_lock(); >> + mutex_lock(&slab_mutex); >> + >> + rcu_set = s->flags & SLAB_TYPESAFE_BY_RCU; >> + >> + err = shutdown_cache(s, true); > > This is currently the only call to shutdown_cache()? So there is to be > a way for the caller to have some influence over the value of that bool? Not the only caller, there's still the initial attempt in kmem_cache_destroy() itself below. > >> + WARN(err, "kmem_cache_destroy %s: Slab cache still has objects", >> + s->name); > > Don't we want to have some sort of delay here? Or is this the > 21-second delay and/or kfree_rcu_barrier() mentioned before? Yes this is after the barrier. The first immediate attempt to shutdown doesn't warn. >> + mutex_unlock(&slab_mutex); >> + cpus_read_unlock(); >> + if (!err && !rcu_set) >> + kmem_cache_release(s); >> +} >> + >> void kmem_cache_destroy(struct kmem_cache *s) >> { >> int err = -EBUSY; >> @@ -494,9 +527,9 @@ void kmem_cache_destroy(struct kmem_cache *s) >> if (s->refcount) >> goto out_unlock; >> >> - err = shutdown_cache(s); >> - WARN(err, "%s %s: Slab cache still has objects when called from %pS", >> - __func__, s->name, (void *)_RET_IP_); >> + err = shutdown_cache(s, false); >> + if (err) >> + schedule_work(&s->async_destroy_work); And here's the initial attempt that used to warn but now doesn't and instead schedules the async one. >> out_unlock: >> mutex_unlock(&slab_mutex); >> cpus_read_unlock(); >> diff --git a/mm/slub.c b/mm/slub.c >> index 1617d8014ecd..4d435b3d2b5f 100644 >> --- a/mm/slub.c >> +++ b/mm/slub.c >> @@ -5342,7 +5342,8 @@ static void list_slab_objects(struct kmem_cache *s, struct slab *slab, >> * This is called from __kmem_cache_shutdown(). We must take list_lock >> * because sysfs file might still access partial list after the shutdowning. >> */ >> -static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) >> +static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n, >> + bool warn_inuse) >> { >> LIST_HEAD(discard); >> struct slab *slab, *h; >> @@ -5353,7 +5354,7 @@ static void free_partial(struct kmem_cache *s, struct kmem_cache_node *n) >> if (!slab->inuse) { >> remove_partial(n, slab); >> list_add(&slab->slab_list, &discard); >> - } else { >> + } else if (warn_inuse) { >> list_slab_objects(s, slab, >> "Objects remaining in %s on __kmem_cache_shutdown()"); >> } >> @@ -5378,7 +5379,7 @@ bool __kmem_cache_empty(struct kmem_cache *s) >> /* >> * Release all resources used by a slab cache. >> */ >> -int __kmem_cache_shutdown(struct kmem_cache *s) >> +int __kmem_cache_shutdown(struct kmem_cache *s, bool warn_inuse) >> { >> int node; >> struct kmem_cache_node *n; >> @@ -5386,7 +5387,7 @@ int __kmem_cache_shutdown(struct kmem_cache *s) >> flush_all_cpus_locked(s); >> /* Attempt to free all objects */ >> for_each_kmem_cache_node(s, node, n) { >> - free_partial(s, n); >> + free_partial(s, n, warn_inuse); >> if (n->nr_partial || node_nr_slabs(n)) >> return 1; >> } >> From syzbot+0dc211bc2adb944e1fd6 at syzkaller.appspotmail.com Tue Jun 18 02:22:21 2024 From: syzbot+0dc211bc2adb944e1fd6 at syzkaller.appspotmail.com (syzbot) Date: Mon, 17 Jun 2024 19:22:21 -0700 Subject: [syzbot] [kvm?] general protection fault in get_work_pool (2) Message-ID: <0000000000006eb03a061b20c079@google.com> Hello, syzbot found the following issue on: HEAD commit: 2ccbdf43d5e7 Merge tag 'for-linus' of git://git.kernel.org.. git tree: upstream console output: https://syzkaller.appspot.com/x/log.txt?x=16f23146980000 kernel config: https://syzkaller.appspot.com/x/.config?x=81c0d76ceef02b39 dashboard link: https://syzkaller.appspot.com/bug?extid=0dc211bc2adb944e1fd6 compiler: gcc (Debian 12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40 userspace arch: i386 Unfortunately, I don't have any reproducer for this issue yet. Downloadable assets: disk image (non-bootable): https://storage.googleapis.com/syzbot-assets/7bc7510fe41f/non_bootable_disk-2ccbdf43.raw.xz vmlinux: https://storage.googleapis.com/syzbot-assets/13cdb5bfbafa/vmlinux-2ccbdf43.xz kernel image: https://storage.googleapis.com/syzbot-assets/7a14f5d07f81/bzImage-2ccbdf43.xz IMPORTANT: if you fix the issue, please add the following tag to the commit: Reported-by: syzbot+0dc211bc2adb944e1fd6 at syzkaller.appspotmail.com Oops: general protection fault, probably for non-canonical address 0xe003fbfffff80000: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: maybe wild-memory-access in range [0x001fffffffc00000-0x001fffffffc00007] CPU: 1 PID: 5570 Comm: kworker/1:5 Not tainted 6.10.0-rc3-syzkaller-00044-g2ccbdf43d5e7 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Workqueue: wg-crypt-wg2 wg_packet_tx_worker RIP: 0010:get_work_pool+0xcb/0x1c0 kernel/workqueue.c:887 Code: 0d 36 00 48 89 d8 5b 5d c3 cc cc cc cc e8 8d 0d 36 00 48 81 e3 00 fe ff ff 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1 ea 03 <80> 3c 02 00 0f 85 da 00 00 00 48 8b 1b e8 63 0d 36 00 48 89 d8 5b RSP: 0018:ffffc90000598738 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 001fffffffc00000 RCX: ffffffff815881f2 RDX: 0003fffffff80000 RSI: ffffffff81588243 RDI: 0000000000000007 RBP: 0000000000000004 R08: 0000000000000007 R09: 0000000000000000 R10: 0000000000000004 R11: 0000000000000005 R12: ffffe8ffad288cc0 R13: ffff888000596400 R14: dffffc0000000000 R15: ffff88805b626800 FS: 0000000000000000(0000) GS:ffff88802c100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f7f75598 CR3: 0000000056bd0000 CR4: 0000000000350ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: __queue_work+0x200/0x1020 kernel/workqueue.c:2301 queue_work_on+0x11a/0x140 kernel/workqueue.c:2410 wg_queue_enqueue_per_device_and_peer drivers/net/wireguard/queueing.h:176 [inline] wg_packet_consume_data drivers/net/wireguard/receive.c:526 [inline] wg_packet_receive+0xf65/0x2350 drivers/net/wireguard/receive.c:576 wg_receive+0x74/0xc0 drivers/net/wireguard/socket.c:326 udp_queue_rcv_one_skb+0xad1/0x18b0 net/ipv4/udp.c:2131 udp_queue_rcv_skb+0x198/0xd10 net/ipv4/udp.c:2209 udp_unicast_rcv_skb+0x165/0x3b0 net/ipv4/udp.c:2369 __udp4_lib_rcv+0x2636/0x3550 net/ipv4/udp.c:2445 ip_protocol_deliver_rcu+0x30c/0x4e0 net/ipv4/ip_input.c:205 ip_local_deliver_finish+0x316/0x570 net/ipv4/ip_input.c:233 NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip_local_deliver+0x18e/0x1f0 net/ipv4/ip_input.c:254 dst_input include/net/dst.h:460 [inline] ip_rcv_finish net/ipv4/ip_input.c:449 [inline] NF_HOOK include/linux/netfilter.h:314 [inline] NF_HOOK include/linux/netfilter.h:308 [inline] ip_rcv+0x2c5/0x5d0 net/ipv4/ip_input.c:569 __netif_receive_skb_one_core+0x199/0x1e0 net/core/dev.c:5625 __netif_receive_skb+0x1d/0x160 net/core/dev.c:5739 process_backlog+0x133/0x760 net/core/dev.c:6068 __napi_poll.constprop.0+0xb7/0x550 net/core/dev.c:6722 napi_poll net/core/dev.c:6791 [inline] net_rx_action+0x9b6/0xf10 net/core/dev.c:6907 handle_softirqs+0x216/0x8f0 kernel/softirq.c:554 do_softirq kernel/softirq.c:455 [inline] do_softirq+0xb2/0xf0 kernel/softirq.c:442 __local_bh_enable_ip+0x100/0x120 kernel/softirq.c:382 wg_socket_send_skb_to_peer+0x14c/0x220 drivers/net/wireguard/socket.c:184 wg_packet_create_data_done drivers/net/wireguard/send.c:251 [inline] wg_packet_tx_worker+0x1aa/0x810 drivers/net/wireguard/send.c:276 process_one_work+0x958/0x1ad0 kernel/workqueue.c:3231 process_scheduled_works kernel/workqueue.c:3312 [inline] worker_thread+0x6c8/0xf70 kernel/workqueue.c:3393 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:get_work_pool+0xcb/0x1c0 kernel/workqueue.c:887 Code: 0d 36 00 48 89 d8 5b 5d c3 cc cc cc cc e8 8d 0d 36 00 48 81 e3 00 fe ff ff 48 b8 00 00 00 00 00 fc ff df 48 89 da 48 c1 ea 03 <80> 3c 02 00 0f 85 da 00 00 00 48 8b 1b e8 63 0d 36 00 48 89 d8 5b RSP: 0018:ffffc90000598738 EFLAGS: 00010006 RAX: dffffc0000000000 RBX: 001fffffffc00000 RCX: ffffffff815881f2 RDX: 0003fffffff80000 RSI: ffffffff81588243 RDI: 0000000000000007 RBP: 0000000000000004 R08: 0000000000000007 R09: 0000000000000000 R10: 0000000000000004 R11: 0000000000000005 R12: ffffe8ffad288cc0 R13: ffff888000596400 R14: dffffc0000000000 R15: ffff88805b626800 FS: 0000000000000000(0000) GS:ffff88802c100000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000f7f75598 CR3: 0000000056bd0000 CR4: 0000000000350ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 ---------------- Code disassembly (best guess): 0: 0d 36 00 48 89 or $0x89480036,%eax 5: d8 5b 5d fcomps 0x5d(%rbx) 8: c3 ret 9: cc int3 a: cc int3 b: cc int3 c: cc int3 d: e8 8d 0d 36 00 call 0x360d9f 12: 48 81 e3 00 fe ff ff and $0xfffffffffffffe00,%rbx 19: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax 20: fc ff df 23: 48 89 da mov %rbx,%rdx 26: 48 c1 ea 03 shr $0x3,%rdx * 2a: 80 3c 02 00 cmpb $0x0,(%rdx,%rax,1) <-- trapping instruction 2e: 0f 85 da 00 00 00 jne 0x10e 34: 48 8b 1b mov (%rbx),%rbx 37: e8 63 0d 36 00 call 0x360d9f 3c: 48 89 d8 mov %rbx,%rax 3f: 5b pop %rbx --- This report is generated by a bot. It may contain errors. See https://goo.gl/tpsmEJ for more information about syzbot. syzbot engineers can be reached at syzkaller at googlegroups.com. syzbot will keep track of this issue. See: https://goo.gl/tpsmEJ#status for how to communicate with syzbot. If the report is already addressed, let syzbot know by replying with: #syz fix: exact-commit-title If you want to overwrite report's subsystems, reply with: #syz set subsystems: new-subsystem (See the list of subsystem names on the web dashboard) If the report is a duplicate of another one, reply with: #syz dup: exact-subject-of-another-report If you want to undo deduplication, reply with: #syz undup From urezki at gmail.com Tue Jun 18 09:31:00 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Tue, 18 Jun 2024 11:31:00 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> References: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> Message-ID: > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > >> + > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > >> + > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i > > wanted to avoid initially. > > I wanted to avoid new API or flags for kfree_rcu() users and this would > be achieved. The barrier is used internally so I don't consider that an > API to avoid. How difficult is the implementation is another question, > depending on how the current batching works. Once (if) we have sheaves > proven to work and move kfree_rcu() fully into SLUB, the barrier might > also look different and hopefully easier. So maybe it's not worth to > invest too much into that barrier and just go for the potentially > longer, but easier to implement? > Right. I agree here. If the cache is not empty, OK, we just defer the work, even we can use a big 21 seconds delay, after that we just "warn" if it is still not empty and leave it as it is, i.e. emit a warning and we are done. Destroying the cache is not something that must happen right away. > > Since you do it asynchronous can we just repeat > > and wait until it a cache is furry freed? > > The problem is we want to detect the cases when it's not fully freed > because there was an actual read. So at some point we'd need to stop the > repeats because we know there can no longer be any kfree_rcu()'s in > flight since the kmem_cache_destroy() was called. > Agree. As noted above, we can go with 21 seconds(as an example) interval and just perform destroy(without repeating). -- Uladzislau Rezki From paulmck at kernel.org Tue Jun 18 16:48:49 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Tue, 18 Jun 2024 09:48:49 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> Message-ID: <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > > >> + > > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > > >> + > > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here > > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i > > > wanted to avoid initially. > > > > I wanted to avoid new API or flags for kfree_rcu() users and this would > > be achieved. The barrier is used internally so I don't consider that an > > API to avoid. How difficult is the implementation is another question, > > depending on how the current batching works. Once (if) we have sheaves > > proven to work and move kfree_rcu() fully into SLUB, the barrier might > > also look different and hopefully easier. So maybe it's not worth to > > invest too much into that barrier and just go for the potentially > > longer, but easier to implement? > > > Right. I agree here. If the cache is not empty, OK, we just defer the > work, even we can use a big 21 seconds delay, after that we just "warn" > if it is still not empty and leave it as it is, i.e. emit a warning and > we are done. > > Destroying the cache is not something that must happen right away. OK, I have to ask... Suppose that the cache is created and destroyed by a module and init/cleanup time, respectively. Suppose that this module is rmmod'ed then very quickly insmod'ed. Do we need to fail the insmod if the kmem_cache has not yet been fully cleaned up? If not, do we have two versions of the same kmem_cache in /proc during the overlap time? Thanx, Paul > > > Since you do it asynchronous can we just repeat > > > and wait until it a cache is furry freed? > > > > The problem is we want to detect the cases when it's not fully freed > > because there was an actual read. So at some point we'd need to stop the > > repeats because we know there can no longer be any kfree_rcu()'s in > > flight since the kmem_cache_destroy() was called. > > > Agree. As noted above, we can go with 21 seconds(as an example) interval > and just perform destroy(without repeating). > > -- > Uladzislau Rezki From vbabka at suse.cz Tue Jun 18 17:21:42 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Tue, 18 Jun 2024 19:21:42 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> References: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> Message-ID: <9967fdfa-e649-456d-a0cb-b4c4bf7f9d68@suse.cz> On 6/18/24 6:48 PM, Paul E. McKenney wrote: > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: >> > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: >> > >> + >> > >> + s = container_of(work, struct kmem_cache, async_destroy_work); >> > >> + >> > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here >> > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i >> > > wanted to avoid initially. >> > >> > I wanted to avoid new API or flags for kfree_rcu() users and this would >> > be achieved. The barrier is used internally so I don't consider that an >> > API to avoid. How difficult is the implementation is another question, >> > depending on how the current batching works. Once (if) we have sheaves >> > proven to work and move kfree_rcu() fully into SLUB, the barrier might >> > also look different and hopefully easier. So maybe it's not worth to >> > invest too much into that barrier and just go for the potentially >> > longer, but easier to implement? >> > >> Right. I agree here. If the cache is not empty, OK, we just defer the >> work, even we can use a big 21 seconds delay, after that we just "warn" >> if it is still not empty and leave it as it is, i.e. emit a warning and >> we are done. >> >> Destroying the cache is not something that must happen right away. > > OK, I have to ask... > > Suppose that the cache is created and destroyed by a module and > init/cleanup time, respectively. Suppose that this module is rmmod'ed > then very quickly insmod'ed. > > Do we need to fail the insmod if the kmem_cache has not yet been fully > cleaned up? We don't have any such link between kmem_cache and module to detect that, so we would have to start tracking that. Probably not worth the trouble. > If not, do we have two versions of the same kmem_cache in > /proc during the overlap time? Hm could happen in /proc/slabinfo but without being harmful other than perhaps confusing someone. We could filter out the caches being destroyed trivially. Sysfs and debugfs might be more problematic as I suppose directory names would clash. I'll have to check... might be even happening now when we do detect leaked objects and just leave the cache around... thanks for the question. > Thanx, Paul > >> > > Since you do it asynchronous can we just repeat >> > > and wait until it a cache is furry freed? >> > >> > The problem is we want to detect the cases when it's not fully freed >> > because there was an actual read. So at some point we'd need to stop the >> > repeats because we know there can no longer be any kfree_rcu()'s in >> > flight since the kmem_cache_destroy() was called. >> > >> Agree. As noted above, we can go with 21 seconds(as an example) interval >> and just perform destroy(without repeating). >> >> -- >> Uladzislau Rezki From paulmck at kernel.org Tue Jun 18 17:53:11 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Tue, 18 Jun 2024 10:53:11 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <9967fdfa-e649-456d-a0cb-b4c4bf7f9d68@suse.cz> References: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> <9967fdfa-e649-456d-a0cb-b4c4bf7f9d68@suse.cz> Message-ID: <6dad6e9f-e0ca-4446-be9c-1be25b2536dd@paulmck-laptop> On Tue, Jun 18, 2024 at 07:21:42PM +0200, Vlastimil Babka wrote: > On 6/18/24 6:48 PM, Paul E. McKenney wrote: > > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > >> > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > >> > >> + > >> > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > >> > >> + > >> > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here > >> > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i > >> > > wanted to avoid initially. > >> > > >> > I wanted to avoid new API or flags for kfree_rcu() users and this would > >> > be achieved. The barrier is used internally so I don't consider that an > >> > API to avoid. How difficult is the implementation is another question, > >> > depending on how the current batching works. Once (if) we have sheaves > >> > proven to work and move kfree_rcu() fully into SLUB, the barrier might > >> > also look different and hopefully easier. So maybe it's not worth to > >> > invest too much into that barrier and just go for the potentially > >> > longer, but easier to implement? > >> > > >> Right. I agree here. If the cache is not empty, OK, we just defer the > >> work, even we can use a big 21 seconds delay, after that we just "warn" > >> if it is still not empty and leave it as it is, i.e. emit a warning and > >> we are done. > >> > >> Destroying the cache is not something that must happen right away. > > > > OK, I have to ask... > > > > Suppose that the cache is created and destroyed by a module and > > init/cleanup time, respectively. Suppose that this module is rmmod'ed > > then very quickly insmod'ed. > > > > Do we need to fail the insmod if the kmem_cache has not yet been fully > > cleaned up? > > We don't have any such link between kmem_cache and module to detect that, so > we would have to start tracking that. Probably not worth the trouble. Fair enough! > > If not, do we have two versions of the same kmem_cache in > > /proc during the overlap time? > > Hm could happen in /proc/slabinfo but without being harmful other than > perhaps confusing someone. We could filter out the caches being destroyed > trivially. Or mark them in /proc/slabinfo? Yet another column, yay!!! Or script breakage from flagging the name somehow, for example, trailing "/" character. > Sysfs and debugfs might be more problematic as I suppose directory names > would clash. I'll have to check... might be even happening now when we do > detect leaked objects and just leave the cache around... thanks for the > question. "It is a service that I provide." ;-) But yes, we might be living with it already and there might already be ways people deal with it. Thanx, Paul > >> > > Since you do it asynchronous can we just repeat > >> > > and wait until it a cache is furry freed? > >> > > >> > The problem is we want to detect the cases when it's not fully freed > >> > because there was an actual read. So at some point we'd need to stop the > >> > repeats because we know there can no longer be any kfree_rcu()'s in > >> > flight since the kmem_cache_destroy() was called. > >> > > >> Agree. As noted above, we can go with 21 seconds(as an example) interval > >> and just perform destroy(without repeating). > >> > >> -- > >> Uladzislau Rezki > From vbabka at suse.cz Wed Jun 19 09:28:13 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Wed, 19 Jun 2024 11:28:13 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <6dad6e9f-e0ca-4446-be9c-1be25b2536dd@paulmck-laptop> References: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> <9967fdfa-e649-456d-a0cb-b4c4bf7f9d68@suse.cz> <6dad6e9f-e0ca-4446-be9c-1be25b2536dd@paulmck-laptop> Message-ID: <4cba4a48-902b-4fb6-895c-c8e6b64e0d5f@suse.cz> On 6/18/24 7:53 PM, Paul E. McKenney wrote: > On Tue, Jun 18, 2024 at 07:21:42PM +0200, Vlastimil Babka wrote: >> On 6/18/24 6:48 PM, Paul E. McKenney wrote: >> > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: >> >> > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: >> >> > >> + >> >> > >> + s = container_of(work, struct kmem_cache, async_destroy_work); >> >> > >> + >> >> > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here >> >> > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i >> >> > > wanted to avoid initially. >> >> > >> >> > I wanted to avoid new API or flags for kfree_rcu() users and this would >> >> > be achieved. The barrier is used internally so I don't consider that an >> >> > API to avoid. How difficult is the implementation is another question, >> >> > depending on how the current batching works. Once (if) we have sheaves >> >> > proven to work and move kfree_rcu() fully into SLUB, the barrier might >> >> > also look different and hopefully easier. So maybe it's not worth to >> >> > invest too much into that barrier and just go for the potentially >> >> > longer, but easier to implement? >> >> > >> >> Right. I agree here. If the cache is not empty, OK, we just defer the >> >> work, even we can use a big 21 seconds delay, after that we just "warn" >> >> if it is still not empty and leave it as it is, i.e. emit a warning and >> >> we are done. >> >> >> >> Destroying the cache is not something that must happen right away. >> > >> > OK, I have to ask... >> > >> > Suppose that the cache is created and destroyed by a module and >> > init/cleanup time, respectively. Suppose that this module is rmmod'ed >> > then very quickly insmod'ed. >> > >> > Do we need to fail the insmod if the kmem_cache has not yet been fully >> > cleaned up? >> >> We don't have any such link between kmem_cache and module to detect that, so >> we would have to start tracking that. Probably not worth the trouble. > > Fair enough! > >> > If not, do we have two versions of the same kmem_cache in >> > /proc during the overlap time? >> >> Hm could happen in /proc/slabinfo but without being harmful other than >> perhaps confusing someone. We could filter out the caches being destroyed >> trivially. > > Or mark them in /proc/slabinfo? Yet another column, yay!!! Or script > breakage from flagging the name somehow, for example, trailing "/" > character. Yeah I've been resisting such changes to the layout and this wouldn't be worth it, apart from changing the name itself but not in a dangerous way like with "/" :) >> Sysfs and debugfs might be more problematic as I suppose directory names >> would clash. I'll have to check... might be even happening now when we do >> detect leaked objects and just leave the cache around... thanks for the >> question. > > "It is a service that I provide." ;-) > > But yes, we might be living with it already and there might already > be ways people deal with it. So it seems if the sysfs/debugfs directories already exist, they will silently not be created. Wonder if we have such cases today already because caches with same name exist. I think we do with the zsmalloc using 32 caches with same name that we discussed elsewhere just recently. Also indeed if the cache has leaked objects and won't be thus destroyed, these directories indeed stay around, as well as the slabinfo entry, and can prevent new ones from being created (slabinfo lines with same name are not prevented). But it wouldn't be great to introduce this possibility to happen for the temporarily delayed removal due to kfree_rcu() and a module re-insert, since that's a legitimate case and not buggy state due to leaks. The debugfs directory we could remove immediately before handing over to the scheduled workfn, but if it turns out there was a leak and the workfn leaves the cache around, debugfs dir will be gone and we can't check the alloc_traces/free_traces files there (but we have the per-object info including the traces in the dmesg splat). The sysfs directory is currently removed only with the whole cache being destryed due to sysfs/kobject lifetime model. I'd love to untangle it for other reasons too, but haven't investigated it yet. But again it might be useful for sysfs dir to stay around for inspection, as for the debugfs. We could rename the sysfs/debugfs directories before queuing the work? Add some prefix like GOING_AWAY-$name. If leak is detected and cache stays forever, another rename to LEAKED-$name. (and same for the slabinfo). But multiple ones with same name might pile up, so try adding a counter then? Probably messy to implement, but perhaps the most robust in the end? The automatic counter could also solve the general case of people using same name for multiple caches. Other ideas? Thanks, Vlastimil > > Thanx, Paul > >> >> > > Since you do it asynchronous can we just repeat >> >> > > and wait until it a cache is furry freed? >> >> > >> >> > The problem is we want to detect the cases when it's not fully freed >> >> > because there was an actual read. So at some point we'd need to stop the >> >> > repeats because we know there can no longer be any kfree_rcu()'s in >> >> > flight since the kmem_cache_destroy() was called. >> >> > >> >> Agree. As noted above, we can go with 21 seconds(as an example) interval >> >> and just perform destroy(without repeating). >> >> >> >> -- >> >> Uladzislau Rezki >> From nico.schottelius at ungleich.ch Wed Jun 19 09:42:34 2024 From: nico.schottelius at ungleich.ch (Nico Schottelius) Date: Wed, 19 Jun 2024 11:42:34 +0200 Subject: Wireguard broken with ip rule due to missing address binding Message-ID: <87h6dpi7zp.fsf@ungleich.ch> Hello, a follow up to the previous thread: if one uses "ip rule" for doing source based routing, wireguard is broken / cannot be used correctly. Let's take the following test case: a) We have a separate VRF / routing table for wireguard endpoints [09:35] server141.place10:~# ip rule ls 0: from all lookup local 32765: from 192.168.1.0/24 lookup 42 32766: from all lookup main 32767: from all lookup default [09:37] server141.place10:~# ip route sh table 42 194.5.220.0/24 via 192.168.1.254 dev eth1 proto bird metric 32 194.187.90.23 via 192.168.1.254 dev eth1 proto bird metric 32 212.103.65.231 via 192.168.1.254 dev eth1 proto bird metric 32 b) ping with a random IP address does not work (correct) [09:35] server141.place10:~# ping -c2 194.187.90.23 PING 194.187.90.23 (194.187.90.23): 56 data bytes --- 194.187.90.23 ping statistics --- 2 packets transmitted, 0 packets received, 100% packet loss c) ping with the correct source ip address does work [09:35] server141.place10:~# ping -I 192.168.1.149 -c2 194.187.90.23 PING 194.187.90.23 (194.187.90.23) from 192.168.1.149: 56 data bytes 64 bytes from 194.187.90.23: seq=0 ttl=57 time=3.883 ms 64 bytes from 194.187.90.23: seq=1 ttl=57 time=3.810 ms --- 194.187.90.23 ping statistics --- 2 packets transmitted, 2 packets received, 0% packet loss round-trip min/avg/max = 3.810/3.846/3.883 ms [09:35] server141.place10:~# d) wireguard does not work [09:38] server141.place10:~# wg show interface: oserver120 public key: EqrNWstRSdJnj1trm5KSWbVNxLi10w/ea2EbdADJSWU= private key: (hidden) listening port: 54658 peer: hUm9SGQnhOG7dPn4OuiGXJZ3Wk9UZZ9JdHd32HYyH0w= endpoint: 194.187.90.23:4011 allowed ips: ::/0, 0.0.0.0/0 transfer: 0 B received, 8.09 KiB sent [09:38] server141.place10:~# From my perspective this is yet another bug that one encounters due to missing IP address binding in wireguard. And no, putting everything into a separate namespace is not an option, because processes from the non namespaced part need access to the tunnel. I really hope the address binding issue can be solved soon, especially giving there is already a patch for it available. Best regards, Nico -- Sustainable and modern Infrastructures by ungleich.ch -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 873 bytes Desc: not available URL: From urezki at gmail.com Wed Jun 19 09:51:58 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Wed, 19 Jun 2024 11:51:58 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> References: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> Message-ID: On Tue, Jun 18, 2024 at 09:48:49AM -0700, Paul E. McKenney wrote: > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > > > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > > > >> + > > > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > > > >> + > > > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here > > > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i > > > > wanted to avoid initially. > > > > > > I wanted to avoid new API or flags for kfree_rcu() users and this would > > > be achieved. The barrier is used internally so I don't consider that an > > > API to avoid. How difficult is the implementation is another question, > > > depending on how the current batching works. Once (if) we have sheaves > > > proven to work and move kfree_rcu() fully into SLUB, the barrier might > > > also look different and hopefully easier. So maybe it's not worth to > > > invest too much into that barrier and just go for the potentially > > > longer, but easier to implement? > > > > > Right. I agree here. If the cache is not empty, OK, we just defer the > > work, even we can use a big 21 seconds delay, after that we just "warn" > > if it is still not empty and leave it as it is, i.e. emit a warning and > > we are done. > > > > Destroying the cache is not something that must happen right away. > > OK, I have to ask... > > Suppose that the cache is created and destroyed by a module and > init/cleanup time, respectively. Suppose that this module is rmmod'ed > then very quickly insmod'ed. > > Do we need to fail the insmod if the kmem_cache has not yet been fully > cleaned up? If not, do we have two versions of the same kmem_cache in > /proc during the overlap time? > No fail :) If same cache is created several times, its s->refcount gets increased, so, it does not create two entries in the "slabinfo". But i agree that your point is good! We need to be carefully with removing and simultaneous creating. >From the first glance, there is a refcounter and a global "slab_mutex" which is used to protect a critical section. Destroying is almost fully protected(as noted above, by a global mutex) with one exception, it is: static void kmem_cache_release(struct kmem_cache *s) { if (slab_state >= FULL) { sysfs_slab_unlink(s); sysfs_slab_release(s); } else { slab_kmem_cache_release(s); } } this one can race, IMO. -- Uladzislau Rezki From vbabka at suse.cz Wed Jun 19 09:56:44 2024 From: vbabka at suse.cz (Vlastimil Babka) Date: Wed, 19 Jun 2024 11:56:44 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <08ee7eb2-8d08-4f1f-9c46-495a544b8c0e@paulmck-laptop> <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> Message-ID: On 6/19/24 11:51 AM, Uladzislau Rezki wrote: > On Tue, Jun 18, 2024 at 09:48:49AM -0700, Paul E. McKenney wrote: >> On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: >> > > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: >> > > >> + >> > > >> + s = container_of(work, struct kmem_cache, async_destroy_work); >> > > >> + >> > > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here >> > > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i >> > > > wanted to avoid initially. >> > > >> > > I wanted to avoid new API or flags for kfree_rcu() users and this would >> > > be achieved. The barrier is used internally so I don't consider that an >> > > API to avoid. How difficult is the implementation is another question, >> > > depending on how the current batching works. Once (if) we have sheaves >> > > proven to work and move kfree_rcu() fully into SLUB, the barrier might >> > > also look different and hopefully easier. So maybe it's not worth to >> > > invest too much into that barrier and just go for the potentially >> > > longer, but easier to implement? >> > > >> > Right. I agree here. If the cache is not empty, OK, we just defer the >> > work, even we can use a big 21 seconds delay, after that we just "warn" >> > if it is still not empty and leave it as it is, i.e. emit a warning and >> > we are done. >> > >> > Destroying the cache is not something that must happen right away. >> >> OK, I have to ask... >> >> Suppose that the cache is created and destroyed by a module and >> init/cleanup time, respectively. Suppose that this module is rmmod'ed >> then very quickly insmod'ed. >> >> Do we need to fail the insmod if the kmem_cache has not yet been fully >> cleaned up? If not, do we have two versions of the same kmem_cache in >> /proc during the overlap time? >> > No fail :) If same cache is created several times, its s->refcount gets > increased, so, it does not create two entries in the "slabinfo". But i > agree that your point is good! We need to be carefully with removing and > simultaneous creating. Note that this merging may be disabled or not happen due to various flags on the cache being incompatible with it. And I want to actually make sure it never happens for caches being already destroyed as that would lead to use-after-free (the workfn doesn't recheck the refcount in case a merge would happen during the grace period) --- a/mm/slab_common.c +++ b/mm/slab_common.c @@ -150,9 +150,10 @@ int slab_unmergeable(struct kmem_cache *s) #endif /* - * We may have set a slab to be unmergeable during bootstrap. + * We may have set a cache to be unmergeable during bootstrap. + * 0 is for cache being destroyed asynchronously */ - if (s->refcount < 0) + if (s->refcount <= 0) return 1; return 0; > From the first glance, there is a refcounter and a global "slab_mutex" > which is used to protect a critical section. Destroying is almost fully > protected(as noted above, by a global mutex) with one exception, it is: > > static void kmem_cache_release(struct kmem_cache *s) > { > if (slab_state >= FULL) { > sysfs_slab_unlink(s); > sysfs_slab_release(s); > } else { > slab_kmem_cache_release(s); > } > } > > this one can race, IMO. > > -- > Uladzislau Rezki From a at unstable.cc Wed Jun 19 10:01:07 2024 From: a at unstable.cc (Antonio Quartulli) Date: Wed, 19 Jun 2024 12:01:07 +0200 Subject: Wireguard broken with ip rule due to missing address binding In-Reply-To: <87h6dpi7zp.fsf@ungleich.ch> References: <87h6dpi7zp.fsf@ungleich.ch> Message-ID: <9e91adb2-b155-4eef-8604-a2f762a98d4d@unstable.cc> Hi, On 19/06/2024 11:42, Nico Schottelius wrote: > I really hope the address binding issue can be solved soon, especially > giving there is already a patch for it available. Question: instead of implementing pure IP binding, may it help to implement some logic so that messages to a peer are always sent using the IP where previous packets were received? Cheers, -- Antonio Quartulli From nico.schottelius at ungleich.ch Wed Jun 19 10:12:49 2024 From: nico.schottelius at ungleich.ch (Nico Schottelius) Date: Wed, 19 Jun 2024 12:12:49 +0200 Subject: Wireguard broken with ip rule due to missing address binding In-Reply-To: <9e91adb2-b155-4eef-8604-a2f762a98d4d@unstable.cc> (Antonio Quartulli's message of "Wed, 19 Jun 2024 12:01:07 +0200") References: <87h6dpi7zp.fsf@ungleich.ch> <9e91adb2-b155-4eef-8604-a2f762a98d4d@unstable.cc> Message-ID: <87cyodi6la.fsf@ungleich.ch> Hello Antonio, Antonio Quartulli writes: > Hi, > > On 19/06/2024 11:42, Nico Schottelius wrote: >> I really hope the address binding issue can be solved soon, especially >> giving there is already a patch for it available. > > Question: instead of implementing pure IP binding, may it help to > implement some logic so that messages to a peer are always sent using > the IP where previous packets were received? This would fix the problem of replying with the incorrect address, yes. However it does not fix the issue of selecting the right ip address on systems with multiple IP addresses ("Originating / initial ip address wrong"). Adding this option sounds rather reasonable, but it does not fix the whole issue. Note that both issues would be fixed with IP address binding. BR, Nico -------------- next part -------------- -- Sustainable and modern Infrastructures by ungleich.ch -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 873 bytes Desc: not available URL: From a at unstable.cc Wed Jun 19 10:19:36 2024 From: a at unstable.cc (Antonio Quartulli) Date: Wed, 19 Jun 2024 12:19:36 +0200 Subject: Wireguard broken with ip rule due to missing address binding In-Reply-To: <87cyodi6la.fsf@ungleich.ch> References: <87h6dpi7zp.fsf@ungleich.ch> <9e91adb2-b155-4eef-8604-a2f762a98d4d@unstable.cc> <87cyodi6la.fsf@ungleich.ch> Message-ID: Hi Nico, On 19/06/2024 12:12, Nico Schottelius wrote: > However it does not fix the issue of selecting the right ip address on > systems with multiple IP addresses ("Originating / initial ip address > wrong"). you're right. I looked at this from a pure "server" perspective, where you always wait for somebody else to originate the connection. Regards, -- Antonio Quartulli From urezki at gmail.com Wed Jun 19 11:22:12 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Wed, 19 Jun 2024 13:22:12 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: References: <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> Message-ID: On Wed, Jun 19, 2024 at 11:56:44AM +0200, Vlastimil Babka wrote: > On 6/19/24 11:51 AM, Uladzislau Rezki wrote: > > On Tue, Jun 18, 2024 at 09:48:49AM -0700, Paul E. McKenney wrote: > >> On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > >> > > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > >> > > >> + > >> > > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > >> > > >> + > >> > > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here > >> > > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i > >> > > > wanted to avoid initially. > >> > > > >> > > I wanted to avoid new API or flags for kfree_rcu() users and this would > >> > > be achieved. The barrier is used internally so I don't consider that an > >> > > API to avoid. How difficult is the implementation is another question, > >> > > depending on how the current batching works. Once (if) we have sheaves > >> > > proven to work and move kfree_rcu() fully into SLUB, the barrier might > >> > > also look different and hopefully easier. So maybe it's not worth to > >> > > invest too much into that barrier and just go for the potentially > >> > > longer, but easier to implement? > >> > > > >> > Right. I agree here. If the cache is not empty, OK, we just defer the > >> > work, even we can use a big 21 seconds delay, after that we just "warn" > >> > if it is still not empty and leave it as it is, i.e. emit a warning and > >> > we are done. > >> > > >> > Destroying the cache is not something that must happen right away. > >> > >> OK, I have to ask... > >> > >> Suppose that the cache is created and destroyed by a module and > >> init/cleanup time, respectively. Suppose that this module is rmmod'ed > >> then very quickly insmod'ed. > >> > >> Do we need to fail the insmod if the kmem_cache has not yet been fully > >> cleaned up? If not, do we have two versions of the same kmem_cache in > >> /proc during the overlap time? > >> > > No fail :) If same cache is created several times, its s->refcount gets > > increased, so, it does not create two entries in the "slabinfo". But i > > agree that your point is good! We need to be carefully with removing and > > simultaneous creating. > > Note that this merging may be disabled or not happen due to various flags on > the cache being incompatible with it. And I want to actually make sure it > never happens for caches being already destroyed as that would lead to > use-after-free (the workfn doesn't recheck the refcount in case a merge > would happen during the grace period) > > --- a/mm/slab_common.c > +++ b/mm/slab_common.c > @@ -150,9 +150,10 @@ int slab_unmergeable(struct kmem_cache *s) > #endif > > /* > - * We may have set a slab to be unmergeable during bootstrap. > + * We may have set a cache to be unmergeable during bootstrap. > + * 0 is for cache being destroyed asynchronously > */ > - if (s->refcount < 0) > + if (s->refcount <= 0) > return 1; > > return 0; > OK, i see such flags, SLAB_NO_MERGE. Then i was wrong, it can create two different slabs. Thanks! -- Uladzislau Rezki From paulmck at kernel.org Wed Jun 19 16:46:35 2024 From: paulmck at kernel.org (Paul E. McKenney) Date: Wed, 19 Jun 2024 09:46:35 -0700 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <4cba4a48-902b-4fb6-895c-c8e6b64e0d5f@suse.cz> References: <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> <9967fdfa-e649-456d-a0cb-b4c4bf7f9d68@suse.cz> <6dad6e9f-e0ca-4446-be9c-1be25b2536dd@paulmck-laptop> <4cba4a48-902b-4fb6-895c-c8e6b64e0d5f@suse.cz> Message-ID: <04567347-c138-48fb-a5ab-44cc6a318549@paulmck-laptop> On Wed, Jun 19, 2024 at 11:28:13AM +0200, Vlastimil Babka wrote: > On 6/18/24 7:53 PM, Paul E. McKenney wrote: > > On Tue, Jun 18, 2024 at 07:21:42PM +0200, Vlastimil Babka wrote: > >> On 6/18/24 6:48 PM, Paul E. McKenney wrote: > >> > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > >> >> > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > >> >> > >> + > >> >> > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > >> >> > >> + > >> >> > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here > >> >> > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i > >> >> > > wanted to avoid initially. > >> >> > > >> >> > I wanted to avoid new API or flags for kfree_rcu() users and this would > >> >> > be achieved. The barrier is used internally so I don't consider that an > >> >> > API to avoid. How difficult is the implementation is another question, > >> >> > depending on how the current batching works. Once (if) we have sheaves > >> >> > proven to work and move kfree_rcu() fully into SLUB, the barrier might > >> >> > also look different and hopefully easier. So maybe it's not worth to > >> >> > invest too much into that barrier and just go for the potentially > >> >> > longer, but easier to implement? > >> >> > > >> >> Right. I agree here. If the cache is not empty, OK, we just defer the > >> >> work, even we can use a big 21 seconds delay, after that we just "warn" > >> >> if it is still not empty and leave it as it is, i.e. emit a warning and > >> >> we are done. > >> >> > >> >> Destroying the cache is not something that must happen right away. > >> > > >> > OK, I have to ask... > >> > > >> > Suppose that the cache is created and destroyed by a module and > >> > init/cleanup time, respectively. Suppose that this module is rmmod'ed > >> > then very quickly insmod'ed. > >> > > >> > Do we need to fail the insmod if the kmem_cache has not yet been fully > >> > cleaned up? > >> > >> We don't have any such link between kmem_cache and module to detect that, so > >> we would have to start tracking that. Probably not worth the trouble. > > > > Fair enough! > > > >> > If not, do we have two versions of the same kmem_cache in > >> > /proc during the overlap time? > >> > >> Hm could happen in /proc/slabinfo but without being harmful other than > >> perhaps confusing someone. We could filter out the caches being destroyed > >> trivially. > > > > Or mark them in /proc/slabinfo? Yet another column, yay!!! Or script > > breakage from flagging the name somehow, for example, trailing "/" > > character. > > Yeah I've been resisting such changes to the layout and this wouldn't be > worth it, apart from changing the name itself but not in a dangerous way > like with "/" :) ;-) ;-) ;-) > >> Sysfs and debugfs might be more problematic as I suppose directory names > >> would clash. I'll have to check... might be even happening now when we do > >> detect leaked objects and just leave the cache around... thanks for the > >> question. > > > > "It is a service that I provide." ;-) > > > > But yes, we might be living with it already and there might already > > be ways people deal with it. > > So it seems if the sysfs/debugfs directories already exist, they will > silently not be created. Wonder if we have such cases today already because > caches with same name exist. I think we do with the zsmalloc using 32 caches > with same name that we discussed elsewhere just recently. > > Also indeed if the cache has leaked objects and won't be thus destroyed, > these directories indeed stay around, as well as the slabinfo entry, and can > prevent new ones from being created (slabinfo lines with same name are not > prevented). New one on me! > But it wouldn't be great to introduce this possibility to happen for the > temporarily delayed removal due to kfree_rcu() and a module re-insert, since > that's a legitimate case and not buggy state due to leaks. Agreed. > The debugfs directory we could remove immediately before handing over to the > scheduled workfn, but if it turns out there was a leak and the workfn leaves > the cache around, debugfs dir will be gone and we can't check the > alloc_traces/free_traces files there (but we have the per-object info > including the traces in the dmesg splat). > > The sysfs directory is currently removed only with the whole cache being > destryed due to sysfs/kobject lifetime model. I'd love to untangle it for > other reasons too, but haven't investigated it yet. But again it might be > useful for sysfs dir to stay around for inspection, as for the debugfs. > > We could rename the sysfs/debugfs directories before queuing the work? Add > some prefix like GOING_AWAY-$name. If leak is detected and cache stays > forever, another rename to LEAKED-$name. (and same for the slabinfo). But > multiple ones with same name might pile up, so try adding a counter then? > Probably messy to implement, but perhaps the most robust in the end? The > automatic counter could also solve the general case of people using same > name for multiple caches. > > Other ideas? Move the going-away files/directories to some new directoriesy? But you would still need a counter or whatever. I honestly cannot say what would be best from the viewpoint of existing software scanning those files and directories. Thanx, Paul > Thanks, > Vlastimil > > > > > Thanx, Paul > > > >> >> > > Since you do it asynchronous can we just repeat > >> >> > > and wait until it a cache is furry freed? > >> >> > > >> >> > The problem is we want to detect the cases when it's not fully freed > >> >> > because there was an actual read. So at some point we'd need to stop the > >> >> > repeats because we know there can no longer be any kfree_rcu()'s in > >> >> > flight since the kmem_cache_destroy() was called. > >> >> > > >> >> Agree. As noted above, we can go with 21 seconds(as an example) interval > >> >> and just perform destroy(without repeating). > >> >> > >> >> -- > >> >> Uladzislau Rezki > >> > From nohktwo at gmail.com Thu Jun 20 14:52:10 2024 From: nohktwo at gmail.com (Nohk Two) Date: Thu, 20 Jun 2024 22:52:10 +0800 Subject: How to detect the IP CAM on LAN from WG tunnel ? Message-ID: <384d1fdd-a32f-4839-bb8b-2761be363b50@gmail.com> Hi, This seems a common question but I don't know how do you solve this problem. My machine has an ethernet interface: eth0 It's network is 192.168.100.1/24 I created a wireguard interface thru eth0: wg0 It's network is 192.168.128.1/24 I have an IP CAM on the LAN: cam1 It's network is 192.168.100.21/24 This is physically on the same LAN as my machine's eth0. My machine has a MASQUERADE iptable entry in the nat table: iptables -t nat -A POSTROUTING -s 192.168.128.0/24 -o eth0 -j MASQUERADE My phone uses the wireguard connect to my machine's wg0. This wireguard configuration allow 192.168.100.0/24. My phone's wireguard VPN IP address 192.168.128.10/24. So my phone should be able to connect to my IP CAM without problem. 192.168.128.10(phone) source NAT as 192.168.100.1(eth0) then connect to 192.168.100.21(cam1) 192.168.100.21(cam1) reply to 192.168.100.1(eth0) then NAT rewrite to 192.168.128.10(phone) However, the IP CAM's mobile App on my phone never remember the IP CAM's IP address and will always scan the network to find out the IP CAM. Then Failed if my phone uses the wireguard VPN. Maybe the problem is that my phone and the IP CAM have different network, 192.168.128.0/24 vs 192.168.100.0/24. How do you solve this problem ? From sune at molgaard.org Thu Jun 20 23:08:15 2024 From: sune at molgaard.org (=?UTF-8?Q?Sune_M=C3=B8lgaard?=) Date: Fri, 21 Jun 2024 01:08:15 +0200 Subject: Multicast on Android? Message-ID: Is it currently possible to enable multicast on the wireguard device under Android, in order to, e.g., access DLNA services? If not, is it feasible to implement? Best regards, Sune M?lgaard From mark at rekudos.net Fri Jun 21 09:18:49 2024 From: mark at rekudos.net (Mark Lawrence) Date: Fri, 21 Jun 2024 09:18:49 +0000 Subject: How to detect the IP CAM on LAN from WG tunnel ? In-Reply-To: <384d1fdd-a32f-4839-bb8b-2761be363b50@gmail.com> References: <384d1fdd-a32f-4839-bb8b-2761be363b50@gmail.com> Message-ID: >How do you solve this problem ? Iterative fact checking, from the lowest levels of the network stack to the highest. - Are the devices actually connected where you think they are? - With the tunnel disconnected, does your phone connect to the camera? - Is your Wireguard tunnel set up properly? - Can your phone ping the wg0 address with the tunnel active? - Can your phone ping other .100 devices with the tunnel active? - Does your eth0/wg0 machine have IP forwarding enabled? - sysctl net.ipv4.ip_forward=1 - What does packet tracing show? - I.e. `ngrep -d wg0 .\* icmp` or the tcpdump equivalent, also against eth0 for the wireguard UDP port. - Does the mobile App actually support remote (routed) cameras or just on the local network? -- Mark Lawrence From urezki at gmail.com Fri Jun 21 09:32:12 2024 From: urezki at gmail.com (Uladzislau Rezki) Date: Fri, 21 Jun 2024 11:32:12 +0200 Subject: [PATCH 00/14] replace call_rcu by kfree_rcu for simple kmem_cache_free callback In-Reply-To: <4cba4a48-902b-4fb6-895c-c8e6b64e0d5f@suse.cz> References: <3b6fe525-626c-41fb-8625-3925ca820d8e@paulmck-laptop> <6711935d-20b5-41c1-8864-db3fc7d7823d@suse.cz> <36c60acd-543e-48c5-8bd2-6ed509972d28@suse.cz> <5c8b2883-962f-431f-b2d3-3632755de3b0@paulmck-laptop> <9967fdfa-e649-456d-a0cb-b4c4bf7f9d68@suse.cz> <6dad6e9f-e0ca-4446-be9c-1be25b2536dd@paulmck-laptop> <4cba4a48-902b-4fb6-895c-c8e6b64e0d5f@suse.cz> Message-ID: On Wed, Jun 19, 2024 at 11:28:13AM +0200, Vlastimil Babka wrote: > On 6/18/24 7:53 PM, Paul E. McKenney wrote: > > On Tue, Jun 18, 2024 at 07:21:42PM +0200, Vlastimil Babka wrote: > >> On 6/18/24 6:48 PM, Paul E. McKenney wrote: > >> > On Tue, Jun 18, 2024 at 11:31:00AM +0200, Uladzislau Rezki wrote: > >> >> > On 6/17/24 8:42 PM, Uladzislau Rezki wrote: > >> >> > >> + > >> >> > >> + s = container_of(work, struct kmem_cache, async_destroy_work); > >> >> > >> + > >> >> > >> + // XXX use the real kmem_cache_free_barrier() or similar thing here > >> >> > > It implies that we need to introduce kfree_rcu_barrier(), a new API, which i > >> >> > > wanted to avoid initially. > >> >> > > >> >> > I wanted to avoid new API or flags for kfree_rcu() users and this would > >> >> > be achieved. The barrier is used internally so I don't consider that an > >> >> > API to avoid. How difficult is the implementation is another question, > >> >> > depending on how the current batching works. Once (if) we have sheaves > >> >> > proven to work and move kfree_rcu() fully into SLUB, the barrier might > >> >> > also look different and hopefully easier. So maybe it's not worth to > >> >> > invest too much into that barrier and just go for the potentially > >> >> > longer, but easier to implement? > >> >> > > >> >> Right. I agree here. If the cache is not empty, OK, we just defer the > >> >> work, even we can use a big 21 seconds delay, after that we just "warn" > >> >> if it is still not empty and leave it as it is, i.e. emit a warning and > >> >> we are done. > >> >> > >> >> Destroying the cache is not something that must happen right away. > >> > > >> > OK, I have to ask... > >> > > >> > Suppose that the cache is created and destroyed by a module and > >> > init/cleanup time, respectively. Suppose that this module is rmmod'ed > >> > then very quickly insmod'ed. > >> > > >> > Do we need to fail the insmod if the kmem_cache has not yet been fully > >> > cleaned up? > >> > >> We don't have any such link between kmem_cache and module to detect that, so > >> we would have to start tracking that. Probably not worth the trouble. > > > > Fair enough! > > > >> > If not, do we have two versions of the same kmem_cache in > >> > /proc during the overlap time? > >> > >> Hm could happen in /proc/slabinfo but without being harmful other than > >> perhaps confusing someone. We could filter out the caches being destroyed > >> trivially. > > > > Or mark them in /proc/slabinfo? Yet another column, yay!!! Or script > > breakage from flagging the name somehow, for example, trailing "/" > > character. > > Yeah I've been resisting such changes to the layout and this wouldn't be > worth it, apart from changing the name itself but not in a dangerous way > like with "/" :) > > >> Sysfs and debugfs might be more problematic as I suppose directory names > >> would clash. I'll have to check... might be even happening now when we do > >> detect leaked objects and just leave the cache around... thanks for the > >> question. > > > > "It is a service that I provide." ;-) > > > > But yes, we might be living with it already and there might already > > be ways people deal with it. > > So it seems if the sysfs/debugfs directories already exist, they will > silently not be created. Wonder if we have such cases today already because > caches with same name exist. I think we do with the zsmalloc using 32 caches > with same name that we discussed elsewhere just recently. > > Also indeed if the cache has leaked objects and won't be thus destroyed, > these directories indeed stay around, as well as the slabinfo entry, and can > prevent new ones from being created (slabinfo lines with same name are not > prevented). > > But it wouldn't be great to introduce this possibility to happen for the > temporarily delayed removal due to kfree_rcu() and a module re-insert, since > that's a legitimate case and not buggy state due to leaks. > > The debugfs directory we could remove immediately before handing over to the > scheduled workfn, but if it turns out there was a leak and the workfn leaves > the cache around, debugfs dir will be gone and we can't check the > alloc_traces/free_traces files there (but we have the per-object info > including the traces in the dmesg splat). > > The sysfs directory is currently removed only with the whole cache being > destryed due to sysfs/kobject lifetime model. I'd love to untangle it for > other reasons too, but haven't investigated it yet. But again it might be > useful for sysfs dir to stay around for inspection, as for the debugfs. > > We could rename the sysfs/debugfs directories before queuing the work? Add > some prefix like GOING_AWAY-$name. If leak is detected and cache stays > forever, another rename to LEAKED-$name. (and same for the slabinfo). But > multiple ones with same name might pile up, so try adding a counter then? > Probably messy to implement, but perhaps the most robust in the end? The > automatic counter could also solve the general case of people using same > name for multiple caches. > > Other ideas? > One question. Maybe it is already late but it is better to ask rather than not. What do you think if we have a small discussion about it on the LPC 2024 as a topic? It might be it is already late or a schedule is set by now. Or we fix it by a conference time. Just a thought. -- Uladzislau Rezki From nohktwo at gmail.com Fri Jun 21 10:39:11 2024 From: nohktwo at gmail.com (Nohk Two) Date: Fri, 21 Jun 2024 18:39:11 +0800 Subject: How to detect the IP CAM on LAN from WG tunnel ? In-Reply-To: References: <384d1fdd-a32f-4839-bb8b-2761be363b50@gmail.com> Message-ID: <1f7f4177-86b1-4a33-876b-06bf4e4f1cbd@gmail.com> On 2024/6/21 17:18, Mark Lawrence wrote: >> How do you solve this problem ? > > Iterative fact checking, from the lowest levels of the network stack to the highest. > > ??? - Are the devices actually connected where you think they are? > ??????? - With the tunnel disconnected, does your phone connect to the ????????? camera? I use wireguard VPN while my phone is using mobile data (4G LTE). With the tunnel disconnected my phone can't connect to the camera since it scanned and cannot find the camera. > ??? - Is your Wireguard tunnel set up properly? > ??????? - Can your phone ping the wg0 address with the tunnel active? > ??????? - Can your phone ping other .100 devices with the tunnel ????????? active? I don't know how to ping from my phone. But the phone, with the wireguard tunnel connected, can visit my LAN website which is in the network 192.168.100.0/24. > ??? - Does your eth0/wg0 machine have IP forwarding enabled? > ??????? - sysctl net.ipv4.ip_forward=1 Yes. $ sysctl net.ipv4.ip_forward net.ipv4.ip_forward = 1 > ??? - What does packet tracing show? > ??????? - I.e. `ngrep -d wg0 .\* icmp` or the tcpdump equivalent, also ????????? against eth0 for the wireguard UDP port. I use `ngrep -d wg0 .\* icmp`, but nothing dump. However while I open my phone's browser to visit my LAN site, it did dump something. > ??? - Does the mobile App actually support remote (routed) cameras or ????? just on the local network? > This is the point I said in my original mail that I think my phone and the camera are in different networks. I believe this App is for LAN network. For this scenario, are there solutions ? From domi at tomcsanyi.net Fri Jun 21 10:47:30 2024 From: domi at tomcsanyi.net (Tomcsanyi, Domonkos) Date: Fri, 21 Jun 2024 12:47:30 +0200 Subject: How to detect the IP CAM on LAN from WG tunnel ? In-Reply-To: <1f7f4177-86b1-4a33-876b-06bf4e4f1cbd@gmail.com> References: <1f7f4177-86b1-4a33-876b-06bf4e4f1cbd@gmail.com> Message-ID: <4B7285ED-C08C-4AAB-827C-AF511D606D03@tomcsanyi.net> In case the camera app uses something below IP, eg ARP to discover you don?t have a chance, since it will never cross the wireguard tunnel. You should try to capture somehow what the app is doing, and then work from that. Either they do not accept the Wireguard routes or they are using non-IP discovery that does not get routed through wg. Good luck! Domi > 21.06.2024 d?tummal, 12:42 id?pontban Nohk Two ?rta: > > ?On 2024/6/21 17:18, Mark Lawrence wrote: >>> How do you solve this problem ? >> Iterative fact checking, from the lowest levels of the network stack to the highest. >> - Are the devices actually connected where you think they are? >> - With the tunnel disconnected, does your phone connect to the camera? > I use wireguard VPN while my phone is using mobile data (4G LTE). With the tunnel disconnected my phone can't connect to the camera since it scanned and cannot find the camera. > >> - Is your Wireguard tunnel set up properly? >> - Can your phone ping the wg0 address with the tunnel active? >> - Can your phone ping other .100 devices with the tunnel active? > I don't know how to ping from my phone. But the phone, with the wireguard tunnel connected, can visit my LAN website which is in the network 192.168.100.0/24. > >> - Does your eth0/wg0 machine have IP forwarding enabled? >> - sysctl net.ipv4.ip_forward=1 > Yes. > $ sysctl net.ipv4.ip_forward > net.ipv4.ip_forward = 1 > >> - What does packet tracing show? >> - I.e. `ngrep -d wg0 .\* icmp` or the tcpdump equivalent, also against eth0 for the wireguard UDP port. > I use `ngrep -d wg0 .\* icmp`, but nothing dump. However while I open my phone's browser to visit my LAN site, it did dump something. > >> - Does the mobile App actually support remote (routed) cameras or just on the local network? > This is the point I said in my original mail that I think my phone and the camera are in different networks. I believe this App is for LAN network. > > For this scenario, are there solutions ? From nico.schottelius at ungleich.ch Fri Jun 21 11:13:27 2024 From: nico.schottelius at ungleich.ch (Nico Schottelius) Date: Fri, 21 Jun 2024 13:13:27 +0200 Subject: Wireguard uses incorrect interface - routing issue Message-ID: <878qyyim5k.fsf@ungleich.ch> Hello again, I'm sorry to flood the mailing list with wireguard bugs, but it seems there is yet another routing bug in wireguard - happy to be wrong, but here are my findings: a) system has source based routing on via ip rule: [11:07] server141.place10:~# ip rule ls 0: from all lookup local 32765: from 192.168.1.0/24 lookup 42 32766: from all lookup main 32767: from all lookup default [11:07] server141.place10:~# ip route sh table 42 194.5.220.0/24 via 192.168.1.254 dev eth1 proto bird metric 32 194.187.90.23 via 192.168.1.254 dev eth1 proto bird metric 32 212.103.65.231 via 192.168.1.254 dev eth1 proto bird metric 32 [11:08] server141.place10:~# This should ensure that packets towards 194.187.90.23 travel via eth1. b) tcpdump for verification Using "tcpdump -ni any port 4000" I observe: 11:10:22.445638 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 11:10:27.447026 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 11:10:32.448329 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 11:10:37.449719 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 c) Route in main table There is indeed a route in the main routing table that matches, too: [11:08] server141.place10:~# ip r get 194.187.90.23 194.187.90.23 via 10.5.2.123 dev eth0 src 192.168.1.149 uid 0 cache d) ip rule not working (?) So from what I can observe it is that ip rule does not work together with wireguard / wireguard routing takes the route from main fib instead of from the separate table. I am not sure if this is related at all to the IP address binding bug, but it appears in a similar context from our tests. BR, Nico -- Sustainable and modern Infrastructures by ungleich.ch -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 873 bytes Desc: not available URL: From nico.schottelius at ungleich.ch Fri Jun 21 11:24:47 2024 From: nico.schottelius at ungleich.ch (Nico Schottelius) Date: Fri, 21 Jun 2024 13:24:47 +0200 Subject: Wireguard uses incorrect interface - routing issue In-Reply-To: <878qyyim5k.fsf@ungleich.ch> (Nico Schottelius's message of "Fri, 21 Jun 2024 13:13:27 +0200") References: <878qyyim5k.fsf@ungleich.ch> Message-ID: <874j9milmo.fsf@ungleich.ch> p.s.: the route lookup looks correct on the machine, when selecting the source IP: [11:15] server141.place10:~# ip r get 194.187.90.23 194.187.90.23 via inet6 fe80::3eec:efff:fecb:d81a dev eth0 src 192.168.1.149 uid 0 cache [11:16] server141.place10:~# ip r get 194.187.90.23 from 192.168.1.149 194.187.90.23 from 192.168.1.149 via 192.168.1.254 dev eth1 table 42 uid 0 cache wireguard still uses the wrong interface: 11:20:13.115154 eth0 Out IP 192.168.1.149.60031 > 194.187.90.23.4000: UDP, length 148 -- Sustainable and modern Infrastructures by ungleich.ch From dxld at darkboxed.org Fri Jun 21 12:29:26 2024 From: dxld at darkboxed.org (Daniel =?utf-8?Q?Gr=C3=B6ber?=) Date: Fri, 21 Jun 2024 14:29:26 +0200 Subject: Wireguard uses incorrect interface - routing issue In-Reply-To: <874j9milmo.fsf@ungleich.ch> References: <878qyyim5k.fsf@ungleich.ch> <874j9milmo.fsf@ungleich.ch> Message-ID: <20240621122926.2xzt7ulno5oczqcv@House.clients.dxld.at> On Fri, Jun 21, 2024 at 01:24:47PM +0200, Nico Schottelius wrote: > > p.s.: the route lookup looks correct on the machine, when selecting the > source IP: > > [11:15] server141.place10:~# ip r get 194.187.90.23 > 194.187.90.23 via inet6 fe80::3eec:efff:fecb:d81a dev eth0 src 192.168.1.149 uid 0 > cache > [11:16] server141.place10:~# ip r get 194.187.90.23 from 192.168.1.149 > 194.187.90.23 from 192.168.1.149 via 192.168.1.254 dev eth1 table 42 uid 0 > cache > > wireguard still uses the wrong interface: > > 11:20:13.115154 eth0 Out IP 192.168.1.149.60031 > 194.187.90.23.4000: UDP, length 148 I haven't looked at the details yet but this smells like the same route caching issue I found a while ago: https://lists.zx2c4.com/pipermail/wireguard/2023-July/008111.html Does up/down'ing the interface make the problem go away? IIRC that will re-initialize the udp socket and thus clear the route chache. FYI Nico: It may be time to escalate these bugs to the network subsystem maintainers on netdev at vger.kernel.org since Jason is not reading this list anymore AFAICT. get_maintainer.pl spits out this list of emails to send To: Jason A. Donenfeld" , "David S. Miller" , Eric Dumazet , Jakub Kicinski , Paolo Abeni , wireguard at lists.zx2c4.com, netdev at vger.kernel.org, linux-kernel at vger.kernel.org Do add me to CC as well. Before sending I'd recommend working out an ip-netns based reproducer script -- makes it harder to ignore the report as "ugh, too much work" ;) Let me know if you need help with that, --Daniel From diyaa at diyaa.ca Fri Jun 21 14:42:02 2024 From: diyaa at diyaa.ca (Diyaa Alkanakre) Date: Fri, 21 Jun 2024 16:42:02 +0200 (CEST) Subject: Wireguard uses incorrect interface - routing issue In-Reply-To: <878qyyim5k.fsf@ungleich.ch> References: <878qyyim5k.fsf@ungleich.ch> Message-ID: The better approach would be to exclude the IPs from your WireGuard AllowedIPs. I always exclude IPs if I can before doing policy based routing. https://www.procustodibus.com/blog/2021/03/wireguard-allowedips-calculator/ Jun 21, 2024, 5:15 AM by nico.schottelius at ungleich.ch: > > Hello again, > > I'm sorry to flood the mailing list with wireguard bugs, but it seems > there is yet another routing bug in wireguard - happy to be wrong, but > here are my findings: > > a) system has source based routing on via ip rule: > > [11:07] server141.place10:~# ip rule ls > 0: from all lookup local > 32765: from 192.168.1.0/24 lookup 42 > 32766: from all lookup main > 32767: from all lookup default > [11:07] server141.place10:~# ip route sh table 42 > 194.5.220.0/24 via 192.168.1.254 dev eth1 proto bird metric 32 > 194.187.90.23 via 192.168.1.254 dev eth1 proto bird metric 32 > 212.103.65.231 via 192.168.1.254 dev eth1 proto bird metric 32 > [11:08] server141.place10:~# > > This should ensure that packets towards 194.187.90.23 travel via eth1. > > b) tcpdump for verification > > Using "tcpdump -ni any port 4000" I observe: > > 11:10:22.445638 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 > 11:10:27.447026 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 > 11:10:32.448329 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 > 11:10:37.449719 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 > > c) Route in main table > > There is indeed a route in the main routing table that matches, too: > > [11:08] server141.place10:~# ip r get 194.187.90.23 > 194.187.90.23 via 10.5.2.123 dev eth0 src 192.168.1.149 uid 0 > cache > > d) ip rule not working (?) > > So from what I can observe it is that ip rule does not work together > with wireguard / wireguard routing takes the route from main fib instead > of from the separate table. > > I am not sure if this is related at all to the IP address binding bug, > but it appears in a similar context from our tests. > > BR, > > Nico > > -- > Sustainable and modern Infrastructures by ungleich.ch > From dxld at darkboxed.org Fri Jun 21 15:18:53 2024 From: dxld at darkboxed.org (Daniel =?utf-8?Q?Gr=C3=B6ber?=) Date: Fri, 21 Jun 2024 17:18:53 +0200 Subject: Wireguard uses incorrect interface - routing issue In-Reply-To: <20240621155439.6cb5abb9@ithnet.com> Message-ID: <20240621151853.s7nzoyanrn4sr6gf@House.clients.dxld.at> Hi, On Fri, Jun 21, 2024 at 03:54:39PM +0200, Stephan von Krawczynski wrote: > ... and in case you do find someone interested at all there is still the > problem of no signaling to anyone when a client connects. > I hardly can remember the decade when all this was implemented in cipe. Yeah. Can be hard to get attention on netdev, but I've been advised that when the maintainance of a (sub)subsystem is in question that is an issue they'll take notice of. So be sure to lament the fact that Jason hasn't been responding in at least a year on this ML ;) IIRC we have a patch for netlink notifications on handshakes flying around somewhere tho. Just needs some more work. On Fri, Jun 21, 2024 at 04:42:02PM +0200, Diyaa Alkanakre wrote: > The better approach would be to exclude the IPs from your WireGuard > AllowedIPs. I always exclude IPs if I can before doing policy based > routing. > > https://www.procustodibus.com/blog/2021/03/wireguard-allowedips-calculator/ Interesting approach, thanks for the pointer :) --Daniel From nico.schottelius at ungleich.ch Fri Jun 21 15:38:56 2024 From: nico.schottelius at ungleich.ch (Nico Schottelius) Date: Fri, 21 Jun 2024 17:38:56 +0200 Subject: Wireguard uses incorrect interface - routing issue In-Reply-To: (Diyaa Alkanakre's message of "Fri, 21 Jun 2024 16:42:02 +0200 (CEST)") References: <878qyyim5k.fsf@ungleich.ch> Message-ID: <87jzii8fvz.fsf@ungleich.ch> Diyaa, this is about the *outside* tunnel IP address that wireguard uses to establish connection, not about inside routing. BR, Nico Diyaa Alkanakre writes: > The better approach would be to exclude the IPs from your WireGuard AllowedIPs. I always exclude IPs if I can before doing policy based routing. > > https://www.procustodibus.com/blog/2021/03/wireguard-allowedips-calculator/ > > > Jun 21, 2024, 5:15 AM by nico.schottelius at ungleich.ch: > >> >> Hello again, >> >> I'm sorry to flood the mailing list with wireguard bugs, but it seems >> there is yet another routing bug in wireguard - happy to be wrong, but >> here are my findings: >> >> a) system has source based routing on via ip rule: >> >> [11:07] server141.place10:~# ip rule ls >> 0: from all lookup local >> 32765: from 192.168.1.0/24 lookup 42 >> 32766: from all lookup main >> 32767: from all lookup default >> [11:07] server141.place10:~# ip route sh table 42 >> 194.5.220.0/24 via 192.168.1.254 dev eth1 proto bird metric 32 >> 194.187.90.23 via 192.168.1.254 dev eth1 proto bird metric 32 >> 212.103.65.231 via 192.168.1.254 dev eth1 proto bird metric 32 >> [11:08] server141.place10:~# >> >> This should ensure that packets towards 194.187.90.23 travel via eth1. >> >> b) tcpdump for verification >> >> Using "tcpdump -ni any port 4000" I observe: >> >> 11:10:22.445638 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 >> 11:10:27.447026 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 >> 11:10:32.448329 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 >> 11:10:37.449719 eth0 Out IP 192.168.1.149.58591 > 194.187.90.23.4000: UDP, length 148 >> >> c) Route in main table >> >> There is indeed a route in the main routing table that matches, too: >> >> [11:08] server141.place10:~# ip r get 194.187.90.23 >> 194.187.90.23 via 10.5.2.123 dev eth0 src 192.168.1.149 uid 0 >> cache >> >> d) ip rule not working (?) >> >> So from what I can observe it is that ip rule does not work together >> with wireguard / wireguard routing takes the route from main fib instead >> of from the separate table. >> >> I am not sure if this is related at all to the IP address binding bug, >> but it appears in a similar context from our tests. >> >> BR, >> >> Nico >> >> -- >> Sustainable and modern Infrastructures by ungleich.ch >> -------------- next part -------------- -- Sustainable and modern Infrastructures by ungleich.ch -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 873 bytes Desc: not available URL: From nico.schottelius at ungleich.ch Sat Jun 22 09:22:28 2024 From: nico.schottelius at ungleich.ch (Nico Schottelius) Date: Sat, 22 Jun 2024 11:22:28 +0200 Subject: Wireguard uses incorrect interface - routing issue In-Reply-To: <20240621122926.2xzt7ulno5oczqcv@House.clients.dxld.at> ("Daniel =?utf-8?Q?Gr=C3=B6ber=22's?= message of "Fri, 21 Jun 2024 14:29:26 +0200") References: <878qyyim5k.fsf@ungleich.ch> <874j9milmo.fsf@ungleich.ch> <20240621122926.2xzt7ulno5oczqcv@House.clients.dxld.at> Message-ID: <87zfrdgwmj.fsf@ungleich.ch> Good morning Daniel, Daniel Gr?ber writes: >> wireguard still uses the wrong interface: >> >> 11:20:13.115154 eth0 Out IP 192.168.1.149.60031 > 194.187.90.23.4000: UDP, length 148 > > I haven't looked at the details yet but this smells like the same route > caching issue I found a while ago: > https://lists.zx2c4.com/pipermail/wireguard/2023-July/008111.html > > Does up/down'ing the interface make the problem go away? IIRC that will > re-initialize the udp socket and thus clear the route chache. Up & down does *not* fix it, however a *reboot* did. I've the feeling that this is a race condition together with bird running on the machine. I suspect the following is happening: - machine starts - ip rule is used to move traffic into table 42 (part of the container startup) - table 42 is populated by bird with static routes (part of bird startup) -- at this stage wireguard works - bird establishes iBGP sessions and receives alternate routes for the target in the main routing table - wireguard restart is triggered and from that moment on wireguard uses the route from the main table -- at this stage wireguard is broken/takes the route from the main table This is so far a theory, I'll need to verify that, maybe a simple test script as you suggested makes sense. > FYI Nico: It may be time to escalate these bugs to the network subsystem > maintainers on netdev at vger.kernel.org since Jason is not reading this list > anymore AFAICT. That is a very good point and I shall do so next week! > get_maintainer.pl spits out this list of emails to send To: > > Jason A. Donenfeld" , > "David S. Miller" , > Eric Dumazet , > Jakub Kicinski , > Paolo Abeni , > wireguard at lists.zx2c4.com, > netdev at vger.kernel.org, > linux-kernel at vger.kernel.org Thanks for looking up! > Do add me to CC as well. Before sending I'd recommend working out an > ip-netns based reproducer script -- makes it harder to ignore the report as > "ugh, too much work" ;) Understood and ... > Let me know if you need help with that, ... would certainly appreciate that. You are on matrix, too, aren't you? I'm @nico:ungleich.ch, might be easier for coordination. Best regards from sunny Glarus, Nico -------------- next part -------------- -- Sustainable and modern Infrastructures by ungleich.ch -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 873 bytes Desc: not available URL: From alarsen at maidenheadbridge.com Mon Jun 24 09:36:06 2024 From: alarsen at maidenheadbridge.com (Adrian Larsen) Date: Mon, 24 Jun 2024 10:36:06 +0100 Subject: Fwd: Wireguard address binding - how to fix? In-Reply-To: <740ee793-0ed2-4cf6-ba4a-07268b46b761@maidenheadbridge.com> References: <740ee793-0ed2-4cf6-ba4a-07268b46b761@maidenheadbridge.com> Message-ID: <43aac110-8699-41b3-bad8-7a38bf984b45@maidenheadbridge.com> Hi Friends, You can achieve address binding on a Linux box with a mix of marking, ip rules, ip route and Source NAT. 1) On WG interface, add "FwMark = 0x34" (the value 0x34 is an example, you can put any value here) 2) Create IP Rule "from all fwmark 0x34 lookup rt_wg0_out" -> this will force the outgoing packet to use the route table "rt_wg0_out" 3) On the route table "rt_wg0_out" create the default or specific route to force the packet market with 0x34 to leave using the interface where your desire "IP address" resides. 4) Create a POSTROUTING -> SNAT forcing mark 0x34 via the desired "IP address". This will bind your "IP address". Done! The packet with mark 0x34 will be routed via the correct interface using the source IP you want. I hope this helps. Best regards, Adrian Larsen Maidenhead Bridge Cloud Security Connectors for SSE vendors. m: +44 7487640352 e:alarsen at maidenheadbridge.com On 09/06/2024 16:39, Nico Schottelius wrote: > Jason, > > may I shortly ask what your opinion is on the patch and whether there is > a way forward to make wireguard usable on systems with multiple IP > addresses? > > Best regards, > > Nico > > Nico Schottelius writes: > >> d tbsky writes: >>> I remembered how exciting when I tested wireguard at 2017. until I >>> asked muti-home question in the list. >>> wiregurad is beautiful,elegant,fast but not easy to get along with. >>> openvpn is not so amazing but it can get the job done. >> Nice summary, hits the nail quite well. >> >> Jason, do you mind having a look at the submitted patches for IP address >> binding and comment on them? Or alternatively can you give green light >> for generally moving forward so that a direct inclusion in the Linux >> kernel would be accepted? >> >> Best regards, >> >> Nico >> From nico.schottelius at ungleich.ch Thu Jun 27 11:33:18 2024 From: nico.schottelius at ungleich.ch (Nico Schottelius) Date: Thu, 27 Jun 2024 13:33:18 +0200 Subject: Fwd: Wireguard address binding - how to fix? In-Reply-To: <43aac110-8699-41b3-bad8-7a38bf984b45@maidenheadbridge.com> (Adrian Larsen's message of "Mon, 24 Jun 2024 10:36:06 +0100") References: <740ee793-0ed2-4cf6-ba4a-07268b46b761@maidenheadbridge.com> <43aac110-8699-41b3-bad8-7a38bf984b45@maidenheadbridge.com> Message-ID: <87ed8ihb7l.fsf@ungleich.ch> Hello Adrian, I tried 1,2 and 3 and observed that wireguard seems to be taking the correct routing table when using fwmark: -------------------------------------------------------------------------------- # cat /etc/wireguard/or3ge.conf [Interface] PrivateKey = ... Address = 2a0a:5480:5:2::2/64 Table = off FwMark = 0x42 [Peer] PublicKey = 3WNj2YuTTm+5wpsAOauRQ3bEMv/WXcKMDZXbJPB8fx0= AllowedIPs = ::/0, 0.0.0.0/0 Endpoint = 194.5.220.43:5001 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- [09:32] server142.place10:~# ip r sh table 42 194.5.220.0/24 via 192.168.1.254 dev eth1 proto bird metric 32 194.187.90.23 via 192.168.1.254 dev eth1 proto bird metric 32 212.103.65.231 via 192.168.1.254 dev eth1 proto bird metric 32 [09:32] server142.place10:~# ip rule ls 0: from all lookup local 32765: from all fwmark 0x42 lookup 42 32766: from all lookup main 32767: from all lookup default -------------------------------------------------------------------------------- So the long story short is that one cannot match on the ip address with wireguard, potentially because it does not do the address binding by default. But I have to say thanks, at least one problem solevd for the moment! Best regards, Nico Adrian Larsen writes: > Hi Friends, > > You can achieve address binding on a Linux box with a mix of marking, > ip rules, ip route and Source NAT. > > 1) On WG interface, add "FwMark = 0x34" (the value 0x34 is an example, > you can put any value here) > > 2) Create IP Rule "from all fwmark 0x34 lookup rt_wg0_out" -> this > will force the outgoing packet to use the route table "rt_wg0_out" > > 3) On the route table "rt_wg0_out" create the default or specific > route to force the packet market with 0x34 to leave using the > interface where your desire "IP address" resides. > > 4) Create a POSTROUTING -> SNAT forcing mark 0x34 via the desired "IP > address". This will bind your "IP address". > > Done! The packet with mark 0x34 will be routed via the correct > interface using the source IP you want. > > I hope this helps. > > Best regards, > > Adrian Larsen > Maidenhead Bridge > Cloud Security Connectors for SSE vendors. > m: +44 7487640352 > e:alarsen at maidenheadbridge.com > > On 09/06/2024 16:39, Nico Schottelius wrote: >> Jason, >> >> may I shortly ask what your opinion is on the patch and whether there is >> a way forward to make wireguard usable on systems with multiple IP >> addresses? >> >> Best regards, >> >> Nico >> >> Nico Schottelius writes: >> >>> d tbsky writes: >>>> I remembered how exciting when I tested wireguard at 2017. until I >>>> asked muti-home question in the list. >>>> wiregurad is beautiful,elegant,fast but not easy to get along with. >>>> openvpn is not so amazing but it can get the job done. >>> Nice summary, hits the nail quite well. >>> >>> Jason, do you mind having a look at the submitted patches for IP address >>> binding and comment on them? Or alternatively can you give green light >>> for generally moving forward so that a direct inclusion in the Linux >>> kernel would be accepted? >>> >>> Best regards, >>> >>> Nico >>> -------------- next part -------------- -- Sustainable and modern Infrastructures by ungleich.ch -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 873 bytes Desc: not available URL: