From doug.hs at proton.me Sat Aug 9 12:08:17 2025 From: doug.hs at proton.me (Douglas Silva) Date: Sat, 09 Aug 2025 12:08:17 +0000 Subject: Android app: export configuration as a password-protected zip file Message-ID: <3uPzijiGmCFwzmGThO2Z2EhiLTEw1w-Jqx3eC5wTpEjGUfbAivno4-eIG9UV4Xqc9tUFSdKkobsWPYRBke-QuqkqGwylaPy4-ALihwtoe7Y=@proton.me> Greetings. This is a feature request for the Android app. I'd like to suggest two alternatives to address the issue of exporting an unencrypted zip-file containing all your private keys into the Downloads folder. 1. Export a password-protected zip-file instead, allowing us to choose a password in the app settings. This is what Syncthing does nowadays. 2. Optionally let Android Backup pick up our config (disabled by default). Google Drive isn't the only back-end available. Ever heard of Seedvault? It's the default on systems like LineageOS or CalyxOS. Seedvault does encrypted backups to any location; even offline, to a flash stick. Aegis Authenticator uses it, as long as the backup is advertised as encrypted. See [1], [2] and [3] and [4]. [1] https://developer.android.com/identity/data/autobackup#EnablingAutoBackup [2] https://developer.android.com/identity/data/autobackup#define-device-conditions [3] https://github.com/seedvault-app/seedvault/wiki/FAQ#why-do-some-apps-not-allow-to-get-backed-up [4] https://github.com/beemdevelopment/Aegis/blob/master/app/src/main/AndroidManifest.xml#L22 For option 1 (password-protected zip), I'd also like to suggest the addition of (optional) automatic exports to a chosen folder, to facilitate automatic backups with an external tool such as Syncthing. The messaging app Signal does something like this to export your messages (encrypted) and keep the last five versions of it. Thank you. -------------- next part -------------- A non-text attachment was scrubbed... Name: publickey - doug.hs at proton.me - 0xB577E0C1.asc Type: application/pgp-keys Size: 645 bytes Desc: not available URL: From yury.norov at gmail.com Sat Aug 9 13:24:11 2025 From: yury.norov at gmail.com (Yury Norov) Date: Sat, 9 Aug 2025 09:24:11 -0400 Subject: [PATCH v2 0/2] rework wg_cpumask_next_online() In-Reply-To: <20250719224444.411074-1-yury.norov@gmail.com> References: <20250719224444.411074-1-yury.norov@gmail.com> Message-ID: Ping? On Sat, Jul 19, 2025 at 06:44:41PM -0400, Yury Norov wrote: > From: Yury Norov (NVIDIA) > > Simplify the function and fix possible out-of-boundary condition. > > v2: > - fix possible >= nr_cpu_ids return (Jason). > > Yury Norov (NVIDIA) (2): > wireguard: queueing: simplify wg_cpumask_next_online() > wireguard: queueing: always return valid online CPU in wg_cpumask_choose_online() > > drivers/net/wireguard/queueing.h | 13 ++++--------- > 1 file changed, 4 insertions(+), 9 deletions(-) > > -- > 2.43.0 From david at redhat.com Thu Aug 21 20:06:32 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:32 +0200 Subject: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-7-david@redhat.com> Let's reject them early, which in turn makes folio_alloc_gigantic() reject them properly. To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER and calculate MAX_FOLIO_NR_PAGES based on that. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 6 ++++-- mm/page_alloc.c | 5 ++++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 00c8a54127d37..77737cbf2216a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) +#define MAX_FOLIO_ORDER PUD_ORDER #else -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER #endif +#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) + /* * compound_nr() returns the number of pages in this potentially compound * page. compound_nr() can be called on a tail page, and is defined to diff --git a/mm/page_alloc.c b/mm/page_alloc.c index ca9e6b9633f79..1e6ae4c395b30 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) int alloc_contig_range_noprof(unsigned long start, unsigned long end, acr_flags_t alloc_flags, gfp_t gfp_mask) { + const unsigned int order = ilog2(end - start); unsigned long outer_start, outer_end; int ret = 0; @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, PB_ISOLATE_MODE_CMA_ALLOC : PB_ISOLATE_MODE_OTHER; + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) + return -EINVAL; + gfp_mask = current_gfp_context(gfp_mask); if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) return -EINVAL; @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, free_contig_range(end, outer_end - end); } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { struct page *head = pfn_to_page(start); - int order = ilog2(end - start); check_new_pages(head, order); prep_new_page(head, order, gfp_mask, 0); -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:33 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:33 +0200 Subject: [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-8-david@redhat.com> Let's reject unreasonable folio sizes early, where we can still fail. We'll add sanity checks to prepare_compound_head/prepare_compound_page next. Is there a way to configure a system such that unreasonable folio sizes would be possible? It would already be rather questionable. If so, we'd probably want to bail out earlier, where we can avoid a WARN and just report a proper error message that indicates where something went wrong such that we messed up. Signed-off-by: David Hildenbrand --- mm/memremap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/memremap.c b/mm/memremap.c index b0ce0d8254bd8..a2d4bb88f64b6 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -275,6 +275,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) return ERR_PTR(-EINVAL); + if (WARN_ONCE(pgmap->vmemmap_shift > MAX_FOLIO_ORDER, + "requested folio size unsupported\n")) + return ERR_PTR(-EINVAL); switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:34 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:34 +0200 Subject: [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-9-david@redhat.com> Let's check that no hstate that corresponds to an unreasonable folio size is registered by an architecture. If we were to succeed registering, we could later try allocating an unsupported gigantic folio size. Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have to use a BUILD_BUG_ON_INVALID() to make it compile. No existing kernel configuration should be able to trigger this check: either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or gigantic folios will not exceed a memory section (the case on sparse). Signed-off-by: David Hildenbrand --- mm/hugetlb.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 514fab5a20ef8..d12a9d5146af4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE < __NR_HPAGEFLAGS); + BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); if (!hugepages_supported()) { if (hugetlb_max_hstate || default_hstate_max_huge_pages) @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order) } BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); + WARN_ON(order > MAX_FOLIO_ORDER); h = &hstates[hugetlb_max_hstate++]; __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); h->order = order; -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:35 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:35 +0200 Subject: [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-10-david@redhat.com> Grepping for "prep_compound_page" leaves on clueless how devdax gets its compound pages initialized. Let's add a comment that might help finding this open-coded prep_compound_page() initialization more easily. Further, let's be less smart about the ordering of initialization and just perform the prep_compound_head() call after all tail pages were initialized: just like prep_compound_page() does. No need for a lengthy comment then: again, just like prep_compound_page(). Note that prep_compound_head() already does initialize stuff in page[2] through prep_compound_head() that successive tail page initialization will overwrite: _deferred_list, and on 32bit _entire_mapcount and _pincount. Very likely 32bit does not apply, and likely nobody ever ends up testing whether the _deferred_list is empty. So it shouldn't be a fix at this point, but certainly something to clean up. Signed-off-by: David Hildenbrand --- mm/mm_init.c | 13 +++++-------- 1 file changed, 5 insertions(+), 8 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index 5c21b3af216b2..708466c5b2cc9 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head, unsigned long pfn, end_pfn = head_pfn + nr_pages; unsigned int order = pgmap->vmemmap_shift; + /* + * This is an open-coded prep_compound_page() whereby we avoid + * walking pages twice by initializing them in the same go. + */ __SetPageHead(head); for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn); @@ -1098,15 +1102,8 @@ static void __ref memmap_init_compound(struct page *head, __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); prep_compound_tail(head, pfn - head_pfn); set_page_count(page, 0); - - /* - * The first tail page stores important compound page info. - * Call prep_compound_head() after the first tail page has - * been initialized, to not have the data overwritten. - */ - if (pfn == head_pfn + 1) - prep_compound_head(head, order); } + prep_compound_head(head, order); } void __ref memmap_init_zone_device(struct zone *zone, -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:36 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:36 +0200 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-11-david@redhat.com> All pages were already initialized and set to PageReserved() with a refcount of 1 by MM init code. In fact, by using __init_single_page(), we will be setting the refcount to 1 just to freeze it again immediately afterwards. So drop the __init_single_page() and use __ClearPageReserved() instead. Adjust the comments to highlight that we are dealing with an open-coded prep_compound_page() variant. Further, as we can now safely iterate over all pages in a folio, let's avoid the page-pfn dance and just iterate the pages directly. Note that the current code was likely problematic, but we never ran into it: prep_compound_tail() would have been called with an offset that might exceed a memory section, and prep_compound_tail() would have simply added that offset to the page pointer -- which would not have done the right thing on sparsemem without vmemmap. Signed-off-by: David Hildenbrand --- mm/hugetlb.c | 21 ++++++++++----------- 1 file changed, 10 insertions(+), 11 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index d12a9d5146af4..ae82a845b14ad 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3235,17 +3235,14 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, unsigned long start_page_number, unsigned long end_page_number) { - enum zone_type zone = zone_idx(folio_zone(folio)); - int nid = folio_nid(folio); - unsigned long head_pfn = folio_pfn(folio); - unsigned long pfn, end_pfn = head_pfn + end_page_number; + struct page *head_page = folio_page(folio, 0); + struct page *page = folio_page(folio, start_page_number); + unsigned long i; int ret; - for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { - struct page *page = pfn_to_page(pfn); - - __init_single_page(page, pfn, zone, nid); - prep_compound_tail((struct page *)folio, pfn - head_pfn); + for (i = start_page_number; i < end_page_number; i++, page++) { + __ClearPageReserved(page); + prep_compound_tail(head_page, i); ret = page_ref_freeze(page, 1); VM_BUG_ON(!ret); } @@ -3257,12 +3254,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, { int ret; - /* Prepare folio head */ + /* + * This is an open-coded prep_compound_page() whereby we avoid + * walking pages twice by preparing+freezing them in the same go. + */ __folio_clear_reserved(folio); __folio_set_head(folio); ret = folio_ref_freeze(folio, 1); VM_BUG_ON(!ret); - /* Initialize the necessary tail struct pages */ hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); prep_compound_head((struct page *)folio, huge_page_order(h)); } -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:37 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:37 +0200 Subject: [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-12-david@redhat.com> Let's sanity-check in folio_set_order() whether we would be trying to create a folio with an order that would make it exceed MAX_FOLIO_ORDER. This will enable the check whenever a folio/compound page is initialized through prepare_compound_head() / prepare_compound_page(). Signed-off-by: David Hildenbrand --- mm/internal.h | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/internal.h b/mm/internal.h index 45b725c3dc030..946ce97036d67 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -755,6 +755,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) { if (WARN_ON_ONCE(!order || !folio_test_large(folio))) return; + VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER); folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order; #ifdef NR_PAGES_IN_LARGE_FOLIO -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:38 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:38 +0200 Subject: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-13-david@redhat.com> Let's limit the maximum folio size in problematic kernel config where the memmap is allocated per memory section (SPARSEMEM without SPARSEMEM_VMEMMAP) to a single memory section. Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE but not SPARSEMEM_VMEMMAP: sh. Fortunately, the biggest hugetlb size sh supports is 64 MiB (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB (SECTION_SIZE_BITS == 26), so their use case is not degraded. As folios and memory sections are naturally aligned to their order-2 size in memory, consequently a single folio can no longer span multiple memory sections on these problematic kernel configs. nth_page() is no longer required when operating within a single compound page / folio. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 77737cbf2216a..48a985e17ef4e 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) return folio_large_nr_pages(folio); } -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -#define MAX_FOLIO_ORDER PUD_ORDER -#else +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) +/* + * We don't expect any folios that exceed buddy sizes (and consequently + * memory sections). + */ #define MAX_FOLIO_ORDER MAX_PAGE_ORDER +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/* + * Only pages within a single memory section are guaranteed to be + * contiguous. By limiting folios to a single memory section, all folio + * pages are guaranteed to be contiguous. + */ +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT +#else +/* + * There is no real limit on the folio size. We limit them to the maximum we + * currently expect. + */ +#define MAX_FOLIO_ORDER PUD_ORDER #endif #define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:39 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:39 +0200 Subject: [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-14-david@redhat.com> Now that a single folio/compound page can no longer span memory sections in problematic kernel configurations, we can stop using nth_page(). While at it, turn both macros into static inline functions and add kernel doc for folio_page_idx(). Signed-off-by: David Hildenbrand --- include/linux/mm.h | 16 ++++++++++++++-- include/linux/page-flags.h | 5 ++++- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 48a985e17ef4e..ef360b72cb05c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes; #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) -#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) #else #define nth_page(page,n) ((page) + (n)) -#define folio_page_idx(folio, p) ((p) - &(folio)->page) #endif /* to align the pointer to the (next) page boundary */ @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) +/** + * folio_page_idx - Return the number of a page in a folio. + * @folio: The folio. + * @page: The folio page. + * + * This function expects that the page is actually part of the folio. + * The returned number is relative to the start of the folio. + */ +static inline unsigned long folio_page_idx(const struct folio *folio, + const struct page *page) +{ + return page - &folio->page; +} + static inline struct folio *lru_to_folio(struct list_head *head) { return list_entry((head)->prev, struct folio, lru); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index d53a86e68c89b..080ad10c0defc 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page) * check that the page number lies within @folio; the caller is presumed * to have a reference to the page. */ -#define folio_page(folio, n) nth_page(&(folio)->page, n) +static inline struct page *folio_page(struct folio *folio, unsigned long nr) +{ + return &folio->page + nr; +} static __always_inline int PageTail(const struct page *page) { -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:40 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:40 +0200 Subject: [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-15-david@redhat.com> We're allocating a higher-order page from the buddy. For these pages (that are guaranteed to not exceed a single memory section) there is no need to use nth_page(). Signed-off-by: David Hildenbrand --- mm/percpu-km.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/percpu-km.c b/mm/percpu-km.c index fe31aa19db81a..4efa74a495cb6 100644 --- a/mm/percpu-km.c +++ b/mm/percpu-km.c @@ -69,7 +69,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) } for (i = 0; i < nr_pages; i++) - pcpu_set_page_chunk(nth_page(pages, i), chunk); + pcpu_set_page_chunk(pages + i, chunk); chunk->data = pages; chunk->base_addr = page_address(pages); -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:42 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:42 +0200 Subject: [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-17-david@redhat.com> It's no longer required to use nth_page() within a folio, so let's just drop the nth_page() in folio_walk_start(). Signed-off-by: David Hildenbrand --- mm/pagewalk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index c6753d370ff4e..9e4225e5fcf5c 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -1004,7 +1004,7 @@ struct folio *folio_walk_start(struct folio_walk *fw, found: if (expose_page) /* Note: Offset from the mapped page, not the folio start. */ - fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT); + fw->page = page + ((addr & (entry_size - 1)) >> PAGE_SHIFT); else fw->page = NULL; fw->ptl = ptl; -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:41 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:41 +0200 Subject: [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-16-david@redhat.com> The nth_page() is not really required anymore, so let's remove it. While at it, cleanup and simplify the code a bit. Signed-off-by: David Hildenbrand --- fs/hugetlbfs/inode.c | 25 ++++++++----------------- 1 file changed, 8 insertions(+), 17 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 34d496a2b7de6..dc981509a7717 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -198,31 +198,22 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, static size_t adjust_range_hwpoison(struct folio *folio, size_t offset, size_t bytes) { - struct page *page; - size_t n = 0; - size_t res = 0; + struct page *page = folio_page(folio, offset / PAGE_SIZE); + size_t n, safe_bytes; - /* First page to start the loop. */ - page = folio_page(folio, offset / PAGE_SIZE); offset %= PAGE_SIZE; - while (1) { + for (safe_bytes = 0; safe_bytes < bytes; safe_bytes += n) { + if (is_raw_hwpoison_page_in_hugepage(page)) break; /* Safe to read n bytes without touching HWPOISON subpage. */ - n = min(bytes, (size_t)PAGE_SIZE - offset); - res += n; - bytes -= n; - if (!bytes || !n) - break; - offset += n; - if (offset == PAGE_SIZE) { - page = nth_page(page, 1); - offset = 0; - } + n = min(bytes - safe_bytes, (size_t)PAGE_SIZE - offset); + offset = 0; + page++; } - return res; + return safe_bytes; } /* -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:43 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:43 +0200 Subject: [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-18-david@redhat.com> nth_page() is no longer required when iterating over pages within a single folio, so let's just drop it when recording subpages. Signed-off-by: David Hildenbrand --- mm/gup.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index b2a78f0291273..f017ff6d7d61a 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -491,9 +491,9 @@ static int record_subpages(struct page *page, unsigned long sz, struct page *start_page; int nr; - start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT); + start_page = page + ((addr & (sz - 1)) >> PAGE_SHIFT); for (nr = 0; addr != end; nr++, addr += PAGE_SIZE) - pages[nr] = nth_page(start_page, nr); + pages[nr] = start_page + nr; return nr; } @@ -1512,7 +1512,7 @@ static long __get_user_pages(struct mm_struct *mm, } for (j = 0; j < page_increm; j++) { - subpage = nth_page(page, j); + subpage = page + j; pages[i + j] = subpage; flush_anon_page(vma, subpage, start + j * PAGE_SIZE); flush_dcache_page(subpage); -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:45 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:45 +0200 Subject: [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-20-david@redhat.com> Within a folio/compound page, nth_page() is no longer required. Given that we call folio_test_partial_kmap()+kmap_local_page(), the code would already be problematic if the src_pages would span multiple folios. So let's just assume that all src pages belong to a single folio/compound page and can be iterated ordinarily. Cc: Jens Axboe Signed-off-by: David Hildenbrand --- io_uring/zcrx.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index f29b2a4867516..107b2a1b31c1c 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -966,7 +966,7 @@ static ssize_t io_copy_page(struct page *dst_page, struct page *src_page, size_t n = len; if (folio_test_partial_kmap(page_folio(src_page))) { - src_page = nth_page(src_page, src_offset / PAGE_SIZE); + src_page += src_offset / PAGE_SIZE; src_offset = offset_in_page(src_offset); n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset); n = min(n, len); -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:44 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:44 +0200 Subject: [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-19-david@redhat.com> We always provide a single dst page, it's unclear why the io_copy_cache complexity is required. So let's simplify and get rid of "struct io_copy_cache", simply working on the single page. ... which immediately allows us for dropping one "nth_page" usage, because it's really just a single page. Cc: Jens Axboe Signed-off-by: David Hildenbrand --- io_uring/zcrx.c | 32 +++++++------------------------- 1 file changed, 7 insertions(+), 25 deletions(-) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index e5ff49f3425e0..f29b2a4867516 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -954,29 +954,18 @@ static struct net_iov *io_zcrx_alloc_fallback(struct io_zcrx_area *area) return niov; } -struct io_copy_cache { - struct page *page; - unsigned long offset; - size_t size; -}; - -static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, +static ssize_t io_copy_page(struct page *dst_page, struct page *src_page, unsigned int src_offset, size_t len) { - size_t copied = 0; + size_t dst_offset = 0; - len = min(len, cc->size); + len = min(len, PAGE_SIZE); while (len) { void *src_addr, *dst_addr; - struct page *dst_page = cc->page; - unsigned dst_offset = cc->offset; size_t n = len; - if (folio_test_partial_kmap(page_folio(dst_page)) || - folio_test_partial_kmap(page_folio(src_page))) { - dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE); - dst_offset = offset_in_page(dst_offset); + if (folio_test_partial_kmap(page_folio(src_page))) { src_page = nth_page(src_page, src_offset / PAGE_SIZE); src_offset = offset_in_page(src_offset); n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset); @@ -991,12 +980,10 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, kunmap_local(src_addr); kunmap_local(dst_addr); - cc->size -= n; - cc->offset += n; + dst_offset += n; len -= n; - copied += n; } - return copied; + return dst_offset; } static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq, @@ -1011,7 +998,6 @@ static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq, return -EFAULT; while (len) { - struct io_copy_cache cc; struct net_iov *niov; size_t n; @@ -1021,11 +1007,7 @@ static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq, break; } - cc.page = io_zcrx_iov_page(niov); - cc.offset = 0; - cc.size = PAGE_SIZE; - - n = io_copy_page(&cc, src_page, src_offset, len); + n = io_copy_page(io_zcrx_iov_page(niov), src_page, src_offset, len); if (!io_zcrx_queue_cqe(req, niov, ifq, 0, n)) { io_zcrx_return_niov(niov); -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:46 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:46 +0200 Subject: [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-21-david@redhat.com> Let's make it clearer that we are operating within a single folio by providing both the folio and the page. This implies that for flush_dcache_folio() we'll now avoid one more page->folio lookup, and that we can safely drop the "nth_page" usage. Cc: Thomas Bogendoerfer Signed-off-by: David Hildenbrand --- arch/mips/include/asm/cacheflush.h | 11 +++++++---- arch/mips/mm/cache.c | 8 ++++---- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h index 1f14132b3fc98..8a2de28936e07 100644 --- a/arch/mips/include/asm/cacheflush.h +++ b/arch/mips/include/asm/cacheflush.h @@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm); extern void (*flush_cache_range)(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn); -extern void __flush_dcache_pages(struct page *page, unsigned int nr); +extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 static inline void flush_dcache_folio(struct folio *folio) { if (cpu_has_dc_aliases) - __flush_dcache_pages(&folio->page, folio_nr_pages(folio)); + __flush_dcache_folio_pages(folio, folio_page(folio, 0), + folio_nr_pages(folio)); else if (!cpu_has_ic_fills_f_dc) folio_set_dcache_dirty(folio); } @@ -64,10 +65,12 @@ static inline void flush_dcache_folio(struct folio *folio) static inline void flush_dcache_page(struct page *page) { + struct folio *folio = page_folio(page); + if (cpu_has_dc_aliases) - __flush_dcache_pages(page, 1); + __flush_dcache_folio_pages(folio, page, folio_nr_pages(folio)); else if (!cpu_has_ic_fills_f_dc) - folio_set_dcache_dirty(page_folio(page)); + folio_set_dcache_dirty(folio); } #define flush_dcache_mmap_lock(mapping) do { } while (0) diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c index bf9a37c60e9f0..e3b4224c9a406 100644 --- a/arch/mips/mm/cache.c +++ b/arch/mips/mm/cache.c @@ -99,9 +99,9 @@ SYSCALL_DEFINE3(cacheflush, unsigned long, addr, unsigned long, bytes, return 0; } -void __flush_dcache_pages(struct page *page, unsigned int nr) +void __flush_dcache_folio_pages(struct folio *folio, struct page *page, + unsigned int nr) { - struct folio *folio = page_folio(page); struct address_space *mapping = folio_flush_mapping(folio); unsigned long addr; unsigned int i; @@ -117,12 +117,12 @@ void __flush_dcache_pages(struct page *page, unsigned int nr) * get faulted into the tlb (and thus flushed) anyways. */ for (i = 0; i < nr; i++) { - addr = (unsigned long)kmap_local_page(nth_page(page, i)); + addr = (unsigned long)kmap_local_page(page + i); flush_data_cache_page(addr); kunmap_local((void *)addr); } } -EXPORT_SYMBOL(__flush_dcache_pages); +EXPORT_SYMBOL(__flush_dcache_folio_pages); void __flush_anon_page(struct page *page, unsigned long vmaddr) { -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:47 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:47 +0200 Subject: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-22-david@redhat.com> Let's disallow handing out PFN ranges with non-contiguous pages, so we can remove the nth-page usage in __cma_alloc(), and so any callers don't have to worry about that either when wanting to blindly iterate pages. This is really only a problem in configs with SPARSEMEM but without SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some cases. Will this cause harm? Probably not, because it's mostly 32bit that does not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could look into allocating the memmap for the memory sections spanned by a single CMA region in one go from memblock. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 6 ++++++ mm/cma.c | 36 +++++++++++++++++++++++------------- mm/util.c | 33 +++++++++++++++++++++++++++++++++ 3 files changed, 62 insertions(+), 13 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index ef360b72cb05c..f59ad1f9fc792 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes; #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +bool page_range_contiguous(const struct page *page, unsigned long nr_pages); #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) #else #define nth_page(page,n) ((page) + (n)) +static inline bool page_range_contiguous(const struct page *page, + unsigned long nr_pages) +{ + return true; +} #endif /* to align the pointer to the (next) page boundary */ diff --git a/mm/cma.c b/mm/cma.c index 2ffa4befb99ab..1119fa2830008 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, unsigned long count, unsigned int align, struct page **pagep, gfp_t gfp) { - unsigned long mask, offset; - unsigned long pfn = -1; - unsigned long start = 0; unsigned long bitmap_maxno, bitmap_no, bitmap_count; + unsigned long start, pfn, mask, offset; int ret = -EBUSY; struct page *page = NULL; @@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, if (bitmap_count > bitmap_maxno) goto out; - for (;;) { + for (start = 0; ; start = bitmap_no + mask + 1) { spin_lock_irq(&cma->lock); /* * If the request is larger than the available number @@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, spin_unlock_irq(&cma->lock); break; } + + pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); + page = pfn_to_page(pfn); + + /* + * Do not hand out page ranges that are not contiguous, so + * callers can just iterate the pages without having to worry + * about these corner cases. + */ + if (!page_range_contiguous(page, count)) { + spin_unlock_irq(&cma->lock); + pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]", + __func__, cma->name, pfn, pfn + count - 1); + continue; + } + bitmap_set(cmr->bitmap, bitmap_no, bitmap_count); cma->available_count -= count; /* @@ -821,29 +835,25 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, */ spin_unlock_irq(&cma->lock); - pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); mutex_lock(&cma->alloc_mutex); ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp); mutex_unlock(&cma->alloc_mutex); - if (ret == 0) { - page = pfn_to_page(pfn); + if (!ret) break; - } cma_clear_bitmap(cma, cmr, pfn, count); if (ret != -EBUSY) break; pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n", - __func__, pfn, pfn_to_page(pfn)); + __func__, pfn, page); trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn), count, align); - /* try again with a bit different memory target */ - start = bitmap_no + mask + 1; } out: - *pagep = page; + if (!ret) + *pagep = page; return ret; } @@ -882,7 +892,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, */ if (page) { for (i = 0; i < count; i++) - page_kasan_tag_reset(nth_page(page, i)); + page_kasan_tag_reset(page + i); } if (ret && !(gfp & __GFP_NOWARN)) { diff --git a/mm/util.c b/mm/util.c index d235b74f7aff7..0bf349b19b652 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, { return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0); } + +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/** + * page_range_contiguous - test whether the page range is contiguous + * @page: the start of the page range. + * @nr_pages: the number of pages in the range. + * + * Test whether the page range is contiguous, such that they can be iterated + * naively, corresponding to iterating a contiguous PFN range. + * + * This function should primarily only be used for debug checks, or when + * working with page ranges that are not naturally contiguous (e.g., pages + * within a folio are). + * + * Returns true if contiguous, otherwise false. + */ +bool page_range_contiguous(const struct page *page, unsigned long nr_pages) +{ + const unsigned long start_pfn = page_to_pfn(page); + const unsigned long end_pfn = start_pfn + nr_pages; + unsigned long pfn; + + /* + * The memmap is allocated per memory section. We need to check + * each involved memory section once. + */ + for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION); + pfn < end_pfn; pfn += PAGES_PER_SECTION) + if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn))) + return false; + return true; +} +#endif #endif /* CONFIG_MMU */ -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:48 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:48 +0200 Subject: [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-23-david@redhat.com> dma_common_contiguous_remap() is used to remap an "allocated contiguous region". Within a single allocation, there is no need to use nth_page() anymore. Neither the buddy, nor hugetlb, nor CMA will hand out problematic page ranges. Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: David Hildenbrand --- kernel/dma/remap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index 9e2afad1c6152..b7c1c0c92d0c8 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -49,7 +49,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t size, if (!pages) return NULL; for (i = 0; i < count; i++) - pages[i] = nth_page(page, i); + pages[i] = page++; vaddr = vmap(pages, count, VM_DMA_COHERENT, prot); kvfree(pages); -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:49 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:49 +0200 Subject: [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-24-david@redhat.com> The expectation is that there is currently no user that would pass in non-contigous page ranges: no allocator, not even VMA, will hand these out. The only problematic part would be if someone would provide a range obtained directly from memblock, or manually merge problematic ranges. If we find such cases, we should fix them to create separate SG entries. Let's check in sg_set_page() that this is really the case. No need to check in sg_set_folio(), as pages in a folio are guaranteed to be contiguous. We can now drop the nth_page() usage in sg_page_iter_page(). Signed-off-by: David Hildenbrand --- include/linux/scatterlist.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6f8a4965f9b98..8196949dfc82c 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -6,6 +6,7 @@ #include #include #include +#include #include struct scatterlist { @@ -158,6 +159,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page) static inline void sg_set_page(struct scatterlist *sg, struct page *page, unsigned int len, unsigned int offset) { + VM_WARN_ON_ONCE(!page_range_contiguous(page, ALIGN(len + offset, PAGE_SIZE) / PAGE_SIZE)); sg_assign_page(sg, page); sg->offset = offset; sg->length = len; @@ -600,7 +602,7 @@ void __sg_page_iter_start(struct sg_page_iter *piter, */ static inline struct page *sg_page_iter_page(struct sg_page_iter *piter) { - return nth_page(sg_page(piter->sg), piter->sg_pgoffset); + return sg_page(piter->sg) + piter->sg_pgoffset; } /** -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:58 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:58 +0200 Subject: [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-33-david@redhat.com> There is the concern that unpin_user_page_range_dirty_lock() might do some weird merging of PFN ranges -- either now or in the future -- such that PFN range is contiguous but the page range might not be. Let's sanity-check for that and drop the nth_page() usage. Signed-off-by: David Hildenbrand --- mm/gup.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index f017ff6d7d61a..0a669a766204b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -237,7 +237,7 @@ void folio_add_pin(struct folio *folio) static inline struct folio *gup_folio_range_next(struct page *start, unsigned long npages, unsigned long i, unsigned int *ntails) { - struct page *next = nth_page(start, i); + struct page *next = start + i; struct folio *folio = page_folio(next); unsigned int nr = 1; @@ -342,6 +342,9 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock); * "gup-pinned page range" refers to a range of pages that has had one of the * pin_user_pages() variants called on that page. * + * The page range must be truly contiguous: the page range corresponds + * to a contiguous PFN range and all pages can be iterated naturally. + * * For the page ranges defined by [page .. page+npages], make that range (or * its head pages, if a compound page) dirty, if @make_dirty is true, and if the * page range was previously listed as clean. @@ -359,6 +362,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, struct folio *folio; unsigned int nr; + VM_WARN_ON_ONCE(!page_range_contiguous(page, npages)); + for (i = 0; i < npages; i += nr) { folio = gup_folio_range_next(page, npages, i, &nr); if (make_dirty && !folio_test_dirty(folio)) { -- 2.50.1 From david at redhat.com Thu Aug 21 20:06:59 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:06:59 +0200 Subject: [PATCH RFC 33/35] kfence: drop nth_page() usage In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-34-david@redhat.com> We want to get rid of nth_page(), and kfence init code is the last user. Unfortunately, we might actually walk a PFN range where the pages are not contiguous, because we might be allocating an area from memblock that could span memory sections in problematic kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP). We could check whether the page range is contiguous using page_range_contiguous() and failing kfence init, or making kfence incompatible these problemtic kernel configs. Let's keep it simple and simply use pfn_to_page() by iterating PFNs. Cc: Alexander Potapenko Cc: Marco Elver Cc: Dmitry Vyukov Signed-off-by: David Hildenbrand --- mm/kfence/core.c | 17 ++++++++++------- 1 file changed, 10 insertions(+), 7 deletions(-) diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 0ed3be100963a..793507c77f9e8 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -594,15 +594,15 @@ static void rcu_guarded_free(struct rcu_head *h) */ static unsigned long kfence_init_pool(void) { - unsigned long addr; - struct page *pages; + unsigned long addr, pfn, start_pfn, end_pfn; int i; if (!arch_kfence_init_pool()) return (unsigned long)__kfence_pool; addr = (unsigned long)__kfence_pool; - pages = virt_to_page(__kfence_pool); + start_pfn = PHYS_PFN(virt_to_phys(__kfence_pool)); + end_pfn = start_pfn + KFENCE_POOL_SIZE / PAGE_SIZE; /* * Set up object pages: they must have PGTY_slab set to avoid freeing @@ -612,12 +612,13 @@ static unsigned long kfence_init_pool(void) * fast-path in SLUB, and therefore need to ensure kfree() correctly * enters __slab_free() slow-path. */ - for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { - struct slab *slab = page_slab(nth_page(pages, i)); + for (pfn = start_pfn; pfn != end_pfn; pfn++) { + struct slab *slab; if (!i || (i % 2)) continue; + slab = page_slab(pfn_to_page(pfn)); __folio_set_slab(slab_folio(slab)); #ifdef CONFIG_MEMCG slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts | @@ -664,11 +665,13 @@ static unsigned long kfence_init_pool(void) return 0; reset_slab: - for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { - struct slab *slab = page_slab(nth_page(pages, i)); + for (pfn = start_pfn; pfn != end_pfn; pfn++) { + struct slab *slab; if (!i || (i % 2)) continue; + + slab = page_slab(pfn_to_page(pfn)); #ifdef CONFIG_MEMCG slab->obj_exts = 0; #endif -- 2.50.1 From david at redhat.com Thu Aug 21 20:07:00 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:07:00 +0200 Subject: [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-35-david@redhat.com> Ever since commit 858c708d9efb ("block: move the bi_size update out of __bio_try_merge_page"), page_is_mergeable() no longer exists, and the logic in bvec_try_merge_page() is now a simple page pointer comparison. Signed-off-by: David Hildenbrand --- include/linux/bvec.h | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/include/linux/bvec.h b/include/linux/bvec.h index 0a80e1f9aa201..3fc0efa0825b1 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -22,11 +22,8 @@ struct page; * @bv_len: Number of bytes in the address range. * @bv_offset: Start of the address range relative to the start of @bv_page. * - * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len: - * - * nth_page(@bv_page, n) == @bv_page + n - * - * This holds because page_is_mergeable() checks the above property. + * All pages within a bio_vec starting from @bv_page are contiguous and + * can simply be iterated (see bvec_advance()). */ struct bio_vec { struct page *bv_page; -- 2.50.1 From david at redhat.com Thu Aug 21 20:07:01 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:07:01 +0200 Subject: [PATCH RFC 35/35] mm: remove nth_page() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250821200701.1329277-36-david@redhat.com> Now that all users are gone, let's remove it. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 2 -- tools/testing/scatterlist/linux/mm.h | 1 - 2 files changed, 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f59ad1f9fc792..3ded0db8322f7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -210,9 +210,7 @@ extern unsigned long sysctl_admin_reserve_kbytes; #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) bool page_range_contiguous(const struct page *page, unsigned long nr_pages); -#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) #else -#define nth_page(page,n) ((page) + (n)) static inline bool page_range_contiguous(const struct page *page, unsigned long nr_pages) { diff --git a/tools/testing/scatterlist/linux/mm.h b/tools/testing/scatterlist/linux/mm.h index 5bd9e6e806254..121ae78d6e885 100644 --- a/tools/testing/scatterlist/linux/mm.h +++ b/tools/testing/scatterlist/linux/mm.h @@ -51,7 +51,6 @@ static inline unsigned long page_to_phys(struct page *page) #define page_to_pfn(page) ((unsigned long)(page) / PAGE_SIZE) #define pfn_to_page(pfn) (void *)((pfn) * PAGE_SIZE) -#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) #define __min(t1, t2, min1, min2, x, y) ({ \ t1 min1 = (x); \ -- 2.50.1 From david at redhat.com Thu Aug 21 20:29:41 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:29:41 +0200 Subject: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-32-david@redhat.com> Message-ID: <2926d7d9-b44e-40c0-b05d-8c42e99c511d@redhat.com> On 21.08.25 22:24, Linus Torvalds wrote: > On Thu, 21 Aug 2025 at 16:08, David Hildenbrand wrote: >> >> - page = nth_page(page, offset >> PAGE_SHIFT); >> + page += offset / PAGE_SIZE; > > Please keep the " >> PAGE_SHIFT" form. No strong opinion. I was primarily doing it to get rid of (in other cases) the parentheses. Like in patch #29 - /* Assumption: contiguous pages can be accessed as "page + i" */ - page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT)); + page = sg_page(sg) + *offset / PAGE_SIZE; > > Is "offset" unsigned? Yes it is, But I had to look at the source code > to make sure, because it wasn't locally obvious from the patch. And > I'd rather we keep a pattern that is "safe", in that it doesn't > generate strange code if the value might be a 's64' (eg loff_t) on > 32-bit architectures. > > Because doing a 64-bit shift on x86-32 is like three cycles. Doing a > 64-bit signed division by a simple constant is something like ten > strange instructions even if the end result is only 32-bit. I would have thought that the compiler is smart enough to optimize that? PAGE_SIZE is a constant. > > And again - not the case *here*, but just a general "let's keep to one > pattern", and the shift pattern is simply the better choice. It's a wild mixture, but I can keep doing what we already do in these cases. -- Cheers David / dhildenb From torvalds at linux-foundation.org Thu Aug 21 20:24:23 2025 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Thu, 21 Aug 2025 16:24:23 -0400 Subject: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry In-Reply-To: <20250821200701.1329277-32-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-32-david@redhat.com> Message-ID: On Thu, 21 Aug 2025 at 16:08, David Hildenbrand wrote: > > - page = nth_page(page, offset >> PAGE_SHIFT); > + page += offset / PAGE_SIZE; Please keep the " >> PAGE_SHIFT" form. Is "offset" unsigned? Yes it is, But I had to look at the source code to make sure, because it wasn't locally obvious from the patch. And I'd rather we keep a pattern that is "safe", in that it doesn't generate strange code if the value might be a 's64' (eg loff_t) on 32-bit architectures. Because doing a 64-bit shift on x86-32 is like three cycles. Doing a 64-bit signed division by a simple constant is something like ten strange instructions even if the end result is only 32-bit. And again - not the case *here*, but just a general "let's keep to one pattern", and the shift pattern is simply the better choice. Linus From david at redhat.com Thu Aug 21 20:32:29 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:32:29 +0200 Subject: [PATCH RFC 33/35] kfence: drop nth_page() usage In-Reply-To: <20250821200701.1329277-34-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-34-david@redhat.com> Message-ID: <1a13a5cb-4312-4c01-827b-fa8a029df0f1@redhat.com> On 21.08.25 22:06, David Hildenbrand wrote: > We want to get rid of nth_page(), and kfence init code is the last user. > > Unfortunately, we might actually walk a PFN range where the pages are > not contiguous, because we might be allocating an area from memblock > that could span memory sections in problematic kernel configs (SPARSEMEM > without SPARSEMEM_VMEMMAP). > > We could check whether the page range is contiguous > using page_range_contiguous() and failing kfence init, or making kfence > incompatible these problemtic kernel configs. > > Let's keep it simple and simply use pfn_to_page() by iterating PFNs. > Fortunately this series is RFC due to lack of detailed testing :P Something gives me a NULL-pointer pointer here (maybe the virt_to_phys()). Will look into that tomorrow. -- Cheers David / dhildenb From torvalds at linux-foundation.org Thu Aug 21 20:36:32 2025 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Thu, 21 Aug 2025 16:36:32 -0400 Subject: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry In-Reply-To: <2926d7d9-b44e-40c0-b05d-8c42e99c511d@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-32-david@redhat.com> <2926d7d9-b44e-40c0-b05d-8c42e99c511d@redhat.com> Message-ID: Oh, an your reply was an invalid email and ended up in my spam-box: From: David Hildenbrand but you apparently didn't use the redhat mail system, so the DKIM signing fails dmarc=fail (p=QUARANTINE sp=QUARANTINE dis=QUARANTINE) header.from=redhat.com and it gets marked as spam. I think you may have gone through smtp.kernel.org, but then you need to use your kernel.org email address to get the DKIM right. Linus From david at redhat.com Thu Aug 21 20:37:26 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:37:26 +0200 Subject: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry In-Reply-To: <2926d7d9-b44e-40c0-b05d-8c42e99c511d@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-32-david@redhat.com> <2926d7d9-b44e-40c0-b05d-8c42e99c511d@redhat.com> Message-ID: <0dc9936f-c977-4ff4-98f3-7941b2eba9d3@redhat.com> On 21.08.25 22:29, David Hildenbrand wrote: > On 21.08.25 22:24, Linus Torvalds wrote: >> On Thu, 21 Aug 2025 at 16:08, David Hildenbrand wrote: >>> >>> - page = nth_page(page, offset >> PAGE_SHIFT); >>> + page += offset / PAGE_SIZE; >> >> Please keep the " >> PAGE_SHIFT" form. > > No strong opinion. > > I was primarily doing it to get rid of (in other cases) the parentheses. > > Like in patch #29 > > - /* Assumption: contiguous pages can be accessed as "page + i" */ > - page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT)); > + page = sg_page(sg) + *offset / PAGE_SIZE; > >> >> Is "offset" unsigned? Yes it is, But I had to look at the source code >> to make sure, because it wasn't locally obvious from the patch. And >> I'd rather we keep a pattern that is "safe", in that it doesn't >> generate strange code if the value might be a 's64' (eg loff_t) on >> 32-bit architectures. >> >> Because doing a 64-bit shift on x86-32 is like three cycles. Doing a >> 64-bit signed division by a simple constant is something like ten >> strange instructions even if the end result is only 32-bit. > > I would have thought that the compiler is smart enough to optimize that? > PAGE_SIZE is a constant. It's late, I get your point: if the compiler can't optimize if it's a signed value ... -- Cheers David / dhildenb From torvalds at linux-foundation.org Thu Aug 21 20:40:13 2025 From: torvalds at linux-foundation.org (Linus Torvalds) Date: Thu, 21 Aug 2025 16:40:13 -0400 Subject: [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry In-Reply-To: <2926d7d9-b44e-40c0-b05d-8c42e99c511d@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-32-david@redhat.com> <2926d7d9-b44e-40c0-b05d-8c42e99c511d@redhat.com> Message-ID: On Thu, Aug 21, 2025 at 4:29?PM David Hildenbrand wrote: > > Because doing a 64-bit shift on x86-32 is like three cycles. Doing a > > 64-bit signed division by a simple constant is something like ten > > strange instructions even if the end result is only 32-bit. > > I would have thought that the compiler is smart enough to optimize that? > PAGE_SIZE is a constant. Oh, the compiler optimizes things. But dividing a 64-bit signed value with a constant is still quite complicated. It doesn't generate a 'div' instruction, but it generates something like this: movl %ebx, %edx sarl $31, %edx movl %edx, %eax xorl %edx, %edx andl $4095, %eax addl %ecx, %eax adcl %ebx, %edx and that's certainly a lot faster than an actual 64-bit divide would be. An unsigned divide - or a shift - results in just shrdl $12, %ecx, %eax which is still not the fastest instruction (I think shrld gets split into two uops), but it's certainly simpler and easier to read. Linus From david at redhat.com Thu Aug 21 20:49:07 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 22:49:07 +0200 Subject: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-13-david@redhat.com> Message-ID: <835b776a-4e15-4821-a601-1470807373a1@redhat.com> On 21.08.25 22:46, Zi Yan wrote: > On 21 Aug 2025, at 16:06, David Hildenbrand wrote: > >> Let's limit the maximum folio size in problematic kernel config where >> the memmap is allocated per memory section (SPARSEMEM without >> SPARSEMEM_VMEMMAP) to a single memory section. >> >> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE >> but not SPARSEMEM_VMEMMAP: sh. >> >> Fortunately, the biggest hugetlb size sh supports is 64 MiB >> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB >> (SECTION_SIZE_BITS == 26), so their use case is not degraded. >> >> As folios and memory sections are naturally aligned to their order-2 size >> in memory, consequently a single folio can no longer span multiple memory >> sections on these problematic kernel configs. >> >> nth_page() is no longer required when operating within a single compound >> page / folio. >> >> Signed-off-by: David Hildenbrand >> --- >> include/linux/mm.h | 22 ++++++++++++++++++---- >> 1 file changed, 18 insertions(+), 4 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 77737cbf2216a..48a985e17ef4e 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) >> return folio_large_nr_pages(folio); >> } >> >> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ >> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >> -#define MAX_FOLIO_ORDER PUD_ORDER >> -#else >> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) >> +/* >> + * We don't expect any folios that exceed buddy sizes (and consequently >> + * memory sections). >> + */ >> #define MAX_FOLIO_ORDER MAX_PAGE_ORDER >> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) >> +/* >> + * Only pages within a single memory section are guaranteed to be >> + * contiguous. By limiting folios to a single memory section, all folio >> + * pages are guaranteed to be contiguous. >> + */ >> +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT >> +#else >> +/* >> + * There is no real limit on the folio size. We limit them to the maximum we >> + * currently expect. > > The comment about hugetlbfs is helpful here, since the other folios are still > limited by buddy allocator?s MAX_ORDER. Yeah, but the old comment was wrong (there is DAX). I can add here "currently expect (e.g., hugetlfs, dax)." -- Cheers David / dhildenb From david at redhat.com Thu Aug 21 21:00:37 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 23:00:37 +0200 Subject: [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-14-david@redhat.com> Message-ID: <23c6e511-19b2-4662-acfc-18692c899a6c@redhat.com> On 21.08.25 22:55, Zi Yan wrote: > On 21 Aug 2025, at 16:06, David Hildenbrand wrote: > >> Now that a single folio/compound page can no longer span memory sections >> in problematic kernel configurations, we can stop using nth_page(). >> >> While at it, turn both macros into static inline functions and add >> kernel doc for folio_page_idx(). >> >> Signed-off-by: David Hildenbrand >> --- >> include/linux/mm.h | 16 ++++++++++++++-- >> include/linux/page-flags.h | 5 ++++- >> 2 files changed, 18 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 48a985e17ef4e..ef360b72cb05c 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes; >> >> #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) >> #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) >> -#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) >> #else >> #define nth_page(page,n) ((page) + (n)) >> -#define folio_page_idx(folio, p) ((p) - &(folio)->page) >> #endif >> >> /* to align the pointer to the (next) page boundary */ >> @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; >> /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ >> #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) >> >> +/** >> + * folio_page_idx - Return the number of a page in a folio. >> + * @folio: The folio. >> + * @page: The folio page. >> + * >> + * This function expects that the page is actually part of the folio. >> + * The returned number is relative to the start of the folio. >> + */ >> +static inline unsigned long folio_page_idx(const struct folio *folio, >> + const struct page *page) >> +{ >> + return page - &folio->page; >> +} >> + >> static inline struct folio *lru_to_folio(struct list_head *head) >> { >> return list_entry((head)->prev, struct folio, lru); >> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h >> index d53a86e68c89b..080ad10c0defc 100644 >> --- a/include/linux/page-flags.h >> +++ b/include/linux/page-flags.h >> @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page) >> * check that the page number lies within @folio; the caller is presumed >> * to have a reference to the page. >> */ >> -#define folio_page(folio, n) nth_page(&(folio)->page, n) >> +static inline struct page *folio_page(struct folio *folio, unsigned long nr) >> +{ >> + return &folio->page + nr; >> +} > > Maybe s/nr/n/ or s/nr/nth/, since it returns the nth page within a folio. Yeah, it's even called "n" in the kernel docs ... > > Since you have added kernel doc for folio_page_idx(), it does not hurt > to have something similar for folio_page(). :) ... which we already have! (see above the macro) :) Thanks! -- Cheers David / dhildenb From david at redhat.com Thu Aug 21 21:45:18 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 21 Aug 2025 23:45:18 +0200 Subject: [PATCH RFC 33/35] kfence: drop nth_page() usage In-Reply-To: <1a13a5cb-4312-4c01-827b-fa8a029df0f1@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-34-david@redhat.com> <1a13a5cb-4312-4c01-827b-fa8a029df0f1@redhat.com> Message-ID: On 21.08.25 22:32, David Hildenbrand wrote: > On 21.08.25 22:06, David Hildenbrand wrote: >> We want to get rid of nth_page(), and kfence init code is the last user. >> >> Unfortunately, we might actually walk a PFN range where the pages are >> not contiguous, because we might be allocating an area from memblock >> that could span memory sections in problematic kernel configs (SPARSEMEM >> without SPARSEMEM_VMEMMAP). >> >> We could check whether the page range is contiguous >> using page_range_contiguous() and failing kfence init, or making kfence >> incompatible these problemtic kernel configs. >> >> Let's keep it simple and simply use pfn_to_page() by iterating PFNs. >> > > Fortunately this series is RFC due to lack of detailed testing :P > > Something gives me a NULL-pointer pointer here (maybe the virt_to_phys()). > > Will look into that tomorrow. Okay, easy: relying on i but not updating it /me facepalm -- Cheers David / dhildenb From ziy at nvidia.com Thu Aug 21 20:20:48 2025 From: ziy at nvidia.com (Zi Yan) Date: Thu, 21 Aug 2025 16:20:48 -0400 Subject: [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable In-Reply-To: <20250821200701.1329277-2-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-2-david@redhat.com> Message-ID: <7169DDE5-A347-44F9-A6A1-707BF9A314F0@nvidia.com> On 21 Aug 2025, at 16:06, David Hildenbrand wrote: > In an ideal world, we wouldn't have to deal with SPARSEMEM without > SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is > considered too costly and consequently not supported. > > However, if an architecture does support SPARSEMEM with > SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just > like we already do for arm64, s390 and x86. > > So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without > SPARSEMEM_VMEMMAP. > > This implies that the option to not use SPARSEMEM_VMEMMAP will now be > gone for loongarch, powerpc, riscv and sparc. All architectures only > enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really > be a big downside to using the VMEMMAP (quite the contrary). > > This is a preparation for not supporting > > (1) folio sizes that exceed a single memory section > (2) CMA allocations of non-contiguous page ranges > > in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we > want to limit possible impact as much as possible (e.g., gigantic hugetlb > page allocations suddenly fails). Sounds like a good idea. > > Cc: Huacai Chen > Cc: WANG Xuerui > Cc: Madhavan Srinivasan > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Christophe Leroy > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Cc: Alexandre Ghiti > Cc: "David S. Miller" > Cc: Andreas Larsson > Signed-off-by: David Hildenbrand > --- > mm/Kconfig | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > Acked-by: Zi Yan Best Regards, Yan, Zi From ziy at nvidia.com Thu Aug 21 20:23:05 2025 From: ziy at nvidia.com (Zi Yan) Date: Thu, 21 Aug 2025 16:23:05 -0400 Subject: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <20250821200701.1329277-7-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-7-david@redhat.com> Message-ID: On 21 Aug 2025, at 16:06, David Hildenbrand wrote: > Let's reject them early, which in turn makes folio_alloc_gigantic() reject > them properly. > > To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER > and calculate MAX_FOLIO_NR_PAGES based on that. > > Signed-off-by: David Hildenbrand > --- > include/linux/mm.h | 6 ++++-- > mm/page_alloc.c | 5 ++++- > 2 files changed, 8 insertions(+), 3 deletions(-) > LGTM. Reviewed-by: Zi Yan Best Regards, Yan, Zi From ziy at nvidia.com Thu Aug 21 20:36:49 2025 From: ziy at nvidia.com (Zi Yan) Date: Thu, 21 Aug 2025 16:36:49 -0400 Subject: [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() In-Reply-To: <20250821200701.1329277-12-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-12-david@redhat.com> Message-ID: <5566D681-ED92-41A8-AF46-216AC8F62174@nvidia.com> On 21 Aug 2025, at 16:06, David Hildenbrand wrote: > Let's sanity-check in folio_set_order() whether we would be trying to > create a folio with an order that would make it exceed MAX_FOLIO_ORDER. > > This will enable the check whenever a folio/compound page is initialized > through prepare_compound_head() / prepare_compound_page(). > > Signed-off-by: David Hildenbrand > --- > mm/internal.h | 1 + > 1 file changed, 1 insertion(+) > Reviewed-by: Zi Yan Best Regards, Yan, Zi From ziy at nvidia.com Thu Aug 21 20:46:34 2025 From: ziy at nvidia.com (Zi Yan) Date: Thu, 21 Aug 2025 16:46:34 -0400 Subject: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: <20250821200701.1329277-13-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-13-david@redhat.com> Message-ID: On 21 Aug 2025, at 16:06, David Hildenbrand wrote: > Let's limit the maximum folio size in problematic kernel config where > the memmap is allocated per memory section (SPARSEMEM without > SPARSEMEM_VMEMMAP) to a single memory section. > > Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE > but not SPARSEMEM_VMEMMAP: sh. > > Fortunately, the biggest hugetlb size sh supports is 64 MiB > (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB > (SECTION_SIZE_BITS == 26), so their use case is not degraded. > > As folios and memory sections are naturally aligned to their order-2 size > in memory, consequently a single folio can no longer span multiple memory > sections on these problematic kernel configs. > > nth_page() is no longer required when operating within a single compound > page / folio. > > Signed-off-by: David Hildenbrand > --- > include/linux/mm.h | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 77737cbf2216a..48a985e17ef4e 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) > return folio_large_nr_pages(folio); > } > > -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > -#define MAX_FOLIO_ORDER PUD_ORDER > -#else > +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) > +/* > + * We don't expect any folios that exceed buddy sizes (and consequently > + * memory sections). > + */ > #define MAX_FOLIO_ORDER MAX_PAGE_ORDER > +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > +/* > + * Only pages within a single memory section are guaranteed to be > + * contiguous. By limiting folios to a single memory section, all folio > + * pages are guaranteed to be contiguous. > + */ > +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT > +#else > +/* > + * There is no real limit on the folio size. We limit them to the maximum we > + * currently expect. The comment about hugetlbfs is helpful here, since the other folios are still limited by buddy allocator?s MAX_ORDER. > + */ > +#define MAX_FOLIO_ORDER PUD_ORDER > #endif > > #define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) > -- > 2.50.1 Otherwise, Reviewed-by: Zi Yan Best Regards, Yan, Zi From ziy at nvidia.com Thu Aug 21 20:50:03 2025 From: ziy at nvidia.com (Zi Yan) Date: Thu, 21 Aug 2025 16:50:03 -0400 Subject: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: <835b776a-4e15-4821-a601-1470807373a1@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-13-david@redhat.com> <835b776a-4e15-4821-a601-1470807373a1@redhat.com> Message-ID: <48255144-572E-4BB9-BBA0-D446DCBA8D75@nvidia.com> On 21 Aug 2025, at 16:49, David Hildenbrand wrote: > On 21.08.25 22:46, Zi Yan wrote: >> On 21 Aug 2025, at 16:06, David Hildenbrand wrote: >> >>> Let's limit the maximum folio size in problematic kernel config where >>> the memmap is allocated per memory section (SPARSEMEM without >>> SPARSEMEM_VMEMMAP) to a single memory section. >>> >>> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE >>> but not SPARSEMEM_VMEMMAP: sh. >>> >>> Fortunately, the biggest hugetlb size sh supports is 64 MiB >>> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB >>> (SECTION_SIZE_BITS == 26), so their use case is not degraded. >>> >>> As folios and memory sections are naturally aligned to their order-2 size >>> in memory, consequently a single folio can no longer span multiple memory >>> sections on these problematic kernel configs. >>> >>> nth_page() is no longer required when operating within a single compound >>> page / folio. >>> >>> Signed-off-by: David Hildenbrand >>> --- >>> include/linux/mm.h | 22 ++++++++++++++++++---- >>> 1 file changed, 18 insertions(+), 4 deletions(-) >>> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h >>> index 77737cbf2216a..48a985e17ef4e 100644 >>> --- a/include/linux/mm.h >>> +++ b/include/linux/mm.h >>> @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) >>> return folio_large_nr_pages(folio); >>> } >>> >>> -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ >>> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >>> -#define MAX_FOLIO_ORDER PUD_ORDER >>> -#else >>> +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) >>> +/* >>> + * We don't expect any folios that exceed buddy sizes (and consequently >>> + * memory sections). >>> + */ >>> #define MAX_FOLIO_ORDER MAX_PAGE_ORDER >>> +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) >>> +/* >>> + * Only pages within a single memory section are guaranteed to be >>> + * contiguous. By limiting folios to a single memory section, all folio >>> + * pages are guaranteed to be contiguous. >>> + */ >>> +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT >>> +#else >>> +/* >>> + * There is no real limit on the folio size. We limit them to the maximum we >>> + * currently expect. >> >> The comment about hugetlbfs is helpful here, since the other folios are still >> limited by buddy allocator?s MAX_ORDER. > > Yeah, but the old comment was wrong (there is DAX). > > I can add here "currently expect (e.g., hugetlfs, dax)." Sounds good. Best Regards, Yan, Zi From ziy at nvidia.com Thu Aug 21 20:55:52 2025 From: ziy at nvidia.com (Zi Yan) Date: Thu, 21 Aug 2025 16:55:52 -0400 Subject: [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() In-Reply-To: <20250821200701.1329277-14-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-14-david@redhat.com> Message-ID: On 21 Aug 2025, at 16:06, David Hildenbrand wrote: > Now that a single folio/compound page can no longer span memory sections > in problematic kernel configurations, we can stop using nth_page(). > > While at it, turn both macros into static inline functions and add > kernel doc for folio_page_idx(). > > Signed-off-by: David Hildenbrand > --- > include/linux/mm.h | 16 ++++++++++++++-- > include/linux/page-flags.h | 5 ++++- > 2 files changed, 18 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 48a985e17ef4e..ef360b72cb05c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes; > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) > -#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) > #else > #define nth_page(page,n) ((page) + (n)) > -#define folio_page_idx(folio, p) ((p) - &(folio)->page) > #endif > > /* to align the pointer to the (next) page boundary */ > @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; > /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ > #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) > > +/** > + * folio_page_idx - Return the number of a page in a folio. > + * @folio: The folio. > + * @page: The folio page. > + * > + * This function expects that the page is actually part of the folio. > + * The returned number is relative to the start of the folio. > + */ > +static inline unsigned long folio_page_idx(const struct folio *folio, > + const struct page *page) > +{ > + return page - &folio->page; > +} > + > static inline struct folio *lru_to_folio(struct list_head *head) > { > return list_entry((head)->prev, struct folio, lru); > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index d53a86e68c89b..080ad10c0defc 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page) > * check that the page number lies within @folio; the caller is presumed > * to have a reference to the page. > */ > -#define folio_page(folio, n) nth_page(&(folio)->page, n) > +static inline struct page *folio_page(struct folio *folio, unsigned long nr) > +{ > + return &folio->page + nr; > +} Maybe s/nr/n/ or s/nr/nth/, since it returns the nth page within a folio. Since you have added kernel doc for folio_page_idx(), it does not hurt to have something similar for folio_page(). :) +/** + * folio_page - Return the nth page in a folio. + * @folio: The folio. + * @n: Page index within the folio. + * + * This function expects that n does not exceed folio_nr_pages(folio). + * The returned page is relative to the first page of the folio. + */ > > static __always_inline int PageTail(const struct page *page) > { > -- > 2.50.1 Otherwise, Reviewed-by: Zi Yan Best Regards, Yan, Zi From dlemoal at kernel.org Fri Aug 22 01:59:15 2025 From: dlemoal at kernel.org (Damien Le Moal) Date: Fri, 22 Aug 2025 10:59:15 +0900 Subject: [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry In-Reply-To: <20250821200701.1329277-25-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-25-david@redhat.com> Message-ID: <3812ed9e-2a47-4c1c-bd69-f37768e62ad3@kernel.org> On 8/22/25 05:06, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Cc: Damien Le Moal > Cc: Niklas Cassel > Signed-off-by: David Hildenbrand > --- > drivers/ata/libata-sff.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c > index 7fc407255eb46..9f5d0f9f6d686 100644 > --- a/drivers/ata/libata-sff.c > +++ b/drivers/ata/libata-sff.c > @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) > offset = qc->cursg->offset + qc->cursg_ofs; > > /* get the current page and offset */ > - page = nth_page(page, (offset >> PAGE_SHIFT)); > + page += offset / PAGE_SHIFT; Shouldn't this be "offset >> PAGE_SHIFT" ? > offset %= PAGE_SIZE; > > /* don't overrun current sg */ > @@ -631,7 +631,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) > unsigned int split_len = PAGE_SIZE - offset; > > ata_pio_xfer(qc, page, offset, split_len); > - ata_pio_xfer(qc, nth_page(page, 1), 0, count - split_len); > + ata_pio_xfer(qc, page + 1, 0, count - split_len); > } else { > ata_pio_xfer(qc, page, offset, count); > } > @@ -751,7 +751,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes) > offset = sg->offset + qc->cursg_ofs; > > /* get the current page and offset */ > - page = nth_page(page, (offset >> PAGE_SHIFT)); > + page += offset / PAGE_SIZE; Same here, though this seems correct too. > offset %= PAGE_SIZE; > > /* don't overrun current sg */ -- Damien Le Moal Western Digital Research From mpenttil at redhat.com Fri Aug 22 04:09:17 2025 From: mpenttil at redhat.com (=?UTF-8?Q?Mika_Penttil=C3=A4?=) Date: Fri, 22 Aug 2025 07:09:17 +0300 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <20250821200701.1329277-11-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> Message-ID: <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> On 8/21/25 23:06, David Hildenbrand wrote: > All pages were already initialized and set to PageReserved() with a > refcount of 1 by MM init code. Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to initialize struct pages? > In fact, by using __init_single_page(), we will be setting the refcount to > 1 just to freeze it again immediately afterwards. > > So drop the __init_single_page() and use __ClearPageReserved() instead. > Adjust the comments to highlight that we are dealing with an open-coded > prep_compound_page() variant. > > Further, as we can now safely iterate over all pages in a folio, let's > avoid the page-pfn dance and just iterate the pages directly. > > Note that the current code was likely problematic, but we never ran into > it: prep_compound_tail() would have been called with an offset that might > exceed a memory section, and prep_compound_tail() would have simply > added that offset to the page pointer -- which would not have done the > right thing on sparsemem without vmemmap. > > Signed-off-by: David Hildenbrand > --- > mm/hugetlb.c | 21 ++++++++++----------- > 1 file changed, 10 insertions(+), 11 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index d12a9d5146af4..ae82a845b14ad 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3235,17 +3235,14 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, > unsigned long start_page_number, > unsigned long end_page_number) > { > - enum zone_type zone = zone_idx(folio_zone(folio)); > - int nid = folio_nid(folio); > - unsigned long head_pfn = folio_pfn(folio); > - unsigned long pfn, end_pfn = head_pfn + end_page_number; > + struct page *head_page = folio_page(folio, 0); > + struct page *page = folio_page(folio, start_page_number); > + unsigned long i; > int ret; > > - for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { > - struct page *page = pfn_to_page(pfn); > - > - __init_single_page(page, pfn, zone, nid); > - prep_compound_tail((struct page *)folio, pfn - head_pfn); > + for (i = start_page_number; i < end_page_number; i++, page++) { > + __ClearPageReserved(page); > + prep_compound_tail(head_page, i); > ret = page_ref_freeze(page, 1); > VM_BUG_ON(!ret); > } > @@ -3257,12 +3254,14 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, > { > int ret; > > - /* Prepare folio head */ > + /* > + * This is an open-coded prep_compound_page() whereby we avoid > + * walking pages twice by preparing+freezing them in the same go. > + */ > __folio_clear_reserved(folio); > __folio_set_head(folio); > ret = folio_ref_freeze(folio, 1); > VM_BUG_ON(!ret); > - /* Initialize the necessary tail struct pages */ > hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); > prep_compound_head((struct page *)folio, huge_page_order(h)); > } --Mika From david at redhat.com Fri Aug 22 06:18:40 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 22 Aug 2025 08:18:40 +0200 Subject: [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry In-Reply-To: <3812ed9e-2a47-4c1c-bd69-f37768e62ad3@kernel.org> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-25-david@redhat.com> <3812ed9e-2a47-4c1c-bd69-f37768e62ad3@kernel.org> Message-ID: <6bff5a45-8e52-4a5d-81cb-63a7331d7d0b@redhat.com> On 22.08.25 03:59, Damien Le Moal wrote: > On 8/22/25 05:06, David Hildenbrand wrote: >> It's no longer required to use nth_page() when iterating pages within a >> single SG entry, so let's drop the nth_page() usage. >> >> Cc: Damien Le Moal >> Cc: Niklas Cassel >> Signed-off-by: David Hildenbrand >> --- >> drivers/ata/libata-sff.c | 6 +++--- >> 1 file changed, 3 insertions(+), 3 deletions(-) >> >> diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c >> index 7fc407255eb46..9f5d0f9f6d686 100644 >> --- a/drivers/ata/libata-sff.c >> +++ b/drivers/ata/libata-sff.c >> @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) >> offset = qc->cursg->offset + qc->cursg_ofs; >> >> /* get the current page and offset */ >> - page = nth_page(page, (offset >> PAGE_SHIFT)); >> + page += offset / PAGE_SHIFT; > > Shouldn't this be "offset >> PAGE_SHIFT" ? Thanks for taking a look! Yeah, I already reverted back to "offset >> PAGE_SHIFT" after Linus mentioned in another mail in this thread that ">> PAGE_SHIFT" is generally preferred because the compiler cannot optimize as much if offset would be a signed variable. So the next version will have the shift again. -- Cheers David / dhildenb From david at redhat.com Fri Aug 22 06:24:31 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 22 Aug 2025 08:24:31 +0200 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> Message-ID: <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> On 22.08.25 06:09, Mika Penttil? wrote: > > On 8/21/25 23:06, David Hildenbrand wrote: > >> All pages were already initialized and set to PageReserved() with a >> refcount of 1 by MM init code. > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to > initialize struct pages? Excellent point, I did not know about that one. Spotting that we don't do the same for the head page made me assume that it's just a misuse of __init_single_page(). But the nasty thing is that we use memblock_reserved_mark_noinit() to only mark the tail pages ... Let me revert back to __init_single_page() and add a big fat comment why this is required. Thanks! -- Cheers David / dhildenb From david at redhat.com Fri Aug 22 13:59:57 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 22 Aug 2025 15:59:57 +0200 Subject: [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-19-david@redhat.com> Message-ID: <473f3576-ddf3-4388-aeec-d486f639950a@redhat.com> On 22.08.25 13:32, Pavel Begunkov wrote: > On 8/21/25 21:06, David Hildenbrand wrote: >> We always provide a single dst page, it's unclear why the io_copy_cache >> complexity is required. > > Because it'll need to be pulled outside the loop to reuse the page for > multiple copies, i.e. packing multiple fragments of the same skb into > it. Not finished, and currently it's wasting memory. Okay, so what you're saying is that there will be follow-up work that will actually make this structure useful. > > Why not do as below? Pages there never cross boundaries of their folios. > Do you want it to be taken into the io_uring tree? This should better all go through the MM tree where we actually guarantee contiguous pages within a folio. (see the cover letter) > > diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c > index e5ff49f3425e..18c12f4b56b6 100644 > --- a/io_uring/zcrx.c > +++ b/io_uring/zcrx.c > @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, > > if (folio_test_partial_kmap(page_folio(dst_page)) || > folio_test_partial_kmap(page_folio(src_page))) { > - dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE); > + dst_page += dst_offset / PAGE_SIZE; > dst_offset = offset_in_page(dst_offset); > - src_page = nth_page(src_page, src_offset / PAGE_SIZE); > + src_page += src_offset / PAGE_SIZE; Yeah, I can do that in the next version given that you have plans on extending that code soon. -- Cheers David / dhildenb From rppt at kernel.org Fri Aug 22 15:10:24 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 22 Aug 2025 18:10:24 +0300 Subject: [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250821200701.1329277-3-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-3-david@redhat.com> Message-ID: On Thu, Aug 21, 2025 at 10:06:28PM +0200, David Hildenbrand wrote: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. > > Cc: Catalin Marinas > Cc: Will Deacon > Signed-off-by: David Hildenbrand Reviewed-by: Mike Rapoport (Microsoft) > --- > arch/arm64/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index e9bbfacc35a64..b1d1f2ff2493b 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -1570,7 +1570,6 @@ source "kernel/Kconfig.hz" > config ARCH_SPARSEMEM_ENABLE > def_bool y > select SPARSEMEM_VMEMMAP_ENABLE > - select SPARSEMEM_VMEMMAP > > config HW_PERF_EVENTS > def_bool y > -- > 2.50.1 > -- Sincerely yours, Mike. From rppt at kernel.org Fri Aug 22 15:13:39 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 22 Aug 2025 18:13:39 +0300 Subject: [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config In-Reply-To: <20250821200701.1329277-6-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-6-david@redhat.com> Message-ID: On Thu, Aug 21, 2025 at 10:06:31PM +0200, David Hildenbrand wrote: > It's no longer user-selectable (and the default was already "y"), so > let's just drop it. and it should not matter for wireguard selftest anyway > > Cc: "Jason A. Donenfeld" > Cc: Shuah Khan > Signed-off-by: David Hildenbrand Acked-by: Mike Rapoport (Microsoft) > --- > tools/testing/selftests/wireguard/qemu/kernel.config | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config > index 0a5381717e9f4..1149289f4b30f 100644 > --- a/tools/testing/selftests/wireguard/qemu/kernel.config > +++ b/tools/testing/selftests/wireguard/qemu/kernel.config > @@ -48,7 +48,6 @@ CONFIG_JUMP_LABEL=y > CONFIG_FUTEX=y > CONFIG_SHMEM=y > CONFIG_SLUB=y > -CONFIG_SPARSEMEM_VMEMMAP=y > CONFIG_SMP=y > CONFIG_SCHED_SMT=y > CONFIG_SCHED_MC=y > -- > 2.50.1 > -- Sincerely yours, Mike. From rppt at kernel.org Fri Aug 22 15:27:22 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 22 Aug 2025 18:27:22 +0300 Subject: [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() In-Reply-To: <20250821200701.1329277-10-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-10-david@redhat.com> Message-ID: On Thu, Aug 21, 2025 at 10:06:35PM +0200, David Hildenbrand wrote: > Grepping for "prep_compound_page" leaves on clueless how devdax gets its > compound pages initialized. > > Let's add a comment that might help finding this open-coded > prep_compound_page() initialization more easily. > > Further, let's be less smart about the ordering of initialization and just > perform the prep_compound_head() call after all tail pages were > initialized: just like prep_compound_page() does. > > No need for a lengthy comment then: again, just like prep_compound_page(). > > Note that prep_compound_head() already does initialize stuff in page[2] > through prep_compound_head() that successive tail page initialization > will overwrite: _deferred_list, and on 32bit _entire_mapcount and > _pincount. Very likely 32bit does not apply, and likely nobody ever ends > up testing whether the _deferred_list is empty. > > So it shouldn't be a fix at this point, but certainly something to clean > up. > > Signed-off-by: David Hildenbrand > --- > mm/mm_init.c | 13 +++++-------- > 1 file changed, 5 insertions(+), 8 deletions(-) > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 5c21b3af216b2..708466c5b2cc9 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head, > unsigned long pfn, end_pfn = head_pfn + nr_pages; > unsigned int order = pgmap->vmemmap_shift; > > + /* > + * This is an open-coded prep_compound_page() whereby we avoid > + * walking pages twice by initializing them in the same go. > + */ While on it, can you also mention that prep_compound_page() is not used to properly set page zone link? With this Reviewed-by: Mike Rapoport (Microsoft) > __SetPageHead(head); > for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { > struct page *page = pfn_to_page(pfn); > @@ -1098,15 +1102,8 @@ static void __ref memmap_init_compound(struct page *head, > __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); > prep_compound_tail(head, pfn - head_pfn); > set_page_count(page, 0); > - > - /* > - * The first tail page stores important compound page info. > - * Call prep_compound_head() after the first tail page has > - * been initialized, to not have the data overwritten. > - */ > - if (pfn == head_pfn + 1) > - prep_compound_head(head, order); > } > + prep_compound_head(head, order); > } > > void __ref memmap_init_zone_device(struct zone *zone, > -- > 2.50.1 > -- Sincerely yours, Mike. From sj at kernel.org Fri Aug 22 17:07:46 2025 From: sj at kernel.org (SeongJae Park) Date: Fri, 22 Aug 2025 10:07:46 -0700 Subject: [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <20250821200701.1329277-7-david@redhat.com> Message-ID: <20250822170746.53309-1-sj@kernel.org> On Thu, 21 Aug 2025 22:06:32 +0200 David Hildenbrand wrote: > Let's reject them early, I like early failures. :) > which in turn makes folio_alloc_gigantic() reject > them properly. > > To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER > and calculate MAX_FOLIO_NR_PAGES based on that. > > Signed-off-by: David Hildenbrand Acked-by: SeongJae Park Thanks, SJ [...] From sj at kernel.org Fri Aug 22 17:09:51 2025 From: sj at kernel.org (SeongJae Park) Date: Fri, 22 Aug 2025 10:09:51 -0700 Subject: [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() In-Reply-To: <20250821200701.1329277-8-david@redhat.com> Message-ID: <20250822170951.53418-1-sj@kernel.org> On Thu, 21 Aug 2025 22:06:33 +0200 David Hildenbrand wrote: > Let's reject unreasonable folio sizes early, where we can still fail. > We'll add sanity checks to prepare_compound_head/prepare_compound_page > next. > > Is there a way to configure a system such that unreasonable folio sizes > would be possible? It would already be rather questionable. > > If so, we'd probably want to bail out earlier, where we can avoid a > WARN and just report a proper error message that indicates where > something went wrong such that we messed up. > > Signed-off-by: David Hildenbrand Acked-by: SeongJae Park Thanks, SJ [...] From david at redhat.com Fri Aug 22 18:09:54 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 22 Aug 2025 20:09:54 +0200 Subject: [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-10-david@redhat.com> Message-ID: <1a3ca0c5-0720-4882-b425-031297c1abb7@redhat.com> On 22.08.25 17:27, Mike Rapoport wrote: > On Thu, Aug 21, 2025 at 10:06:35PM +0200, David Hildenbrand wrote: >> Grepping for "prep_compound_page" leaves on clueless how devdax gets its >> compound pages initialized. >> >> Let's add a comment that might help finding this open-coded >> prep_compound_page() initialization more easily. >> >> Further, let's be less smart about the ordering of initialization and just >> perform the prep_compound_head() call after all tail pages were >> initialized: just like prep_compound_page() does. >> >> No need for a lengthy comment then: again, just like prep_compound_page(). >> >> Note that prep_compound_head() already does initialize stuff in page[2] >> through prep_compound_head() that successive tail page initialization >> will overwrite: _deferred_list, and on 32bit _entire_mapcount and >> _pincount. Very likely 32bit does not apply, and likely nobody ever ends >> up testing whether the _deferred_list is empty. >> >> So it shouldn't be a fix at this point, but certainly something to clean >> up. >> >> Signed-off-by: David Hildenbrand >> --- >> mm/mm_init.c | 13 +++++-------- >> 1 file changed, 5 insertions(+), 8 deletions(-) >> >> diff --git a/mm/mm_init.c b/mm/mm_init.c >> index 5c21b3af216b2..708466c5b2cc9 100644 >> --- a/mm/mm_init.c >> +++ b/mm/mm_init.c >> @@ -1091,6 +1091,10 @@ static void __ref memmap_init_compound(struct page *head, >> unsigned long pfn, end_pfn = head_pfn + nr_pages; >> unsigned int order = pgmap->vmemmap_shift; >> >> + /* >> + * This is an open-coded prep_compound_page() whereby we avoid >> + * walking pages twice by initializing them in the same go. >> + */ > > While on it, can you also mention that prep_compound_page() is not used to > properly set page zone link? Sure, thanks! -- Cheers David / dhildenb From rppt at kernel.org Sat Aug 23 08:59:50 2025 From: rppt at kernel.org (Mike Rapoport) Date: Sat, 23 Aug 2025 11:59:50 +0300 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote: > On 22.08.25 06:09, Mika Penttil? wrote: > > > > On 8/21/25 23:06, David Hildenbrand wrote: > > > > > All pages were already initialized and set to PageReserved() with a > > > refcount of 1 by MM init code. > > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to > > initialize struct pages? > > Excellent point, I did not know about that one. > > Spotting that we don't do the same for the head page made me assume that > it's just a misuse of __init_single_page(). > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only > mark the tail pages ... And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled struct pages are initialized regardless of memblock_reserved_mark_noinit(). I think this patch should go in before your updates: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 753f99b4c718..1c51788339a5 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3230,6 +3230,22 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid) return 1; } +/* + * Tail pages in a huge folio allocated from memblock are marked as 'noinit', + * which means that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled their + * struct page won't be initialized + */ +#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT +static void __init hugetlb_init_tail_page(struct page *page, unsigned long pfn, + enum zone_type zone, int nid) +{ + __init_single_page(page, pfn, zone, nid); +} +#else +static inline void hugetlb_init_tail_page(struct page *page, unsigned long pfn, + enum zone_type zone, int nid) {} +#endif + /* Initialize [start_page:end_page_number] tail struct pages of a hugepage */ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, unsigned long start_page_number, @@ -3244,7 +3260,7 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn); - __init_single_page(page, pfn, zone, nid); + hugetlb_init_tail_page(page, pfn, zone, nid); prep_compound_tail((struct page *)folio, pfn - head_pfn); ret = page_ref_freeze(page, 1); VM_BUG_ON(!ret); > Let me revert back to __init_single_page() and add a big fat comment why > this is required. > > Thanks! -- Sincerely yours, Mike. From rppt at kernel.org Sun Aug 24 13:24:23 2025 From: rppt at kernel.org (Mike Rapoport) Date: Sun, 24 Aug 2025 16:24:23 +0300 Subject: [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: <20250821200701.1329277-13-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-13-david@redhat.com> Message-ID: On Thu, Aug 21, 2025 at 10:06:38PM +0200, David Hildenbrand wrote: > Let's limit the maximum folio size in problematic kernel config where > the memmap is allocated per memory section (SPARSEMEM without > SPARSEMEM_VMEMMAP) to a single memory section. > > Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE > but not SPARSEMEM_VMEMMAP: sh. > > Fortunately, the biggest hugetlb size sh supports is 64 MiB > (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB > (SECTION_SIZE_BITS == 26), so their use case is not degraded. > > As folios and memory sections are naturally aligned to their order-2 size > in memory, consequently a single folio can no longer span multiple memory > sections on these problematic kernel configs. > > nth_page() is no longer required when operating within a single compound > page / folio. > > Signed-off-by: David Hildenbrand Acked-by: Mike Rapoport (Microsoft) > --- > include/linux/mm.h | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 77737cbf2216a..48a985e17ef4e 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) > return folio_large_nr_pages(folio); > } > > -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > -#define MAX_FOLIO_ORDER PUD_ORDER > -#else > +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) > +/* > + * We don't expect any folios that exceed buddy sizes (and consequently > + * memory sections). > + */ > #define MAX_FOLIO_ORDER MAX_PAGE_ORDER > +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > +/* > + * Only pages within a single memory section are guaranteed to be > + * contiguous. By limiting folios to a single memory section, all folio > + * pages are guaranteed to be contiguous. > + */ > +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT > +#else > +/* > + * There is no real limit on the folio size. We limit them to the maximum we > + * currently expect. > + */ > +#define MAX_FOLIO_ORDER PUD_ORDER > #endif > > #define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) > -- > 2.50.1 > -- Sincerely yours, Mike. From david at redhat.com Mon Aug 25 12:48:58 2025 From: david at redhat.com (David Hildenbrand) Date: Mon, 25 Aug 2025 14:48:58 +0200 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: On 23.08.25 10:59, Mike Rapoport wrote: > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote: >> On 22.08.25 06:09, Mika Penttil? wrote: >>> >>> On 8/21/25 23:06, David Hildenbrand wrote: >>> >>>> All pages were already initialized and set to PageReserved() with a >>>> refcount of 1 by MM init code. >>> >>> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to >>> initialize struct pages? >> >> Excellent point, I did not know about that one. >> >> Spotting that we don't do the same for the head page made me assume that >> it's just a misuse of __init_single_page(). >> >> But the nasty thing is that we use memblock_reserved_mark_noinit() to only >> mark the tail pages ... > > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is > disabled struct pages are initialized regardless of > memblock_reserved_mark_noinit(). > > I think this patch should go in before your updates: Shouldn't we fix this in memblock code? Hacking around that in the memblock_reserved_mark_noinit() user sound wrong -- and nothing in the doc of memblock_reserved_mark_noinit() spells that behavior out. -- Cheers David / dhildenb From rppt at kernel.org Mon Aug 25 14:32:20 2025 From: rppt at kernel.org (Mike Rapoport) Date: Mon, 25 Aug 2025 17:32:20 +0300 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote: > On 23.08.25 10:59, Mike Rapoport wrote: > > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote: > > > On 22.08.25 06:09, Mika Penttil? wrote: > > > > > > > > On 8/21/25 23:06, David Hildenbrand wrote: > > > > > > > > > All pages were already initialized and set to PageReserved() with a > > > > > refcount of 1 by MM init code. > > > > > > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to > > > > initialize struct pages? > > > > > > Excellent point, I did not know about that one. > > > > > > Spotting that we don't do the same for the head page made me assume that > > > it's just a misuse of __init_single_page(). > > > > > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only > > > mark the tail pages ... > > > > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is > > disabled struct pages are initialized regardless of > > memblock_reserved_mark_noinit(). > > > > I think this patch should go in before your updates: > > Shouldn't we fix this in memblock code? > > Hacking around that in the memblock_reserved_mark_noinit() user sound wrong > -- and nothing in the doc of memblock_reserved_mark_noinit() spells that > behavior out. We can surely update the docs, but unfortunately I don't see how to avoid hacking around it in hugetlb. Since it's used to optimise HVO even further to the point hugetlb open codes memmap initialization, I think it's fair that it should deal with all possible configurations. > -- > Cheers > > David / dhildenb > > -- Sincerely yours, Mike. From david at redhat.com Mon Aug 25 14:38:03 2025 From: david at redhat.com (David Hildenbrand) Date: Mon, 25 Aug 2025 16:38:03 +0200 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: On 25.08.25 16:32, Mike Rapoport wrote: > On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote: >> On 23.08.25 10:59, Mike Rapoport wrote: >>> On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote: >>>> On 22.08.25 06:09, Mika Penttil? wrote: >>>>> >>>>> On 8/21/25 23:06, David Hildenbrand wrote: >>>>> >>>>>> All pages were already initialized and set to PageReserved() with a >>>>>> refcount of 1 by MM init code. >>>>> >>>>> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to >>>>> initialize struct pages? >>>> >>>> Excellent point, I did not know about that one. >>>> >>>> Spotting that we don't do the same for the head page made me assume that >>>> it's just a misuse of __init_single_page(). >>>> >>>> But the nasty thing is that we use memblock_reserved_mark_noinit() to only >>>> mark the tail pages ... >>> >>> And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is >>> disabled struct pages are initialized regardless of >>> memblock_reserved_mark_noinit(). >>> >>> I think this patch should go in before your updates: >> >> Shouldn't we fix this in memblock code? >> >> Hacking around that in the memblock_reserved_mark_noinit() user sound wrong >> -- and nothing in the doc of memblock_reserved_mark_noinit() spells that >> behavior out. > > We can surely update the docs, but unfortunately I don't see how to avoid > hacking around it in hugetlb. > Since it's used to optimise HVO even further to the point hugetlb open > codes memmap initialization, I think it's fair that it should deal with all > possible configurations. Remind me, why can't we support memblock_reserved_mark_noinit() when CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled? -- Cheers David / dhildenb From rppt at kernel.org Mon Aug 25 14:59:22 2025 From: rppt at kernel.org (Mike Rapoport) Date: Mon, 25 Aug 2025 17:59:22 +0300 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: On Mon, Aug 25, 2025 at 04:38:03PM +0200, David Hildenbrand wrote: > On 25.08.25 16:32, Mike Rapoport wrote: > > On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote: > > > On 23.08.25 10:59, Mike Rapoport wrote: > > > > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote: > > > > > On 22.08.25 06:09, Mika Penttil? wrote: > > > > > > > > > > > > On 8/21/25 23:06, David Hildenbrand wrote: > > > > > > > > > > > > > All pages were already initialized and set to PageReserved() with a > > > > > > > refcount of 1 by MM init code. > > > > > > > > > > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to > > > > > > initialize struct pages? > > > > > > > > > > Excellent point, I did not know about that one. > > > > > > > > > > Spotting that we don't do the same for the head page made me assume that > > > > > it's just a misuse of __init_single_page(). > > > > > > > > > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only > > > > > mark the tail pages ... > > > > > > > > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is > > > > disabled struct pages are initialized regardless of > > > > memblock_reserved_mark_noinit(). > > > > > > > > I think this patch should go in before your updates: > > > > > > Shouldn't we fix this in memblock code? > > > > > > Hacking around that in the memblock_reserved_mark_noinit() user sound wrong > > > -- and nothing in the doc of memblock_reserved_mark_noinit() spells that > > > behavior out. > > > > We can surely update the docs, but unfortunately I don't see how to avoid > > hacking around it in hugetlb. > > Since it's used to optimise HVO even further to the point hugetlb open > > codes memmap initialization, I think it's fair that it should deal with all > > possible configurations. > > Remind me, why can't we support memblock_reserved_mark_noinit() when > CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled? When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled we initialize the entire memmap early (setup_arch()->free_area_init()), and we may have a bunch of memblock_reserved_mark_noinit() afterwards > -- > Cheers > > David / dhildenb > -- Sincerely yours, Mike. From david at redhat.com Mon Aug 25 15:42:33 2025 From: david at redhat.com (David Hildenbrand) Date: Mon, 25 Aug 2025 17:42:33 +0200 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: On 25.08.25 16:59, Mike Rapoport wrote: > On Mon, Aug 25, 2025 at 04:38:03PM +0200, David Hildenbrand wrote: >> On 25.08.25 16:32, Mike Rapoport wrote: >>> On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote: >>>> On 23.08.25 10:59, Mike Rapoport wrote: >>>>> On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote: >>>>>> On 22.08.25 06:09, Mika Penttil? wrote: >>>>>>> >>>>>>> On 8/21/25 23:06, David Hildenbrand wrote: >>>>>>> >>>>>>>> All pages were already initialized and set to PageReserved() with a >>>>>>>> refcount of 1 by MM init code. >>>>>>> >>>>>>> Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to >>>>>>> initialize struct pages? >>>>>> >>>>>> Excellent point, I did not know about that one. >>>>>> >>>>>> Spotting that we don't do the same for the head page made me assume that >>>>>> it's just a misuse of __init_single_page(). >>>>>> >>>>>> But the nasty thing is that we use memblock_reserved_mark_noinit() to only >>>>>> mark the tail pages ... >>>>> >>>>> And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is >>>>> disabled struct pages are initialized regardless of >>>>> memblock_reserved_mark_noinit(). >>>>> >>>>> I think this patch should go in before your updates: >>>> >>>> Shouldn't we fix this in memblock code? >>>> >>>> Hacking around that in the memblock_reserved_mark_noinit() user sound wrong >>>> -- and nothing in the doc of memblock_reserved_mark_noinit() spells that >>>> behavior out. >>> >>> We can surely update the docs, but unfortunately I don't see how to avoid >>> hacking around it in hugetlb. >>> Since it's used to optimise HVO even further to the point hugetlb open >>> codes memmap initialization, I think it's fair that it should deal with all >>> possible configurations. >> >> Remind me, why can't we support memblock_reserved_mark_noinit() when >> CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled? > > When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled we initialize the entire > memmap early (setup_arch()->free_area_init()), and we may have a bunch of > memblock_reserved_mark_noinit() afterwards Oh, you mean that we get effective memblock modifications after already initializing the memmap. That sounds ... interesting :) So yeah, we have to document this for memblock_reserved_mark_noinit(). Is it also a problem for kexec_handover? We should do something like: diff --git a/mm/memblock.c b/mm/memblock.c index 154f1d73b61f2..ed4c563d72c32 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1091,13 +1091,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) /** * memblock_reserved_mark_noinit - Mark a reserved memory region with flag - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized - * for this region. + * MEMBLOCK_RSRV_NOINIT which allows for the "struct pages" corresponding + * to this region not getting initialized, because the caller will take + * care of it. * @base: the base phys addr of the region * @size: the size of the region * - * struct pages will not be initialized for reserved memory regions marked with - * %MEMBLOCK_RSRV_NOINIT. + * "struct pages" will not be initialized for reserved memory regions marked + * with %MEMBLOCK_RSRV_NOINIT if this function is called before initialization + * code runs. Without CONFIG_DEFERRED_STRUCT_PAGE_INIT, it is more likely + * that this function is not effective. * * Return: 0 on success, -errno on failure. */ Optimizing the hugetlb code could be done, but I am not sure how high the priority is (nobody complained so far about the double init). -- Cheers David / dhildenb From rppt at kernel.org Mon Aug 25 16:17:02 2025 From: rppt at kernel.org (Mike Rapoport) Date: Mon, 25 Aug 2025 19:17:02 +0300 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: On Mon, Aug 25, 2025 at 05:42:33PM +0200, David Hildenbrand wrote: > On 25.08.25 16:59, Mike Rapoport wrote: > > On Mon, Aug 25, 2025 at 04:38:03PM +0200, David Hildenbrand wrote: > > > On 25.08.25 16:32, Mike Rapoport wrote: > > > > On Mon, Aug 25, 2025 at 02:48:58PM +0200, David Hildenbrand wrote: > > > > > On 23.08.25 10:59, Mike Rapoport wrote: > > > > > > On Fri, Aug 22, 2025 at 08:24:31AM +0200, David Hildenbrand wrote: > > > > > > > On 22.08.25 06:09, Mika Penttil? wrote: > > > > > > > > > > > > > > > > On 8/21/25 23:06, David Hildenbrand wrote: > > > > > > > > > > > > > > > > > All pages were already initialized and set to PageReserved() with a > > > > > > > > > refcount of 1 by MM init code. > > > > > > > > > > > > > > > > Just to be sure, how is this working with MEMBLOCK_RSRV_NOINIT, where MM is supposed not to > > > > > > > > initialize struct pages? > > > > > > > > > > > > > > Excellent point, I did not know about that one. > > > > > > > > > > > > > > Spotting that we don't do the same for the head page made me assume that > > > > > > > it's just a misuse of __init_single_page(). > > > > > > > > > > > > > > But the nasty thing is that we use memblock_reserved_mark_noinit() to only > > > > > > > mark the tail pages ... > > > > > > > > > > > > And even nastier thing is that when CONFIG_DEFERRED_STRUCT_PAGE_INIT is > > > > > > disabled struct pages are initialized regardless of > > > > > > memblock_reserved_mark_noinit(). > > > > > > > > > > > > I think this patch should go in before your updates: > > > > > > > > > > Shouldn't we fix this in memblock code? > > > > > > > > > > Hacking around that in the memblock_reserved_mark_noinit() user sound wrong > > > > > -- and nothing in the doc of memblock_reserved_mark_noinit() spells that > > > > > behavior out. > > > > > > > > We can surely update the docs, but unfortunately I don't see how to avoid > > > > hacking around it in hugetlb. > > > > Since it's used to optimise HVO even further to the point hugetlb open > > > > codes memmap initialization, I think it's fair that it should deal with all > > > > possible configurations. > > > > > > Remind me, why can't we support memblock_reserved_mark_noinit() when > > > CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled? > > > > When CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled we initialize the entire > > memmap early (setup_arch()->free_area_init()), and we may have a bunch of > > memblock_reserved_mark_noinit() afterwards > > Oh, you mean that we get effective memblock modifications after already > initializing the memmap. > > That sounds ... interesting :) It's memmap, not the free lists. Without deferred init, memblock is active for a while after memmap initialized and before the memory goes to the free lists. > So yeah, we have to document this for memblock_reserved_mark_noinit(). > > Is it also a problem for kexec_handover? With KHO it's also interesting, but it does not support deferred struct page init for now :) > We should do something like: > > diff --git a/mm/memblock.c b/mm/memblock.c > index 154f1d73b61f2..ed4c563d72c32 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1091,13 +1091,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) > /** > * memblock_reserved_mark_noinit - Mark a reserved memory region with flag > - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized > - * for this region. > + * MEMBLOCK_RSRV_NOINIT which allows for the "struct pages" corresponding > + * to this region not getting initialized, because the caller will take > + * care of it. > * @base: the base phys addr of the region > * @size: the size of the region > * > - * struct pages will not be initialized for reserved memory regions marked with > - * %MEMBLOCK_RSRV_NOINIT. > + * "struct pages" will not be initialized for reserved memory regions marked > + * with %MEMBLOCK_RSRV_NOINIT if this function is called before initialization > + * code runs. Without CONFIG_DEFERRED_STRUCT_PAGE_INIT, it is more likely > + * that this function is not effective. > * > * Return: 0 on success, -errno on failure. > */ I have a different version :) diff --git a/include/linux/memblock.h b/include/linux/memblock.h index b96746376e17..d20d091c6343 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn; * via a driver, and never indicated in the firmware-provided memory map as * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the * kernel resource tree. - * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are - * not initialized (only for reserved regions). + * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages don't have + * PG_Reserved set and are completely not initialized when + * %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled (only for reserved regions). * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use, * either explictitly with memblock_reserve_kern() or via memblock * allocation APIs. All memblock allocations set this flag. diff --git a/mm/memblock.c b/mm/memblock.c index 154f1d73b61f..02de5ffb085b 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1091,13 +1091,15 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) /** * memblock_reserved_mark_noinit - Mark a reserved memory region with flag - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized - * for this region. + * MEMBLOCK_RSRV_NOINIT + * * @base: the base phys addr of the region * @size: the size of the region * - * struct pages will not be initialized for reserved memory regions marked with - * %MEMBLOCK_RSRV_NOINIT. + * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will + * not have %PG_Reserved flag set. + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flags also + * completly bypasses the initialization of struct pages for this region. * * Return: 0 on success, -errno on failure. */ > Optimizing the hugetlb code could be done, but I am not sure how high > the priority is (nobody complained so far about the double init). > > -- > Cheers > > David / dhildenb > -- Sincerely yours, Mike. From david at redhat.com Mon Aug 25 16:23:48 2025 From: david at redhat.com (David Hildenbrand) Date: Mon, 25 Aug 2025 18:23:48 +0200 Subject: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-11-david@redhat.com> <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: > >> We should do something like: >> >> diff --git a/mm/memblock.c b/mm/memblock.c >> index 154f1d73b61f2..ed4c563d72c32 100644 >> --- a/mm/memblock.c >> +++ b/mm/memblock.c >> @@ -1091,13 +1091,16 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) >> /** >> * memblock_reserved_mark_noinit - Mark a reserved memory region with flag >> - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized >> - * for this region. >> + * MEMBLOCK_RSRV_NOINIT which allows for the "struct pages" corresponding >> + * to this region not getting initialized, because the caller will take >> + * care of it. >> * @base: the base phys addr of the region >> * @size: the size of the region >> * >> - * struct pages will not be initialized for reserved memory regions marked with >> - * %MEMBLOCK_RSRV_NOINIT. >> + * "struct pages" will not be initialized for reserved memory regions marked >> + * with %MEMBLOCK_RSRV_NOINIT if this function is called before initialization >> + * code runs. Without CONFIG_DEFERRED_STRUCT_PAGE_INIT, it is more likely >> + * that this function is not effective. >> * >> * Return: 0 on success, -errno on failure. >> */ > > I have a different version :) > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h > index b96746376e17..d20d091c6343 100644 > --- a/include/linux/memblock.h > +++ b/include/linux/memblock.h > @@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn; > * via a driver, and never indicated in the firmware-provided memory map as > * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the > * kernel resource tree. > - * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are > - * not initialized (only for reserved regions). > + * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages don't have > + * PG_Reserved set and are completely not initialized when > + * %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled (only for reserved regions). > * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use, > * either explictitly with memblock_reserve_kern() or via memblock > * allocation APIs. All memblock allocations set this flag. > diff --git a/mm/memblock.c b/mm/memblock.c > index 154f1d73b61f..02de5ffb085b 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1091,13 +1091,15 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) > > /** > * memblock_reserved_mark_noinit - Mark a reserved memory region with flag > - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized > - * for this region. > + * MEMBLOCK_RSRV_NOINIT > + * > * @base: the base phys addr of the region > * @size: the size of the region > * > - * struct pages will not be initialized for reserved memory regions marked with > - * %MEMBLOCK_RSRV_NOINIT. > + * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will > + * not have %PG_Reserved flag set. > + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flags also > + * completly bypasses the initialization of struct pages for this region. s/completly/completely. I don't quite understand the interaction with PG_Reserved and why anybody using this function should care. So maybe you can rephrase in a way that is easier to digest, and rather focuses on what callers of this function are supposed to do vs. have the liberty of not doing? -- Cheers David / dhildenb From rppt at kernel.org Mon Aug 25 16:58:10 2025 From: rppt at kernel.org (Mike Rapoport) Date: Mon, 25 Aug 2025 19:58:10 +0300 Subject: update kernel-doc for MEMBLOCK_RSRV_NOINIT (was: Re: [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()) In-Reply-To: References: <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: On Mon, Aug 25, 2025 at 06:23:48PM +0200, David Hildenbrand wrote: > > I don't quite understand the interaction with PG_Reserved and why anybody > using this function should care. > > So maybe you can rephrase in a way that is easier to digest, and rather > focuses on what callers of this function are supposed to do vs. have the > liberty of not doing? How about diff --git a/include/linux/memblock.h b/include/linux/memblock.h index b96746376e17..fcda8481de9a 100644 --- a/include/linux/memblock.h +++ b/include/linux/memblock.h @@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn; * via a driver, and never indicated in the firmware-provided memory map as * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the * kernel resource tree. - * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are - * not initialized (only for reserved regions). + * @MEMBLOCK_RSRV_NOINIT: reserved memory region for which struct pages are not + * fully initialized. Users of this flag are responsible to properly initialize + * struct pages of this region * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use, * either explictitly with memblock_reserve_kern() or via memblock * allocation APIs. All memblock allocations set this flag. diff --git a/mm/memblock.c b/mm/memblock.c index 154f1d73b61f..46b411fb3630 100644 --- a/mm/memblock.c +++ b/mm/memblock.c @@ -1091,13 +1091,20 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) /** * memblock_reserved_mark_noinit - Mark a reserved memory region with flag - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized - * for this region. + * MEMBLOCK_RSRV_NOINIT + * * @base: the base phys addr of the region * @size: the size of the region * - * struct pages will not be initialized for reserved memory regions marked with - * %MEMBLOCK_RSRV_NOINIT. + * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will + * not be fully initialized to allow the caller optimize their initialization. + * + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flag + * completely bypasses the initialization of struct pages for such region. + * + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, struct pages in this + * region will be initialized with default values but won't be marked as + * reserved. * * Return: 0 on success, -errno on failure. */ > -- > Cheers > > David / dhildenb > -- Sincerely yours, Mike. From david at redhat.com Mon Aug 25 18:32:27 2025 From: david at redhat.com (David Hildenbrand) Date: Mon, 25 Aug 2025 20:32:27 +0200 Subject: update kernel-doc for MEMBLOCK_RSRV_NOINIT In-Reply-To: References: <9156d191-9ec4-4422-bae9-2e8ce66f9d5e@redhat.com> <7077e09f-6ce9-43ba-8f87-47a290680141@redhat.com> Message-ID: <7ffd0abd-27a1-40a8-b538-9a01e21abb29@redhat.com> On 25.08.25 18:58, Mike Rapoport wrote: > On Mon, Aug 25, 2025 at 06:23:48PM +0200, David Hildenbrand wrote: >> >> I don't quite understand the interaction with PG_Reserved and why anybody >> using this function should care. >> >> So maybe you can rephrase in a way that is easier to digest, and rather >> focuses on what callers of this function are supposed to do vs. have the >> liberty of not doing? > > How about > > diff --git a/include/linux/memblock.h b/include/linux/memblock.h > index b96746376e17..fcda8481de9a 100644 > --- a/include/linux/memblock.h > +++ b/include/linux/memblock.h > @@ -40,8 +40,9 @@ extern unsigned long long max_possible_pfn; > * via a driver, and never indicated in the firmware-provided memory map as > * system RAM. This corresponds to IORESOURCE_SYSRAM_DRIVER_MANAGED in the > * kernel resource tree. > - * @MEMBLOCK_RSRV_NOINIT: memory region for which struct pages are > - * not initialized (only for reserved regions). > + * @MEMBLOCK_RSRV_NOINIT: reserved memory region for which struct pages are not > + * fully initialized. Users of this flag are responsible to properly initialize > + * struct pages of this region > * @MEMBLOCK_RSRV_KERN: memory region that is reserved for kernel use, > * either explictitly with memblock_reserve_kern() or via memblock > * allocation APIs. All memblock allocations set this flag. > diff --git a/mm/memblock.c b/mm/memblock.c > index 154f1d73b61f..46b411fb3630 100644 > --- a/mm/memblock.c > +++ b/mm/memblock.c > @@ -1091,13 +1091,20 @@ int __init_memblock memblock_clear_nomap(phys_addr_t base, phys_addr_t size) > > /** > * memblock_reserved_mark_noinit - Mark a reserved memory region with flag > - * MEMBLOCK_RSRV_NOINIT which results in the struct pages not being initialized > - * for this region. > + * MEMBLOCK_RSRV_NOINIT > + * > * @base: the base phys addr of the region > * @size: the size of the region > * > - * struct pages will not be initialized for reserved memory regions marked with > - * %MEMBLOCK_RSRV_NOINIT. > + * The struct pages for the reserved regions marked %MEMBLOCK_RSRV_NOINIT will > + * not be fully initialized to allow the caller optimize their initialization. > + * > + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, setting this flag > + * completely bypasses the initialization of struct pages for such region. > + * > + * When %CONFIG_DEFERRED_STRUCT_PAGE_INIT is disabled, struct pages in this > + * region will be initialized with default values but won't be marked as > + * reserved. Sounds good. I am surprised regarding "reserved", but I guess that's because we don't end up calling "reserve_bootmem_region()" on these regions in memmap_init_reserved_pages(). -- Cheers David / dhildenb From david at redhat.com Tue Aug 26 11:04:33 2025 From: david at redhat.com (David Hildenbrand) Date: Tue, 26 Aug 2025 13:04:33 +0200 Subject: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-22-david@redhat.com> Message-ID: >> >> pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n", >> - __func__, pfn, pfn_to_page(pfn)); >> + __func__, pfn, page); >> >> trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn), > > Nitpick: I think you already have the page here. Indeed, forgot to clean that up as well. > >> count, align); >> - /* try again with a bit different memory target */ >> - start = bitmap_no + mask + 1; >> } >> out: >> - *pagep = page; >> + if (!ret) >> + *pagep = page; >> return ret; >> } >> >> @@ -882,7 +892,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, >> */ >> if (page) { >> for (i = 0; i < count; i++) >> - page_kasan_tag_reset(nth_page(page, i)); >> + page_kasan_tag_reset(page + i); > > Had a look at it, not very familiar with CMA, but the changes look equivalent to > what was before. Not sure that's worth a Reviewed-by tag, but here it in case > you want to add it: > > Reviewed-by: Alexandru Elisei Thanks! > > Just so I can better understand the problem being fixed, I guess you can have > two consecutive pfns with non-consecutive associated struct page if you have two > adjacent memory sections spanning the same physical memory region, is that > correct? Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not guaranteed that pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1 when we cross memory section boundaries. It can be the case for early boot memory if we allocated consecutive areas from memblock when allocating the memmap (struct pages) per memory section, but it's not guaranteed. So we rule out that case. -- Cheers David / dhildenb From david at redhat.com Tue Aug 26 13:08:08 2025 From: david at redhat.com (David Hildenbrand) Date: Tue, 26 Aug 2025 15:08:08 +0200 Subject: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-22-david@redhat.com> Message-ID: On 26.08.25 15:03, Alexandru Elisei wrote: > Hi David, > > On Tue, Aug 26, 2025 at 01:04:33PM +0200, David Hildenbrand wrote: > .. >>> Just so I can better understand the problem being fixed, I guess you can have >>> two consecutive pfns with non-consecutive associated struct page if you have two >>> adjacent memory sections spanning the same physical memory region, is that >>> correct? >> >> Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not >> guaranteed that >> >> pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1 >> >> when we cross memory section boundaries. >> >> It can be the case for early boot memory if we allocated consecutive areas >> from memblock when allocating the memmap (struct pages) per memory section, >> but it's not guaranteed. > > Thank you for the explanation, but I'm a bit confused by the last paragraph. I > think what you're saying is that we can also have the reverse problem, where > consecutive struct page * represent non-consecutive pfns, because memmap > allocations happened to return consecutive virtual addresses, is that right? Exactly, that's something we have to deal with elsewhere [1]. For this code, it's not a problem because we always allocate a contiguous PFN range. > > If that's correct, I don't think that's the case for CMA, which deals out > contiguous physical memory. Or were you just trying to explain the other side of > the problem, and I'm just overthinking it? The latter :) [1] https://lkml.kernel.org/r/20250814064714.56485-2-lizhe.67 at bytedance.com -- Cheers David / dhildenb From david at redhat.com Wed Aug 27 22:01:10 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:10 +0200 Subject: [PATCH v1 06/36] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-7-david@redhat.com> Let's reject them early, which in turn makes folio_alloc_gigantic() reject them properly. To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER and calculate MAX_FOLIO_NR_PAGES based on that. Reviewed-by: Zi Yan Acked-by: SeongJae Park Signed-off-by: David Hildenbrand --- include/linux/mm.h | 6 ++++-- mm/page_alloc.c | 5 ++++- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 00c8a54127d37..77737cbf2216a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) +#define MAX_FOLIO_ORDER PUD_ORDER #else -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER #endif +#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) + /* * compound_nr() returns the number of pages in this potentially compound * page. compound_nr() can be called on a tail page, and is defined to diff --git a/mm/page_alloc.c b/mm/page_alloc.c index baead29b3e67b..426bc404b80cc 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) int alloc_contig_range_noprof(unsigned long start, unsigned long end, acr_flags_t alloc_flags, gfp_t gfp_mask) { + const unsigned int order = ilog2(end - start); unsigned long outer_start, outer_end; int ret = 0; @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, PB_ISOLATE_MODE_CMA_ALLOC : PB_ISOLATE_MODE_OTHER; + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) + return -EINVAL; + gfp_mask = current_gfp_context(gfp_mask); if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) return -EINVAL; @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, free_contig_range(end, outer_end - end); } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { struct page *head = pfn_to_page(start); - int order = ilog2(end - start); check_new_pages(head, order); prep_new_page(head, order, gfp_mask, 0); -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:11 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:11 +0200 Subject: [PATCH v1 07/36] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-8-david@redhat.com> Let's reject unreasonable folio sizes early, where we can still fail. We'll add sanity checks to prepare_compound_head/prepare_compound_page next. Is there a way to configure a system such that unreasonable folio sizes would be possible? It would already be rather questionable. If so, we'd probably want to bail out earlier, where we can avoid a WARN and just report a proper error message that indicates where something went wrong such that we messed up. Acked-by: SeongJae Park Signed-off-by: David Hildenbrand --- mm/memremap.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/mm/memremap.c b/mm/memremap.c index b0ce0d8254bd8..a2d4bb88f64b6 100644 --- a/mm/memremap.c +++ b/mm/memremap.c @@ -275,6 +275,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) return ERR_PTR(-EINVAL); + if (WARN_ONCE(pgmap->vmemmap_shift > MAX_FOLIO_ORDER, + "requested folio size unsupported\n")) + return ERR_PTR(-EINVAL); switch (pgmap->type) { case MEMORY_DEVICE_PRIVATE: -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:12 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:12 +0200 Subject: [PATCH v1 08/36] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-9-david@redhat.com> Let's check that no hstate that corresponds to an unreasonable folio size is registered by an architecture. If we were to succeed registering, we could later try allocating an unsupported gigantic folio size. Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have to use a BUILD_BUG_ON_INVALID() to make it compile. No existing kernel configuration should be able to trigger this check: either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or gigantic folios will not exceed a memory section (the case on sparse). Signed-off-by: David Hildenbrand --- mm/hugetlb.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 572b6f7772841..4a97e4f14c0dc 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE < __NR_HPAGEFLAGS); + BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); if (!hugepages_supported()) { if (hugetlb_max_hstate || default_hstate_max_huge_pages) @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order) } BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); + WARN_ON(order > MAX_FOLIO_ORDER); h = &hstates[hugetlb_max_hstate++]; __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); h->order = order; -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:13 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:13 +0200 Subject: [PATCH v1 09/36] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-10-david@redhat.com> Grepping for "prep_compound_page" leaves on clueless how devdax gets its compound pages initialized. Let's add a comment that might help finding this open-coded prep_compound_page() initialization more easily. Further, let's be less smart about the ordering of initialization and just perform the prep_compound_head() call after all tail pages were initialized: just like prep_compound_page() does. No need for a comment to describe the initialization order: again, just like prep_compound_page(). Reviewed-by: Mike Rapoport (Microsoft) Signed-off-by: David Hildenbrand --- mm/mm_init.c | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/mm/mm_init.c b/mm/mm_init.c index 5c21b3af216b2..df614556741a4 100644 --- a/mm/mm_init.c +++ b/mm/mm_init.c @@ -1091,6 +1091,12 @@ static void __ref memmap_init_compound(struct page *head, unsigned long pfn, end_pfn = head_pfn + nr_pages; unsigned int order = pgmap->vmemmap_shift; + /* + * We have to initialize the pages, including setting up page links. + * prep_compound_page() does not take care of that, so instead we + * open-code prep_compound_page() so we can take care of initializing + * the pages in the same go. + */ __SetPageHead(head); for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { struct page *page = pfn_to_page(pfn); @@ -1098,15 +1104,8 @@ static void __ref memmap_init_compound(struct page *head, __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); prep_compound_tail(head, pfn - head_pfn); set_page_count(page, 0); - - /* - * The first tail page stores important compound page info. - * Call prep_compound_head() after the first tail page has - * been initialized, to not have the data overwritten. - */ - if (pfn == head_pfn + 1) - prep_compound_head(head, order); } + prep_compound_head(head, order); } void __ref memmap_init_zone_device(struct zone *zone, -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:14 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:14 +0200 Subject: [PATCH v1 10/36] mm: sanity-check maximum folio size in folio_set_order() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-11-david@redhat.com> Let's sanity-check in folio_set_order() whether we would be trying to create a folio with an order that would make it exceed MAX_FOLIO_ORDER. This will enable the check whenever a folio/compound page is initialized through prepare_compound_head() / prepare_compound_page(). Reviewed-by: Zi Yan Signed-off-by: David Hildenbrand --- mm/internal.h | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/internal.h b/mm/internal.h index 45da9ff5694f6..9b0129531d004 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -755,6 +755,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) { if (WARN_ON_ONCE(!order || !folio_test_large(folio))) return; + VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER); folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order; #ifdef NR_PAGES_IN_LARGE_FOLIO -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:15 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:15 +0200 Subject: [PATCH v1 11/36] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-12-david@redhat.com> Let's limit the maximum folio size in problematic kernel config where the memmap is allocated per memory section (SPARSEMEM without SPARSEMEM_VMEMMAP) to a single memory section. Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE but not SPARSEMEM_VMEMMAP: sh. Fortunately, the biggest hugetlb size sh supports is 64 MiB (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB (SECTION_SIZE_BITS == 26), so their use case is not degraded. As folios and memory sections are naturally aligned to their order-2 size in memory, consequently a single folio can no longer span multiple memory sections on these problematic kernel configs. nth_page() is no longer required when operating within a single compound page / folio. Reviewed-by: Zi Yan Acked-by: Mike Rapoport (Microsoft) Signed-off-by: David Hildenbrand --- include/linux/mm.h | 22 ++++++++++++++++++---- 1 file changed, 18 insertions(+), 4 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 77737cbf2216a..2dee79fa2efcf 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) return folio_large_nr_pages(folio); } -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE -#define MAX_FOLIO_ORDER PUD_ORDER -#else +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) +/* + * We don't expect any folios that exceed buddy sizes (and consequently + * memory sections). + */ #define MAX_FOLIO_ORDER MAX_PAGE_ORDER +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/* + * Only pages within a single memory section are guaranteed to be + * contiguous. By limiting folios to a single memory section, all folio + * pages are guaranteed to be contiguous. + */ +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT +#else +/* + * There is no real limit on the folio size. We limit them to the maximum we + * currently expect (e.g., hugetlb, dax). + */ +#define MAX_FOLIO_ORDER PUD_ORDER #endif #define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:16 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:16 +0200 Subject: [PATCH v1 12/36] mm: simplify folio_page() and folio_page_idx() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-13-david@redhat.com> Now that a single folio/compound page can no longer span memory sections in problematic kernel configurations, we can stop using nth_page(). While at it, turn both macros into static inline functions and add kernel doc for folio_page_idx(). Reviewed-by: Zi Yan Signed-off-by: David Hildenbrand --- include/linux/mm.h | 16 ++++++++++++++-- include/linux/page-flags.h | 5 ++++- 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 2dee79fa2efcf..f6880e3225c5c 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes; #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) -#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) #else #define nth_page(page,n) ((page) + (n)) -#define folio_page_idx(folio, p) ((p) - &(folio)->page) #endif /* to align the pointer to the (next) page boundary */ @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) +/** + * folio_page_idx - Return the number of a page in a folio. + * @folio: The folio. + * @page: The folio page. + * + * This function expects that the page is actually part of the folio. + * The returned number is relative to the start of the folio. + */ +static inline unsigned long folio_page_idx(const struct folio *folio, + const struct page *page) +{ + return page - &folio->page; +} + static inline struct folio *lru_to_folio(struct list_head *head) { return list_entry((head)->prev, struct folio, lru); diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h index 5ee6ffbdbf831..faf17ca211b4f 100644 --- a/include/linux/page-flags.h +++ b/include/linux/page-flags.h @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page) * check that the page number lies within @folio; the caller is presumed * to have a reference to the page. */ -#define folio_page(folio, n) nth_page(&(folio)->page, n) +static inline struct page *folio_page(struct folio *folio, unsigned long n) +{ + return &folio->page + n; +} static __always_inline int PageTail(const struct page *page) { -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:17 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:17 +0200 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-14-david@redhat.com> We can now safely iterate over all pages in a folio, so no need for the pfn_to_page(). Also, as we already force the refcount in __init_single_page() to 1, we can just set the refcount to 0 and avoid page_ref_freeze() + VM_BUG_ON. Likely, in the future, we would just want to tell __init_single_page() to which value to initialize the refcount. Further, adjust the comments to highlight that we are dealing with an open-coded prep_compound_page() variant, and add another comment explaining why we really need the __init_single_page() only on the tail pages. Note that the current code was likely problematic, but we never ran into it: prep_compound_tail() would have been called with an offset that might exceed a memory section, and prep_compound_tail() would have simply added that offset to the page pointer -- which would not have done the right thing on sparsemem without vmemmap. Signed-off-by: David Hildenbrand --- mm/hugetlb.c | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 4a97e4f14c0dc..1f42186a85ea4 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -3237,17 +3237,18 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, { enum zone_type zone = zone_idx(folio_zone(folio)); int nid = folio_nid(folio); + struct page *page = folio_page(folio, start_page_number); unsigned long head_pfn = folio_pfn(folio); unsigned long pfn, end_pfn = head_pfn + end_page_number; - int ret; - - for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { - struct page *page = pfn_to_page(pfn); + /* + * We mark all tail pages with memblock_reserved_mark_noinit(), + * so these pages are completely uninitialized. + */ + for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) { __init_single_page(page, pfn, zone, nid); prep_compound_tail((struct page *)folio, pfn - head_pfn); - ret = page_ref_freeze(page, 1); - VM_BUG_ON(!ret); + set_page_count(page, 0); } } @@ -3257,12 +3258,15 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, { int ret; - /* Prepare folio head */ + /* + * This is an open-coded prep_compound_page() whereby we avoid + * walking pages twice by initializing/preparing+freezing them in the + * same go. + */ __folio_clear_reserved(folio); __folio_set_head(folio); ret = folio_ref_freeze(folio, 1); VM_BUG_ON(!ret); - /* Initialize the necessary tail struct pages */ hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); prep_compound_head((struct page *)folio, huge_page_order(h)); } -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:18 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:18 +0200 Subject: [PATCH v1 14/36] mm/mm/percpu-km: drop nth_page() usage within single allocation In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-15-david@redhat.com> We're allocating a higher-order page from the buddy. For these pages (that are guaranteed to not exceed a single memory section) there is no need to use nth_page(). Signed-off-by: David Hildenbrand --- mm/percpu-km.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/percpu-km.c b/mm/percpu-km.c index fe31aa19db81a..4efa74a495cb6 100644 --- a/mm/percpu-km.c +++ b/mm/percpu-km.c @@ -69,7 +69,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) } for (i = 0; i < nr_pages; i++) - pcpu_set_page_chunk(nth_page(pages, i), chunk); + pcpu_set_page_chunk(pages + i, chunk); chunk->data = pages; chunk->base_addr = page_address(pages); -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:19 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:19 +0200 Subject: [PATCH v1 15/36] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-16-david@redhat.com> The nth_page() is not really required anymore, so let's remove it. While at it, cleanup and simplify the code a bit. Signed-off-by: David Hildenbrand --- fs/hugetlbfs/inode.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index 34d496a2b7de6..c5a46d10afaa0 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -217,7 +217,7 @@ static size_t adjust_range_hwpoison(struct folio *folio, size_t offset, break; offset += n; if (offset == PAGE_SIZE) { - page = nth_page(page, 1); + page++; offset = 0; } } -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:20 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:20 +0200 Subject: [PATCH v1 16/36] fs: hugetlbfs: cleanup folio in adjust_range_hwpoison() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-17-david@redhat.com> Let's cleanup and simplify the function a bit. Signed-off-by: David Hildenbrand --- fs/hugetlbfs/inode.c | 33 +++++++++++---------------------- 1 file changed, 11 insertions(+), 22 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index c5a46d10afaa0..6ca1f6b45c1e5 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -198,31 +198,20 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, static size_t adjust_range_hwpoison(struct folio *folio, size_t offset, size_t bytes) { - struct page *page; - size_t n = 0; - size_t res = 0; - - /* First page to start the loop. */ - page = folio_page(folio, offset / PAGE_SIZE); - offset %= PAGE_SIZE; - while (1) { - if (is_raw_hwpoison_page_in_hugepage(page)) - break; + struct page *page = folio_page(folio, offset / PAGE_SIZE); + size_t safe_bytes; + + if (is_raw_hwpoison_page_in_hugepage(page)) + return 0; + /* Safe to read the remaining bytes in this page. */ + safe_bytes = PAGE_SIZE - (offset % PAGE_SIZE); + page++; - /* Safe to read n bytes without touching HWPOISON subpage. */ - n = min(bytes, (size_t)PAGE_SIZE - offset); - res += n; - bytes -= n; - if (!bytes || !n) + for (; safe_bytes < bytes; safe_bytes += PAGE_SIZE, page++) + if (is_raw_hwpoison_page_in_hugepage(page)) break; - offset += n; - if (offset == PAGE_SIZE) { - page++; - offset = 0; - } - } - return res; + return min(safe_bytes, bytes); } /* -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:21 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:21 +0200 Subject: [PATCH v1 17/36] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-18-david@redhat.com> It's no longer required to use nth_page() within a folio, so let's just drop the nth_page() in folio_walk_start(). Signed-off-by: David Hildenbrand --- mm/pagewalk.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/pagewalk.c b/mm/pagewalk.c index c6753d370ff4e..9e4225e5fcf5c 100644 --- a/mm/pagewalk.c +++ b/mm/pagewalk.c @@ -1004,7 +1004,7 @@ struct folio *folio_walk_start(struct folio_walk *fw, found: if (expose_page) /* Note: Offset from the mapped page, not the folio start. */ - fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT); + fw->page = page + ((addr & (entry_size - 1)) >> PAGE_SHIFT); else fw->page = NULL; fw->ptl = ptl; -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:22 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:22 +0200 Subject: [PATCH v1 18/36] mm/gup: drop nth_page() usage within folio when recording subpages In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-19-david@redhat.com> nth_page() is no longer required when iterating over pages within a single folio, so let's just drop it when recording subpages. Signed-off-by: David Hildenbrand --- mm/gup.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/gup.c b/mm/gup.c index b2a78f0291273..89ca0813791ab 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -488,12 +488,11 @@ static int record_subpages(struct page *page, unsigned long sz, unsigned long addr, unsigned long end, struct page **pages) { - struct page *start_page; int nr; - start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT); + page += (addr & (sz - 1)) >> PAGE_SHIFT; for (nr = 0; addr != end; nr++, addr += PAGE_SIZE) - pages[nr] = nth_page(start_page, nr); + pages[nr] = page++; return nr; } @@ -1512,7 +1511,7 @@ static long __get_user_pages(struct mm_struct *mm, } for (j = 0; j < page_increm; j++) { - subpage = nth_page(page, j); + subpage = page + j; pages[i + j] = subpage; flush_anon_page(vma, subpage, start + j * PAGE_SIZE); flush_dcache_page(subpage); -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:23 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:23 +0200 Subject: [PATCH v1 19/36] io_uring/zcrx: remove nth_page() usage within folio In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-20-david@redhat.com> Within a folio/compound page, nth_page() is no longer required. Given that we call folio_test_partial_kmap()+kmap_local_page(), the code would already be problematic if the pages would span multiple folios. So let's just assume that all src pages belong to a single folio/compound page and can be iterated ordinarily. The dst page is currently always a single page, so we're not actually iterating anything. Reviewed-by: Pavel Begunkov Cc: Jens Axboe Cc: Pavel Begunkov Signed-off-by: David Hildenbrand --- io_uring/zcrx.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index e5ff49f3425e0..18c12f4b56b6c 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, if (folio_test_partial_kmap(page_folio(dst_page)) || folio_test_partial_kmap(page_folio(src_page))) { - dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE); + dst_page += dst_offset / PAGE_SIZE; dst_offset = offset_in_page(dst_offset); - src_page = nth_page(src_page, src_offset / PAGE_SIZE); + src_page += src_offset / PAGE_SIZE; src_offset = offset_in_page(src_offset); n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset); n = min(n, len); -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:24 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:24 +0200 Subject: [PATCH v1 20/36] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-21-david@redhat.com> Let's make it clearer that we are operating within a single folio by providing both the folio and the page. This implies that for flush_dcache_folio() we'll now avoid one more page->folio lookup, and that we can safely drop the "nth_page" usage. Cc: Thomas Bogendoerfer Signed-off-by: David Hildenbrand --- arch/mips/include/asm/cacheflush.h | 11 +++++++---- arch/mips/mm/cache.c | 8 ++++---- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h index 5d283ef89d90d..8d79bfc687d21 100644 --- a/arch/mips/include/asm/cacheflush.h +++ b/arch/mips/include/asm/cacheflush.h @@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm); extern void (*flush_cache_range)(struct vm_area_struct *vma, unsigned long start, unsigned long end); extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn); -extern void __flush_dcache_pages(struct page *page, unsigned int nr); +extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr); #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 static inline void flush_dcache_folio(struct folio *folio) { if (cpu_has_dc_aliases) - __flush_dcache_pages(&folio->page, folio_nr_pages(folio)); + __flush_dcache_folio_pages(folio, folio_page(folio, 0), + folio_nr_pages(folio)); else if (!cpu_has_ic_fills_f_dc) folio_set_dcache_dirty(folio); } @@ -64,10 +65,12 @@ static inline void flush_dcache_folio(struct folio *folio) static inline void flush_dcache_page(struct page *page) { + struct folio *folio = page_folio(page); + if (cpu_has_dc_aliases) - __flush_dcache_pages(page, 1); + __flush_dcache_folio_pages(folio, page, folio_nr_pages(folio)); else if (!cpu_has_ic_fills_f_dc) - folio_set_dcache_dirty(page_folio(page)); + folio_set_dcache_dirty(folio); } #define flush_dcache_mmap_lock(mapping) do { } while (0) diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c index bf9a37c60e9f0..e3b4224c9a406 100644 --- a/arch/mips/mm/cache.c +++ b/arch/mips/mm/cache.c @@ -99,9 +99,9 @@ SYSCALL_DEFINE3(cacheflush, unsigned long, addr, unsigned long, bytes, return 0; } -void __flush_dcache_pages(struct page *page, unsigned int nr) +void __flush_dcache_folio_pages(struct folio *folio, struct page *page, + unsigned int nr) { - struct folio *folio = page_folio(page); struct address_space *mapping = folio_flush_mapping(folio); unsigned long addr; unsigned int i; @@ -117,12 +117,12 @@ void __flush_dcache_pages(struct page *page, unsigned int nr) * get faulted into the tlb (and thus flushed) anyways. */ for (i = 0; i < nr; i++) { - addr = (unsigned long)kmap_local_page(nth_page(page, i)); + addr = (unsigned long)kmap_local_page(page + i); flush_data_cache_page(addr); kunmap_local((void *)addr); } } -EXPORT_SYMBOL(__flush_dcache_pages); +EXPORT_SYMBOL(__flush_dcache_folio_pages); void __flush_anon_page(struct page *page, unsigned long vmaddr) { -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:25 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:25 +0200 Subject: [PATCH v1 21/36] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-22-david@redhat.com> Let's disallow handing out PFN ranges with non-contiguous pages, so we can remove the nth-page usage in __cma_alloc(), and so any callers don't have to worry about that either when wanting to blindly iterate pages. This is really only a problem in configs with SPARSEMEM but without SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some cases. Will this cause harm? Probably not, because it's mostly 32bit that does not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could look into allocating the memmap for the memory sections spanned by a single CMA region in one go from memblock. Reviewed-by: Alexandru Elisei Signed-off-by: David Hildenbrand --- include/linux/mm.h | 6 ++++++ mm/cma.c | 39 ++++++++++++++++++++++++--------------- mm/util.c | 33 +++++++++++++++++++++++++++++++++ 3 files changed, 63 insertions(+), 15 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index f6880e3225c5c..2ca1eb2db63ec 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes; extern unsigned long sysctl_admin_reserve_kbytes; #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +bool page_range_contiguous(const struct page *page, unsigned long nr_pages); #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) #else #define nth_page(page,n) ((page) + (n)) +static inline bool page_range_contiguous(const struct page *page, + unsigned long nr_pages) +{ + return true; +} #endif /* to align the pointer to the (next) page boundary */ diff --git a/mm/cma.c b/mm/cma.c index e56ec64d0567e..813e6dc7b0954 100644 --- a/mm/cma.c +++ b/mm/cma.c @@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, unsigned long count, unsigned int align, struct page **pagep, gfp_t gfp) { - unsigned long mask, offset; - unsigned long pfn = -1; - unsigned long start = 0; unsigned long bitmap_maxno, bitmap_no, bitmap_count; + unsigned long start, pfn, mask, offset; int ret = -EBUSY; struct page *page = NULL; @@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, if (bitmap_count > bitmap_maxno) goto out; - for (;;) { + for (start = 0; ; start = bitmap_no + mask + 1) { spin_lock_irq(&cma->lock); /* * If the request is larger than the available number @@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, spin_unlock_irq(&cma->lock); break; } + + pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); + page = pfn_to_page(pfn); + + /* + * Do not hand out page ranges that are not contiguous, so + * callers can just iterate the pages without having to worry + * about these corner cases. + */ + if (!page_range_contiguous(page, count)) { + spin_unlock_irq(&cma->lock); + pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]", + __func__, cma->name, pfn, pfn + count - 1); + continue; + } + bitmap_set(cmr->bitmap, bitmap_no, bitmap_count); cma->available_count -= count; /* @@ -821,29 +835,24 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, */ spin_unlock_irq(&cma->lock); - pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); mutex_lock(&cma->alloc_mutex); ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp); mutex_unlock(&cma->alloc_mutex); - if (ret == 0) { - page = pfn_to_page(pfn); + if (!ret) break; - } cma_clear_bitmap(cma, cmr, pfn, count); if (ret != -EBUSY) break; pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n", - __func__, pfn, pfn_to_page(pfn)); + __func__, pfn, page); - trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn), - count, align); - /* try again with a bit different memory target */ - start = bitmap_no + mask + 1; + trace_cma_alloc_busy_retry(cma->name, pfn, page, count, align); } out: - *pagep = page; + if (!ret) + *pagep = page; return ret; } @@ -882,7 +891,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, */ if (page) { for (i = 0; i < count; i++) - page_kasan_tag_reset(nth_page(page, i)); + page_kasan_tag_reset(page + i); } if (ret && !(gfp & __GFP_NOWARN)) { diff --git a/mm/util.c b/mm/util.c index d235b74f7aff7..0bf349b19b652 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, { return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0); } + +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) +/** + * page_range_contiguous - test whether the page range is contiguous + * @page: the start of the page range. + * @nr_pages: the number of pages in the range. + * + * Test whether the page range is contiguous, such that they can be iterated + * naively, corresponding to iterating a contiguous PFN range. + * + * This function should primarily only be used for debug checks, or when + * working with page ranges that are not naturally contiguous (e.g., pages + * within a folio are). + * + * Returns true if contiguous, otherwise false. + */ +bool page_range_contiguous(const struct page *page, unsigned long nr_pages) +{ + const unsigned long start_pfn = page_to_pfn(page); + const unsigned long end_pfn = start_pfn + nr_pages; + unsigned long pfn; + + /* + * The memmap is allocated per memory section. We need to check + * each involved memory section once. + */ + for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION); + pfn < end_pfn; pfn += PAGES_PER_SECTION) + if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn))) + return false; + return true; +} +#endif #endif /* CONFIG_MMU */ -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:26 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:26 +0200 Subject: [PATCH v1 22/36] dma-remap: drop nth_page() in dma_common_contiguous_remap() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-23-david@redhat.com> dma_common_contiguous_remap() is used to remap an "allocated contiguous region". Within a single allocation, there is no need to use nth_page() anymore. Neither the buddy, nor hugetlb, nor CMA will hand out problematic page ranges. Acked-by: Marek Szyprowski Cc: Marek Szyprowski Cc: Robin Murphy Signed-off-by: David Hildenbrand --- kernel/dma/remap.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c index 9e2afad1c6152..b7c1c0c92d0c8 100644 --- a/kernel/dma/remap.c +++ b/kernel/dma/remap.c @@ -49,7 +49,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t size, if (!pages) return NULL; for (i = 0; i < count; i++) - pages[i] = nth_page(page, i); + pages[i] = page++; vaddr = vmap(pages, count, VM_DMA_COHERENT, prot); kvfree(pages); -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:27 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:27 +0200 Subject: [PATCH v1 23/36] scatterlist: disallow non-contigous page ranges in a single SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-24-david@redhat.com> The expectation is that there is currently no user that would pass in non-contigous page ranges: no allocator, not even VMA, will hand these out. The only problematic part would be if someone would provide a range obtained directly from memblock, or manually merge problematic ranges. If we find such cases, we should fix them to create separate SG entries. Let's check in sg_set_page() that this is really the case. No need to check in sg_set_folio(), as pages in a folio are guaranteed to be contiguous. As sg_set_page() gets inlined into modules, we have to export the page_range_contiguous() helper -- use EXPORT_SYMBOL, there is nothing special about this helper such that we would want to enforce GPL-only modules. We can now drop the nth_page() usage in sg_page_iter_page(). Acked-by: Marek Szyprowski Signed-off-by: David Hildenbrand --- include/linux/scatterlist.h | 3 ++- mm/util.c | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h index 6f8a4965f9b98..29f6ceb98d74b 100644 --- a/include/linux/scatterlist.h +++ b/include/linux/scatterlist.h @@ -158,6 +158,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page) static inline void sg_set_page(struct scatterlist *sg, struct page *page, unsigned int len, unsigned int offset) { + VM_WARN_ON_ONCE(!page_range_contiguous(page, ALIGN(len + offset, PAGE_SIZE) / PAGE_SIZE)); sg_assign_page(sg, page); sg->offset = offset; sg->length = len; @@ -600,7 +601,7 @@ void __sg_page_iter_start(struct sg_page_iter *piter, */ static inline struct page *sg_page_iter_page(struct sg_page_iter *piter) { - return nth_page(sg_page(piter->sg), piter->sg_pgoffset); + return sg_page(piter->sg) + piter->sg_pgoffset; } /** diff --git a/mm/util.c b/mm/util.c index 0bf349b19b652..e8b9da6b13230 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1312,5 +1312,6 @@ bool page_range_contiguous(const struct page *page, unsigned long nr_pages) return false; return true; } +EXPORT_SYMBOL(page_range_contiguous); #endif #endif /* CONFIG_MMU */ -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:37 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:37 +0200 Subject: [PATCH v1 33/36] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-34-david@redhat.com> There is the concern that unpin_user_page_range_dirty_lock() might do some weird merging of PFN ranges -- either now or in the future -- such that PFN range is contiguous but the page range might not be. Let's sanity-check for that and drop the nth_page() usage. Signed-off-by: David Hildenbrand --- mm/gup.c | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/mm/gup.c b/mm/gup.c index 89ca0813791ab..c24f6009a7a44 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -237,7 +237,7 @@ void folio_add_pin(struct folio *folio) static inline struct folio *gup_folio_range_next(struct page *start, unsigned long npages, unsigned long i, unsigned int *ntails) { - struct page *next = nth_page(start, i); + struct page *next = start + i; struct folio *folio = page_folio(next); unsigned int nr = 1; @@ -342,6 +342,9 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock); * "gup-pinned page range" refers to a range of pages that has had one of the * pin_user_pages() variants called on that page. * + * The page range must be truly contiguous: the page range corresponds + * to a contiguous PFN range and all pages can be iterated naturally. + * * For the page ranges defined by [page .. page+npages], make that range (or * its head pages, if a compound page) dirty, if @make_dirty is true, and if the * page range was previously listed as clean. @@ -359,6 +362,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, struct folio *folio; unsigned int nr; + VM_WARN_ON_ONCE(!page_range_contiguous(page, npages)); + for (i = 0; i < npages; i += nr) { folio = gup_folio_range_next(page, npages, i, &nr); if (make_dirty && !folio_test_dirty(folio)) { -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:38 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:38 +0200 Subject: [PATCH v1 34/36] kfence: drop nth_page() usage In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-35-david@redhat.com> We want to get rid of nth_page(), and kfence init code is the last user. Unfortunately, we might actually walk a PFN range where the pages are not contiguous, because we might be allocating an area from memblock that could span memory sections in problematic kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP). We could check whether the page range is contiguous using page_range_contiguous() and failing kfence init, or making kfence incompatible these problemtic kernel configs. Let's keep it simple and simply use pfn_to_page() by iterating PFNs. Cc: Alexander Potapenko Cc: Marco Elver Cc: Dmitry Vyukov Signed-off-by: David Hildenbrand --- mm/kfence/core.c | 12 +++++++----- 1 file changed, 7 insertions(+), 5 deletions(-) diff --git a/mm/kfence/core.c b/mm/kfence/core.c index 0ed3be100963a..727c20c94ac59 100644 --- a/mm/kfence/core.c +++ b/mm/kfence/core.c @@ -594,15 +594,14 @@ static void rcu_guarded_free(struct rcu_head *h) */ static unsigned long kfence_init_pool(void) { - unsigned long addr; - struct page *pages; + unsigned long addr, start_pfn; int i; if (!arch_kfence_init_pool()) return (unsigned long)__kfence_pool; addr = (unsigned long)__kfence_pool; - pages = virt_to_page(__kfence_pool); + start_pfn = PHYS_PFN(virt_to_phys(__kfence_pool)); /* * Set up object pages: they must have PGTY_slab set to avoid freeing @@ -613,11 +612,12 @@ static unsigned long kfence_init_pool(void) * enters __slab_free() slow-path. */ for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { - struct slab *slab = page_slab(nth_page(pages, i)); + struct slab *slab; if (!i || (i % 2)) continue; + slab = page_slab(pfn_to_page(start_pfn + i)); __folio_set_slab(slab_folio(slab)); #ifdef CONFIG_MEMCG slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts | @@ -665,10 +665,12 @@ static unsigned long kfence_init_pool(void) reset_slab: for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { - struct slab *slab = page_slab(nth_page(pages, i)); + struct slab *slab; if (!i || (i % 2)) continue; + + slab = page_slab(pfn_to_page(start_pfn + i)); #ifdef CONFIG_MEMCG slab->obj_exts = 0; #endif -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:39 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:39 +0200 Subject: [PATCH v1 35/36] block: update comment of "struct bio_vec" regarding nth_page() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-36-david@redhat.com> Ever since commit 858c708d9efb ("block: move the bi_size update out of __bio_try_merge_page"), page_is_mergeable() no longer exists, and the logic in bvec_try_merge_page() is now a simple page pointer comparison. Signed-off-by: David Hildenbrand --- include/linux/bvec.h | 7 ++----- 1 file changed, 2 insertions(+), 5 deletions(-) diff --git a/include/linux/bvec.h b/include/linux/bvec.h index 0a80e1f9aa201..3fc0efa0825b1 100644 --- a/include/linux/bvec.h +++ b/include/linux/bvec.h @@ -22,11 +22,8 @@ struct page; * @bv_len: Number of bytes in the address range. * @bv_offset: Start of the address range relative to the start of @bv_page. * - * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len: - * - * nth_page(@bv_page, n) == @bv_page + n - * - * This holds because page_is_mergeable() checks the above property. + * All pages within a bio_vec starting from @bv_page are contiguous and + * can simply be iterated (see bvec_advance()). */ struct bio_vec { struct page *bv_page; -- 2.50.1 From david at redhat.com Wed Aug 27 22:01:40 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 00:01:40 +0200 Subject: [PATCH v1 36/36] mm: remove nth_page() In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-37-david@redhat.com> Now that all users are gone, let's remove it. Signed-off-by: David Hildenbrand --- include/linux/mm.h | 2 -- tools/testing/scatterlist/linux/mm.h | 1 - 2 files changed, 3 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 2ca1eb2db63ec..b26ca8b2162d9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -210,9 +210,7 @@ extern unsigned long sysctl_admin_reserve_kbytes; #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) bool page_range_contiguous(const struct page *page, unsigned long nr_pages); -#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) #else -#define nth_page(page,n) ((page) + (n)) static inline bool page_range_contiguous(const struct page *page, unsigned long nr_pages) { diff --git a/tools/testing/scatterlist/linux/mm.h b/tools/testing/scatterlist/linux/mm.h index 5bd9e6e806254..121ae78d6e885 100644 --- a/tools/testing/scatterlist/linux/mm.h +++ b/tools/testing/scatterlist/linux/mm.h @@ -51,7 +51,6 @@ static inline unsigned long page_to_phys(struct page *page) #define page_to_pfn(page) ((unsigned long)(page) / PAGE_SIZE) #define pfn_to_page(pfn) (void *)((pfn) * PAGE_SIZE) -#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) #define __min(t1, t2, min1, min2, x, y) ({ \ t1 min1 = (x); \ -- 2.50.1 From dlemoal at kernel.org Thu Aug 28 04:24:45 2025 From: dlemoal at kernel.org (Damien Le Moal) Date: Thu, 28 Aug 2025 13:24:45 +0900 Subject: [PATCH v1 24/36] ata: libata-eh: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-25-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-25-david@redhat.com> Message-ID: On 8/28/25 7:01 AM, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Cc: Damien Le Moal > Cc: Niklas Cassel > Signed-off-by: David Hildenbrand Acked-by: Damien Le Moal -- Damien Le Moal Western Digital Research From rppt at kernel.org Thu Aug 28 07:21:00 2025 From: rppt at kernel.org (Mike Rapoport) Date: Thu, 28 Aug 2025 10:21:00 +0300 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <20250827220141.262669-14-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:17AM +0200, David Hildenbrand wrote: > We can now safely iterate over all pages in a folio, so no need for the > pfn_to_page(). > > Also, as we already force the refcount in __init_single_page() to 1, > we can just set the refcount to 0 and avoid page_ref_freeze() + > VM_BUG_ON. Likely, in the future, we would just want to tell > __init_single_page() to which value to initialize the refcount. > > Further, adjust the comments to highlight that we are dealing with an > open-coded prep_compound_page() variant, and add another comment explaining > why we really need the __init_single_page() only on the tail pages. > > Note that the current code was likely problematic, but we never ran into > it: prep_compound_tail() would have been called with an offset that might > exceed a memory section, and prep_compound_tail() would have simply > added that offset to the page pointer -- which would not have done the > right thing on sparsemem without vmemmap. > > Signed-off-by: David Hildenbrand > --- > mm/hugetlb.c | 20 ++++++++++++-------- > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 4a97e4f14c0dc..1f42186a85ea4 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3237,17 +3237,18 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, > { > enum zone_type zone = zone_idx(folio_zone(folio)); > int nid = folio_nid(folio); > + struct page *page = folio_page(folio, start_page_number); > unsigned long head_pfn = folio_pfn(folio); > unsigned long pfn, end_pfn = head_pfn + end_page_number; > - int ret; > - > - for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { > - struct page *page = pfn_to_page(pfn); > > + /* > + * We mark all tail pages with memblock_reserved_mark_noinit(), > + * so these pages are completely uninitialized. ^ not? ;-) > + */ > + for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) { > __init_single_page(page, pfn, zone, nid); > prep_compound_tail((struct page *)folio, pfn - head_pfn); > - ret = page_ref_freeze(page, 1); > - VM_BUG_ON(!ret); > + set_page_count(page, 0); > } > } > > @@ -3257,12 +3258,15 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, > { > int ret; > > - /* Prepare folio head */ > + /* > + * This is an open-coded prep_compound_page() whereby we avoid > + * walking pages twice by initializing/preparing+freezing them in the > + * same go. > + */ > __folio_clear_reserved(folio); > __folio_set_head(folio); > ret = folio_ref_freeze(folio, 1); > VM_BUG_ON(!ret); > - /* Initialize the necessary tail struct pages */ > hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); > prep_compound_head((struct page *)folio, huge_page_order(h)); > } > -- > 2.50.1 > -- Sincerely yours, Mike. From houmie at gmail.com Thu Aug 28 07:22:18 2025 From: houmie at gmail.com (Houman) Date: Thu, 28 Aug 2025 08:22:18 +0100 Subject: Google Play 16KB alignment issue with WireGuard library - fix needed Message-ID: Google PlayStore is no longer accepting apps affected by Google Play's 16 KB page size requirements. The native library `arm64-v8a/libwg-go.so` (from `com.wireguard.android:tunnel:1.0.20230706`) is not 16 KB aligned implementation 'com.wireguard.android:tunnel:1.0.20230706' Can this be fixed? Thanks, Houman From david at redhat.com Thu Aug 28 07:44:27 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 09:44:27 +0200 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> Message-ID: <377449bd-3c06-4a09-8647-e41354e64b30@redhat.com> On 28.08.25 09:21, Mike Rapoport wrote: > On Thu, Aug 28, 2025 at 12:01:17AM +0200, David Hildenbrand wrote: >> We can now safely iterate over all pages in a folio, so no need for the >> pfn_to_page(). >> >> Also, as we already force the refcount in __init_single_page() to 1, >> we can just set the refcount to 0 and avoid page_ref_freeze() + >> VM_BUG_ON. Likely, in the future, we would just want to tell >> __init_single_page() to which value to initialize the refcount. >> >> Further, adjust the comments to highlight that we are dealing with an >> open-coded prep_compound_page() variant, and add another comment explaining >> why we really need the __init_single_page() only on the tail pages. >> >> Note that the current code was likely problematic, but we never ran into >> it: prep_compound_tail() would have been called with an offset that might >> exceed a memory section, and prep_compound_tail() would have simply >> added that offset to the page pointer -- which would not have done the >> right thing on sparsemem without vmemmap. >> >> Signed-off-by: David Hildenbrand >> --- >> mm/hugetlb.c | 20 ++++++++++++-------- >> 1 file changed, 12 insertions(+), 8 deletions(-) >> >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >> index 4a97e4f14c0dc..1f42186a85ea4 100644 >> --- a/mm/hugetlb.c >> +++ b/mm/hugetlb.c >> @@ -3237,17 +3237,18 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, >> { >> enum zone_type zone = zone_idx(folio_zone(folio)); >> int nid = folio_nid(folio); >> + struct page *page = folio_page(folio, start_page_number); >> unsigned long head_pfn = folio_pfn(folio); >> unsigned long pfn, end_pfn = head_pfn + end_page_number; >> - int ret; >> - >> - for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { >> - struct page *page = pfn_to_page(pfn); >> >> + /* >> + * We mark all tail pages with memblock_reserved_mark_noinit(), >> + * so these pages are completely uninitialized. > > ^ not? ;-) Can you elaborate? -- Cheers David / dhildenb From rppt at kernel.org Thu Aug 28 08:06:07 2025 From: rppt at kernel.org (Mike Rapoport) Date: Thu, 28 Aug 2025 11:06:07 +0300 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <377449bd-3c06-4a09-8647-e41354e64b30@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> <377449bd-3c06-4a09-8647-e41354e64b30@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 09:44:27AM +0200, David Hildenbrand wrote: > On 28.08.25 09:21, Mike Rapoport wrote: > > On Thu, Aug 28, 2025 at 12:01:17AM +0200, David Hildenbrand wrote: > > > We can now safely iterate over all pages in a folio, so no need for the > > > pfn_to_page(). > > > > > > Also, as we already force the refcount in __init_single_page() to 1, > > > we can just set the refcount to 0 and avoid page_ref_freeze() + > > > VM_BUG_ON. Likely, in the future, we would just want to tell > > > __init_single_page() to which value to initialize the refcount. > > > > > > Further, adjust the comments to highlight that we are dealing with an > > > open-coded prep_compound_page() variant, and add another comment explaining > > > why we really need the __init_single_page() only on the tail pages. > > > > > > Note that the current code was likely problematic, but we never ran into > > > it: prep_compound_tail() would have been called with an offset that might > > > exceed a memory section, and prep_compound_tail() would have simply > > > added that offset to the page pointer -- which would not have done the > > > right thing on sparsemem without vmemmap. > > > > > > Signed-off-by: David Hildenbrand > > > --- > > > mm/hugetlb.c | 20 ++++++++++++-------- > > > 1 file changed, 12 insertions(+), 8 deletions(-) > > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index 4a97e4f14c0dc..1f42186a85ea4 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -3237,17 +3237,18 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, > > > { > > > enum zone_type zone = zone_idx(folio_zone(folio)); > > > int nid = folio_nid(folio); > > > + struct page *page = folio_page(folio, start_page_number); > > > unsigned long head_pfn = folio_pfn(folio); > > > unsigned long pfn, end_pfn = head_pfn + end_page_number; > > > - int ret; > > > - > > > - for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { > > > - struct page *page = pfn_to_page(pfn); > > > + /* > > > + * We mark all tail pages with memblock_reserved_mark_noinit(), > > > + * so these pages are completely uninitialized. > > > > ^ not? ;-) > > Can you elaborate? Oh, sorry, I misread "uninitialized". Still, I'd phrase it as /* * We marked all tail pages with memblock_reserved_mark_noinit(), * so we must initialize them here. */ > -- > Cheers > > David / dhildenb > -- Sincerely yours, Mike. From david at redhat.com Thu Aug 28 08:18:23 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 10:18:23 +0200 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> <377449bd-3c06-4a09-8647-e41354e64b30@redhat.com> Message-ID: <6880f125-803d-4eea-88ac-b67fdcc5995d@redhat.com> On 28.08.25 10:06, Mike Rapoport wrote: > On Thu, Aug 28, 2025 at 09:44:27AM +0200, David Hildenbrand wrote: >> On 28.08.25 09:21, Mike Rapoport wrote: >>> On Thu, Aug 28, 2025 at 12:01:17AM +0200, David Hildenbrand wrote: >>>> We can now safely iterate over all pages in a folio, so no need for the >>>> pfn_to_page(). >>>> >>>> Also, as we already force the refcount in __init_single_page() to 1, >>>> we can just set the refcount to 0 and avoid page_ref_freeze() + >>>> VM_BUG_ON. Likely, in the future, we would just want to tell >>>> __init_single_page() to which value to initialize the refcount. >>>> >>>> Further, adjust the comments to highlight that we are dealing with an >>>> open-coded prep_compound_page() variant, and add another comment explaining >>>> why we really need the __init_single_page() only on the tail pages. >>>> >>>> Note that the current code was likely problematic, but we never ran into >>>> it: prep_compound_tail() would have been called with an offset that might >>>> exceed a memory section, and prep_compound_tail() would have simply >>>> added that offset to the page pointer -- which would not have done the >>>> right thing on sparsemem without vmemmap. >>>> >>>> Signed-off-by: David Hildenbrand >>>> --- >>>> mm/hugetlb.c | 20 ++++++++++++-------- >>>> 1 file changed, 12 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c >>>> index 4a97e4f14c0dc..1f42186a85ea4 100644 >>>> --- a/mm/hugetlb.c >>>> +++ b/mm/hugetlb.c >>>> @@ -3237,17 +3237,18 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, >>>> { >>>> enum zone_type zone = zone_idx(folio_zone(folio)); >>>> int nid = folio_nid(folio); >>>> + struct page *page = folio_page(folio, start_page_number); >>>> unsigned long head_pfn = folio_pfn(folio); >>>> unsigned long pfn, end_pfn = head_pfn + end_page_number; >>>> - int ret; >>>> - >>>> - for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { >>>> - struct page *page = pfn_to_page(pfn); >>>> + /* >>>> + * We mark all tail pages with memblock_reserved_mark_noinit(), >>>> + * so these pages are completely uninitialized. >>> >>> ^ not? ;-) >> >> Can you elaborate? > > Oh, sorry, I misread "uninitialized". > Still, I'd phrase it as > > /* > * We marked all tail pages with memblock_reserved_mark_noinit(), > * so we must initialize them here. > */ I prefer what I currently have, but thanks for the review. -- Cheers David / dhildenb From leon at sidebranch.com Thu Aug 28 08:25:03 2025 From: leon at sidebranch.com (Leon Woestenberg) Date: Thu, 28 Aug 2025 10:25:03 +0200 Subject: Google Play 16KB alignment issue with WireGuard library - fix needed In-Reply-To: References: Message-ID: On Thu, Aug 28, 2025 at 9:31?AM Houman wrote: > > Google PlayStore is no longer accepting apps affected by Google Play's > 16 KB page size requirements. > > The native library `arm64-v8a/libwg-go.so` (from > `com.wireguard.android:tunnel:1.0.20230706`) is not 16 KB aligned > >From what I read, the requirement is not to rely on 4KiB (hard coded) page sizes. Or are they going to hard-code 16KiB now??! https://android-developers.googleblog.com/2024/12/get-your-apps-ready-for-16-kb-page-size-devices.html - Leon. From houmie at gmail.com Thu Aug 28 08:28:50 2025 From: houmie at gmail.com (Houman) Date: Thu, 28 Aug 2025 09:28:50 +0100 Subject: Google Play 16KB alignment issue with WireGuard library - fix needed In-Reply-To: References: Message-ID: I got this email from Google: To ensure your app works correctly on the latest versions of Android, Google Play requires all apps targeting Android 15+ to support 16 KB memory page sizes. >From May 1, 2026, if your app updates do not support 16 KB memory page sizes, you won't be able to release these updates. Thanks, Houman On Thu, 28 Aug 2025 at 09:25, Leon Woestenberg wrote: > > On Thu, Aug 28, 2025 at 9:31?AM Houman wrote: > > > > Google PlayStore is no longer accepting apps affected by Google Play's > > 16 KB page size requirements. > > > > The native library `arm64-v8a/libwg-go.so` (from > > `com.wireguard.android:tunnel:1.0.20230706`) is not 16 KB aligned > > > From what I read, the requirement is not to rely on 4KiB (hard coded) > page sizes. Or are they going to hard-code 16KiB now??! > > https://android-developers.googleblog.com/2024/12/get-your-apps-ready-for-16-kb-page-size-devices.html > > - > Leon. From rppt at kernel.org Thu Aug 28 08:37:37 2025 From: rppt at kernel.org (Mike Rapoport) Date: Thu, 28 Aug 2025 11:37:37 +0300 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <6880f125-803d-4eea-88ac-b67fdcc5995d@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> <377449bd-3c06-4a09-8647-e41354e64b30@redhat.com> <6880f125-803d-4eea-88ac-b67fdcc5995d@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 10:18:23AM +0200, David Hildenbrand wrote: > On 28.08.25 10:06, Mike Rapoport wrote: > > On Thu, Aug 28, 2025 at 09:44:27AM +0200, David Hildenbrand wrote: > > > On 28.08.25 09:21, Mike Rapoport wrote: > > > > On Thu, Aug 28, 2025 at 12:01:17AM +0200, David Hildenbrand wrote: > > > > > + /* > > > > > + * We mark all tail pages with memblock_reserved_mark_noinit(), > > > > > + * so these pages are completely uninitialized. > > > > > > > > ^ not? ;-) > > > > > > Can you elaborate? > > > > Oh, sorry, I misread "uninitialized". > > Still, I'd phrase it as > > > > /* > > * We marked all tail pages with memblock_reserved_mark_noinit(), > > * so we must initialize them here. > > */ > > I prefer what I currently have, but thanks for the review. No strong feelings, feel free to add Reviewed-by: Mike Rapoport (Microsoft) -- Sincerely yours, Mike. From elver at google.com Thu Aug 28 08:43:16 2025 From: elver at google.com (Marco Elver) Date: Thu, 28 Aug 2025 10:43:16 +0200 Subject: [PATCH v1 34/36] kfence: drop nth_page() usage In-Reply-To: <20250827220141.262669-35-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-35-david@redhat.com> Message-ID: On Thu, 28 Aug 2025 at 00:11, 'David Hildenbrand' via kasan-dev wrote: > > We want to get rid of nth_page(), and kfence init code is the last user. > > Unfortunately, we might actually walk a PFN range where the pages are > not contiguous, because we might be allocating an area from memblock > that could span memory sections in problematic kernel configs (SPARSEMEM > without SPARSEMEM_VMEMMAP). > > We could check whether the page range is contiguous > using page_range_contiguous() and failing kfence init, or making kfence > incompatible these problemtic kernel configs. > > Let's keep it simple and simply use pfn_to_page() by iterating PFNs. > > Cc: Alexander Potapenko > Cc: Marco Elver > Cc: Dmitry Vyukov > Signed-off-by: David Hildenbrand Reviewed-by: Marco Elver Thanks. > --- > mm/kfence/core.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > index 0ed3be100963a..727c20c94ac59 100644 > --- a/mm/kfence/core.c > +++ b/mm/kfence/core.c > @@ -594,15 +594,14 @@ static void rcu_guarded_free(struct rcu_head *h) > */ > static unsigned long kfence_init_pool(void) > { > - unsigned long addr; > - struct page *pages; > + unsigned long addr, start_pfn; > int i; > > if (!arch_kfence_init_pool()) > return (unsigned long)__kfence_pool; > > addr = (unsigned long)__kfence_pool; > - pages = virt_to_page(__kfence_pool); > + start_pfn = PHYS_PFN(virt_to_phys(__kfence_pool)); > > /* > * Set up object pages: they must have PGTY_slab set to avoid freeing > @@ -613,11 +612,12 @@ static unsigned long kfence_init_pool(void) > * enters __slab_free() slow-path. > */ > for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { > - struct slab *slab = page_slab(nth_page(pages, i)); > + struct slab *slab; > > if (!i || (i % 2)) > continue; > > + slab = page_slab(pfn_to_page(start_pfn + i)); > __folio_set_slab(slab_folio(slab)); > #ifdef CONFIG_MEMCG > slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts | > @@ -665,10 +665,12 @@ static unsigned long kfence_init_pool(void) > > reset_slab: > for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { > - struct slab *slab = page_slab(nth_page(pages, i)); > + struct slab *slab; > > if (!i || (i % 2)) > continue; > + > + slab = page_slab(pfn_to_page(start_pfn + i)); > #ifdef CONFIG_MEMCG > slab->obj_exts = 0; > #endif > -- > 2.50.1 > > -- > You received this message because you are subscribed to the Google Groups "kasan-dev" group. > To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+unsubscribe at googlegroups.com. > To view this discussion visit https://groups.google.com/d/msgid/kasan-dev/20250827220141.262669-35-david%40redhat.com. From david at redhat.com Thu Aug 28 20:51:46 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 22:51:46 +0200 Subject: [PATCH v1 20/36] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-21-david@redhat.com> Message-ID: <2be7db96-2fa2-4348-837e-648124bd604f@redhat.com> On 28.08.25 18:57, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:24AM +0200, David Hildenbrand wrote: >> Let's make it clearer that we are operating within a single folio by >> providing both the folio and the page. >> >> This implies that for flush_dcache_folio() we'll now avoid one more >> page->folio lookup, and that we can safely drop the "nth_page" usage. >> >> Cc: Thomas Bogendoerfer >> Signed-off-by: David Hildenbrand >> --- >> arch/mips/include/asm/cacheflush.h | 11 +++++++---- >> arch/mips/mm/cache.c | 8 ++++---- >> 2 files changed, 11 insertions(+), 8 deletions(-) >> >> diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h >> index 5d283ef89d90d..8d79bfc687d21 100644 >> --- a/arch/mips/include/asm/cacheflush.h >> +++ b/arch/mips/include/asm/cacheflush.h >> @@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm); >> extern void (*flush_cache_range)(struct vm_area_struct *vma, >> unsigned long start, unsigned long end); >> extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn); >> -extern void __flush_dcache_pages(struct page *page, unsigned int nr); >> +extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr); > > NIT: Be good to drop the extern. I think I'll leave the one in, though, someone should clean up all of them in one go. Just imagine how the other functions would think about the new guy showing off here. :) > >> >> #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 >> static inline void flush_dcache_folio(struct folio *folio) >> { >> if (cpu_has_dc_aliases) >> - __flush_dcache_pages(&folio->page, folio_nr_pages(folio)); >> + __flush_dcache_folio_pages(folio, folio_page(folio, 0), >> + folio_nr_pages(folio)); >> else if (!cpu_has_ic_fills_f_dc) >> folio_set_dcache_dirty(folio); >> } >> @@ -64,10 +65,12 @@ static inline void flush_dcache_folio(struct folio *folio) >> >> static inline void flush_dcache_page(struct page *page) >> { >> + struct folio *folio = page_folio(page); >> + >> if (cpu_has_dc_aliases) >> - __flush_dcache_pages(page, 1); >> + __flush_dcache_folio_pages(folio, page, folio_nr_pages(folio)); > > Hmmm, shouldn't this be 1 not folio_nr_pages()? Seems that the original > implementation only flushed a single page even if contained within a larger > folio? Yes, reworked it 3 times and messed it up during the last rework. Thanks! -- Cheers David / dhildenb From dlemoal at kernel.org Fri Aug 29 00:22:30 2025 From: dlemoal at kernel.org (Damien Le Moal) Date: Fri, 29 Aug 2025 09:22:30 +0900 Subject: [PATCH v1 24/36] ata: libata-eh: drop nth_page() usage within SG entry In-Reply-To: <7612fdc2-97ff-4b89-a532-90c5de56acdc@lucifer.local> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-25-david@redhat.com> <7612fdc2-97ff-4b89-a532-90c5de56acdc@lucifer.local> Message-ID: <423566a0-5967-488d-a62a-4f825ae6f227@kernel.org> On 8/29/25 2:53 AM, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:28AM +0200, David Hildenbrand wrote: >> It's no longer required to use nth_page() when iterating pages within a >> single SG entry, so let's drop the nth_page() usage. >> >> Cc: Damien Le Moal >> Cc: Niklas Cassel >> Signed-off-by: David Hildenbrand > > LGTM, so: > > Reviewed-by: Lorenzo Stoakes Just noticed this: s/libata-eh/libata-sff in the commit title please. -- Damien Le Moal Western Digital Research From david at redhat.com Fri Aug 29 09:58:49 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 11:58:49 +0200 Subject: [PATCH v1 06/36] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <3hpjmfa6p3onfdv4ma4nv2tdggvsyarh7m36aufy6hvwqtp2wd@2odohwxkl3rk> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-7-david@redhat.com> <3hpjmfa6p3onfdv4ma4nv2tdggvsyarh7m36aufy6hvwqtp2wd@2odohwxkl3rk> Message-ID: <6a2e2ba2-e5ea-4744-a66e-790216c1e762@redhat.com> On 29.08.25 02:33, Liam R. Howlett wrote: > * David Hildenbrand [250827 18:04]: >> Let's reject them early, which in turn makes folio_alloc_gigantic() reject >> them properly. >> >> To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER >> and calculate MAX_FOLIO_NR_PAGES based on that. >> >> Reviewed-by: Zi Yan >> Acked-by: SeongJae Park >> Signed-off-by: David Hildenbrand > > Nit below, but.. > > Reviewed-by: Liam R. Howlett > >> --- >> include/linux/mm.h | 6 ++++-- >> mm/page_alloc.c | 5 ++++- >> 2 files changed, 8 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 00c8a54127d37..77737cbf2216a 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) >> >> /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ >> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >> -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) >> +#define MAX_FOLIO_ORDER PUD_ORDER >> #else >> -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES >> +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER >> #endif >> >> +#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) >> + >> /* >> * compound_nr() returns the number of pages in this potentially compound >> * page. compound_nr() can be called on a tail page, and is defined to >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index baead29b3e67b..426bc404b80cc 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) >> int alloc_contig_range_noprof(unsigned long start, unsigned long end, >> acr_flags_t alloc_flags, gfp_t gfp_mask) >> { >> + const unsigned int order = ilog2(end - start); >> unsigned long outer_start, outer_end; >> int ret = 0; >> >> @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >> PB_ISOLATE_MODE_CMA_ALLOC : >> PB_ISOLATE_MODE_OTHER; >> >> + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) >> + return -EINVAL; >> + >> gfp_mask = current_gfp_context(gfp_mask); >> if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) >> return -EINVAL; >> @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >> free_contig_range(end, outer_end - end); >> } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { >> struct page *head = pfn_to_page(start); >> - int order = ilog2(end - start); > > You have changed this from an int to a const unsigned int, which is > totally fine but it was left out of the change log. Considered to trivial to document, but I can add a sentence about that. > Probably not really > worth mentioning but curious why the change to unsigned here? orders are always unsigned, like folio_order(). Thanks! -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 10:06:21 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 12:06:21 +0200 Subject: [PATCH v1 06/36] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-7-david@redhat.com> Message-ID: <547145e0-9b0e-40ca-8201-e94cc5d19356@redhat.com> On 28.08.25 16:37, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:10AM +0200, David Hildenbrand wrote: >> Let's reject them early, which in turn makes folio_alloc_gigantic() reject >> them properly. >> >> To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER >> and calculate MAX_FOLIO_NR_PAGES based on that. >> >> Reviewed-by: Zi Yan >> Acked-by: SeongJae Park >> Signed-off-by: David Hildenbrand > > Some nits, but overall LGTM so: > > Reviewed-by: Lorenzo Stoakes > >> --- >> include/linux/mm.h | 6 ++++-- >> mm/page_alloc.c | 5 ++++- >> 2 files changed, 8 insertions(+), 3 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 00c8a54127d37..77737cbf2216a 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) >> >> /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ >> #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE >> -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) >> +#define MAX_FOLIO_ORDER PUD_ORDER >> #else >> -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES >> +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER >> #endif >> >> +#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) > > BIT()? I don't think we want to use BIT whenever we convert from order -> folio -- which is why we also don't do that in other code. BIT() is nice in the context of flags and bitmaps, but not really in the context of converting orders to pages. One could argue that maybe one would want a order_to_pages() helper (that could use BIT() internally), but I am certainly not someone that would suggest that at this point ... :) > >> + >> /* >> * compound_nr() returns the number of pages in this potentially compound >> * page. compound_nr() can be called on a tail page, and is defined to >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >> index baead29b3e67b..426bc404b80cc 100644 >> --- a/mm/page_alloc.c >> +++ b/mm/page_alloc.c >> @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) >> int alloc_contig_range_noprof(unsigned long start, unsigned long end, >> acr_flags_t alloc_flags, gfp_t gfp_mask) >> { >> + const unsigned int order = ilog2(end - start); >> unsigned long outer_start, outer_end; >> int ret = 0; >> >> @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >> PB_ISOLATE_MODE_CMA_ALLOC : >> PB_ISOLATE_MODE_OTHER; >> >> + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) >> + return -EINVAL; > > Possibly not worth it for a one off, but be nice to have this as a helper function, like: > > static bool is_valid_order(gfp_t gfp_mask, unsigned int order) > { > return !(gfp_mask & __GFP_COMP) || order <= MAX_FOLIO_ORDER; > } > > Then makes this: > > if (WARN_ON_ONCE(!is_valid_order(gfp_mask, order))) > return -EINVAL; > > Kinda self-documenting! I don't like it -- especially forwarding __GFP_COMP. is_valid_folio_order() to wrap the order check? Also not sure. So I'll leave it as is I think. Thanks for all the review! -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 10:07:44 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 12:07:44 +0200 Subject: [PATCH v1 08/36] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-9-david@redhat.com> Message-ID: <5f6e49fa-4c1c-4ece-ba67-0e140e2685da@redhat.com> On 28.08.25 16:45, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:12AM +0200, David Hildenbrand wrote: >> Let's check that no hstate that corresponds to an unreasonable folio size >> is registered by an architecture. If we were to succeed registering, we >> could later try allocating an unsupported gigantic folio size. >> >> Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER >> is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have >> to use a BUILD_BUG_ON_INVALID() to make it compile. >> >> No existing kernel configuration should be able to trigger this check: >> either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or >> gigantic folios will not exceed a memory section (the case on sparse). > > I am guessing it's implicit that MAX_FOLIO_ORDER <= section size? Yes, we have a build-time bug that somewhere. -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 10:10:30 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 12:10:30 +0200 Subject: [PATCH v1 10/36] mm: sanity-check maximum folio size in folio_set_order() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-11-david@redhat.com> Message-ID: On 28.08.25 17:00, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:14AM +0200, David Hildenbrand wrote: >> Let's sanity-check in folio_set_order() whether we would be trying to >> create a folio with an order that would make it exceed MAX_FOLIO_ORDER. >> >> This will enable the check whenever a folio/compound page is initialized >> through prepare_compound_head() / prepare_compound_page(). > > NIT: with CONFIG_DEBUG_VM set :) Yes, will add that. > >> >> Reviewed-by: Zi Yan >> Signed-off-by: David Hildenbrand > > LGTM (apart from nit below), so: > > Reviewed-by: Lorenzo Stoakes > >> --- >> mm/internal.h | 1 + >> 1 file changed, 1 insertion(+) >> >> diff --git a/mm/internal.h b/mm/internal.h >> index 45da9ff5694f6..9b0129531d004 100644 >> --- a/mm/internal.h >> +++ b/mm/internal.h >> @@ -755,6 +755,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) >> { >> if (WARN_ON_ONCE(!order || !folio_test_large(folio))) >> return; >> + VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER); > > Given we have 'full-fat' WARN_ON*()'s above, maybe worth making this one too? The idea is that if you reach this point here, previous such checks I added failed. So this is the safety net, and for that VM_WARN_ON_ONCE() is sufficient. I think we should rather convert the WARN_ON_ONCE to VM_WARN_ON_ONCE() at some point, because no sane code should ever trigger that. -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 11:57:22 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 13:57:22 +0200 Subject: [PATCH v1 11/36] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-12-david@redhat.com> Message-ID: On 28.08.25 17:10, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:15AM +0200, David Hildenbrand wrote: >> Let's limit the maximum folio size in problematic kernel config where >> the memmap is allocated per memory section (SPARSEMEM without >> SPARSEMEM_VMEMMAP) to a single memory section. >> >> Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE >> but not SPARSEMEM_VMEMMAP: sh. >> >> Fortunately, the biggest hugetlb size sh supports is 64 MiB >> (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB >> (SECTION_SIZE_BITS == 26), so their use case is not degraded. >> >> As folios and memory sections are naturally aligned to their order-2 size >> in memory, consequently a single folio can no longer span multiple memory >> sections on these problematic kernel configs. >> >> nth_page() is no longer required when operating within a single compound >> page / folio. >> >> Reviewed-by: Zi Yan >> Acked-by: Mike Rapoport (Microsoft) >> Signed-off-by: David Hildenbrand > > Realy great comments, like this! > > I wonder if we could have this be part of the first patch where you fiddle > with MAX_FOLIO_ORDER etc. but not a big deal. I think it belongs into this patch where we actually impose the restrictions. [...] >> +/* >> + * Only pages within a single memory section are guaranteed to be >> + * contiguous. By limiting folios to a single memory section, all folio >> + * pages are guaranteed to be contiguous. >> + */ >> +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT > > Hmmm, was this implicit before somehow? I mean surely by the fact as you say > that physical contiguity would not otherwise be guaranteed :)) Well, my patches until this point made sure that any attempt to use a larger folio would fail in a way that we could spot now if there is any offender. That is why before this change, nth_page() was required within a folio. Hope that clarifies it, thanks! -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 11:59:19 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 13:59:19 +0200 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> Message-ID: <0dcef56e-0ae7-401b-9453-f6dc6a4dcebf@redhat.com> On 28.08.25 17:37, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:17AM +0200, David Hildenbrand wrote: >> We can now safely iterate over all pages in a folio, so no need for the >> pfn_to_page(). >> >> Also, as we already force the refcount in __init_single_page() to 1, > > Mega huge nit (ignore if you want), but maybe worth saying 'via > init_page_count()'. Will add, thanks! -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 12:00:37 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 14:00:37 +0200 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> <377449bd-3c06-4a09-8647-e41354e64b30@redhat.com> <6880f125-803d-4eea-88ac-b67fdcc5995d@redhat.com> Message-ID: On 28.08.25 10:37, Mike Rapoport wrote: > On Thu, Aug 28, 2025 at 10:18:23AM +0200, David Hildenbrand wrote: >> On 28.08.25 10:06, Mike Rapoport wrote: >>> On Thu, Aug 28, 2025 at 09:44:27AM +0200, David Hildenbrand wrote: >>>> On 28.08.25 09:21, Mike Rapoport wrote: >>>>> On Thu, Aug 28, 2025 at 12:01:17AM +0200, David Hildenbrand wrote: >>>>>> + /* >>>>>> + * We mark all tail pages with memblock_reserved_mark_noinit(), >>>>>> + * so these pages are completely uninitialized. >>>>> >>>>> ^ not? ;-) >>>> >>>> Can you elaborate? >>> >>> Oh, sorry, I misread "uninitialized". >>> Still, I'd phrase it as >>> >>> /* >>> * We marked all tail pages with memblock_reserved_mark_noinit(), >>> * so we must initialize them here. >>> */ >> >> I prefer what I currently have, but thanks for the review. > > No strong feelings, feel free to add > > Reviewed-by: Mike Rapoport (Microsoft) > I now have "As we marked all tail pages with memblock_reserved_mark_noinit(), we must initialize them ourselves here." -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 12:02:02 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 14:02:02 +0200 Subject: [PATCH v1 15/36] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() In-Reply-To: <1d74a0e2-51ff-462f-8f3c-75639fd21221@lucifer.local> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-16-david@redhat.com> <1d74a0e2-51ff-462f-8f3c-75639fd21221@lucifer.local> Message-ID: On 28.08.25 17:45, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:19AM +0200, David Hildenbrand wrote: >> The nth_page() is not really required anymore, so let's remove it. >> While at it, cleanup and simplify the code a bit. > > Hm Not sure which bit is the cleanup? Was there meant to be more here or? Thanks, leftover from the pre-split of this patch! -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 13:09:45 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 15:09:45 +0200 Subject: [PATCH v1 06/36] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <34edaa0d-0d5f-4041-9a3d-fb5b2dd584e8@lucifer.local> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-7-david@redhat.com> <547145e0-9b0e-40ca-8201-e94cc5d19356@redhat.com> <34edaa0d-0d5f-4041-9a3d-fb5b2dd584e8@lucifer.local> Message-ID: <4f6e66a1-1747-402e-8f1a-f6b7783fc2e5@redhat.com> > > It seems a bit arbitrary, like we open-code this (at risk of making a mistake) > in some places but not others. [...] >> >> One could argue that maybe one would want a order_to_pages() helper (that >> could use BIT() internally), but I am certainly not someone that would >> suggest that at this point ... :) > > I mean maybe. > > Anyway as I said none of this is massively important, the open-coding here is > correct, just seems silly. Maybe we really want a ORDER_PAGES() and PAGES_ORDER(). But I mean, we also have PHYS_PFN() PFN_PHYS() and see how many "<< PAGE_SIZE" etc we are using all over the place. > >> >>> >>>> + >>>> /* >>>> * compound_nr() returns the number of pages in this potentially compound >>>> * page. compound_nr() can be called on a tail page, and is defined to >>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c >>>> index baead29b3e67b..426bc404b80cc 100644 >>>> --- a/mm/page_alloc.c >>>> +++ b/mm/page_alloc.c >>>> @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) >>>> int alloc_contig_range_noprof(unsigned long start, unsigned long end, >>>> acr_flags_t alloc_flags, gfp_t gfp_mask) > > Funny btw th > >>>> { >>>> + const unsigned int order = ilog2(end - start); >>>> unsigned long outer_start, outer_end; >>>> int ret = 0; >>>> >>>> @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, >>>> PB_ISOLATE_MODE_CMA_ALLOC : >>>> PB_ISOLATE_MODE_OTHER; >>>> >>>> + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) >>>> + return -EINVAL; >>> >>> Possibly not worth it for a one off, but be nice to have this as a helper function, like: >>> >>> static bool is_valid_order(gfp_t gfp_mask, unsigned int order) >>> { >>> return !(gfp_mask & __GFP_COMP) || order <= MAX_FOLIO_ORDER; >>> } >>> >>> Then makes this: >>> >>> if (WARN_ON_ONCE(!is_valid_order(gfp_mask, order))) >>> return -EINVAL; >>> >>> Kinda self-documenting! >> >> I don't like it -- especially forwarding __GFP_COMP. >> >> is_valid_folio_order() to wrap the order check? Also not sure. > > OK, it's not a big deal. > > Can we have a comment explaining this though? As people might be confused > as to why we check this here and not elsewhere. I can add a comment. -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 13:22:01 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 15:22:01 +0200 Subject: [PATCH v1 16/36] fs: hugetlbfs: cleanup folio in adjust_range_hwpoison() In-Reply-To: <71cf3600-d9cf-4d16-951c-44582b46c0fa@lucifer.local> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-17-david@redhat.com> <71cf3600-d9cf-4d16-951c-44582b46c0fa@lucifer.local> Message-ID: > > Lord above. > > Also semantics of 'if bytes == 0, then check first page anyway' which you do > capture. Yeah, I think bytes == 0 would not make any sense, though. Staring briefly at the single caller, that seems to be the case (bytes != 0). > > OK think I have convinced myself this is right, so hopefully no deeply subtle > off-by-one issues here :P > > Anyway, LGTM, so: > > Reviewed-by: Lorenzo Stoakes > >> --- >> fs/hugetlbfs/inode.c | 33 +++++++++++---------------------- >> 1 file changed, 11 insertions(+), 22 deletions(-) >> >> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c >> index c5a46d10afaa0..6ca1f6b45c1e5 100644 >> --- a/fs/hugetlbfs/inode.c >> +++ b/fs/hugetlbfs/inode.c >> @@ -198,31 +198,20 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, >> static size_t adjust_range_hwpoison(struct folio *folio, size_t offset, >> size_t bytes) >> { >> - struct page *page; >> - size_t n = 0; >> - size_t res = 0; >> - >> - /* First page to start the loop. */ >> - page = folio_page(folio, offset / PAGE_SIZE); >> - offset %= PAGE_SIZE; >> - while (1) { >> - if (is_raw_hwpoison_page_in_hugepage(page)) >> - break; >> + struct page *page = folio_page(folio, offset / PAGE_SIZE); >> + size_t safe_bytes; >> + >> + if (is_raw_hwpoison_page_in_hugepage(page)) >> + return 0; >> + /* Safe to read the remaining bytes in this page. */ >> + safe_bytes = PAGE_SIZE - (offset % PAGE_SIZE); >> + page++; >> >> - /* Safe to read n bytes without touching HWPOISON subpage. */ >> - n = min(bytes, (size_t)PAGE_SIZE - offset); >> - res += n; >> - bytes -= n; >> - if (!bytes || !n) >> + for (; safe_bytes < bytes; safe_bytes += PAGE_SIZE, page++) > > OK this is quite subtle - so if safe_bytes == bytes, this means we've confirmed > that all requested bytes are safe. > > So offset=0, bytes = 4096 would fail this (as safe_bytes == 4096). > > Maybe worth putting something like: > > /* > * Now we check page-by-page in the folio to see if any bytes we don't > * yet know to be safe are contained within posioned pages or not. > */ > > Above the loop. Or something like this. "Check each remaining page as long as we are not done yet." > >> + if (is_raw_hwpoison_page_in_hugepage(page)) >> break; >> - offset += n; >> - if (offset == PAGE_SIZE) { >> - page++; >> - offset = 0; >> - } >> - } >> >> - return res; >> + return min(safe_bytes, bytes); > > Yeah given above analysis this seems correct. > > You must have torn your hair out over this :) I could resist the urge to clean that up, yes. I'll also drop the "The implementation borrows the iteration logic from copy_page_to_iter*." part, because I suspect this comment no longer makes sense. Thanks! -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 13:41:40 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 15:41:40 +0200 Subject: [PATCH v1 18/36] mm/gup: drop nth_page() usage within folio when recording subpages In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-19-david@redhat.com> Message-ID: <632fea32-28aa-4993-9eff-99fc291c64f2@redhat.com> On 28.08.25 18:37, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:22AM +0200, David Hildenbrand wrote: >> nth_page() is no longer required when iterating over pages within a >> single folio, so let's just drop it when recording subpages. >> >> Signed-off-by: David Hildenbrand > > This looks correct to me, so notwithtsanding suggestion below, LGTM and: > > Reviewed-by: Lorenzo Stoakes > >> --- >> mm/gup.c | 7 +++---- >> 1 file changed, 3 insertions(+), 4 deletions(-) >> >> diff --git a/mm/gup.c b/mm/gup.c >> index b2a78f0291273..89ca0813791ab 100644 >> --- a/mm/gup.c >> +++ b/mm/gup.c >> @@ -488,12 +488,11 @@ static int record_subpages(struct page *page, unsigned long sz, >> unsigned long addr, unsigned long end, >> struct page **pages) >> { >> - struct page *start_page; >> int nr; >> >> - start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT); >> + page += (addr & (sz - 1)) >> PAGE_SHIFT; >> for (nr = 0; addr != end; nr++, addr += PAGE_SIZE) >> - pages[nr] = nth_page(start_page, nr); >> + pages[nr] = page++; > > > This is really nice, but I wonder if (while we're here) we can't be even > more clear as to what's going on here, e.g.: > > static int record_subpages(struct page *page, unsigned long sz, > unsigned long addr, unsigned long end, > struct page **pages) > { > size_t offset_in_folio = (addr & (sz - 1)) >> PAGE_SHIFT; > struct page *subpage = page + offset_in_folio; > > for (; addr != end; addr += PAGE_SIZE) > *pages++ = subpage++; > > return nr; > } > > Or some variant of that with the masking stuff self-documented. What about the following cleanup on top: diff --git a/mm/gup.c b/mm/gup.c index 89ca0813791ab..5a72a135ec70b 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -484,19 +484,6 @@ static inline void mm_set_has_pinned_flag(struct mm_struct *mm) #ifdef CONFIG_MMU #ifdef CONFIG_HAVE_GUP_FAST -static int record_subpages(struct page *page, unsigned long sz, - unsigned long addr, unsigned long end, - struct page **pages) -{ - int nr; - - page += (addr & (sz - 1)) >> PAGE_SHIFT; - for (nr = 0; addr != end; nr++, addr += PAGE_SIZE) - pages[nr] = page++; - - return nr; -} - /** * try_grab_folio_fast() - Attempt to get or pin a folio in fast path. * @page: pointer to page to be grabbed @@ -2963,8 +2950,8 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, if (pmd_special(orig)) return 0; - page = pmd_page(orig); - refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr); + refs = (end - addr) >> PAGE_SHIFT; + page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); folio = try_grab_folio_fast(page, refs, flags); if (!folio) @@ -2985,6 +2972,8 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, } *nr += refs; + for (; refs; refs--) + *(pages++) = page++; folio_set_referenced(folio); return 1; } @@ -3003,8 +2992,8 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, if (pud_special(orig)) return 0; - page = pud_page(orig); - refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr); + refs = (end - addr) >> PAGE_SHIFT; + page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); folio = try_grab_folio_fast(page, refs, flags); if (!folio) @@ -3026,6 +3015,8 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, } *nr += refs; + for (; refs; refs--) + *(pages++) = page++; folio_set_referenced(folio); return 1; } The nice thing is that we only record pages in the array if they actually passed our tests. -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 13:44:20 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 15:44:20 +0200 Subject: [PATCH v1 20/36] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() In-Reply-To: <549a60a6-25e2-48d5-b442-49404a857014@lucifer.local> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-21-david@redhat.com> <2be7db96-2fa2-4348-837e-648124bd604f@redhat.com> <549a60a6-25e2-48d5-b442-49404a857014@lucifer.local> Message-ID: On 29.08.25 14:51, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 10:51:46PM +0200, David Hildenbrand wrote: >> On 28.08.25 18:57, Lorenzo Stoakes wrote: >>> On Thu, Aug 28, 2025 at 12:01:24AM +0200, David Hildenbrand wrote: >>>> Let's make it clearer that we are operating within a single folio by >>>> providing both the folio and the page. >>>> >>>> This implies that for flush_dcache_folio() we'll now avoid one more >>>> page->folio lookup, and that we can safely drop the "nth_page" usage. >>>> >>>> Cc: Thomas Bogendoerfer >>>> Signed-off-by: David Hildenbrand >>>> --- >>>> arch/mips/include/asm/cacheflush.h | 11 +++++++---- >>>> arch/mips/mm/cache.c | 8 ++++---- >>>> 2 files changed, 11 insertions(+), 8 deletions(-) >>>> >>>> diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h >>>> index 5d283ef89d90d..8d79bfc687d21 100644 >>>> --- a/arch/mips/include/asm/cacheflush.h >>>> +++ b/arch/mips/include/asm/cacheflush.h >>>> @@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm); >>>> extern void (*flush_cache_range)(struct vm_area_struct *vma, >>>> unsigned long start, unsigned long end); >>>> extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn); >>>> -extern void __flush_dcache_pages(struct page *page, unsigned int nr); >>>> +extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr); >>> >>> NIT: Be good to drop the extern. >> >> I think I'll leave the one in, though, someone should clean up all of them >> in one go. > > This is how we always clean these up though, buuut to be fair that's in mm. > Well, okay, I'll make all the other functions jealous and blame it on you! :P -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 14:34:54 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 16:34:54 +0200 Subject: [PATCH v1 21/36] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-22-david@redhat.com> Message-ID: <62fad23f-e8dc-4fd5-a82f-6419376465b5@redhat.com> On 28.08.25 19:28, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:25AM +0200, David Hildenbrand wrote: >> Let's disallow handing out PFN ranges with non-contiguous pages, so we >> can remove the nth-page usage in __cma_alloc(), and so any callers don't >> have to worry about that either when wanting to blindly iterate pages. >> >> This is really only a problem in configs with SPARSEMEM but without >> SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some >> cases. > > I'm guessing this is something that we don't need to worry about in > reality? That my theory yes. > >> >> Will this cause harm? Probably not, because it's mostly 32bit that does >> not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could >> look into allocating the memmap for the memory sections spanned by a >> single CMA region in one go from memblock. >> >> Reviewed-by: Alexandru Elisei >> Signed-off-by: David Hildenbrand > > LGTM other than refactoring point below. > > CMA stuff looks fine afaict after staring at it for a while, on proviso > that handing out ranges within the same section is always going to be the > case. > > Anyway overall, > > LGTM, so: > > Reviewed-by: Lorenzo Stoakes > > >> --- >> include/linux/mm.h | 6 ++++++ >> mm/cma.c | 39 ++++++++++++++++++++++++--------------- >> mm/util.c | 33 +++++++++++++++++++++++++++++++++ >> 3 files changed, 63 insertions(+), 15 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index f6880e3225c5c..2ca1eb2db63ec 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes; >> extern unsigned long sysctl_admin_reserve_kbytes; >> >> #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) >> +bool page_range_contiguous(const struct page *page, unsigned long nr_pages); >> #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) >> #else >> #define nth_page(page,n) ((page) + (n)) >> +static inline bool page_range_contiguous(const struct page *page, >> + unsigned long nr_pages) >> +{ >> + return true; >> +} >> #endif >> >> /* to align the pointer to the (next) page boundary */ >> diff --git a/mm/cma.c b/mm/cma.c >> index e56ec64d0567e..813e6dc7b0954 100644 >> --- a/mm/cma.c >> +++ b/mm/cma.c >> @@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, >> unsigned long count, unsigned int align, >> struct page **pagep, gfp_t gfp) >> { >> - unsigned long mask, offset; >> - unsigned long pfn = -1; >> - unsigned long start = 0; >> unsigned long bitmap_maxno, bitmap_no, bitmap_count; >> + unsigned long start, pfn, mask, offset; >> int ret = -EBUSY; >> struct page *page = NULL; >> >> @@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, >> if (bitmap_count > bitmap_maxno) >> goto out; >> >> - for (;;) { >> + for (start = 0; ; start = bitmap_no + mask + 1) { >> spin_lock_irq(&cma->lock); >> /* >> * If the request is larger than the available number >> @@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, >> spin_unlock_irq(&cma->lock); >> break; >> } >> + >> + pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); >> + page = pfn_to_page(pfn); >> + >> + /* >> + * Do not hand out page ranges that are not contiguous, so >> + * callers can just iterate the pages without having to worry >> + * about these corner cases. >> + */ >> + if (!page_range_contiguous(page, count)) { >> + spin_unlock_irq(&cma->lock); >> + pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]", >> + __func__, cma->name, pfn, pfn + count - 1); >> + continue; >> + } >> + >> bitmap_set(cmr->bitmap, bitmap_no, bitmap_count); >> cma->available_count -= count; >> /* >> @@ -821,29 +835,24 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, >> */ >> spin_unlock_irq(&cma->lock); >> >> - pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); >> mutex_lock(&cma->alloc_mutex); >> ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp); >> mutex_unlock(&cma->alloc_mutex); >> - if (ret == 0) { >> - page = pfn_to_page(pfn); >> + if (!ret) >> break; >> - } >> >> cma_clear_bitmap(cma, cmr, pfn, count); >> if (ret != -EBUSY) >> break; >> >> pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n", >> - __func__, pfn, pfn_to_page(pfn)); >> + __func__, pfn, page); >> >> - trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn), >> - count, align); >> - /* try again with a bit different memory target */ >> - start = bitmap_no + mask + 1; >> + trace_cma_alloc_busy_retry(cma->name, pfn, page, count, align); >> } >> out: >> - *pagep = page; >> + if (!ret) >> + *pagep = page; >> return ret; >> } >> >> @@ -882,7 +891,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, >> */ >> if (page) { >> for (i = 0; i < count; i++) >> - page_kasan_tag_reset(nth_page(page, i)); >> + page_kasan_tag_reset(page + i); >> } >> >> if (ret && !(gfp & __GFP_NOWARN)) { >> diff --git a/mm/util.c b/mm/util.c >> index d235b74f7aff7..0bf349b19b652 100644 >> --- a/mm/util.c >> +++ b/mm/util.c >> @@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, >> { >> return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0); >> } >> + >> +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) >> +/** >> + * page_range_contiguous - test whether the page range is contiguous >> + * @page: the start of the page range. >> + * @nr_pages: the number of pages in the range. >> + * >> + * Test whether the page range is contiguous, such that they can be iterated >> + * naively, corresponding to iterating a contiguous PFN range. >> + * >> + * This function should primarily only be used for debug checks, or when >> + * working with page ranges that are not naturally contiguous (e.g., pages >> + * within a folio are). >> + * >> + * Returns true if contiguous, otherwise false. >> + */ >> +bool page_range_contiguous(const struct page *page, unsigned long nr_pages) >> +{ >> + const unsigned long start_pfn = page_to_pfn(page); >> + const unsigned long end_pfn = start_pfn + nr_pages; >> + unsigned long pfn; >> + >> + /* >> + * The memmap is allocated per memory section. We need to check >> + * each involved memory section once. >> + */ >> + for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION); >> + pfn < end_pfn; pfn += PAGES_PER_SECTION) >> + if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn))) >> + return false; > > I find this pretty confusing, my test for this is how many times I have to read > the code to understand what it's doing :) > > So we have something like: > > (pfn of page) > start_pfn pfn = align UP > | | > v v > | section | > <-----------------> > pfn - start_pfn > > Then check page + (pfn - start_pfn) == pfn_to_page(pfn) > > And loop such that: > > (pfn of page) > start_pfn pfn > | | > v v > | section | section | > <------------------------------------------> > pfn - start_pfn > > Again check page + (pfn - start_pfn) == pfn_to_page(pfn) > > And so on. > > So the logic looks good, but it's just... that took me a hot second to > parse :) > > I think a few simple fixups > > bool page_range_contiguous(const struct page *page, unsigned long nr_pages) > { > const unsigned long start_pfn = page_to_pfn(page); > const unsigned long end_pfn = start_pfn + nr_pages; > /* The PFN of the start of the next section. */ > unsigned long pfn = ALIGN(start_pfn, PAGES_PER_SECTION); > /* The page we'd expected to see if the range were contiguous. */ > struct page *expected = page + (pfn - start_pfn); > > /* > * The memmap is allocated per memory section. We need to check > * each involved memory section once. > */ > for (; pfn < end_pfn; pfn += PAGES_PER_SECTION, expected += PAGES_PER_SECTION) > if (unlikely(expected != pfn_to_page(pfn))) > return false; > return true; > } > Hm, I prefer my variant, especially where the pfn is calculated in the for loop. Likely a matter of personal taste. But I can see why skipping the first section might be a surprise when not having the semantics of ALIGN() in the cache. So I'll add the following on top: diff --git a/mm/util.c b/mm/util.c index 0bf349b19b652..fbdb73aaf35fe 100644 --- a/mm/util.c +++ b/mm/util.c @@ -1303,8 +1303,10 @@ bool page_range_contiguous(const struct page *page, unsigned long nr_pages) unsigned long pfn; /* - * The memmap is allocated per memory section. We need to check - * each involved memory section once. + * The memmap is allocated per memory section, so no need to check + * within the first section. However, we need to check each other + * spanned memory section once, making sure the first page in a + * section could similarly be reached by just iterating pages. */ for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION); pfn < end_pfn; pfn += PAGES_PER_SECTION) Thanks! -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 14:37:26 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 16:37:26 +0200 Subject: [PATCH v1 24/36] ata: libata-eh: drop nth_page() usage within SG entry In-Reply-To: <423566a0-5967-488d-a62a-4f825ae6f227@kernel.org> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-25-david@redhat.com> <7612fdc2-97ff-4b89-a532-90c5de56acdc@lucifer.local> <423566a0-5967-488d-a62a-4f825ae6f227@kernel.org> Message-ID: <07b11bc1-ea31-4d9d-b0be-0dd94a7b1c9c@redhat.com> On 29.08.25 02:22, Damien Le Moal wrote: > On 8/29/25 2:53 AM, Lorenzo Stoakes wrote: >> On Thu, Aug 28, 2025 at 12:01:28AM +0200, David Hildenbrand wrote: >>> It's no longer required to use nth_page() when iterating pages within a >>> single SG entry, so let's drop the nth_page() usage. >>> >>> Cc: Damien Le Moal >>> Cc: Niklas Cassel >>> Signed-off-by: David Hildenbrand >> >> LGTM, so: >> >> Reviewed-by: Lorenzo Stoakes > > Just noticed this: > > s/libata-eh/libata-sff > > in the commit title please. > Sure, I think some quick git-log search mislead me. Thanks! -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 14:41:08 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 16:41:08 +0200 Subject: [PATCH v1 33/36] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-34-david@redhat.com> Message-ID: <4b053602-7c80-4ea4-8617-0f5e526c02f6@redhat.com> On 28.08.25 20:09, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:37AM +0200, David Hildenbrand wrote: >> There is the concern that unpin_user_page_range_dirty_lock() might do >> some weird merging of PFN ranges -- either now or in the future -- such >> that PFN range is contiguous but the page range might not be. >> >> Let's sanity-check for that and drop the nth_page() usage. >> >> Signed-off-by: David Hildenbrand > > Seems one user uses SG and the other is IOMMU and in each instance you'd > expect physical contiguity (maybe Jason G. or somebody else more familiar > with these uses can also chime in). Right, and I added the sanity-check so we can identify and fix any such wrong merging of ranges. Thanks! -- Cheers David / dhildenb From david at redhat.com Fri Aug 29 14:42:30 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 29 Aug 2025 16:42:30 +0200 Subject: [PATCH v1 36/36] mm: remove nth_page() In-Reply-To: <18c6a175-507f-464c-b776-67d346863ddf@lucifer.local> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-37-david@redhat.com> <18c6a175-507f-464c-b776-67d346863ddf@lucifer.local> Message-ID: <7cd5f8c9-9bd3-40ed-a3df-a359dcfe1567@redhat.com> On 28.08.25 20:25, Lorenzo Stoakes wrote: > On Thu, Aug 28, 2025 at 12:01:40AM +0200, David Hildenbrand wrote: >> Now that all users are gone, let's remove it. >> >> Signed-off-by: David Hildenbrand > > HAPPY DAYYS!!!! > > Happy to have reached this bit, great work! :) I was just as happy when I made it to the end of this series :) Thanks for all the review!! -- Cheers David / dhildenb From herbert at gondor.apana.org.au Sat Aug 30 08:50:52 2025 From: herbert at gondor.apana.org.au (Herbert Xu) Date: Sat, 30 Aug 2025 16:50:52 +0800 Subject: [PATCH v1 32/36] crypto: remove nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-33-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-33-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:36AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Cc: Herbert Xu > Cc: "David S. Miller" > Signed-off-by: David Hildenbrand > --- > crypto/ahash.c | 4 ++-- > crypto/scompress.c | 8 ++++---- > include/crypto/scatterwalk.h | 4 ++-- > 3 files changed, 8 insertions(+), 8 deletions(-) Acked-by: Herbert Xu Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt From david at redhat.com Fri Aug 22 18:11:04 2025 From: david at redhat.com (David Hildenbrand) Date: Fri, 22 Aug 2025 18:11:04 -0000 Subject: [PATCH RFC 29/35] scsi: core: drop nth_page() usage within SG entry In-Reply-To: <58816f2c-d4a7-4ec0-a48e-66a876ea1168@acm.org> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-30-david@redhat.com> <58816f2c-d4a7-4ec0-a48e-66a876ea1168@acm.org> Message-ID: <9a9eb9ca-a5ae-4230-8921-fd0e0a79ccbb@redhat.com> On 22.08.25 20:01, Bart Van Assche wrote: > On 8/21/25 1:06 PM, David Hildenbrand wrote: >> It's no longer required to use nth_page() when iterating pages within a >> single SG entry, so let's drop the nth_page() usage. > Usually the SCSI core and the SG I/O driver are updated separately. > Anyway: Thanks, I had it separately but decided to merge per broader subsystem before sending. I can split it up in the next version. -- Cheers David / dhildenb From david at redhat.com Wed Aug 27 22:02:28 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:02:28 -0000 Subject: [PATCH v1 00/36] mm: remove nth_page() Message-ID: <20250827220141.262669-1-david@redhat.com> This is based on mm-unstable. I will only CC non-MM folks on the cover letter and the respective patch to not flood too many inboxes (the lists receive all patches). -- As discussed recently with Linus, nth_page() is just nasty and we would like to remove it. To recap, the reason we currently need nth_page() within a folio is because on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the memmap is allocated per memory section. While buddy allocations cannot cross memory section boundaries, hugetlb and dax folios can. So crossing a memory section means that "page++" could do the wrong thing. Instead, nth_page() on these problematic configs always goes from page->pfn, to the go from (++pfn)->page, which is rather nasty. Likely, many people have no idea when nth_page() is required and when it might be dropped. We refer to such problematic PFN ranges and "non-contiguous pages". If we only deal with "contiguous pages", there is not need for nth_page(). Besides that "obvious" folio case, we might end up using nth_page() within CMA allocations (again, could span memory sections), and in one corner case (kfence) when processing memblock allocations (again, could span memory sections). So let's handle all that, add sanity checks, and remove nth_page(). Patch #1 -> #5 : stop making SPARSEMEM_VMEMMAP user-selectable + cleanups Patch #6 -> #13 : disallow folios to have non-contiguous pages Patch #14 -> #20 : remove nth_page() usage within folios Patch #21 : disallow CMA allocations of non-contiguous pages Patch #22 -> #32 : sanity+check + remove nth_page() usage within SG entry Patch #33 : sanity-check + remove nth_page() usage in unpin_user_page_range_dirty_lock() Patch #34 : remove nth_page() in kfence Patch #35 : adjust stale comment regarding nth_page Patch #36 : mm: remove nth_page() A lot of this is inspired from the discussion at [1] between Linus, Jason and me, so cudos to them. [1] https://lore.kernel.org/all/CAHk-=wiCYfNp4AJLBORU-c7ZyRBUp66W2-Et6cdQ4REx-GyQ_A at mail.gmail.com/T/#u RFC -> v1: * "wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config" -> Mention that it was never really relevant for the test * "mm/mm_init: make memmap_init_compound() look more like prep_compound_page()" -> Mention the setup of page links * "mm: limit folio/compound page sizes in problematic kernel configs" -> Improve comment for PUD handling, mentioning hugetlb and dax * "mm: simplify folio_page() and folio_page_idx()" -> Call variable "n" * "mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap()" -> Keep __init_single_page() and refer to the usage of memblock_reserved_mark_noinit() * "fs: hugetlbfs: cleanup folio in adjust_range_hwpoison()" * "fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison()" -> Separate nth_page() removal from cleanups -> Further improve cleanups * "io_uring/zcrx: remove nth_page() usage within folio" -> Keep the io_copy_cache for now and limit to nth_page() removal * "mm/gup: drop nth_page() usage within folio when recording subpages" -> Cleanup record_subpages as bit * "mm/cma: refuse handing out non-contiguous page ranges" -> Replace another instance of "pfn_to_page(pfn)" where we already have the page * "scatterlist: disallow non-contigous page ranges in a single SG entry" -> We have to EXPORT the symbol. I thought about moving it to mm_inline.h, but I really don't want to include that in include/linux/scatterlist.h * "ata: libata-eh: drop nth_page() usage within SG entry" * "mspro_block: drop nth_page() usage within SG entry" * "memstick: drop nth_page() usage within SG entry" * "mmc: drop nth_page() usage within SG entry" -> Keep PAGE_SHIFT * "scsi: scsi_lib: drop nth_page() usage within SG entry" * "scsi: sg: drop nth_page() usage within SG entry" -> Split patches, Keep PAGE_SHIFT * "crypto: remove nth_page() usage within SG entry" -> Keep PAGE_SHIFT * "kfence: drop nth_page() usage" -> Keep modifying i and use "start_pfn" only instead Cc: Andrew Morton Cc: Linus Torvalds Cc: Jason Gunthorpe Cc: Lorenzo Stoakes Cc: "Liam R. Howlett" Cc: Vlastimil Babka Cc: Mike Rapoport Cc: Suren Baghdasaryan Cc: Michal Hocko Cc: Jens Axboe Cc: Marek Szyprowski Cc: Robin Murphy Cc: John Hubbard Cc: Peter Xu Cc: Alexander Potapenko Cc: Marco Elver Cc: Dmitry Vyukov Cc: Brendan Jackman Cc: Johannes Weiner Cc: Zi Yan Cc: Dennis Zhou Cc: Tejun Heo Cc: Christoph Lameter Cc: Muchun Song Cc: Oscar Salvador Cc: x86 at kernel.org Cc: linux-arm-kernel at lists.infradead.org Cc: linux-mips at vger.kernel.org Cc: linux-s390 at vger.kernel.org Cc: linux-crypto at vger.kernel.org Cc: linux-ide at vger.kernel.org Cc: intel-gfx at lists.freedesktop.org Cc: dri-devel at lists.freedesktop.org Cc: linux-mmc at vger.kernel.org Cc: linux-arm-kernel at axis.com Cc: linux-scsi at vger.kernel.org Cc: kvm at vger.kernel.org Cc: virtualization at lists.linux.dev Cc: linux-mm at kvack.org Cc: io-uring at vger.kernel.org Cc: iommu at lists.linux.dev Cc: kasan-dev at googlegroups.com Cc: wireguard at lists.zx2c4.com Cc: netdev at vger.kernel.org Cc: linux-kselftest at vger.kernel.org Cc: linux-riscv at lists.infradead.org David Hildenbrand (36): mm: stop making SPARSEMEM_VMEMMAP user-selectable arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() mm/hugetlb: check for unreasonable folio sizes when registering hstate mm/mm_init: make memmap_init_compound() look more like prep_compound_page() mm: sanity-check maximum folio size in folio_set_order() mm: limit folio/compound page sizes in problematic kernel configs mm: simplify folio_page() and folio_page_idx() mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() mm/mm/percpu-km: drop nth_page() usage within single allocation fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() fs: hugetlbfs: cleanup folio in adjust_range_hwpoison() mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() mm/gup: drop nth_page() usage within folio when recording subpages io_uring/zcrx: remove nth_page() usage within folio mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() mm/cma: refuse handing out non-contiguous page ranges dma-remap: drop nth_page() in dma_common_contiguous_remap() scatterlist: disallow non-contigous page ranges in a single SG entry ata: libata-eh: drop nth_page() usage within SG entry drm/i915/gem: drop nth_page() usage within SG entry mspro_block: drop nth_page() usage within SG entry memstick: drop nth_page() usage within SG entry mmc: drop nth_page() usage within SG entry scsi: scsi_lib: drop nth_page() usage within SG entry scsi: sg: drop nth_page() usage within SG entry vfio/pci: drop nth_page() usage within SG entry crypto: remove nth_page() usage within SG entry mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() kfence: drop nth_page() usage block: update comment of "struct bio_vec" regarding nth_page() mm: remove nth_page() arch/arm64/Kconfig | 1 - arch/mips/include/asm/cacheflush.h | 11 +++-- arch/mips/mm/cache.c | 8 ++-- arch/s390/Kconfig | 1 - arch/x86/Kconfig | 1 - crypto/ahash.c | 4 +- crypto/scompress.c | 8 ++-- drivers/ata/libata-sff.c | 6 +-- drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +- drivers/memstick/core/mspro_block.c | 3 +- drivers/memstick/host/jmb38x_ms.c | 3 +- drivers/memstick/host/tifm_ms.c | 3 +- drivers/mmc/host/tifm_sd.c | 4 +- drivers/mmc/host/usdhi6rol0.c | 4 +- drivers/scsi/scsi_lib.c | 3 +- drivers/scsi/sg.c | 3 +- drivers/vfio/pci/pds/lm.c | 3 +- drivers/vfio/pci/virtio/migrate.c | 3 +- fs/hugetlbfs/inode.c | 33 +++++-------- include/crypto/scatterwalk.h | 4 +- include/linux/bvec.h | 7 +-- include/linux/mm.h | 48 +++++++++++++++---- include/linux/page-flags.h | 5 +- include/linux/scatterlist.h | 3 +- io_uring/zcrx.c | 4 +- kernel/dma/remap.c | 2 +- mm/Kconfig | 3 +- mm/cma.c | 39 +++++++++------ mm/gup.c | 14 ++++-- mm/hugetlb.c | 22 +++++---- mm/internal.h | 1 + mm/kfence/core.c | 12 +++-- mm/memremap.c | 3 ++ mm/mm_init.c | 15 +++--- mm/page_alloc.c | 5 +- mm/pagewalk.c | 2 +- mm/percpu-km.c | 2 +- mm/util.c | 34 +++++++++++++ tools/testing/scatterlist/linux/mm.h | 1 - .../selftests/wireguard/qemu/kernel.config | 1 - 40 files changed, 202 insertions(+), 129 deletions(-) base-commit: efa7612003b44c220551fd02466bfbad5180fc83 -- 2.50.1 From david at redhat.com Wed Aug 27 22:02:51 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:02:51 -0000 Subject: [PATCH v1 01/36] mm: stop making SPARSEMEM_VMEMMAP user-selectable In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-2-david@redhat.com> In an ideal world, we wouldn't have to deal with SPARSEMEM without SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is considered too costly and consequently not supported. However, if an architecture does support SPARSEMEM with SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just like we already do for arm64, s390 and x86. So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without SPARSEMEM_VMEMMAP. This implies that the option to not use SPARSEMEM_VMEMMAP will now be gone for loongarch, powerpc, riscv and sparc. All architectures only enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really be a big downside to using the VMEMMAP (quite the contrary). This is a preparation for not supporting (1) folio sizes that exceed a single memory section (2) CMA allocations of non-contiguous page ranges in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we want to limit possible impact as much as possible (e.g., gigantic hugetlb page allocations suddenly fails). Acked-by: Zi Yan Acked-by: Mike Rapoport (Microsoft) Acked-by: SeongJae Park Cc: Huacai Chen Cc: WANG Xuerui Cc: Madhavan Srinivasan Cc: Michael Ellerman Cc: Nicholas Piggin Cc: Christophe Leroy Cc: Paul Walmsley Cc: Palmer Dabbelt Cc: Albert Ou Cc: Alexandre Ghiti Cc: "David S. Miller" Cc: Andreas Larsson Signed-off-by: David Hildenbrand --- mm/Kconfig | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/mm/Kconfig b/mm/Kconfig index 4108bcd967848..330d0e698ef96 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -439,9 +439,8 @@ config SPARSEMEM_VMEMMAP_ENABLE bool config SPARSEMEM_VMEMMAP - bool "Sparse Memory virtual memmap" + def_bool y depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE - default y help SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise pfn_to_page and page_to_pfn operations. This is the most -- 2.50.1 From david at redhat.com Wed Aug 27 22:03:06 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:03:06 -0000 Subject: [PATCH v1 02/36] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-3-david@redhat.com> Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE is selected. Reviewed-by: Mike Rapoport (Microsoft) Cc: Catalin Marinas Cc: Will Deacon Signed-off-by: David Hildenbrand --- arch/arm64/Kconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index e9bbfacc35a64..b1d1f2ff2493b 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -1570,7 +1570,6 @@ source "kernel/Kconfig.hz" config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_VMEMMAP_ENABLE - select SPARSEMEM_VMEMMAP config HW_PERF_EVENTS def_bool y -- 2.50.1 From david at redhat.com Wed Aug 27 22:03:27 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:03:27 -0000 Subject: [PATCH v1 03/36] s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-4-david@redhat.com> Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE is selected. Reviewed-by: Mike Rapoport (Microsoft) Cc: Heiko Carstens Cc: Vasily Gorbik Cc: Alexander Gordeev Cc: Christian Borntraeger Cc: Sven Schnelle Signed-off-by: David Hildenbrand --- arch/s390/Kconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig index bf680c26a33cf..145ca23c2fff6 100644 --- a/arch/s390/Kconfig +++ b/arch/s390/Kconfig @@ -710,7 +710,6 @@ menu "Memory setup" config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_VMEMMAP_ENABLE - select SPARSEMEM_VMEMMAP config ARCH_SPARSEMEM_DEFAULT def_bool y -- 2.50.1 From david at redhat.com Wed Aug 27 22:03:42 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:03:42 -0000 Subject: [PATCH v1 04/36] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-5-david@redhat.com> Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE is selected. Reviewed-by: Mike Rapoport (Microsoft) Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Signed-off-by: David Hildenbrand --- arch/x86/Kconfig | 1 - 1 file changed, 1 deletion(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 58d890fe2100e..e431d1c06fecd 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1552,7 +1552,6 @@ config ARCH_SPARSEMEM_ENABLE def_bool y select SPARSEMEM_STATIC if X86_32 select SPARSEMEM_VMEMMAP_ENABLE if X86_64 - select SPARSEMEM_VMEMMAP if X86_64 config ARCH_SPARSEMEM_DEFAULT def_bool X86_64 || (NUMA && X86_32) -- 2.50.1 From david at redhat.com Wed Aug 27 22:03:58 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:03:58 -0000 Subject: [PATCH v1 05/36] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-6-david@redhat.com> It's no longer user-selectable (and the default was already "y"), so let's just drop it. It was never really relevant to the wireguard selftests either way. Acked-by: Mike Rapoport (Microsoft) Cc: "Jason A. Donenfeld" Cc: Shuah Khan Signed-off-by: David Hildenbrand --- tools/testing/selftests/wireguard/qemu/kernel.config | 1 - 1 file changed, 1 deletion(-) diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config index 0a5381717e9f4..1149289f4b30f 100644 --- a/tools/testing/selftests/wireguard/qemu/kernel.config +++ b/tools/testing/selftests/wireguard/qemu/kernel.config @@ -48,7 +48,6 @@ CONFIG_JUMP_LABEL=y CONFIG_FUTEX=y CONFIG_SHMEM=y CONFIG_SLUB=y -CONFIG_SPARSEMEM_VMEMMAP=y CONFIG_SMP=y CONFIG_SCHED_SMT=y CONFIG_SCHED_MC=y -- 2.50.1 From david at redhat.com Wed Aug 27 22:09:09 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:09:09 -0000 Subject: [PATCH v1 24/36] ata: libata-eh: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-25-david@redhat.com> It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Cc: Damien Le Moal Cc: Niklas Cassel Signed-off-by: David Hildenbrand --- drivers/ata/libata-sff.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c index 7fc407255eb46..1e2a2c33cdc80 100644 --- a/drivers/ata/libata-sff.c +++ b/drivers/ata/libata-sff.c @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) offset = qc->cursg->offset + qc->cursg_ofs; /* get the current page and offset */ - page = nth_page(page, (offset >> PAGE_SHIFT)); + page += offset >> PAGE_SHIFT; offset %= PAGE_SIZE; /* don't overrun current sg */ @@ -631,7 +631,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) unsigned int split_len = PAGE_SIZE - offset; ata_pio_xfer(qc, page, offset, split_len); - ata_pio_xfer(qc, nth_page(page, 1), 0, count - split_len); + ata_pio_xfer(qc, page + 1, 0, count - split_len); } else { ata_pio_xfer(qc, page, offset, count); } @@ -751,7 +751,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes) offset = sg->offset + qc->cursg_ofs; /* get the current page and offset */ - page = nth_page(page, (offset >> PAGE_SHIFT)); + page += offset >> PAGE_SHIFT; offset %= PAGE_SIZE; /* don't overrun current sg */ -- 2.50.1 From david at redhat.com Wed Aug 27 22:09:26 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:09:26 -0000 Subject: [PATCH v1 25/36] drm/i915/gem: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-26-david@redhat.com> It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Cc: Jani Nikula Cc: Joonas Lahtinen Cc: Rodrigo Vivi Cc: Tvrtko Ursulin Cc: David Airlie Cc: Simona Vetter Signed-off-by: David Hildenbrand --- drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c index c16a57160b262..031d7acc16142 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c @@ -779,7 +779,7 @@ __i915_gem_object_get_page(struct drm_i915_gem_object *obj, pgoff_t n) GEM_BUG_ON(!i915_gem_object_has_struct_page(obj)); sg = i915_gem_object_get_sg(obj, n, &offset); - return nth_page(sg_page(sg), offset); + return sg_page(sg) + offset; } /* Like i915_gem_object_get_page(), but mark the returned page dirty */ -- 2.50.1 From david at redhat.com Wed Aug 27 22:09:43 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:09:43 -0000 Subject: [PATCH v1 26/36] mspro_block: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-27-david@redhat.com> It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Acked-by: Ulf Hansson Cc: Maxim Levitsky Cc: Alex Dubov Cc: Ulf Hansson Signed-off-by: David Hildenbrand --- drivers/memstick/core/mspro_block.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/memstick/core/mspro_block.c b/drivers/memstick/core/mspro_block.c index c9853d887d282..d3f160dc0da4c 100644 --- a/drivers/memstick/core/mspro_block.c +++ b/drivers/memstick/core/mspro_block.c @@ -560,8 +560,7 @@ static int h_mspro_block_transfer_data(struct memstick_dev *card, t_offset += msb->current_page * msb->page_size; sg_set_page(&t_sg, - nth_page(sg_page(&(msb->req_sg[msb->current_seg])), - t_offset >> PAGE_SHIFT), + sg_page(&(msb->req_sg[msb->current_seg])) + (t_offset >> PAGE_SHIFT), msb->page_size, offset_in_page(t_offset)); memstick_init_req_sg(*mrq, msb->data_dir == READ -- 2.50.1 From david at redhat.com Wed Aug 27 22:10:01 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:10:01 -0000 Subject: [PATCH v1 27/36] memstick: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-28-david@redhat.com> It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Acked-by: Ulf Hansson Cc: Maxim Levitsky Cc: Alex Dubov Cc: Ulf Hansson Signed-off-by: David Hildenbrand --- drivers/memstick/host/jmb38x_ms.c | 3 +-- drivers/memstick/host/tifm_ms.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/memstick/host/jmb38x_ms.c b/drivers/memstick/host/jmb38x_ms.c index cddddb3a5a27f..79e66e30417c1 100644 --- a/drivers/memstick/host/jmb38x_ms.c +++ b/drivers/memstick/host/jmb38x_ms.c @@ -317,8 +317,7 @@ static int jmb38x_ms_transfer_data(struct jmb38x_ms_host *host) unsigned int p_off; if (host->req->long_data) { - pg = nth_page(sg_page(&host->req->sg), - off >> PAGE_SHIFT); + pg = sg_page(&host->req->sg) + (off >> PAGE_SHIFT); p_off = offset_in_page(off); p_cnt = PAGE_SIZE - p_off; p_cnt = min(p_cnt, length); diff --git a/drivers/memstick/host/tifm_ms.c b/drivers/memstick/host/tifm_ms.c index db7f3a088fb09..0b6a90661eee5 100644 --- a/drivers/memstick/host/tifm_ms.c +++ b/drivers/memstick/host/tifm_ms.c @@ -201,8 +201,7 @@ static unsigned int tifm_ms_transfer_data(struct tifm_ms *host) unsigned int p_off; if (host->req->long_data) { - pg = nth_page(sg_page(&host->req->sg), - off >> PAGE_SHIFT); + pg = sg_page(&host->req->sg) + (off >> PAGE_SHIFT); p_off = offset_in_page(off); p_cnt = PAGE_SIZE - p_off; p_cnt = min(p_cnt, length); -- 2.50.1 From david at redhat.com Wed Aug 27 22:10:21 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:10:21 -0000 Subject: [PATCH v1 28/36] mmc: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-29-david@redhat.com> It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Acked-by: Ulf Hansson Cc: Alex Dubov Cc: Ulf Hansson Cc: Jesper Nilsson Cc: Lars Persson Signed-off-by: David Hildenbrand --- drivers/mmc/host/tifm_sd.c | 4 ++-- drivers/mmc/host/usdhi6rol0.c | 4 ++-- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/mmc/host/tifm_sd.c b/drivers/mmc/host/tifm_sd.c index ac636efd911d3..2cd69c9e9571b 100644 --- a/drivers/mmc/host/tifm_sd.c +++ b/drivers/mmc/host/tifm_sd.c @@ -191,7 +191,7 @@ static void tifm_sd_transfer_data(struct tifm_sd *host) } off = sg[host->sg_pos].offset + host->block_pos; - pg = nth_page(sg_page(&sg[host->sg_pos]), off >> PAGE_SHIFT); + pg = sg_page(&sg[host->sg_pos]) + (off >> PAGE_SHIFT); p_off = offset_in_page(off); p_cnt = PAGE_SIZE - p_off; p_cnt = min(p_cnt, cnt); @@ -240,7 +240,7 @@ static void tifm_sd_bounce_block(struct tifm_sd *host, struct mmc_data *r_data) } off = sg[host->sg_pos].offset + host->block_pos; - pg = nth_page(sg_page(&sg[host->sg_pos]), off >> PAGE_SHIFT); + pg = sg_page(&sg[host->sg_pos]) + (off >> PAGE_SHIFT); p_off = offset_in_page(off); p_cnt = PAGE_SIZE - p_off; p_cnt = min(p_cnt, cnt); diff --git a/drivers/mmc/host/usdhi6rol0.c b/drivers/mmc/host/usdhi6rol0.c index 85b49c07918b3..3bccf800339ba 100644 --- a/drivers/mmc/host/usdhi6rol0.c +++ b/drivers/mmc/host/usdhi6rol0.c @@ -323,7 +323,7 @@ static void usdhi6_blk_bounce(struct usdhi6_host *host, host->head_pg.page = host->pg.page; host->head_pg.mapped = host->pg.mapped; - host->pg.page = nth_page(host->pg.page, 1); + host->pg.page = host->pg.page + 1; host->pg.mapped = kmap(host->pg.page); host->blk_page = host->bounce_buf; @@ -503,7 +503,7 @@ static void usdhi6_sg_advance(struct usdhi6_host *host) /* We cannot get here after crossing a page border */ /* Next page in the same SG */ - host->pg.page = nth_page(sg_page(host->sg), host->page_idx); + host->pg.page = sg_page(host->sg) + host->page_idx; host->pg.mapped = kmap(host->pg.page); host->blk_page = host->pg.mapped; -- 2.50.1 From david at redhat.com Wed Aug 27 22:10:43 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:10:43 -0000 Subject: [PATCH v1 29/36] scsi: scsi_lib: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-30-david@redhat.com> It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Reviewed-by: Bart Van Assche Cc: "James E.J. Bottomley" Cc: "Martin K. Petersen" Signed-off-by: David Hildenbrand --- drivers/scsi/scsi_lib.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c index 0c65ecfedfbd6..d7e42293b8645 100644 --- a/drivers/scsi/scsi_lib.c +++ b/drivers/scsi/scsi_lib.c @@ -3148,8 +3148,7 @@ void *scsi_kmap_atomic_sg(struct scatterlist *sgl, int sg_count, /* Offset starting from the beginning of first page in this sg-entry */ *offset = *offset - len_complete + sg->offset; - /* Assumption: contiguous pages can be accessed as "page + i" */ - page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT)); + page = sg_page(sg) + (*offset >> PAGE_SHIFT); *offset &= ~PAGE_MASK; /* Bytes in this sg-entry from *offset to the end of the page */ -- 2.50.1 From david at redhat.com Wed Aug 27 22:11:05 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:11:05 -0000 Subject: [PATCH v1 30/36] scsi: sg: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-31-david@redhat.com> It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Reviewed-by: Bart Van Assche Cc: Doug Gilbert Cc: "James E.J. Bottomley" Cc: "Martin K. Petersen" Signed-off-by: David Hildenbrand --- drivers/scsi/sg.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c index 3c02a5f7b5f39..4c62c597c7be9 100644 --- a/drivers/scsi/sg.c +++ b/drivers/scsi/sg.c @@ -1235,8 +1235,7 @@ sg_vma_fault(struct vm_fault *vmf) len = vma->vm_end - sa; len = (len < length) ? len : length; if (offset < len) { - struct page *page = nth_page(rsv_schp->pages[k], - offset >> PAGE_SHIFT); + struct page *page = rsv_schp->pages[k] + (offset >> PAGE_SHIFT); get_page(page); /* increment page count */ vmf->page = page; return 0; /* success */ -- 2.50.1 From david at redhat.com Wed Aug 27 22:11:13 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:11:13 -0000 Subject: [PATCH v1 31/36] vfio/pci: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-32-david@redhat.com> It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Cc: Brett Creeley Cc: Jason Gunthorpe Cc: Yishai Hadas Cc: Shameer Kolothum Cc: Kevin Tian Cc: Alex Williamson Signed-off-by: David Hildenbrand --- drivers/vfio/pci/pds/lm.c | 3 +-- drivers/vfio/pci/virtio/migrate.c | 3 +-- 2 files changed, 2 insertions(+), 4 deletions(-) diff --git a/drivers/vfio/pci/pds/lm.c b/drivers/vfio/pci/pds/lm.c index f2673d395236a..4d70c833fa32e 100644 --- a/drivers/vfio/pci/pds/lm.c +++ b/drivers/vfio/pci/pds/lm.c @@ -151,8 +151,7 @@ static struct page *pds_vfio_get_file_page(struct pds_vfio_lm_file *lm_file, lm_file->last_offset_sg = sg; lm_file->sg_last_entry += i; lm_file->last_offset = cur_offset; - return nth_page(sg_page(sg), - (offset - cur_offset) / PAGE_SIZE); + return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE; } cur_offset += sg->length; } diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c index ba92bb4e9af94..7dd0ac866461d 100644 --- a/drivers/vfio/pci/virtio/migrate.c +++ b/drivers/vfio/pci/virtio/migrate.c @@ -53,8 +53,7 @@ virtiovf_get_migration_page(struct virtiovf_data_buffer *buf, buf->last_offset_sg = sg; buf->sg_last_entry += i; buf->last_offset = cur_offset; - return nth_page(sg_page(sg), - (offset - cur_offset) / PAGE_SIZE); + return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE; } cur_offset += sg->length; } -- 2.50.1 From david at redhat.com Wed Aug 27 22:11:30 2025 From: david at redhat.com (David Hildenbrand) Date: Wed, 27 Aug 2025 22:11:30 -0000 Subject: [PATCH v1 32/36] crypto: remove nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-1-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> Message-ID: <20250827220141.262669-33-david@redhat.com> It's no longer required to use nth_page() when iterating pages within a single SG entry, so let's drop the nth_page() usage. Cc: Herbert Xu Cc: "David S. Miller" Signed-off-by: David Hildenbrand --- crypto/ahash.c | 4 ++-- crypto/scompress.c | 8 ++++---- include/crypto/scatterwalk.h | 4 ++-- 3 files changed, 8 insertions(+), 8 deletions(-) diff --git a/crypto/ahash.c b/crypto/ahash.c index a227793d2c5b5..dfb4f5476428f 100644 --- a/crypto/ahash.c +++ b/crypto/ahash.c @@ -88,7 +88,7 @@ static int hash_walk_new_entry(struct crypto_hash_walk *walk) sg = walk->sg; walk->offset = sg->offset; - walk->pg = nth_page(sg_page(walk->sg), (walk->offset >> PAGE_SHIFT)); + walk->pg = sg_page(walk->sg) + (walk->offset >> PAGE_SHIFT); walk->offset = offset_in_page(walk->offset); walk->entrylen = sg->length; @@ -226,7 +226,7 @@ int shash_ahash_digest(struct ahash_request *req, struct shash_desc *desc) if (!IS_ENABLED(CONFIG_HIGHMEM)) return crypto_shash_digest(desc, data, nbytes, req->result); - page = nth_page(page, offset >> PAGE_SHIFT); + page += offset >> PAGE_SHIFT; offset = offset_in_page(offset); if (nbytes > (unsigned int)PAGE_SIZE - offset) diff --git a/crypto/scompress.c b/crypto/scompress.c index c651e7f2197a9..1a7ed8ae65b07 100644 --- a/crypto/scompress.c +++ b/crypto/scompress.c @@ -198,7 +198,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) } else return -ENOSYS; - dpage = nth_page(dpage, doff / PAGE_SIZE); + dpage += doff / PAGE_SIZE; doff = offset_in_page(doff); n = (dlen - 1) / PAGE_SIZE; @@ -220,12 +220,12 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) } else break; - spage = nth_page(spage, soff / PAGE_SIZE); + spage = spage + soff / PAGE_SIZE; soff = offset_in_page(soff); n = (slen - 1) / PAGE_SIZE; n += (offset_in_page(slen - 1) + soff) / PAGE_SIZE; - if (PageHighMem(nth_page(spage, n)) && + if (PageHighMem(spage + n) && size_add(soff, slen) > PAGE_SIZE) break; src = kmap_local_page(spage) + soff; @@ -270,7 +270,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) if (dlen <= PAGE_SIZE) break; dlen -= PAGE_SIZE; - dpage = nth_page(dpage, 1); + dpage++; } } diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h index 15ab743f68c8f..83d14376ff2bc 100644 --- a/include/crypto/scatterwalk.h +++ b/include/crypto/scatterwalk.h @@ -159,7 +159,7 @@ static inline void scatterwalk_map(struct scatter_walk *walk) if (IS_ENABLED(CONFIG_HIGHMEM)) { struct page *page; - page = nth_page(base_page, offset >> PAGE_SHIFT); + page = base_page + (offset >> PAGE_SHIFT); offset = offset_in_page(offset); addr = kmap_local_page(page) + offset; } else { @@ -259,7 +259,7 @@ static inline void scatterwalk_done_dst(struct scatter_walk *walk, end += (offset_in_page(offset) + offset_in_page(nbytes) + PAGE_SIZE - 1) >> PAGE_SHIFT; for (i = start; i < end; i++) - flush_dcache_page(nth_page(base_page, i)); + flush_dcache_page(base_page + i); } scatterwalk_advance(walk, nbytes); } -- 2.50.1 From dave.hansen at intel.com Wed Aug 27 22:52:38 2025 From: dave.hansen at intel.com (Dave Hansen) Date: Wed, 27 Aug 2025 22:52:38 -0000 Subject: [PATCH v1 04/36] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-5-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-5-david@redhat.com> Message-ID: On 8/27/25 15:01, David Hildenbrand wrote: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. Acked-by: Dave Hansen From david at redhat.com Thu Aug 28 07:46:34 2025 From: david at redhat.com (David Hildenbrand) Date: Thu, 28 Aug 2025 07:46:34 -0000 Subject: [PATCH v1 12/36] mm: simplify folio_page() and folio_page_idx() In-Reply-To: <20250828074356.3xiuqugokg36yuxw@master> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-13-david@redhat.com> <20250828074356.3xiuqugokg36yuxw@master> Message-ID: <0e1c0fe1-4dd1-46dc-8ce8-a6bf6e4c3e80@redhat.com> > > Curious about why it is in page-flags.h. It seems not related to page-flags. Likely because we have the page_folio() in there as well. -- Cheers David / dhildenb From catalin.marinas at arm.com Thu Aug 28 10:43:49 2025 From: catalin.marinas at arm.com (Catalin Marinas) Date: Thu, 28 Aug 2025 10:43:49 -0000 Subject: [PATCH v1 02/36] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-3-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-3-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:06AM +0200, David Hildenbrand wrote: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. > > Reviewed-by: Mike Rapoport (Microsoft) > Cc: Catalin Marinas > Cc: Will Deacon > Signed-off-by: David Hildenbrand Acked-by: Catalin Marinas From bcreeley at amd.com Thu Aug 28 20:16:00 2025 From: bcreeley at amd.com (Brett Creeley) Date: Thu, 28 Aug 2025 20:16:00 -0000 Subject: [PATCH v1 31/36] vfio/pci: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-32-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-32-david@redhat.com> Message-ID: On 8/27/2025 3:01 PM, David Hildenbrand wrote: > Caution: This message originated from an External Source. Use proper caution when opening attachments, clicking links, or responding. > > > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Cc: Brett Creeley > Cc: Jason Gunthorpe > Cc: Yishai Hadas > Cc: Shameer Kolothum > Cc: Kevin Tian > Cc: Alex Williamson > Signed-off-by: David Hildenbrand > --- > drivers/vfio/pci/pds/lm.c | 3 +-- > drivers/vfio/pci/virtio/migrate.c | 3 +-- > 2 files changed, 2 insertions(+), 4 deletions(-) > > diff --git a/drivers/vfio/pci/pds/lm.c b/drivers/vfio/pci/pds/lm.c > index f2673d395236a..4d70c833fa32e 100644 > --- a/drivers/vfio/pci/pds/lm.c > +++ b/drivers/vfio/pci/pds/lm.c > @@ -151,8 +151,7 @@ static struct page *pds_vfio_get_file_page(struct pds_vfio_lm_file *lm_file, > lm_file->last_offset_sg = sg; > lm_file->sg_last_entry += i; > lm_file->last_offset = cur_offset; > - return nth_page(sg_page(sg), > - (offset - cur_offset) / PAGE_SIZE); > + return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE; > } > cur_offset += sg->length; > } > diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c > index ba92bb4e9af94..7dd0ac866461d 100644 > --- a/drivers/vfio/pci/virtio/migrate.c > +++ b/drivers/vfio/pci/virtio/migrate.c > @@ -53,8 +53,7 @@ virtiovf_get_migration_page(struct virtiovf_data_buffer *buf, > buf->last_offset_sg = sg; > buf->sg_last_entry += i; > buf->last_offset = cur_offset; > - return nth_page(sg_page(sg), > - (offset - cur_offset) / PAGE_SIZE); > + return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE; > } > cur_offset += sg->length; > } LGTM. Thanks. Reviewed-by: Brett Creeley > -- > 2.50.1 > From syzbot+ci0b43493baa45553d at syzkaller.appspotmail.com Thu Aug 21 21:37:15 2025 From: syzbot+ci0b43493baa45553d at syzkaller.appspotmail.com (syzbot ci) Date: Thu, 21 Aug 2025 21:37:15 -0000 Subject: [syzbot ci] Re: mm: remove nth_page() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> Message-ID: <68a79189.050a0220.cb3d1.0004.GAE@google.com> syzbot ci has tested the following series [v1] mm: remove nth_page() https://lore.kernel.org/all/20250821200701.1329277-1-david at redhat.com * [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable * [PATCH RFC 02/35] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" * [PATCH RFC 03/35] s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" * [PATCH RFC 04/35] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" * [PATCH RFC 05/35] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config * [PATCH RFC 06/35] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() * [PATCH RFC 07/35] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() * [PATCH RFC 08/35] mm/hugetlb: check for unreasonable folio sizes when registering hstate * [PATCH RFC 09/35] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() * [PATCH RFC 10/35] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() * [PATCH RFC 11/35] mm: sanity-check maximum folio size in folio_set_order() * [PATCH RFC 12/35] mm: limit folio/compound page sizes in problematic kernel configs * [PATCH RFC 13/35] mm: simplify folio_page() and folio_page_idx() * [PATCH RFC 14/35] mm/mm/percpu-km: drop nth_page() usage within single allocation * [PATCH RFC 15/35] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() * [PATCH RFC 16/35] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() * [PATCH RFC 17/35] mm/gup: drop nth_page() usage within folio when recording subpages * [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage * [PATCH RFC 19/35] io_uring/zcrx: remove nth_page() usage within folio * [PATCH RFC 20/35] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() * [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges * [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() * [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry * [PATCH RFC 24/35] ata: libata-eh: drop nth_page() usage within SG entry * [PATCH RFC 25/35] drm/i915/gem: drop nth_page() usage within SG entry * [PATCH RFC 26/35] mspro_block: drop nth_page() usage within SG entry * [PATCH RFC 27/35] memstick: drop nth_page() usage within SG entry * [PATCH RFC 28/35] mmc: drop nth_page() usage within SG entry * [PATCH RFC 29/35] scsi: core: drop nth_page() usage within SG entry * [PATCH RFC 30/35] vfio/pci: drop nth_page() usage within SG entry * [PATCH RFC 31/35] crypto: remove nth_page() usage within SG entry * [PATCH RFC 32/35] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() * [PATCH RFC 33/35] kfence: drop nth_page() usage * [PATCH RFC 34/35] block: update comment of "struct bio_vec" regarding nth_page() * [PATCH RFC 35/35] mm: remove nth_page() and found the following issue: general protection fault in kfence_guarded_alloc Full report is available here: https://ci.syzbot.org/series/f6f0aea1-9616-4675-8c80-f9b59ba3211c *** general protection fault in kfence_guarded_alloc tree: net-next URL: https://kernel.googlesource.com/pub/scm/linux/kernel/git/netdev/net-next.git base: da114122b83149d1f1db0586b1d67947b651aa20 arch: amd64 compiler: Debian clang version 20.1.7 (++20250616065708+6146a88f6049-1~exp1~20250616065826.132), Debian LLD 20.1.7 config: https://ci.syzbot.org/builds/705b7862-eb10-40bd-a4cf-4820b4912466/config smpboot: CPU0: Intel(R) Xeon(R) CPU @ 2.80GHz (family: 0x6, model: 0x55, stepping: 0x7) Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 UID: 0 PID: 1 Comm: swapper/0 Not tainted syzkaller #0 PREEMPT(full) Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 RIP: 0010:kfence_guarded_alloc+0x643/0xc70 Code: 41 c1 e5 18 bf 00 00 00 f5 44 89 ee e8 a6 67 9c ff 45 31 f6 41 81 fd 00 00 00 f5 4c 0f 44 f3 49 8d 7e 08 48 89 f8 48 c1 e8 03 <42> 80 3c 20 00 74 05 e8 f1 cb ff ff 4c 8b 6c 24 18 4d 89 6e 08 49 RSP: 0000:ffffc90000047740 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffea0004d90080 RCX: 0000000000000000 RDX: ffff88801c2e8000 RSI: 00000000ff000000 RDI: 0000000000000008 RBP: ffffc90000047850 R08: ffffffff99b2201b R09: 1ffffffff3364403 R10: dffffc0000000000 R11: fffffbfff3364404 R12: dffffc0000000000 R13: 00000000ff000000 R14: 0000000000000000 R15: ffff88813fec7068 FS: 0000000000000000(0000) GS:ffff8880b861c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88813ffff000 CR3: 000000000df36000 CR4: 0000000000350ef0 Call Trace: __kfence_alloc+0x385/0x3b0 __kmalloc_noprof+0x440/0x4f0 __alloc_workqueue+0x103/0x1b70 alloc_workqueue_noprof+0xd4/0x210 init_mm_internals+0x17/0x140 kernel_init_freeable+0x307/0x4b0 kernel_init+0x1d/0x1d0 ret_from_fork+0x3f9/0x770 ret_from_fork_asm+0x1a/0x30 Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:kfence_guarded_alloc+0x643/0xc70 Code: 41 c1 e5 18 bf 00 00 00 f5 44 89 ee e8 a6 67 9c ff 45 31 f6 41 81 fd 00 00 00 f5 4c 0f 44 f3 49 8d 7e 08 48 89 f8 48 c1 e8 03 <42> 80 3c 20 00 74 05 e8 f1 cb ff ff 4c 8b 6c 24 18 4d 89 6e 08 49 RSP: 0000:ffffc90000047740 EFLAGS: 00010202 RAX: 0000000000000001 RBX: ffffea0004d90080 RCX: 0000000000000000 RDX: ffff88801c2e8000 RSI: 00000000ff000000 RDI: 0000000000000008 RBP: ffffc90000047850 R08: ffffffff99b2201b R09: 1ffffffff3364403 R10: dffffc0000000000 R11: fffffbfff3364404 R12: dffffc0000000000 R13: 00000000ff000000 R14: 0000000000000000 R15: ffff88813fec7068 FS: 0000000000000000(0000) GS:ffff8880b861c000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff88813ffff000 CR3: 000000000df36000 CR4: 0000000000350ef0 *** If these findings have caused you to resend the series or submit a separate fix, please add the following tag to your commit message: Tested-by: syzbot at syzkaller.appspotmail.com --- This report is generated by a bot. It may contain errors. syzbot ci engineers can be reached at syzkaller at googlegroups.com. From m.szyprowski at samsung.com Fri Aug 22 08:15:08 2025 From: m.szyprowski at samsung.com (Marek Szyprowski) Date: Fri, 22 Aug 2025 08:15:08 -0000 Subject: [PATCH RFC 22/35] dma-remap: drop nth_page() in dma_common_contiguous_remap() In-Reply-To: <20250821200701.1329277-23-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-23-david@redhat.com> Message-ID: On 21.08.2025 22:06, David Hildenbrand wrote: > dma_common_contiguous_remap() is used to remap an "allocated contiguous > region". Within a single allocation, there is no need to use nth_page() > anymore. > > Neither the buddy, nor hugetlb, nor CMA will hand out problematic page > ranges. > > Cc: Marek Szyprowski > Cc: Robin Murphy > Signed-off-by: David Hildenbrand Acked-by: Marek Szyprowski > --- > kernel/dma/remap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c > index 9e2afad1c6152..b7c1c0c92d0c8 100644 > --- a/kernel/dma/remap.c > +++ b/kernel/dma/remap.c > @@ -49,7 +49,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t size, > if (!pages) > return NULL; > for (i = 0; i < count; i++) > - pages[i] = nth_page(page, i); > + pages[i] = page++; > vaddr = vmap(pages, count, VM_DMA_COHERENT, prot); > kvfree(pages); > Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland From m.szyprowski at samsung.com Fri Aug 22 08:15:20 2025 From: m.szyprowski at samsung.com (Marek Szyprowski) Date: Fri, 22 Aug 2025 08:15:20 -0000 Subject: [PATCH RFC 23/35] scatterlist: disallow non-contigous page ranges in a single SG entry In-Reply-To: <20250821200701.1329277-24-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-24-david@redhat.com> Message-ID: <8c4b0068-92be-427b-8bfc-9926eea6aa09@samsung.com> On 21.08.2025 22:06, David Hildenbrand wrote: > The expectation is that there is currently no user that would pass in > non-contigous page ranges: no allocator, not even VMA, will hand these > out. > > The only problematic part would be if someone would provide a range > obtained directly from memblock, or manually merge problematic ranges. > If we find such cases, we should fix them to create separate > SG entries. > > Let's check in sg_set_page() that this is really the case. No need to > check in sg_set_folio(), as pages in a folio are guaranteed to be > contiguous. > > We can now drop the nth_page() usage in sg_page_iter_page(). > > Signed-off-by: David Hildenbrand Acked-by: Marek Szyprowski > --- > include/linux/scatterlist.h | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h > index 6f8a4965f9b98..8196949dfc82c 100644 > --- a/include/linux/scatterlist.h > +++ b/include/linux/scatterlist.h > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > #include > > struct scatterlist { > @@ -158,6 +159,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page) > static inline void sg_set_page(struct scatterlist *sg, struct page *page, > unsigned int len, unsigned int offset) > { > + VM_WARN_ON_ONCE(!page_range_contiguous(page, ALIGN(len + offset, PAGE_SIZE) / PAGE_SIZE)); > sg_assign_page(sg, page); > sg->offset = offset; > sg->length = len; > @@ -600,7 +602,7 @@ void __sg_page_iter_start(struct sg_page_iter *piter, > */ > static inline struct page *sg_page_iter_page(struct sg_page_iter *piter) > { > - return nth_page(sg_page(piter->sg), piter->sg_pgoffset); > + return sg_page(piter->sg) + piter->sg_pgoffset; > } > > /** Best regards -- Marek Szyprowski, PhD Samsung R&D Institute Poland From asml.silence at gmail.com Fri Aug 22 11:31:29 2025 From: asml.silence at gmail.com (Pavel Begunkov) Date: Fri, 22 Aug 2025 11:31:29 -0000 Subject: [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage In-Reply-To: <20250821200701.1329277-19-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-19-david@redhat.com> Message-ID: On 8/21/25 21:06, David Hildenbrand wrote: > We always provide a single dst page, it's unclear why the io_copy_cache > complexity is required. Because it'll need to be pulled outside the loop to reuse the page for multiple copies, i.e. packing multiple fragments of the same skb into it. Not finished, and currently it's wasting memory. Why not do as below? Pages there never cross boundaries of their folios. Do you want it to be taken into the io_uring tree? diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index e5ff49f3425e..18c12f4b56b6 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, if (folio_test_partial_kmap(page_folio(dst_page)) || folio_test_partial_kmap(page_folio(src_page))) { - dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE); + dst_page += dst_offset / PAGE_SIZE; dst_offset = offset_in_page(dst_offset); - src_page = nth_page(src_page, src_offset / PAGE_SIZE); + src_page += src_offset / PAGE_SIZE; src_offset = offset_in_page(src_offset); n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset); n = min(n, len); -- Pavel Begunkov From rppt at kernel.org Fri Aug 22 15:09:27 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 22 Aug 2025 15:09:27 -0000 Subject: [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable In-Reply-To: <20250821200701.1329277-2-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-2-david@redhat.com> Message-ID: On Thu, Aug 21, 2025 at 10:06:27PM +0200, David Hildenbrand wrote: > In an ideal world, we wouldn't have to deal with SPARSEMEM without > SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is > considered too costly and consequently not supported. > > However, if an architecture does support SPARSEMEM with > SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just > like we already do for arm64, s390 and x86. > > So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without > SPARSEMEM_VMEMMAP. > > This implies that the option to not use SPARSEMEM_VMEMMAP will now be > gone for loongarch, powerpc, riscv and sparc. All architectures only > enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really > be a big downside to using the VMEMMAP (quite the contrary). > > This is a preparation for not supporting > > (1) folio sizes that exceed a single memory section > (2) CMA allocations of non-contiguous page ranges > > in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we > want to limit possible impact as much as possible (e.g., gigantic hugetlb > page allocations suddenly fails). > > Cc: Huacai Chen > Cc: WANG Xuerui > Cc: Madhavan Srinivasan > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Christophe Leroy > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Cc: Alexandre Ghiti > Cc: "David S. Miller" > Cc: Andreas Larsson > Signed-off-by: David Hildenbrand Acked-by: Mike Rapoport (Microsoft) > --- > mm/Kconfig | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 4108bcd967848..330d0e698ef96 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -439,9 +439,8 @@ config SPARSEMEM_VMEMMAP_ENABLE > bool > > config SPARSEMEM_VMEMMAP > - bool "Sparse Memory virtual memmap" > + def_bool y > depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE > - default y > help > SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise > pfn_to_page and page_to_pfn operations. This is the most > -- > 2.50.1 > -- Sincerely yours, Mike. From rppt at kernel.org Fri Aug 22 15:11:37 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 22 Aug 2025 15:11:37 -0000 Subject: [PATCH RFC 03/35] s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250821200701.1329277-4-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-4-david@redhat.com> Message-ID: On Thu, Aug 21, 2025 at 10:06:29PM +0200, David Hildenbrand wrote: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. > > Cc: Heiko Carstens > Cc: Vasily Gorbik > Cc: Alexander Gordeev > Cc: Christian Borntraeger > Cc: Sven Schnelle > Signed-off-by: David Hildenbrand Reviewed-by: Mike Rapoport (Microsoft) > --- > arch/s390/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig > index bf680c26a33cf..145ca23c2fff6 100644 > --- a/arch/s390/Kconfig > +++ b/arch/s390/Kconfig > @@ -710,7 +710,6 @@ menu "Memory setup" > config ARCH_SPARSEMEM_ENABLE > def_bool y > select SPARSEMEM_VMEMMAP_ENABLE > - select SPARSEMEM_VMEMMAP > > config ARCH_SPARSEMEM_DEFAULT > def_bool y > -- > 2.50.1 > > -- Sincerely yours, Mike. From rppt at kernel.org Fri Aug 22 15:12:05 2025 From: rppt at kernel.org (Mike Rapoport) Date: Fri, 22 Aug 2025 15:12:05 -0000 Subject: [PATCH RFC 04/35] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250821200701.1329277-5-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-5-david@redhat.com> Message-ID: On Thu, Aug 21, 2025 at 10:06:30PM +0200, David Hildenbrand wrote: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. > > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Borislav Petkov > Cc: Dave Hansen > Signed-off-by: David Hildenbrand Reviewed-by: Mike Rapoport (Microsoft) > --- > arch/x86/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 58d890fe2100e..e431d1c06fecd 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1552,7 +1552,6 @@ config ARCH_SPARSEMEM_ENABLE > def_bool y > select SPARSEMEM_STATIC if X86_32 > select SPARSEMEM_VMEMMAP_ENABLE if X86_64 > - select SPARSEMEM_VMEMMAP if X86_64 > > config ARCH_SPARSEMEM_DEFAULT > def_bool X86_64 || (NUMA && X86_32) > -- > 2.50.1 > -- Sincerely yours, Mike. From sj at kernel.org Fri Aug 22 17:02:22 2025 From: sj at kernel.org (SeongJae Park) Date: Fri, 22 Aug 2025 17:02:22 -0000 Subject: [PATCH RFC 01/35] mm: stop making SPARSEMEM_VMEMMAP user-selectable In-Reply-To: <20250821200701.1329277-2-david@redhat.com> Message-ID: <20250822170217.53169-1-sj@kernel.org> On Thu, 21 Aug 2025 22:06:27 +0200 David Hildenbrand wrote: > In an ideal world, we wouldn't have to deal with SPARSEMEM without > SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is > considered too costly and consequently not supported. > > However, if an architecture does support SPARSEMEM with > SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just > like we already do for arm64, s390 and x86. > > So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without > SPARSEMEM_VMEMMAP. > > This implies that the option to not use SPARSEMEM_VMEMMAP will now be > gone for loongarch, powerpc, riscv and sparc. All architectures only > enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really > be a big downside to using the VMEMMAP (quite the contrary). > > This is a preparation for not supporting > > (1) folio sizes that exceed a single memory section > (2) CMA allocations of non-contiguous page ranges > > in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we > want to limit possible impact as much as possible (e.g., gigantic hugetlb > page allocations suddenly fails). > > Cc: Huacai Chen > Cc: WANG Xuerui > Cc: Madhavan Srinivasan > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Christophe Leroy > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Cc: Alexandre Ghiti > Cc: "David S. Miller" > Cc: Andreas Larsson > Signed-off-by: David Hildenbrand Acked-by: SeongJae Park Thanks, SJ [...] From bvanassche at acm.org Fri Aug 22 18:02:37 2025 From: bvanassche at acm.org (Bart Van Assche) Date: Fri, 22 Aug 2025 18:02:37 -0000 Subject: [PATCH RFC 29/35] scsi: core: drop nth_page() usage within SG entry In-Reply-To: <20250821200701.1329277-30-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-30-david@redhat.com> Message-ID: <58816f2c-d4a7-4ec0-a48e-66a876ea1168@acm.org> On 8/21/25 1:06 PM, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. Usually the SCSI core and the SG I/O driver are updated separately. Anyway: Reviewed-by: Bart Van Assche From alexandru.elisei at arm.com Tue Aug 26 10:46:11 2025 From: alexandru.elisei at arm.com (Alexandru Elisei) Date: Tue, 26 Aug 2025 10:46:11 -0000 Subject: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: <20250821200701.1329277-22-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-22-david@redhat.com> Message-ID: Hi David, On Thu, Aug 21, 2025 at 10:06:47PM +0200, David Hildenbrand wrote: > Let's disallow handing out PFN ranges with non-contiguous pages, so we > can remove the nth-page usage in __cma_alloc(), and so any callers don't > have to worry about that either when wanting to blindly iterate pages. > > This is really only a problem in configs with SPARSEMEM but without > SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some > cases. > > Will this cause harm? Probably not, because it's mostly 32bit that does > not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could > look into allocating the memmap for the memory sections spanned by a > single CMA region in one go from memblock. > > Signed-off-by: David Hildenbrand > --- > include/linux/mm.h | 6 ++++++ > mm/cma.c | 36 +++++++++++++++++++++++------------- > mm/util.c | 33 +++++++++++++++++++++++++++++++++ > 3 files changed, 62 insertions(+), 13 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index ef360b72cb05c..f59ad1f9fc792 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes; > extern unsigned long sysctl_admin_reserve_kbytes; > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > +bool page_range_contiguous(const struct page *page, unsigned long nr_pages); > #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) > #else > #define nth_page(page,n) ((page) + (n)) > +static inline bool page_range_contiguous(const struct page *page, > + unsigned long nr_pages) > +{ > + return true; > +} > #endif > > /* to align the pointer to the (next) page boundary */ > diff --git a/mm/cma.c b/mm/cma.c > index 2ffa4befb99ab..1119fa2830008 100644 > --- a/mm/cma.c > +++ b/mm/cma.c > @@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > unsigned long count, unsigned int align, > struct page **pagep, gfp_t gfp) > { > - unsigned long mask, offset; > - unsigned long pfn = -1; > - unsigned long start = 0; > unsigned long bitmap_maxno, bitmap_no, bitmap_count; > + unsigned long start, pfn, mask, offset; > int ret = -EBUSY; > struct page *page = NULL; > > @@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > if (bitmap_count > bitmap_maxno) > goto out; > > - for (;;) { > + for (start = 0; ; start = bitmap_no + mask + 1) { > spin_lock_irq(&cma->lock); > /* > * If the request is larger than the available number > @@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > spin_unlock_irq(&cma->lock); > break; > } > + > + pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); > + page = pfn_to_page(pfn); > + > + /* > + * Do not hand out page ranges that are not contiguous, so > + * callers can just iterate the pages without having to worry > + * about these corner cases. > + */ > + if (!page_range_contiguous(page, count)) { > + spin_unlock_irq(&cma->lock); > + pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]", > + __func__, cma->name, pfn, pfn + count - 1); > + continue; > + } > + > bitmap_set(cmr->bitmap, bitmap_no, bitmap_count); > cma->available_count -= count; > /* > @@ -821,29 +835,25 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > */ > spin_unlock_irq(&cma->lock); > > - pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); > mutex_lock(&cma->alloc_mutex); > ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp); > mutex_unlock(&cma->alloc_mutex); > - if (ret == 0) { > - page = pfn_to_page(pfn); > + if (!ret) > break; > - } > > cma_clear_bitmap(cma, cmr, pfn, count); > if (ret != -EBUSY) > break; > > pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n", > - __func__, pfn, pfn_to_page(pfn)); > + __func__, pfn, page); > > trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn), Nitpick: I think you already have the page here. > count, align); > - /* try again with a bit different memory target */ > - start = bitmap_no + mask + 1; > } > out: > - *pagep = page; > + if (!ret) > + *pagep = page; > return ret; > } > > @@ -882,7 +892,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, > */ > if (page) { > for (i = 0; i < count; i++) > - page_kasan_tag_reset(nth_page(page, i)); > + page_kasan_tag_reset(page + i); Had a look at it, not very familiar with CMA, but the changes look equivalent to what was before. Not sure that's worth a Reviewed-by tag, but here it in case you want to add it: Reviewed-by: Alexandru Elisei Just so I can better understand the problem being fixed, I guess you can have two consecutive pfns with non-consecutive associated struct page if you have two adjacent memory sections spanning the same physical memory region, is that correct? Thanks, Alex > } > > if (ret && !(gfp & __GFP_NOWARN)) { > diff --git a/mm/util.c b/mm/util.c > index d235b74f7aff7..0bf349b19b652 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, > { > return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0); > } > + > +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > +/** > + * page_range_contiguous - test whether the page range is contiguous > + * @page: the start of the page range. > + * @nr_pages: the number of pages in the range. > + * > + * Test whether the page range is contiguous, such that they can be iterated > + * naively, corresponding to iterating a contiguous PFN range. > + * > + * This function should primarily only be used for debug checks, or when > + * working with page ranges that are not naturally contiguous (e.g., pages > + * within a folio are). > + * > + * Returns true if contiguous, otherwise false. > + */ > +bool page_range_contiguous(const struct page *page, unsigned long nr_pages) > +{ > + const unsigned long start_pfn = page_to_pfn(page); > + const unsigned long end_pfn = start_pfn + nr_pages; > + unsigned long pfn; > + > + /* > + * The memmap is allocated per memory section. We need to check > + * each involved memory section once. > + */ > + for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION); > + pfn < end_pfn; pfn += PAGES_PER_SECTION) > + if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn))) > + return false; > + return true; > +} > +#endif > #endif /* CONFIG_MMU */ > -- > 2.50.1 > > From alexandru.elisei at arm.com Tue Aug 26 13:11:46 2025 From: alexandru.elisei at arm.com (Alexandru Elisei) Date: Tue, 26 Aug 2025 13:11:46 -0000 Subject: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-22-david@redhat.com> Message-ID: Hi David, On Tue, Aug 26, 2025 at 03:08:08PM +0200, David Hildenbrand wrote: > On 26.08.25 15:03, Alexandru Elisei wrote: > > Hi David, > > > > On Tue, Aug 26, 2025 at 01:04:33PM +0200, David Hildenbrand wrote: > > .. > > > > Just so I can better understand the problem being fixed, I guess you can have > > > > two consecutive pfns with non-consecutive associated struct page if you have two > > > > adjacent memory sections spanning the same physical memory region, is that > > > > correct? > > > > > > Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not > > > guaranteed that > > > > > > pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1 > > > > > > when we cross memory section boundaries. > > > > > > It can be the case for early boot memory if we allocated consecutive areas > > > from memblock when allocating the memmap (struct pages) per memory section, > > > but it's not guaranteed. > > > > Thank you for the explanation, but I'm a bit confused by the last paragraph. I > > think what you're saying is that we can also have the reverse problem, where > > consecutive struct page * represent non-consecutive pfns, because memmap > > allocations happened to return consecutive virtual addresses, is that right? > > Exactly, that's something we have to deal with elsewhere [1]. For this code, > it's not a problem because we always allocate a contiguous PFN range. > > > > > If that's correct, I don't think that's the case for CMA, which deals out > > contiguous physical memory. Or were you just trying to explain the other side of > > the problem, and I'm just overthinking it? > > The latter :) Ok, sorry for the noise then, and thank you for educating me. Alex From asml.silence at gmail.com Wed Aug 27 09:42:40 2025 From: asml.silence at gmail.com (Pavel Begunkov) Date: Wed, 27 Aug 2025 09:42:40 -0000 Subject: [PATCH RFC 18/35] io_uring/zcrx: remove "struct io_copy_cache" and one nth_page() usage In-Reply-To: <473f3576-ddf3-4388-aeec-d486f639950a@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-19-david@redhat.com> <473f3576-ddf3-4388-aeec-d486f639950a@redhat.com> Message-ID: <46d09557-1873-4d97-b073-ce0c7296b954@gmail.com> On 8/22/25 14:59, David Hildenbrand wrote: > On 22.08.25 13:32, Pavel Begunkov wrote: >> On 8/21/25 21:06, David Hildenbrand wrote: >>> We always provide a single dst page, it's unclear why the io_copy_cache >>> complexity is required. >> >> Because it'll need to be pulled outside the loop to reuse the page for >> multiple copies, i.e. packing multiple fragments of the same skb into >> it. Not finished, and currently it's wasting memory. > > Okay, so what you're saying is that there will be follow-up work that will actually make this structure useful. Exactly >> Why not do as below? Pages there never cross boundaries of their folios. > Do you want it to be taken into the io_uring tree? > > This should better all go through the MM tree where we actually guarantee contiguous pages within a folio. (see the cover letter) Makes sense. No objection, hopefully it won't cause too many conflicts. >> diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c >> index e5ff49f3425e..18c12f4b56b6 100644 >> --- a/io_uring/zcrx.c >> +++ b/io_uring/zcrx.c >> @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, >> ?????????? if (folio_test_partial_kmap(page_folio(dst_page)) || >> ?????????????? folio_test_partial_kmap(page_folio(src_page))) { >> -??????????? dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE); >> +??????????? dst_page += dst_offset / PAGE_SIZE; >> ?????????????? dst_offset = offset_in_page(dst_offset); >> -??????????? src_page = nth_page(src_page, src_offset / PAGE_SIZE); >> +??????????? src_page += src_offset / PAGE_SIZE; > > Yeah, I can do that in the next version given that you have plans on extending that code soon. If we go with this version: Reviewed-by: Pavel Begunkov -- Pavel Begunkov From ziy at nvidia.com Thu Aug 28 01:04:42 2025 From: ziy at nvidia.com (Zi Yan) Date: Thu, 28 Aug 2025 01:04:42 -0000 Subject: [PATCH v1 08/36] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <20250827220141.262669-9-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-9-david@redhat.com> Message-ID: On 27 Aug 2025, at 18:01, David Hildenbrand wrote: > Let's check that no hstate that corresponds to an unreasonable folio size > is registered by an architecture. If we were to succeed registering, we > could later try allocating an unsupported gigantic folio size. > > Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER > is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have > to use a BUILD_BUG_ON_INVALID() to make it compile. > > No existing kernel configuration should be able to trigger this check: > either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or > gigantic folios will not exceed a memory section (the case on sparse). > > Signed-off-by: David Hildenbrand > --- > mm/hugetlb.c | 2 ++ > 1 file changed, 2 insertions(+) > LGTM. Reviewed-by: Zi Yan -- Best Regards, Yan, Zi From ziy at nvidia.com Thu Aug 28 01:14:50 2025 From: ziy at nvidia.com (Zi Yan) Date: Thu, 28 Aug 2025 01:14:50 -0000 Subject: [PATCH v1 15/36] hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() In-Reply-To: <20250827220141.262669-16-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-16-david@redhat.com> Message-ID: <521A948B-6E62-4CF3-947E-17B93F524DA0@nvidia.com> On 27 Aug 2025, at 18:01, David Hildenbrand wrote: > The nth_page() is not really required anymore, so let's remove it. > While at it, cleanup and simplify the code a bit. > > Signed-off-by: David Hildenbrand > --- > fs/hugetlbfs/inode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > LGTM. Reviewed-by: Zi Yan -- Best Regards, Yan, Zi From ziy at nvidia.com Thu Aug 28 01:18:30 2025 From: ziy at nvidia.com (Zi Yan) Date: Thu, 28 Aug 2025 01:18:30 -0000 Subject: [PATCH v1 16/36] hugetlbfs: cleanup folio in adjust_range_hwpoison() In-Reply-To: <20250827220141.262669-17-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-17-david@redhat.com> Message-ID: <22900121-30DB-4A1B-88A2-E3D158E009E2@nvidia.com> On 27 Aug 2025, at 18:01, David Hildenbrand wrote: > Let's cleanup and simplify the function a bit. > > Signed-off-by: David Hildenbrand > --- > fs/hugetlbfs/inode.c | 33 +++++++++++---------------------- > 1 file changed, 11 insertions(+), 22 deletions(-) > LGTM. Reviewed-by: Zi Yan -- Best Regards, Yan, Zi From richard.weiyang at gmail.com Thu Aug 28 07:18:52 2025 From: richard.weiyang at gmail.com (Wei Yang) Date: Thu, 28 Aug 2025 07:18:52 -0000 Subject: [PATCH v1 01/36] mm: stop making SPARSEMEM_VMEMMAP user-selectable In-Reply-To: <20250827220141.262669-2-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-2-david@redhat.com> Message-ID: <20250828071850.kl7clyh6e75horlk@master> On Thu, Aug 28, 2025 at 12:01:05AM +0200, David Hildenbrand wrote: >In an ideal world, we wouldn't have to deal with SPARSEMEM without >SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is >considered too costly and consequently not supported. > >However, if an architecture does support SPARSEMEM with >SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just >like we already do for arm64, s390 and x86. > >So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without >SPARSEMEM_VMEMMAP. > >This implies that the option to not use SPARSEMEM_VMEMMAP will now be >gone for loongarch, powerpc, riscv and sparc. All architectures only >enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really >be a big downside to using the VMEMMAP (quite the contrary). > >This is a preparation for not supporting > >(1) folio sizes that exceed a single memory section >(2) CMA allocations of non-contiguous page ranges > >in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we >want to limit possible impact as much as possible (e.g., gigantic hugetlb >page allocations suddenly fails). > >Acked-by: Zi Yan >Acked-by: Mike Rapoport (Microsoft) >Acked-by: SeongJae Park >Cc: Huacai Chen >Cc: WANG Xuerui >Cc: Madhavan Srinivasan >Cc: Michael Ellerman >Cc: Nicholas Piggin >Cc: Christophe Leroy >Cc: Paul Walmsley >Cc: Palmer Dabbelt >Cc: Albert Ou >Cc: Alexandre Ghiti >Cc: "David S. Miller" >Cc: Andreas Larsson >Signed-off-by: David Hildenbrand Reviewed-by: Wei Yang >--- > mm/Kconfig | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > >diff --git a/mm/Kconfig b/mm/Kconfig >index 4108bcd967848..330d0e698ef96 100644 >--- a/mm/Kconfig >+++ b/mm/Kconfig >@@ -439,9 +439,8 @@ config SPARSEMEM_VMEMMAP_ENABLE > bool > > config SPARSEMEM_VMEMMAP >- bool "Sparse Memory virtual memmap" >+ def_bool y > depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE >- default y > help > SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise > pfn_to_page and page_to_pfn operations. This is the most >-- >2.50.1 > -- Wei Yang Help you, Help me From richard.weiyang at gmail.com Thu Aug 28 07:31:52 2025 From: richard.weiyang at gmail.com (Wei Yang) Date: Thu, 28 Aug 2025 07:31:52 -0000 Subject: [PATCH v1 06/36] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <20250827220141.262669-7-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-7-david@redhat.com> Message-ID: <20250828073150.jyafkufvkjfqwp3f@master> On Thu, Aug 28, 2025 at 12:01:10AM +0200, David Hildenbrand wrote: >Let's reject them early, which in turn makes folio_alloc_gigantic() reject >them properly. > >To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER >and calculate MAX_FOLIO_NR_PAGES based on that. > >Reviewed-by: Zi Yan >Acked-by: SeongJae Park >Signed-off-by: David Hildenbrand Reviewed-by: Wei Yang -- Wei Yang Help you, Help me From richard.weiyang at gmail.com Thu Aug 28 07:35:30 2025 From: richard.weiyang at gmail.com (Wei Yang) Date: Thu, 28 Aug 2025 07:35:30 -0000 Subject: [PATCH v1 09/36] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() In-Reply-To: <20250827220141.262669-10-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-10-david@redhat.com> Message-ID: <20250828073527.u4k47fohaquzf3pg@master> On Thu, Aug 28, 2025 at 12:01:13AM +0200, David Hildenbrand wrote: >Grepping for "prep_compound_page" leaves on clueless how devdax gets its >compound pages initialized. > >Let's add a comment that might help finding this open-coded >prep_compound_page() initialization more easily. > >Further, let's be less smart about the ordering of initialization and just >perform the prep_compound_head() call after all tail pages were >initialized: just like prep_compound_page() does. > >No need for a comment to describe the initialization order: again, >just like prep_compound_page(). > >Reviewed-by: Mike Rapoport (Microsoft) >Signed-off-by: David Hildenbrand Reviewed-by: Wei Yang -- Wei Yang Help you, Help me From richard.weiyang at gmail.com Thu Aug 28 07:35:56 2025 From: richard.weiyang at gmail.com (Wei Yang) Date: Thu, 28 Aug 2025 07:35:56 -0000 Subject: [PATCH v1 10/36] mm: sanity-check maximum folio size in folio_set_order() In-Reply-To: <20250827220141.262669-11-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-11-david@redhat.com> Message-ID: <20250828073554.evipmbkxrint3tbs@master> On Thu, Aug 28, 2025 at 12:01:14AM +0200, David Hildenbrand wrote: >Let's sanity-check in folio_set_order() whether we would be trying to >create a folio with an order that would make it exceed MAX_FOLIO_ORDER. > >This will enable the check whenever a folio/compound page is initialized >through prepare_compound_head() / prepare_compound_page(). > >Reviewed-by: Zi Yan >Signed-off-by: David Hildenbrand Reviewed-by: Wei Yang -- Wei Yang Help you, Help me From richard.weiyang at gmail.com Thu Aug 28 07:37:57 2025 From: richard.weiyang at gmail.com (Wei Yang) Date: Thu, 28 Aug 2025 07:37:57 -0000 Subject: [PATCH v1 11/36] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: <20250827220141.262669-12-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-12-david@redhat.com> Message-ID: <20250828073755.gyq5cyafrxb7lnw2@master> On Thu, Aug 28, 2025 at 12:01:15AM +0200, David Hildenbrand wrote: >Let's limit the maximum folio size in problematic kernel config where >the memmap is allocated per memory section (SPARSEMEM without >SPARSEMEM_VMEMMAP) to a single memory section. > >Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE >but not SPARSEMEM_VMEMMAP: sh. > >Fortunately, the biggest hugetlb size sh supports is 64 MiB >(HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB >(SECTION_SIZE_BITS == 26), so their use case is not degraded. > >As folios and memory sections are naturally aligned to their order-2 size >in memory, consequently a single folio can no longer span multiple memory >sections on these problematic kernel configs. > >nth_page() is no longer required when operating within a single compound >page / folio. > >Reviewed-by: Zi Yan >Acked-by: Mike Rapoport (Microsoft) >Signed-off-by: David Hildenbrand Reviewed-by: Wei Yang -- Wei Yang Help you, Help me From richard.weiyang at gmail.com Thu Aug 28 07:43:57 2025 From: richard.weiyang at gmail.com (Wei Yang) Date: Thu, 28 Aug 2025 07:43:57 -0000 Subject: [PATCH v1 12/36] mm: simplify folio_page() and folio_page_idx() In-Reply-To: <20250827220141.262669-13-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-13-david@redhat.com> Message-ID: <20250828074356.3xiuqugokg36yuxw@master> On Thu, Aug 28, 2025 at 12:01:16AM +0200, David Hildenbrand wrote: >Now that a single folio/compound page can no longer span memory sections >in problematic kernel configurations, we can stop using nth_page(). > >While at it, turn both macros into static inline functions and add >kernel doc for folio_page_idx(). > >Reviewed-by: Zi Yan >Signed-off-by: David Hildenbrand Reviewed-by: Wei Yang The code looks good, while one nit below. >--- > include/linux/mm.h | 16 ++++++++++++++-- > include/linux/page-flags.h | 5 ++++- > 2 files changed, 18 insertions(+), 3 deletions(-) > >diff --git a/include/linux/mm.h b/include/linux/mm.h >index 2dee79fa2efcf..f6880e3225c5c 100644 >--- a/include/linux/mm.h >+++ b/include/linux/mm.h >@@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes; > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) >-#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) > #else > #define nth_page(page,n) ((page) + (n)) >-#define folio_page_idx(folio, p) ((p) - &(folio)->page) > #endif > > /* to align the pointer to the (next) page boundary */ >@@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; > /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ > #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) > >+/** >+ * folio_page_idx - Return the number of a page in a folio. >+ * @folio: The folio. >+ * @page: The folio page. >+ * >+ * This function expects that the page is actually part of the folio. >+ * The returned number is relative to the start of the folio. >+ */ >+static inline unsigned long folio_page_idx(const struct folio *folio, >+ const struct page *page) >+{ >+ return page - &folio->page; >+} >+ > static inline struct folio *lru_to_folio(struct list_head *head) > { > return list_entry((head)->prev, struct folio, lru); >diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h >index 5ee6ffbdbf831..faf17ca211b4f 100644 >--- a/include/linux/page-flags.h >+++ b/include/linux/page-flags.h >@@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page) > * check that the page number lies within @folio; the caller is presumed > * to have a reference to the page. > */ >-#define folio_page(folio, n) nth_page(&(folio)->page, n) >+static inline struct page *folio_page(struct folio *folio, unsigned long n) >+{ >+ return &folio->page + n; >+} > Curious about why it is in page-flags.h. It seems not related to page-flags. > static __always_inline int PageTail(const struct page *page) > { >-- >2.50.1 > -- Wei Yang Help you, Help me From richard.weiyang at gmail.com Thu Aug 28 08:18:33 2025 From: richard.weiyang at gmail.com (Wei Yang) Date: Thu, 28 Aug 2025 08:18:33 -0000 Subject: [PATCH v1 12/36] mm: simplify folio_page() and folio_page_idx() In-Reply-To: <0e1c0fe1-4dd1-46dc-8ce8-a6bf6e4c3e80@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-13-david@redhat.com> <20250828074356.3xiuqugokg36yuxw@master> <0e1c0fe1-4dd1-46dc-8ce8-a6bf6e4c3e80@redhat.com> Message-ID: <20250828081831.fv4bs77kihwbffdi@master> On Thu, Aug 28, 2025 at 09:46:25AM +0200, David Hildenbrand wrote: >> >> Curious about why it is in page-flags.h. It seems not related to page-flags. > >Likely because we have the page_folio() in there as well. > Hmm... sorry for this silly question. >-- >Cheers > >David / dhildenb -- Wei Yang Help you, Help me From lorenzo.stoakes at oracle.com Thu Aug 28 14:12:21 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 14:12:21 -0000 Subject: [PATCH v1 01/36] mm: stop making SPARSEMEM_VMEMMAP user-selectable In-Reply-To: <20250827220141.262669-2-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-2-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:05AM +0200, David Hildenbrand wrote: > In an ideal world, we wouldn't have to deal with SPARSEMEM without > SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is > considered too costly and consequently not supported. > > However, if an architecture does support SPARSEMEM with > SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just > like we already do for arm64, s390 and x86. > > So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without > SPARSEMEM_VMEMMAP. > > This implies that the option to not use SPARSEMEM_VMEMMAP will now be > gone for loongarch, powerpc, riscv and sparc. All architectures only > enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really > be a big downside to using the VMEMMAP (quite the contrary). Nice! And I see SPARSEMEM_VMEMMAP_ENABLE is selected by the arches which support it, as you say 64-bit (or in other words - modern :) > > This is a preparation for not supporting > > (1) folio sizes that exceed a single memory section > (2) CMA allocations of non-contiguous page ranges Nice. This should simplify things... :) > > in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we > want to limit possible impact as much as possible (e.g., gigantic hugetlb > page allocations suddenly fails). > > Acked-by: Zi Yan > Acked-by: Mike Rapoport (Microsoft) > Acked-by: SeongJae Park > Cc: Huacai Chen > Cc: WANG Xuerui > Cc: Madhavan Srinivasan > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Christophe Leroy > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Cc: Alexandre Ghiti > Cc: "David S. Miller" > Cc: Andreas Larsson > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > mm/Kconfig | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 4108bcd967848..330d0e698ef96 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -439,9 +439,8 @@ config SPARSEMEM_VMEMMAP_ENABLE > bool > > config SPARSEMEM_VMEMMAP > - bool "Sparse Memory virtual memmap" > + def_bool y > depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE > - default y > help > SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise > pfn_to_page and page_to_pfn operations. This is the most > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 14:13:00 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 14:13:00 -0000 Subject: [PATCH v1 02/36] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-3-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-3-david@redhat.com> Message-ID: <504d82f1-65c9-4835-9138-12f605b0aa54@lucifer.local> On Thu, Aug 28, 2025 at 12:01:06AM +0200, David Hildenbrand wrote: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. Do you plan to do this for other cases then I guess? Or was this an outlier? I guess I will see :) > > Reviewed-by: Mike Rapoport (Microsoft) > Cc: Catalin Marinas > Cc: Will Deacon > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > arch/arm64/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index e9bbfacc35a64..b1d1f2ff2493b 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -1570,7 +1570,6 @@ source "kernel/Kconfig.hz" > config ARCH_SPARSEMEM_ENABLE > def_bool y > select SPARSEMEM_VMEMMAP_ENABLE > - select SPARSEMEM_VMEMMAP > > config HW_PERF_EVENTS > def_bool y > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 14:13:38 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 14:13:38 -0000 Subject: [PATCH v1 03/36] s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-4-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-4-david@redhat.com> Message-ID: <6b835163-58e4-45e6-920b-c0594f97d315@lucifer.local> On Thu, Aug 28, 2025 at 12:01:07AM +0200, David Hildenbrand wrote: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. Ah yes there are other cases :) > > Reviewed-by: Mike Rapoport (Microsoft) > Cc: Heiko Carstens > Cc: Vasily Gorbik > Cc: Alexander Gordeev > Cc: Christian Borntraeger > Cc: Sven Schnelle > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > arch/s390/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig > index bf680c26a33cf..145ca23c2fff6 100644 > --- a/arch/s390/Kconfig > +++ b/arch/s390/Kconfig > @@ -710,7 +710,6 @@ menu "Memory setup" > config ARCH_SPARSEMEM_ENABLE > def_bool y > select SPARSEMEM_VMEMMAP_ENABLE > - select SPARSEMEM_VMEMMAP > > config ARCH_SPARSEMEM_DEFAULT > def_bool y > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 14:26:45 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 14:26:45 -0000 Subject: [PATCH v1 04/36] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-5-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-5-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:08AM +0200, David Hildenbrand wrote: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. > > Reviewed-by: Mike Rapoport (Microsoft) > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Borislav Petkov > Cc: Dave Hansen > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > arch/x86/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 58d890fe2100e..e431d1c06fecd 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1552,7 +1552,6 @@ config ARCH_SPARSEMEM_ENABLE > def_bool y > select SPARSEMEM_STATIC if X86_32 > select SPARSEMEM_VMEMMAP_ENABLE if X86_64 > - select SPARSEMEM_VMEMMAP if X86_64 > > config ARCH_SPARSEMEM_DEFAULT > def_bool X86_64 || (NUMA && X86_32) > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 14:27:16 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 14:27:16 -0000 Subject: [PATCH v1 05/36] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config In-Reply-To: <20250827220141.262669-6-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-6-david@redhat.com> Message-ID: <544d9592-403d-4b4b-b00f-250acb593c1b@lucifer.local> On Thu, Aug 28, 2025 at 12:01:09AM +0200, David Hildenbrand wrote: > It's no longer user-selectable (and the default was already "y"), so > let's just drop it. > > It was never really relevant to the wireguard selftests either way. > > Acked-by: Mike Rapoport (Microsoft) > Cc: "Jason A. Donenfeld" > Cc: Shuah Khan > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > tools/testing/selftests/wireguard/qemu/kernel.config | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config > index 0a5381717e9f4..1149289f4b30f 100644 > --- a/tools/testing/selftests/wireguard/qemu/kernel.config > +++ b/tools/testing/selftests/wireguard/qemu/kernel.config > @@ -48,7 +48,6 @@ CONFIG_JUMP_LABEL=y > CONFIG_FUTEX=y > CONFIG_SHMEM=y > CONFIG_SLUB=y > -CONFIG_SPARSEMEM_VMEMMAP=y > CONFIG_SMP=y > CONFIG_SCHED_SMT=y > CONFIG_SCHED_MC=y > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 14:38:23 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 14:38:23 -0000 Subject: [PATCH v1 06/36] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <20250827220141.262669-7-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-7-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:10AM +0200, David Hildenbrand wrote: > Let's reject them early, which in turn makes folio_alloc_gigantic() reject > them properly. > > To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER > and calculate MAX_FOLIO_NR_PAGES based on that. > > Reviewed-by: Zi Yan > Acked-by: SeongJae Park > Signed-off-by: David Hildenbrand Some nits, but overall LGTM so: Reviewed-by: Lorenzo Stoakes > --- > include/linux/mm.h | 6 ++++-- > mm/page_alloc.c | 5 ++++- > 2 files changed, 8 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 00c8a54127d37..77737cbf2216a 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) > > /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) > +#define MAX_FOLIO_ORDER PUD_ORDER > #else > -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES > +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER > #endif > > +#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) BIT()? > + > /* > * compound_nr() returns the number of pages in this potentially compound > * page. compound_nr() can be called on a tail page, and is defined to > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index baead29b3e67b..426bc404b80cc 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) > int alloc_contig_range_noprof(unsigned long start, unsigned long end, > acr_flags_t alloc_flags, gfp_t gfp_mask) > { > + const unsigned int order = ilog2(end - start); > unsigned long outer_start, outer_end; > int ret = 0; > > @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, > PB_ISOLATE_MODE_CMA_ALLOC : > PB_ISOLATE_MODE_OTHER; > > + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) > + return -EINVAL; Possibly not worth it for a one off, but be nice to have this as a helper function, like: static bool is_valid_order(gfp_t gfp_mask, unsigned int order) { return !(gfp_mask & __GFP_COMP) || order <= MAX_FOLIO_ORDER; } Then makes this: if (WARN_ON_ONCE(!is_valid_order(gfp_mask, order))) return -EINVAL; Kinda self-documenting! > + > gfp_mask = current_gfp_context(gfp_mask); > if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) > return -EINVAL; > @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, > free_contig_range(end, outer_end - end); > } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { > struct page *head = pfn_to_page(start); > - int order = ilog2(end - start); > > check_new_pages(head, order); > prep_new_page(head, order, gfp_mask, 0); > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 14:46:26 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 14:46:26 -0000 Subject: [PATCH v1 08/36] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <20250827220141.262669-9-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-9-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:12AM +0200, David Hildenbrand wrote: > Let's check that no hstate that corresponds to an unreasonable folio size > is registered by an architecture. If we were to succeed registering, we > could later try allocating an unsupported gigantic folio size. > > Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER > is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have > to use a BUILD_BUG_ON_INVALID() to make it compile. > > No existing kernel configuration should be able to trigger this check: > either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or > gigantic folios will not exceed a memory section (the case on sparse). I am guessing it's implicit that MAX_FOLIO_ORDER <= section size? > > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > mm/hugetlb.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 572b6f7772841..4a97e4f14c0dc 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) > > BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE < > __NR_HPAGEFLAGS); > + BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); > > if (!hugepages_supported()) { > if (hugetlb_max_hstate || default_hstate_max_huge_pages) > @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order) > } > BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); > BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); > + WARN_ON(order > MAX_FOLIO_ORDER); > h = &hstates[hugetlb_max_hstate++]; > __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); > h->order = order; > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 14:55:39 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 14:55:39 -0000 Subject: [PATCH v1 09/36] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() In-Reply-To: <20250827220141.262669-10-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-10-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:13AM +0200, David Hildenbrand wrote: > Grepping for "prep_compound_page" leaves on clueless how devdax gets its > compound pages initialized. > > Let's add a comment that might help finding this open-coded > prep_compound_page() initialization more easily. > > Further, let's be less smart about the ordering of initialization and just > perform the prep_compound_head() call after all tail pages were > initialized: just like prep_compound_page() does. > > No need for a comment to describe the initialization order: again, > just like prep_compound_page(). Wow this is great, thank you for putting a quality comment for this and thinking of this :) We have too much 'special case you just have to know' stuff sitting around, so this kind of thing is always great to see. > > Reviewed-by: Mike Rapoport (Microsoft) > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > mm/mm_init.c | 15 +++++++-------- > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 5c21b3af216b2..df614556741a4 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -1091,6 +1091,12 @@ static void __ref memmap_init_compound(struct page *head, > unsigned long pfn, end_pfn = head_pfn + nr_pages; > unsigned int order = pgmap->vmemmap_shift; > > + /* > + * We have to initialize the pages, including setting up page links. > + * prep_compound_page() does not take care of that, so instead we > + * open-code prep_compound_page() so we can take care of initializing > + * the pages in the same go. > + */ > __SetPageHead(head); > for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { > struct page *page = pfn_to_page(pfn); > @@ -1098,15 +1104,8 @@ static void __ref memmap_init_compound(struct page *head, > __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); > prep_compound_tail(head, pfn - head_pfn); > set_page_count(page, 0); > - > - /* > - * The first tail page stores important compound page info. > - * Call prep_compound_head() after the first tail page has > - * been initialized, to not have the data overwritten. > - */ > - if (pfn == head_pfn + 1) > - prep_compound_head(head, order); > } > + prep_compound_head(head, order); > } > > void __ref memmap_init_zone_device(struct zone *zone, > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 15:01:20 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 15:01:20 -0000 Subject: [PATCH v1 10/36] mm: sanity-check maximum folio size in folio_set_order() In-Reply-To: <20250827220141.262669-11-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-11-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:14AM +0200, David Hildenbrand wrote: > Let's sanity-check in folio_set_order() whether we would be trying to > create a folio with an order that would make it exceed MAX_FOLIO_ORDER. > > This will enable the check whenever a folio/compound page is initialized > through prepare_compound_head() / prepare_compound_page(). NIT: with CONFIG_DEBUG_VM set :) > > Reviewed-by: Zi Yan > Signed-off-by: David Hildenbrand LGTM (apart from nit below), so: Reviewed-by: Lorenzo Stoakes > --- > mm/internal.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/internal.h b/mm/internal.h > index 45da9ff5694f6..9b0129531d004 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -755,6 +755,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) > { > if (WARN_ON_ONCE(!order || !folio_test_large(folio))) > return; > + VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER); Given we have 'full-fat' WARN_ON*()'s above, maybe worth making this one too? > > folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order; > #ifdef NR_PAGES_IN_LARGE_FOLIO > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 15:11:18 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 15:11:18 -0000 Subject: [PATCH v1 11/36] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: <20250827220141.262669-12-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-12-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:15AM +0200, David Hildenbrand wrote: > Let's limit the maximum folio size in problematic kernel config where > the memmap is allocated per memory section (SPARSEMEM without > SPARSEMEM_VMEMMAP) to a single memory section. > > Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE > but not SPARSEMEM_VMEMMAP: sh. > > Fortunately, the biggest hugetlb size sh supports is 64 MiB > (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB > (SECTION_SIZE_BITS == 26), so their use case is not degraded. > > As folios and memory sections are naturally aligned to their order-2 size > in memory, consequently a single folio can no longer span multiple memory > sections on these problematic kernel configs. > > nth_page() is no longer required when operating within a single compound > page / folio. > > Reviewed-by: Zi Yan > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: David Hildenbrand Realy great comments, like this! I wonder if we could have this be part of the first patch where you fiddle with MAX_FOLIO_ORDER etc. but not a big deal. Anyway LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > include/linux/mm.h | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 77737cbf2216a..2dee79fa2efcf 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) > return folio_large_nr_pages(folio); > } > > -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > -#define MAX_FOLIO_ORDER PUD_ORDER > -#else > +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) > +/* > + * We don't expect any folios that exceed buddy sizes (and consequently > + * memory sections). > + */ > #define MAX_FOLIO_ORDER MAX_PAGE_ORDER > +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > +/* > + * Only pages within a single memory section are guaranteed to be > + * contiguous. By limiting folios to a single memory section, all folio > + * pages are guaranteed to be contiguous. > + */ > +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT Hmmm, was this implicit before somehow? I mean surely by the fact as you say that physical contiguity would not otherwise be guaranteed :)) > +#else > +/* > + * There is no real limit on the folio size. We limit them to the maximum we > + * currently expect (e.g., hugetlb, dax). > + */ This is nice. > +#define MAX_FOLIO_ORDER PUD_ORDER > #endif > > #define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 15:24:44 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 15:24:44 -0000 Subject: [PATCH v1 12/36] mm: simplify folio_page() and folio_page_idx() In-Reply-To: <20250827220141.262669-13-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-13-david@redhat.com> Message-ID: <7b7406c2-e309-4481-940b-63b6811b986c@lucifer.local> On Thu, Aug 28, 2025 at 12:01:16AM +0200, David Hildenbrand wrote: > Now that a single folio/compound page can no longer span memory sections > in problematic kernel configurations, we can stop using nth_page(). > > While at it, turn both macros into static inline functions and add > kernel doc for folio_page_idx(). > > Reviewed-by: Zi Yan > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > include/linux/mm.h | 16 ++++++++++++++-- > include/linux/page-flags.h | 5 ++++- > 2 files changed, 18 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 2dee79fa2efcf..f6880e3225c5c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes; > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) > -#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) > #else > #define nth_page(page,n) ((page) + (n)) > -#define folio_page_idx(folio, p) ((p) - &(folio)->page) > #endif > > /* to align the pointer to the (next) page boundary */ > @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; > /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ > #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) > > +/** > + * folio_page_idx - Return the number of a page in a folio. > + * @folio: The folio. > + * @page: The folio page. > + * > + * This function expects that the page is actually part of the folio. > + * The returned number is relative to the start of the folio. > + */ > +static inline unsigned long folio_page_idx(const struct folio *folio, > + const struct page *page) > +{ > + return page - &folio->page; Ahh now I see why we did all this stuff with regard to the sparse things before :) very nice. > +} > + > static inline struct folio *lru_to_folio(struct list_head *head) > { > return list_entry((head)->prev, struct folio, lru); > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 5ee6ffbdbf831..faf17ca211b4f 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page) > * check that the page number lies within @folio; the caller is presumed > * to have a reference to the page. > */ > -#define folio_page(folio, n) nth_page(&(folio)->page, n) > +static inline struct page *folio_page(struct folio *folio, unsigned long n) > +{ > + return &folio->page + n; > +}3 > > static __always_inline int PageTail(const struct page *page) > { > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 15:38:42 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 15:38:42 -0000 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <20250827220141.262669-14-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:17AM +0200, David Hildenbrand wrote: > We can now safely iterate over all pages in a folio, so no need for the > pfn_to_page(). > > Also, as we already force the refcount in __init_single_page() to 1, Mega huge nit (ignore if you want), but maybe worth saying 'via init_page_count()'. > we can just set the refcount to 0 and avoid page_ref_freeze() + > VM_BUG_ON. Likely, in the future, we would just want to tell > __init_single_page() to which value to initialize the refcount. Right yes :) > > Further, adjust the comments to highlight that we are dealing with an > open-coded prep_compound_page() variant, and add another comment explaining > why we really need the __init_single_page() only on the tail pages. Ah nice another 'anchor' to grep for! > > Note that the current code was likely problematic, but we never ran into > it: prep_compound_tail() would have been called with an offset that might > exceed a memory section, and prep_compound_tail() would have simply > added that offset to the page pointer -- which would not have done the > right thing on sparsemem without vmemmap. > > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > mm/hugetlb.c | 20 ++++++++++++-------- > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 4a97e4f14c0dc..1f42186a85ea4 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3237,17 +3237,18 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, > { > enum zone_type zone = zone_idx(folio_zone(folio)); > int nid = folio_nid(folio); > + struct page *page = folio_page(folio, start_page_number); > unsigned long head_pfn = folio_pfn(folio); > unsigned long pfn, end_pfn = head_pfn + end_page_number; > - int ret; > - > - for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { > - struct page *page = pfn_to_page(pfn); > > + /* > + * We mark all tail pages with memblock_reserved_mark_noinit(), > + * so these pages are completely uninitialized. > + */ > + for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) { > __init_single_page(page, pfn, zone, nid); > prep_compound_tail((struct page *)folio, pfn - head_pfn); > - ret = page_ref_freeze(page, 1); > - VM_BUG_ON(!ret); > + set_page_count(page, 0); > } > } > > @@ -3257,12 +3258,15 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, > { > int ret; > > - /* Prepare folio head */ > + /* > + * This is an open-coded prep_compound_page() whereby we avoid > + * walking pages twice by initializing/preparing+freezing them in the > + * same go. > + */ > __folio_clear_reserved(folio); > __folio_set_head(folio); > ret = folio_ref_freeze(folio, 1); > VM_BUG_ON(!ret); > - /* Initialize the necessary tail struct pages */ > hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); > prep_compound_head((struct page *)folio, huge_page_order(h)); > } > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 15:44:40 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 15:44:40 -0000 Subject: [PATCH v1 14/36] mm/mm/percpu-km: drop nth_page() usage within single allocation In-Reply-To: <20250827220141.262669-15-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-15-david@redhat.com> Message-ID: <2ee63b0d-f5d8-41ee-ae7a-0e917638cebc@lucifer.local> On Thu, Aug 28, 2025 at 12:01:18AM +0200, David Hildenbrand wrote: > We're allocating a higher-order page from the buddy. For these pages > (that are guaranteed to not exceed a single memory section) there is no > need to use nth_page(). > > Signed-off-by: David Hildenbrand Oh hello! Now it all comes together :) nth_tag(): LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > mm/percpu-km.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/percpu-km.c b/mm/percpu-km.c > index fe31aa19db81a..4efa74a495cb6 100644 > --- a/mm/percpu-km.c > +++ b/mm/percpu-km.c > @@ -69,7 +69,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) > } > > for (i = 0; i < nr_pages; i++) > - pcpu_set_page_chunk(nth_page(pages, i), chunk); > + pcpu_set_page_chunk(pages + i, chunk); > > chunk->data = pages; > chunk->base_addr = page_address(pages); > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 15:46:26 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 15:46:26 -0000 Subject: [PATCH v1 15/36] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() In-Reply-To: <20250827220141.262669-16-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-16-david@redhat.com> Message-ID: <1d74a0e2-51ff-462f-8f3c-75639fd21221@lucifer.local> On Thu, Aug 28, 2025 at 12:01:19AM +0200, David Hildenbrand wrote: > The nth_page() is not really required anymore, so let's remove it. > While at it, cleanup and simplify the code a bit. Hm Not sure which bit is the cleanup? Was there meant to be more here or? > > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > fs/hugetlbfs/inode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index 34d496a2b7de6..c5a46d10afaa0 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -217,7 +217,7 @@ static size_t adjust_range_hwpoison(struct folio *folio, size_t offset, > break; > offset += n; > if (offset == PAGE_SIZE) { > - page = nth_page(page, 1); > + page++; LOL at that diff. Great! > offset = 0; > } > } > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 16:20:58 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 16:20:58 -0000 Subject: [PATCH v1 16/36] fs: hugetlbfs: cleanup folio in adjust_range_hwpoison() In-Reply-To: <20250827220141.262669-17-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-17-david@redhat.com> Message-ID: <71cf3600-d9cf-4d16-951c-44582b46c0fa@lucifer.local> On Thu, Aug 28, 2025 at 12:01:20AM +0200, David Hildenbrand wrote: > Let's cleanup and simplify the function a bit. Ah I guess you separated this out from the previous patch? :) I feel like it might be worth talking about the implementation here in the commit message as it took me a while to figure this out. > > Signed-off-by: David Hildenbrand This original implementation is SO GROSS. God this hurts my mind n = min(bytes, (size_t)PAGE_SIZE - offset); So either it'll be remaining bytes in page or we're only spanning one page first time round Then we res += n; bytes -= n; So bytes comes to end of page if spanning multiple Then offset if spanning multiple pages will be PAGE_SIZE -offset + offset (!!!) therefore PAGE_SIZE And we move to the next page and reset offset to 0: offset += n; if (offset == PAGE_SIZE) { page = nth_page(page, 1); offset = 0; } Then from then on n = min(bytes, PAGE_SIZE) (!!!!!!) So res = remaining safe bytes in first page + num other pages OR bytes if we don't span more than 1. Lord above. Also semantics of 'if bytes == 0, then check first page anyway' which you do capture. OK think I have convinced myself this is right, so hopefully no deeply subtle off-by-one issues here :P Anyway, LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > fs/hugetlbfs/inode.c | 33 +++++++++++---------------------- > 1 file changed, 11 insertions(+), 22 deletions(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index c5a46d10afaa0..6ca1f6b45c1e5 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -198,31 +198,20 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr, > static size_t adjust_range_hwpoison(struct folio *folio, size_t offset, > size_t bytes) > { > - struct page *page; > - size_t n = 0; > - size_t res = 0; > - > - /* First page to start the loop. */ > - page = folio_page(folio, offset / PAGE_SIZE); > - offset %= PAGE_SIZE; > - while (1) { > - if (is_raw_hwpoison_page_in_hugepage(page)) > - break; > + struct page *page = folio_page(folio, offset / PAGE_SIZE); > + size_t safe_bytes; > + > + if (is_raw_hwpoison_page_in_hugepage(page)) > + return 0; > + /* Safe to read the remaining bytes in this page. */ > + safe_bytes = PAGE_SIZE - (offset % PAGE_SIZE); > + page++; > > - /* Safe to read n bytes without touching HWPOISON subpage. */ > - n = min(bytes, (size_t)PAGE_SIZE - offset); > - res += n; > - bytes -= n; > - if (!bytes || !n) > + for (; safe_bytes < bytes; safe_bytes += PAGE_SIZE, page++) OK this is quite subtle - so if safe_bytes == bytes, this means we've confirmed that all requested bytes are safe. So offset=0, bytes = 4096 would fail this (as safe_bytes == 4096). Maybe worth putting something like: /* * Now we check page-by-page in the folio to see if any bytes we don't * yet know to be safe are contained within posioned pages or not. */ Above the loop. Or something like this. > + if (is_raw_hwpoison_page_in_hugepage(page)) > break; > - offset += n; > - if (offset == PAGE_SIZE) { > - page++; > - offset = 0; > - } > - } > > - return res; > + return min(safe_bytes, bytes); Yeah given above analysis this seems correct. You must have torn your hair out over this :) > } > > /* > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 16:22:56 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 16:22:56 -0000 Subject: [PATCH v1 17/36] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() In-Reply-To: <20250827220141.262669-18-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-18-david@redhat.com> Message-ID: <8842cba4-61bc-48e2-b4aa-df9619409621@lucifer.local> On Thu, Aug 28, 2025 at 12:01:21AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() within a folio, so let's just > drop the nth_page() in folio_walk_start(). > > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > mm/pagewalk.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index c6753d370ff4e..9e4225e5fcf5c 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -1004,7 +1004,7 @@ struct folio *folio_walk_start(struct folio_walk *fw, > found: > if (expose_page) > /* Note: Offset from the mapped page, not the folio start. */ > - fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT); > + fw->page = page + ((addr & (entry_size - 1)) >> PAGE_SHIFT); Be nice to clean this horrid one liner up a bit also but that's out of scope here :) > else > fw->page = NULL; > fw->ptl = ptl; > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 16:38:35 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 16:38:35 -0000 Subject: [PATCH v1 18/36] mm/gup: drop nth_page() usage within folio when recording subpages In-Reply-To: <20250827220141.262669-19-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-19-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:22AM +0200, David Hildenbrand wrote: > nth_page() is no longer required when iterating over pages within a > single folio, so let's just drop it when recording subpages. > > Signed-off-by: David Hildenbrand This looks correct to me, so notwithtsanding suggestion below, LGTM and: Reviewed-by: Lorenzo Stoakes > --- > mm/gup.c | 7 +++---- > 1 file changed, 3 insertions(+), 4 deletions(-) > > diff --git a/mm/gup.c b/mm/gup.c > index b2a78f0291273..89ca0813791ab 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -488,12 +488,11 @@ static int record_subpages(struct page *page, unsigned long sz, > unsigned long addr, unsigned long end, > struct page **pages) > { > - struct page *start_page; > int nr; > > - start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT); > + page += (addr & (sz - 1)) >> PAGE_SHIFT; > for (nr = 0; addr != end; nr++, addr += PAGE_SIZE) > - pages[nr] = nth_page(start_page, nr); > + pages[nr] = page++; This is really nice, but I wonder if (while we're here) we can't be even more clear as to what's going on here, e.g.: static int record_subpages(struct page *page, unsigned long sz, unsigned long addr, unsigned long end, struct page **pages) { size_t offset_in_folio = (addr & (sz - 1)) >> PAGE_SHIFT; struct page *subpage = page + offset_in_folio; for (; addr != end; addr += PAGE_SIZE) *pages++ = subpage++; return nr; } Or some variant of that with the masking stuff self-documented. > > return nr; > } > @@ -1512,7 +1511,7 @@ static long __get_user_pages(struct mm_struct *mm, > } > > for (j = 0; j < page_increm; j++) { > - subpage = nth_page(page, j); > + subpage = page + j; > pages[i + j] = subpage; > flush_anon_page(vma, subpage, start + j * PAGE_SIZE); > flush_dcache_page(subpage); > -- > 2.50.1 > Cheers, Lorenzo From lorenzo.stoakes at oracle.com Thu Aug 28 16:49:39 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 16:49:39 -0000 Subject: [PATCH v1 19/36] io_uring/zcrx: remove nth_page() usage within folio In-Reply-To: <20250827220141.262669-20-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-20-david@redhat.com> Message-ID: <4f366255-5dd9-44ac-878c-e44e557b8484@lucifer.local> On Thu, Aug 28, 2025 at 12:01:23AM +0200, David Hildenbrand wrote: > Within a folio/compound page, nth_page() is no longer required. > Given that we call folio_test_partial_kmap()+kmap_local_page(), the code > would already be problematic if the pages would span multiple folios. > > So let's just assume that all src pages belong to a single > folio/compound page and can be iterated ordinarily. The dst page is > currently always a single page, so we're not actually iterating > anything. > > Reviewed-by: Pavel Begunkov > Cc: Jens Axboe > Cc: Pavel Begunkov > Signed-off-by: David Hildenbrand On basis of src pages being within the same folio, LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > io_uring/zcrx.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c > index e5ff49f3425e0..18c12f4b56b6c 100644 > --- a/io_uring/zcrx.c > +++ b/io_uring/zcrx.c > @@ -975,9 +975,9 @@ static ssize_t io_copy_page(struct io_copy_cache *cc, struct page *src_page, > > if (folio_test_partial_kmap(page_folio(dst_page)) || > folio_test_partial_kmap(page_folio(src_page))) { > - dst_page = nth_page(dst_page, dst_offset / PAGE_SIZE); > + dst_page += dst_offset / PAGE_SIZE; > dst_offset = offset_in_page(dst_offset); > - src_page = nth_page(src_page, src_offset / PAGE_SIZE); > + src_page += src_offset / PAGE_SIZE; > src_offset = offset_in_page(src_offset); > n = min(PAGE_SIZE - src_offset, PAGE_SIZE - dst_offset); > n = min(n, len); > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 16:57:46 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 16:57:46 -0000 Subject: [PATCH v1 20/36] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() In-Reply-To: <20250827220141.262669-21-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-21-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:24AM +0200, David Hildenbrand wrote: > Let's make it clearer that we are operating within a single folio by > providing both the folio and the page. > > This implies that for flush_dcache_folio() we'll now avoid one more > page->folio lookup, and that we can safely drop the "nth_page" usage. > > Cc: Thomas Bogendoerfer > Signed-off-by: David Hildenbrand > --- > arch/mips/include/asm/cacheflush.h | 11 +++++++---- > arch/mips/mm/cache.c | 8 ++++---- > 2 files changed, 11 insertions(+), 8 deletions(-) > > diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h > index 5d283ef89d90d..8d79bfc687d21 100644 > --- a/arch/mips/include/asm/cacheflush.h > +++ b/arch/mips/include/asm/cacheflush.h > @@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm); > extern void (*flush_cache_range)(struct vm_area_struct *vma, > unsigned long start, unsigned long end); > extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn); > -extern void __flush_dcache_pages(struct page *page, unsigned int nr); > +extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr); NIT: Be good to drop the extern. > > #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 > static inline void flush_dcache_folio(struct folio *folio) > { > if (cpu_has_dc_aliases) > - __flush_dcache_pages(&folio->page, folio_nr_pages(folio)); > + __flush_dcache_folio_pages(folio, folio_page(folio, 0), > + folio_nr_pages(folio)); > else if (!cpu_has_ic_fills_f_dc) > folio_set_dcache_dirty(folio); > } > @@ -64,10 +65,12 @@ static inline void flush_dcache_folio(struct folio *folio) > > static inline void flush_dcache_page(struct page *page) > { > + struct folio *folio = page_folio(page); > + > if (cpu_has_dc_aliases) > - __flush_dcache_pages(page, 1); > + __flush_dcache_folio_pages(folio, page, folio_nr_pages(folio)); Hmmm, shouldn't this be 1 not folio_nr_pages()? Seems that the original implementation only flushed a single page even if contained within a larger folio? > else if (!cpu_has_ic_fills_f_dc) > - folio_set_dcache_dirty(page_folio(page)); > + folio_set_dcache_dirty(folio); > } > > #define flush_dcache_mmap_lock(mapping) do { } while (0) > diff --git a/arch/mips/mm/cache.c b/arch/mips/mm/cache.c > index bf9a37c60e9f0..e3b4224c9a406 100644 > --- a/arch/mips/mm/cache.c > +++ b/arch/mips/mm/cache.c > @@ -99,9 +99,9 @@ SYSCALL_DEFINE3(cacheflush, unsigned long, addr, unsigned long, bytes, > return 0; > } > > -void __flush_dcache_pages(struct page *page, unsigned int nr) > +void __flush_dcache_folio_pages(struct folio *folio, struct page *page, > + unsigned int nr) > { > - struct folio *folio = page_folio(page); > struct address_space *mapping = folio_flush_mapping(folio); > unsigned long addr; > unsigned int i; > @@ -117,12 +117,12 @@ void __flush_dcache_pages(struct page *page, unsigned int nr) > * get faulted into the tlb (and thus flushed) anyways. > */ > for (i = 0; i < nr; i++) { > - addr = (unsigned long)kmap_local_page(nth_page(page, i)); > + addr = (unsigned long)kmap_local_page(page + i); > flush_data_cache_page(addr); > kunmap_local((void *)addr); > } > } > -EXPORT_SYMBOL(__flush_dcache_pages); > +EXPORT_SYMBOL(__flush_dcache_folio_pages); > > void __flush_anon_page(struct page *page, unsigned long vmaddr) > { > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 17:29:16 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 17:29:16 -0000 Subject: [PATCH v1 21/36] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: <20250827220141.262669-22-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-22-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:25AM +0200, David Hildenbrand wrote: > Let's disallow handing out PFN ranges with non-contiguous pages, so we > can remove the nth-page usage in __cma_alloc(), and so any callers don't > have to worry about that either when wanting to blindly iterate pages. > > This is really only a problem in configs with SPARSEMEM but without > SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some > cases. I'm guessing this is something that we don't need to worry about in reality? > > Will this cause harm? Probably not, because it's mostly 32bit that does > not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could > look into allocating the memmap for the memory sections spanned by a > single CMA region in one go from memblock. > > Reviewed-by: Alexandru Elisei > Signed-off-by: David Hildenbrand LGTM other than refactoring point below. CMA stuff looks fine afaict after staring at it for a while, on proviso that handing out ranges within the same section is always going to be the case. Anyway overall, LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > include/linux/mm.h | 6 ++++++ > mm/cma.c | 39 ++++++++++++++++++++++++--------------- > mm/util.c | 33 +++++++++++++++++++++++++++++++++ > 3 files changed, 63 insertions(+), 15 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index f6880e3225c5c..2ca1eb2db63ec 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes; > extern unsigned long sysctl_admin_reserve_kbytes; > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > +bool page_range_contiguous(const struct page *page, unsigned long nr_pages); > #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) > #else > #define nth_page(page,n) ((page) + (n)) > +static inline bool page_range_contiguous(const struct page *page, > + unsigned long nr_pages) > +{ > + return true; > +} > #endif > > /* to align the pointer to the (next) page boundary */ > diff --git a/mm/cma.c b/mm/cma.c > index e56ec64d0567e..813e6dc7b0954 100644 > --- a/mm/cma.c > +++ b/mm/cma.c > @@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > unsigned long count, unsigned int align, > struct page **pagep, gfp_t gfp) > { > - unsigned long mask, offset; > - unsigned long pfn = -1; > - unsigned long start = 0; > unsigned long bitmap_maxno, bitmap_no, bitmap_count; > + unsigned long start, pfn, mask, offset; > int ret = -EBUSY; > struct page *page = NULL; > > @@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > if (bitmap_count > bitmap_maxno) > goto out; > > - for (;;) { > + for (start = 0; ; start = bitmap_no + mask + 1) { > spin_lock_irq(&cma->lock); > /* > * If the request is larger than the available number > @@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > spin_unlock_irq(&cma->lock); > break; > } > + > + pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); > + page = pfn_to_page(pfn); > + > + /* > + * Do not hand out page ranges that are not contiguous, so > + * callers can just iterate the pages without having to worry > + * about these corner cases. > + */ > + if (!page_range_contiguous(page, count)) { > + spin_unlock_irq(&cma->lock); > + pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]", > + __func__, cma->name, pfn, pfn + count - 1); > + continue; > + } > + > bitmap_set(cmr->bitmap, bitmap_no, bitmap_count); > cma->available_count -= count; > /* > @@ -821,29 +835,24 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > */ > spin_unlock_irq(&cma->lock); > > - pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); > mutex_lock(&cma->alloc_mutex); > ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp); > mutex_unlock(&cma->alloc_mutex); > - if (ret == 0) { > - page = pfn_to_page(pfn); > + if (!ret) > break; > - } > > cma_clear_bitmap(cma, cmr, pfn, count); > if (ret != -EBUSY) > break; > > pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n", > - __func__, pfn, pfn_to_page(pfn)); > + __func__, pfn, page); > > - trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn), > - count, align); > - /* try again with a bit different memory target */ > - start = bitmap_no + mask + 1; > + trace_cma_alloc_busy_retry(cma->name, pfn, page, count, align); > } > out: > - *pagep = page; > + if (!ret) > + *pagep = page; > return ret; > } > > @@ -882,7 +891,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, > */ > if (page) { > for (i = 0; i < count; i++) > - page_kasan_tag_reset(nth_page(page, i)); > + page_kasan_tag_reset(page + i); > } > > if (ret && !(gfp & __GFP_NOWARN)) { > diff --git a/mm/util.c b/mm/util.c > index d235b74f7aff7..0bf349b19b652 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, > { > return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0); > } > + > +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > +/** > + * page_range_contiguous - test whether the page range is contiguous > + * @page: the start of the page range. > + * @nr_pages: the number of pages in the range. > + * > + * Test whether the page range is contiguous, such that they can be iterated > + * naively, corresponding to iterating a contiguous PFN range. > + * > + * This function should primarily only be used for debug checks, or when > + * working with page ranges that are not naturally contiguous (e.g., pages > + * within a folio are). > + * > + * Returns true if contiguous, otherwise false. > + */ > +bool page_range_contiguous(const struct page *page, unsigned long nr_pages) > +{ > + const unsigned long start_pfn = page_to_pfn(page); > + const unsigned long end_pfn = start_pfn + nr_pages; > + unsigned long pfn; > + > + /* > + * The memmap is allocated per memory section. We need to check > + * each involved memory section once. > + */ > + for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION); > + pfn < end_pfn; pfn += PAGES_PER_SECTION) > + if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn))) > + return false; I find this pretty confusing, my test for this is how many times I have to read the code to understand what it's doing :) So we have something like: (pfn of page) start_pfn pfn = align UP | | v v | section | <-----------------> pfn - start_pfn Then check page + (pfn - start_pfn) == pfn_to_page(pfn) And loop such that: (pfn of page) start_pfn pfn | | v v | section | section | <------------------------------------------> pfn - start_pfn Again check page + (pfn - start_pfn) == pfn_to_page(pfn) And so on. So the logic looks good, but it's just... that took me a hot second to parse :) I think a few simple fixups bool page_range_contiguous(const struct page *page, unsigned long nr_pages) { const unsigned long start_pfn = page_to_pfn(page); const unsigned long end_pfn = start_pfn + nr_pages; /* The PFN of the start of the next section. */ unsigned long pfn = ALIGN(start_pfn, PAGES_PER_SECTION); /* The page we'd expected to see if the range were contiguous. */ struct page *expected = page + (pfn - start_pfn); /* * The memmap is allocated per memory section. We need to check * each involved memory section once. */ for (; pfn < end_pfn; pfn += PAGES_PER_SECTION, expected += PAGES_PER_SECTION) if (unlikely(expected != pfn_to_page(pfn))) return false; return true; } > + return true; > +} > +#endif > #endif /* CONFIG_MMU */ > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 17:30:00 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 17:30:00 -0000 Subject: [PATCH v1 22/36] dma-remap: drop nth_page() in dma_common_contiguous_remap() In-Reply-To: <20250827220141.262669-23-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-23-david@redhat.com> Message-ID: <7cfdbb15-72c1-47e7-b5e0-b8a243f2a516@lucifer.local> On Thu, Aug 28, 2025 at 12:01:26AM +0200, David Hildenbrand wrote: > dma_common_contiguous_remap() is used to remap an "allocated contiguous > region". Within a single allocation, there is no need to use nth_page() > anymore. > > Neither the buddy, nor hugetlb, nor CMA will hand out problematic page > ranges. > > Acked-by: Marek Szyprowski > Cc: Marek Szyprowski > Cc: Robin Murphy > Signed-off-by: David Hildenbrand Nice! LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > kernel/dma/remap.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/dma/remap.c b/kernel/dma/remap.c > index 9e2afad1c6152..b7c1c0c92d0c8 100644 > --- a/kernel/dma/remap.c > +++ b/kernel/dma/remap.c > @@ -49,7 +49,7 @@ void *dma_common_contiguous_remap(struct page *page, size_t size, > if (!pages) > return NULL; > for (i = 0; i < count; i++) > - pages[i] = nth_page(page, i); > + pages[i] = page++; > vaddr = vmap(pages, count, VM_DMA_COHERENT, prot); > kvfree(pages); > > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 17:54:01 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 17:54:01 -0000 Subject: [PATCH v1 23/36] scatterlist: disallow non-contigous page ranges in a single SG entry In-Reply-To: <20250827220141.262669-24-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-24-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:27AM +0200, David Hildenbrand wrote: > The expectation is that there is currently no user that would pass in > non-contigous page ranges: no allocator, not even VMA, will hand these > out. > > The only problematic part would be if someone would provide a range > obtained directly from memblock, or manually merge problematic ranges. > If we find such cases, we should fix them to create separate > SG entries. > > Let's check in sg_set_page() that this is really the case. No need to > check in sg_set_folio(), as pages in a folio are guaranteed to be > contiguous. As sg_set_page() gets inlined into modules, we have to > export the page_range_contiguous() helper -- use EXPORT_SYMBOL, there is > nothing special about this helper such that we would want to enforce > GPL-only modules. Ah you mention this here (I wrote end of this first :) > > We can now drop the nth_page() usage in sg_page_iter_page(). > > Acked-by: Marek Szyprowski > Signed-off-by: David Hildenbrand All LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > include/linux/scatterlist.h | 3 ++- > mm/util.c | 1 + > 2 files changed, 3 insertions(+), 1 deletion(-) > > diff --git a/include/linux/scatterlist.h b/include/linux/scatterlist.h > index 6f8a4965f9b98..29f6ceb98d74b 100644 > --- a/include/linux/scatterlist.h > +++ b/include/linux/scatterlist.h > @@ -158,6 +158,7 @@ static inline void sg_assign_page(struct scatterlist *sg, struct page *page) > static inline void sg_set_page(struct scatterlist *sg, struct page *page, > unsigned int len, unsigned int offset) > { > + VM_WARN_ON_ONCE(!page_range_contiguous(page, ALIGN(len + offset, PAGE_SIZE) / PAGE_SIZE)); This is pretty horrible as one statement, but I guess we can't really do better, I had a quick look around for some helper that could work but nothing is clearly suitable. So this should be fine. > sg_assign_page(sg, page); > sg->offset = offset; > sg->length = len; > @@ -600,7 +601,7 @@ void __sg_page_iter_start(struct sg_page_iter *piter, > */ > static inline struct page *sg_page_iter_page(struct sg_page_iter *piter) > { > - return nth_page(sg_page(piter->sg), piter->sg_pgoffset); > + return sg_page(piter->sg) + piter->sg_pgoffset; > } > > /** > diff --git a/mm/util.c b/mm/util.c > index 0bf349b19b652..e8b9da6b13230 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -1312,5 +1312,6 @@ bool page_range_contiguous(const struct page *page, unsigned long nr_pages) > return false; > return true; > } > +EXPORT_SYMBOL(page_range_contiguous); Kinda sad that we're doing this as EXPORT_SYMBOL() rather than EXPORT_SYMBOL_GPL() :( but I guess necessary to stay consistent... > #endif > #endif /* CONFIG_MMU */ > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 17:54:23 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 17:54:23 -0000 Subject: [PATCH v1 24/36] ata: libata-eh: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-25-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-25-david@redhat.com> Message-ID: <7612fdc2-97ff-4b89-a532-90c5de56acdc@lucifer.local> On Thu, Aug 28, 2025 at 12:01:28AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Cc: Damien Le Moal > Cc: Niklas Cassel > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > drivers/ata/libata-sff.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/drivers/ata/libata-sff.c b/drivers/ata/libata-sff.c > index 7fc407255eb46..1e2a2c33cdc80 100644 > --- a/drivers/ata/libata-sff.c > +++ b/drivers/ata/libata-sff.c > @@ -614,7 +614,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) > offset = qc->cursg->offset + qc->cursg_ofs; > > /* get the current page and offset */ > - page = nth_page(page, (offset >> PAGE_SHIFT)); > + page += offset >> PAGE_SHIFT; > offset %= PAGE_SIZE; > > /* don't overrun current sg */ > @@ -631,7 +631,7 @@ static void ata_pio_sector(struct ata_queued_cmd *qc) > unsigned int split_len = PAGE_SIZE - offset; > > ata_pio_xfer(qc, page, offset, split_len); > - ata_pio_xfer(qc, nth_page(page, 1), 0, count - split_len); > + ata_pio_xfer(qc, page + 1, 0, count - split_len); > } else { > ata_pio_xfer(qc, page, offset, count); > } > @@ -751,7 +751,7 @@ static int __atapi_pio_bytes(struct ata_queued_cmd *qc, unsigned int bytes) > offset = sg->offset + qc->cursg_ofs; > > /* get the current page and offset */ > - page = nth_page(page, (offset >> PAGE_SHIFT)); > + page += offset >> PAGE_SHIFT; > offset %= PAGE_SIZE; > > /* don't overrun current sg */ > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 17:56:42 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 17:56:42 -0000 Subject: [PATCH v1 25/36] drm/i915/gem: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-26-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-26-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:29AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Cc: Jani Nikula > Cc: Joonas Lahtinen > Cc: Rodrigo Vivi > Cc: Tvrtko Ursulin > Cc: David Airlie > Cc: Simona Vetter > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > drivers/gpu/drm/i915/gem/i915_gem_pages.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/i915/gem/i915_gem_pages.c b/drivers/gpu/drm/i915/gem/i915_gem_pages.c > index c16a57160b262..031d7acc16142 100644 > --- a/drivers/gpu/drm/i915/gem/i915_gem_pages.c > +++ b/drivers/gpu/drm/i915/gem/i915_gem_pages.c > @@ -779,7 +779,7 @@ __i915_gem_object_get_page(struct drm_i915_gem_object *obj, pgoff_t n) > GEM_BUG_ON(!i915_gem_object_has_struct_page(obj)); > > sg = i915_gem_object_get_sg(obj, n, &offset); > - return nth_page(sg_page(sg), offset); > + return sg_page(sg) + offset; > } > > /* Like i915_gem_object_get_page(), but mark the returned page dirty */ > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 17:57:42 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 17:57:42 -0000 Subject: [PATCH v1 26/36] mspro_block: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-27-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-27-david@redhat.com> Message-ID: <1e64780f-b408-41a4-8cf3-376e5a1948ca@lucifer.local> On Thu, Aug 28, 2025 at 12:01:30AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Acked-by: Ulf Hansson > Cc: Maxim Levitsky > Cc: Alex Dubov > Cc: Ulf Hansson > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > drivers/memstick/core/mspro_block.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/memstick/core/mspro_block.c b/drivers/memstick/core/mspro_block.c > index c9853d887d282..d3f160dc0da4c 100644 > --- a/drivers/memstick/core/mspro_block.c > +++ b/drivers/memstick/core/mspro_block.c > @@ -560,8 +560,7 @@ static int h_mspro_block_transfer_data(struct memstick_dev *card, > t_offset += msb->current_page * msb->page_size; > > sg_set_page(&t_sg, > - nth_page(sg_page(&(msb->req_sg[msb->current_seg])), > - t_offset >> PAGE_SHIFT), > + sg_page(&(msb->req_sg[msb->current_seg])) + (t_offset >> PAGE_SHIFT), > msb->page_size, offset_in_page(t_offset)); > > memstick_init_req_sg(*mrq, msb->data_dir == READ > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 17:58:21 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 17:58:21 -0000 Subject: [PATCH v1 27/36] memstick: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-28-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-28-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:31AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Acked-by: Ulf Hansson > Cc: Maxim Levitsky > Cc: Alex Dubov > Cc: Ulf Hansson > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > drivers/memstick/host/jmb38x_ms.c | 3 +-- > drivers/memstick/host/tifm_ms.c | 3 +-- > 2 files changed, 2 insertions(+), 4 deletions(-) > > diff --git a/drivers/memstick/host/jmb38x_ms.c b/drivers/memstick/host/jmb38x_ms.c > index cddddb3a5a27f..79e66e30417c1 100644 > --- a/drivers/memstick/host/jmb38x_ms.c > +++ b/drivers/memstick/host/jmb38x_ms.c > @@ -317,8 +317,7 @@ static int jmb38x_ms_transfer_data(struct jmb38x_ms_host *host) > unsigned int p_off; > > if (host->req->long_data) { > - pg = nth_page(sg_page(&host->req->sg), > - off >> PAGE_SHIFT); > + pg = sg_page(&host->req->sg) + (off >> PAGE_SHIFT); > p_off = offset_in_page(off); > p_cnt = PAGE_SIZE - p_off; > p_cnt = min(p_cnt, length); > diff --git a/drivers/memstick/host/tifm_ms.c b/drivers/memstick/host/tifm_ms.c > index db7f3a088fb09..0b6a90661eee5 100644 > --- a/drivers/memstick/host/tifm_ms.c > +++ b/drivers/memstick/host/tifm_ms.c > @@ -201,8 +201,7 @@ static unsigned int tifm_ms_transfer_data(struct tifm_ms *host) > unsigned int p_off; > > if (host->req->long_data) { > - pg = nth_page(sg_page(&host->req->sg), > - off >> PAGE_SHIFT); > + pg = sg_page(&host->req->sg) + (off >> PAGE_SHIFT); > p_off = offset_in_page(off); > p_cnt = PAGE_SIZE - p_off; > p_cnt = min(p_cnt, length); > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 18:00:25 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 18:00:25 -0000 Subject: [PATCH v1 28/36] mmc: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-29-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-29-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:32AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Acked-by: Ulf Hansson > Cc: Alex Dubov > Cc: Ulf Hansson > Cc: Jesper Nilsson > Cc: Lars Persson > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > drivers/mmc/host/tifm_sd.c | 4 ++-- > drivers/mmc/host/usdhi6rol0.c | 4 ++-- > 2 files changed, 4 insertions(+), 4 deletions(-) > > diff --git a/drivers/mmc/host/tifm_sd.c b/drivers/mmc/host/tifm_sd.c > index ac636efd911d3..2cd69c9e9571b 100644 > --- a/drivers/mmc/host/tifm_sd.c > +++ b/drivers/mmc/host/tifm_sd.c > @@ -191,7 +191,7 @@ static void tifm_sd_transfer_data(struct tifm_sd *host) > } > off = sg[host->sg_pos].offset + host->block_pos; > > - pg = nth_page(sg_page(&sg[host->sg_pos]), off >> PAGE_SHIFT); > + pg = sg_page(&sg[host->sg_pos]) + (off >> PAGE_SHIFT); > p_off = offset_in_page(off); > p_cnt = PAGE_SIZE - p_off; > p_cnt = min(p_cnt, cnt); > @@ -240,7 +240,7 @@ static void tifm_sd_bounce_block(struct tifm_sd *host, struct mmc_data *r_data) > } > off = sg[host->sg_pos].offset + host->block_pos; > > - pg = nth_page(sg_page(&sg[host->sg_pos]), off >> PAGE_SHIFT); > + pg = sg_page(&sg[host->sg_pos]) + (off >> PAGE_SHIFT); > p_off = offset_in_page(off); > p_cnt = PAGE_SIZE - p_off; > p_cnt = min(p_cnt, cnt); > diff --git a/drivers/mmc/host/usdhi6rol0.c b/drivers/mmc/host/usdhi6rol0.c > index 85b49c07918b3..3bccf800339ba 100644 > --- a/drivers/mmc/host/usdhi6rol0.c > +++ b/drivers/mmc/host/usdhi6rol0.c > @@ -323,7 +323,7 @@ static void usdhi6_blk_bounce(struct usdhi6_host *host, > > host->head_pg.page = host->pg.page; > host->head_pg.mapped = host->pg.mapped; > - host->pg.page = nth_page(host->pg.page, 1); > + host->pg.page = host->pg.page + 1; > host->pg.mapped = kmap(host->pg.page); > > host->blk_page = host->bounce_buf; > @@ -503,7 +503,7 @@ static void usdhi6_sg_advance(struct usdhi6_host *host) > /* We cannot get here after crossing a page border */ > > /* Next page in the same SG */ > - host->pg.page = nth_page(sg_page(host->sg), host->page_idx); > + host->pg.page = sg_page(host->sg) + host->page_idx; > host->pg.mapped = kmap(host->pg.page); > host->blk_page = host->pg.mapped; > > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 18:01:28 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 18:01:28 -0000 Subject: [PATCH v1 29/36] scsi: scsi_lib: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-30-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-30-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:33AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Reviewed-by: Bart Van Assche > Cc: "James E.J. Bottomley" > Cc: "Martin K. Petersen" > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > drivers/scsi/scsi_lib.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c > index 0c65ecfedfbd6..d7e42293b8645 100644 > --- a/drivers/scsi/scsi_lib.c > +++ b/drivers/scsi/scsi_lib.c > @@ -3148,8 +3148,7 @@ void *scsi_kmap_atomic_sg(struct scatterlist *sgl, int sg_count, > /* Offset starting from the beginning of first page in this sg-entry */ > *offset = *offset - len_complete + sg->offset; > > - /* Assumption: contiguous pages can be accessed as "page + i" */ Nice to drop this :) > - page = nth_page(sg_page(sg), (*offset >> PAGE_SHIFT)); > + page = sg_page(sg) + (*offset >> PAGE_SHIFT); > *offset &= ~PAGE_MASK; > > /* Bytes in this sg-entry from *offset to the end of the page */ > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 18:01:53 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 18:01:53 -0000 Subject: [PATCH v1 30/36] scsi: sg: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-31-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-31-david@redhat.com> Message-ID: <795d8319-86bf-4087-b4dc-34a093678001@lucifer.local> On Thu, Aug 28, 2025 at 12:01:34AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Reviewed-by: Bart Van Assche > Cc: Doug Gilbert > Cc: "James E.J. Bottomley" > Cc: "Martin K. Petersen" > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > drivers/scsi/sg.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/drivers/scsi/sg.c b/drivers/scsi/sg.c > index 3c02a5f7b5f39..4c62c597c7be9 100644 > --- a/drivers/scsi/sg.c > +++ b/drivers/scsi/sg.c > @@ -1235,8 +1235,7 @@ sg_vma_fault(struct vm_fault *vmf) > len = vma->vm_end - sa; > len = (len < length) ? len : length; > if (offset < len) { > - struct page *page = nth_page(rsv_schp->pages[k], > - offset >> PAGE_SHIFT); > + struct page *page = rsv_schp->pages[k] + (offset >> PAGE_SHIFT); > get_page(page); /* increment page count */ > vmf->page = page; > return 0; /* success */ > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 18:09:58 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 18:09:58 -0000 Subject: [PATCH v1 33/36] mm/gup: drop nth_page() usage in unpin_user_page_range_dirty_lock() In-Reply-To: <20250827220141.262669-34-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-34-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:37AM +0200, David Hildenbrand wrote: > There is the concern that unpin_user_page_range_dirty_lock() might do > some weird merging of PFN ranges -- either now or in the future -- such > that PFN range is contiguous but the page range might not be. > > Let's sanity-check for that and drop the nth_page() usage. > > Signed-off-by: David Hildenbrand Seems one user uses SG and the other is IOMMU and in each instance you'd expect physical contiguity (maybe Jason G. or somebody else more familiar with these uses can also chime in). Anyway, on that basis, LGTM (though 1 small nit below), so: Reviewed-by: Lorenzo Stoakes > --- > mm/gup.c | 7 ++++++- > 1 file changed, 6 insertions(+), 1 deletion(-) > > diff --git a/mm/gup.c b/mm/gup.c > index 89ca0813791ab..c24f6009a7a44 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -237,7 +237,7 @@ void folio_add_pin(struct folio *folio) > static inline struct folio *gup_folio_range_next(struct page *start, > unsigned long npages, unsigned long i, unsigned int *ntails) > { > - struct page *next = nth_page(start, i); > + struct page *next = start + i; > struct folio *folio = page_folio(next); > unsigned int nr = 1; > > @@ -342,6 +342,9 @@ EXPORT_SYMBOL(unpin_user_pages_dirty_lock); > * "gup-pinned page range" refers to a range of pages that has had one of the > * pin_user_pages() variants called on that page. > * > + * The page range must be truly contiguous: the page range corresponds NIT: maybe 'physically contiguous'? > + * to a contiguous PFN range and all pages can be iterated naturally. > + * > * For the page ranges defined by [page .. page+npages], make that range (or > * its head pages, if a compound page) dirty, if @make_dirty is true, and if the > * page range was previously listed as clean. > @@ -359,6 +362,8 @@ void unpin_user_page_range_dirty_lock(struct page *page, unsigned long npages, > struct folio *folio; > unsigned int nr; > > + VM_WARN_ON_ONCE(!page_range_contiguous(page, npages)); > + > for (i = 0; i < npages; i += nr) { > folio = gup_folio_range_next(page, npages, i, &nr); > if (make_dirty && !folio_test_dirty(folio)) { > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 18:19:56 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 18:19:56 -0000 Subject: [PATCH v1 34/36] kfence: drop nth_page() usage In-Reply-To: <20250827220141.262669-35-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-35-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:38AM +0200, David Hildenbrand wrote: > We want to get rid of nth_page(), and kfence init code is the last user. > > Unfortunately, we might actually walk a PFN range where the pages are > not contiguous, because we might be allocating an area from memblock > that could span memory sections in problematic kernel configs (SPARSEMEM > without SPARSEMEM_VMEMMAP). Sad. > > We could check whether the page range is contiguous > using page_range_contiguous() and failing kfence init, or making kfence > incompatible these problemtic kernel configs. Sounds iffy though. > > Let's keep it simple and simply use pfn_to_page() by iterating PFNs. Yes. > > Cc: Alexander Potapenko > Cc: Marco Elver > Cc: Dmitry Vyukov > Signed-off-by: David Hildenbrand Stared at this and can't see anything wrong, so - LGTM and: Reviewed-by: Lorenzo Stoakes > --- > mm/kfence/core.c | 12 +++++++----- > 1 file changed, 7 insertions(+), 5 deletions(-) > > diff --git a/mm/kfence/core.c b/mm/kfence/core.c > index 0ed3be100963a..727c20c94ac59 100644 > --- a/mm/kfence/core.c > +++ b/mm/kfence/core.c > @@ -594,15 +594,14 @@ static void rcu_guarded_free(struct rcu_head *h) > */ > static unsigned long kfence_init_pool(void) > { > - unsigned long addr; > - struct page *pages; > + unsigned long addr, start_pfn; > int i; > > if (!arch_kfence_init_pool()) > return (unsigned long)__kfence_pool; > > addr = (unsigned long)__kfence_pool; > - pages = virt_to_page(__kfence_pool); > + start_pfn = PHYS_PFN(virt_to_phys(__kfence_pool)); > > /* > * Set up object pages: they must have PGTY_slab set to avoid freeing > @@ -613,11 +612,12 @@ static unsigned long kfence_init_pool(void) > * enters __slab_free() slow-path. > */ > for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { > - struct slab *slab = page_slab(nth_page(pages, i)); > + struct slab *slab; > > if (!i || (i % 2)) > continue; > > + slab = page_slab(pfn_to_page(start_pfn + i)); > __folio_set_slab(slab_folio(slab)); > #ifdef CONFIG_MEMCG > slab->obj_exts = (unsigned long)&kfence_metadata_init[i / 2 - 1].obj_exts | > @@ -665,10 +665,12 @@ static unsigned long kfence_init_pool(void) > > reset_slab: > for (i = 0; i < KFENCE_POOL_SIZE / PAGE_SIZE; i++) { > - struct slab *slab = page_slab(nth_page(pages, i)); > + struct slab *slab; > > if (!i || (i % 2)) > continue; > + > + slab = page_slab(pfn_to_page(start_pfn + i)); > #ifdef CONFIG_MEMCG > slab->obj_exts = 0; > #endif > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 18:20:36 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 18:20:36 -0000 Subject: [PATCH v1 35/36] block: update comment of "struct bio_vec" regarding nth_page() In-Reply-To: <20250827220141.262669-36-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-36-david@redhat.com> Message-ID: On Thu, Aug 28, 2025 at 12:01:39AM +0200, David Hildenbrand wrote: > Ever since commit 858c708d9efb ("block: move the bi_size update out of > __bio_try_merge_page"), page_is_mergeable() no longer exists, and the > logic in bvec_try_merge_page() is now a simple page pointer > comparison. > > Signed-off-by: David Hildenbrand Nice! :) LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > include/linux/bvec.h | 7 ++----- > 1 file changed, 2 insertions(+), 5 deletions(-) > > diff --git a/include/linux/bvec.h b/include/linux/bvec.h > index 0a80e1f9aa201..3fc0efa0825b1 100644 > --- a/include/linux/bvec.h > +++ b/include/linux/bvec.h > @@ -22,11 +22,8 @@ struct page; > * @bv_len: Number of bytes in the address range. > * @bv_offset: Start of the address range relative to the start of @bv_page. > * > - * The following holds for a bvec if n * PAGE_SIZE < bv_offset + bv_len: > - * > - * nth_page(@bv_page, n) == @bv_page + n > - * > - * This holds because page_is_mergeable() checks the above property. > + * All pages within a bio_vec starting from @bv_page are contiguous and > + * can simply be iterated (see bvec_advance()). > */ > struct bio_vec { > struct page *bv_page; > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 18:25:51 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 18:25:51 -0000 Subject: [PATCH v1 36/36] mm: remove nth_page() In-Reply-To: <20250827220141.262669-37-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-37-david@redhat.com> Message-ID: <18c6a175-507f-464c-b776-67d346863ddf@lucifer.local> On Thu, Aug 28, 2025 at 12:01:40AM +0200, David Hildenbrand wrote: > Now that all users are gone, let's remove it. > > Signed-off-by: David Hildenbrand HAPPY DAYYS!!!! Happy to have reached this bit, great work! :) LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > include/linux/mm.h | 2 -- > tools/testing/scatterlist/linux/mm.h | 1 - > 2 files changed, 3 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 2ca1eb2db63ec..b26ca8b2162d9 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -210,9 +210,7 @@ extern unsigned long sysctl_admin_reserve_kbytes; > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > bool page_range_contiguous(const struct page *page, unsigned long nr_pages); > -#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) > #else > -#define nth_page(page,n) ((page) + (n)) > static inline bool page_range_contiguous(const struct page *page, > unsigned long nr_pages) > { > diff --git a/tools/testing/scatterlist/linux/mm.h b/tools/testing/scatterlist/linux/mm.h > index 5bd9e6e806254..121ae78d6e885 100644 > --- a/tools/testing/scatterlist/linux/mm.h > +++ b/tools/testing/scatterlist/linux/mm.h > @@ -51,7 +51,6 @@ static inline unsigned long page_to_phys(struct page *page) > > #define page_to_pfn(page) ((unsigned long)(page) / PAGE_SIZE) > #define pfn_to_page(pfn) (void *)((pfn) * PAGE_SIZE) > -#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) > > #define __min(t1, t2, min1, min2, x, y) ({ \ > t1 min1 = (x); \ > -- > 2.50.1 > From Liam.Howlett at oracle.com Fri Aug 29 00:27:30 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 00:27:30 -0000 Subject: [PATCH v1 01/36] mm: stop making SPARSEMEM_VMEMMAP user-selectable In-Reply-To: <20250827220141.262669-2-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-2-david@redhat.com> Message-ID: * David Hildenbrand [250827 18:03]: > In an ideal world, we wouldn't have to deal with SPARSEMEM without > SPARSEMEM_VMEMMAP, but in particular for 32bit SPARSEMEM_VMEMMAP is > considered too costly and consequently not supported. > > However, if an architecture does support SPARSEMEM with > SPARSEMEM_VMEMMAP, let's forbid the user to disable VMEMMAP: just > like we already do for arm64, s390 and x86. > > So if SPARSEMEM_VMEMMAP is supported, don't allow to use SPARSEMEM without > SPARSEMEM_VMEMMAP. > > This implies that the option to not use SPARSEMEM_VMEMMAP will now be > gone for loongarch, powerpc, riscv and sparc. All architectures only > enable SPARSEMEM_VMEMMAP with 64bit support, so there should not really > be a big downside to using the VMEMMAP (quite the contrary). > > This is a preparation for not supporting > > (1) folio sizes that exceed a single memory section > (2) CMA allocations of non-contiguous page ranges > > in SPARSEMEM without SPARSEMEM_VMEMMAP configs, whereby we > want to limit possible impact as much as possible (e.g., gigantic hugetlb > page allocations suddenly fails). > > Acked-by: Zi Yan > Acked-by: Mike Rapoport (Microsoft) > Acked-by: SeongJae Park > Cc: Huacai Chen > Cc: WANG Xuerui > Cc: Madhavan Srinivasan > Cc: Michael Ellerman > Cc: Nicholas Piggin > Cc: Christophe Leroy > Cc: Paul Walmsley > Cc: Palmer Dabbelt > Cc: Albert Ou > Cc: Alexandre Ghiti > Cc: "David S. Miller" > Cc: Andreas Larsson > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > mm/Kconfig | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/mm/Kconfig b/mm/Kconfig > index 4108bcd967848..330d0e698ef96 100644 > --- a/mm/Kconfig > +++ b/mm/Kconfig > @@ -439,9 +439,8 @@ config SPARSEMEM_VMEMMAP_ENABLE > bool > > config SPARSEMEM_VMEMMAP > - bool "Sparse Memory virtual memmap" > + def_bool y > depends on SPARSEMEM && SPARSEMEM_VMEMMAP_ENABLE > - default y > help > SPARSEMEM_VMEMMAP uses a virtually mapped memmap to optimise > pfn_to_page and page_to_pfn operations. This is the most > -- > 2.50.1 > > From Liam.Howlett at oracle.com Fri Aug 29 00:28:15 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 00:28:15 -0000 Subject: [PATCH v1 02/36] arm64: Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-3-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-3-david@redhat.com> Message-ID: * David Hildenbrand [250827 18:03]: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. > > Reviewed-by: Mike Rapoport (Microsoft) > Cc: Catalin Marinas > Cc: Will Deacon > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > arch/arm64/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig > index e9bbfacc35a64..b1d1f2ff2493b 100644 > --- a/arch/arm64/Kconfig > +++ b/arch/arm64/Kconfig > @@ -1570,7 +1570,6 @@ source "kernel/Kconfig.hz" > config ARCH_SPARSEMEM_ENABLE > def_bool y > select SPARSEMEM_VMEMMAP_ENABLE > - select SPARSEMEM_VMEMMAP > > config HW_PERF_EVENTS > def_bool y > -- > 2.50.1 > From Liam.Howlett at oracle.com Fri Aug 29 00:29:11 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 00:29:11 -0000 Subject: [PATCH v1 03/36] s390/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-4-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-4-david@redhat.com> Message-ID: * David Hildenbrand [250827 18:03]: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. > > Reviewed-by: Mike Rapoport (Microsoft) > Cc: Heiko Carstens > Cc: Vasily Gorbik > Cc: Alexander Gordeev > Cc: Christian Borntraeger > Cc: Sven Schnelle > Signed-off-by: David Hildenbrand I have a little fear of the Cc's that may come with this one, but.. Reviewed-by: Liam R. Howlett > --- > arch/s390/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig > index bf680c26a33cf..145ca23c2fff6 100644 > --- a/arch/s390/Kconfig > +++ b/arch/s390/Kconfig > @@ -710,7 +710,6 @@ menu "Memory setup" > config ARCH_SPARSEMEM_ENABLE > def_bool y > select SPARSEMEM_VMEMMAP_ENABLE > - select SPARSEMEM_VMEMMAP > > config ARCH_SPARSEMEM_DEFAULT > def_bool y > -- > 2.50.1 > From Liam.Howlett at oracle.com Fri Aug 29 00:29:37 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 00:29:37 -0000 Subject: [PATCH v1 04/36] x86/Kconfig: drop superfluous "select SPARSEMEM_VMEMMAP" In-Reply-To: <20250827220141.262669-5-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-5-david@redhat.com> Message-ID: <27leccakrwk7gwupltma5f7enjx4vt4utxdcitqpirx3fpnpd4@ythmris3c25e> * David Hildenbrand [250827 18:03]: > Now handled by the core automatically once SPARSEMEM_VMEMMAP_ENABLE > is selected. > > Reviewed-by: Mike Rapoport (Microsoft) > Cc: Thomas Gleixner > Cc: Ingo Molnar > Cc: Borislav Petkov > Cc: Dave Hansen > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > arch/x86/Kconfig | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > index 58d890fe2100e..e431d1c06fecd 100644 > --- a/arch/x86/Kconfig > +++ b/arch/x86/Kconfig > @@ -1552,7 +1552,6 @@ config ARCH_SPARSEMEM_ENABLE > def_bool y > select SPARSEMEM_STATIC if X86_32 > select SPARSEMEM_VMEMMAP_ENABLE if X86_64 > - select SPARSEMEM_VMEMMAP if X86_64 > > config ARCH_SPARSEMEM_DEFAULT > def_bool X86_64 || (NUMA && X86_32) > -- > 2.50.1 > From Liam.Howlett at oracle.com Fri Aug 29 00:29:55 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 00:29:55 -0000 Subject: [PATCH v1 05/36] wireguard: selftests: remove CONFIG_SPARSEMEM_VMEMMAP=y from qemu kernel config In-Reply-To: <20250827220141.262669-6-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-6-david@redhat.com> Message-ID: * David Hildenbrand [250827 18:04]: > It's no longer user-selectable (and the default was already "y"), so > let's just drop it. > > It was never really relevant to the wireguard selftests either way. > > Acked-by: Mike Rapoport (Microsoft) > Cc: "Jason A. Donenfeld" > Cc: Shuah Khan > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > tools/testing/selftests/wireguard/qemu/kernel.config | 1 - > 1 file changed, 1 deletion(-) > > diff --git a/tools/testing/selftests/wireguard/qemu/kernel.config b/tools/testing/selftests/wireguard/qemu/kernel.config > index 0a5381717e9f4..1149289f4b30f 100644 > --- a/tools/testing/selftests/wireguard/qemu/kernel.config > +++ b/tools/testing/selftests/wireguard/qemu/kernel.config > @@ -48,7 +48,6 @@ CONFIG_JUMP_LABEL=y > CONFIG_FUTEX=y > CONFIG_SHMEM=y > CONFIG_SLUB=y > -CONFIG_SPARSEMEM_VMEMMAP=y > CONFIG_SMP=y > CONFIG_SCHED_SMT=y > CONFIG_SCHED_MC=y > -- > 2.50.1 > > From Liam.Howlett at oracle.com Fri Aug 29 00:36:22 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 00:36:22 -0000 Subject: [PATCH v1 08/36] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <20250827220141.262669-9-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-9-david@redhat.com> Message-ID: <3cw5er2nr2pnht456e6fh5lassb6y5z64xk3g7ffrao2glkmx7@os7byqmcuddp> * David Hildenbrand [250827 18:04]: > Let's check that no hstate that corresponds to an unreasonable folio size > is registered by an architecture. If we were to succeed registering, we > could later try allocating an unsupported gigantic folio size. > > Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER > is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have > to use a BUILD_BUG_ON_INVALID() to make it compile. > > No existing kernel configuration should be able to trigger this check: > either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or > gigantic folios will not exceed a memory section (the case on sparse). > > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > mm/hugetlb.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 572b6f7772841..4a97e4f14c0dc 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4657,6 +4657,7 @@ static int __init hugetlb_init(void) > > BUILD_BUG_ON(sizeof_field(struct page, private) * BITS_PER_BYTE < > __NR_HPAGEFLAGS); > + BUILD_BUG_ON_INVALID(HUGETLB_PAGE_ORDER > MAX_FOLIO_ORDER); > > if (!hugepages_supported()) { > if (hugetlb_max_hstate || default_hstate_max_huge_pages) > @@ -4740,6 +4741,7 @@ void __init hugetlb_add_hstate(unsigned int order) > } > BUG_ON(hugetlb_max_hstate >= HUGE_MAX_HSTATE); > BUG_ON(order < order_base_2(__NR_USED_SUBPAGE)); > + WARN_ON(order > MAX_FOLIO_ORDER); > h = &hstates[hugetlb_max_hstate++]; > __mutex_init(&h->resize_lock, "resize mutex", &h->resize_key); > h->order = order; > -- > 2.50.1 > > From Liam.Howlett at oracle.com Fri Aug 29 00:38:22 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 00:38:22 -0000 Subject: [PATCH v1 09/36] mm/mm_init: make memmap_init_compound() look more like prep_compound_page() In-Reply-To: <20250827220141.262669-10-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-10-david@redhat.com> Message-ID: * David Hildenbrand [250827 18:05]: > Grepping for "prep_compound_page" leaves on clueless how devdax gets its > compound pages initialized. > > Let's add a comment that might help finding this open-coded > prep_compound_page() initialization more easily. Thanks for the comment here. > > Further, let's be less smart about the ordering of initialization and just > perform the prep_compound_head() call after all tail pages were > initialized: just like prep_compound_page() does. > > No need for a comment to describe the initialization order: again, > just like prep_compound_page(). > > Reviewed-by: Mike Rapoport (Microsoft) > Signed-off-by: David Hildenbrand Acked-by: Liam R. Howlett > --- > mm/mm_init.c | 15 +++++++-------- > 1 file changed, 7 insertions(+), 8 deletions(-) > > diff --git a/mm/mm_init.c b/mm/mm_init.c > index 5c21b3af216b2..df614556741a4 100644 > --- a/mm/mm_init.c > +++ b/mm/mm_init.c > @@ -1091,6 +1091,12 @@ static void __ref memmap_init_compound(struct page *head, > unsigned long pfn, end_pfn = head_pfn + nr_pages; > unsigned int order = pgmap->vmemmap_shift; > > + /* > + * We have to initialize the pages, including setting up page links. > + * prep_compound_page() does not take care of that, so instead we > + * open-code prep_compound_page() so we can take care of initializing > + * the pages in the same go. > + */ > __SetPageHead(head); > for (pfn = head_pfn + 1; pfn < end_pfn; pfn++) { > struct page *page = pfn_to_page(pfn); > @@ -1098,15 +1104,8 @@ static void __ref memmap_init_compound(struct page *head, > __init_zone_device_page(page, pfn, zone_idx, nid, pgmap); > prep_compound_tail(head, pfn - head_pfn); > set_page_count(page, 0); > - > - /* > - * The first tail page stores important compound page info. > - * Call prep_compound_head() after the first tail page has > - * been initialized, to not have the data overwritten. > - */ > - if (pfn == head_pfn + 1) > - prep_compound_head(head, order); > } > + prep_compound_head(head, order); > } > > void __ref memmap_init_zone_device(struct zone *zone, > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Fri Aug 29 12:02:46 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 12:02:46 -0000 Subject: [PATCH v1 11/36] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-12-david@redhat.com> Message-ID: <32fbe774-d0e4-498e-873f-f028347c1fcb@lucifer.local> On Fri, Aug 29, 2025 at 01:57:22PM +0200, David Hildenbrand wrote: > On 28.08.25 17:10, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 12:01:15AM +0200, David Hildenbrand wrote: > > > Let's limit the maximum folio size in problematic kernel config where > > > the memmap is allocated per memory section (SPARSEMEM without > > > SPARSEMEM_VMEMMAP) to a single memory section. > > > > > > Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE > > > but not SPARSEMEM_VMEMMAP: sh. > > > > > > Fortunately, the biggest hugetlb size sh supports is 64 MiB > > > (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB > > > (SECTION_SIZE_BITS == 26), so their use case is not degraded. > > > > > > As folios and memory sections are naturally aligned to their order-2 size > > > in memory, consequently a single folio can no longer span multiple memory > > > sections on these problematic kernel configs. > > > > > > nth_page() is no longer required when operating within a single compound > > > page / folio. > > > > > > Reviewed-by: Zi Yan > > > Acked-by: Mike Rapoport (Microsoft) > > > Signed-off-by: David Hildenbrand > > > > Realy great comments, like this! > > > > I wonder if we could have this be part of the first patch where you fiddle > > with MAX_FOLIO_ORDER etc. but not a big deal. > > I think it belongs into this patch where we actually impose the > restrictions. Sure it's not a big deal. > > [...] > > > > +/* > > > + * Only pages within a single memory section are guaranteed to be > > > + * contiguous. By limiting folios to a single memory section, all folio > > > + * pages are guaranteed to be contiguous. > > > + */ > > > +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT > > > > Hmmm, was this implicit before somehow? I mean surely by the fact as you say > > that physical contiguity would not otherwise be guaranteed :)) > > Well, my patches until this point made sure that any attempt to use a larger > folio would fail in a way that we could spot now if there is any offender. Ack yeah. > > That is why before this change, nth_page() was required within a folio. > > Hope that clarifies it, thanks! Yes thanks! :) > > -- > Cheers > > David / dhildenb > Cheers, Lorenzo From lorenzo.stoakes at oracle.com Fri Aug 29 12:04:20 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 12:04:20 -0000 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <0dcef56e-0ae7-401b-9453-f6dc6a4dcebf@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> <0dcef56e-0ae7-401b-9453-f6dc6a4dcebf@redhat.com> Message-ID: <6552e67b-72fd-4d9e-bf35-872cbfae5de0@lucifer.local> On Fri, Aug 29, 2025 at 01:59:19PM +0200, David Hildenbrand wrote: > On 28.08.25 17:37, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 12:01:17AM +0200, David Hildenbrand wrote: > > > We can now safely iterate over all pages in a folio, so no need for the > > > pfn_to_page(). > > > > > > Also, as we already force the refcount in __init_single_page() to 1, > > > > Mega huge nit (ignore if you want), but maybe worth saying 'via > > init_page_count()'. > > Will add, thanks! Thanks! > > -- > Cheers > > David / dhildenb > > From lorenzo.stoakes at oracle.com Fri Aug 29 12:10:38 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 12:10:38 -0000 Subject: [PATCH v1 15/36] fs: hugetlbfs: remove nth_page() usage within folio in adjust_range_hwpoison() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-16-david@redhat.com> <1d74a0e2-51ff-462f-8f3c-75639fd21221@lucifer.local> Message-ID: <5dd50f11-22bd-4a83-8484-2d23bdf5c10e@lucifer.local> On Fri, Aug 29, 2025 at 02:02:02PM +0200, David Hildenbrand wrote: > On 28.08.25 17:45, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 12:01:19AM +0200, David Hildenbrand wrote: > > > The nth_page() is not really required anymore, so let's remove it. > > > While at it, cleanup and simplify the code a bit. > > > > Hm Not sure which bit is the cleanup? Was there meant to be more here or? > > Thanks, leftover from the pre-split of this patch! Thanks! :) (Am replying even on 'not really needing a reply' like this so I know which stuff I replied to :P) > > -- > Cheers > > David / dhildenb > > Cheers, Lorenzo From lorenzo.stoakes at oracle.com Fri Aug 29 12:18:48 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 12:18:48 -0000 Subject: [PATCH v1 10/36] mm: sanity-check maximum folio size in folio_set_order() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-11-david@redhat.com> Message-ID: On Fri, Aug 29, 2025 at 12:10:30PM +0200, David Hildenbrand wrote: > On 28.08.25 17:00, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 12:01:14AM +0200, David Hildenbrand wrote: > > > Let's sanity-check in folio_set_order() whether we would be trying to > > > create a folio with an order that would make it exceed MAX_FOLIO_ORDER. > > > > > > This will enable the check whenever a folio/compound page is initialized > > > through prepare_compound_head() / prepare_compound_page(). > > > > NIT: with CONFIG_DEBUG_VM set :) > > Yes, will add that. Thanks! > > > > > > > > > Reviewed-by: Zi Yan > > > Signed-off-by: David Hildenbrand > > > > LGTM (apart from nit below), so: > > > > Reviewed-by: Lorenzo Stoakes > > > > > --- > > > mm/internal.h | 1 + > > > 1 file changed, 1 insertion(+) > > > > > > diff --git a/mm/internal.h b/mm/internal.h > > > index 45da9ff5694f6..9b0129531d004 100644 > > > --- a/mm/internal.h > > > +++ b/mm/internal.h > > > @@ -755,6 +755,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) > > > { > > > if (WARN_ON_ONCE(!order || !folio_test_large(folio))) > > > return; > > > + VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER); > > > > Given we have 'full-fat' WARN_ON*()'s above, maybe worth making this one too? > > The idea is that if you reach this point here, previous such checks I added > failed. So this is the safety net, and for that VM_WARN_ON_ONCE() is > sufficient. > > I think we should rather convert the WARN_ON_ONCE to VM_WARN_ON_ONCE() at > some point, because no sane code should ever trigger that. Ack, ok. I don't think vital for this series though! > > -- > Cheers > > David / dhildenb > Cheers, Lorenzo From lorenzo.stoakes at oracle.com Fri Aug 29 12:19:58 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 12:19:58 -0000 Subject: [PATCH v1 08/36] mm/hugetlb: check for unreasonable folio sizes when registering hstate In-Reply-To: <5f6e49fa-4c1c-4ece-ba67-0e140e2685da@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-9-david@redhat.com> <5f6e49fa-4c1c-4ece-ba67-0e140e2685da@redhat.com> Message-ID: On Fri, Aug 29, 2025 at 12:07:44PM +0200, David Hildenbrand wrote: > On 28.08.25 16:45, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 12:01:12AM +0200, David Hildenbrand wrote: > > > Let's check that no hstate that corresponds to an unreasonable folio size > > > is registered by an architecture. If we were to succeed registering, we > > > could later try allocating an unsupported gigantic folio size. > > > > > > Further, let's add a BUILD_BUG_ON() for checking that HUGETLB_PAGE_ORDER > > > is sane at build time. As HUGETLB_PAGE_ORDER is dynamic on powerpc, we have > > > to use a BUILD_BUG_ON_INVALID() to make it compile. > > > > > > No existing kernel configuration should be able to trigger this check: > > > either SPARSEMEM without SPARSEMEM_VMEMMAP cannot be configured or > > > gigantic folios will not exceed a memory section (the case on sparse). > > > > I am guessing it's implicit that MAX_FOLIO_ORDER <= section size? > > Yes, we have a build-time bug that somewhere. OK cool thanks! > > -- > Cheers > > David / dhildenb > Cheers, Lorenzo From lorenzo.stoakes at oracle.com Fri Aug 29 12:32:21 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 12:32:21 -0000 Subject: [PATCH v1 06/36] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <547145e0-9b0e-40ca-8201-e94cc5d19356@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-7-david@redhat.com> <547145e0-9b0e-40ca-8201-e94cc5d19356@redhat.com> Message-ID: <34edaa0d-0d5f-4041-9a3d-fb5b2dd584e8@lucifer.local> On Fri, Aug 29, 2025 at 12:06:21PM +0200, David Hildenbrand wrote: > On 28.08.25 16:37, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 12:01:10AM +0200, David Hildenbrand wrote: > > > Let's reject them early, which in turn makes folio_alloc_gigantic() reject > > > them properly. > > > > > > To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER > > > and calculate MAX_FOLIO_NR_PAGES based on that. > > > > > > Reviewed-by: Zi Yan > > > Acked-by: SeongJae Park > > > Signed-off-by: David Hildenbrand > > > > Some nits, but overall LGTM so: > > > > Reviewed-by: Lorenzo Stoakes > > > > > --- > > > include/linux/mm.h | 6 ++++-- > > > mm/page_alloc.c | 5 ++++- > > > 2 files changed, 8 insertions(+), 3 deletions(-) > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index 00c8a54127d37..77737cbf2216a 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) > > > > > > /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > > > #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > > > -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) > > > +#define MAX_FOLIO_ORDER PUD_ORDER > > > #else > > > -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES > > > +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER > > > #endif > > > > > > +#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) > > > > BIT()? > > I don't think we want to use BIT whenever we convert from order -> folio -- > which is why we also don't do that in other code. It seems a bit arbitrary, like we open-code this (at risk of making a mistake) in some places but not others. > > BIT() is nice in the context of flags and bitmaps, but not really in the > context of converting orders to pages. It's nice for setting a specific bit :) > > One could argue that maybe one would want a order_to_pages() helper (that > could use BIT() internally), but I am certainly not someone that would > suggest that at this point ... :) I mean maybe. Anyway as I said none of this is massively important, the open-coding here is correct, just seems silly. > > > > > > + > > > /* > > > * compound_nr() returns the number of pages in this potentially compound > > > * page. compound_nr() can be called on a tail page, and is defined to > > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > > > index baead29b3e67b..426bc404b80cc 100644 > > > --- a/mm/page_alloc.c > > > +++ b/mm/page_alloc.c > > > @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) > > > int alloc_contig_range_noprof(unsigned long start, unsigned long end, > > > acr_flags_t alloc_flags, gfp_t gfp_mask) Funny btw th > > > { > > > + const unsigned int order = ilog2(end - start); > > > unsigned long outer_start, outer_end; > > > int ret = 0; > > > > > > @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, > > > PB_ISOLATE_MODE_CMA_ALLOC : > > > PB_ISOLATE_MODE_OTHER; > > > > > > + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) > > > + return -EINVAL; > > > > Possibly not worth it for a one off, but be nice to have this as a helper function, like: > > > > static bool is_valid_order(gfp_t gfp_mask, unsigned int order) > > { > > return !(gfp_mask & __GFP_COMP) || order <= MAX_FOLIO_ORDER; > > } > > > > Then makes this: > > > > if (WARN_ON_ONCE(!is_valid_order(gfp_mask, order))) > > return -EINVAL; > > > > Kinda self-documenting! > > I don't like it -- especially forwarding __GFP_COMP. > > is_valid_folio_order() to wrap the order check? Also not sure. OK, it's not a big deal. Can we have a comment explaining this though? As people might be confused as to why we check this here and not elsewhere. > > So I'll leave it as is I think. Right fine. > > Thanks for all the review! > > -- > Cheers > > David / dhildenb > From lorenzo.stoakes at oracle.com Fri Aug 29 12:52:20 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 12:52:20 -0000 Subject: [PATCH v1 20/36] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() In-Reply-To: <2be7db96-2fa2-4348-837e-648124bd604f@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-21-david@redhat.com> <2be7db96-2fa2-4348-837e-648124bd604f@redhat.com> Message-ID: <549a60a6-25e2-48d5-b442-49404a857014@lucifer.local> On Thu, Aug 28, 2025 at 10:51:46PM +0200, David Hildenbrand wrote: > On 28.08.25 18:57, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 12:01:24AM +0200, David Hildenbrand wrote: > > > Let's make it clearer that we are operating within a single folio by > > > providing both the folio and the page. > > > > > > This implies that for flush_dcache_folio() we'll now avoid one more > > > page->folio lookup, and that we can safely drop the "nth_page" usage. > > > > > > Cc: Thomas Bogendoerfer > > > Signed-off-by: David Hildenbrand > > > --- > > > arch/mips/include/asm/cacheflush.h | 11 +++++++---- > > > arch/mips/mm/cache.c | 8 ++++---- > > > 2 files changed, 11 insertions(+), 8 deletions(-) > > > > > > diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h > > > index 5d283ef89d90d..8d79bfc687d21 100644 > > > --- a/arch/mips/include/asm/cacheflush.h > > > +++ b/arch/mips/include/asm/cacheflush.h > > > @@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm); > > > extern void (*flush_cache_range)(struct vm_area_struct *vma, > > > unsigned long start, unsigned long end); > > > extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn); > > > -extern void __flush_dcache_pages(struct page *page, unsigned int nr); > > > +extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr); > > > > NIT: Be good to drop the extern. > > I think I'll leave the one in, though, someone should clean up all of them > in one go. This is how we always clean these up though, buuut to be fair that's in mm. > > Just imagine how the other functions would think about the new guy showing > off here. :) ;) > > > > > > > > > #define ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE 1 > > > static inline void flush_dcache_folio(struct folio *folio) > > > { > > > if (cpu_has_dc_aliases) > > > - __flush_dcache_pages(&folio->page, folio_nr_pages(folio)); > > > + __flush_dcache_folio_pages(folio, folio_page(folio, 0), > > > + folio_nr_pages(folio)); > > > else if (!cpu_has_ic_fills_f_dc) > > > folio_set_dcache_dirty(folio); > > > } > > > @@ -64,10 +65,12 @@ static inline void flush_dcache_folio(struct folio *folio) > > > > > > static inline void flush_dcache_page(struct page *page) > > > { > > > + struct folio *folio = page_folio(page); > > > + > > > if (cpu_has_dc_aliases) > > > - __flush_dcache_pages(page, 1); > > > + __flush_dcache_folio_pages(folio, page, folio_nr_pages(folio)); > > > > Hmmm, shouldn't this be 1 not folio_nr_pages()? Seems that the original > > implementation only flushed a single page even if contained within a larger > > folio? > > Yes, reworked it 3 times and messed it up during the last rework. Thanks! Woot I found an actual bug :P Yeah it's fiddly so understandable. :) > > -- > Cheers > > David / dhildenb > Cheers, Lorenzo From Liam.Howlett at oracle.com Fri Aug 29 14:25:05 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 14:25:05 -0000 Subject: [PATCH v1 10/36] mm: sanity-check maximum folio size in folio_set_order() In-Reply-To: <20250827220141.262669-11-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-11-david@redhat.com> Message-ID: * David Hildenbrand [250827 18:05]: > Let's sanity-check in folio_set_order() whether we would be trying to > create a folio with an order that would make it exceed MAX_FOLIO_ORDER. > > This will enable the check whenever a folio/compound page is initialized > through prepare_compound_head() / prepare_compound_page(). > > Reviewed-by: Zi Yan > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > mm/internal.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/mm/internal.h b/mm/internal.h > index 45da9ff5694f6..9b0129531d004 100644 > --- a/mm/internal.h > +++ b/mm/internal.h > @@ -755,6 +755,7 @@ static inline void folio_set_order(struct folio *folio, unsigned int order) > { > if (WARN_ON_ONCE(!order || !folio_test_large(folio))) > return; > + VM_WARN_ON_ONCE(order > MAX_FOLIO_ORDER); > > folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order; > #ifdef NR_PAGES_IN_LARGE_FOLIO > -- > 2.50.1 > > From Liam.Howlett at oracle.com Fri Aug 29 14:29:01 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 14:29:01 -0000 Subject: [PATCH v1 11/36] mm: limit folio/compound page sizes in problematic kernel configs In-Reply-To: <20250827220141.262669-12-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-12-david@redhat.com> Message-ID: * David Hildenbrand [250827 18:05]: > Let's limit the maximum folio size in problematic kernel config where > the memmap is allocated per memory section (SPARSEMEM without > SPARSEMEM_VMEMMAP) to a single memory section. > > Currently, only a single architectures supports ARCH_HAS_GIGANTIC_PAGE > but not SPARSEMEM_VMEMMAP: sh. > > Fortunately, the biggest hugetlb size sh supports is 64 MiB > (HUGETLB_PAGE_SIZE_64MB) and the section size is at least 64 MiB > (SECTION_SIZE_BITS == 26), so their use case is not degraded. > > As folios and memory sections are naturally aligned to their order-2 size > in memory, consequently a single folio can no longer span multiple memory > sections on these problematic kernel configs. > > nth_page() is no longer required when operating within a single compound > page / folio. > > Reviewed-by: Zi Yan > Acked-by: Mike Rapoport (Microsoft) > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > include/linux/mm.h | 22 ++++++++++++++++++---- > 1 file changed, 18 insertions(+), 4 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 77737cbf2216a..2dee79fa2efcf 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2053,11 +2053,25 @@ static inline long folio_nr_pages(const struct folio *folio) > return folio_large_nr_pages(folio); > } > > -/* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > -#define MAX_FOLIO_ORDER PUD_ORDER > -#else > +#if !defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) > +/* > + * We don't expect any folios that exceed buddy sizes (and consequently > + * memory sections). > + */ > #define MAX_FOLIO_ORDER MAX_PAGE_ORDER > +#elif defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > +/* > + * Only pages within a single memory section are guaranteed to be > + * contiguous. By limiting folios to a single memory section, all folio > + * pages are guaranteed to be contiguous. > + */ > +#define MAX_FOLIO_ORDER PFN_SECTION_SHIFT > +#else > +/* > + * There is no real limit on the folio size. We limit them to the maximum we > + * currently expect (e.g., hugetlb, dax). > + */ > +#define MAX_FOLIO_ORDER PUD_ORDER > #endif > > #define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) > -- > 2.50.1 > From Liam.Howlett at oracle.com Fri Aug 29 14:42:06 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 14:42:06 -0000 Subject: [PATCH v1 12/36] mm: simplify folio_page() and folio_page_idx() In-Reply-To: <20250827220141.262669-13-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-13-david@redhat.com> Message-ID: * David Hildenbrand [250827 18:06]: > Now that a single folio/compound page can no longer span memory sections > in problematic kernel configurations, we can stop using nth_page(). ..but only in a subset of nth_page uses, considering mm.h still has the define. > > While at it, turn both macros into static inline functions and add > kernel doc for folio_page_idx(). > > Reviewed-by: Zi Yan > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > include/linux/mm.h | 16 ++++++++++++++-- > include/linux/page-flags.h | 5 ++++- > 2 files changed, 18 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 2dee79fa2efcf..f6880e3225c5c 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -210,10 +210,8 @@ extern unsigned long sysctl_admin_reserve_kbytes; > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) > -#define folio_page_idx(folio, p) (page_to_pfn(p) - folio_pfn(folio)) > #else > #define nth_page(page,n) ((page) + (n)) > -#define folio_page_idx(folio, p) ((p) - &(folio)->page) > #endif > > /* to align the pointer to the (next) page boundary */ > @@ -225,6 +223,20 @@ extern unsigned long sysctl_admin_reserve_kbytes; > /* test whether an address (unsigned long or pointer) is aligned to PAGE_SIZE */ > #define PAGE_ALIGNED(addr) IS_ALIGNED((unsigned long)(addr), PAGE_SIZE) > > +/** > + * folio_page_idx - Return the number of a page in a folio. > + * @folio: The folio. > + * @page: The folio page. > + * > + * This function expects that the page is actually part of the folio. > + * The returned number is relative to the start of the folio. > + */ > +static inline unsigned long folio_page_idx(const struct folio *folio, > + const struct page *page) > +{ > + return page - &folio->page; > +} > + > static inline struct folio *lru_to_folio(struct list_head *head) > { > return list_entry((head)->prev, struct folio, lru); > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h > index 5ee6ffbdbf831..faf17ca211b4f 100644 > --- a/include/linux/page-flags.h > +++ b/include/linux/page-flags.h > @@ -316,7 +316,10 @@ static __always_inline unsigned long _compound_head(const struct page *page) > * check that the page number lies within @folio; the caller is presumed > * to have a reference to the page. > */ > -#define folio_page(folio, n) nth_page(&(folio)->page, n) > +static inline struct page *folio_page(struct folio *folio, unsigned long n) > +{ > + return &folio->page + n; > +} > > static __always_inline int PageTail(const struct page *page) > { > -- > 2.50.1 > > From lorenzo.stoakes at oracle.com Fri Aug 29 14:45:11 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 14:45:11 -0000 Subject: [PATCH v1 21/36] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: <62fad23f-e8dc-4fd5-a82f-6419376465b5@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-22-david@redhat.com> <62fad23f-e8dc-4fd5-a82f-6419376465b5@redhat.com> Message-ID: On Fri, Aug 29, 2025 at 04:34:54PM +0200, David Hildenbrand wrote: > On 28.08.25 19:28, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 12:01:25AM +0200, David Hildenbrand wrote: > > > Let's disallow handing out PFN ranges with non-contiguous pages, so we > > > can remove the nth-page usage in __cma_alloc(), and so any callers don't > > > have to worry about that either when wanting to blindly iterate pages. > > > > > > This is really only a problem in configs with SPARSEMEM but without > > > SPARSEMEM_VMEMMAP, and only when we would cross memory sections in some > > > cases. > > > > I'm guessing this is something that we don't need to worry about in > > reality? > > That my theory yes. Let's hope correct haha, but seems reasonable. > > > > > > > > > Will this cause harm? Probably not, because it's mostly 32bit that does > > > not support SPARSEMEM_VMEMMAP. If this ever becomes a problem we could > > > look into allocating the memmap for the memory sections spanned by a > > > single CMA region in one go from memblock. > > > > > > Reviewed-by: Alexandru Elisei > > > Signed-off-by: David Hildenbrand > > > > LGTM other than refactoring point below. > > > > CMA stuff looks fine afaict after staring at it for a while, on proviso > > that handing out ranges within the same section is always going to be the > > case. > > > > Anyway overall, > > > > LGTM, so: > > > > Reviewed-by: Lorenzo Stoakes > > > > > > > --- > > > include/linux/mm.h | 6 ++++++ > > > mm/cma.c | 39 ++++++++++++++++++++++++--------------- > > > mm/util.c | 33 +++++++++++++++++++++++++++++++++ > > > 3 files changed, 63 insertions(+), 15 deletions(-) > > > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > > index f6880e3225c5c..2ca1eb2db63ec 100644 > > > --- a/include/linux/mm.h > > > +++ b/include/linux/mm.h > > > @@ -209,9 +209,15 @@ extern unsigned long sysctl_user_reserve_kbytes; > > > extern unsigned long sysctl_admin_reserve_kbytes; > > > > > > #if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > > > +bool page_range_contiguous(const struct page *page, unsigned long nr_pages); > > > #define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n)) > > > #else > > > #define nth_page(page,n) ((page) + (n)) > > > +static inline bool page_range_contiguous(const struct page *page, > > > + unsigned long nr_pages) > > > +{ > > > + return true; > > > +} > > > #endif > > > > > > /* to align the pointer to the (next) page boundary */ > > > diff --git a/mm/cma.c b/mm/cma.c > > > index e56ec64d0567e..813e6dc7b0954 100644 > > > --- a/mm/cma.c > > > +++ b/mm/cma.c > > > @@ -780,10 +780,8 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > > > unsigned long count, unsigned int align, > > > struct page **pagep, gfp_t gfp) > > > { > > > - unsigned long mask, offset; > > > - unsigned long pfn = -1; > > > - unsigned long start = 0; > > > unsigned long bitmap_maxno, bitmap_no, bitmap_count; > > > + unsigned long start, pfn, mask, offset; > > > int ret = -EBUSY; > > > struct page *page = NULL; > > > > > > @@ -795,7 +793,7 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > > > if (bitmap_count > bitmap_maxno) > > > goto out; > > > > > > - for (;;) { > > > + for (start = 0; ; start = bitmap_no + mask + 1) { > > > spin_lock_irq(&cma->lock); > > > /* > > > * If the request is larger than the available number > > > @@ -812,6 +810,22 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > > > spin_unlock_irq(&cma->lock); > > > break; > > > } > > > + > > > + pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); > > > + page = pfn_to_page(pfn); > > > + > > > + /* > > > + * Do not hand out page ranges that are not contiguous, so > > > + * callers can just iterate the pages without having to worry > > > + * about these corner cases. > > > + */ > > > + if (!page_range_contiguous(page, count)) { > > > + spin_unlock_irq(&cma->lock); > > > + pr_warn_ratelimited("%s: %s: skipping incompatible area [0x%lx-0x%lx]", > > > + __func__, cma->name, pfn, pfn + count - 1); > > > + continue; > > > + } > > > + > > > bitmap_set(cmr->bitmap, bitmap_no, bitmap_count); > > > cma->available_count -= count; > > > /* > > > @@ -821,29 +835,24 @@ static int cma_range_alloc(struct cma *cma, struct cma_memrange *cmr, > > > */ > > > spin_unlock_irq(&cma->lock); > > > > > > - pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); > > > mutex_lock(&cma->alloc_mutex); > > > ret = alloc_contig_range(pfn, pfn + count, ACR_FLAGS_CMA, gfp); > > > mutex_unlock(&cma->alloc_mutex); > > > - if (ret == 0) { > > > - page = pfn_to_page(pfn); > > > + if (!ret) > > > break; > > > - } > > > > > > cma_clear_bitmap(cma, cmr, pfn, count); > > > if (ret != -EBUSY) > > > break; > > > > > > pr_debug("%s(): memory range at pfn 0x%lx %p is busy, retrying\n", > > > - __func__, pfn, pfn_to_page(pfn)); > > > + __func__, pfn, page); > > > > > > - trace_cma_alloc_busy_retry(cma->name, pfn, pfn_to_page(pfn), > > > - count, align); > > > - /* try again with a bit different memory target */ > > > - start = bitmap_no + mask + 1; > > > + trace_cma_alloc_busy_retry(cma->name, pfn, page, count, align); > > > } > > > out: > > > - *pagep = page; > > > + if (!ret) > > > + *pagep = page; > > > return ret; > > > } > > > > > > @@ -882,7 +891,7 @@ static struct page *__cma_alloc(struct cma *cma, unsigned long count, > > > */ > > > if (page) { > > > for (i = 0; i < count; i++) > > > - page_kasan_tag_reset(nth_page(page, i)); > > > + page_kasan_tag_reset(page + i); > > > } > > > > > > if (ret && !(gfp & __GFP_NOWARN)) { > > > diff --git a/mm/util.c b/mm/util.c > > > index d235b74f7aff7..0bf349b19b652 100644 > > > --- a/mm/util.c > > > +++ b/mm/util.c > > > @@ -1280,4 +1280,37 @@ unsigned int folio_pte_batch(struct folio *folio, pte_t *ptep, pte_t pte, > > > { > > > return folio_pte_batch_flags(folio, NULL, ptep, &pte, max_nr, 0); > > > } > > > + > > > +#if defined(CONFIG_SPARSEMEM) && !defined(CONFIG_SPARSEMEM_VMEMMAP) > > > +/** > > > + * page_range_contiguous - test whether the page range is contiguous > > > + * @page: the start of the page range. > > > + * @nr_pages: the number of pages in the range. > > > + * > > > + * Test whether the page range is contiguous, such that they can be iterated > > > + * naively, corresponding to iterating a contiguous PFN range. > > > + * > > > + * This function should primarily only be used for debug checks, or when > > > + * working with page ranges that are not naturally contiguous (e.g., pages > > > + * within a folio are). > > > + * > > > + * Returns true if contiguous, otherwise false. > > > + */ > > > +bool page_range_contiguous(const struct page *page, unsigned long nr_pages) > > > +{ > > > + const unsigned long start_pfn = page_to_pfn(page); > > > + const unsigned long end_pfn = start_pfn + nr_pages; > > > + unsigned long pfn; > > > + > > > + /* > > > + * The memmap is allocated per memory section. We need to check > > > + * each involved memory section once. > > > + */ > > > + for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION); > > > + pfn < end_pfn; pfn += PAGES_PER_SECTION) > > > + if (unlikely(page + (pfn - start_pfn) != pfn_to_page(pfn))) > > > + return false; > > > > I find this pretty confusing, my test for this is how many times I have to read > > the code to understand what it's doing :) > > > > So we have something like: > > > > (pfn of page) > > start_pfn pfn = align UP > > | | > > v v > > | section | > > <-----------------> > > pfn - start_pfn > > > > Then check page + (pfn - start_pfn) == pfn_to_page(pfn) > > > > And loop such that: > > > > (pfn of page) > > start_pfn pfn > > | | > > v v > > | section | section | > > <------------------------------------------> > > pfn - start_pfn > > > > Again check page + (pfn - start_pfn) == pfn_to_page(pfn) > > > > And so on. > > > > So the logic looks good, but it's just... that took me a hot second to > > parse :) > > > > I think a few simple fixups > > > > bool page_range_contiguous(const struct page *page, unsigned long nr_pages) > > { > > const unsigned long start_pfn = page_to_pfn(page); > > const unsigned long end_pfn = start_pfn + nr_pages; > > /* The PFN of the start of the next section. */ > > unsigned long pfn = ALIGN(start_pfn, PAGES_PER_SECTION); > > /* The page we'd expected to see if the range were contiguous. */ > > struct page *expected = page + (pfn - start_pfn); > > > > /* > > * The memmap is allocated per memory section. We need to check > > * each involved memory section once. > > */ > > for (; pfn < end_pfn; pfn += PAGES_PER_SECTION, expected += PAGES_PER_SECTION) > > if (unlikely(expected != pfn_to_page(pfn))) > > return false; > > return true; > > } > > > > Hm, I prefer my variant, especially where the pfn is calculated in the for loop. Likely a > matter of personal taste. Sure this is always a factor in code :) > > But I can see why skipping the first section might be a surprise when not > having the semantics of ALIGN() in the cache. Yup! > > So I'll add the following on top: > > diff --git a/mm/util.c b/mm/util.c > index 0bf349b19b652..fbdb73aaf35fe 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -1303,8 +1303,10 @@ bool page_range_contiguous(const struct page *page, unsigned long nr_pages) > unsigned long pfn; > /* > - * The memmap is allocated per memory section. We need to check > - * each involved memory section once. > + * The memmap is allocated per memory section, so no need to check > + * within the first section. However, we need to check each other > + * spanned memory section once, making sure the first page in a > + * section could similarly be reached by just iterating pages. > */ > for (pfn = ALIGN(start_pfn, PAGES_PER_SECTION); > pfn < end_pfn; pfn += PAGES_PER_SECTION) Cool this helps clarify things, that'll do fine! > > Thanks! > > -- > Cheers > > David / dhildenb > > Cheers, Lorenzo From lorenzo.stoakes at oracle.com Fri Aug 29 14:45:53 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 14:45:53 -0000 Subject: [PATCH v1 20/36] mips: mm: convert __flush_dcache_pages() to __flush_dcache_folio_pages() In-Reply-To: References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-21-david@redhat.com> <2be7db96-2fa2-4348-837e-648124bd604f@redhat.com> <549a60a6-25e2-48d5-b442-49404a857014@lucifer.local> Message-ID: On Fri, Aug 29, 2025 at 03:44:20PM +0200, David Hildenbrand wrote: > On 29.08.25 14:51, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 10:51:46PM +0200, David Hildenbrand wrote: > > > On 28.08.25 18:57, Lorenzo Stoakes wrote: > > > > On Thu, Aug 28, 2025 at 12:01:24AM +0200, David Hildenbrand wrote: > > > > > Let's make it clearer that we are operating within a single folio by > > > > > providing both the folio and the page. > > > > > > > > > > This implies that for flush_dcache_folio() we'll now avoid one more > > > > > page->folio lookup, and that we can safely drop the "nth_page" usage. > > > > > > > > > > Cc: Thomas Bogendoerfer > > > > > Signed-off-by: David Hildenbrand > > > > > --- > > > > > arch/mips/include/asm/cacheflush.h | 11 +++++++---- > > > > > arch/mips/mm/cache.c | 8 ++++---- > > > > > 2 files changed, 11 insertions(+), 8 deletions(-) > > > > > > > > > > diff --git a/arch/mips/include/asm/cacheflush.h b/arch/mips/include/asm/cacheflush.h > > > > > index 5d283ef89d90d..8d79bfc687d21 100644 > > > > > --- a/arch/mips/include/asm/cacheflush.h > > > > > +++ b/arch/mips/include/asm/cacheflush.h > > > > > @@ -50,13 +50,14 @@ extern void (*flush_cache_mm)(struct mm_struct *mm); > > > > > extern void (*flush_cache_range)(struct vm_area_struct *vma, > > > > > unsigned long start, unsigned long end); > > > > > extern void (*flush_cache_page)(struct vm_area_struct *vma, unsigned long page, unsigned long pfn); > > > > > -extern void __flush_dcache_pages(struct page *page, unsigned int nr); > > > > > +extern void __flush_dcache_folio_pages(struct folio *folio, struct page *page, unsigned int nr); > > > > > > > > NIT: Be good to drop the extern. > > > > > > I think I'll leave the one in, though, someone should clean up all of them > > > in one go. > > > > This is how we always clean these up though, buuut to be fair that's in mm. > > > > Well, okay, I'll make all the other functions jealous and blame it on you! > :P ;) > > -- > Cheers > > David / dhildenb > From Liam.Howlett at oracle.com Fri Aug 29 14:58:40 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 14:58:40 -0000 Subject: [PATCH v1 13/36] mm/hugetlb: cleanup hugetlb_folio_init_tail_vmemmap() In-Reply-To: <20250827220141.262669-14-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-14-david@redhat.com> Message-ID: * David Hildenbrand [250827 18:06]: > We can now safely iterate over all pages in a folio, so no need for the > pfn_to_page(). > > Also, as we already force the refcount in __init_single_page() to 1, > we can just set the refcount to 0 and avoid page_ref_freeze() + > VM_BUG_ON. Likely, in the future, we would just want to tell > __init_single_page() to which value to initialize the refcount. > > Further, adjust the comments to highlight that we are dealing with an > open-coded prep_compound_page() variant, and add another comment explaining > why we really need the __init_single_page() only on the tail pages. > > Note that the current code was likely problematic, but we never ran into > it: prep_compound_tail() would have been called with an offset that might > exceed a memory section, and prep_compound_tail() would have simply > added that offset to the page pointer -- which would not have done the > right thing on sparsemem without vmemmap. > > Signed-off-by: David Hildenbrand Acked-by: Liam R. Howlett > --- > mm/hugetlb.c | 20 ++++++++++++-------- > 1 file changed, 12 insertions(+), 8 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 4a97e4f14c0dc..1f42186a85ea4 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -3237,17 +3237,18 @@ static void __init hugetlb_folio_init_tail_vmemmap(struct folio *folio, > { > enum zone_type zone = zone_idx(folio_zone(folio)); > int nid = folio_nid(folio); > + struct page *page = folio_page(folio, start_page_number); > unsigned long head_pfn = folio_pfn(folio); > unsigned long pfn, end_pfn = head_pfn + end_page_number; > - int ret; > - > - for (pfn = head_pfn + start_page_number; pfn < end_pfn; pfn++) { > - struct page *page = pfn_to_page(pfn); > > + /* > + * We mark all tail pages with memblock_reserved_mark_noinit(), > + * so these pages are completely uninitialized. > + */ > + for (pfn = head_pfn + start_page_number; pfn < end_pfn; page++, pfn++) { > __init_single_page(page, pfn, zone, nid); > prep_compound_tail((struct page *)folio, pfn - head_pfn); > - ret = page_ref_freeze(page, 1); > - VM_BUG_ON(!ret); > + set_page_count(page, 0); > } > } > > @@ -3257,12 +3258,15 @@ static void __init hugetlb_folio_init_vmemmap(struct folio *folio, > { > int ret; > > - /* Prepare folio head */ > + /* > + * This is an open-coded prep_compound_page() whereby we avoid > + * walking pages twice by initializing/preparing+freezing them in the > + * same go. > + */ > __folio_clear_reserved(folio); > __folio_set_head(folio); > ret = folio_ref_freeze(folio, 1); > VM_BUG_ON(!ret); > - /* Initialize the necessary tail struct pages */ > hugetlb_folio_init_tail_vmemmap(folio, 1, nr_pages); > prep_compound_head((struct page *)folio, huge_page_order(h)); > } > -- > 2.50.1 > From Liam.Howlett at oracle.com Fri Aug 29 14:59:55 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 14:59:55 -0000 Subject: [PATCH v1 14/36] mm/mm/percpu-km: drop nth_page() usage within single allocation In-Reply-To: <20250827220141.262669-15-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-15-david@redhat.com> Message-ID: <2vgbxodabnlvacqoiythb2jsvgp3mhgvkv6od4t6nw4unauhc3@vwqvplaiqk5f> * David Hildenbrand [250827 18:06]: > We're allocating a higher-order page from the buddy. For these pages > (that are guaranteed to not exceed a single memory section) there is no > need to use nth_page(). > > Signed-off-by: David Hildenbrand Acked-by: Liam R. Howlett > --- > mm/percpu-km.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/percpu-km.c b/mm/percpu-km.c > index fe31aa19db81a..4efa74a495cb6 100644 > --- a/mm/percpu-km.c > +++ b/mm/percpu-km.c > @@ -69,7 +69,7 @@ static struct pcpu_chunk *pcpu_create_chunk(gfp_t gfp) > } > > for (i = 0; i < nr_pages; i++) > - pcpu_set_page_chunk(nth_page(pages, i), chunk); > + pcpu_set_page_chunk(pages + i, chunk); > > chunk->data = pages; > chunk->base_addr = page_address(pages); > -- > 2.50.1 > From Liam.Howlett at oracle.com Fri Aug 29 15:12:44 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 15:12:44 -0000 Subject: [PATCH v1 17/36] mm/pagewalk: drop nth_page() usage within folio in folio_walk_start() In-Reply-To: <20250827220141.262669-18-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-18-david@redhat.com> Message-ID: <6mckxk4fnam3hxhpvdhyeelu2bbut3xbtmwni2oixfgffzox2m@lnod3lwxkb32> * David Hildenbrand [250827 18:07]: > It's no longer required to use nth_page() within a folio, so let's just > drop the nth_page() in folio_walk_start(). > > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > mm/pagewalk.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/mm/pagewalk.c b/mm/pagewalk.c > index c6753d370ff4e..9e4225e5fcf5c 100644 > --- a/mm/pagewalk.c > +++ b/mm/pagewalk.c > @@ -1004,7 +1004,7 @@ struct folio *folio_walk_start(struct folio_walk *fw, > found: > if (expose_page) > /* Note: Offset from the mapped page, not the folio start. */ > - fw->page = nth_page(page, (addr & (entry_size - 1)) >> PAGE_SHIFT); > + fw->page = page + ((addr & (entry_size - 1)) >> PAGE_SHIFT); > else > fw->page = NULL; > fw->ptl = ptl; > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Fri Aug 29 15:21:02 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Fri, 29 Aug 2025 15:21:02 -0000 Subject: [PATCH v1 18/36] mm/gup: drop nth_page() usage within folio when recording subpages In-Reply-To: <632fea32-28aa-4993-9eff-99fc291c64f2@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-19-david@redhat.com> <632fea32-28aa-4993-9eff-99fc291c64f2@redhat.com> Message-ID: <8a26ae97-9a78-4db5-be98-9c1f6e4fb403@lucifer.local> On Fri, Aug 29, 2025 at 03:41:40PM +0200, David Hildenbrand wrote: > On 28.08.25 18:37, Lorenzo Stoakes wrote: > > On Thu, Aug 28, 2025 at 12:01:22AM +0200, David Hildenbrand wrote: > > > nth_page() is no longer required when iterating over pages within a > > > single folio, so let's just drop it when recording subpages. > > > > > > Signed-off-by: David Hildenbrand > > > > This looks correct to me, so notwithtsanding suggestion below, LGTM and: > > > > Reviewed-by: Lorenzo Stoakes > > > > > --- > > > mm/gup.c | 7 +++---- > > > 1 file changed, 3 insertions(+), 4 deletions(-) > > > > > > diff --git a/mm/gup.c b/mm/gup.c > > > index b2a78f0291273..89ca0813791ab 100644 > > > --- a/mm/gup.c > > > +++ b/mm/gup.c > > > @@ -488,12 +488,11 @@ static int record_subpages(struct page *page, unsigned long sz, > > > unsigned long addr, unsigned long end, > > > struct page **pages) > > > { > > > - struct page *start_page; > > > int nr; > > > > > > - start_page = nth_page(page, (addr & (sz - 1)) >> PAGE_SHIFT); > > > + page += (addr & (sz - 1)) >> PAGE_SHIFT; > > > for (nr = 0; addr != end; nr++, addr += PAGE_SIZE) > > > - pages[nr] = nth_page(start_page, nr); > > > + pages[nr] = page++; > > > > > > This is really nice, but I wonder if (while we're here) we can't be even > > more clear as to what's going on here, e.g.: > > > > static int record_subpages(struct page *page, unsigned long sz, > > unsigned long addr, unsigned long end, > > struct page **pages) > > { > > size_t offset_in_folio = (addr & (sz - 1)) >> PAGE_SHIFT; > > struct page *subpage = page + offset_in_folio; > > > > for (; addr != end; addr += PAGE_SIZE) > > *pages++ = subpage++; > > > > return nr; > > } > > > > Or some variant of that with the masking stuff self-documented. > > What about the following cleanup on top: > > > diff --git a/mm/gup.c b/mm/gup.c > index 89ca0813791ab..5a72a135ec70b 100644 > --- a/mm/gup.c > +++ b/mm/gup.c > @@ -484,19 +484,6 @@ static inline void mm_set_has_pinned_flag(struct mm_struct *mm) > #ifdef CONFIG_MMU > #ifdef CONFIG_HAVE_GUP_FAST > -static int record_subpages(struct page *page, unsigned long sz, > - unsigned long addr, unsigned long end, > - struct page **pages) > -{ > - int nr; > - > - page += (addr & (sz - 1)) >> PAGE_SHIFT; > - for (nr = 0; addr != end; nr++, addr += PAGE_SIZE) > - pages[nr] = page++; > - > - return nr; > -} > - > /** > * try_grab_folio_fast() - Attempt to get or pin a folio in fast path. > * @page: pointer to page to be grabbed > @@ -2963,8 +2950,8 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, > if (pmd_special(orig)) > return 0; > - page = pmd_page(orig); > - refs = record_subpages(page, PMD_SIZE, addr, end, pages + *nr); > + refs = (end - addr) >> PAGE_SHIFT; > + page = pmd_page(orig) + ((addr & ~PMD_MASK) >> PAGE_SHIFT); Ah I see we use page_folio() in try_grab_folio_fast() so this being within PMD is ok. > folio = try_grab_folio_fast(page, refs, flags); > if (!folio) > @@ -2985,6 +2972,8 @@ static int gup_fast_pmd_leaf(pmd_t orig, pmd_t *pmdp, unsigned long addr, > } > *nr += refs; > + for (; refs; refs--) > + *(pages++) = page++; > folio_set_referenced(folio); > return 1; > } > @@ -3003,8 +2992,8 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, > if (pud_special(orig)) > return 0; > - page = pud_page(orig); > - refs = record_subpages(page, PUD_SIZE, addr, end, pages + *nr); > + refs = (end - addr) >> PAGE_SHIFT; > + page = pud_page(orig) + ((addr & ~PUD_MASK) >> PAGE_SHIFT); > folio = try_grab_folio_fast(page, refs, flags); > if (!folio) > @@ -3026,6 +3015,8 @@ static int gup_fast_pud_leaf(pud_t orig, pud_t *pudp, unsigned long addr, > } > *nr += refs; > + for (; refs; refs--) > + *(pages++) = page++; > folio_set_referenced(folio); > return 1; > } > > > The nice thing is that we only record pages in the array if they actually passed our tests. Yeah that's nice actually. This is fine (not the meme :P) So yes let's do this! > > > -- > Cheers > > David / dhildenb > Cheers, Lorenzo From martin.petersen at oracle.com Sun Aug 31 01:04:36 2025 From: martin.petersen at oracle.com (Martin K. Petersen) Date: Sun, 31 Aug 2025 01:04:36 -0000 Subject: [PATCH v1 29/36] scsi: scsi_lib: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-30-david@redhat.com> (David Hildenbrand's message of "Thu, 28 Aug 2025 00:01:33 +0200") References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-30-david@redhat.com> Message-ID: David, > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. Reviewed-by: Martin K. Petersen -- Martin K. Petersen From martin.petersen at oracle.com Sun Aug 31 01:05:13 2025 From: martin.petersen at oracle.com (Martin K. Petersen) Date: Sun, 31 Aug 2025 01:05:13 -0000 Subject: [PATCH v1 30/36] scsi: sg: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-31-david@redhat.com> (David Hildenbrand's message of "Thu, 28 Aug 2025 00:01:34 +0200") References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-31-david@redhat.com> Message-ID: David, > It's no longer required to use nth_page() when iterating pages within > a single SG entry, so let's drop the nth_page() usage. Reviewed-by: Martin K. Petersen -- Martin K. Petersen From jgg at nvidia.com Fri Aug 22 14:30:52 2025 From: jgg at nvidia.com (Jason Gunthorpe) Date: Fri, 22 Aug 2025 14:30:52 -0000 Subject: [PATCH RFC 00/35] mm: remove nth_page() In-Reply-To: <20250821200701.1329277-1-david@redhat.com> References: <20250821200701.1329277-1-david@redhat.com> Message-ID: <20250822143043.GG1311579@nvidia.com> On Thu, Aug 21, 2025 at 10:06:26PM +0200, David Hildenbrand wrote: > As discussed recently with Linus, nth_page() is just nasty and we would > like to remove it. > > To recap, the reason we currently need nth_page() within a folio is because > on some kernel configs (SPARSEMEM without SPARSEMEM_VMEMMAP), the > memmap is allocated per memory section. > > While buddy allocations cannot cross memory section boundaries, hugetlb > and dax folios can. > > So crossing a memory section means that "page++" could do the wrong thing. > Instead, nth_page() on these problematic configs always goes from > page->pfn, to the go from (++pfn)->page, which is rather nasty. > > Likely, many people have no idea when nth_page() is required and when > it might be dropped. > > We refer to such problematic PFN ranges and "non-contiguous pages". > If we only deal with "contiguous pages", there is not need for nth_page(). > > Besides that "obvious" folio case, we might end up using nth_page() > within CMA allocations (again, could span memory sections), and in > one corner case (kfence) when processing memblock allocations (again, > could span memory sections). I browsed the patches and it looks great to me, thanks for doing this Jason From alexandru.elisei at arm.com Tue Aug 26 13:03:28 2025 From: alexandru.elisei at arm.com (Alexandru Elisei) Date: Tue, 26 Aug 2025 13:03:28 -0000 Subject: [PATCH RFC 21/35] mm/cma: refuse handing out non-contiguous page ranges In-Reply-To: References: <20250821200701.1329277-1-david@redhat.com> <20250821200701.1329277-22-david@redhat.com> Message-ID: Hi David, On Tue, Aug 26, 2025 at 01:04:33PM +0200, David Hildenbrand wrote: .. > > Just so I can better understand the problem being fixed, I guess you can have > > two consecutive pfns with non-consecutive associated struct page if you have two > > adjacent memory sections spanning the same physical memory region, is that > > correct? > > Exactly. Essentially on SPARSEMEM without SPARSEMEM_VMEMMAP it is not > guaranteed that > > pfn_to_page(pfn + 1) == pfn_to_page(pfn) + 1 > > when we cross memory section boundaries. > > It can be the case for early boot memory if we allocated consecutive areas > from memblock when allocating the memmap (struct pages) per memory section, > but it's not guaranteed. Thank you for the explanation, but I'm a bit confused by the last paragraph. I think what you're saying is that we can also have the reverse problem, where consecutive struct page * represent non-consecutive pfns, because memmap allocations happened to return consecutive virtual addresses, is that right? If that's correct, I don't think that's the case for CMA, which deals out contiguous physical memory. Or were you just trying to explain the other side of the problem, and I'm just overthinking it? Thanks, Alex From lorenzo.stoakes at oracle.com Thu Aug 28 14:40:29 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 14:40:29 -0000 Subject: [PATCH v1 07/36] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() In-Reply-To: <20250827220141.262669-8-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-8-david@redhat.com> Message-ID: <29226dad-e119-4727-9e23-dc7c527e4281@lucifer.local> On Thu, Aug 28, 2025 at 12:01:11AM +0200, David Hildenbrand wrote: > Let's reject unreasonable folio sizes early, where we can still fail. > We'll add sanity checks to prepare_compound_head/prepare_compound_page > next. > > Is there a way to configure a system such that unreasonable folio sizes > would be possible? It would already be rather questionable. > > If so, we'd probably want to bail out earlier, where we can avoid a > WARN and just report a proper error message that indicates where > something went wrong such that we messed up. > > Acked-by: SeongJae Park > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > mm/memremap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/memremap.c b/mm/memremap.c > index b0ce0d8254bd8..a2d4bb88f64b6 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -275,6 +275,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) > > if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) > return ERR_PTR(-EINVAL); > + if (WARN_ONCE(pgmap->vmemmap_shift > MAX_FOLIO_ORDER, > + "requested folio size unsupported\n")) > + return ERR_PTR(-EINVAL); > > switch (pgmap->type) { > case MEMORY_DEVICE_PRIVATE: > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 18:02:22 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 18:02:22 -0000 Subject: [PATCH v1 31/36] vfio/pci: drop nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-32-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-32-david@redhat.com> Message-ID: <1b1b425f-e8de-4760-a70e-f29897fd9367@lucifer.local> On Thu, Aug 28, 2025 at 12:01:35AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Cc: Brett Creeley > Cc: Jason Gunthorpe > Cc: Yishai Hadas > Cc: Shameer Kolothum > Cc: Kevin Tian > Cc: Alex Williamson > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > drivers/vfio/pci/pds/lm.c | 3 +-- > drivers/vfio/pci/virtio/migrate.c | 3 +-- > 2 files changed, 2 insertions(+), 4 deletions(-) > > diff --git a/drivers/vfio/pci/pds/lm.c b/drivers/vfio/pci/pds/lm.c > index f2673d395236a..4d70c833fa32e 100644 > --- a/drivers/vfio/pci/pds/lm.c > +++ b/drivers/vfio/pci/pds/lm.c > @@ -151,8 +151,7 @@ static struct page *pds_vfio_get_file_page(struct pds_vfio_lm_file *lm_file, > lm_file->last_offset_sg = sg; > lm_file->sg_last_entry += i; > lm_file->last_offset = cur_offset; > - return nth_page(sg_page(sg), > - (offset - cur_offset) / PAGE_SIZE); > + return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE; > } > cur_offset += sg->length; > } > diff --git a/drivers/vfio/pci/virtio/migrate.c b/drivers/vfio/pci/virtio/migrate.c > index ba92bb4e9af94..7dd0ac866461d 100644 > --- a/drivers/vfio/pci/virtio/migrate.c > +++ b/drivers/vfio/pci/virtio/migrate.c > @@ -53,8 +53,7 @@ virtiovf_get_migration_page(struct virtiovf_data_buffer *buf, > buf->last_offset_sg = sg; > buf->sg_last_entry += i; > buf->last_offset = cur_offset; > - return nth_page(sg_page(sg), > - (offset - cur_offset) / PAGE_SIZE); > + return sg_page(sg) + (offset - cur_offset) / PAGE_SIZE; > } > cur_offset += sg->length; > } > -- > 2.50.1 > From lorenzo.stoakes at oracle.com Thu Aug 28 18:03:49 2025 From: lorenzo.stoakes at oracle.com (Lorenzo Stoakes) Date: Thu, 28 Aug 2025 18:03:49 -0000 Subject: [PATCH v1 32/36] crypto: remove nth_page() usage within SG entry In-Reply-To: <20250827220141.262669-33-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-33-david@redhat.com> Message-ID: <9bfd5683-0eb6-4566-939d-fff01454849f@lucifer.local> On Thu, Aug 28, 2025 at 12:01:36AM +0200, David Hildenbrand wrote: > It's no longer required to use nth_page() when iterating pages within a > single SG entry, so let's drop the nth_page() usage. > > Cc: Herbert Xu > Cc: "David S. Miller" > Signed-off-by: David Hildenbrand LGTM, so: Reviewed-by: Lorenzo Stoakes > --- > crypto/ahash.c | 4 ++-- > crypto/scompress.c | 8 ++++---- > include/crypto/scatterwalk.h | 4 ++-- > 3 files changed, 8 insertions(+), 8 deletions(-) > > diff --git a/crypto/ahash.c b/crypto/ahash.c > index a227793d2c5b5..dfb4f5476428f 100644 > --- a/crypto/ahash.c > +++ b/crypto/ahash.c > @@ -88,7 +88,7 @@ static int hash_walk_new_entry(struct crypto_hash_walk *walk) > > sg = walk->sg; > walk->offset = sg->offset; > - walk->pg = nth_page(sg_page(walk->sg), (walk->offset >> PAGE_SHIFT)); > + walk->pg = sg_page(walk->sg) + (walk->offset >> PAGE_SHIFT); > walk->offset = offset_in_page(walk->offset); > walk->entrylen = sg->length; > > @@ -226,7 +226,7 @@ int shash_ahash_digest(struct ahash_request *req, struct shash_desc *desc) > if (!IS_ENABLED(CONFIG_HIGHMEM)) > return crypto_shash_digest(desc, data, nbytes, req->result); > > - page = nth_page(page, offset >> PAGE_SHIFT); > + page += offset >> PAGE_SHIFT; > offset = offset_in_page(offset); > > if (nbytes > (unsigned int)PAGE_SIZE - offset) > diff --git a/crypto/scompress.c b/crypto/scompress.c > index c651e7f2197a9..1a7ed8ae65b07 100644 > --- a/crypto/scompress.c > +++ b/crypto/scompress.c > @@ -198,7 +198,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) > } else > return -ENOSYS; > > - dpage = nth_page(dpage, doff / PAGE_SIZE); > + dpage += doff / PAGE_SIZE; > doff = offset_in_page(doff); > > n = (dlen - 1) / PAGE_SIZE; > @@ -220,12 +220,12 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) > } else > break; > > - spage = nth_page(spage, soff / PAGE_SIZE); > + spage = spage + soff / PAGE_SIZE; > soff = offset_in_page(soff); > > n = (slen - 1) / PAGE_SIZE; > n += (offset_in_page(slen - 1) + soff) / PAGE_SIZE; > - if (PageHighMem(nth_page(spage, n)) && > + if (PageHighMem(spage + n) && > size_add(soff, slen) > PAGE_SIZE) > break; > src = kmap_local_page(spage) + soff; > @@ -270,7 +270,7 @@ static int scomp_acomp_comp_decomp(struct acomp_req *req, int dir) > if (dlen <= PAGE_SIZE) > break; > dlen -= PAGE_SIZE; > - dpage = nth_page(dpage, 1); > + dpage++; Can't help but chuckle when I see this simplification each time, really nice! :) > } > } > > diff --git a/include/crypto/scatterwalk.h b/include/crypto/scatterwalk.h > index 15ab743f68c8f..83d14376ff2bc 100644 > --- a/include/crypto/scatterwalk.h > +++ b/include/crypto/scatterwalk.h > @@ -159,7 +159,7 @@ static inline void scatterwalk_map(struct scatter_walk *walk) > if (IS_ENABLED(CONFIG_HIGHMEM)) { > struct page *page; > > - page = nth_page(base_page, offset >> PAGE_SHIFT); > + page = base_page + (offset >> PAGE_SHIFT); > offset = offset_in_page(offset); > addr = kmap_local_page(page) + offset; > } else { > @@ -259,7 +259,7 @@ static inline void scatterwalk_done_dst(struct scatter_walk *walk, > end += (offset_in_page(offset) + offset_in_page(nbytes) + > PAGE_SIZE - 1) >> PAGE_SHIFT; > for (i = start; i < end; i++) > - flush_dcache_page(nth_page(base_page, i)); > + flush_dcache_page(base_page + i); > } > scatterwalk_advance(walk, nbytes); > } > -- > 2.50.1 > From Liam.Howlett at oracle.com Fri Aug 29 00:34:28 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 00:34:28 -0000 Subject: [PATCH v1 06/36] mm/page_alloc: reject unreasonable folio/compound page sizes in alloc_contig_range_noprof() In-Reply-To: <20250827220141.262669-7-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-7-david@redhat.com> Message-ID: <3hpjmfa6p3onfdv4ma4nv2tdggvsyarh7m36aufy6hvwqtp2wd@2odohwxkl3rk> * David Hildenbrand [250827 18:04]: > Let's reject them early, which in turn makes folio_alloc_gigantic() reject > them properly. > > To avoid converting from order to nr_pages, let's just add MAX_FOLIO_ORDER > and calculate MAX_FOLIO_NR_PAGES based on that. > > Reviewed-by: Zi Yan > Acked-by: SeongJae Park > Signed-off-by: David Hildenbrand Nit below, but.. Reviewed-by: Liam R. Howlett > --- > include/linux/mm.h | 6 ++++-- > mm/page_alloc.c | 5 ++++- > 2 files changed, 8 insertions(+), 3 deletions(-) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 00c8a54127d37..77737cbf2216a 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -2055,11 +2055,13 @@ static inline long folio_nr_pages(const struct folio *folio) > > /* Only hugetlbfs can allocate folios larger than MAX_ORDER */ > #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE > -#define MAX_FOLIO_NR_PAGES (1UL << PUD_ORDER) > +#define MAX_FOLIO_ORDER PUD_ORDER > #else > -#define MAX_FOLIO_NR_PAGES MAX_ORDER_NR_PAGES > +#define MAX_FOLIO_ORDER MAX_PAGE_ORDER > #endif > > +#define MAX_FOLIO_NR_PAGES (1UL << MAX_FOLIO_ORDER) > + > /* > * compound_nr() returns the number of pages in this potentially compound > * page. compound_nr() can be called on a tail page, and is defined to > diff --git a/mm/page_alloc.c b/mm/page_alloc.c > index baead29b3e67b..426bc404b80cc 100644 > --- a/mm/page_alloc.c > +++ b/mm/page_alloc.c > @@ -6833,6 +6833,7 @@ static int __alloc_contig_verify_gfp_mask(gfp_t gfp_mask, gfp_t *gfp_cc_mask) > int alloc_contig_range_noprof(unsigned long start, unsigned long end, > acr_flags_t alloc_flags, gfp_t gfp_mask) > { > + const unsigned int order = ilog2(end - start); > unsigned long outer_start, outer_end; > int ret = 0; > > @@ -6850,6 +6851,9 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, > PB_ISOLATE_MODE_CMA_ALLOC : > PB_ISOLATE_MODE_OTHER; > > + if (WARN_ON_ONCE((gfp_mask & __GFP_COMP) && order > MAX_FOLIO_ORDER)) > + return -EINVAL; > + > gfp_mask = current_gfp_context(gfp_mask); > if (__alloc_contig_verify_gfp_mask(gfp_mask, (gfp_t *)&cc.gfp_mask)) > return -EINVAL; > @@ -6947,7 +6951,6 @@ int alloc_contig_range_noprof(unsigned long start, unsigned long end, > free_contig_range(end, outer_end - end); > } else if (start == outer_start && end == outer_end && is_power_of_2(end - start)) { > struct page *head = pfn_to_page(start); > - int order = ilog2(end - start); You have changed this from an int to a const unsigned int, which is totally fine but it was left out of the change log. Probably not really worth mentioning but curious why the change to unsigned here? > > check_new_pages(head, order); > prep_new_page(head, order, gfp_mask, 0); > -- > 2.50.1 > From Liam.Howlett at oracle.com Fri Aug 29 00:35:06 2025 From: Liam.Howlett at oracle.com (Liam R. Howlett) Date: Fri, 29 Aug 2025 00:35:06 -0000 Subject: [PATCH v1 07/36] mm/memremap: reject unreasonable folio/compound page sizes in memremap_pages() In-Reply-To: <20250827220141.262669-8-david@redhat.com> References: <20250827220141.262669-1-david@redhat.com> <20250827220141.262669-8-david@redhat.com> Message-ID: <4rjh3rjhz6c3iov3ukkrhwm7yyma6vfsbfynrzaciw2jkdw3f7@mipp5xo2zqbe> * David Hildenbrand [250827 18:04]: > Let's reject unreasonable folio sizes early, where we can still fail. > We'll add sanity checks to prepare_compound_head/prepare_compound_page > next. > > Is there a way to configure a system such that unreasonable folio sizes > would be possible? It would already be rather questionable. > > If so, we'd probably want to bail out earlier, where we can avoid a > WARN and just report a proper error message that indicates where > something went wrong such that we messed up. > > Acked-by: SeongJae Park > Signed-off-by: David Hildenbrand Reviewed-by: Liam R. Howlett > --- > mm/memremap.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/mm/memremap.c b/mm/memremap.c > index b0ce0d8254bd8..a2d4bb88f64b6 100644 > --- a/mm/memremap.c > +++ b/mm/memremap.c > @@ -275,6 +275,9 @@ void *memremap_pages(struct dev_pagemap *pgmap, int nid) > > if (WARN_ONCE(!nr_range, "nr_range must be specified\n")) > return ERR_PTR(-EINVAL); > + if (WARN_ONCE(pgmap->vmemmap_shift > MAX_FOLIO_ORDER, > + "requested folio size unsupported\n")) > + return ERR_PTR(-EINVAL); > > switch (pgmap->type) { > case MEMORY_DEVICE_PRIVATE: > -- > 2.50.1 > From radon at intuitiveexplanations.com Wed Aug 13 22:41:44 2025 From: radon at intuitiveexplanations.com (Radon Rosborough) Date: Wed, 13 Aug 2025 22:41:44 -0000 Subject: Incorrect behavior of "exclude private IPs" in Android app Message-ID: <98b75751-8cb9-46a1-93a5-fe70952b78ff@app.fastmail.com> Hello, My understanding is that this mailing list serves as the issue tracker for Wireguard, based on my reading of https://www.wireguard.com/#about-the-project. Please redirect me to the appropriate destination if I'm in the wrong place. The Wireguard Android app allows for the setting of an "Allowed IPs" list, which allows for tunneling only traffic destined to a subset of destinations. Since excluding local/private IPs from tunneling is a common use case for this, there is a built-in checkbox in the Android app for excluding private IPs. The checkbox populates a default list of CIDR ranges to exclude. However, as far as I can tell, the value used by this checkbox is incorrect. It does not exclude the 127.0.0.0/8 address range, even though this is almost certainly not intended to be tunneled by a user. For example, consider a case where the user has forwarded the Wireguard port from a peer to Android over USB. Then the Wireguard Android app must be configured with a peer address of 127.0.0.1:51820. With the default checkbox settings, Wireguard will attempt to tunnel 127.0.0.1 to itself, and block all traffic, including DNS resolution. The "Exclude private IPs" option was implemented in the Android app originally in https://lists.zx2c4.com/pipermail/wireguard/2018-July/003106.html, where the current list was proposed. I found a website https://www.procustodibus.com/blog/2021/03/wireguard-allowedips-calculator/ which allows for the calculation of correct "Allowed IPs" settings. It also provides a suggested default value for the "Allowed IPs" option. Here is what is currently in the app: 0.0.0.0/5, 8.0.0.0/7, 11.0.0.0/8, 12.0.0.0/6, 16.0.0.0/4, 32.0.0.0/3, 64.0.0.0/2, 128.0.0.0/3, 160.0.0.0/5, 168.0.0.0/6, 172.0.0.0/12, 172.32.0.0/11, 172.64.0.0/10, 172.128.0.0/9, 173.0.0.0/8, 174.0.0.0/7, 176.0.0.0/4, 192.0.0.0/9, 192.128.0.0/11, 192.160.0.0/13, 192.169.0.0/16, 192.170.0.0/15, 192.172.0.0/14, 192.176.0.0/12, 192.192.0.0/10, 193.0.0.0/8, 194.0.0.0/7, 196.0.0.0/6, 200.0.0.0/5, 208.0.0.0/4, /32 Here is the proposed alternative: 1.0.0.0/8, 2.0.0.0/7, 4.0.0.0/6, 8.0.0.0/7, 11.0.0.0/8, 12.0.0.0/6, 16.0.0.0/4, 32.0.0.0/3, 64.0.0.0/3, 96.0.0.0/4, 112.0.0.0/5, 120.0.0.0/6, 124.0.0.0/7, 126.0.0.0/8, 128.0.0.0/3, 160.0.0.0/5, 168.0.0.0/8, 169.0.0.0/9, 169.128.0.0/10, 169.192.0.0/11, 169.224.0.0/12, 169.240.0.0/13, 169.248.0.0/14, 169.252.0.0/15, 169.255.0.0/16, 170.0.0.0/7, 172.0.0.0/12, 172.32.0.0/11, 172.64.0.0/10, 172.128.0.0/9, 173.0.0.0/8, 174.0.0.0/7, 176.0.0.0/4, 192.0.0.0/9, 192.128.0.0/11, 192.160.0.0/13, 192.169.0.0/16, 192.170.0.0/15, 192.172.0.0/14, 192.176.0.0/12, 192.192.0.0/10, 193.0.0.0/8, 194.0.0.0/7, 196.0.0.0/6, 200.0.0.0/5, 208.0.0.0/4, 224.0.0.0/4, ::/1, 8000::/2, c000::/3, e000::/4, f000::/5, f800::/6, fe00::/9, fec0::/10, ff00::/8 What do you think? Is this an appropriate change to make, so that users have a higher likelihood of the "Exclude private IPs" option doing what they expect? Thank you, Radon Rosborough PS. I am not subscribed to the development mailing list; so I would like to be copied on replies.