Nvdimm device FS DAX support

Description

Arm64 has supported nvdimm device drivers from 5.4. Usually for PMDK, if we want to use the nvdimm device to by pass the page cache, the nvdimm device should work on the FSDAX mode. For nvdimm, also known as persistent Memory, usually there are 4 types of accessing method:

mode

description

device path

device type

label metadata

atomicity

filesystems

DAX

PFN metadata

former name

raw

raw

/dev/pmemN

block

no

no

yes

no

no

 

sector

sector atomic

/dev/pmemNs

block

yes

yes

yes

no

no

 

fsdax

filesystem DAX

/dev/pmemN

block

yes

no

yes

yes

yes

memory

devdax

device DAX

/dev/daxN.M

character

yes

no

no

yes

yes

dax

 

On arm64, the BBU based Nvdimm-N device on bare metal should support this mode, the potential fix is inside the ACPI table and kernel nvdimm driver. This is the precondition of persistent memory running on Arm64 as the page cache should be by-passed to improve the performance. On the BM test node, when running the ndctl to create the fsdax namespace, we got the error as: -17 -EEXIST report as the previous comment.

Activity

Show:

Kevin Zhao October 20, 2021 at 6:49 AM
Edited

The previous error was introduced as memory access to a reserved region. We can find that the Persist memory iomem info as:

The persist memory memblock is reserved.

Now, run lsblk, we can see that /dev/pmem0 is there. Then run:

Error got:

It failed, and also we saw that the persistent memory is lost from “lsblk“. And the iomem info here is:

Now run command to change it to raw:

Then we will meet issue as previous comment. “Unable to handle kernel paging request at virtual address **“, kernel ooops. That is because that the virtual address has been mapped, and kernel treat this region “5c00000000-5fffffffff“ as the hot-pluggable memory(Run the “ndctl create-namespace --force --mode=fsdax” will trigger the memory hotplug process, but it failed when check) The failed calling chain is:

The error is due to this region is reserved when run region_reserve().

 

For Qemu based Arm64 VM with Pmem, the memory detection does not get the Persist Memory info before region_reserve. After the memory initial is finished and the machine is booted, it will call acpi_nfit_add to add the persist memory. The error for Kunpeng BM with BBU is the memory is detected when run reserve_region, and then in NFIT table, it will be added another time. But the first time reserve_region is a must, since this is a memory and should be marked as NOMAP, so that this memory is not merged with other memblock. See below as the memory dump info:

The traditional memory is like, with “WB” flags.

Work around solution now:

drivers/firmware/efi/efi-init.c, static __init void reserve_regions adding the code as below:

Kevin Zhao October 13, 2021 at 8:59 AM

Hack the rc = -EEXIST, we can get the fsdax mount pmem.

But with a lot of warnings and if run:

ndctl create-namespace -mode=raw, will induce error:

[ 2636.399613] Unable to handle kernel paging request at virtual address ffff205c00000000
[ 2636.407502] Mem abort info:
[ 2636.410283] ESR = 0x96000005
[ 2636.413324] EC = 0x25: DABT (current EL), IL = 32 bits
[ 2636.418638] SET = 0, FnV = 0
[ 2636.421678] EA = 0, S1PTW = 0
[ 2636.424804] FSC = 0x05: level 1 translation fault
[ 2636.429662] Data abort info:
[ 2636.432530] ISV = 0, ISS = 0x00000005
[ 2636.436350] CM = 0, WnR = 0
[ 2636.439305] swapper pgtable: 4k pages, 48-bit VAs, pgdp=000000002290e000
[ 2636.445978] [ffff205c00000000] pgd=1800205be1f93003, p4d=1800205be1f93003, pud=0000000000000000
[ 2636.454638] Internal error: Oops: 96000005 [#1] SMP
[ 2636.459493] Modules linked in: nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set rfkill nf_tables nfnetlink vfat fat rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser ib_ipoib rdma_cm iw_cm libiscsi ib_cm scsi_transport_iscsi ipmi_ssif mlx5_ib ib_uverbs ib_core acpi_ipmi crct10dif_ce ghash_ce ipmi_si sha1_ce sbsa_gwdt ipmi_devintf dax_pmem_compat spi_dw_mmio ipmi_msghandler nd_pmem device_dax spi_dw nd_btt dax_pmem_core ip_tables xfs libcrc32c hibmc_drm drm_vram_helper drm_kms_helper syscopyarea sysfillrect mlx5_core sysimgblt fb_sys_fops drm_ttm_helper ttm drm hns3 hisi_sas_v3_hw hclge mpt3sas nfit hisi_sas_main mlxfw libnvdimm libsas tls sha2_ce nvme sha256_arm64 nvme_core encrypted_keys sg trusted psample raid_class hnae3 i2c_designware_platform scsi_transport_sas i2c_algo_bit i2c_designware_core asn1_encoder
[ 2636.459570] gpio_dwapb tee fuse
[ 2636.549203] CPU: 31 PID: 7664 Comm: ndctl Tainted: G B W 5.15.0-rc2+ #11
[ 2636.556996] Hardware name: Huawei TaiShan 200 (Model 2280)/BC82AMDDV, BIOS 1.70 01/07/2021
[ 2636.565220] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 2636.572150] pc : __memcpy+0x110/0x260
[ 2636.575798] lr : pmem_do_read+0xac/0x138 [nd_pmem]
[ 2636.580569] sp : ffff80002fe3b4d0
[ 2636.583869] x29: ffff80002fe3b4d0 x28: 0000000000001000 x27: ffff000000000000
[ 2636.590973] x26: fffffc008348e580 x25: 0000000000000000 x24: ffff800010f23128
[ 2636.598075] x23: ffff0020aaa10000 x22: 000000008348e580 x21: ffff205c00000000
[ 2636.605178] x20: 0000000000001000 x19: 0000000000001000 x18: ffffffffffffffff
[ 2636.612280] x17: 4e51455300696462 x16: 3d4d455453595342 x15: 0000000000000008
[ 2636.619382] x14: ffffdfc4d2396000 x13: 0000000000000028 x12: 0101010101010101
[ 2636.626485] x11: 7f7f7f7f7f7f7f7f x10: fefeff306c646c6f x9 : ffff800008c58b80
[ 2636.633587] x8 : ffff0020a5f13e00 x7 : ffff0020a5f148b8 x6 : 000000407935efd2
[ 2636.640690] x5 : ffff0020d2397000 x4 : ffff205c00001000 x3 : 0000000000000000
[ 2636.647792] x2 : 0000000000001000 x1 : ffff205c00000000 x0 : ffff0020d2396000
[ 2636.654895] Call trace:
[ 2636.657330] __memcpy+0x110/0x260
[ 2636.660631] pmem_submit_bio+0x180/0x1f0 [nd_pmem]
[ 2636.665400] submit_bio_noacct+0xdc/0x3f8
[ 2636.669396] submit_bio+0x54/0x150
[ 2636.672782] submit_bh_wbc+0x170/0x1e8
[ 2636.676517] block_read_full_page+0x310/0x430
[ 2636.680854] blkdev_readpage+0x20/0x28
[ 2636.684586] do_read_cache_page+0x2ec/0x3a8
[ 2636.688750] read_cache_page+0x18/0x20
[ 2636.692481] read_part_sector+0x48/0x1c0
[ 2636.696388] read_lba+0xa8/0x1b0
[ 2636.699601] efi_partition+0xac/0x698
[ 2636.703246] bdev_disk_changed+0x1d0/0x640
[ 2636.707325] blkdev_get_whole+0x98/0xa8
[ 2636.711145] blkdev_get_by_dev+0xc0/0x300
[ 2636.715136] device_add_disk+0x364/0x3b0
[ 2636.719040] pmem_attach_disk+0x4a4/0x568 [nd_pmem]
[ 2636.723895] nd_pmem_probe+0x84/0x150 [nd_pmem]
[ 2636.728404] nvdimm_bus_probe+0xa4/0x1f0 [libnvdimm]
[ 2636.733423] really_probe+0xc0/0x428
[ 2636.736983] __driver_probe_device+0x114/0x188
[ 2636.741406] device_driver_attach+0x34/0x68
[ 2636.745569] bind_store+0xa8/0x128
[ 2636.748955] drv_attr_store+0x28/0x38
[ 2636.752603] sysfs_kf_write+0x48/0x58
[ 2636.756251] kernfs_fop_write_iter+0x138/0x1c8
[ 2636.760674] new_sync_write+0x108/0x188
[ 2636.764495] vfs_write+0x1e0/0x2b8
[ 2636.767881] ksys_write+0x6c/0xf0
[ 2636.771183] __arm64_sys_write+0x20/0x28
[ 2636.775088] invoke_syscall.constprop.6+0x50/0xd8
[ 2636.779774] do_el0_svc+0xe8/0x148
[ 2636.783160] el0_svc+0x30/0x110
[ 2636.786290] el0t_64_sync_handler+0x88/0xb0
[ 2636.790453] el0t_64_sync+0x158/0x15c
[ 2636.794101] Code: cb01000e b4fffc2e eb0201df 540004a3 (a940342c)
[ 2636.800165] ---[ end trace 1a1d07353cb8c0b1 ]---
[ 2636.804761] Kernel panic - not syncing: Oops: Fatal exception
[ 2636.810480] SMP: stopping secondary CPUs
[ 2637.312984] Kernel Offset: 0x30000 from 0xffff800010000000
[ 2637.318444] PHYS_OFFSET: 0x0
[ 2637.321312] CPU features: 0x00000201,a3202c40
[ 2637.325648] Memory Limit: none
[ 2637.328687] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

Kevin Zhao October 13, 2021 at 8:56 AM
Edited

On Kunpeng BM with BBU, the same issue with -17 -EEXIST report as the previous comment.

linux/mm/sparse.c

Kevin Zhao September 22, 2021 at 8:33 AM

The calling chain:

nd_pmem_probe(drivers/nvdimm/pmem.c)

pmem_attach_disk(drivers/nvdimm/pmem.c)

devm_memremap_pages(mm/memremap.c)

memremap_pages

pagemap_range

arch_add_memory

__add_pages

sparse_add_section

section_activate

bitmap_intersects → failedwith -EEXIST - Section has been present

Kevin Zhao September 22, 2021 at 8:26 AM

in mm/sparse.c:

The modprobe_pfn failed with error -17 is come from here. Hardcode rc = 0 we can have a pmem with DAX support.

Done

Details

Assignee

Reporter

Original estimate

Time tracking

No time logged6w remaining

Sprint

Fix versions

Priority

Checklist

Sentry

Created September 15, 2021 at 8:53 AM
Updated December 3, 2021 at 12:43 PM
Resolved October 25, 2021 at 1:18 AM