(51)
(2)
(1)
(7)
(10)
(2)
(20)
(5)
(1)
(3)
分类: linux
2023-04-18 19:41:02
customer report ubuntu 22.04 consume more memory than ubuntu 20.04.
|
free -lh |
ubuntu 22.04 / 5.15.0 |
total used free shared buff/cache available mem: 376gi 5.0gi 368gi 10mi 2.5gi 368gi low: 376gi 7.5gi 368gi high: 0b 0b 0b swap: 0b 0b 0b |
ubuntu 20.04 / 5.4.0 |
total used free shared buff/cache available mem: 376gi 2.3gi 371gi 10mi 3.0gi 371gi low: 376gi 5.2gi 371gi high: 0b 0b 0b swap: 0b 0b 0b |
free 显示ubuntu 20.04 ‘used’ 内存为2.3g; 而ubuntu 22.04 ‘used’ 内存为4.9g.
为什么22.04多使用了2.6g的内存?
从procps源码看’used’是读取并分析 /proc/meminfo文件的内容计算得来;
# cat /proc/meminfo memtotal: 394594036 kb memfree: 389106200 kb memavailable: 389952084 kb buffers: 4276 kb cached: 2817564 kb swapcached: 0 kb
sreclaimable: 281992 kb
|
计算方法是从total减去free,cached,reclaimable;
kb_main_cached = kb_page_cache kb_slab_reclaimable; mem_used = kb_main_total - kb_main_free - kb_main_cached - kb_main_buffers; |
等价于:
used = memtotal - memfree - cached - sreclaimable - buffers |
此时看不出used memory被什么程序占用.
‘/proc/meminfo’用于报告系统内存的使用情况, 每个统计项的含义可以查看下面的链接.
cat /proc/meminfo memtotal: 394594036 kb memfree: 389105524 kb memavailable: 389951424 kb buffers: 4276 kb cached: 2817564 kb swapcached: 0 kb active: 687244 kb inactive: 2337940 kb active(anon): 219916 kb inactive(anon): 8600 kb active(file): 467328 kb inactive(file): 2329340 kb unevictable: 17612 kb mlocked: 17612 kb swaptotal: 0 kb swapfree: 0 kb dirty: 480 kb writeback: 0 kb anonpages: 221548 kb mapped: 243640 kb shmem: 10760 kb kreclaimable: 282024 kb slab: 727528 kb sreclaimable: 282024 kb sunreclaim: 445504 kb kernelstack: 16432 kb pagetables: 4552 kb nfs_unstable: 0 kb bounce: 0 kb writebacktmp: 0 kb commitlimit: 197297016 kb committed_as: 2100600 kb vmalloctotal: 34359738367 kb vmallocused: 1302520 kb vmallocchunk: 0 kb percpu: 61760 kb hardwarecorrupted: 0 kb anonhugepages: 0 kb shmemhugepages: 0 kb shmempmdmapped: 0 kb filehugepages: 0 kb filepmdmapped: 0 kb cmatotal: 0 kb cmafree: 0 kb hugepages_total: 0 hugepages_free: 0 hugepages_rsvd: 0 hugepages_surp: 0 hugepagesize: 2048 kb hugetlb: 0 kb directmap4k: 517584 kb directmap2m: 8556544 kb directmap1g: 394264576 kb |
系统使用的内存可以分为两类:用户态内存和kernel态内存.
通过/proc/meminfo各个字段的信息,可以统计出大致的内存分配:
用户态消耗的内存:
(cached anonpages buffers) (hugepages_total * hugepagesize) |
或者
(active inactive unevictable) (hugepages_total * hugepagesize) |
对应到’free’ 输出,用户态消耗相当于是buff/cache 部分;
内核态消耗内存:
slab vmallocused pagetables kernelstack hardwarecorrupted bounce x |
· x表示直接通过alloc_pages/__get_free_page分配的内存,/proc/meminfo中没有针对此类的统计, 是内存黑洞;
· vmalloc :详细的vmalloc信息记录在/proc/vmallocinfo 中,可以通过的下面的命令来计算:
cat /proc/vmallocinfo |grep vmalloc | awk ‘{ total = $2 } end { printf “totalvmalloc of user-mode: %.02f mb\n”, total/1024/1024 }’ |
两个版本消耗内存主要差异在内核态, 将已知的统计项做和,得到内存并没有太大差异.
怀疑点落在x部分,但这部分属于内存黑洞, 没有踪迹可寻.
是不是可以从memavailable来找到线索?
答案是否定的, memavailable的来自各个zone free pages , 再加上可以回收再利用的一些内存.
同样找不到是谁分配的pages.
available = vm_zone_stat[nr_free_pages] pagecache reclaimable |
代码实现参考: si_mem_available()
该文件是针对per numa node / per memory zone 的内存pages的metric信息.
其文件内容与/proc/meminfo的十分相似. 事实是meminfo是基于zoneinfo计算出来的.
zoneinfo信息仍然不能解决当前的问题.
这个文件可以帮助理解memory与node,zone之间的关系.
spanned_pages is the total pages spanned by the zone;
present_pages is physical pages existing within the zone;
reserved_pages includes pages allocated by the bootmem allocator;
managed_pages is present pages managed by the buddy system;
present_pages = spanned_pages - absent_pages(pages in holes);
managed_pages = present_pages - reserved_pages;
系统总的有效内存是各个zone的managed之和:
total_memory = node0_zone_dma[managed] node0_zone_dma32[managed] node0_zone_normal[managed] node1_zone_normal[managed]
node 0, zone dma per-node stats nr_inactive_anon 2117 nr_active_anon 38986 nr_inactive_file 545121 nr_active_file 98412 nr_unevictable 3141 nr_slab_reclaimable 59505 nr_slab_unreclaimable 84004 nr_isolated_anon 0 nr_isolated_file 0 workingset_nodes 0 workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0 nr_anon_pages 39335 nr_mapped 55358 nr_file_pages 648505 nr_dirty 136 nr_writeback 0 nr_writeback_temp 0 nr_shmem 2630 nr_shmem_hugepages 0 nr_shmem_pmdmapped 0 nr_file_hugepages 0 nr_file_pmdmapped 0 nr_anon_transparent_hugepages 0 nr_unstable 0 nr_vmscan_write 0 nr_vmscan_immediate_reclaim 0 nr_dirtied 663519 nr_written 576275 nr_kernel_misc_reclaimable 0 pages free 3840 min 0 low 3 high 6 spanned 4095 present 3993 managed 3840 protection: (0, 1325, 191764, 191764, 191764) |
node 0, zone dma32 pages free 347116 min 56 low 395 high 734 spanned 1044480 present 429428 managed 347508 protection: (0, 0, 190439, 190439, 190439)
node 0, zone normal pages free 47545626 min 8097 low 56849 high 105601 spanned 49545216 present 49545216 managed 48754582 protection: (0, 0, 0, 0, 0) nr_free_pages 47545626 nr_zone_inactive_anon 2117 nr_zone_active_anon 38986 nr_zone_inactive_file 545121 nr_zone_active_file 98412 nr_zone_unevictable 3141 nr_zone_write_pending 34 nr_mlock 3141 nr_page_table_pages 872 nr_kernel_stack 10056 nr_bounce 0 nr_zspages 0 nr_free_cma 0 numa_hit 2365759 numa_miss 0 numa_foreign 0 numa_interleave 43664 numa_local 2365148 numa_other 611 |
node 1, zone normal per-node stats nr_inactive_anon 33 nr_active_anon 16139 nr_inactive_file 37211 nr_active_file 18441 nr_unevictable 1262 nr_slab_reclaimable 11198 nr_slab_unreclaimable 27613 nr_isolated_anon 0 nr_isolated_file 0 workingset_nodes 0 workingset_refault 0 workingset_activate 0 workingset_restore 0 workingset_nodereclaim 0 nr_anon_pages 16213 nr_mapped 5952 nr_file_pages 56974 nr_dirty 0 nr_writeback 0 nr_writeback_temp 0 nr_shmem 60 nr_shmem_hugepages 0 nr_shmem_pmdmapped 0 nr_file_hugepages 0 nr_file_pmdmapped 0 nr_anon_transparent_hugepages 0 nr_unstable 0 nr_vmscan_write 0 nr_vmscan_immediate_reclaim 0 nr_dirtied 59535 nr_written 37528 nr_kernel_misc_reclaimable 0 pages free 49379629 min 8229 low 57771 high 107313 spanned 50331648 present 50331648 managed 49542579 protection: (0, 0, 0, 0, 0) |
reserved memory 发生在什么阶段?
在系统初始化boot阶段,常规的内存管理还未被使能,此时物理内存的分配由特殊的分配器bootmem来承担. bootmem生命周期始于setup_arch()终于mem_init().
预留内存是在boot阶段完成,可以从dmesg日志中看到具体信息,
[ 2.938694] memory: 394552504k/401241140k available (14339k kernel code, 2390k rwdata, 8352k rodata, 2728k init, 4988k bss, 6688636k reserved, 0k cma-reserved) |
401241140k == zoneinfo中所有zone present_pages总和
394552504k ??zoneinfo中所有zone managed_pages总和
当前issue涉及的memory是在buddy系统中分配, 所以否定了memory被reserved的可能.
已知的系统信息对定位问题没有帮助.
阅读pages分配流程, 发现可以通过enable config_page_owner来记录page owner, 该功能会记录分配pages时刻的stack信息. 这样就可以追溯到分配内存时的上下文.
__alloc_pages ->get_page_from_freelist ->prep_new_page ->post_alloc_hook ->set_page_owner -> |
参考:
导出当前时刻 page owner信息:
cat /sys/kernel/debug/page_owner > page_owner_full.txt ./page_owner_sort page_owner_full.txt sorted_page_owner.txt |
得到下面的输出:
253 times: page allocated via order 0, mask 0x2a22(gfp_atomic|__gfp_highmem|__gfp_nowarn), pid 1, ts 4292808510 ns, free_ts 0 ns prep_new_page 0xa6/0xe0 get_page_from_freelist 0x2f8/0x450 __alloc_pages 0x178/0x330 alloc_page_interleave 0x19/0x90 alloc_pages 0xef/0x110 __vmalloc_area_node.constprop.0 0x105/0x280 __vmalloc_node_range 0x74/0xe0 __vmalloc_node 0x4e/0x70 __vmalloc 0x1e/0x20 alloc_large_system_hash 0x264/0x356 futex_init 0x87/0x131 do_one_initcall 0x46/0x1d0 kernel_init_freeable 0x289/0x2f2 kernel_init 0x1b/0x150 ret_from_fork 0x1f/0x30
……
1 times: page allocated via order 9, mask 0xcc0(gfp_kernel), pid 707, ts 9593710865 ns, free_ts 0 ns prep_new_page 0xa6/0xe0 get_page_from_freelist 0x2f8/0x450 __alloc_pages 0x178/0x330 __dma_direct_alloc_pages 0x8e/0x120 dma_direct_alloc 0x66/0x2b0 dma_alloc_attrs 0x3e/0x50 irdma_puda_qp_create.constprop.0 0x76/0x4e0 [irdma] irdma_puda_create_rsrc 0x26d/0x560 [irdma] irdma_initialize_ieq 0xae/0xe0 [irdma] irdma_rt_init_hw 0x2a3/0x580 [irdma] i40iw_open 0x1c3/0x320 [irdma] i40e_client_subtask 0xc3/0x140 [i40e] i40e_service_task 0x2af/0x680 [i40e] process_one_work 0x228/0x3d0 worker_thread 0x4d/0x3f0 kthread 0x127/0x150 |
很容易看出相同上下文中分配pages的次数, page order, pid等信息;
以order为关键字来统计页面分配的情况,
5.15.0中分配2^9 pages (512*4k)的case很异常. 查看对应的stack, 发现几乎都与irdma有关.
5.15.0 |
5.4.0 |
order: 0, times: 2107310, memory: 8429240 kb, 8231 mb order: 1, times: 8110, memory: 64880 kb, 63 mb order: 2, times: 1515, memory: 24240 kb, 23 mb order: 3, times: 9671, memory: 309472 kb, 302 mb order: 4, times: 101, memory: 6464 kb, 6 mb order: 5, times: 33, memory: 4224 kb, 4 mb order: 6, times: 5, memory: 1280 kb, 1 mb order: 7, times: 9, memory: 4608 kb, 4 mb order: 8, times: 3, memory: 3072 kb, 3 mb order: 9, times: 1426, memory: 2920448 kb, 2852 mb order: 10, times: 3, memory: 12288 kb, 12 mb all memory: 11780216 kb 11 gb |
order: 0, times: 1218829, memory: 4875316 kb, 4761 mb order: 1, times: 12370, memory: 98960 kb, 96 mb order: 2, times: 1825, memory: 29200 kb, 28 mb order: 3, times: 6834, memory: 218688 kb, 213 mb order: 4, times: 110, memory: 7040 kb, 6 mb order: 5, times: 17, memory: 2176 kb, 2 mb order: 6, times: 0, memory: 0 kb, 0 mb order: 7, times: 2, memory: 1024 kb, 1 mb order: 8, times: 0, memory: 0 kb, 0 mb order: 9, times: 0, memory: 0 kb, 0 mb order: 10, times: 0, memory: 0 kb, 0 mb all memory: 5232404 kb 4 gb |
据了解现在业务中暂时没有业务使用rdma功能,系统初始化时可以暂时不加载irdma.ko. 在os中搜索关键字,没有很明显的加载irdma的指令.
仅发现irdma module被定义了几个alias:
alias i40iw irdma alias auxiliary:ice.roce irdma alias auxiliary:ice.iwarp irdma alias auxiliary:i40e.iwarp irdma |
irdma是如何被自动加载的?
发现在手动加载i40e网卡驱动时,irdma也被加载了.
# modprobe -v i40e insmod /lib/modules/5.15.0-26-generic/kernel/drivers/net/ethernet/intel/i40e/i40e.ko |
irdma是被什么程序带起来的?
利用tracepoint监控内核的行为
# cd /sys/kernel/debug/tracing # tree -d events/module/ events/module/ ├── module_free ├── module_get ├── module_load ├── module_put └── module_request
5 directories
#echo 1 > tracing_on #echo 1 > ./events/module/enable
#modprobe i40e
# cat trace …… modprobe-11202 [068] ..... 24358.270849: module_put: i40e call_site=do_init_module refcnt=1 systemd-udevd-11444 [035] ..... 24358.275116: module_get: ib_uverbs call_site=resolve_symbol refcnt=2 systemd-udevd-11444 [035] ..... 24358.275130: module_get: ib_core call_site=resolve_symbol refcnt=3 systemd-udevd-11444 [035] ..... 24358.275185: module_get: ice call_site=resolve_symbol refcnt=2 systemd-udevd-11444 [035] ..... 24358.275247: module_get: i40e call_site=resolve_symbol refcnt=2 systemd-udevd-11444 [009] ..... 24358.295650: module_load: irdma systemd-udevd-11444 [009] .n... 24358.295730: module_put: irdma call_site=do_init_module refcnt=1 …… |
trace log显示是system-udevd service 加载了 irdma, 使能debug模式可以查看system_udevd详细的日志,
/lib/systemd/systemd-udevd --debug |
system-udevd为什么会load irdma?
要返回到内核态看i40e probe流程.
涉及到一个概念: auxiliary bus .
加载i40e driver时,会创建auxiliary device i40e.iwarp, 并发送uevent通知用户态.
system-udevd监听到消息后, 加载i40e.iwarp对应的module, 由于i40e.iwarp是irdma的alias, 实际被加载的module为irdma.
i40e_probe ->i40e_lan_add_device() ->i40e_client_add_instance(pf); ->i40e_register_auxiliary_dev(&cdev->lan_info, "iwarp”) ->auxiliary_device_add(aux_dev); ->dev_set_name(dev, "%s.%s.%d", modname, auxdev->name, auxdev->id); //i40e.iwarp.0, i40e.iwarp.1 ->device_add(dev); ->kobject_uevent(&dev->kobj, kobj_add); ->kobject_uevent_env(kobj, action, null); // send an uevent with environmental data ->bus_probe_device(dev); |
如何禁止irdma被自动加载?
在/etc/modprobe.d/blacklist.conf中增加一行, 来禁止irdma以alias扩展的身份来load.
# this file lists those modules which we don't want to be loaded by # alias expansion, usually so some other driver will be loaded for the # device instead.
blacklist irdma |
这样并不会影响手动modprobe irdma
server上使用的网卡是intel x722,该型号网卡支持iwarp, 在加载网卡驱动时,i40e driver注册了auxiliary iwarp device, 并发送了uevent消息到userspace;
system-udevd service接受到uevent消息后加载了irdma.ko, irdma在初始化时创建了dma 映射, 通过__alloc_pages()分配掉了大块的内存.
intel在 5.14 内核合入一个patch, 将iwarp driver更改为irdma, 并删除了原来的 i40iw driver.
5.4.0 kernel没有自动加载 i40iw driver, 若手动 load i40iw后,同样会消耗 3g 左右的内存.
上一篇:
下一篇:没有了