凯发app官方网站-凯发k8官网下载客户端中心 | | 凯发app官方网站-凯发k8官网下载客户端中心

 |   |  
  • 博客访问: 499043
  • 博文数量: 51
  • 博客积分: 345
  • 博客等级: 民兵
  • 技术积分: 534
  • 用 户 组: 普通用户
  • 注册时间: 2011-03-21 12:02
个人简介
文章分类

(51)

  • (4)
  • (4)
  • (1)
  • (3)
  • (4)
  • (1)
  • (1)
  • (0)
  • (1)
  • (1)
  • (3)
  • (7)
  • (2)
  • (3)
  • (2)
  • (10)
  • (3)
  • (1)
文章存档

(2)

(1)

(7)

(10)

(2)

(20)

(5)

(1)

(3)

我的朋友
最近访客
相关博文
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·
  • ·

分类: linux

2023-04-18 19:41:02

案例分析-凯发app官方网站

customer report ubuntu 22.04 consume more memory than ubuntu 20.04.

 

free -lh

ubuntu 22.04 / 5.15.0

                    total        used        free      shared  buff/cache   available

mem:          376gi       5.0gi       368gi        10mi       2.5gi       368gi

low:           376gi       7.5gi       368gi

high:             0b          0b          0b

swap:             0b          0b          0b

ubuntu 20.04 / 5.4.0

                    total        used        free      shared  buff/cache   available

mem:        376gi       2.3gi       371gi        10mi       3.0gi       371gi

low:          376gi       5.2gi       371gi

high:           0b          0b          0b

swap:          0b          0b          0b

free 显示ubuntu 20.04 ‘used’ 内存为2.3g; ubuntu 22.04 ‘used’ 内存为4.9g.

为什么22.04多使用了2.6g的内存?

phase1 :’used’是如何计算得来的

从procps源码看’used’是读取分析 /proc/meminfo文件的内容计算得来;

# cat /proc/meminfo

memtotal:       394594036 kb

memfree:        389106200 kb

memavailable:   389952084 kb

buffers:            4276 kb

cached:          2817564 kb

swapcached:            0 kb

 

sreclaimable:     281992 kb

 

 

计算方法是从total减去free,cached,reclaimable;

kb_main_cached = kb_page_cache kb_slab_reclaimable;

mem_used = kb_main_total - kb_main_free - kb_main_cached - kb_main_buffers;

 

等价于:

used = memtotal - memfree - cached - sreclaimable - buffers

 

此时看不出used memory被什么程序占用.

phase2:/proc/meminfo

‘/proc/meminfo’用于报告系统内存的使用情况, 每个统计项的含义可以查看下面的链接.

 

cat /proc/meminfo

memtotal:       394594036 kb

memfree:        389105524 kb

memavailable:   389951424 kb

buffers:            4276 kb

cached:          2817564 kb

swapcached:            0 kb

active:           687244 kb

inactive:        2337940 kb

active(anon):     219916 kb

inactive(anon):     8600 kb

active(file):     467328 kb

inactive(file):  2329340 kb

unevictable:       17612 kb

mlocked:           17612 kb

swaptotal:             0 kb

swapfree:              0 kb

dirty:               480 kb

writeback:             0 kb

anonpages:        221548 kb

mapped:           243640 kb

shmem:             10760 kb

kreclaimable:     282024 kb

slab:             727528 kb

sreclaimable:     282024 kb

sunreclaim:       445504 kb

kernelstack:       16432 kb

pagetables:         4552 kb

nfs_unstable:          0 kb

bounce:                0 kb

writebacktmp:          0 kb

commitlimit:    197297016 kb

committed_as:    2100600 kb

vmalloctotal:   34359738367 kb

vmallocused:     1302520 kb

vmallocchunk:          0 kb

percpu:            61760 kb

hardwarecorrupted:     0 kb

anonhugepages:         0 kb

shmemhugepages:        0 kb

shmempmdmapped:        0 kb

filehugepages:         0 kb

filepmdmapped:         0 kb

cmatotal:              0 kb

cmafree:               0 kb

hugepages_total:       0

hugepages_free:        0

hugepages_rsvd:        0

hugepages_surp:        0

hugepagesize:       2048 kb

hugetlb:               0 kb

directmap4k:      517584 kb

directmap2m:     8556544 kb

directmap1g:    394264576 kb

 

系统使用的内存可以分为两类:用户态内存和kernel态内存.

 

通过/proc/meminfo各个字段的信息,可以统计出大致的内存分配:

 

用户态消耗内存:

(cached anonpages buffers) (hugepages_total * hugepagesize)

或者

(active inactive unevictable) (hugepages_total * hugepagesize)

 

对应到’free’ 输出,用户态消耗相当于是buff/cache 部分;

 

内核态消耗内存:

 

slab  vmallocused  pagetables kernelstack hardwarecorrupted bounce x

·      x表示直接通过alloc_pages/__get_free_page分配的内存,/proc/meminfo中没有针对此类的统计, 是内存黑洞;

·      vmalloc :详细的vmalloc信息记录在/proc/vmallocinfo 中,可以通过的下面的命令来计算:

cat /proc/vmallocinfo |grep vmalloc | awk ‘{ total = $2 } end { printf “totalvmalloc of user-mode: %.02f mb\n”, total/1024/1024 }’

 

两个版本消耗内存主要差异在内核态,  将已知的统计项做和,得到内存并没有太大差异.


怀疑点落在x部分,但这部分属于内存黑洞, 没有踪迹可寻.



是不是可以从memavailable来找到线索?

 

答案是否定的,  memavailable的来自各个zone free pages , 再加上可以回收再利用的一些内存.

同样找不到是谁分配的pages.

 

available = vm_zone_stat[nr_free_pages] pagecache reclaimable

 

代码实现参考: si_mem_available()

phase3:/proc/zoneinfo

该文件是针对per numa node / per memory zone 的内存pages的metric信息.

文件内容与/proc/meminfo的十分相似. 事实是meminfo是基于zoneinfo计算出来的.

zoneinfo信息仍然不能解决当前的问题.

这个文件可以帮助理解memorynode,zone之间的关系.

spanned_pages is the total pages spanned by the zone;

present_pages is physical pages existing within the zone;

reserved_pages includes pages allocated by the bootmem allocator;

managed_pages is present pages managed by the buddy system;

 

present_pages = spanned_pages - absent_pages(pages in holes);

managed_pages = present_pages - reserved_pages;

 

系统总的有效内存是各个zonemanaged之和:

total_memory = node0_zone_dma[managed] node0_zone_dma32[managed] node0_zone_normal[managed] node1_zone_normal[managed]

 

node 0, zone      dma

  per-node stats

      nr_inactive_anon 2117

      nr_active_anon 38986

      nr_inactive_file 545121

      nr_active_file 98412

      nr_unevictable 3141

      nr_slab_reclaimable 59505

      nr_slab_unreclaimable 84004

      nr_isolated_anon 0

      nr_isolated_file 0

      workingset_nodes 0

      workingset_refault 0

      workingset_activate 0

      workingset_restore 0

      workingset_nodereclaim 0

      nr_anon_pages 39335

      nr_mapped    55358

      nr_file_pages 648505

      nr_dirty     136

      nr_writeback 0

      nr_writeback_temp 0

      nr_shmem     2630

      nr_shmem_hugepages 0

      nr_shmem_pmdmapped 0

      nr_file_hugepages 0

      nr_file_pmdmapped 0

      nr_anon_transparent_hugepages 0

      nr_unstable  0

      nr_vmscan_write 0

      nr_vmscan_immediate_reclaim 0

      nr_dirtied   663519

      nr_written   576275

      nr_kernel_misc_reclaimable 0

  pages free     3840

        min      0

        low      3

        high     6

        spanned  4095

        present  3993

        managed  3840

        protection: (0, 1325, 191764, 191764, 191764)

node 0, zone    dma32

  pages free     347116

        min      56

        low      395

        high     734

        spanned  1044480

        present  429428

        managed  347508

        protection: (0, 0, 190439, 190439, 190439)

 

 

node 0, zone   normal

  pages free     47545626

        min      8097

        low      56849

        high     105601

        spanned  49545216

        present  49545216

        managed  48754582

        protection: (0, 0, 0, 0, 0)

      nr_free_pages 47545626

      nr_zone_inactive_anon 2117

      nr_zone_active_anon 38986

      nr_zone_inactive_file 545121

      nr_zone_active_file 98412

      nr_zone_unevictable 3141

      nr_zone_write_pending 34

      nr_mlock     3141

      nr_page_table_pages 872

      nr_kernel_stack 10056

      nr_bounce    0

      nr_zspages   0

      nr_free_cma  0

      numa_hit     2365759

      numa_miss    0

      numa_foreign 0

      numa_interleave 43664

      numa_local   2365148

      numa_other   611

node 1, zone   normal

  per-node stats

      nr_inactive_anon 33

      nr_active_anon 16139

      nr_inactive_file 37211

      nr_active_file 18441

      nr_unevictable 1262

      nr_slab_reclaimable 11198

      nr_slab_unreclaimable 27613

      nr_isolated_anon 0

      nr_isolated_file 0

      workingset_nodes 0

      workingset_refault 0

      workingset_activate 0

      workingset_restore 0

      workingset_nodereclaim 0

      nr_anon_pages 16213

      nr_mapped    5952

      nr_file_pages 56974

      nr_dirty     0

      nr_writeback 0

      nr_writeback_temp 0

      nr_shmem     60

      nr_shmem_hugepages 0

      nr_shmem_pmdmapped 0

      nr_file_hugepages 0

      nr_file_pmdmapped 0

      nr_anon_transparent_hugepages 0

      nr_unstable  0

      nr_vmscan_write 0

      nr_vmscan_immediate_reclaim 0

      nr_dirtied   59535

      nr_written   37528

      nr_kernel_misc_reclaimable 0

  pages free     49379629

        min      8229

        low      57771

        high     107313

        spanned  50331648

        present  50331648

        managed  49542579

        protection: (0, 0, 0, 0, 0)

 

phase4:有没有可能是reserved memory

 

reserved memory 发生在什么阶段?

 

在系统初始化boot阶段,常规的内存管理还未被使能,此时物理内存的分配由特殊的分配器bootmem来承担. bootmem生命周期始于setup_arch()终于mem_init().

 

预留内存是在boot阶段完成,可以从dmesg日志中看到具体信息,

[    2.938694] memory: 394552504k/401241140k available (14339k kernel code, 2390k rwdata, 8352k rodata, 2728k init, 4988k bss, 6688636k reserved, 0k cma-reserved)

 

401241140k == zoneinfo中所有zone present_pages总和

394552504k ??zoneinfo中所有zone managed_pages总和

 

当前issue涉及的memory是在buddy系统中分配, 所以否定了memoryreserved的可能.

 

phase5:config_page_owner

已知的系统信息对定位问题没有帮助.

阅读pages分配流程, 发现可以通过enable config_page_owner来记录page owner, 该功能会记录分配pages时刻的stack信息. 这样就可以追溯到分配内存时的上下文.

__alloc_pages

->get_page_from_freelist

            ->prep_new_page

                        ->post_alloc_hook

                                    ->set_page_owner

                                                ->

参考:

 

导出当前时刻 page owner信息:

cat /sys/kernel/debug/page_owner > page_owner_full.txt

./page_owner_sort page_owner_full.txt sorted_page_owner.txt

 

得到下面的输出:

253 times:

page allocated via order 0, mask 0x2a22(gfp_atomic|__gfp_highmem|__gfp_nowarn), pid 1, ts 4292808510 ns, free_ts 0 ns

 prep_new_page 0xa6/0xe0

 get_page_from_freelist 0x2f8/0x450

 __alloc_pages 0x178/0x330

 alloc_page_interleave 0x19/0x90

 alloc_pages 0xef/0x110

 __vmalloc_area_node.constprop.0 0x105/0x280

 __vmalloc_node_range 0x74/0xe0

 __vmalloc_node 0x4e/0x70

 __vmalloc 0x1e/0x20

 alloc_large_system_hash 0x264/0x356

 futex_init 0x87/0x131

 do_one_initcall 0x46/0x1d0

 kernel_init_freeable 0x289/0x2f2

 kernel_init 0x1b/0x150

 ret_from_fork 0x1f/0x30

 

……

 

1 times:

page allocated via order 9, mask 0xcc0(gfp_kernel), pid 707, ts 9593710865 ns, free_ts 0 ns

 prep_new_page 0xa6/0xe0

 get_page_from_freelist 0x2f8/0x450

 __alloc_pages 0x178/0x330

 __dma_direct_alloc_pages 0x8e/0x120

 dma_direct_alloc 0x66/0x2b0

 dma_alloc_attrs 0x3e/0x50

 irdma_puda_qp_create.constprop.0 0x76/0x4e0 [irdma]

 irdma_puda_create_rsrc 0x26d/0x560 [irdma]

 irdma_initialize_ieq 0xae/0xe0 [irdma]

 irdma_rt_init_hw 0x2a3/0x580 [irdma]

 i40iw_open 0x1c3/0x320 [irdma]

 i40e_client_subtask 0xc3/0x140 [i40e]

 i40e_service_task 0x2af/0x680 [i40e]

 process_one_work 0x228/0x3d0

 worker_thread 0x4d/0x3f0

 kthread 0x127/0x150

很容易看出相同上下文中分配pages的次数, page order, pid等信息;

 

order为关键字来统计页面分配的情况,

5.15.0中分配2^9 pages (512*4k)case很异常.  查看对应的stack, 发现几乎都与irdma有关.

5.15.0

5.4.0

order: 0, times: 2107310, memory: 8429240 kb, 8231 mb

order: 1, times: 8110, memory: 64880 kb, 63 mb

order: 2, times: 1515, memory: 24240 kb, 23 mb

order: 3, times: 9671, memory: 309472 kb, 302 mb

order: 4, times: 101, memory: 6464 kb, 6 mb

order: 5, times: 33, memory: 4224 kb, 4 mb

order: 6, times: 5, memory: 1280 kb, 1 mb

order: 7, times: 9, memory: 4608 kb, 4 mb

order: 8, times: 3, memory: 3072 kb, 3 mb

order: 9, times: 1426, memory: 2920448 kb, 2852 mb

order: 10, times: 3, memory: 12288 kb, 12 mb

all memory: 11780216 kb 11 gb

order: 0, times: 1218829, memory: 4875316 kb, 4761 mb

order: 1, times: 12370, memory: 98960 kb, 96 mb

order: 2, times: 1825, memory: 29200 kb, 28 mb

order: 3, times: 6834, memory: 218688 kb, 213 mb

order: 4, times: 110, memory: 7040 kb, 6 mb

order: 5, times: 17, memory: 2176 kb, 2 mb

order: 6, times: 0, memory: 0 kb, 0 mb

order: 7, times: 2, memory: 1024 kb, 1 mb

order: 8, times: 0, memory: 0 kb, 0 mb

order: 9, times: 0, memory: 0 kb, 0 mb

order: 10, times: 0, memory: 0 kb, 0 mb

all memory: 5232404 kb 4 gb

据了解现在业务中暂时没有业务使用rdma功能,系统初始化时可以暂时不加载irdma.ko. os中搜索关键字,没有很明显的加载irdma的指令.

 

仅发现irdma module被定义了几个alias:

alias i40iw irdma

alias auxiliary:ice.roce irdma

alias auxiliary:ice.iwarp irdma

alias auxiliary:i40e.iwarp irdma

 


irdma是如何被自动加载的?

发现在手动加载i40e网卡驱动时,irdma也被加载了.

# modprobe -v i40e

insmod /lib/modules/5.15.0-26-generic/kernel/drivers/net/ethernet/intel/i40e/i40e.ko

irdma是被什么程序带起来的?

 

利用tracepoint监控内核的行为

# cd /sys/kernel/debug/tracing

# tree -d events/module/

events/module/

├── module_free

├── module_get

├── module_load

├── module_put

└── module_request

 

5 directories

 

#echo 1 > tracing_on

#echo 1 > ./events/module/enable

 

#modprobe i40e

 

# cat trace

……

   modprobe-11202   [068] ..... 24358.270849: module_put: i40e call_site=do_init_module refcnt=1

   systemd-udevd-11444   [035] ..... 24358.275116: module_get: ib_uverbs call_site=resolve_symbol refcnt=2

   systemd-udevd-11444   [035] ..... 24358.275130: module_get: ib_core call_site=resolve_symbol refcnt=3

   systemd-udevd-11444   [035] ..... 24358.275185: module_get: ice call_site=resolve_symbol refcnt=2

   systemd-udevd-11444   [035] ..... 24358.275247: module_get: i40e call_site=resolve_symbol refcnt=2

   systemd-udevd-11444   [009] ..... 24358.295650: module_load: irdma

   systemd-udevd-11444   [009] .n... 24358.295730: module_put: irdma call_site=do_init_module refcnt=1

……


trace log显示是system-udevd service 加载了 irdma, 使能debug模式可以查看system_udevd详细的日志,

/lib/systemd/systemd-udevd --debug

 

system-udevd为什么会load irdma?

要返回到内核态看i40e  probe流程.

 

涉及到一个概念: auxiliary bus .

 

加载i40e driver,会创建auxiliary device i40e.iwarp, 并发送uevent通知用户态.

system-udevd监听到消息后, 加载i40e.iwarp对应的module, 由于i40e.iwarpirdmaalias, 实际被加载的moduleirdma.

i40e_probe

->i40e_lan_add_device()

    ->i40e_client_add_instance(pf);

        ->i40e_register_auxiliary_dev(&cdev->lan_info, "iwarp”)

            ->auxiliary_device_add(aux_dev);

                ->dev_set_name(dev, "%s.%s.%d", modname, auxdev->name, auxdev->id);    //i40e.iwarp.0, i40e.iwarp.1

                ->device_add(dev);

                    ->kobject_uevent(&dev->kobj, kobj_add);

                        ->kobject_uevent_env(kobj, action, null); // send an uevent with environmental data

                    ->bus_probe_device(dev);


如何禁止irdma被自动加载?

 

/etc/modprobe.d/blacklist.conf中增加一行, 来禁止irdmaalias扩展的身份来load.

# this file lists those modules which we don't want to be loaded by

# alias expansion, usually so some other driver will be loaded for the

# device instead.

 

blacklist irdma

这样并不会影响手动modprobe irdma

server上使用的网卡是intel x722,该型号网卡支持iwarp, 在加载网卡驱动时,i40e driver注册了auxiliary iwarp device, 并发送了uevent消息到userspace;

system-udevd service接受到uevent消息后加载了irdma.ko, irdma在初始化时创建了dma 映射, 通过__alloc_pages()分配掉了大块的内存.

 

intel 5.14 内核合入一个patch, iwarp driver更改irdma, 并删除了原来的 i40iw driver.
5.4.0 kernel没有自动加载 i40iw driver, 手动 load i40iw,同样会消耗 3g 左右的内存.

阅读(867) | 评论(0) | 转发(0) |
0

上一篇:

下一篇:没有了

给主人留下些什么吧!~~
")); function link(t){ var href= $(t).attr('href'); href ="?url=" encodeuricomponent(location.href); $(t).attr('href',href); //setcookie("returnouturl", location.href, 60, "/"); }
网站地图