下記の続きエントリです
今回は PSI - Pressure Stall Information の /proc/pressure/memory
についてのエントリ
PSI を使うと CPU, メモリ, IO で stall した時間(割合) を計測できるってなことですが、どういった実装で「memory stall = メモリのストール」を起こしているタスクを計測しているのかかが疑問で調べていました
PSI のドキュメント
正直、ドキュメントを読んだだけでは /proc/pressure/memory
がどういったタイミングで計上されるのか分かりませんでした.
ソースを読む
./kernel/sched/psi.c をざーっと呼んでいって psi_memstall_enter() と psi_memstall_leave() という関数を見つけます
psi_memstall_enter()
psi_memstall_enter() で 「メモリで stall する操作を始めるタスク」をマークし PSI で計上し、psi_memstall_leave() でマークを外す という仕組みになっているらしい.
/** * psi_memstall_enter - mark the beginning of a memory stall section * @flags: flags to handle nested sections * * Marks the calling task as being stalled due to a lack of memory, * such as waiting for a refault or performing reclaim. */ void psi_memstall_enter(unsigned long *flags) { struct rq_flags rf; struct rq *rq; if (static_branch_likely(&psi_disabled)) return; *flags = current->flags & PF_MEMSTALL; if (*flags) return; /* * PF_MEMSTALL setting & accounting needs to be atomic wrt * changes to the task's scheduling state, otherwise we can * race with CPU migration. */ rq = this_rq_lock_irq(&rf); current->flags |= PF_MEMSTALL; psi_task_change(current, 0, TSK_MEMSTALL); rq_unlock_irq(rq, &rf);
/** * psi_memstall_leave - mark the end of an memory stall section * @flags: flags to handle nested memdelay sections * * Marks the calling task as no longer stalled due to lack of memory. */ void psi_memstall_leave(unsigned long *flags) { struct rq_flags rf; struct rq *rq; if (static_branch_likely(&psi_disabled)) return; if (*flags) return; /* * PF_MEMSTALL clearing & accounting needs to be atomic wrt * changes to the task's scheduling state, otherwise we could * race with CPU migration. */ rq = this_rq_lock_irq(&rf); current->flags &= ~PF_MEMSTALL; psi_task_change(current, TSK_MEMSTALL, 0); rq_unlock_irq(rq, &rf); }
この2つが呼び出される箇所を追っていけば PSI - /proc/pressure/memory
で計上される Linux カーネル内の処理が特定できます
読んでて分かりやすい
ところで psi_memstall_enter() と psi_memstall_leave() はペアで呼び出されるように書かれているので、ソースを読んでいて *「あ、ここが メモリで stall しうる場所なんだな」 というアノテーションになってくれてもいる.
例えば kcomapctd の以下のソースなんかがわかりやすい
/* * The background compaction daemon, started as a kernel thread * from the init process. */ static int kcompactd(void *p) { pg_data_t *pgdat = (pg_data_t*)p; struct task_struct *tsk = current; const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id); if (!cpumask_empty(cpumask)) set_cpus_allowed_ptr(tsk, cpumask); set_freezable(); pgdat->kcompactd_max_order = 0; pgdat->kcompactd_classzone_idx = pgdat->nr_zones - 1; while (!kthread_should_stop()) { unsigned long pflags; trace_mm_compaction_kcompactd_sleep(pgdat->node_id); wait_event_freezable(pgdat->kcompactd_wait, kcompactd_work_requested(pgdat)); psi_memstall_enter(&pflags); 👈 kcompactd_do_work(pgdat); psi_memstall_leave(&pflags); 👈 } return 0; }
カーネルのソースを追いかけていてもメモリバウンド(?) な処理になるかどうかってのは分かりにくい/覚えてないのでソースの可読性が上がるのは嬉しい.
PSI が導入されていない古いバージョンのカーネルを読む際にも、ヒントにもなるのではなかろうか
psi_memstall_enter() を呼び出す箇所
5.4-rc8 がでたばかりの https://github.com/torvalds/linux/ で psi_memstall_enter() 呼び出しの箇所を探すと、9箇所とさほどは多くない感じだった
psi_memstall_enter 1722 block/blk-cgroup.c psi_memstall_enter(&pflags); psi_memstall_enter 1188 block/blk-core.c psi_memstall_enter(&pflags); psi_memstall_enter 22 include/linux/psi.h void psi_memstall_enter(unsigned long *flags); psi_memstall_enter 2650 mm/compaction.c psi_memstall_enter(&pflags); psi_memstall_enter 1148 mm/filemap.c psi_memstall_enter(&pflags); psi_memstall_enter 2494 mm/memcontrol.c psi_memstall_enter(&pflags); psi_memstall_enter 3898 mm/page_alloc.c psi_memstall_enter(&pflags); psi_memstall_enter 4120 mm/page_alloc.c psi_memstall_enter(&pflags); psi_memstall_enter 3380 mm/vmscan.c psi_memstall_enter(&pflags); psi_memstall_enter 3592 mm/vmscan.c psi_memstall_enter(&pflags);
以下のような関数群が psi_memstall_enter() を呼び出しています
- blkcg_maybe_throttle_blkg()
- submit_bio()
- wait_on_page_bit_common()
- mem_cgroup_handle_over_high()
- __alloc_pages_direct_compact()
- __perform_reclaim()
- try_to_free_mem_cgroup_pages()
- kcompactd() ... kcompactd
- balance_pgdat() ... kswapd
さらにこれらを呼び出す関数がまたいろいろあるから、どいうった経路で呼び出されるのかを網羅するのは大変
実験
ソースをざっと追ったところで 実験にうつります
- PSI - /proc/pressure/memory の値に作用しそうな「負荷」をかける
- 実際に値が変化するのかを観察する
- bpftrace でバックトレースを出してソースを読む
ということをやります
psi_memstall_enter() はメモリサブシステムの関数軍に埋め込まれているのでカーネルモジュールを書いて擬似の負荷を生み出すというアプローチには向いてなさそうだったので 次の手順で実験を行いました
実験環境
Ubuntu + Bionic に 5.4-rc7 のカーネルを載せての実験です
root@bionic:~# uname -a Linux bionic 5.4.0-050400rc7-generic #201911102031 SMP Mon Nov 11 01:34:23 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
実験1. OOM を起こす
cgroup で memory.mem.limit_in_bytes
の制限をかけて OOM Killer を起こす
root@bionic:~# mkdir /sys/fs/cgroup/memory/000 root@bionic:~# echo 100M > /sys/fs/cgroup/memory/000/memory.mem.limit_in_bytes root@bionic:~# echo $$ > /sys/fs/cgroup/memory/000/tasks # OOM 🔥 root@bionic:~# perl -e '@a = (); for(;;) { push @a, 1 x 1024 }'
同時に /proc/pressure/io
を観測する
for i in {1..100}; do cat /proc/pressure/memory | perl -nle 'printf "%s\t%s\n", scalar localtime(), $_'; sleep 1; done;
OOM Killer が実行されるのと同タイミング位で /proc/pressure/memory の値が変化するのを観測できる
Mon Nov 18 16:59:18 2019 some avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:18 2019 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:19 2019 some avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:19 2019 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:20 2019 some avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:20 2019 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:21 2019 some avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:21 2019 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:22 2019 some avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:22 2019 full avg10=0.00 avg60=0.00 avg300=0.00 total=0 Mon Nov 18 16:59:25 2019 👉 some avg10=0.00 avg60=0.00 avg300=0.00 total=28846 Mon Nov 18 16:59:25 2019 👉 full avg10=0.00 avg60=0.00 avg300=0.00 total=11876 Mon Nov 18 16:59:26 2019 some avg10=1.96 avg60=0.41 avg300=0.08 total=286128 Mon Nov 18 16:59:26 2019 full avg10=1.51 avg60=0.31 avg300=0.06 total=221701 Mon Nov 18 16:59:27 2019 some avg10=1.96 avg60=0.41 avg300=0.08 total=286128 Mon Nov 18 16:59:27 2019 full avg10=1.51 avg60=0.31 avg300=0.06 total=221701 Mon Nov 18 16:59:28 2019 some avg10=1.60 avg60=0.39 avg300=0.08 total=286128 Mon Nov 18 16:59:28 2019 full avg10=1.24 avg60=0.30 avg300=0.06 total=221701 Mon Nov 18 16:59:29 2019 some avg10=1.60 avg60=0.39 avg300=0.08 total=286128 Mon Nov 18 16:59:29 2019 full avg10=1.24 avg60=0.30 avg300=0.06 total=221701 Mon Nov 18 16:59:30 2019 some avg10=1.31 avg60=0.38 avg300=0.08 total=286128 Mon Nov 18 16:59:30 2019 full avg10=1.01 avg60=0.29 avg300=0.06 total=221701 Mon Nov 18 16:59:31 2019 some avg10=1.31 avg60=0.38 avg300=0.08 total=286128 Mon Nov 18 16:59:31 2019 full avg10=1.01 avg60=0.29 avg300=0.06 total=221701 Mon Nov 18 16:59:32 2019 some avg10=1.07 avg60=0.37 avg300=0.08 total=286128 Mon Nov 18 16:59:32 2019 full avg10=0.83 avg60=0.28 avg300=0.06 total=221701 Mon Nov 18 16:59:33 2019 some avg10=1.07 avg60=0.37 avg300=0.08 total=286128 Mon Nov 18 16:59:33 2019 full avg10=0.83 avg60=0.28 avg300=0.06 total=221701 Mon Nov 18 16:59:34 2019 some avg10=0.88 avg60=0.36 avg300=0.08 total=286128 Mon Nov 18 16:59:34 2019 full avg10=0.68 avg60=0.27 avg300=0.06 total=221701 Mon Nov 18 16:59:35 2019 some avg10=0.88 avg60=0.36 avg300=0.08 total=286128 Mon Nov 18 16:59:35 2019 full avg10=0.68 avg60=0.27 avg300=0.06 total=221701 Mon Nov 18 16:59:36 2019 some avg10=0.72 avg60=0.34 avg300=0.08 total=286128 Mon Nov 18 16:59:36 2019 full avg10=0.55 avg60=0.26 avg300=0.06 total=221701
bpftrace で kprobe:psi_memstall_enter() 呼び出しの際のバックトレースをとると以下のような呼び出しパスをとる。結構おおいな〜
vagrant@bionic:~$ sudo bpftrace -e 'kprobe:psi_memstall_enter { @[kstack] = count(); }' Attaching 1 probe... ^C @[ psi_memstall_enter+1 ext4_mpage_readpages+1145 ext4_readpages+58 read_pages+113 __do_page_cache_readahead+303 ondemand_readahead+402 page_cache_sync_readahead+120 generic_file_buffered_read+1393 generic_file_read_iter+220 ext4_file_read_iter+79 do_iter_readv_writev+419 do_iter_read+226 vfs_iter_read+25 loop_queue_work+889 kthread_worker_fn+121 loop_kthread_worker_fn+30 kthread+260 ret_from_fork+53 ]: 1 @[ psi_memstall_enter+1 submit_bh_wbc+389 ll_rw_block+171 __breadahead+68 __ext4_get_inode_loc+945 ext4_reserve_inode_write+84 ext4_mark_inode_dirty+83 ext4_dirty_inode+73 __mark_inode_dirty+640 generic_update_time+157 file_update_time+235 __generic_file_write_iter+154 ext4_file_write_iter+185 new_sync_write+293 __vfs_write+41 vfs_write+185 ksys_write+103 __x64_sys_write+26 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+68 ]: 1 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_current+135 __page_cache_alloc+114 __do_page_cache_readahead+161 ondemand_readahead+655 page_cache_sync_readahead+120 generic_file_buffered_read+1393 generic_file_read_iter+220 ext4_file_read_iter+79 do_iter_readv_writev+419 do_iter_read+226 vfs_iter_read+25 loop_queue_work+889 kthread_worker_fn+121 loop_kthread_worker_fn+30 kthread+260 ret_from_fork+53 ]: 1 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_current+135 __page_cache_alloc+114 __do_page_cache_readahead+161 filemap_fault+1742 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 1 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_current+135 __page_cache_alloc+114 pagecache_get_page+191 squashfs_readpage_block+275 squashfs_readpage+537 read_pages+294 __do_page_cache_readahead+384 filemap_fault+1742 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 2 @[ psi_memstall_enter+1 ext4_mpage_readpages+1145 ext4_readpages+58 read_pages+113 __do_page_cache_readahead+384 ondemand_readahead+402 page_cache_async_readahead+181 generic_file_buffered_read+1964 generic_file_read_iter+220 ext4_file_read_iter+79 do_iter_readv_writev+419 do_iter_read+226 vfs_iter_read+25 loop_queue_work+889 kthread_worker_fn+121 loop_kthread_worker_fn+30 kthread+260 ret_from_fork+53 ]: 2 @[ psi_memstall_enter+1 generic_file_read_iter+220 ext4_file_read_iter+79 do_iter_readv_writev+419 do_iter_read+226 vfs_iter_read+25 loop_queue_work+889 kthread_worker_fn+121 loop_kthread_worker_fn+30 kthread+260 ret_from_fork+53 ]: 3 @[ psi_memstall_enter+1 ext4_mpage_readpages+1145 ext4_readpages+58 read_pages+113 __do_page_cache_readahead+384 ondemand_readahead+402 page_cache_sync_readahead+120 generic_file_buffered_read+1393 generic_file_read_iter+220 ext4_file_read_iter+79 do_iter_readv_writev+419 do_iter_read+226 vfs_iter_read+25 loop_queue_work+889 kthread_worker_fn+121 loop_kthread_worker_fn+30 kthread+260 ret_from_fork+53 ]: 3 @[ psi_memstall_enter+1 ext4_mpage_readpages+1145 ext4_readpages+58 read_pages+113 __do_page_cache_readahead+303 filemap_fault+1742 ext4_filemap_fault+50 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 4 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_current+135 __page_cache_alloc+114 pagecache_get_page+191 __getblk_gfp+241 squashfs_read_data+728 squashfs_readpage_block+667 squashfs_readpage+537 read_pages+294 __do_page_cache_readahead+384 filemap_fault+1742 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 4 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_current+135 __page_cache_alloc+114 __do_page_cache_readahead+161 ondemand_readahead+402 page_cache_async_readahead+181 generic_file_buffered_read+1964 generic_file_read_iter+220 ext4_file_read_iter+79 do_iter_readv_writev+419 do_iter_read+226 vfs_iter_read+25 loop_queue_work+889 kthread_worker_fn+121 loop_kthread_worker_fn+30 kthread+260 ret_from_fork+53 ]: 4 @[ psi_memstall_enter+1 ext4_mpage_readpages+1145 ext4_readpages+58 read_pages+113 __do_page_cache_readahead+384 filemap_fault+1742 ext4_filemap_fault+50 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 5 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_current+135 __page_cache_alloc+114 __do_page_cache_readahead+161 ondemand_readahead+402 page_cache_sync_readahead+120 generic_file_buffered_read+1393 generic_file_read_iter+220 ext4_file_read_iter+79 do_iter_readv_writev+419 do_iter_read+226 vfs_iter_read+25 loop_queue_work+889 kthread_worker_fn+121 loop_kthread_worker_fn+30 kthread+260 ret_from_fork+53 ]: 6 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_vma+127 do_anonymous_page+280 __handle_mm_fault+1888 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 8 @[ psi_memstall_enter+1 filemap_fault+1855 ext4_filemap_fault+50 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 13 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_current+135 __page_cache_alloc+114 __do_page_cache_readahead+161 filemap_fault+1742 ext4_filemap_fault+50 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 17 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_current+135 __page_cache_alloc+114 pagecache_get_page+191 filemap_fault+1778 ext4_filemap_fault+50 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 19 @[ psi_memstall_enter+1 submit_bh_wbc+389 ll_rw_block+171 squashfs_read_data+787 squashfs_readpage_block+667 squashfs_readpage+537 read_pages+294 __do_page_cache_readahead+384 filemap_fault+1742 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 25 @[ psi_memstall_enter+1 submit_bh_wbc+389 ll_rw_block+171 squashfs_read_data+787 squashfs_readpage_block+667 squashfs_readpage+537 filemap_fault+287 __do_fault+60 do_fault+587 __handle_mm_fault+1221 handle_mm_fault+202 do_user_addr_fault+505 __do_page_fault+88 do_page_fault+44 page_fault+52 ]: 34 @[ psi_memstall_enter+1 __alloc_pages_nodemask+720 alloc_pages_current+135 __page_cache_alloc+114 generic_file_buffered_read+1429 generic_file_read_iter+220 ext4_file_read_iter+79 do_iter_readv_writev+419 do_iter_read+226 vfs_iter_read+25 loop_queue_work+889 kthread_worker_fn+121 loop_kthread_worker_fn+30 kthread+260 ret_from_fork+53 ]: 77 @[ psi_memstall_enter+1 kswapd+511 kthread+260 ret_from_fork+53 ]: 626 ^C
実験2 negative dentry を大量に生成する
negative dentry を大量に生成して、slab の回収処理を強制的に呼び出すという負荷をかける
root@bionic:~# mkdir /sys/fs/cgroup/memory/000 root@bionic:~# echo 100M > /sys/fs/cgroup/000/memory.mem.limit_in_bytes root@bionic:~# echo $$ > /sys/fs/cgroup/memory/000/tasks # slab reclaim 🔥 root@bionic:~# timeout 30 perl -e 'stat "/$_" for 1..10000000'
同時に /proc/pressure/io
を観測する
for i in {1..100}; do cat /proc/pressure/memory | perl -nle 'printf "%s\t%s\n", scalar localtime(), $_'; sleep 1; done;
徐々に /proc/pressure/memory の値が上がっていくのを観測できる. negative dentry をずーっと作り続けているので、slab の回収処理が走るあたりで計上されているんじゃなかろうか
Mon Nov 18 17:53:50 2019 some avg10=0.00 avg60=0.34 avg300=0.47 total=5822085 Mon Nov 18 17:53:50 2019 full avg10=0.00 avg60=0.22 avg300=0.31 total=4638505 Mon Nov 18 17:53:51 2019 some avg10=0.36 avg60=0.40 avg300=0.48 total=5863706 Mon Nov 18 17:53:51 2019 full avg10=0.18 avg60=0.25 avg300=0.32 total=4670954 Mon Nov 18 17:53:52 2019 some avg10=0.36 avg60=0.40 avg300=0.48 total=5979685 Mon Nov 18 17:53:52 2019 full avg10=0.18 avg60=0.25 avg300=0.32 total=4717815 Mon Nov 18 17:53:53 2019 some avg10=1.92 avg60=0.68 avg300=0.54 total=6048779 Mon Nov 18 17:53:53 2019 full avg10=1.05 avg60=0.40 avg300=0.35 total=4774972 Mon Nov 18 17:53:54 2019 some avg10=1.92 avg60=0.68 avg300=0.54 total=6103945 Mon Nov 18 17:53:54 2019 full avg10=1.05 avg60=0.40 avg300=0.35 total=4811085 Mon Nov 18 17:53:55 2019 some avg10=2.66 avg60=0.85 avg300=0.58 total=6189074 Mon Nov 18 17:53:55 2019 full avg10=1.40 avg60=0.49 avg300=0.37 total=4847986 Mon Nov 18 17:53:56 2019 some avg10=2.66 avg60=0.85 avg300=0.58 total=6302442 Mon Nov 18 17:53:56 2019 full avg10=1.40 avg60=0.49 avg300=0.37 total=4941928 Mon Nov 18 17:53:57 2019 some avg10=3.63 avg60=1.09 avg300=0.63 total=6352685 Mon Nov 18 17:53:57 2019 full avg10=2.23 avg60=0.67 avg300=0.41 total=4975421 Mon Nov 18 17:53:58 2019 some avg10=3.63 avg60=1.09 avg300=0.63 total=6460241 Mon Nov 18 17:53:58 2019 full avg10=2.23 avg60=0.67 avg300=0.41 total=5042527 Mon Nov 18 17:53:59 2019 some avg10=4.78 avg60=1.38 avg300=0.69 total=6556563 Mon Nov 18 17:53:59 2019 full avg10=2.73 avg60=0.81 avg300=0.44 total=5080571 Mon Nov 18 17:54:00 2019 some avg10=4.78 avg60=1.38 avg300=0.69 total=6672685 Mon Nov 18 17:54:00 2019 full avg10=2.73 avg60=0.81 avg300=0.44 total=5168872 Mon Nov 18 17:54:01 2019 some avg10=5.36 avg60=1.60 avg300=0.75 total=6733324 Mon Nov 18 17:54:01 2019 full avg10=3.33 avg60=0.98 avg300=0.48 total=5210281 Mon Nov 18 17:54:02 2019 some avg10=5.36 avg60=1.60 avg300=0.75 total=6799858 Mon Nov 18 17:54:02 2019 full avg10=3.33 avg60=0.98 avg300=0.48 total=5238881 Mon Nov 18 17:54:03 2019 some avg10=5.66 avg60=1.77 avg300=0.79 total=6872770 Mon Nov 18 17:54:03 2019 full avg10=3.45 avg60=1.08 avg300=0.50 total=5283273 Mon Nov 18 17:54:04 2019 some avg10=5.66 avg60=1.77 avg300=0.79 total=6941195 Mon Nov 18 17:54:04 2019 full avg10=3.45 avg60=1.08 avg300=0.50 total=5330769 Mon Nov 18 17:54:05 2019 some avg10=6.08 avg60=1.98 avg300=0.84 total=7045575 Mon Nov 18 17:54:05 2019 full avg10=3.73 avg60=1.21 avg300=0.53 total=5395983 Mon Nov 18 17:54:06 2019 some avg10=6.08 avg60=1.98 avg300=0.84 total=7101877 Mon Nov 18 17:54:06 2019 full avg10=3.73 avg60=1.21 avg300=0.53 total=5430315 Mon Nov 18 17:54:07 2019 some avg10=6.07 avg60=2.11 avg300=0.87 total=7175145 Mon Nov 18 17:54:07 2019 full avg10=3.78 avg60=1.30 avg300=0.55 total=5480289 Mon Nov 18 17:54:08 2019 some avg10=6.07 avg60=2.11 avg300=0.87 total=7323447 Mon Nov 18 17:54:08 2019 full avg10=3.78 avg60=1.30 avg300=0.55 total=5546810 Mon Nov 18 17:54:09 2019 some avg10=6.96 avg60=2.40 avg300=0.94 total=7412187 Mon Nov 18 17:54:09 2019 full avg10=4.00 avg60=1.42 avg300=0.59 total=5594752 Mon Nov 18 17:54:10 2019 some avg10=6.96 avg60=2.40 avg300=0.94 total=7458924 Mon Nov 18 17:54:10 2019 full avg10=4.00 avg60=1.42 avg300=0.59 total=5631377 Mon Nov 18 17:54:11 2019 some avg10=7.33 avg60=2.61 avg300=1.00 total=7619594 Mon Nov 18 17:54:11 2019 full avg10=4.18 avg60=1.54 avg300=0.62 total=5699840 Mon Nov 18 17:54:12 2019 some avg10=7.33 avg60=2.61 avg300=1.00 total=7709262 Mon Nov 18 17:54:12 2019 full avg10=4.18 avg60=1.54 avg300=0.62 total=5774328 Mon Nov 18 17:54:13 2019 some avg10=7.27 avg60=2.76 avg300=1.04 total=7777595 Mon Nov 18 17:54:13 2019 full avg10=4.33 avg60=1.65 avg300=0.65 total=5826863
bpftrace で kprobe:psi_memstall_enter() 呼び出しの際のバックトレースをとる
vagrant@bionic:~$ sudo bpftrace -e 'kprobe:psi_memstall_enter { @[kstack] = count(); }' Attaching 1 probe... ^C @[ psi_memstall_enter+1 try_charge+744 __memcg_kmem_charge_memcg+64 alloc_slab_page+404 allocate_slab+125 new_slab+74 ___slab_alloc+824 __slab_alloc+32 kmem_cache_alloc+516 __d_alloc+39 d_alloc+26 d_alloc_parallel+90 __lookup_slow+105 lookup_slow+59 walk_component+474 path_lookupat.isra.0+128 filename_lookup+174 user_path_at_empty+58 vfs_statx+125 __do_sys_newstat+62 __x64_sys_newstat+22 do_syscall_64+87 entry_SYSCALL_64_after_hwframe+68 ]: 141
bpftrace では try_charge() が呼び出し元に見えるが、ソースを読むと try_to_free_mem_cgroup_pages() が正確な場所だった
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *memcg, unsigned long nr_pages, gfp_t gfp_mask, bool may_swap) { struct zonelist *zonelist; unsigned long nr_reclaimed; unsigned long pflags; int nid; unsigned int noreclaim_flag; struct scan_control sc = { .nr_to_reclaim = max(nr_pages, SWAP_CLUSTER_MAX), .gfp_mask = (current_gfp_context(gfp_mask) & GFP_RECLAIM_MASK) | (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK), .reclaim_idx = MAX_NR_ZONES - 1, .target_mem_cgroup = memcg, .priority = DEF_PRIORITY, .may_writepage = !laptop_mode, .may_unmap = 1, .may_swap = may_swap, }; set_task_reclaim_state(current, &sc.reclaim_state); /* * Unlike direct reclaim via alloc_pages(), memcg's reclaim doesn't * take care of from where we get pages. So the node where we start the * scan does not need to be the current node. */ nid = mem_cgroup_select_victim_node(memcg); zonelist = &NODE_DATA(nid)->node_zonelists[ZONELIST_FALLBACK]; trace_mm_vmscan_memcg_reclaim_begin(0, sc.gfp_mask); psi_memstall_enter(&pflags); 👈 noreclaim_flag = memalloc_noreclaim_save(); nr_reclaimed = do_try_to_free_pages(zonelist, &sc); memalloc_noreclaim_restore(noreclaim_flag); psi_memstall_leave(&pflags); 👈 trace_mm_vmscan_memcg_reclaim_end(nr_reclaimed); set_task_reclaim_state(current, NULL); return nr_reclaimed; } #endif
まとめ
- ざっくりとソースと呼び出し元の関数を追った
- OOM と negative dentry の大量生成で
/proc/pressure/memory
の変化を追った - やっぱりメモリサブシステムは複雑だなぁと 頭がパンクした次第
/proc/pressure/memory
単体ではどこでストールしているかを追えないので、/proc/vmstat
のカウンタと合わせとメトリスクとして扱いやすくなるかな?