TL;DR: transparent_hugepage set to
always looks like a memory leak when
running a ceph OSD
Like a lot of digital technology geeks, I have a server running in my basement. Actually, it's a kubernetes cluster (running on k3s) at the moment, set up as a 100% distributed highly-available cluster on 3 nodes. Yes, this is overkill for the workloads I'm running.
To keep everything distributed and flexible, I'm using ceph for storage (with rook, so I can manage it via k8s), with 3 OSDs, one on each node. My nodes are running Debian 11, which defaults to transparent_hugepage enabled.
Shortly after setting this whole structure up and switching my jellyfin server and such over to it, I noticed that after I would run backups, the memory usage of the OSD processes was very high - over 50% more than the 4GB I had configured them to use. Ceph's internal tooling and statistics seemed to report normal-looking memory usage, at which point I started to wonder if I found some kind of weird memory leak in ceph.
However, as it turns out, I didn't find a leak in ceph, but a kernel feature that was causing memory fragmentation issues:
$ cat /sys/kernel/mm/transparent_hugepage/enabled [always] madvise never
(fair warning: the next paragraph matches my experience here, but is not well-researched and may not be completely accurate)
On my nodes, transparent_hugepage was set to
[always], meaning the kernel
would optimistically try to merge standard 4KB memory pages into 2MB "huge
pages". However, if 512 full 4k pages were merged into a hugepage, and then all
but 4KB gets freed, the entire 2MB page will remain allocated to the process,
vastly inflating it's memory usage. As I understand it, the kernel is able to
break this 2MB page back into 4KB pages under memory pressure, but this does
cost extra CPU cycles and slows performance when the system is loaded.
To fix this issue, I changed the
transparent_hugepage setting to
(which is the default on my fedora workstation):
$ echo madvise >/sys/kernel/mm/transparent_hugepage/enabled
This will change the setting for the current boot; to change it for future
boots, you need to add
transparent_hugepage=madvise to your kernel's boot
arguments (by editing /etc/default/grub and running
update-grub for my Debian
After making this settings change, my OSD memory usage stayed within my configured 4GB.