Andrew Egeler
Learning how things work.

Transparent Huge Pages and Ceph OSDs

TL;DR: transparent_hugepage set to always looks like a memory leak when running a ceph OSD

Like a lot of digital technology geeks, I have a server running in my basement. Actually, it's a kubernetes cluster (running on k3s) at the moment, set up as a 100% distributed highly-available cluster on 3 nodes. Yes, this is overkill for the workloads I'm running.

To keep everything distributed and flexible, I'm using ceph for storage (with rook, so I can manage it via k8s), with 3 OSDs, one on each node. My nodes are running Debian 11, which defaults to transparent_hugepage enabled.

Shortly after setting this whole structure up and switching my jellyfin server and such over to it, I noticed that after I would run backups, the memory usage of the OSD processes was very high - over 50% more than the 4GB I had configured them to use. Ceph's internal tooling and statistics seemed to report normal-looking memory usage, at which point I started to wonder if I found some kind of weird memory leak in ceph.

However, as it turns out, I didn't find a leak in ceph, but a kernel feature that was causing memory fragmentation issues:

$ cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

(fair warning: the next paragraph matches my experience here, but is not well-researched and may not be completely accurate)

On my nodes, transparent_hugepage was set to [always], meaning the kernel would optimistically try to merge standard 4KB memory pages into 2MB "huge pages". However, if 512 full 4k pages were merged into a hugepage, and then all but 4KB gets freed, the entire 2MB page will remain allocated to the process, vastly inflating it's memory usage. As I understand it, the kernel is able to break this 2MB page back into 4KB pages under memory pressure, but this does cost extra CPU cycles and slows performance when the system is loaded.

To fix this issue, I changed the transparent_hugepage setting to madvise (which is the default on my fedora workstation):

$ echo madvise >/sys/kernel/mm/transparent_hugepage/enabled

This will change the setting for the current boot; to change it for future boots, you need to add transparent_hugepage=madvise to your kernel's boot arguments (by editing /etc/default/grub and running update-grub for my Debian 11 install)

After making this settings change, my OSD memory usage stayed within my configured 4GB.