Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms...
Transcript of Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms...
![Page 1: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/1.jpg)
Linux memory management at scale
Chris DownKernel, Facebookhttps://chrisdown.name
![Page 2: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/2.jpg)
![Page 3: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/3.jpg)
![Page 4: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/4.jpg)
server
![Page 5: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/5.jpg)
Image: Spc. Christopher Hernandez, US Military Public Domain
![Page 6: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/6.jpg)
Image: Simon Law on Flickr, CC-BY-SA
![Page 7: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/7.jpg)
![Page 8: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/8.jpg)
Image: Orion J on Wikimedia Commons, CC-BY
■ Memory is divided in to multiple “types”: anon, cache, buffers, etc■ “Reclaimable” or “unreclaimable” is important, but not guaranteed■ RSS is kinda bullshit, sorry
![Page 9: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/9.jpg)
bit.ly/whyswap
■ Swap isn’t about emergency memory, in fact that’s probably harmful■ Instead, it increases reclaim equality and reliability of forward progress of the system■ Also promotes maintaining a small positive pressure (similar to make -j cores+1)
![Page 10: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/10.jpg)
■ OOM killer is reactive, not proactive, based on reclaim failure■ Hotness obscured by MMU (pte_young), we don’t know we’re OOMing ahead of time■ Can be very, very late to the party, and sometimes go to the wrong party entirely
![Page 11: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/11.jpg)
■ kswapd reclaim: background, started when resident pages goes above a threshold■ Direct reclaim: blocks application when have no memory available to allocate frames■ Tries to reclaim the coldest pages first■ Some things might not be reclaimable. Swap can help here (bit.ly/whyswap)
![Page 12: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/12.jpg)
“If I had more of this resource, I could probably run N% faster”
■ Find bottlenecks■ Detect workload health issues before they become severe■ Used for resource allocation, load shedding, pre-OOM detection
$ cat /sys/fs/cgroup/system.slice/memory.pressuresome avg10=0.21 avg60=0.22 total=4760988587full avg10=0.21 avg60=0.22 total=4681731696
![Page 13: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/13.jpg)
bit.ly/fboomd
■ Early-warning OOM detection and handling using new memory pressure metrics■ Highly configurable policy/rule engine■ Workload QoS and context-aware decisions
![Page 14: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/14.jpg)
Shift to “protection” mentality
■ Limits (eg. memory.{high,max}) really don’t compose well■ Prefer protection (memory.{low,min}) if possible■ Protections affect memory reclaim behaviour
![Page 15: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/15.jpg)
fbtax2
■ Workload protection: Prevent non-critical services degrading main workload■ Host protection: Degrade gracefully if machine cannot sustain workload■ Usability: Avoid introducing performance or operational costs
![Page 16: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/16.jpg)
fbtax2
Base OS
Filesystems
Swap
Kernel tunables…
cgroup v2Default hierarchy
Resource configuration
Applicationsoomd
Metric exporting for cgroups
![Page 17: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/17.jpg)
Base OS
■ btrfs as /■ ext4 has priority inversions■ All metadata is annotated
■ Swap■ Yes, you really still want it (bit.ly/whyswap)■ Allows memory pressure to build up gracefully■ Usually disabled on main workload■ btrfs swap file support to avoid tying to provisioning
■ Kernel tunables■ vm.swappiness■ Writeback throttling
![Page 18: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/18.jpg)
fbtax2 cgroup hierarchy: old
web
system.slicememory.high: 8Gmemory.max: 10G
Chef
hostcritical.slicesshd
syslog
workload.slice
workload-container.slice HHVM
workload-deps.sliceService discovery
Config service
![Page 19: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/19.jpg)
fbtax2 cgroup hierarchy
web
system.sliceio.latency: 75ms Chef
hostcritical.slicememory.min: 352Mio.latency: 50ms
sshd
syslog
workload.slicememory.low: 17Gio.latency: 50ms
workload-container.slicememory.low: max HHVM
workload-deps.slicememory.low: 2.5G
Service discovery
Config service
![Page 20: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/20.jpg)
webservers: protection against memory starvation
![Page 21: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/21.jpg)
Try it out: bit.ly/fbtax2
![Page 22: Linux memory management at scale - FOSDEM · fbtax2cgrouphierarchy web system.slice io.latency:75ms Chef hostcritical.slice memory.min:352M io.latency:50ms sshd syslog workload.slice](https://reader036.fdocuments.net/reader036/viewer/2022062523/5f0d71f57e708231d43a6420/html5/thumbnails/22.jpg)