Systemd service hardening

First written 2020-01-25; Last updated 2024-02-28

Systemd provides many configuration settings to reduce privileges and restrict access of a service and thus harden the service against potential vulnerabilities. However, these settings are scattered throughout the documentation making them more difficult to find than necessary. In addition, the commonly suggested settings are not enough to restrict privileged processes (running as root or with special capabilities) because they can still access sensitive files like private keys (e.g. in /etc/) or sockets (e.g. in /run/).

Note
One remaining limitation of this setup is that privileged processes can still send signals to any other process. If possible don’t run services as root but instead as a separate user (User, Group). Required capabilities can be granted even for non-root processes with AmbientCapabilities.

The following configuration snippet is a collection of all relevant hardening options I could find, followed by a short explanation what they do and how they are useful; see the systemd man pages (man systemd.directives) for details. These settings work at least since Debian Buster (systemd 241), except where otherwise noted.

systemd provides the systemd-analyze security command to check if a service is restricted. It does not take all possible hardening settings into account but gives a good overview which services require further hardening.

CapabilityBoundingSet=
KeyringMode=private
LockPersonality=yes
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
PrivateDevices=yes
PrivateMounts=yes
PrivateNetwork=yes
PrivateTmp=yes
PrivateUsers=yes
ProtectClock=true
ProtectClock=true
ProtectControlGroups=yes
ProtectHome=yes
ProtectHostname=yes
ProtectKernelLogs=true
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectProc=invisible
ProtectSystem=strict
# Permit AF_UNIX for syslog(3) to help debugging. (Empty setting permits all
# families! A possible workaround would be to blacklist AF_UNIX afterwards.)
RestrictAddressFamilies=
RestrictAddressFamilies=AF_UNIX
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=
SystemCallFilter=@system-service
SystemCallFilter=~@aio @chown @clock @cpu-emulation @debug @keyring @memlock @module @mount @obsolete @privileged @raw-io @reboot @resources @setuid @swap userfaultfd mincore

# Restrict access to potential sensitive data (kernels, config, mount points,
# private keys). The paths will be created if they don't exist and they must
# not be files.
TemporaryFileSystem=/boot:ro /etc/luks:ro /etc/ssh:ro /etc/ssl/private:ro /media:ro /mnt:ro /run:ro /srv:ro /var:ro
# Permit syslog(3) messages to journald
BindReadOnlyPaths=/run/systemd/journal/dev-log
Note
All settings marking mounts as read-only (e.g. ProtectSystem or ReadOnlyPaths) cannot protect mount points created after the service was started (see the systemd man page of ReadOnlyPaths for details).
All path based restrictions (e.g. from previous paragraph or TemporaryFileSystem) can be undone by a privileged process with the ability to perform mount syscalls. The CapabilityBoundingSet and SystemCallFilter settings above prevent this but one should be aware of this potential issue.

When restricting existing services I use systemctl edit $service to create an override file with these settings (or I put the file manually at the appropriate place). This way my settings override the default restrictions of the service and are kept during system updates.

After this block of default options, specific settings can be changed or extended. For example PrivateUsers is often too strict, adding PrivateUsers=no after this block will restore the default. Or to permit access to keyring syscalls one can add SystemCallFilter=@keyring. Having the default options first followed by service-specific modifications makes it easy to update the default settings of multiple service files.

CapabilityBoundingSet restricts the capabilities (man 7 capabilities) of this service; setting it to empty removes all capabilities. Capabilities permit more fine-grained permissions, for example CAP_NET_RAW allows creating raw network sockets without being root.

LockPersonality prevents changing the “process execution domain” (man 2 personality), a rarely used feature with potential bugs.

MemoryDenyWriteExecute prevents memory mappings which are both writable and executable to hinder (simple) exploits.

NoNewPrivileges prevents the process from gaining any additional privileges during exec (man 2 execve), for example when running setuid or setcap programs.

Private* provides a separate instance of the named feature to the process. This way, devices (PrivateDevices), mounts (PrivateMounts), network interfaces (PrivateNetwork), /tmp/ and /var/tmp/ directories (PrivateTmp) and users (PrivateUsers) are isolated from the regular system and cannot be modified by the process. In the case of temporary directories this also protects the process against other users of the system as for example TOCTOU (time-of-check time-of-use, Wikipedia) races in /tmp/ can no longer attack the process. These settings use Linux’s namespaces (man 7 namespaces) to provide isolation.

Protect* restricts access to the named features. This prevents the process from modifying cgroups (man 7 cgroups, ProtectControlGroups), sysctls and other kernel tunables in /proc/ and /sys/ (ProtectKernelTunables), kernel modules (ProtectKernelModules) and the hostname (ProtectHostname). ProtectHome=yes (other values are possible) makes /home/, /root/ and /run/user/ inaccessible. ProtectSystem=strict (other values are possible) mounts the whole file system hierarchy read-only (except for /dev/, /proc/, /sys/; those are protected by PrivateDevices, ProtectKernelTunables, ProtectControlGroups). ReadWritePaths can be used to give write-access selectively. Most of these settings are also implemented using namespaces.

Restrict* also restricts access to the named feature. This controls the available support for address families (RestrictAddressFamilies), namespaces (RestrictNamespaces), real-time scheduling (RestrictRealtime) and setting suid/guid bits on files/directories (RestrictSUIDSGID). Note that setting RestrictAddressFamilies to the empty value permits all address families! This is unlike other options where the empty value is the most restrictive.

SystemCallFilter restricts access to syscalls via seccomp (man 2 seccomp). First the setting is reset to the default (first line), then the systemd defaults for services is permitted (second line), followed by the removal of additional syscalls which should not be necessary for most services (third line). Two extra syscalls are blacklisted: userfaultfd which can be used to help exploiting timing sensitive attacks and mincore which can leak kernel information.

TemporaryFileSystem mounts tmpfs (read-only with :ro suffix, other settings possible) over the specified directories. This is similar to InaccessiblePaths which also prevents access to the directory contents but TemporaryFileSystem permits nesting to give access to sub-directories. In the example this is used with BindReadOnlyPaths to permit logging to syslog. To give write access to sub-directories use BindPaths in combination with ReadWritePaths.

This restrictive use of TemporaryFileSystem is especially important for privileged processes which still have access to all root-owned files even with all the other restrictions from above. As this often includes private keys restricting access via TemporaryFileSystem is very useful.

Last updated 2024-02-28

Impressum Datenschutzerklärung