Systemd provides many configuration settings to reduce privileges and restrict
access of a service and thus harden the service against potential
vulnerabilities. However, these settings are scattered throughout the
documentation making them more difficult to find than necessary. In addition,
the commonly suggested settings are not enough to restrict privileged
processes (running as root or with special capabilities) because they can
still access sensitive files like private keys (e.g. in /etc/
) or sockets
(e.g. in /run/
).
Note
|
One remaining limitation of this setup is that privileged processes can
still send signals to any other process. If possible don’t run services as
root but instead as a separate user (User , Group ). Required capabilities
can be granted even for non-root processes with AmbientCapabilities . |
The following configuration snippet is a collection of all relevant hardening
options I could find, followed by a short explanation what they do and how
they are useful; see the systemd man pages (man systemd.directives
) for
details. These settings work at least since Debian Buster (systemd 241),
except where otherwise noted.
systemd provides the systemd-analyze security
command to check if a service
is restricted. It does not take all possible hardening settings into account
but gives a good overview which services require further hardening.
CapabilityBoundingSet=
KeyringMode=private
LockPersonality=yes
MemoryDenyWriteExecute=yes
NoNewPrivileges=yes
PrivateDevices=yes
PrivateMounts=yes
PrivateNetwork=yes
PrivateTmp=yes
PrivateUsers=yes
ProtectClock=true
ProtectControlGroups=yes
ProtectHome=yes
ProtectHostname=yes
ProtectKernelLogs=true
ProtectKernelModules=yes
ProtectKernelTunables=yes
ProtectProc=invisible
ProtectSystem=strict
# Permit AF_UNIX for syslog(3) to help debugging. (Empty setting permits all
# families! A possible workaround would be to blacklist AF_UNIX afterwards.)
RestrictAddressFamilies=
RestrictAddressFamilies=AF_UNIX
RestrictNamespaces=yes
RestrictRealtime=yes
RestrictSUIDSGID=yes
SystemCallArchitectures=native
SystemCallFilter=
SystemCallFilter=@system-service
SystemCallFilter=~@aio @chown @clock @cpu-emulation @debug @keyring @memlock @module @mount @obsolete @privileged @raw-io @reboot @resources @setuid @swap userfaultfd mincore
# Restrict access to potential sensitive data (kernels, config, mount points,
# private keys). The paths will be created if they don't exist and they must
# not be files.
TemporaryFileSystem=/boot:ro /etc/luks:ro /etc/ssh:ro /etc/ssl/private:ro /media:ro /mnt:ro /run:ro /srv:ro /var:ro
# Permit syslog(3) messages to journald
BindReadOnlyPaths=/run/systemd/journal/dev-log
Note
|
All settings marking mounts as read-only (e.g. ProtectSystem or
ReadOnlyPaths ) cannot protect mount points created after the service was
started (see the systemd man page of ReadOnlyPaths for details).
All path based restrictions (e.g. from previous paragraph or
TemporaryFileSystem ) can be undone by a privileged process with the ability
to perform mount syscalls. The CapabilityBoundingSet and SystemCallFilter
settings above prevent this but one should be aware of this potential issue. |
When restricting existing services I use systemctl edit $service
to create
an override file with these settings (or I put the file manually at the
appropriate place). This way my settings override the default restrictions of
the service and are kept during system updates.
After this block of default options, specific settings can be changed or
extended. For example PrivateUsers
is often too strict, adding
PrivateUsers=no
after this block will restore the default. Or to permit
access to keyring syscalls one can add SystemCallFilter=@keyring
. Having the
default options first followed by service-specific modifications makes it easy
to update the default settings of multiple service files.
CapabilityBoundingSet
restricts the capabilities (man 7 capabilities
) of
this service; setting it to empty removes all capabilities. Capabilities
permit more fine-grained permissions, for example CAP_NET_RAW
allows
creating raw network sockets without being root
.
LockPersonality
prevents changing the “process execution domain” (man 2
personality
), a rarely used feature with potential bugs.
MemoryDenyWriteExecute
prevents memory mappings which are both writable and
executable to hinder (simple) exploits.
NoNewPrivileges
prevents the process from gaining any additional privileges
during exec (man 2 execve
), for example when running setuid or setcap
programs.
Private*
provides a separate instance of the named feature to the process.
This way, devices (PrivateDevices
), mounts (PrivateMounts
), network
interfaces (PrivateNetwork
), /tmp/
and /var/tmp/
directories
(PrivateTmp
) and users (PrivateUsers
) are isolated from the regular system
and cannot be modified by the process. In the case of temporary directories
this also protects the process against other users of the system as for
example TOCTOU (time-of-check time-of-use, Wikipedia) races in
/tmp/
can no longer attack the process. These settings use Linux’s
namespaces (man 7 namespaces
) to provide isolation.
Protect*
restricts access to the named features. This prevents the process
from modifying cgroups (man 7 cgroups
, ProtectControlGroups
), sysctls and
other kernel tunables in /proc/
and /sys/
(ProtectKernelTunables
),
kernel modules (ProtectKernelModules
) and the hostname (ProtectHostname
).
ProtectHome=yes
(other values are possible) makes /home/
, /root/
and
/run/user/
inaccessible. ProtectSystem=strict
(other values are possible)
mounts the whole file system hierarchy read-only (except for /dev/
,
/proc/
, /sys/
; those are protected by PrivateDevices
,
ProtectKernelTunables
, ProtectControlGroups
). ReadWritePaths
can be used
to give write-access selectively. Most of these settings are also implemented
using namespaces.
Restrict*
also restricts access to the named feature. This controls the
available support for address families (RestrictAddressFamilies
), namespaces
(RestrictNamespaces
), real-time scheduling (RestrictRealtime
) and setting
suid/guid bits on files/directories (RestrictSUIDSGID
). Note that setting
RestrictAddressFamilies
to the empty value permits all address families!
This is unlike other options where the empty value is the most restrictive.
SystemCallFilter
restricts access to syscalls via seccomp (man 2 seccomp
).
First the setting is reset to the default (first line), then the systemd
defaults for services is permitted (second line), followed by the removal of
additional syscalls which should not be necessary for most services (third
line). Two extra syscalls are blacklisted: userfaultfd
which can be used to
help exploiting timing sensitive attacks and mincore
which can leak kernel
information.
TemporaryFileSystem
mounts tmpfs (read-only with :ro
suffix, other
settings possible) over the specified directories. This is similar to
InaccessiblePaths
which also prevents access to the directory contents but
TemporaryFileSystem
permits nesting to give access to sub-directories. In
the example this is used with BindReadOnlyPaths
to permit logging to syslog.
To give write access to sub-directories use BindPaths
in combination with
ReadWritePaths
.
This restrictive use of TemporaryFileSystem
is especially important for
privileged processes which still have access to all root-owned files even with
all the other restrictions from above. As this often includes private keys
restricting access via TemporaryFileSystem
is very useful.