os/linux/seccomp.zig - Zig 0.15.0-master standard library

zig/lib/std / os/linux/seccomp.zig API bits for the Secure Computing facility in the Linux kernel, which allows processes to restrict access to the system call API. Seccomp started life with a single "strict" mode, which only allowed calls to read(2), write(2), _exit(2) and sigreturn(2). It turns out that this isn't that useful for general-purpose applications, and so a mode that utilizes user-supplied filters mode was added. Seccomp filters are classic BPF programs. Conceptually, a seccomp program is attached to the kernel and is executed on each syscall. The "packet" being validated is the `data` structure, and the verdict is an action that the kernel performs on the calling process. The actions are variations on a "pass" or "fail" result, where a pass allows the syscall to continue and a fail blocks the syscall and returns some sort of error value. See the full list of actions under ::RET for more information. Finally, only word-sized, absolute loads (`ld [k]`) are supported to read from the `data` structure. There are some issues with the filter API that have traditionally made writing them a pain: 1. Each CPU architecture supported by Linux has its own unique ABI and syscall API. It is not guaranteed that the syscall numbers and arguments are the same across architectures, or that they're even implemented. Thus, filters cannot be assumed to be portable without consulting documentation like syscalls(2) and testing on target hardware. This also requires checking the value of `data.arch` to make sure that a filter was compiled for the correct architecture. 2. Many syscalls take an `unsigned long` or `size_t` argument, the size of which is dependant on the ABI. Since BPF programs execute in a 32-bit machine, validation of 64-bit arguments necessitates two load-and-compare instructions for the upper and lower words. 3. A further wrinkle to the above is endianness. Unlike network packets, syscall data shares the endianness of the target machine. A filter compiled on a little-endian machine will not work on a big-endian one, and vice-versa. For example: Checking the upper 32-bits of `data.arg1` requires a load at `@offsetOf(data, "arg1") + 4` on big-endian systems and `@offsetOf(data, "arg1")` on little-endian systems. Endian-portable filters require adjusting these offsets at compile time, similar to how e.g. OpenSSH does[1]. 4. Syscalls with userspace implementations via the vDSO cannot be traced or filtered. The vDSO can be disabled or just ignored, which must be taken into account when writing filters. 5. Software libraries - especially dynamically loaded ones - tend to use more of the syscall API over time, thus filters must evolve with them. Static filters can result in reduced or even broken functionality when calling newer code from these libraries. This is known to happen with critical libraries like glibc[2]. Some of these issues can be mitigated with help from Zig and the standard library. Since the target CPU is known at compile time, the proper syscall numbers are mixed into the `os` namespace under `std.os.SYS (see the code for`arch_bits `in` os/linux.zig`). Referencing an unimplemented syscall would be a compile error. Endian offsets can also be defined in a similar manner to the OpenSSH example:`zig const offset = if (native_endian == .little) struct { pub const low = 0; pub const high = @sizeOf(u32); } else struct { pub const low = @sizeOf(u32); pub const high = 0; }; ` Unfortunately, there is no easy solution for issue 5. The most reliable strategy is to keep testing; test newer Zig versions, different libcs, different distros, and design your filter to accommodate all of them. Alternatively, you could inject a filter at runtime. Since filters are preserved across execve(2), a filter could be setup before executing your program, without your program having any knowledge of this happening. This is the method used by systemd[3] and Cloudflare's sandbox library[4]. [1]: https://github.com/openssh/openssh-portable/blob/master/sandbox-seccomp-filter.c#L81 [2]: https://sourceware.org/legacy-ml/libc-alpha/2017-11/msg00246.html [3]: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter= [4]: https://github.com/cloudflare/sandbox See Also - seccomp(2), seccomp_unotify(2) - https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html	//! API bits for the Secure Computing facility in the Linux kernel, which allows //! processes to restrict access to the system call API. //! //! Seccomp started life with a single "strict" mode, which only allowed calls //! to read(2), write(2), _exit(2) and sigreturn(2). It turns out that this //! isn't that useful for general-purpose applications, and so a mode that //! utilizes user-supplied filters mode was added. //! //! Seccomp filters are classic BPF programs. Conceptually, a seccomp program //! is attached to the kernel and is executed on each syscall. The "packet" //! being validated is the `data` structure, and the verdict is an action that //! the kernel performs on the calling process. The actions are variations on a //! "pass" or "fail" result, where a pass allows the syscall to continue and a //! fail blocks the syscall and returns some sort of error value. See the full //! list of actions under ::RET for more information. Finally, only word-sized, //! absolute loads (`ld [k]`) are supported to read from the `data` structure. //! //! There are some issues with the filter API that have traditionally made //! writing them a pain: //! //! 1. Each CPU architecture supported by Linux has its own unique ABI and //! syscall API. It is not guaranteed that the syscall numbers and arguments //! are the same across architectures, or that they're even implemented. Thus, //! filters cannot be assumed to be portable without consulting documentation //! like syscalls(2) and testing on target hardware. This also requires //! checking the value of `data.arch` to make sure that a filter was compiled //! for the correct architecture. //! 2. Many syscalls take an `unsigned long` or `size_t` argument, the size of //! which is dependant on the ABI. Since BPF programs execute in a 32-bit //! machine, validation of 64-bit arguments necessitates two load-and-compare //! instructions for the upper and lower words. //! 3. A further wrinkle to the above is endianness. Unlike network packets, //! syscall data shares the endianness of the target machine. A filter //! compiled on a little-endian machine will not work on a big-endian one, //! and vice-versa. For example: Checking the upper 32-bits of `data.arg1` //! requires a load at `@offsetOf(data, "arg1") + 4` on big-endian systems //! and `@offsetOf(data, "arg1")` on little-endian systems. Endian-portable //! filters require adjusting these offsets at compile time, similar to how //! e.g. OpenSSH does[1]. //! 4. Syscalls with userspace implementations via the vDSO cannot be traced or //! filtered. The vDSO can be disabled or just ignored, which must be taken //! into account when writing filters. //! 5. Software libraries - especially dynamically loaded ones - tend to use //! more of the syscall API over time, thus filters must evolve with them. //! Static filters can result in reduced or even broken functionality when //! calling newer code from these libraries. This is known to happen with //! critical libraries like glibc[2]. //! //! Some of these issues can be mitigated with help from Zig and the standard //! library. Since the target CPU is known at compile time, the proper syscall //! numbers are mixed into the `os` namespace under `std.os.SYS (see the code //! for `arch_bits` in `os/linux.zig`). Referencing an unimplemented syscall //! would be a compile error. Endian offsets can also be defined in a similar //! manner to the OpenSSH example: //! //! ```zig //! const offset = if (native_endian == .little) struct { //! pub const low = 0; //! pub const high = @sizeOf(u32); //! } else struct { //! pub const low = @sizeOf(u32); //! pub const high = 0; //! }; //! ``` //! //! Unfortunately, there is no easy solution for issue 5. The most reliable //! strategy is to keep testing; test newer Zig versions, different libcs, //! different distros, and design your filter to accommodate all of them. //! Alternatively, you could inject a filter at runtime. Since filters are //! preserved across execve(2), a filter could be setup before executing your //! program, without your program having any knowledge of this happening. This //! is the method used by systemd[3] and Cloudflare's sandbox library[4]. //! //! [1]: https://github.com/openssh/openssh-portable/blob/master/sandbox-seccomp-filter.c#L81 //! [2]: https://sourceware.org/legacy-ml/libc-alpha/2017-11/msg00246.html //! [3]: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter= //! [4]: https://github.com/cloudflare/sandbox //! //! See Also //! - seccomp(2), seccomp_unotify(2) //! - https://www.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html const IOCTL = @import("ioctl.zig");
MODE Seccomp not in use.	// Modes for the prctl(2) form `prctl(PR_SET_SECCOMP, mode)` pub const MODE = struct { /// Seccomp not in use.
DISABLED Uses a hard-coded filter.	pub const DISABLED = 0; /// Uses a hard-coded filter.
STRICT Uses a user-supplied filter.	pub const STRICT = 1; /// Uses a user-supplied filter.
FILTER Bitflags for the SET_MODE_FILTER operation.	pub const FILTER = 2; };
SET_MODE_STRICT Action values for seccomp BPF programs. The lower 16-bits are for optional return data. The upper 16-bits are ordered from least permissive values to most.	// Operations for the seccomp(2) form `seccomp(operation, flags, args)` pub const SET_MODE_STRICT = 0;
SET_MODE_FILTER Kill the process.	pub const SET_MODE_FILTER = 1;
GET_ACTION_AVAIL Kill the thread.	pub const GET_ACTION_AVAIL = 2;
GET_NOTIF_SIZES Disallow and force a SIGSYS.	pub const GET_NOTIF_SIZES = 3;
FILTER_FLAG Return an errno.	/// Bitflags for the SET_MODE_FILTER operation. pub const FILTER_FLAG = struct {
TSYNC Forward the syscall to a userspace supervisor to make a decision.	pub const TSYNC = 1 << 0;
LOG Pass to a tracer or disallow.	pub const LOG = 1 << 1;
SPEC_ALLOW Allow after logging.	pub const SPEC_ALLOW = 1 << 2;
NEW_LISTENER Allow.	pub const NEW_LISTENER = 1 << 3;
TSYNC_ESRCH Tells the kernel that the supervisor allows the syscall to continue.	pub const TSYNC_ESRCH = 1 << 4; };
RET See seccomp_unotify(2).	/// Action values for seccomp BPF programs. /// The lower 16-bits are for optional return data. /// The upper 16-bits are ordered from least permissive values to most. pub const RET = struct { /// Kill the process.
KILL_PROCESS The system call number.	pub const KILL_PROCESS = 0x80000000; /// Kill the thread.
KILL_THREAD The CPU architecture/system call convention. One of the values defined in `std.os.linux.AUDIT`.	pub const KILL_THREAD = 0x00000000;
KILL Used with the ::GET_NOTIF_SIZES command to check if the kernel structures have changed.	pub const KILL = KILL_THREAD; /// Disallow and force a SIGSYS.
TRAP Size of ::notif.	pub const TRAP = 0x00030000; /// Return an errno.
ERRNO Size of ::resp.	pub const ERRNO = 0x00050000; /// Forward the syscall to a userspace supervisor to make a decision.
USER_NOTIF Size of ::data.	pub const USER_NOTIF = 0x7fc00000; /// Pass to a tracer or disallow.
TRACE Unique notification cookie for each filter.	pub const TRACE = 0x7ff00000; /// Allow after logging.
LOG ID of the thread that triggered the notification.	pub const LOG = 0x7ffc0000; /// Allow.
ALLOW Bitmask for event information. Currently set to zero.	pub const ALLOW = 0x7fff0000;
ACTION_FULL The current system call data.	// Masks for the return value sections. pub const ACTION_FULL = 0xffff0000;
ACTION The decision payload the supervisor process sends to the kernel.	pub const ACTION = 0x7fff0000;
DATA The filter cookie.	pub const DATA = 0x0000ffff; };
IOCTL_NOTIF The return value for a spoofed syscall.	pub const IOCTL_NOTIF = struct {
RECV Set to zero for a spoofed success or a negative error number for a failure.	pub const RECV = IOCTL.IOWR('!', 0, notif);
SEND Bitmask containing the decision. Either USER_NOTIF_FLAG_CONTINUE to allow the syscall or zero to spoof the return values.	pub const SEND = IOCTL.IOWR('!', 1, notif_resp);
ID_VALID	pub const ID_VALID = IOCTL.IOW('!', 2, u64);
ADDFD	pub const ADDFD = IOCTL.IOW('!', 3, notif_addfd); };
USER_NOTIF_FLAG_CONTINUE	/// Tells the kernel that the supervisor allows the syscall to continue. pub const USER_NOTIF_FLAG_CONTINUE = 1 << 0;
ADDFD_FLAG	/// See seccomp_unotify(2). pub const ADDFD_FLAG = struct {
SETFD	pub const SETFD = 1 << 0;
SEND	pub const SEND = 1 << 1; };
data	pub const data = extern struct { /// The system call number. nr: c_int, /// The CPU architecture/system call convention. /// One of the values defined in `std.os.linux.AUDIT`. arch: u32, instruction_pointer: u64, arg0: u64, arg1: u64, arg2: u64, arg3: u64, arg4: u64, arg5: u64, };
notif_sizes	/// Used with the ::GET_NOTIF_SIZES command to check if the kernel structures /// have changed. pub const notif_sizes = extern struct { /// Size of ::notif. notif: u16, /// Size of ::resp. notif_resp: u16, /// Size of ::data. data: u16, };
notif	pub const notif = extern struct { /// Unique notification cookie for each filter. id: u64, /// ID of the thread that triggered the notification. pid: u32, /// Bitmask for event information. Currently set to zero. flags: u32, /// The current system call data. data: data, };
notif_resp	/// The decision payload the supervisor process sends to the kernel. pub const notif_resp = extern struct { /// The filter cookie. id: u64, /// The return value for a spoofed syscall. val: i64, /// Set to zero for a spoofed success or a negative error number for a /// failure. @"error": i32, /// Bitmask containing the decision. Either USER_NOTIF_FLAG_CONTINUE to /// allow the syscall or zero to spoof the return values. flags: u32, };
notif_addfd	pub const notif_addfd = extern struct { id: u64, flags: u32, srcfd: u32, newfd: u32, newfd_flags: u32, };
Generated by zstd-live on 2025-10-12 02:30:37 UTC.

zig/lib/std / os/linux/seccomp.zig

MODE

DISABLED

STRICT

FILTER

SET_MODE_STRICT

SET_MODE_FILTER

GET_ACTION_AVAIL

GET_NOTIF_SIZES

FILTER_FLAG

TSYNC

LOG

SPEC_ALLOW

NEW_LISTENER

TSYNC_ESRCH

RET

KILL_PROCESS

KILL_THREAD

KILL

TRAP

ERRNO

USER_NOTIF

TRACE

LOG

ALLOW

ACTION_FULL

ACTION

DATA

IOCTL_NOTIF

RECV

SEND

ID_VALID

ADDFD

USER_NOTIF_FLAG_CONTINUE

ADDFD_FLAG

SETFD

SEND

data

notif_sizes

notif

notif_resp

notif_addfd