[ghak90,V6,02/10] audit: add container id

Submitted by Richard Guy Briggs on April 9, 2019, 3:39 a.m.

Details

Message ID 9edad39c40671fb53f28d76862304cc2647029c6.1554732921.git.rgb@redhat.com
State New
Series "audit: implement container identifier"
Headers show

Commit Message

Richard Guy Briggs April 9, 2019, 3:39 a.m.
Implement the proc fs write to set the audit container identifier of a
process, emitting an AUDIT_CONTAINER_OP record to document the event.

This is a write from the container orchestrator task to a proc entry of
the form /proc/PID/audit_containerid where PID is the process ID of the
newly created task that is to become the first task in a container, or
an additional task added to a container.

The write expects up to a u64 value (unset: 18446744073709551615).

The writer must have capability CAP_AUDIT_CONTROL.

This will produce a record such as this:
  type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615 pid=628 auid=root uid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash exe=/usr/bin/bash res=yes

The "op" field indicates an initial set.  The "pid" to "ses" fields are
the orchestrator while the "opid" field is the object's PID, the process
being "contained".  New and old audit container identifier values are
given in the "contid" fields, while res indicates its success.

It is not permitted to unset the audit container identifier.
A child inherits its parent's audit container identifier.

Please see the github audit kernel issue for the main feature:
  https://github.com/linux-audit/audit-kernel/issues/90
Please see the github audit userspace issue for supporting additions:
  https://github.com/linux-audit/audit-userspace/issues/51
Please see the github audit testsuiite issue for the test case:
  https://github.com/linux-audit/audit-testsuite/issues/64
Please see the github audit wiki for the feature overview:
  https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Acked-by: Serge Hallyn <serge@hallyn.com>
Acked-by: Steve Grubb <sgrubb@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
---
 fs/proc/base.c             | 36 ++++++++++++++++++++++++
 include/linux/audit.h      | 25 +++++++++++++++++
 include/uapi/linux/audit.h |  2 ++
 kernel/audit.c             | 69 ++++++++++++++++++++++++++++++++++++++++++++++
 kernel/audit.h             |  1 +
 kernel/auditsc.c           |  4 +++
 6 files changed, 137 insertions(+)

Patch hide | download patch | download mbox

diff --git a/fs/proc/base.c b/fs/proc/base.c
index ddef482f1334..43fd0c4b87de 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1294,6 +1294,40 @@  static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
 	.read		= proc_sessionid_read,
 	.llseek		= generic_file_llseek,
 };
+
+static ssize_t proc_contid_write(struct file *file, const char __user *buf,
+				   size_t count, loff_t *ppos)
+{
+	struct inode *inode = file_inode(file);
+	u64 contid;
+	int rv;
+	struct task_struct *task = get_proc_task(inode);
+
+	if (!task)
+		return -ESRCH;
+	if (*ppos != 0) {
+		/* No partial writes. */
+		put_task_struct(task);
+		return -EINVAL;
+	}
+
+	rv = kstrtou64_from_user(buf, count, 10, &contid);
+	if (rv < 0) {
+		put_task_struct(task);
+		return rv;
+	}
+
+	rv = audit_set_contid(task, contid);
+	put_task_struct(task);
+	if (rv < 0)
+		return rv;
+	return count;
+}
+
+static const struct file_operations proc_contid_operations = {
+	.write		= proc_contid_write,
+	.llseek		= generic_file_llseek,
+};
 #endif
 
 #ifdef CONFIG_FAULT_INJECTION
@@ -3033,6 +3067,7 @@  static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
 #ifdef CONFIG_AUDIT
 	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
+	REG("audit_containerid", S_IWUSR, proc_contid_operations),
 #endif
 #ifdef CONFIG_FAULT_INJECTION
 	REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
@@ -3431,6 +3466,7 @@  static int proc_tid_comm_permission(struct inode *inode, int mask)
 #ifdef CONFIG_AUDIT
 	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
 	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
+	REG("audit_containerid", S_IWUSR, proc_contid_operations),
 #endif
 #ifdef CONFIG_FAULT_INJECTION
 	REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
diff --git a/include/linux/audit.h b/include/linux/audit.h
index bde346e73f0c..301337776193 100644
--- a/include/linux/audit.h
+++ b/include/linux/audit.h
@@ -89,6 +89,7 @@  struct audit_field {
 struct audit_task_info {
 	kuid_t			loginuid;
 	unsigned int		sessionid;
+	u64			contid;
 #ifdef CONFIG_AUDITSYSCALL
 	struct audit_context	*ctx;
 #endif
@@ -189,6 +190,15 @@  static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
 	return tsk->audit->sessionid;
 }
 
+extern int audit_set_contid(struct task_struct *tsk, u64 contid);
+
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+	if (!tsk->audit)
+		return AUDIT_CID_UNSET;
+	return tsk->audit->contid;
+}
+
 extern u32 audit_enabled;
 #else /* CONFIG_AUDIT */
 static inline int audit_alloc(struct task_struct *task)
@@ -250,6 +260,11 @@  static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
 	return AUDIT_SID_UNSET;
 }
 
+static inline u64 audit_get_contid(struct task_struct *tsk)
+{
+	return AUDIT_CID_UNSET;
+}
+
 #define audit_enabled AUDIT_OFF
 #endif /* CONFIG_AUDIT */
 
@@ -606,6 +621,16 @@  static inline bool audit_loginuid_set(struct task_struct *tsk)
 	return uid_valid(audit_get_loginuid(tsk));
 }
 
+static inline bool audit_contid_valid(u64 contid)
+{
+	return contid != AUDIT_CID_UNSET;
+}
+
+static inline bool audit_contid_set(struct task_struct *tsk)
+{
+	return audit_contid_valid(audit_get_contid(tsk));
+}
+
 static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
 {
 	audit_log_n_string(ab, buf, strlen(buf));
diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
index 3901c51c0b93..4a6a8bf1de32 100644
--- a/include/uapi/linux/audit.h
+++ b/include/uapi/linux/audit.h
@@ -71,6 +71,7 @@ 
 #define AUDIT_TTY_SET		1017	/* Set TTY auditing status */
 #define AUDIT_SET_FEATURE	1018	/* Turn an audit feature on or off */
 #define AUDIT_GET_FEATURE	1019	/* Get which features are enabled */
+#define AUDIT_CONTAINER_OP	1020	/* Define the container id and info */
 
 #define AUDIT_FIRST_USER_MSG	1100	/* Userspace messages mostly uninteresting to kernel */
 #define AUDIT_USER_AVC		1107	/* We filter this differently */
@@ -485,6 +486,7 @@  struct audit_tty_status {
 
 #define AUDIT_UID_UNSET (unsigned int)-1
 #define AUDIT_SID_UNSET ((unsigned int)-1)
+#define AUDIT_CID_UNSET ((u64)-1)
 
 /* audit_rule_data supports filter rules with both integer and string
  * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
diff --git a/kernel/audit.c b/kernel/audit.c
index 3fb09783cd4a..182b0f2c183d 100644
--- a/kernel/audit.c
+++ b/kernel/audit.c
@@ -244,6 +244,7 @@  int audit_alloc(struct task_struct *tsk)
 	}
 	info->loginuid = audit_get_loginuid(current);
 	info->sessionid = audit_get_sessionid(current);
+	info->contid = audit_get_contid(current);
 	tsk->audit = info;
 
 	ret = audit_alloc_syscall(tsk);
@@ -258,6 +259,7 @@  int audit_alloc(struct task_struct *tsk)
 struct audit_task_info init_struct_audit = {
 	.loginuid = INVALID_UID,
 	.sessionid = AUDIT_SID_UNSET,
+	.contid = AUDIT_CID_UNSET,
 #ifdef CONFIG_AUDITSYSCALL
 	.ctx = NULL,
 #endif
@@ -2341,6 +2343,73 @@  int audit_set_loginuid(kuid_t loginuid)
 }
 
 /**
+ * audit_set_contid - set current task's audit contid
+ * @contid: contid value
+ *
+ * Returns 0 on success, -EPERM on permission failure.
+ *
+ * Called (set) from fs/proc/base.c::proc_contid_write().
+ */
+int audit_set_contid(struct task_struct *task, u64 contid)
+{
+	u64 oldcontid;
+	int rc = 0;
+	struct audit_buffer *ab;
+	uid_t uid;
+	struct tty_struct *tty;
+	char comm[sizeof(current->comm)];
+
+	task_lock(task);
+	/* Can't set if audit disabled */
+	if (!task->audit) {
+		task_unlock(task);
+		return -ENOPROTOOPT;
+	}
+	oldcontid = audit_get_contid(task);
+	read_lock(&tasklist_lock);
+	/* Don't allow the audit containerid to be unset */
+	if (!audit_contid_valid(contid))
+		rc = -EINVAL;
+	/* if we don't have caps, reject */
+	else if (!capable(CAP_AUDIT_CONTROL))
+		rc = -EPERM;
+	/* if task has children or is not single-threaded, deny */
+	else if (!list_empty(&task->children))
+		rc = -EBUSY;
+	else if (!(thread_group_leader(task) && thread_group_empty(task)))
+		rc = -EALREADY;
+	read_unlock(&tasklist_lock);
+	if (!rc)
+		task->audit->contid = contid;
+	task_unlock(task);
+
+	if (!audit_enabled)
+		return rc;
+
+	ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
+	if (!ab)
+		return rc;
+
+	uid = from_kuid(&init_user_ns, task_uid(current));
+	tty = audit_get_tty();
+	audit_log_format(ab,
+			 "op=set opid=%d contid=%llu old-contid=%llu pid=%d uid=%u auid=%u tty=%s ses=%u",
+			 task_tgid_nr(task), contid, oldcontid,
+			 task_tgid_nr(current), uid,
+			 from_kuid(&init_user_ns, audit_get_loginuid(current)),
+			 tty ? tty_name(tty) : "(none)",
+			 audit_get_sessionid(current));
+	audit_put_tty(tty);
+	audit_log_task_context(ab);
+	audit_log_format(ab, " comm=");
+	audit_log_untrustedstring(ab, get_task_comm(comm, current));
+	audit_log_d_path_exe(ab, current->mm);
+	audit_log_format(ab, " res=%d", !rc);
+	audit_log_end(ab);
+	return rc;
+}
+
+/**
  * audit_log_end - end one audit record
  * @ab: the audit_buffer
  *
diff --git a/kernel/audit.h b/kernel/audit.h
index c00e2ee3c6b3..e2912924af0d 100644
--- a/kernel/audit.h
+++ b/kernel/audit.h
@@ -148,6 +148,7 @@  struct audit_context {
 	kuid_t		    target_uid;
 	unsigned int	    target_sessionid;
 	u32		    target_sid;
+	u64		    target_cid;
 	char		    target_comm[TASK_COMM_LEN];
 
 	struct audit_tree_refs *trees, *first_trees;
diff --git a/kernel/auditsc.c b/kernel/auditsc.c
index fd7ca983de4f..1f7edf035b16 100644
--- a/kernel/auditsc.c
+++ b/kernel/auditsc.c
@@ -113,6 +113,7 @@  struct audit_aux_data_pids {
 	kuid_t			target_uid[AUDIT_AUX_PIDS];
 	unsigned int		target_sessionid[AUDIT_AUX_PIDS];
 	u32			target_sid[AUDIT_AUX_PIDS];
+	u64			target_cid[AUDIT_AUX_PIDS];
 	char 			target_comm[AUDIT_AUX_PIDS][TASK_COMM_LEN];
 	int			pid_count;
 };
@@ -2368,6 +2369,7 @@  void __audit_ptrace(struct task_struct *t)
 	context->target_uid = task_uid(t);
 	context->target_sessionid = audit_get_sessionid(t);
 	security_task_getsecid(t, &context->target_sid);
+	context->target_cid = audit_get_contid(t);
 	memcpy(context->target_comm, t->comm, TASK_COMM_LEN);
 }
 
@@ -2408,6 +2410,7 @@  int audit_signal_info(int sig, struct task_struct *t)
 		ctx->target_uid = t_uid;
 		ctx->target_sessionid = audit_get_sessionid(t);
 		security_task_getsecid(t, &ctx->target_sid);
+		ctx->target_cid = audit_get_contid(t);
 		memcpy(ctx->target_comm, t->comm, TASK_COMM_LEN);
 		return 0;
 	}
@@ -2429,6 +2432,7 @@  int audit_signal_info(int sig, struct task_struct *t)
 	axp->target_uid[axp->pid_count] = t_uid;
 	axp->target_sessionid[axp->pid_count] = audit_get_sessionid(t);
 	security_task_getsecid(t, &axp->target_sid[axp->pid_count]);
+	axp->target_cid[axp->pid_count] = audit_get_contid(t);
 	memcpy(axp->target_comm[axp->pid_count], t->comm, TASK_COMM_LEN);
 	axp->pid_count++;
 

Comments

Tycho Andersen May 29, 2019, 2:57 p.m.
On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> It is not permitted to unset the audit container identifier.
> A child inherits its parent's audit container identifier.

...

>  /**
> + * audit_set_contid - set current task's audit contid
> + * @contid: contid value
> + *
> + * Returns 0 on success, -EPERM on permission failure.
> + *
> + * Called (set) from fs/proc/base.c::proc_contid_write().
> + */
> +int audit_set_contid(struct task_struct *task, u64 contid)
> +{
> +	u64 oldcontid;
> +	int rc = 0;
> +	struct audit_buffer *ab;
> +	uid_t uid;
> +	struct tty_struct *tty;
> +	char comm[sizeof(current->comm)];
> +
> +	task_lock(task);
> +	/* Can't set if audit disabled */
> +	if (!task->audit) {
> +		task_unlock(task);
> +		return -ENOPROTOOPT;
> +	}
> +	oldcontid = audit_get_contid(task);
> +	read_lock(&tasklist_lock);
> +	/* Don't allow the audit containerid to be unset */
> +	if (!audit_contid_valid(contid))
> +		rc = -EINVAL;
> +	/* if we don't have caps, reject */
> +	else if (!capable(CAP_AUDIT_CONTROL))
> +		rc = -EPERM;
> +	/* if task has children or is not single-threaded, deny */
> +	else if (!list_empty(&task->children))
> +		rc = -EBUSY;
> +	else if (!(thread_group_leader(task) && thread_group_empty(task)))
> +		rc = -EALREADY;
> +	read_unlock(&tasklist_lock);
> +	if (!rc)
> +		task->audit->contid = contid;
> +	task_unlock(task);
> +
> +	if (!audit_enabled)
> +		return rc;

...but it is allowed to change it (assuming
capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
immediately useful since we still live in the world of majority
privileged containers if we didn't allow changing it, in addition to
un-setting it.

Tycho
Paul Moore May 29, 2019, 3:29 p.m.
On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > It is not permitted to unset the audit container identifier.
> > A child inherits its parent's audit container identifier.
>
> ...
>
> >  /**
> > + * audit_set_contid - set current task's audit contid
> > + * @contid: contid value
> > + *
> > + * Returns 0 on success, -EPERM on permission failure.
> > + *
> > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > + */
> > +int audit_set_contid(struct task_struct *task, u64 contid)
> > +{
> > +     u64 oldcontid;
> > +     int rc = 0;
> > +     struct audit_buffer *ab;
> > +     uid_t uid;
> > +     struct tty_struct *tty;
> > +     char comm[sizeof(current->comm)];
> > +
> > +     task_lock(task);
> > +     /* Can't set if audit disabled */
> > +     if (!task->audit) {
> > +             task_unlock(task);
> > +             return -ENOPROTOOPT;
> > +     }
> > +     oldcontid = audit_get_contid(task);
> > +     read_lock(&tasklist_lock);
> > +     /* Don't allow the audit containerid to be unset */
> > +     if (!audit_contid_valid(contid))
> > +             rc = -EINVAL;
> > +     /* if we don't have caps, reject */
> > +     else if (!capable(CAP_AUDIT_CONTROL))
> > +             rc = -EPERM;
> > +     /* if task has children or is not single-threaded, deny */
> > +     else if (!list_empty(&task->children))
> > +             rc = -EBUSY;
> > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > +             rc = -EALREADY;
> > +     read_unlock(&tasklist_lock);
> > +     if (!rc)
> > +             task->audit->contid = contid;
> > +     task_unlock(task);
> > +
> > +     if (!audit_enabled)
> > +             return rc;
>
> ...but it is allowed to change it (assuming
> capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> immediately useful since we still live in the world of majority
> privileged containers if we didn't allow changing it, in addition to
> un-setting it.

The idea is that only container orchestrators should be able to
set/modify the audit container ID, and since setting the audit
container ID can have a significant effect on the records captured
(and their routing to multiple daemons when we get there) modifying
the audit container ID is akin to modifying the audit configuration
which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
is that you would only change the audit container ID from one
set/inherited value to another if you were nesting containers, in
which case the nested container orchestrator would need to be granted
CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
compromise).  We did consider allowing for a chain of nested audit
container IDs, but the implications of doing so are significant
(implementation mess, runtime cost, etc.) so we are leaving that out
of this effort.

From a practical perspective, un-setting the audit container ID is
pretty much the same as changing it from one set value to another so
most of the above applies to that case as well.
Tycho Andersen May 29, 2019, 3:34 p.m.
On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > It is not permitted to unset the audit container identifier.
> > > A child inherits its parent's audit container identifier.
> >
> > ...
> >
> > >  /**
> > > + * audit_set_contid - set current task's audit contid
> > > + * @contid: contid value
> > > + *
> > > + * Returns 0 on success, -EPERM on permission failure.
> > > + *
> > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > + */
> > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > +{
> > > +     u64 oldcontid;
> > > +     int rc = 0;
> > > +     struct audit_buffer *ab;
> > > +     uid_t uid;
> > > +     struct tty_struct *tty;
> > > +     char comm[sizeof(current->comm)];
> > > +
> > > +     task_lock(task);
> > > +     /* Can't set if audit disabled */
> > > +     if (!task->audit) {
> > > +             task_unlock(task);
> > > +             return -ENOPROTOOPT;
> > > +     }
> > > +     oldcontid = audit_get_contid(task);
> > > +     read_lock(&tasklist_lock);
> > > +     /* Don't allow the audit containerid to be unset */
> > > +     if (!audit_contid_valid(contid))
> > > +             rc = -EINVAL;
> > > +     /* if we don't have caps, reject */
> > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > +             rc = -EPERM;
> > > +     /* if task has children or is not single-threaded, deny */
> > > +     else if (!list_empty(&task->children))
> > > +             rc = -EBUSY;
> > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > +             rc = -EALREADY;
> > > +     read_unlock(&tasklist_lock);
> > > +     if (!rc)
> > > +             task->audit->contid = contid;
> > > +     task_unlock(task);
> > > +
> > > +     if (!audit_enabled)
> > > +             return rc;
> >
> > ...but it is allowed to change it (assuming
> > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > immediately useful since we still live in the world of majority
> > privileged containers if we didn't allow changing it, in addition to
> > un-setting it.
> 
> The idea is that only container orchestrators should be able to
> set/modify the audit container ID, and since setting the audit
> container ID can have a significant effect on the records captured
> (and their routing to multiple daemons when we get there) modifying
> the audit container ID is akin to modifying the audit configuration
> which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> is that you would only change the audit container ID from one
> set/inherited value to another if you were nesting containers, in
> which case the nested container orchestrator would need to be granted
> CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> compromise).

But then don't you want some kind of ns_capable() instead (probably
not the obvious one, though...)? With capable(), you can't really nest
using the audit-id and user namespaces together.

Tycho
Paul Moore May 29, 2019, 4:03 p.m.
On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
>
> On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > >
> > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > > It is not permitted to unset the audit container identifier.
> > > > A child inherits its parent's audit container identifier.
> > >
> > > ...
> > >
> > > >  /**
> > > > + * audit_set_contid - set current task's audit contid
> > > > + * @contid: contid value
> > > > + *
> > > > + * Returns 0 on success, -EPERM on permission failure.
> > > > + *
> > > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > > + */
> > > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > > +{
> > > > +     u64 oldcontid;
> > > > +     int rc = 0;
> > > > +     struct audit_buffer *ab;
> > > > +     uid_t uid;
> > > > +     struct tty_struct *tty;
> > > > +     char comm[sizeof(current->comm)];
> > > > +
> > > > +     task_lock(task);
> > > > +     /* Can't set if audit disabled */
> > > > +     if (!task->audit) {
> > > > +             task_unlock(task);
> > > > +             return -ENOPROTOOPT;
> > > > +     }
> > > > +     oldcontid = audit_get_contid(task);
> > > > +     read_lock(&tasklist_lock);
> > > > +     /* Don't allow the audit containerid to be unset */
> > > > +     if (!audit_contid_valid(contid))
> > > > +             rc = -EINVAL;
> > > > +     /* if we don't have caps, reject */
> > > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > > +             rc = -EPERM;
> > > > +     /* if task has children or is not single-threaded, deny */
> > > > +     else if (!list_empty(&task->children))
> > > > +             rc = -EBUSY;
> > > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > > +             rc = -EALREADY;
> > > > +     read_unlock(&tasklist_lock);
> > > > +     if (!rc)
> > > > +             task->audit->contid = contid;
> > > > +     task_unlock(task);
> > > > +
> > > > +     if (!audit_enabled)
> > > > +             return rc;
> > >
> > > ...but it is allowed to change it (assuming
> > > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > > immediately useful since we still live in the world of majority
> > > privileged containers if we didn't allow changing it, in addition to
> > > un-setting it.
> >
> > The idea is that only container orchestrators should be able to
> > set/modify the audit container ID, and since setting the audit
> > container ID can have a significant effect on the records captured
> > (and their routing to multiple daemons when we get there) modifying
> > the audit container ID is akin to modifying the audit configuration
> > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > is that you would only change the audit container ID from one
> > set/inherited value to another if you were nesting containers, in
> > which case the nested container orchestrator would need to be granted
> > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > compromise).
>
> But then don't you want some kind of ns_capable() instead (probably
> not the obvious one, though...)? With capable(), you can't really nest
> using the audit-id and user namespaces together.

You want capable() and not ns_capable() because you want to ensure
that the orchestrator has the rights in the init_ns as changes to the
audit container ID could have an auditing impact that spans the entire
system.  Setting the audit container ID is equivalent to munging the
kernel's audit configuration, and the audit configuration is not
"namespaced" in any way.  The audit container ID work is about
providing the right "container context" (as defined by userspace) with
the audit records so that admins have a better understanding about
what is going on in the system; it is very explicitly not creating an
audit namespace.

At some point in the future we will want to support running multiple
audit daemons, and have a configurable way of routing audit records
based on the audit container ID, which will blur the line regarding
audit namespaces, but even then I would argue we are not creating an
audit namespace.
Tycho Andersen May 29, 2019, 10:28 p.m.
On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > >
> > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > > > It is not permitted to unset the audit container identifier.
> > > > > A child inherits its parent's audit container identifier.
> > > >
> > > > ...
> > > >
> > > > >  /**
> > > > > + * audit_set_contid - set current task's audit contid
> > > > > + * @contid: contid value
> > > > > + *
> > > > > + * Returns 0 on success, -EPERM on permission failure.
> > > > > + *
> > > > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > > > + */
> > > > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > > > +{
> > > > > +     u64 oldcontid;
> > > > > +     int rc = 0;
> > > > > +     struct audit_buffer *ab;
> > > > > +     uid_t uid;
> > > > > +     struct tty_struct *tty;
> > > > > +     char comm[sizeof(current->comm)];
> > > > > +
> > > > > +     task_lock(task);
> > > > > +     /* Can't set if audit disabled */
> > > > > +     if (!task->audit) {
> > > > > +             task_unlock(task);
> > > > > +             return -ENOPROTOOPT;
> > > > > +     }
> > > > > +     oldcontid = audit_get_contid(task);
> > > > > +     read_lock(&tasklist_lock);
> > > > > +     /* Don't allow the audit containerid to be unset */
> > > > > +     if (!audit_contid_valid(contid))
> > > > > +             rc = -EINVAL;
> > > > > +     /* if we don't have caps, reject */
> > > > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > > > +             rc = -EPERM;
> > > > > +     /* if task has children or is not single-threaded, deny */
> > > > > +     else if (!list_empty(&task->children))
> > > > > +             rc = -EBUSY;
> > > > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > > > +             rc = -EALREADY;
> > > > > +     read_unlock(&tasklist_lock);
> > > > > +     if (!rc)
> > > > > +             task->audit->contid = contid;
> > > > > +     task_unlock(task);
> > > > > +
> > > > > +     if (!audit_enabled)
> > > > > +             return rc;
> > > >
> > > > ...but it is allowed to change it (assuming
> > > > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > > > immediately useful since we still live in the world of majority
> > > > privileged containers if we didn't allow changing it, in addition to
> > > > un-setting it.
> > >
> > > The idea is that only container orchestrators should be able to
> > > set/modify the audit container ID, and since setting the audit
> > > container ID can have a significant effect on the records captured
> > > (and their routing to multiple daemons when we get there) modifying
> > > the audit container ID is akin to modifying the audit configuration
> > > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > > is that you would only change the audit container ID from one
> > > set/inherited value to another if you were nesting containers, in
> > > which case the nested container orchestrator would need to be granted
> > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > compromise).
> >
> > But then don't you want some kind of ns_capable() instead (probably
> > not the obvious one, though...)? With capable(), you can't really nest
> > using the audit-id and user namespaces together.
> 
> You want capable() and not ns_capable() because you want to ensure
> that the orchestrator has the rights in the init_ns as changes to the
> audit container ID could have an auditing impact that spans the entire
> system.

Ok but,

> > > The current thinking
> > > is that you would only change the audit container ID from one
> > > set/inherited value to another if you were nesting containers, in
> > > which case the nested container orchestrator would need to be granted
> > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > compromise).

won't work in user namespaced containers, because they will never be
capable(CAP_AUDIT_CONTROL); so I don't think this will work for
nesting as is. But maybe nobody cares :)

Tycho
Paul Moore May 29, 2019, 10:39 p.m.
On Wed, May 29, 2019 at 6:28 PM Tycho Andersen <tycho@tycho.ws> wrote:
> On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> > On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > >
> > > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > >
> > > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > > > > It is not permitted to unset the audit container identifier.
> > > > > > A child inherits its parent's audit container identifier.
> > > > >
> > > > > ...
> > > > >
> > > > > >  /**
> > > > > > + * audit_set_contid - set current task's audit contid
> > > > > > + * @contid: contid value
> > > > > > + *
> > > > > > + * Returns 0 on success, -EPERM on permission failure.
> > > > > > + *
> > > > > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > > > > + */
> > > > > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > > > > +{
> > > > > > +     u64 oldcontid;
> > > > > > +     int rc = 0;
> > > > > > +     struct audit_buffer *ab;
> > > > > > +     uid_t uid;
> > > > > > +     struct tty_struct *tty;
> > > > > > +     char comm[sizeof(current->comm)];
> > > > > > +
> > > > > > +     task_lock(task);
> > > > > > +     /* Can't set if audit disabled */
> > > > > > +     if (!task->audit) {
> > > > > > +             task_unlock(task);
> > > > > > +             return -ENOPROTOOPT;
> > > > > > +     }
> > > > > > +     oldcontid = audit_get_contid(task);
> > > > > > +     read_lock(&tasklist_lock);
> > > > > > +     /* Don't allow the audit containerid to be unset */
> > > > > > +     if (!audit_contid_valid(contid))
> > > > > > +             rc = -EINVAL;
> > > > > > +     /* if we don't have caps, reject */
> > > > > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > > > > +             rc = -EPERM;
> > > > > > +     /* if task has children or is not single-threaded, deny */
> > > > > > +     else if (!list_empty(&task->children))
> > > > > > +             rc = -EBUSY;
> > > > > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > > > > +             rc = -EALREADY;
> > > > > > +     read_unlock(&tasklist_lock);
> > > > > > +     if (!rc)
> > > > > > +             task->audit->contid = contid;
> > > > > > +     task_unlock(task);
> > > > > > +
> > > > > > +     if (!audit_enabled)
> > > > > > +             return rc;
> > > > >
> > > > > ...but it is allowed to change it (assuming
> > > > > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > > > > immediately useful since we still live in the world of majority
> > > > > privileged containers if we didn't allow changing it, in addition to
> > > > > un-setting it.
> > > >
> > > > The idea is that only container orchestrators should be able to
> > > > set/modify the audit container ID, and since setting the audit
> > > > container ID can have a significant effect on the records captured
> > > > (and their routing to multiple daemons when we get there) modifying
> > > > the audit container ID is akin to modifying the audit configuration
> > > > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > > > is that you would only change the audit container ID from one
> > > > set/inherited value to another if you were nesting containers, in
> > > > which case the nested container orchestrator would need to be granted
> > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > compromise).
> > >
> > > But then don't you want some kind of ns_capable() instead (probably
> > > not the obvious one, though...)? With capable(), you can't really nest
> > > using the audit-id and user namespaces together.
> >
> > You want capable() and not ns_capable() because you want to ensure
> > that the orchestrator has the rights in the init_ns as changes to the
> > audit container ID could have an auditing impact that spans the entire
> > system.
>
> Ok but,
>
> > > > The current thinking
> > > > is that you would only change the audit container ID from one
> > > > set/inherited value to another if you were nesting containers, in
> > > > which case the nested container orchestrator would need to be granted
> > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > compromise).
>
> won't work in user namespaced containers, because they will never be
> capable(CAP_AUDIT_CONTROL); so I don't think this will work for
> nesting as is. But maybe nobody cares :)

That's fun :)

To be honest, I've never been a big fan of supporting nested
containers from an audit perspective, so I'm not really too upset
about this.  The k8s/cri-o folks seem okay with this, or at least I
haven't heard any objections; lxc folks, what do you have to say?
Serge E. Hallyn May 30, 2019, 5:09 p.m.
On Wed, May 29, 2019 at 06:39:48PM -0400, Paul Moore wrote:
> On Wed, May 29, 2019 at 6:28 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> > > On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > >
> > > > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > >
> > > > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > > > > > It is not permitted to unset the audit container identifier.
> > > > > > > A child inherits its parent's audit container identifier.
> > > > > >
> > > > > > ...
> > > > > >
> > > > > > >  /**
> > > > > > > + * audit_set_contid - set current task's audit contid
> > > > > > > + * @contid: contid value
> > > > > > > + *
> > > > > > > + * Returns 0 on success, -EPERM on permission failure.
> > > > > > > + *
> > > > > > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > > > > > + */
> > > > > > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > > > > > +{
> > > > > > > +     u64 oldcontid;
> > > > > > > +     int rc = 0;
> > > > > > > +     struct audit_buffer *ab;
> > > > > > > +     uid_t uid;
> > > > > > > +     struct tty_struct *tty;
> > > > > > > +     char comm[sizeof(current->comm)];
> > > > > > > +
> > > > > > > +     task_lock(task);
> > > > > > > +     /* Can't set if audit disabled */
> > > > > > > +     if (!task->audit) {
> > > > > > > +             task_unlock(task);
> > > > > > > +             return -ENOPROTOOPT;
> > > > > > > +     }
> > > > > > > +     oldcontid = audit_get_contid(task);
> > > > > > > +     read_lock(&tasklist_lock);
> > > > > > > +     /* Don't allow the audit containerid to be unset */
> > > > > > > +     if (!audit_contid_valid(contid))
> > > > > > > +             rc = -EINVAL;
> > > > > > > +     /* if we don't have caps, reject */
> > > > > > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > > > > > +             rc = -EPERM;
> > > > > > > +     /* if task has children or is not single-threaded, deny */
> > > > > > > +     else if (!list_empty(&task->children))
> > > > > > > +             rc = -EBUSY;
> > > > > > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > > > > > +             rc = -EALREADY;
> > > > > > > +     read_unlock(&tasklist_lock);
> > > > > > > +     if (!rc)
> > > > > > > +             task->audit->contid = contid;
> > > > > > > +     task_unlock(task);
> > > > > > > +
> > > > > > > +     if (!audit_enabled)
> > > > > > > +             return rc;
> > > > > >
> > > > > > ...but it is allowed to change it (assuming
> > > > > > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > > > > > immediately useful since we still live in the world of majority
> > > > > > privileged containers if we didn't allow changing it, in addition to
> > > > > > un-setting it.
> > > > >
> > > > > The idea is that only container orchestrators should be able to
> > > > > set/modify the audit container ID, and since setting the audit
> > > > > container ID can have a significant effect on the records captured
> > > > > (and their routing to multiple daemons when we get there) modifying
> > > > > the audit container ID is akin to modifying the audit configuration
> > > > > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > > > > is that you would only change the audit container ID from one
> > > > > set/inherited value to another if you were nesting containers, in
> > > > > which case the nested container orchestrator would need to be granted
> > > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > > compromise).
> > > >
> > > > But then don't you want some kind of ns_capable() instead (probably
> > > > not the obvious one, though...)? With capable(), you can't really nest
> > > > using the audit-id and user namespaces together.
> > >
> > > You want capable() and not ns_capable() because you want to ensure
> > > that the orchestrator has the rights in the init_ns as changes to the
> > > audit container ID could have an auditing impact that spans the entire
> > > system.
> >
> > Ok but,
> >
> > > > > The current thinking
> > > > > is that you would only change the audit container ID from one
> > > > > set/inherited value to another if you were nesting containers, in
> > > > > which case the nested container orchestrator would need to be granted
> > > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > > compromise).
> >
> > won't work in user namespaced containers, because they will never be
> > capable(CAP_AUDIT_CONTROL); so I don't think this will work for
> > nesting as is. But maybe nobody cares :)
> 
> That's fun :)
> 
> To be honest, I've never been a big fan of supporting nested
> containers from an audit perspective, so I'm not really too upset
> about this.  The k8s/cri-o folks seem okay with this, or at least I
> haven't heard any objections; lxc folks, what do you have to say?

I actually thought the answer to this (when last I looked, "some time" ago)
was that userspace should track an audit message saying "task X in
container Y is changing its auditid to Z", and then decide to also track Z.
This should be doable, but a lot of extra work in userspace.

Per-userns containerids would also work.  So task X1 is in containerid
1 on the host and creates a new task Y in new userns;  it continues to
be reported in init_user_ns as containerid 1 forever;  but in its own
userns it can request to be known as some other containerid.  Audit
socks would be per-userns, allowing root in a container to watch for
audit events in its own (and descendent) namespaces.

But again I'm sure we've gone over all this in the last few years.

I suppose we can look at this as a "first step", and talk about
making it user-ns-nestable later.  But agreed it's not useful in a
lot of situations as is.

-serge
Paul Moore May 30, 2019, 7:29 p.m.
On Thu, May 30, 2019 at 1:09 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> On Wed, May 29, 2019 at 06:39:48PM -0400, Paul Moore wrote:
> > On Wed, May 29, 2019 at 6:28 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> > > > On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > > > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:

...

> > > > > > The current thinking
> > > > > > is that you would only change the audit container ID from one
> > > > > > set/inherited value to another if you were nesting containers, in
> > > > > > which case the nested container orchestrator would need to be granted
> > > > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > > > compromise).
> > >
> > > won't work in user namespaced containers, because they will never be
> > > capable(CAP_AUDIT_CONTROL); so I don't think this will work for
> > > nesting as is. But maybe nobody cares :)
> >
> > That's fun :)
> >
> > To be honest, I've never been a big fan of supporting nested
> > containers from an audit perspective, so I'm not really too upset
> > about this.  The k8s/cri-o folks seem okay with this, or at least I
> > haven't heard any objections; lxc folks, what do you have to say?
>
> I actually thought the answer to this (when last I looked, "some time" ago)
> was that userspace should track an audit message saying "task X in
> container Y is changing its auditid to Z", and then decide to also track Z.
> This should be doable, but a lot of extra work in userspace.
>
> Per-userns containerids would also work.  So task X1 is in containerid
> 1 on the host and creates a new task Y in new userns;  it continues to
> be reported in init_user_ns as containerid 1 forever;  but in its own
> userns it can request to be known as some other containerid.  Audit
> socks would be per-userns, allowing root in a container to watch for
> audit events in its own (and descendent) namespaces.
>
> But again I'm sure we've gone over all this in the last few years.
>
> I suppose we can look at this as a "first step", and talk about
> making it user-ns-nestable later.  But agreed it's not useful in a
> lot of situations as is.

[REMINDER: It is an "*audit* container ID" and not a general
"container ID" ;)  Smiley aside, I'm not kidding about that part.]

I'm not interested in supporting/merging something that isn't useful;
if this doesn't work for your use case then we need to figure out what
would work.  It sounds like nested containers are much more common in
the lxc world, can you elaborate a bit more on this?

As far as the possible solutions you mention above, I'm not sure I
like the per-userns audit container IDs, I'd much rather just emit the
necessary tracking information via the audit record stream and let the
log analysis tools figure it out.  However, the bigger question is how
to limit (re)setting the audit container ID when you are in a non-init
userns.  For reasons already mentioned, using capable() is a non
starter for everything but the initial userns, and using ns_capable()
is equally poor as it essentially allows any userns the ability to
munge it's audit container ID (obviously not good).  It appears we
need a different method for controlling access to the audit container
ID.

Punting this to a LSM hook is an obvious thing to do, and something we
might want to do anyway, but currently audit doesn't rely on the LSM
for proper/safe operation and I'm not sure I want to change that now.

The next obvious thing is to create some sort of access control knob
in audit itself.  Perhaps an auditctl operation that would allow the
administrator to specify which containers, via their corresponding
audit container IDs, are allowed to change their audit container ID?
The permission granting would need to be done in the init userns, but
it would allow containers with a non-init userns the ability to change
their audit container ID.  We would probably still want a
ns_capable(CAP_AUDIT_CONTROL) restriction in this case.

Does anyone else have any other ideas?
Tycho Andersen May 30, 2019, 9:29 p.m.
On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> 
> [REMINDER: It is an "*audit* container ID" and not a general
> "container ID" ;)  Smiley aside, I'm not kidding about that part.]

This sort of seems like a distinction without a difference; presumably
audit is going to want to differentiate between everything that people
in userspace call a container. So you'll have to support all this
insanity anyway, even if it's "not a container ID".

> I'm not interested in supporting/merging something that isn't useful;
> if this doesn't work for your use case then we need to figure out what
> would work.  It sounds like nested containers are much more common in
> the lxc world, can you elaborate a bit more on this?
> 
> As far as the possible solutions you mention above, I'm not sure I
> like the per-userns audit container IDs, I'd much rather just emit the
> necessary tracking information via the audit record stream and let the
> log analysis tools figure it out.  However, the bigger question is how
> to limit (re)setting the audit container ID when you are in a non-init
> userns.  For reasons already mentioned, using capable() is a non
> starter for everything but the initial userns, and using ns_capable()
> is equally poor as it essentially allows any userns the ability to
> munge it's audit container ID (obviously not good).  It appears we
> need a different method for controlling access to the audit container
> ID.

One option would be to make it a string, and have it be append only.
That should be safe with no checks.

I know there was a long thread about what type to make this thing. I
think you could accomplish the append-only-ness with a u64 if you had
some rule about only allowing setting lower order bits than those that
are already set. With 4 bits for simplicity:

1100         # initial container id
1100 -> 1011 # not allowed
1100 -> 1101 # allowed, but now 1101 is set in stone since there are
             # no lower order bits left

There are probably fancier ways to do it if you actually understand
math :)

Since userns nesting is limited to 32 levels (right now, IIRC), and
you have 64 bits, this might be reasonable. You could just teach
container engines to use the first say N bits for themselves, with a 1
bit for the barrier at the end.

Tycho
Paul Moore May 30, 2019, 11:26 p.m.
On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> >
> > [REMINDER: It is an "*audit* container ID" and not a general
> > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
>
> This sort of seems like a distinction without a difference; presumably
> audit is going to want to differentiate between everything that people
> in userspace call a container. So you'll have to support all this
> insanity anyway, even if it's "not a container ID".

That's not quite right.  Audit doesn't care about what a container is,
or is not, it also doesn't care if the "audit container ID" actually
matches the ID used by the container engine in userspace and I think
that is a very important line to draw.  Audit is simply given a value
which it calls the "audit container ID", it ensures that the value is
inherited appropriately (e.g. children inherit their parent's audit
container ID), and it uses the value in audit records to provide some
additional context for log analysis.  The distinction isn't limited to
the value itself, but also to how it is used; it is an "audit
container ID" and not a "container ID" because this value is
exclusively for use by the audit subsystem.  We are very intentionally
not adding a generic container ID to the kernel.  If the kernel does
ever grow a general purpose container ID we will be one of the first
ones in line to make use of it, but we are not going to be the ones to
generically add containers to the kernel.  Enough people already hate
audit ;)

> > I'm not interested in supporting/merging something that isn't useful;
> > if this doesn't work for your use case then we need to figure out what
> > would work.  It sounds like nested containers are much more common in
> > the lxc world, can you elaborate a bit more on this?
> >
> > As far as the possible solutions you mention above, I'm not sure I
> > like the per-userns audit container IDs, I'd much rather just emit the
> > necessary tracking information via the audit record stream and let the
> > log analysis tools figure it out.  However, the bigger question is how
> > to limit (re)setting the audit container ID when you are in a non-init
> > userns.  For reasons already mentioned, using capable() is a non
> > starter for everything but the initial userns, and using ns_capable()
> > is equally poor as it essentially allows any userns the ability to
> > munge it's audit container ID (obviously not good).  It appears we
> > need a different method for controlling access to the audit container
> > ID.
>
> One option would be to make it a string, and have it be append only.
> That should be safe with no checks.
>
> I know there was a long thread about what type to make this thing. I
> think you could accomplish the append-only-ness with a u64 if you had
> some rule about only allowing setting lower order bits than those that
> are already set. With 4 bits for simplicity:
>
> 1100         # initial container id
> 1100 -> 1011 # not allowed
> 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
>              # no lower order bits left
>
> There are probably fancier ways to do it if you actually understand
> math :)

 ;)

> Since userns nesting is limited to 32 levels (right now, IIRC), and
> you have 64 bits, this might be reasonable. You could just teach
> container engines to use the first say N bits for themselves, with a 1
> bit for the barrier at the end.

I like the creativity, but I worry that at some point these
limitations are going to be raised (limits have a funny way of doing
that over time) and we will be in trouble.  I say "trouble" because I
want to be able to quickly do an audit container ID comparison and
we're going to pay a penalty for these larger values (we'll need this
when we add multiple auditd support and the requisite record routing).

Thinking about this makes me also realize we probably need to think a
bit longer about audit container ID conflicts between orchestrators.
Right now we just take the value that is given to us by the
orchestrator, but if we want to allow multiple container orchestrators
to work without some form of cooperation in userspace (I think we have
to assume the orchestrators will not talk to each other) we likely
need to have some way to block reuse of an audit container ID.  We
would either need to prevent the orchestrator from explicitly setting
an audit container ID to a currently in use value, or instead generate
the audit container ID in the kernel upon an event triggered by the
orchestrator (e.g. a write to a /proc file).  I suspect we should
start looking at the idr code, I think we will need to make use of it.
Richard Guy Briggs May 31, 2019, 12:20 a.m.
On 2019-05-30 19:26, Paul Moore wrote:
> On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > >
> > > [REMINDER: It is an "*audit* container ID" and not a general
> > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> >
> > This sort of seems like a distinction without a difference; presumably
> > audit is going to want to differentiate between everything that people
> > in userspace call a container. So you'll have to support all this
> > insanity anyway, even if it's "not a container ID".
> 
> That's not quite right.  Audit doesn't care about what a container is,
> or is not, it also doesn't care if the "audit container ID" actually
> matches the ID used by the container engine in userspace and I think
> that is a very important line to draw.  Audit is simply given a value
> which it calls the "audit container ID", it ensures that the value is
> inherited appropriately (e.g. children inherit their parent's audit
> container ID), and it uses the value in audit records to provide some
> additional context for log analysis.  The distinction isn't limited to
> the value itself, but also to how it is used; it is an "audit
> container ID" and not a "container ID" because this value is
> exclusively for use by the audit subsystem.  We are very intentionally
> not adding a generic container ID to the kernel.  If the kernel does
> ever grow a general purpose container ID we will be one of the first
> ones in line to make use of it, but we are not going to be the ones to
> generically add containers to the kernel.  Enough people already hate
> audit ;)
> 
> > > I'm not interested in supporting/merging something that isn't useful;
> > > if this doesn't work for your use case then we need to figure out what
> > > would work.  It sounds like nested containers are much more common in
> > > the lxc world, can you elaborate a bit more on this?
> > >
> > > As far as the possible solutions you mention above, I'm not sure I
> > > like the per-userns audit container IDs, I'd much rather just emit the
> > > necessary tracking information via the audit record stream and let the
> > > log analysis tools figure it out.  However, the bigger question is how
> > > to limit (re)setting the audit container ID when you are in a non-init
> > > userns.  For reasons already mentioned, using capable() is a non
> > > starter for everything but the initial userns, and using ns_capable()
> > > is equally poor as it essentially allows any userns the ability to
> > > munge it's audit container ID (obviously not good).  It appears we
> > > need a different method for controlling access to the audit container
> > > ID.
> >
> > One option would be to make it a string, and have it be append only.
> > That should be safe with no checks.
> >
> > I know there was a long thread about what type to make this thing. I
> > think you could accomplish the append-only-ness with a u64 if you had
> > some rule about only allowing setting lower order bits than those that
> > are already set. With 4 bits for simplicity:
> >
> > 1100         # initial container id
> > 1100 -> 1011 # not allowed
> > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> >              # no lower order bits left
> >
> > There are probably fancier ways to do it if you actually understand
> > math :)
> 
>  ;)
> 
> > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > you have 64 bits, this might be reasonable. You could just teach
> > container engines to use the first say N bits for themselves, with a 1
> > bit for the barrier at the end.
> 
> I like the creativity, but I worry that at some point these
> limitations are going to be raised (limits have a funny way of doing
> that over time) and we will be in trouble.  I say "trouble" because I
> want to be able to quickly do an audit container ID comparison and
> we're going to pay a penalty for these larger values (we'll need this
> when we add multiple auditd support and the requisite record routing).
> 
> Thinking about this makes me also realize we probably need to think a
> bit longer about audit container ID conflicts between orchestrators.
> Right now we just take the value that is given to us by the
> orchestrator, but if we want to allow multiple container orchestrators
> to work without some form of cooperation in userspace (I think we have
> to assume the orchestrators will not talk to each other) we likely
> need to have some way to block reuse of an audit container ID.  We
> would either need to prevent the orchestrator from explicitly setting
> an audit container ID to a currently in use value, or instead generate
> the audit container ID in the kernel upon an event triggered by the
> orchestrator (e.g. a write to a /proc file).  I suspect we should
> start looking at the idr code, I think we will need to make use of it.

My first reaction to using the IDR code is that once an idr is given up,
it can be reused.  I suppose we request IDRs and then never give them up
to avoid reuse...

I already had some ideas of preventing an existing ID from being reused,
but that makes the practice of some container engines injecting
processes into existing containers difficult if not impossible.

> paul moore
> www.paul-moore.com

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore May 31, 2019, 12:44 p.m.
On Thu, May 30, 2019 at 8:21 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 19:26, Paul Moore wrote:
> > On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > > >
> > > > [REMINDER: It is an "*audit* container ID" and not a general
> > > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> > >
> > > This sort of seems like a distinction without a difference; presumably
> > > audit is going to want to differentiate between everything that people
> > > in userspace call a container. So you'll have to support all this
> > > insanity anyway, even if it's "not a container ID".
> >
> > That's not quite right.  Audit doesn't care about what a container is,
> > or is not, it also doesn't care if the "audit container ID" actually
> > matches the ID used by the container engine in userspace and I think
> > that is a very important line to draw.  Audit is simply given a value
> > which it calls the "audit container ID", it ensures that the value is
> > inherited appropriately (e.g. children inherit their parent's audit
> > container ID), and it uses the value in audit records to provide some
> > additional context for log analysis.  The distinction isn't limited to
> > the value itself, but also to how it is used; it is an "audit
> > container ID" and not a "container ID" because this value is
> > exclusively for use by the audit subsystem.  We are very intentionally
> > not adding a generic container ID to the kernel.  If the kernel does
> > ever grow a general purpose container ID we will be one of the first
> > ones in line to make use of it, but we are not going to be the ones to
> > generically add containers to the kernel.  Enough people already hate
> > audit ;)
> >
> > > > I'm not interested in supporting/merging something that isn't useful;
> > > > if this doesn't work for your use case then we need to figure out what
> > > > would work.  It sounds like nested containers are much more common in
> > > > the lxc world, can you elaborate a bit more on this?
> > > >
> > > > As far as the possible solutions you mention above, I'm not sure I
> > > > like the per-userns audit container IDs, I'd much rather just emit the
> > > > necessary tracking information via the audit record stream and let the
> > > > log analysis tools figure it out.  However, the bigger question is how
> > > > to limit (re)setting the audit container ID when you are in a non-init
> > > > userns.  For reasons already mentioned, using capable() is a non
> > > > starter for everything but the initial userns, and using ns_capable()
> > > > is equally poor as it essentially allows any userns the ability to
> > > > munge it's audit container ID (obviously not good).  It appears we
> > > > need a different method for controlling access to the audit container
> > > > ID.
> > >
> > > One option would be to make it a string, and have it be append only.
> > > That should be safe with no checks.
> > >
> > > I know there was a long thread about what type to make this thing. I
> > > think you could accomplish the append-only-ness with a u64 if you had
> > > some rule about only allowing setting lower order bits than those that
> > > are already set. With 4 bits for simplicity:
> > >
> > > 1100         # initial container id
> > > 1100 -> 1011 # not allowed
> > > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> > >              # no lower order bits left
> > >
> > > There are probably fancier ways to do it if you actually understand
> > > math :)
> >
> >  ;)
> >
> > > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > > you have 64 bits, this might be reasonable. You could just teach
> > > container engines to use the first say N bits for themselves, with a 1
> > > bit for the barrier at the end.
> >
> > I like the creativity, but I worry that at some point these
> > limitations are going to be raised (limits have a funny way of doing
> > that over time) and we will be in trouble.  I say "trouble" because I
> > want to be able to quickly do an audit container ID comparison and
> > we're going to pay a penalty for these larger values (we'll need this
> > when we add multiple auditd support and the requisite record routing).
> >
> > Thinking about this makes me also realize we probably need to think a
> > bit longer about audit container ID conflicts between orchestrators.
> > Right now we just take the value that is given to us by the
> > orchestrator, but if we want to allow multiple container orchestrators
> > to work without some form of cooperation in userspace (I think we have
> > to assume the orchestrators will not talk to each other) we likely
> > need to have some way to block reuse of an audit container ID.  We
> > would either need to prevent the orchestrator from explicitly setting
> > an audit container ID to a currently in use value, or instead generate
> > the audit container ID in the kernel upon an event triggered by the
> > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > start looking at the idr code, I think we will need to make use of it.
>
> My first reaction to using the IDR code is that once an idr is given up,
> it can be reused.  I suppose we request IDRs and then never give them up
> to avoid reuse...

I'm not sure we ever what to guarantee that an audit container ID
won't be reused during the lifetime of the system, it is a discrete
integer after all.  What I think we do want to guarantee is that we
won't allow an unintentional audit container ID collision between
different orchestrators; if a single orchestrator wants to reuse an
audit container ID then that is its choice.

> I already had some ideas of preventing an existing ID from being reused,

Cool.  I only made the idr suggestion since it is used for PIDs and
solves a very similar problem.

> but that makes the practice of some container engines injecting
> processes into existing containers difficult if not impossible.

Yes, we'll need some provision to indicate which orchestrator
"controls" that particular audit container ID, and allow that
orchestrator to reuse that particular audit container ID (until all
those containers disappear and the audit container ID is given back to
the pool).
Steve Grubb June 3, 2019, 8:24 p.m.
Hello Paul,

I am curious about this. We seemed to be close to getting this patch pulled 
in. A lot of people are waiting for it. Can you summarize what you think the 
patches need and who we think needs to do it? I'm lost. Does LXC people need 
to propose something? Does Richard? Someone else? Who's got the ball?

Thank,
-Steve

On Friday, May 31, 2019 8:44:45 AM EDT Paul Moore wrote:
> On Thu, May 30, 2019 at 8:21 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-05-30 19:26, Paul Moore wrote:
> > > On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > > > > [REMINDER: It is an "*audit* container ID" and not a general
> > > > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> > > > 
> > > > This sort of seems like a distinction without a difference;
> > > > presumably
> > > > audit is going to want to differentiate between everything that
> > > > people
> > > > in userspace call a container. So you'll have to support all this
> > > > insanity anyway, even if it's "not a container ID".
> > > 
> > > That's not quite right.  Audit doesn't care about what a container is,
> > > or is not, it also doesn't care if the "audit container ID" actually
> > > matches the ID used by the container engine in userspace and I think
> > > that is a very important line to draw.  Audit is simply given a value
> > > which it calls the "audit container ID", it ensures that the value is
> > > inherited appropriately (e.g. children inherit their parent's audit
> > > container ID), and it uses the value in audit records to provide some
> > > additional context for log analysis.  The distinction isn't limited to
> > > the value itself, but also to how it is used; it is an "audit
> > > container ID" and not a "container ID" because this value is
> > > exclusively for use by the audit subsystem.  We are very intentionally
> > > not adding a generic container ID to the kernel.  If the kernel does
> > > ever grow a general purpose container ID we will be one of the first
> > > ones in line to make use of it, but we are not going to be the ones to
> > > generically add containers to the kernel.  Enough people already hate
> > > audit ;)
> > > 
> > > > > I'm not interested in supporting/merging something that isn't
> > > > > useful;
> > > > > if this doesn't work for your use case then we need to figure out
> > > > > what
> > > > > would work.  It sounds like nested containers are much more common
> > > > > in
> > > > > the lxc world, can you elaborate a bit more on this?
> > > > > 
> > > > > As far as the possible solutions you mention above, I'm not sure I
> > > > > like the per-userns audit container IDs, I'd much rather just emit
> > > > > the
> > > > > necessary tracking information via the audit record stream and let
> > > > > the
> > > > > log analysis tools figure it out.  However, the bigger question is
> > > > > how
> > > > > to limit (re)setting the audit container ID when you are in a
> > > > > non-init
> > > > > userns.  For reasons already mentioned, using capable() is a non
> > > > > starter for everything but the initial userns, and using
> > > > > ns_capable()
> > > > > is equally poor as it essentially allows any userns the ability to
> > > > > munge it's audit container ID (obviously not good).  It appears we
> > > > > need a different method for controlling access to the audit
> > > > > container
> > > > > ID.
> > > > 
> > > > One option would be to make it a string, and have it be append only.
> > > > That should be safe with no checks.
> > > > 
> > > > I know there was a long thread about what type to make this thing. I
> > > > think you could accomplish the append-only-ness with a u64 if you had
> > > > some rule about only allowing setting lower order bits than those
> > > > that
> > > > are already set. With 4 bits for simplicity:
> > > > 
> > > > 1100         # initial container id
> > > > 1100 -> 1011 # not allowed
> > > > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> > > > 
> > > >              # no lower order bits left
> > > > 
> > > > There are probably fancier ways to do it if you actually understand
> > > > math :)
> > >  
> > >  ;)
> > >  
> > > > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > > > you have 64 bits, this might be reasonable. You could just teach
> > > > container engines to use the first say N bits for themselves, with a
> > > > 1
> > > > bit for the barrier at the end.
> > > 
> > > I like the creativity, but I worry that at some point these
> > > limitations are going to be raised (limits have a funny way of doing
> > > that over time) and we will be in trouble.  I say "trouble" because I
> > > want to be able to quickly do an audit container ID comparison and
> > > we're going to pay a penalty for these larger values (we'll need this
> > > when we add multiple auditd support and the requisite record routing).
> > > 
> > > Thinking about this makes me also realize we probably need to think a
> > > bit longer about audit container ID conflicts between orchestrators.
> > > Right now we just take the value that is given to us by the
> > > orchestrator, but if we want to allow multiple container orchestrators
> > > to work without some form of cooperation in userspace (I think we have
> > > to assume the orchestrators will not talk to each other) we likely
> > > need to have some way to block reuse of an audit container ID.  We
> > > would either need to prevent the orchestrator from explicitly setting
> > > an audit container ID to a currently in use value, or instead generate
> > > the audit container ID in the kernel upon an event triggered by the
> > > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > > start looking at the idr code, I think we will need to make use of it.
> > 
> > My first reaction to using the IDR code is that once an idr is given up,
> > it can be reused.  I suppose we request IDRs and then never give them up
> > to avoid reuse...
> 
> I'm not sure we ever what to guarantee that an audit container ID
> won't be reused during the lifetime of the system, it is a discrete
> integer after all.  What I think we do want to guarantee is that we
> won't allow an unintentional audit container ID collision between
> different orchestrators; if a single orchestrator wants to reuse an
> audit container ID then that is its choice.
> 
> > I already had some ideas of preventing an existing ID from being reused,
> 
> Cool.  I only made the idr suggestion since it is used for PIDs and
> solves a very similar problem.
> 
> > but that makes the practice of some container engines injecting
> > processes into existing containers difficult if not impossible.
> 
> Yes, we'll need some provision to indicate which orchestrator
> "controls" that particular audit container ID, and allow that
> orchestrator to reuse that particular audit container ID (until all
> those containers disappear and the audit container ID is given back to
> the pool).
Paul Moore June 18, 2019, 10:12 p.m.
On Mon, Jun 3, 2019 at 4:24 PM Steve Grubb <sgrubb@redhat.com> wrote:
> Hello Paul,
>
> I am curious about this. We seemed to be close to getting this patch pulled
> in. A lot of people are waiting for it. Can you summarize what you think the
> patches need and who we think needs to do it? I'm lost. Does LXC people need
> to propose something? Does Richard? Someone else? Who's got the ball?

[My apologies, this was lost in my inbox and I just not noticed it.]

Please don't top post on things sent to the mailing lists Steve, you
know better than that.

Yes, things were moving along well, but upon talking with the LXC
folks it appears we underestimated the importance of nested
orchestrators.  I suspect my reply to Dan on the 4th covered your
questions, if you didn't see it, here is the relevant snippet:

"To be clear, that's where we are at: we need to figure out what the
kernel API would look like to support nested container orchestrators.
My gut feeling is that this isn't going to be terribly complicated
compared to the rest of the audit container ID work, but it is going
to be some work.  We had a discussion about some potential solutions
in the cover letter and it sounds like Richard is working up some
ideas now, let's wait to see what that looks like."

... and that is where we are at.  I'm looking forward to seeing
Richard's next patchset.

> On Friday, May 31, 2019 8:44:45 AM EDT Paul Moore wrote:
> > On Thu, May 30, 2019 at 8:21 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2019-05-30 19:26, Paul Moore wrote:
> > > > On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > > > > > [REMINDER: It is an "*audit* container ID" and not a general
> > > > > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> > > > >
> > > > > This sort of seems like a distinction without a difference;
> > > > > presumably
> > > > > audit is going to want to differentiate between everything that
> > > > > people
> > > > > in userspace call a container. So you'll have to support all this
> > > > > insanity anyway, even if it's "not a container ID".
> > > >
> > > > That's not quite right.  Audit doesn't care about what a container is,
> > > > or is not, it also doesn't care if the "audit container ID" actually
> > > > matches the ID used by the container engine in userspace and I think
> > > > that is a very important line to draw.  Audit is simply given a value
> > > > which it calls the "audit container ID", it ensures that the value is
> > > > inherited appropriately (e.g. children inherit their parent's audit
> > > > container ID), and it uses the value in audit records to provide some
> > > > additional context for log analysis.  The distinction isn't limited to
> > > > the value itself, but also to how it is used; it is an "audit
> > > > container ID" and not a "container ID" because this value is
> > > > exclusively for use by the audit subsystem.  We are very intentionally
> > > > not adding a generic container ID to the kernel.  If the kernel does
> > > > ever grow a general purpose container ID we will be one of the first
> > > > ones in line to make use of it, but we are not going to be the ones to
> > > > generically add containers to the kernel.  Enough people already hate
> > > > audit ;)
> > > >
> > > > > > I'm not interested in supporting/merging something that isn't
> > > > > > useful;
> > > > > > if this doesn't work for your use case then we need to figure out
> > > > > > what
> > > > > > would work.  It sounds like nested containers are much more common
> > > > > > in
> > > > > > the lxc world, can you elaborate a bit more on this?
> > > > > >
> > > > > > As far as the possible solutions you mention above, I'm not sure I
> > > > > > like the per-userns audit container IDs, I'd much rather just emit
> > > > > > the
> > > > > > necessary tracking information via the audit record stream and let
> > > > > > the
> > > > > > log analysis tools figure it out.  However, the bigger question is
> > > > > > how
> > > > > > to limit (re)setting the audit container ID when you are in a
> > > > > > non-init
> > > > > > userns.  For reasons already mentioned, using capable() is a non
> > > > > > starter for everything but the initial userns, and using
> > > > > > ns_capable()
> > > > > > is equally poor as it essentially allows any userns the ability to
> > > > > > munge it's audit container ID (obviously not good).  It appears we
> > > > > > need a different method for controlling access to the audit
> > > > > > container
> > > > > > ID.
> > > > >
> > > > > One option would be to make it a string, and have it be append only.
> > > > > That should be safe with no checks.
> > > > >
> > > > > I know there was a long thread about what type to make this thing. I
> > > > > think you could accomplish the append-only-ness with a u64 if you had
> > > > > some rule about only allowing setting lower order bits than those
> > > > > that
> > > > > are already set. With 4 bits for simplicity:
> > > > >
> > > > > 1100         # initial container id
> > > > > 1100 -> 1011 # not allowed
> > > > > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> > > > >
> > > > >              # no lower order bits left
> > > > >
> > > > > There are probably fancier ways to do it if you actually understand
> > > > > math :)
> > > >
> > > >  ;)
> > > >
> > > > > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > > > > you have 64 bits, this might be reasonable. You could just teach
> > > > > container engines to use the first say N bits for themselves, with a
> > > > > 1
> > > > > bit for the barrier at the end.
> > > >
> > > > I like the creativity, but I worry that at some point these
> > > > limitations are going to be raised (limits have a funny way of doing
> > > > that over time) and we will be in trouble.  I say "trouble" because I
> > > > want to be able to quickly do an audit container ID comparison and
> > > > we're going to pay a penalty for these larger values (we'll need this
> > > > when we add multiple auditd support and the requisite record routing).
> > > >
> > > > Thinking about this makes me also realize we probably need to think a
> > > > bit longer about audit container ID conflicts between orchestrators.
> > > > Right now we just take the value that is given to us by the
> > > > orchestrator, but if we want to allow multiple container orchestrators
> > > > to work without some form of cooperation in userspace (I think we have
> > > > to assume the orchestrators will not talk to each other) we likely
> > > > need to have some way to block reuse of an audit container ID.  We
> > > > would either need to prevent the orchestrator from explicitly setting
> > > > an audit container ID to a currently in use value, or instead generate
> > > > the audit container ID in the kernel upon an event triggered by the
> > > > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > > > start looking at the idr code, I think we will need to make use of it.
> > >
> > > My first reaction to using the IDR code is that once an idr is given up,
> > > it can be reused.  I suppose we request IDRs and then never give them up
> > > to avoid reuse...
> >
> > I'm not sure we ever what to guarantee that an audit container ID
> > won't be reused during the lifetime of the system, it is a discrete
> > integer after all.  What I think we do want to guarantee is that we
> > won't allow an unintentional audit container ID collision between
> > different orchestrators; if a single orchestrator wants to reuse an
> > audit container ID then that is its choice.
> >
> > > I already had some ideas of preventing an existing ID from being reused,
> >
> > Cool.  I only made the idr suggestion since it is used for PIDs and
> > solves a very similar problem.
> >
> > > but that makes the practice of some container engines injecting
> > > processes into existing containers difficult if not impossible.
> >
> > Yes, we'll need some provision to indicate which orchestrator
> > "controls" that particular audit container ID, and allow that
> > orchestrator to reuse that particular audit container ID (until all
> > those containers disappear and the audit container ID is given back to
> > the pool).
>
>
>
>


--
paul moore
www.paul-moore.com
Richard Guy Briggs June 18, 2019, 10:46 p.m.
On 2019-06-18 18:12, Paul Moore wrote:
> On Mon, Jun 3, 2019 at 4:24 PM Steve Grubb <sgrubb@redhat.com> wrote:
> > Hello Paul,
> >
> > I am curious about this. We seemed to be close to getting this patch pulled
> > in. A lot of people are waiting for it. Can you summarize what you think the
> > patches need and who we think needs to do it? I'm lost. Does LXC people need
> > to propose something? Does Richard? Someone else? Who's got the ball?
> 
> [My apologies, this was lost in my inbox and I just not noticed it.]
> 
> Please don't top post on things sent to the mailing lists Steve, you
> know better than that.
> 
> Yes, things were moving along well, but upon talking with the LXC
> folks it appears we underestimated the importance of nested
> orchestrators.  I suspect my reply to Dan on the 4th covered your
> questions, if you didn't see it, here is the relevant snippet:
> 
> "To be clear, that's where we are at: we need to figure out what the
> kernel API would look like to support nested container orchestrators.
> My gut feeling is that this isn't going to be terribly complicated
> compared to the rest of the audit container ID work, but it is going
> to be some work.  We had a discussion about some potential solutions
> in the cover letter and it sounds like Richard is working up some
> ideas now, let's wait to see what that looks like."
> 
> ... and that is where we are at.  I'm looking forward to seeing
> Richard's next patchset.

I've rebased everything and am trying out some code to see if it will
address the concerns raised...  There will be more overhead on contid
write, and a tiny bit more for normal operations...

> > On Friday, May 31, 2019 8:44:45 AM EDT Paul Moore wrote:
> > > On Thu, May 30, 2019 at 8:21 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2019-05-30 19:26, Paul Moore wrote:
> > > > > On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > > > > > > [REMINDER: It is an "*audit* container ID" and not a general
> > > > > > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> > > > > >
> > > > > > This sort of seems like a distinction without a difference;
> > > > > > presumably
> > > > > > audit is going to want to differentiate between everything that
> > > > > > people
> > > > > > in userspace call a container. So you'll have to support all this
> > > > > > insanity anyway, even if it's "not a container ID".
> > > > >
> > > > > That's not quite right.  Audit doesn't care about what a container is,
> > > > > or is not, it also doesn't care if the "audit container ID" actually
> > > > > matches the ID used by the container engine in userspace and I think
> > > > > that is a very important line to draw.  Audit is simply given a value
> > > > > which it calls the "audit container ID", it ensures that the value is
> > > > > inherited appropriately (e.g. children inherit their parent's audit
> > > > > container ID), and it uses the value in audit records to provide some
> > > > > additional context for log analysis.  The distinction isn't limited to
> > > > > the value itself, but also to how it is used; it is an "audit
> > > > > container ID" and not a "container ID" because this value is
> > > > > exclusively for use by the audit subsystem.  We are very intentionally
> > > > > not adding a generic container ID to the kernel.  If the kernel does
> > > > > ever grow a general purpose container ID we will be one of the first
> > > > > ones in line to make use of it, but we are not going to be the ones to
> > > > > generically add containers to the kernel.  Enough people already hate
> > > > > audit ;)
> > > > >
> > > > > > > I'm not interested in supporting/merging something that isn't
> > > > > > > useful;
> > > > > > > if this doesn't work for your use case then we need to figure out
> > > > > > > what
> > > > > > > would work.  It sounds like nested containers are much more common
> > > > > > > in
> > > > > > > the lxc world, can you elaborate a bit more on this?
> > > > > > >
> > > > > > > As far as the possible solutions you mention above, I'm not sure I
> > > > > > > like the per-userns audit container IDs, I'd much rather just emit
> > > > > > > the
> > > > > > > necessary tracking information via the audit record stream and let
> > > > > > > the
> > > > > > > log analysis tools figure it out.  However, the bigger question is
> > > > > > > how
> > > > > > > to limit (re)setting the audit container ID when you are in a
> > > > > > > non-init
> > > > > > > userns.  For reasons already mentioned, using capable() is a non
> > > > > > > starter for everything but the initial userns, and using
> > > > > > > ns_capable()
> > > > > > > is equally poor as it essentially allows any userns the ability to
> > > > > > > munge it's audit container ID (obviously not good).  It appears we
> > > > > > > need a different method for controlling access to the audit
> > > > > > > container
> > > > > > > ID.
> > > > > >
> > > > > > One option would be to make it a string, and have it be append only.
> > > > > > That should be safe with no checks.
> > > > > >
> > > > > > I know there was a long thread about what type to make this thing. I
> > > > > > think you could accomplish the append-only-ness with a u64 if you had
> > > > > > some rule about only allowing setting lower order bits than those
> > > > > > that
> > > > > > are already set. With 4 bits for simplicity:
> > > > > >
> > > > > > 1100         # initial container id
> > > > > > 1100 -> 1011 # not allowed
> > > > > > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> > > > > >
> > > > > >              # no lower order bits left
> > > > > >
> > > > > > There are probably fancier ways to do it if you actually understand
> > > > > > math :)
> > > > >
> > > > >  ;)
> > > > >
> > > > > > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > > > > > you have 64 bits, this might be reasonable. You could just teach
> > > > > > container engines to use the first say N bits for themselves, with a
> > > > > > 1
> > > > > > bit for the barrier at the end.
> > > > >
> > > > > I like the creativity, but I worry that at some point these
> > > > > limitations are going to be raised (limits have a funny way of doing
> > > > > that over time) and we will be in trouble.  I say "trouble" because I
> > > > > want to be able to quickly do an audit container ID comparison and
> > > > > we're going to pay a penalty for these larger values (we'll need this
> > > > > when we add multiple auditd support and the requisite record routing).
> > > > >
> > > > > Thinking about this makes me also realize we probably need to think a
> > > > > bit longer about audit container ID conflicts between orchestrators.
> > > > > Right now we just take the value that is given to us by the
> > > > > orchestrator, but if we want to allow multiple container orchestrators
> > > > > to work without some form of cooperation in userspace (I think we have
> > > > > to assume the orchestrators will not talk to each other) we likely
> > > > > need to have some way to block reuse of an audit container ID.  We
> > > > > would either need to prevent the orchestrator from explicitly setting
> > > > > an audit container ID to a currently in use value, or instead generate
> > > > > the audit container ID in the kernel upon an event triggered by the
> > > > > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > > > > start looking at the idr code, I think we will need to make use of it.
> > > >
> > > > My first reaction to using the IDR code is that once an idr is given up,
> > > > it can be reused.  I suppose we request IDRs and then never give them up
> > > > to avoid reuse...
> > >
> > > I'm not sure we ever what to guarantee that an audit container ID
> > > won't be reused during the lifetime of the system, it is a discrete
> > > integer after all.  What I think we do want to guarantee is that we
> > > won't allow an unintentional audit container ID collision between
> > > different orchestrators; if a single orchestrator wants to reuse an
> > > audit container ID then that is its choice.
> > >
> > > > I already had some ideas of preventing an existing ID from being reused,
> > >
> > > Cool.  I only made the idr suggestion since it is used for PIDs and
> > > solves a very similar problem.
> > >
> > > > but that makes the practice of some container engines injecting
> > > > processes into existing containers difficult if not impossible.
> > >
> > > Yes, we'll need some provision to indicate which orchestrator
> > > "controls" that particular audit container ID, and allow that
> > > orchestrator to reuse that particular audit container ID (until all
> > > those containers disappear and the audit container ID is given back to
> > > the pool).
> 
> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs July 8, 2019, 5:51 p.m.
On 2019-05-29 11:29, Paul Moore wrote:
> On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> >
> > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> > > It is not permitted to unset the audit container identifier.
> > > A child inherits its parent's audit container identifier.
> >
> > ...
> >
> > >  /**
> > > + * audit_set_contid - set current task's audit contid
> > > + * @contid: contid value
> > > + *
> > > + * Returns 0 on success, -EPERM on permission failure.
> > > + *
> > > + * Called (set) from fs/proc/base.c::proc_contid_write().
> > > + */
> > > +int audit_set_contid(struct task_struct *task, u64 contid)
> > > +{
> > > +     u64 oldcontid;
> > > +     int rc = 0;
> > > +     struct audit_buffer *ab;
> > > +     uid_t uid;
> > > +     struct tty_struct *tty;
> > > +     char comm[sizeof(current->comm)];
> > > +
> > > +     task_lock(task);
> > > +     /* Can't set if audit disabled */
> > > +     if (!task->audit) {
> > > +             task_unlock(task);
> > > +             return -ENOPROTOOPT;
> > > +     }
> > > +     oldcontid = audit_get_contid(task);
> > > +     read_lock(&tasklist_lock);
> > > +     /* Don't allow the audit containerid to be unset */
> > > +     if (!audit_contid_valid(contid))
> > > +             rc = -EINVAL;
> > > +     /* if we don't have caps, reject */
> > > +     else if (!capable(CAP_AUDIT_CONTROL))
> > > +             rc = -EPERM;
> > > +     /* if task has children or is not single-threaded, deny */
> > > +     else if (!list_empty(&task->children))
> > > +             rc = -EBUSY;
> > > +     else if (!(thread_group_leader(task) && thread_group_empty(task)))
> > > +             rc = -EALREADY;
> > > +     read_unlock(&tasklist_lock);
> > > +     if (!rc)
> > > +             task->audit->contid = contid;
> > > +     task_unlock(task);
> > > +
> > > +     if (!audit_enabled)
> > > +             return rc;
> >
> > ...but it is allowed to change it (assuming
> > capable(CAP_AUDIT_CONTROL), of course)? Seems like this might be more
> > immediately useful since we still live in the world of majority
> > privileged containers if we didn't allow changing it, in addition to
> > un-setting it.
> 
> The idea is that only container orchestrators should be able to
> set/modify the audit container ID, and since setting the audit
> container ID can have a significant effect on the records captured
> (and their routing to multiple daemons when we get there) modifying
> the audit container ID is akin to modifying the audit configuration
> which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> is that you would only change the audit container ID from one
> set/inherited value to another if you were nesting containers, in
> which case the nested container orchestrator would need to be granted
> CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> compromise).  We did consider allowing for a chain of nested audit
> container IDs, but the implications of doing so are significant
> (implementation mess, runtime cost, etc.) so we are leaving that out
> of this effort.

We had previously discussed the idea of restricting
orchestrators/engines from only being able to set the audit container
identifier on their own descendants, but it was discarded.  I've added a
check to ensure this is now enforced.

I've also added a check to ensure that a process can't set its own audit
container identifier and that if the identifier is already set, then the
orchestrator/engine must be in a descendant user namespace from the
orchestrator that set the previously inherited audit container
identifier.

> From a practical perspective, un-setting the audit container ID is
> pretty much the same as changing it from one set value to another so
> most of the above applies to that case as well.
> 
> -- 
> paul moore
> www.paul-moore.com

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs July 8, 2019, 6:05 p.m.
On 2019-05-30 15:29, Paul Moore wrote:
> On Thu, May 30, 2019 at 1:09 PM Serge E. Hallyn <serge@hallyn.com> wrote:
> > On Wed, May 29, 2019 at 06:39:48PM -0400, Paul Moore wrote:
> > > On Wed, May 29, 2019 at 6:28 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > On Wed, May 29, 2019 at 12:03:58PM -0400, Paul Moore wrote:
> > > > > On Wed, May 29, 2019 at 11:34 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > > On Wed, May 29, 2019 at 11:29:05AM -0400, Paul Moore wrote:
> > > > > > > On Wed, May 29, 2019 at 10:57 AM Tycho Andersen <tycho@tycho.ws> wrote:
> > > > > > > > On Mon, Apr 08, 2019 at 11:39:09PM -0400, Richard Guy Briggs wrote:
> 
> ...
> 
> > > > > > > The current thinking
> > > > > > > is that you would only change the audit container ID from one
> > > > > > > set/inherited value to another if you were nesting containers, in
> > > > > > > which case the nested container orchestrator would need to be granted
> > > > > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > > > > compromise).
> > > >
> > > > won't work in user namespaced containers, because they will never be
> > > > capable(CAP_AUDIT_CONTROL); so I don't think this will work for
> > > > nesting as is. But maybe nobody cares :)
> > >
> > > That's fun :)
> > >
> > > To be honest, I've never been a big fan of supporting nested
> > > containers from an audit perspective, so I'm not really too upset
> > > about this.  The k8s/cri-o folks seem okay with this, or at least I
> > > haven't heard any objections; lxc folks, what do you have to say?
> >
> > I actually thought the answer to this (when last I looked, "some time" ago)
> > was that userspace should track an audit message saying "task X in
> > container Y is changing its auditid to Z", and then decide to also track Z.
> > This should be doable, but a lot of extra work in userspace.
> >
> > Per-userns containerids would also work.  So task X1 is in containerid
> > 1 on the host and creates a new task Y in new userns;  it continues to
> > be reported in init_user_ns as containerid 1 forever;  but in its own
> > userns it can request to be known as some other containerid.  Audit
> > socks would be per-userns, allowing root in a container to watch for
> > audit events in its own (and descendent) namespaces.
> >
> > But again I'm sure we've gone over all this in the last few years.
> >
> > I suppose we can look at this as a "first step", and talk about
> > making it user-ns-nestable later.  But agreed it's not useful in a
> > lot of situations as is.
> 
> [REMINDER: It is an "*audit* container ID" and not a general
> "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> 
> I'm not interested in supporting/merging something that isn't useful;
> if this doesn't work for your use case then we need to figure out what
> would work.  It sounds like nested containers are much more common in
> the lxc world, can you elaborate a bit more on this?
> 
> As far as the possible solutions you mention above, I'm not sure I
> like the per-userns audit container IDs, I'd much rather just emit the
> necessary tracking information via the audit record stream and let the
> log analysis tools figure it out.  However, the bigger question is how
> to limit (re)setting the audit container ID when you are in a non-init
> userns.  For reasons already mentioned, using capable() is a non
> starter for everything but the initial userns, and using ns_capable()
> is equally poor as it essentially allows any userns the ability to
> munge it's audit container ID (obviously not good).  It appears we
> need a different method for controlling access to the audit container
> ID.

We're not quite ready yet for multiple audit daemons and possibly not
yet for audit namespaces, but this is starting to look a lot like the
latter.

If we can't trust ns_capable() then why are we passing on
CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
by the orchestrator/engine.  If ns_capable() isn't inherited how is it
gained otherwise?  Can it be inserted by cotainer image?  I think the
answer is "no".  Either we trust ns_capable() or we have audit
namespaces (recommend based on user namespace) (or both).

At this point I would say we are at an impasse unless we trust
ns_capable() or we implement audit namespaces.

I don't think another mechanism to trust nested orchestrators/engines
will buy us anything.

Am I missing something?

> Punting this to a LSM hook is an obvious thing to do, and something we
> might want to do anyway, but currently audit doesn't rely on the LSM
> for proper/safe operation and I'm not sure I want to change that now.
> 
> The next obvious thing is to create some sort of access control knob
> in audit itself.  Perhaps an auditctl operation that would allow the
> administrator to specify which containers, via their corresponding
> audit container IDs, are allowed to change their audit container ID?
> The permission granting would need to be done in the init userns, but
> it would allow containers with a non-init userns the ability to change
> their audit container ID.  We would probably still want a
> ns_capable(CAP_AUDIT_CONTROL) restriction in this case.

This auditctl knob of which you speak is an additional API, not changing
the existing proposed one.

> Does anyone else have any other ideas?
> 
> -- 
> paul moore
> www.paul-moore.com

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs July 8, 2019, 6:12 p.m.
On 2019-05-30 19:26, Paul Moore wrote:
> On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
> > On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
> > >
> > > [REMINDER: It is an "*audit* container ID" and not a general
> > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> >
> > This sort of seems like a distinction without a difference; presumably
> > audit is going to want to differentiate between everything that people
> > in userspace call a container. So you'll have to support all this
> > insanity anyway, even if it's "not a container ID".
> 
> That's not quite right.  Audit doesn't care about what a container is,
> or is not, it also doesn't care if the "audit container ID" actually
> matches the ID used by the container engine in userspace and I think
> that is a very important line to draw.  Audit is simply given a value
> which it calls the "audit container ID", it ensures that the value is
> inherited appropriately (e.g. children inherit their parent's audit
> container ID), and it uses the value in audit records to provide some
> additional context for log analysis.  The distinction isn't limited to
> the value itself, but also to how it is used; it is an "audit
> container ID" and not a "container ID" because this value is
> exclusively for use by the audit subsystem.  We are very intentionally
> not adding a generic container ID to the kernel.  If the kernel does
> ever grow a general purpose container ID we will be one of the first
> ones in line to make use of it, but we are not going to be the ones to
> generically add containers to the kernel.  Enough people already hate
> audit ;)
> 
> > > I'm not interested in supporting/merging something that isn't useful;
> > > if this doesn't work for your use case then we need to figure out what
> > > would work.  It sounds like nested containers are much more common in
> > > the lxc world, can you elaborate a bit more on this?
> > >
> > > As far as the possible solutions you mention above, I'm not sure I
> > > like the per-userns audit container IDs, I'd much rather just emit the
> > > necessary tracking information via the audit record stream and let the
> > > log analysis tools figure it out.  However, the bigger question is how
> > > to limit (re)setting the audit container ID when you are in a non-init
> > > userns.  For reasons already mentioned, using capable() is a non
> > > starter for everything but the initial userns, and using ns_capable()
> > > is equally poor as it essentially allows any userns the ability to
> > > munge it's audit container ID (obviously not good).  It appears we
> > > need a different method for controlling access to the audit container
> > > ID.
> >
> > One option would be to make it a string, and have it be append only.
> > That should be safe with no checks.
> >
> > I know there was a long thread about what type to make this thing. I
> > think you could accomplish the append-only-ness with a u64 if you had
> > some rule about only allowing setting lower order bits than those that
> > are already set. With 4 bits for simplicity:
> >
> > 1100         # initial container id
> > 1100 -> 1011 # not allowed
> > 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
> >              # no lower order bits left
> >
> > There are probably fancier ways to do it if you actually understand
> > math :)
> 
>  ;)
> 
> > Since userns nesting is limited to 32 levels (right now, IIRC), and
> > you have 64 bits, this might be reasonable. You could just teach
> > container engines to use the first say N bits for themselves, with a 1
> > bit for the barrier at the end.
> 
> I like the creativity, but I worry that at some point these
> limitations are going to be raised (limits have a funny way of doing
> that over time) and we will be in trouble.  I say "trouble" because I
> want to be able to quickly do an audit container ID comparison and
> we're going to pay a penalty for these larger values (we'll need this
> when we add multiple auditd support and the requisite record routing).
> 
> Thinking about this makes me also realize we probably need to think a
> bit longer about audit container ID conflicts between orchestrators.
> Right now we just take the value that is given to us by the
> orchestrator, but if we want to allow multiple container orchestrators
> to work without some form of cooperation in userspace (I think we have
> to assume the orchestrators will not talk to each other) we likely
> need to have some way to block reuse of an audit container ID.  We
> would either need to prevent the orchestrator from explicitly setting
> an audit container ID to a currently in use value, or instead generate
> the audit container ID in the kernel upon an event triggered by the
> orchestrator (e.g. a write to a /proc file).  I suspect we should
> start looking at the idr code, I think we will need to make use of it.

To address this, I'd suggest that it is enforced to only allow the
setting of descendants and to maintain a master list of audit container
identifiers (with a hash table if necessary later) that includes the
container owner.

This also allows the orchestrator/engine to inject processes into
existing containers by checking that the audit container identifier is
only used again by the same owner.

I have working code for both.

> paul moore
> www.paul-moore.com

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore July 8, 2019, 8:43 p.m.
On July 8, 2019 8:12:56 PM Richard Guy Briggs <rgb@redhat.com> wrote:

> On 2019-05-30 19:26, Paul Moore wrote:
>> On Thu, May 30, 2019 at 5:29 PM Tycho Andersen <tycho@tycho.ws> wrote:
>>> On Thu, May 30, 2019 at 03:29:32PM -0400, Paul Moore wrote:
>>>>
>>>>
>>>> [REMINDER: It is an "*audit* container ID" and not a general
>>>> "container ID" ;)  Smiley aside, I'm not kidding about that part.]
>>>
>>> This sort of seems like a distinction without a difference; presumably
>>> audit is going to want to differentiate between everything that people
>>> in userspace call a container. So you'll have to support all this
>>> insanity anyway, even if it's "not a container ID".
>>
>> That's not quite right.  Audit doesn't care about what a container is,
>> or is not, it also doesn't care if the "audit container ID" actually
>> matches the ID used by the container engine in userspace and I think
>> that is a very important line to draw.  Audit is simply given a value
>> which it calls the "audit container ID", it ensures that the value is
>> inherited appropriately (e.g. children inherit their parent's audit
>> container ID), and it uses the value in audit records to provide some
>> additional context for log analysis.  The distinction isn't limited to
>> the value itself, but also to how it is used; it is an "audit
>> container ID" and not a "container ID" because this value is
>> exclusively for use by the audit subsystem.  We are very intentionally
>> not adding a generic container ID to the kernel.  If the kernel does
>> ever grow a general purpose container ID we will be one of the first
>> ones in line to make use of it, but we are not going to be the ones to
>> generically add containers to the kernel.  Enough people already hate
>> audit ;)
>>
>>>> I'm not interested in supporting/merging something that isn't useful;
>>>> if this doesn't work for your use case then we need to figure out what
>>>> would work.  It sounds like nested containers are much more common in
>>>> the lxc world, can you elaborate a bit more on this?
>>>>
>>>>
>>>> As far as the possible solutions you mention above, I'm not sure I
>>>> like the per-userns audit container IDs, I'd much rather just emit the
>>>> necessary tracking information via the audit record stream and let the
>>>> log analysis tools figure it out.  However, the bigger question is how
>>>> to limit (re)setting the audit container ID when you are in a non-init
>>>> userns.  For reasons already mentioned, using capable() is a non
>>>> starter for everything but the initial userns, and using ns_capable()
>>>> is equally poor as it essentially allows any userns the ability to
>>>> munge it's audit container ID (obviously not good).  It appears we
>>>> need a different method for controlling access to the audit container
>>>> ID.
>>>
>>> One option would be to make it a string, and have it be append only.
>>> That should be safe with no checks.
>>>
>>> I know there was a long thread about what type to make this thing. I
>>> think you could accomplish the append-only-ness with a u64 if you had
>>> some rule about only allowing setting lower order bits than those that
>>> are already set. With 4 bits for simplicity:
>>>
>>> 1100         # initial container id
>>> 1100 -> 1011 # not allowed
>>> 1100 -> 1101 # allowed, but now 1101 is set in stone since there are
>>>       # no lower order bits left
>>>
>>> There are probably fancier ways to do it if you actually understand
>>> math :)
>>
>> ;)
>>
>>> Since userns nesting is limited to 32 levels (right now, IIRC), and
>>> you have 64 bits, this might be reasonable. You could just teach
>>> container engines to use the first say N bits for themselves, with a 1
>>> bit for the barrier at the end.
>>
>> I like the creativity, but I worry that at some point these
>> limitations are going to be raised (limits have a funny way of doing
>> that over time) and we will be in trouble.  I say "trouble" because I
>> want to be able to quickly do an audit container ID comparison and
>> we're going to pay a penalty for these larger values (we'll need this
>> when we add multiple auditd support and the requisite record routing).
>>
>> Thinking about this makes me also realize we probably need to think a
>> bit longer about audit container ID conflicts between orchestrators.
>> Right now we just take the value that is given to us by the
>> orchestrator, but if we want to allow multiple container orchestrators
>> to work without some form of cooperation in userspace (I think we have
>> to assume the orchestrators will not talk to each other) we likely
>> need to have some way to block reuse of an audit container ID.  We
>> would either need to prevent the orchestrator from explicitly setting
>> an audit container ID to a currently in use value, or instead generate
>> the audit container ID in the kernel upon an event triggered by the
>> orchestrator (e.g. a write to a /proc file).  I suspect we should
>> start looking at the idr code, I think we will need to make use of it.
>
> To address this, I'd suggest that it is enforced to only allow the
> setting of descendants and to maintain a master list of audit container
> identifiers (with a hash table if necessary later) that includes the
> container owner.
>
> This also allows the orchestrator/engine to inject processes into
> existing containers by checking that the audit container identifier is
> only used again by the same owner.
>
> I have working code for both.

Just a quick note that due to some holiday travel I'm not going to be able to adequately respond to your latest messages on this thread for at least another week, likely a bit more.  I'm only checking mail to put out fires, and the audit container ID work tends to be something that starts them ;)

--
paul moore
www.paul-moore.com
Paul Moore July 15, 2019, 8:38 p.m.
On Mon, Jul 8, 2019 at 1:51 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-29 11:29, Paul Moore wrote:

...

> > The idea is that only container orchestrators should be able to
> > set/modify the audit container ID, and since setting the audit
> > container ID can have a significant effect on the records captured
> > (and their routing to multiple daemons when we get there) modifying
> > the audit container ID is akin to modifying the audit configuration
> > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > is that you would only change the audit container ID from one
> > set/inherited value to another if you were nesting containers, in
> > which case the nested container orchestrator would need to be granted
> > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > compromise).  We did consider allowing for a chain of nested audit
> > container IDs, but the implications of doing so are significant
> > (implementation mess, runtime cost, etc.) so we are leaving that out
> > of this effort.
>
> We had previously discussed the idea of restricting
> orchestrators/engines from only being able to set the audit container
> identifier on their own descendants, but it was discarded.  I've added a
> check to ensure this is now enforced.

When we weren't allowing nested orchestrators it wasn't necessary, but
with the move to support nesting I believe this will be a requirement.
We might also need/want to restrict audit container ID changes if a
descendant is acting as a container orchestrator and managing one or
more audit container IDs; although I'm less certain of the need for
this.

> I've also added a check to ensure that a process can't set its own audit
> container identifier ...

What does this protect against, or what problem does this solve?
Considering how easy it is to fork/exec, it seems like this could be
trivially bypassed.

> ... and that if the identifier is already set, then the
> orchestrator/engine must be in a descendant user namespace from the
> orchestrator that set the previously inherited audit container
> identifier.

You lost me here ... although I don't like the idea of relying on X
namespace inheritance for a hard coded policy on setting the audit
container ID; we've worked hard to keep this independent of any
definition of a "container" and it would sadden me greatly if we had
to go back on that.
Paul Moore July 15, 2019, 9:04 p.m.
On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 15:29, Paul Moore wrote:

...

> > [REMINDER: It is an "*audit* container ID" and not a general
> > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> >
> > I'm not interested in supporting/merging something that isn't useful;
> > if this doesn't work for your use case then we need to figure out what
> > would work.  It sounds like nested containers are much more common in
> > the lxc world, can you elaborate a bit more on this?
> >
> > As far as the possible solutions you mention above, I'm not sure I
> > like the per-userns audit container IDs, I'd much rather just emit the
> > necessary tracking information via the audit record stream and let the
> > log analysis tools figure it out.  However, the bigger question is how
> > to limit (re)setting the audit container ID when you are in a non-init
> > userns.  For reasons already mentioned, using capable() is a non
> > starter for everything but the initial userns, and using ns_capable()
> > is equally poor as it essentially allows any userns the ability to
> > munge it's audit container ID (obviously not good).  It appears we
> > need a different method for controlling access to the audit container
> > ID.
>
> We're not quite ready yet for multiple audit daemons and possibly not
> yet for audit namespaces, but this is starting to look a lot like the
> latter.

A few quick comments on audit namespaces: the audit container ID is
not envisioned as a new namespace (even in nested form) and neither do
I consider running multiple audit daemons to be a new namespace.

From my perspective we create namespaces to allow us to redefine a
global resource for some subset of the system, e.g. providing a unique
/tmp for some number of processes on the system.  While it may be
tempting to think of the audit container ID as something we could
"namespace", especially when multiple audit daemons are concerned, in
some ways this would be counter productive; the audit container ID is
intended to be a global ID that can be used to associate audit event
records with a "container" where the "container" is defined by an
orchestrator outside the audit subsystem.  The global nature of the
audit container ID allows us to maintain a sane(ish) view of the
system in the audit log, if we were to "namespace" the audit container
ID such that the value was no longer guaranteed to be unique
throughout the system, we would need to additionally track the audit
namespace along with the audit container ID which starts to border on
insanity IMHO.

> If we can't trust ns_capable() then why are we passing on
> CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
> by the orchestrator/engine.  If ns_capable() isn't inherited how is it
> gained otherwise?  Can it be inserted by cotainer image?  I think the
> answer is "no".  Either we trust ns_capable() or we have audit
> namespaces (recommend based on user namespace) (or both).

My thinking is that since ns_capable() checks the credentials with
respect to the current user namespace we can't rely on it to control
access since it would be possible for a privileged process running
inside an unprivileged container to manipulate the audit container ID
(containerized process has CAP_AUDIT_CONTROL, e.g. running as root in
the container, while the container itself does not).

> At this point I would say we are at an impasse unless we trust
> ns_capable() or we implement audit namespaces.

I'm not sure how we can trust ns_capable(), but if you can think of a
way I would love to hear it.  I'm also not sure how namespacing audit
is helpful (see my above comments), but if you think it is please
explain.

> I don't think another mechanism to trust nested orchestrators/engines
> will buy us anything.
>
> Am I missing something?

Based on your questions/comments above it looks like your
understanding of ns_capable() does not match mine; if I'm thinking
about ns_capable() incorrectly, please educate me.
Paul Moore July 15, 2019, 9:09 p.m.
On Mon, Jul 8, 2019 at 2:12 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-05-30 19:26, Paul Moore wrote:

...

> > I like the creativity, but I worry that at some point these
> > limitations are going to be raised (limits have a funny way of doing
> > that over time) and we will be in trouble.  I say "trouble" because I
> > want to be able to quickly do an audit container ID comparison and
> > we're going to pay a penalty for these larger values (we'll need this
> > when we add multiple auditd support and the requisite record routing).
> >
> > Thinking about this makes me also realize we probably need to think a
> > bit longer about audit container ID conflicts between orchestrators.
> > Right now we just take the value that is given to us by the
> > orchestrator, but if we want to allow multiple container orchestrators
> > to work without some form of cooperation in userspace (I think we have
> > to assume the orchestrators will not talk to each other) we likely
> > need to have some way to block reuse of an audit container ID.  We
> > would either need to prevent the orchestrator from explicitly setting
> > an audit container ID to a currently in use value, or instead generate
> > the audit container ID in the kernel upon an event triggered by the
> > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > start looking at the idr code, I think we will need to make use of it.
>
> To address this, I'd suggest that it is enforced to only allow the
> setting of descendants and to maintain a master list of audit container
> identifiers (with a hash table if necessary later) that includes the
> container owner.

We're discussing the audit container ID management policy elsewhere in
this thread so I won't comment on that here, but I did want to say
that we will likely need something better than a simple list of audit
container IDs from the start.  It's common for systems to have
thousands of containers now (or multiple thousands), which tells me
that a list is a poor choice.  You mentioned a hash table, so I would
suggest starting with that over the list for the initial patchset.
Richard Guy Briggs July 16, 2019, 3:37 p.m.
On 2019-07-15 17:09, Paul Moore wrote:
> On Mon, Jul 8, 2019 at 2:12 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-05-30 19:26, Paul Moore wrote:
> 
> ...
> 
> > > I like the creativity, but I worry that at some point these
> > > limitations are going to be raised (limits have a funny way of doing
> > > that over time) and we will be in trouble.  I say "trouble" because I
> > > want to be able to quickly do an audit container ID comparison and
> > > we're going to pay a penalty for these larger values (we'll need this
> > > when we add multiple auditd support and the requisite record routing).
> > >
> > > Thinking about this makes me also realize we probably need to think a
> > > bit longer about audit container ID conflicts between orchestrators.
> > > Right now we just take the value that is given to us by the
> > > orchestrator, but if we want to allow multiple container orchestrators
> > > to work without some form of cooperation in userspace (I think we have
> > > to assume the orchestrators will not talk to each other) we likely
> > > need to have some way to block reuse of an audit container ID.  We
> > > would either need to prevent the orchestrator from explicitly setting
> > > an audit container ID to a currently in use value, or instead generate
> > > the audit container ID in the kernel upon an event triggered by the
> > > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > > start looking at the idr code, I think we will need to make use of it.
> >
> > To address this, I'd suggest that it is enforced to only allow the
> > setting of descendants and to maintain a master list of audit container
> > identifiers (with a hash table if necessary later) that includes the
> > container owner.
> 
> We're discussing the audit container ID management policy elsewhere in
> this thread so I won't comment on that here, but I did want to say
> that we will likely need something better than a simple list of audit
> container IDs from the start.  It's common for systems to have
> thousands of containers now (or multiple thousands), which tells me
> that a list is a poor choice.  You mentioned a hash table, so I would
> suggest starting with that over the list for the initial patchset.

I saw that as an internal incremental improvement that did not affect
the API, so I wanted to keep things a bit simpler (as you've requested
in the past) to get this going, and add that enhancement later.

I'll start working on it now.  The hash table would simply point to
lists anyways unless you can recommend a better approach.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore July 16, 2019, 4:08 p.m.
On Tue, Jul 16, 2019 at 11:37 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-07-15 17:09, Paul Moore wrote:
> > On Mon, Jul 8, 2019 at 2:12 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2019-05-30 19:26, Paul Moore wrote:
> >
> > ...
> >
> > > > I like the creativity, but I worry that at some point these
> > > > limitations are going to be raised (limits have a funny way of doing
> > > > that over time) and we will be in trouble.  I say "trouble" because I
> > > > want to be able to quickly do an audit container ID comparison and
> > > > we're going to pay a penalty for these larger values (we'll need this
> > > > when we add multiple auditd support and the requisite record routing).
> > > >
> > > > Thinking about this makes me also realize we probably need to think a
> > > > bit longer about audit container ID conflicts between orchestrators.
> > > > Right now we just take the value that is given to us by the
> > > > orchestrator, but if we want to allow multiple container orchestrators
> > > > to work without some form of cooperation in userspace (I think we have
> > > > to assume the orchestrators will not talk to each other) we likely
> > > > need to have some way to block reuse of an audit container ID.  We
> > > > would either need to prevent the orchestrator from explicitly setting
> > > > an audit container ID to a currently in use value, or instead generate
> > > > the audit container ID in the kernel upon an event triggered by the
> > > > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > > > start looking at the idr code, I think we will need to make use of it.
> > >
> > > To address this, I'd suggest that it is enforced to only allow the
> > > setting of descendants and to maintain a master list of audit container
> > > identifiers (with a hash table if necessary later) that includes the
> > > container owner.
> >
> > We're discussing the audit container ID management policy elsewhere in
> > this thread so I won't comment on that here, but I did want to say
> > that we will likely need something better than a simple list of audit
> > container IDs from the start.  It's common for systems to have
> > thousands of containers now (or multiple thousands), which tells me
> > that a list is a poor choice.  You mentioned a hash table, so I would
> > suggest starting with that over the list for the initial patchset.
>
> I saw that as an internal incremental improvement that did not affect
> the API, so I wanted to keep things a bit simpler (as you've requested
> in the past) to get this going, and add that enhancement later.

In general a simple approach is a good way to start when the
problem/use-case is not very well understood; in other words, don't
spend a lot of time/effort optimizing something you don't yet
understand.  In this case we know that people want to deploy a *lot*
of containers on a single system so we should design the data
structures appropriately.  A list is simply not a good fit here, I
believe/hope you know that too.

> I'll start working on it now.  The hash table would simply point to
> lists anyways unless you can recommend a better approach.

I assume when you say "point to lists" you are talking about using
lists for the hash buckets?  If so, yes that should be fine at this
point.  In general if the per-bucket lists become a bottleneck we can
look at the size of the table (or make it tunable) or even use a
different approach entirely.  Ultimately the data store is an
implementation detail private to the audit subsystem in the kernel so
we should be able to change it as necessary without breaking anything.
Richard Guy Briggs July 16, 2019, 4:26 p.m.
On 2019-07-16 12:08, Paul Moore wrote:
> On Tue, Jul 16, 2019 at 11:37 AM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-07-15 17:09, Paul Moore wrote:
> > > On Mon, Jul 8, 2019 at 2:12 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > > On 2019-05-30 19:26, Paul Moore wrote:
> > >
> > > ...
> > >
> > > > > I like the creativity, but I worry that at some point these
> > > > > limitations are going to be raised (limits have a funny way of doing
> > > > > that over time) and we will be in trouble.  I say "trouble" because I
> > > > > want to be able to quickly do an audit container ID comparison and
> > > > > we're going to pay a penalty for these larger values (we'll need this
> > > > > when we add multiple auditd support and the requisite record routing).
> > > > >
> > > > > Thinking about this makes me also realize we probably need to think a
> > > > > bit longer about audit container ID conflicts between orchestrators.
> > > > > Right now we just take the value that is given to us by the
> > > > > orchestrator, but if we want to allow multiple container orchestrators
> > > > > to work without some form of cooperation in userspace (I think we have
> > > > > to assume the orchestrators will not talk to each other) we likely
> > > > > need to have some way to block reuse of an audit container ID.  We
> > > > > would either need to prevent the orchestrator from explicitly setting
> > > > > an audit container ID to a currently in use value, or instead generate
> > > > > the audit container ID in the kernel upon an event triggered by the
> > > > > orchestrator (e.g. a write to a /proc file).  I suspect we should
> > > > > start looking at the idr code, I think we will need to make use of it.
> > > >
> > > > To address this, I'd suggest that it is enforced to only allow the
> > > > setting of descendants and to maintain a master list of audit container
> > > > identifiers (with a hash table if necessary later) that includes the
> > > > container owner.
> > >
> > > We're discussing the audit container ID management policy elsewhere in
> > > this thread so I won't comment on that here, but I did want to say
> > > that we will likely need something better than a simple list of audit
> > > container IDs from the start.  It's common for systems to have
> > > thousands of containers now (or multiple thousands), which tells me
> > > that a list is a poor choice.  You mentioned a hash table, so I would
> > > suggest starting with that over the list for the initial patchset.
> >
> > I saw that as an internal incremental improvement that did not affect
> > the API, so I wanted to keep things a bit simpler (as you've requested
> > in the past) to get this going, and add that enhancement later.
> 
> In general a simple approach is a good way to start when the
> problem/use-case is not very well understood; in other words, don't
> spend a lot of time/effort optimizing something you don't yet
> understand.  In this case we know that people want to deploy a *lot*
> of containers on a single system so we should design the data
> structures appropriately.  A list is simply not a good fit here, I
> believe/hope you know that too.

Yes, I knew that, which is why I alluded to a hash table...

> > I'll start working on it now.  The hash table would simply point to
> > lists anyways unless you can recommend a better approach.
> 
> I assume when you say "point to lists" you are talking about using
> lists for the hash buckets?  If so, yes that should be fine at this
> point.  In general if the per-bucket lists become a bottleneck we can
> look at the size of the table (or make it tunable) or even use a
> different approach entirely.  Ultimately the data store is an
> implementation detail private to the audit subsystem in the kernel so
> we should be able to change it as necessary without breaking anything.

Yes, this is what I had in mind.  It would be tunable either by a macro
or a config option, so the exact value isn't a critical implementation
detail that can be easily tuned as we gain experience with it.  And yes,
the intent was that it was a non-user-perceivable implementation choice
other than performace metrics.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Richard Guy Briggs July 16, 2019, 7:38 p.m.
On 2019-07-15 16:38, Paul Moore wrote:
> On Mon, Jul 8, 2019 at 1:51 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-05-29 11:29, Paul Moore wrote:
> 
> ...
> 
> > > The idea is that only container orchestrators should be able to
> > > set/modify the audit container ID, and since setting the audit
> > > container ID can have a significant effect on the records captured
> > > (and their routing to multiple daemons when we get there) modifying
> > > the audit container ID is akin to modifying the audit configuration
> > > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > > is that you would only change the audit container ID from one
> > > set/inherited value to another if you were nesting containers, in
> > > which case the nested container orchestrator would need to be granted
> > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > compromise).  We did consider allowing for a chain of nested audit
> > > container IDs, but the implications of doing so are significant
> > > (implementation mess, runtime cost, etc.) so we are leaving that out
> > > of this effort.
> >
> > We had previously discussed the idea of restricting
> > orchestrators/engines from only being able to set the audit container
> > identifier on their own descendants, but it was discarded.  I've added a
> > check to ensure this is now enforced.
> 
> When we weren't allowing nested orchestrators it wasn't necessary, but
> with the move to support nesting I believe this will be a requirement.
> We might also need/want to restrict audit container ID changes if a
> descendant is acting as a container orchestrator and managing one or
> more audit container IDs; although I'm less certain of the need for
> this.

I was of the opinion it was necessary before with single-layer parallel
orchestrators/engines.

> > I've also added a check to ensure that a process can't set its own audit
> > container identifier ...
> 
> What does this protect against, or what problem does this solve?
> Considering how easy it is to fork/exec, it seems like this could be
> trivially bypassed.

Well, for starters, it would remove one layer of nesting.  It would
separate the functional layers of processes.  Other than that, it seems
like a gut feeling that it is just wrong to allow it.  It seems like a
layer violation that one container orchestrator/engine could set its own
audit container identifier and then set its children as well.  It would
be its own parent.  It would make it harder to verify adherance to
descendancy and inheritance rules.

> > ... and that if the identifier is already set, then the
> > orchestrator/engine must be in a descendant user namespace from the
> > orchestrator that set the previously inherited audit container
> > identifier.
> 
> You lost me here ... although I don't like the idea of relying on X
> namespace inheritance for a hard coded policy on setting the audit
> container ID; we've worked hard to keep this independent of any
> definition of a "container" and it would sadden me greatly if we had
> to go back on that.

This would seem to be the one concession I'm reluctantly making to try
to solve this nested container orchestrator/engine challenge.

Would backing off on that descendant user namespace requirement and only
require that a nested audit container identifier only be permitted on a
descendant task be sufficient?  It may for this use case, but I suspect
not for additional audit daemons (we're not there yet) and message
routing to those daemons.

The one difference here is that it does not depend on this if the audit
container identifier has not already been set.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore July 16, 2019, 9:39 p.m.
On Tue, Jul 16, 2019 at 3:38 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-07-15 16:38, Paul Moore wrote:
> > On Mon, Jul 8, 2019 at 1:51 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > > On 2019-05-29 11:29, Paul Moore wrote:
> >
> > ...
> >
> > > > The idea is that only container orchestrators should be able to
> > > > set/modify the audit container ID, and since setting the audit
> > > > container ID can have a significant effect on the records captured
> > > > (and their routing to multiple daemons when we get there) modifying
> > > > the audit container ID is akin to modifying the audit configuration
> > > > which is why it is gated by CAP_AUDIT_CONTROL.  The current thinking
> > > > is that you would only change the audit container ID from one
> > > > set/inherited value to another if you were nesting containers, in
> > > > which case the nested container orchestrator would need to be granted
> > > > CAP_AUDIT_CONTROL (which everyone to date seems to agree is a workable
> > > > compromise).  We did consider allowing for a chain of nested audit
> > > > container IDs, but the implications of doing so are significant
> > > > (implementation mess, runtime cost, etc.) so we are leaving that out
> > > > of this effort.
> > >
> > > We had previously discussed the idea of restricting
> > > orchestrators/engines from only being able to set the audit container
> > > identifier on their own descendants, but it was discarded.  I've added a
> > > check to ensure this is now enforced.
> >
> > When we weren't allowing nested orchestrators it wasn't necessary, but
> > with the move to support nesting I believe this will be a requirement.
> > We might also need/want to restrict audit container ID changes if a
> > descendant is acting as a container orchestrator and managing one or
> > more audit container IDs; although I'm less certain of the need for
> > this.
>
> I was of the opinion it was necessary before with single-layer parallel
> orchestrators/engines.

One of the many things we've disagreed on, but it doesn't really
matter at this point.

> > > I've also added a check to ensure that a process can't set its own audit
> > > container identifier ...
> >
> > What does this protect against, or what problem does this solve?
> > Considering how easy it is to fork/exec, it seems like this could be
> > trivially bypassed.
>
> Well, for starters, it would remove one layer of nesting.  It would
> separate the functional layers of processes.

This doesn't seem like something we need to protect against, what's
the harm?  My opinion at this point is that we should only add
restrictions to protect against problematic or dangerous situations; I
don't believe one extra layer of nesting counts as either.

Perhaps the container folks on the To/CC line can comment on this?  If
there is a valid reason for this restriction, great, let's do it,
otherwise it seems like an unnecessary hard coded policy to me.

> Other than that, it seems
> like a gut feeling that it is just wrong to allow it.  It seems like a
> layer violation that one container orchestrator/engine could set its own
> audit container identifier and then set its children as well.  It would
> be its own parent.

I suspect you are right that the current crop of container engines
won't do this, but who knows what we'll be doing with "containers" 5,
or even 10, years from now.  With that in mind, let me ask the
question again: is allowing an orchestrator the ability to set its own
audit container ID problematic and/or dangerous?

> It would make it harder to verify adherance to descendancy and inheritance rules.

The audit log should contain all the information needed to track that,
right?  If it doesn't, then I think we have a problem with the
information we are logging.  Right?

> > > ... and that if the identifier is already set, then the
> > > orchestrator/engine must be in a descendant user namespace from the
> > > orchestrator that set the previously inherited audit container
> > > identifier.
> >
> > You lost me here ... although I don't like the idea of relying on X
> > namespace inheritance for a hard coded policy on setting the audit
> > container ID; we've worked hard to keep this independent of any
> > definition of a "container" and it would sadden me greatly if we had
> > to go back on that.
>
> This would seem to be the one concession I'm reluctantly making to try
> to solve this nested container orchestrator/engine challenge.

As I said, you lost me on this - how does this help?  A more detailed
explanation of how this helps resolve the nesting problem would be
useful.

> Would backing off on that descendant user namespace requirement and only
> require that a nested audit container identifier only be permitted on a
> descendant task be sufficient?  It may for this use case, but I suspect
> not for additional audit daemons (we're not there yet) and message
> routing to those daemons.
>
> The one difference here is that it does not depend on this if the audit
> container identifier has not already been set.
Richard Guy Briggs July 16, 2019, 10:03 p.m.
On 2019-07-15 17:04, Paul Moore wrote:
> On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-05-30 15:29, Paul Moore wrote:
> 
> ...
> 
> > > [REMINDER: It is an "*audit* container ID" and not a general
> > > "container ID" ;)  Smiley aside, I'm not kidding about that part.]
> > >
> > > I'm not interested in supporting/merging something that isn't useful;
> > > if this doesn't work for your use case then we need to figure out what
> > > would work.  It sounds like nested containers are much more common in
> > > the lxc world, can you elaborate a bit more on this?
> > >
> > > As far as the possible solutions you mention above, I'm not sure I
> > > like the per-userns audit container IDs, I'd much rather just emit the
> > > necessary tracking information via the audit record stream and let the
> > > log analysis tools figure it out.  However, the bigger question is how
> > > to limit (re)setting the audit container ID when you are in a non-init
> > > userns.  For reasons already mentioned, using capable() is a non
> > > starter for everything but the initial userns, and using ns_capable()
> > > is equally poor as it essentially allows any userns the ability to
> > > munge it's audit container ID (obviously not good).  It appears we
> > > need a different method for controlling access to the audit container
> > > ID.
> >
> > We're not quite ready yet for multiple audit daemons and possibly not
> > yet for audit namespaces, but this is starting to look a lot like the
> > latter.
> 
> A few quick comments on audit namespaces: the audit container ID is
> not envisioned as a new namespace (even in nested form) and neither do
> I consider running multiple audit daemons to be a new namespace.

I can picture either one.

> From my perspective we create namespaces to allow us to redefine a
> global resource for some subset of the system, e.g. providing a unique
> /tmp for some number of processes on the system.  While it may be
> tempting to think of the audit container ID as something we could
> "namespace", especially when multiple audit daemons are concerned, in
> some ways this would be counter productive; the audit container ID is
> intended to be a global ID that can be used to associate audit event
> records with a "container" where the "container" is defined by an
> orchestrator outside the audit subsystem.  The global nature of the
> audit container ID allows us to maintain a sane(ish) view of the
> system in the audit log, if we were to "namespace" the audit container
> ID such that the value was no longer guaranteed to be unique
> throughout the system, we would need to additionally track the audit
> namespace along with the audit container ID which starts to border on
> insanity IMHO.

Understood.  And mostly agree.  Any audit namespace would have to be a
hybrid anyways, since only the init one would have full access to audit
resources.  All the others would be somewhat neutered.  And in the case
of checking for previous usage of a contid, if it was not already in use
in the hypothetical audit namespace but was in use elsewhere in the
system and we blocked its usage in this namespace, it would leak that
information by blocking it.

I saw it as a way of permitting layering with the natural descendancy
structure showing that hierarchy.  The potential flaw with my reasoning
is that a parent could exit and its children would get re-parented onto
its next ancestor, so the intermediate task with an intermediate contid
would break that contid documentation chain.

> > If we can't trust ns_capable() then why are we passing on
> > CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
> > by the orchestrator/engine.  If ns_capable() isn't inherited how is it
> > gained otherwise?  Can it be inserted by cotainer image?  I think the
> > answer is "no".  Either we trust ns_capable() or we have audit
> > namespaces (recommend based on user namespace) (or both).
> 
> My thinking is that since ns_capable() checks the credentials with
> respect to the current user namespace we can't rely on it to control
> access since it would be possible for a privileged process running
> inside an unprivileged container to manipulate the audit container ID
> (containerized process has CAP_AUDIT_CONTROL, e.g. running as root in
> the container, while the container itself does not).

What makes an unprivileged container unprivileged?  "root", or "CAP_*"?

If CAP_AUDIT_CONTROL is granted, does "root" matter?  Does it matter
what user namespace it is in?  I understand that root is *gained* in an
unprivileged user namespace, but capabilities are inherited or permitted
and that process either has it or it doesn't and an unprivileged user
namespace can't gain a capability that has been rescinded.  Different
subsystems use the userid or capabilities or both to determine
privileges.  In this case, is the userid relevant?

> > At this point I would say we are at an impasse unless we trust
> > ns_capable() or we implement audit namespaces.
> 
> I'm not sure how we can trust ns_capable(), but if you can think of a
> way I would love to hear it.  I'm also not sure how namespacing audit
> is helpful (see my above comments), but if you think it is please
> explain.

So if we are not namespacing, why do we not trust capabilities?

> > I don't think another mechanism to trust nested orchestrators/engines
> > will buy us anything.
> >
> > Am I missing something?
> 
> Based on your questions/comments above it looks like your
> understanding of ns_capable() does not match mine; if I'm thinking
> about ns_capable() incorrectly, please educate me.
> 
> -- 
> paul moore
> www.paul-moore.com

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore July 16, 2019, 11:30 p.m.
On Tue, Jul 16, 2019 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-07-15 17:04, Paul Moore wrote:
> > On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:

...

> > > If we can't trust ns_capable() then why are we passing on
> > > CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
> > > by the orchestrator/engine.  If ns_capable() isn't inherited how is it
> > > gained otherwise?  Can it be inserted by cotainer image?  I think the
> > > answer is "no".  Either we trust ns_capable() or we have audit
> > > namespaces (recommend based on user namespace) (or both).
> >
> > My thinking is that since ns_capable() checks the credentials with
> > respect to the current user namespace we can't rely on it to control
> > access since it would be possible for a privileged process running
> > inside an unprivileged container to manipulate the audit container ID
> > (containerized process has CAP_AUDIT_CONTROL, e.g. running as root in
> > the container, while the container itself does not).
>
> What makes an unprivileged container unprivileged?  "root", or "CAP_*"?

My understanding is that when most people refer to an unprivileged
container they are referring to a container run without capabilities
or a container run by a user other than root.  I'm sure there are
better definitions out there, by folks much smarter than me on these
things, but that's my working definition.

> If CAP_AUDIT_CONTROL is granted, does "root" matter?

Our discussions here have been about capabilities, not UIDs.  The only
reason root might matter is that it generally has the full capability
set.

> Does it matter what user namespace it is in?

What likely matters is what check is called: capable() or
ns_capable().  Those can yield very different results.

> I understand that root is *gained* in an
> unprivileged user namespace, but capabilities are inherited or permitted
> and that process either has it or it doesn't and an unprivileged user
> namespace can't gain a capability that has been rescinded.  Different
> subsystems use the userid or capabilities or both to determine
> privileges.

Once again, I believe the important thing to focus on here is
capable() vs ns_capable().  We can't safely rely on ns_capable() for
the audit container ID policy since that is easily met inside the
container regardless of the process' creds which started the
container.

> In this case, is the userid relevant?

We don't do UID checks, we do capability checks, so yes, the UID is irrelevant.

> > > At this point I would say we are at an impasse unless we trust
> > > ns_capable() or we implement audit namespaces.
> >
> > I'm not sure how we can trust ns_capable(), but if you can think of a
> > way I would love to hear it.  I'm also not sure how namespacing audit
> > is helpful (see my above comments), but if you think it is please
> > explain.
>
> So if we are not namespacing, why do we not trust capabilities?

We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
Richard Guy Briggs July 18, 2019, 12:51 a.m.
On 2019-07-16 19:30, Paul Moore wrote:
> On Tue, Jul 16, 2019 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> > On 2019-07-15 17:04, Paul Moore wrote:
> > > On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> 
> ...
> 
> > > > If we can't trust ns_capable() then why are we passing on
> > > > CAP_AUDIT_CONTROL?  It is being passed down and not stripped purposely
> > > > by the orchestrator/engine.  If ns_capable() isn't inherited how is it
> > > > gained otherwise?  Can it be inserted by cotainer image?  I think the
> > > > answer is "no".  Either we trust ns_capable() or we have audit
> > > > namespaces (recommend based on user namespace) (or both).
> > >
> > > My thinking is that since ns_capable() checks the credentials with
> > > respect to the current user namespace we can't rely on it to control
> > > access since it would be possible for a privileged process running
> > > inside an unprivileged container to manipulate the audit container ID
> > > (containerized process has CAP_AUDIT_CONTROL, e.g. running as root in
> > > the container, while the container itself does not).
> >
> > What makes an unprivileged container unprivileged?  "root", or "CAP_*"?
> 
> My understanding is that when most people refer to an unprivileged
> container they are referring to a container run without capabilities
> or a container run by a user other than root.  I'm sure there are
> better definitions out there, by folks much smarter than me on these
> things, but that's my working definition.

Close enough to my understanding...

> > If CAP_AUDIT_CONTROL is granted, does "root" matter?
> 
> Our discussions here have been about capabilities, not UIDs.  The only
> reason root might matter is that it generally has the full capability
> set.

Good, that's my understanding.

> > Does it matter what user namespace it is in?
> 
> What likely matters is what check is called: capable() or
> ns_capable().  Those can yield very different results.

Ok, I finally found what I was looking for to better understand the
challenge with trusting ns_capable().  Sorry for being so dense and slow
on this one.  I thought I had gone through the code carefully enough,
but this time I finally found it.  set_cred_user_ns() sets a full set of
capabilities rather than inheriting them from the parent user_ns, called
from userns_install() or create_userns().  Even if the container
orchestrator/engine restricts those capabilities on its own containers,
they could easily unshare a userns and get a full set unless it also
restricted CAP_SYS_ADMIN, which is used too many other places to be
practical to restrict.

> > I understand that root is *gained* in an
> > unprivileged user namespace, but capabilities are inherited or permitted
> > and that process either has it or it doesn't and an unprivileged user
> > namespace can't gain a capability that has been rescinded.  Different
> > subsystems use the userid or capabilities or both to determine
> > privileges.
> 
> Once again, I believe the important thing to focus on here is
> capable() vs ns_capable().  We can't safely rely on ns_capable() for
> the audit container ID policy since that is easily met inside the
> container regardless of the process' creds which started the
> container.

Agreed.

> > In this case, is the userid relevant?
> 
> We don't do UID checks, we do capability checks, so yes, the UID is irrelevant.

Agreed.

> > > > At this point I would say we are at an impasse unless we trust
> > > > ns_capable() or we implement audit namespaces.
> > >
> > > I'm not sure how we can trust ns_capable(), but if you can think of a
> > > way I would love to hear it.  I'm also not sure how namespacing audit
> > > is helpful (see my above comments), but if you think it is please
> > > explain.
> >
> > So if we are not namespacing, why do we not trust capabilities?
> 
> We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
> ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).

Ok.  So does a process in a non-init user namespace have two (or more)
sets of capabilities stored in creds, one in the init_user_ns, and one
in current_user_ns?  Or does it get stripped of all its capabilities in
init_user_ns once it has its own set in current_user_ns?  If the former,
then we can use capable().  If the latter, we need another mechanism, as
you have suggested might be needed.

If some random unprivileged user wants to fire up a container
orchestrator/engine in his own user namespace, then audit needs to be
namespaced.  Can we safely discard this scenario for now?  That user can
use a VM.

> paul moore

- RGB

--
Richard Guy Briggs <rgb@redhat.com>
Sr. S/W Engineer, Kernel Security, Base Operating Systems
Remote, Ottawa, Red Hat Canada
IRC: rgb, SunRaycer
Voice: +1.647.777.2635, Internal: (81) 32635
Paul Moore July 18, 2019, 9:52 p.m.
On Wed, Jul 17, 2019 at 8:52 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2019-07-16 19:30, Paul Moore wrote:

...

> > We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
> > ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
>
> Ok.  So does a process in a non-init user namespace have two (or more)
> sets of capabilities stored in creds, one in the init_user_ns, and one
> in current_user_ns?  Or does it get stripped of all its capabilities in
> init_user_ns once it has its own set in current_user_ns?  If the former,
> then we can use capable().  If the latter, we need another mechanism, as
> you have suggested might be needed.

Unfortunately I think the problem is that ultimately we need to allow
any container orchestrator that has been given privileges to manage
the audit container ID to also grant that privilege to any of the
child process/containers it manages.  I don't believe we can do that
with capabilities based on the code I've looked at, and the
discussions I've had, but if you find a way I would leave to hear it.

> If some random unprivileged user wants to fire up a container
> orchestrator/engine in his own user namespace, then audit needs to be
> namespaced.  Can we safely discard this scenario for now?

I think the only time we want to allow a container orchestrator to
manage the audit container ID is if it has been granted that privilege
by someone who has that privilege already.  In the zero-container, or
single-level of containers, case this is relatively easy, and we can
accomplish it using CAP_AUDIT_CONTROL as the privilege.  If we start
nesting container orchestrators it becomes more complicated as we need
to be able to support granting and inheriting this privilege in a
manner; this is why I suggested a new mechanism *may* be necessary.

--
paul moore
www.paul-moore.com
Eric W. Biederman July 19, 2019, 3:32 p.m.
Richard Guy Briggs <rgb@redhat.com> writes:

> On 2019-07-16 19:30, Paul Moore wrote:
>> On Tue, Jul 16, 2019 at 6:03 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> > On 2019-07-15 17:04, Paul Moore wrote:
>> > > On Mon, Jul 8, 2019 at 2:06 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> 
>> > > > At this point I would say we are at an impasse unless we trust
>> > > > ns_capable() or we implement audit namespaces.
>> > >
>> > > I'm not sure how we can trust ns_capable(), but if you can think of a
>> > > way I would love to hear it.  I'm also not sure how namespacing audit
>> > > is helpful (see my above comments), but if you think it is please
>> > > explain.
>> >
>> > So if we are not namespacing, why do we not trust capabilities?
>> 
>> We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
>> ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
>
> Ok.  So does a process in a non-init user namespace have two (or more)
> sets of capabilities stored in creds, one in the init_user_ns, and one
> in current_user_ns?  Or does it get stripped of all its capabilities in
> init_user_ns once it has its own set in current_user_ns?  If the former,
> then we can use capable().  If the latter, we need another mechanism, as
> you have suggested might be needed.

The latter.  There is only one set of capabilities and it is in the
processes current user namespace.

Eric
Eric W. Biederman July 19, 2019, 4 p.m.
Paul Moore <paul@paul-moore.com> writes:

> On Wed, Jul 17, 2019 at 8:52 PM Richard Guy Briggs <rgb@redhat.com> wrote:
>> On 2019-07-16 19:30, Paul Moore wrote:
>
> ...
>
>> > We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
>> > ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
>>
>> Ok.  So does a process in a non-init user namespace have two (or more)
>> sets of capabilities stored in creds, one in the init_user_ns, and one
>> in current_user_ns?  Or does it get stripped of all its capabilities in
>> init_user_ns once it has its own set in current_user_ns?  If the former,
>> then we can use capable().  If the latter, we need another mechanism, as
>> you have suggested might be needed.
>
> Unfortunately I think the problem is that ultimately we need to allow
> any container orchestrator that has been given privileges to manage
> the audit container ID to also grant that privilege to any of the
> child process/containers it manages.  I don't believe we can do that
> with capabilities based on the code I've looked at, and the
> discussions I've had, but if you find a way I would leave to hear it.

>> If some random unprivileged user wants to fire up a container
>> orchestrator/engine in his own user namespace, then audit needs to be
>> namespaced.  Can we safely discard this scenario for now?
>
> I think the only time we want to allow a container orchestrator to
> manage the audit container ID is if it has been granted that privilege
> by someone who has that privilege already.  In the zero-container, or
> single-level of containers, case this is relatively easy, and we can
> accomplish it using CAP_AUDIT_CONTROL as the privilege.  If we start
> nesting container orchestrators it becomes more complicated as we need
> to be able to support granting and inheriting this privilege in a
> manner; this is why I suggested a new mechanism *may* be necessary.


Let me segway a bit and see if I can get this conversation out of the
rut it seems to have drifted into.

Unprivileged containers and nested containers exist today and are going
to become increasingly common.  Let that be a given.

As I recall the interesting thing for audit to log is actions by
privileged processes.  Audit can log more but generally configuring
logging by of the actions of unprivileged users is effectively a self
DOS.

So I think the initial implementation can safely ignore actions of
nested containers and unprivileged containers because you don't care
about their actions. 

If we start allow running audit in a container then we need to deal with
all of the nesting issues but until then I don't think you folks care.

Or am I wrong.  Do the requirements for securely auditing things from
the kernel care about the actions of unprivileged users?

Eric
Eric W. Biederman July 19, 2019, 4:07 p.m.
Richard Guy Briggs <rgb@redhat.com> writes:

> Implement the proc fs write to set the audit container identifier of a
> process, emitting an AUDIT_CONTAINER_OP record to document the event.
>
> This is a write from the container orchestrator task to a proc entry of
> the form /proc/PID/audit_containerid where PID is the process ID of the
> newly created task that is to become the first task in a container, or
> an additional task added to a container.
>
> The write expects up to a u64 value (unset: 18446744073709551615).
>
> The writer must have capability CAP_AUDIT_CONTROL.
>
> This will produce a record such as this:
>   type=CONTAINER_OP msg=audit(2018-06-06 12:39:29.636:26949) : op=set opid=2209 contid=123456 old-contid=18446744073709551615 pid=628 auid=root uid=root tty=ttyS0 ses=1 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 comm=bash exe=/usr/bin/bash res=yes
>
> The "op" field indicates an initial set.  The "pid" to "ses" fields are
> the orchestrator while the "opid" field is the object's PID, the process
> being "contained".  New and old audit container identifier values are
> given in the "contid" fields, while res indicates its success.
>
> It is not permitted to unset the audit container identifier.
> A child inherits its parent's audit container identifier.

Why get proc involved in this?  I know it more or less fits as
this is about a process and it's descendants.  But this seems to
encouarge being able to read this value, and being able to read
this value seems to encourage misuse.

So I am not of fan of using proc for this.

> Please see the github audit kernel issue for the main feature:
>   https://github.com/linux-audit/audit-kernel/issues/90
> Please see the github audit userspace issue for supporting additions:
>   https://github.com/linux-audit/audit-userspace/issues/51
> Please see the github audit testsuiite issue for the test case:
>   https://github.com/linux-audit/audit-testsuite/issues/64
> Please see the github audit wiki for the feature overview:
>   https://github.com/linux-audit/audit-kernel/wiki/RFE-Audit-Container-ID
>
> Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
> Acked-by: Serge Hallyn <serge@hallyn.com>
> Acked-by: Steve Grubb <sgrubb@redhat.com>
> Acked-by: Neil Horman <nhorman@tuxdriver.com>
> Reviewed-by: Ondrej Mosnacek <omosnace@redhat.com>
> ---
>  fs/proc/base.c             | 36 ++++++++++++++++++++++++
>  include/linux/audit.h      | 25 +++++++++++++++++
>  include/uapi/linux/audit.h |  2 ++
>  kernel/audit.c             | 69 ++++++++++++++++++++++++++++++++++++++++++++++
>  kernel/audit.h             |  1 +
>  kernel/auditsc.c           |  4 +++
>  6 files changed, 137 insertions(+)
>
> diff --git a/fs/proc/base.c b/fs/proc/base.c
> index ddef482f1334..43fd0c4b87de 100644
> --- a/fs/proc/base.c
> +++ b/fs/proc/base.c
> @@ -1294,6 +1294,40 @@ static ssize_t proc_sessionid_read(struct file * file, char __user * buf,
>  	.read		= proc_sessionid_read,
>  	.llseek		= generic_file_llseek,
>  };
> +
> +static ssize_t proc_contid_write(struct file *file, const char __user *buf,
> +				   size_t count, loff_t *ppos)
> +{
> +	struct inode *inode = file_inode(file);
> +	u64 contid;
> +	int rv;
> +	struct task_struct *task = get_proc_task(inode);
> +
> +	if (!task)
> +		return -ESRCH;
> +	if (*ppos != 0) {
> +		/* No partial writes. */
> +		put_task_struct(task);
> +		return -EINVAL;
> +	}
> +
> +	rv = kstrtou64_from_user(buf, count, 10, &contid);
> +	if (rv < 0) {
> +		put_task_struct(task);
> +		return rv;
> +	}
> +
> +	rv = audit_set_contid(task, contid);
> +	put_task_struct(task);
> +	if (rv < 0)
> +		return rv;
> +	return count;
> +}
> +
> +static const struct file_operations proc_contid_operations = {
> +	.write		= proc_contid_write,
> +	.llseek		= generic_file_llseek,
> +};
>  #endif
>  
>  #ifdef CONFIG_FAULT_INJECTION
> @@ -3033,6 +3067,7 @@ static int proc_stack_depth(struct seq_file *m, struct pid_namespace *ns,
>  #ifdef CONFIG_AUDIT
>  	REG("loginuid",   S_IWUSR|S_IRUGO, proc_loginuid_operations),
>  	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> +	REG("audit_containerid", S_IWUSR, proc_contid_operations),
>  #endif
>  #ifdef CONFIG_FAULT_INJECTION
>  	REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> @@ -3431,6 +3466,7 @@ static int proc_tid_comm_permission(struct inode *inode, int mask)
>  #ifdef CONFIG_AUDIT
>  	REG("loginuid",  S_IWUSR|S_IRUGO, proc_loginuid_operations),
>  	REG("sessionid",  S_IRUGO, proc_sessionid_operations),
> +	REG("audit_containerid", S_IWUSR, proc_contid_operations),
>  #endif
>  #ifdef CONFIG_FAULT_INJECTION
>  	REG("make-it-fail", S_IRUGO|S_IWUSR, proc_fault_inject_operations),
> diff --git a/include/linux/audit.h b/include/linux/audit.h
> index bde346e73f0c..301337776193 100644
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -89,6 +89,7 @@ struct audit_field {
>  struct audit_task_info {
>  	kuid_t			loginuid;
>  	unsigned int		sessionid;
> +	u64			contid;
>  #ifdef CONFIG_AUDITSYSCALL
>  	struct audit_context	*ctx;
>  #endif
> @@ -189,6 +190,15 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
>  	return tsk->audit->sessionid;
>  }
>  
> +extern int audit_set_contid(struct task_struct *tsk, u64 contid);
> +
> +static inline u64 audit_get_contid(struct task_struct *tsk)
> +{
> +	if (!tsk->audit)
> +		return AUDIT_CID_UNSET;
> +	return tsk->audit->contid;
> +}
> +
>  extern u32 audit_enabled;
>  #else /* CONFIG_AUDIT */
>  static inline int audit_alloc(struct task_struct *task)
> @@ -250,6 +260,11 @@ static inline unsigned int audit_get_sessionid(struct task_struct *tsk)
>  	return AUDIT_SID_UNSET;
>  }
>  
> +static inline u64 audit_get_contid(struct task_struct *tsk)
> +{
> +	return AUDIT_CID_UNSET;
> +}
> +
>  #define audit_enabled AUDIT_OFF
>  #endif /* CONFIG_AUDIT */
>  
> @@ -606,6 +621,16 @@ static inline bool audit_loginuid_set(struct task_struct *tsk)
>  	return uid_valid(audit_get_loginuid(tsk));
>  }
>  
> +static inline bool audit_contid_valid(u64 contid)
> +{
> +	return contid != AUDIT_CID_UNSET;
> +}
> +
> +static inline bool audit_contid_set(struct task_struct *tsk)
> +{
> +	return audit_contid_valid(audit_get_contid(tsk));
> +}
> +
>  static inline void audit_log_string(struct audit_buffer *ab, const char *buf)
>  {
>  	audit_log_n_string(ab, buf, strlen(buf));
> diff --git a/include/uapi/linux/audit.h b/include/uapi/linux/audit.h
> index 3901c51c0b93..4a6a8bf1de32 100644
> --- a/include/uapi/linux/audit.h
> +++ b/include/uapi/linux/audit.h
> @@ -71,6 +71,7 @@
>  #define AUDIT_TTY_SET		1017	/* Set TTY auditing status */
>  #define AUDIT_SET_FEATURE	1018	/* Turn an audit feature on or off */
>  #define AUDIT_GET_FEATURE	1019	/* Get which features are enabled */
> +#define AUDIT_CONTAINER_OP	1020	/* Define the container id and info */
>  
>  #define AUDIT_FIRST_USER_MSG	1100	/* Userspace messages mostly uninteresting to kernel */
>  #define AUDIT_USER_AVC		1107	/* We filter this differently */
> @@ -485,6 +486,7 @@ struct audit_tty_status {
>  
>  #define AUDIT_UID_UNSET (unsigned int)-1
>  #define AUDIT_SID_UNSET ((unsigned int)-1)
> +#define AUDIT_CID_UNSET ((u64)-1)
>  
>  /* audit_rule_data supports filter rules with both integer and string
>   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> diff --git a/kernel/audit.c b/kernel/audit.c
> index 3fb09783cd4a..182b0f2c183d 100644
> --- a/kernel/audit.c
> +++ b/kernel/audit.c
> @@ -244,6 +244,7 @@ int audit_alloc(struct task_struct *tsk)
>  	}
>  	info->loginuid = audit_get_loginuid(current);
>  	info->sessionid = audit_get_sessionid(current);
> +	info->contid = audit_get_contid(current);
>  	tsk->audit = info;
>  
>  	ret = audit_alloc_syscall(tsk);
> @@ -258,6 +259,7 @@ int audit_alloc(struct task_struct *tsk)
>  struct audit_task_info init_struct_audit = {
>  	.loginuid = INVALID_UID,
>  	.sessionid = AUDIT_SID_UNSET,
> +	.contid = AUDIT_CID_UNSET,
>  #ifdef CONFIG_AUDITSYSCALL
>  	.ctx = NULL,
>  #endif
> @@ -2341,6 +2343,73 @@ int audit_set_loginuid(kuid_t loginuid)
>  }
>  
>  /**
> + * audit_set_contid - set current task's audit contid
> + * @contid: contid value
> + *
> + * Returns 0 on success, -EPERM on permission failure.
> + *
> + * Called (set) from fs/proc/base.c::proc_contid_write().
> + */
> +int audit_set_contid(struct task_struct *task, u64 contid)
> +{
> +	u64 oldcontid;
> +	int rc = 0;
> +	struct audit_buffer *ab;
> +	uid_t uid;
> +	struct tty_struct *tty;
> +	char comm[sizeof(current->comm)];
> +
> +	task_lock(task);
> +	/* Can't set if audit disabled */
> +	if (!task->audit) {
> +		task_unlock(task);
> +		return -ENOPROTOOPT;
> +	}
> +	oldcontid = audit_get_contid(task);
> +	read_lock(&tasklist_lock);
> +	/* Don't allow the audit containerid to be unset */
> +	if (!audit_contid_valid(contid))
> +		rc = -EINVAL;
> +	/* if we don't have caps, reject */
> +	else if (!capable(CAP_AUDIT_CONTROL))
> +		rc = -EPERM;
> +	/* if task has children or is not single-threaded, deny */
> +	else if (!list_empty(&task->children))
> +		rc = -EBUSY;
> +	else if (!(thread_group_leader(task) && thread_group_empty(task)))
> +		rc = -EALREADY;
> +	read_unlock(&tasklist_lock);
> +	if (!rc)
> +		task->audit->contid = contid;
> +	task_unlock(task);
> +
> +	if (!audit_enabled)
> +		return rc;
> +
> +	ab = audit_log_start(audit_context(), GFP_KERNEL, AUDIT_CONTAINER_OP);
> +	if (!ab)
> +		return rc;
> +
> +	uid = from_kuid(&init_user_ns, task_uid(current));
> +	tty = audit_get_tty();
> +	audit_log_format(ab,
> +			 "op=set opid=%d contid=%llu old-contid=%llu pid=%d uid=%u auid=%u tty=%s ses=%u",
> +			 task_tgid_nr(task), contid, oldcontid,
> +			 task_tgid_nr(current), uid,
> +			 from_kuid(&init_user_ns, audit_get_loginuid(current)),
> +			 tty ? tty_name(tty) : "(none)",
> +			 audit_get_sessionid(current));
> +	audit_put_tty(tty);
> +	audit_log_task_context(ab);
> +	audit_log_format(ab, " comm=");
> +	audit_log_untrustedstring(ab, get_task_comm(comm, current));
> +	audit_log_d_path_exe(ab, current->mm);
> +	audit_log_format(ab, " res=%d", !rc);
> +	audit_log_end(ab);
> +	return rc;
> +}
> +
> +/**
>   * audit_log_end - end one audit record
>   * @ab: the audit_buffer
>   *
> diff --git a/kernel/audit.h b/kernel/audit.h
> index c00e2ee3c6b3..e2912924af0d 100644
> --- a/kernel/audit.h
> +++ b/kernel/audit.h
> @@ -148,6 +148,7 @@ struct audit_context {
>  	kuid_t		    target_uid;
>  	unsigned int	    target_sessionid;
>  	u32		    target_sid;
> +	u64		    target_cid;
>  	char		    target_comm[TASK_COMM_LEN];
>  
>  	struct audit_tree_refs *trees, *first_trees;
> diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> index fd7ca983de4f..1f7edf035b16 100644
> --- a/kernel/auditsc.c
> +++ b/kernel/auditsc.c
> @@ -113,6 +113,7 @@ struct audit_aux_data_pids {
>  	kuid_t			target_uid[AUDIT_AUX_PIDS];
>  	unsigned int		target_sessionid[AUDIT_AUX_PIDS];
>  	u32			target_sid[AUDIT_AUX_PIDS];
> +	u64			target_cid[AUDIT_AUX_PIDS];
>  	char 			target_comm[AUDIT_AUX_PIDS][TASK_COMM_LEN];
>  	int			pid_count;
>  };
> @@ -2368,6 +2369,7 @@ void __audit_ptrace(struct task_struct *t)
>  	context->target_uid = task_uid(t);
>  	context->target_sessionid = audit_get_sessionid(t);
>  	security_task_getsecid(t, &context->target_sid);
> +	context->target_cid = audit_get_contid(t);
>  	memcpy(context->target_comm, t->comm, TASK_COMM_LEN);
>  }
>  
> @@ -2408,6 +2410,7 @@ int audit_signal_info(int sig, struct task_struct *t)
>  		ctx->target_uid = t_uid;
>  		ctx->target_sessionid = audit_get_sessionid(t);
>  		security_task_getsecid(t, &ctx->target_sid);
> +		ctx->target_cid = audit_get_contid(t);
>  		memcpy(ctx->target_comm, t->comm, TASK_COMM_LEN);
>  		return 0;
>  	}
> @@ -2429,6 +2432,7 @@ int audit_signal_info(int sig, struct task_struct *t)
>  	axp->target_uid[axp->pid_count] = t_uid;
>  	axp->target_sessionid[axp->pid_count] = audit_get_sessionid(t);
>  	security_task_getsecid(t, &axp->target_sid[axp->pid_count]);
> +	axp->target_cid[axp->pid_count] = audit_get_contid(t);
>  	memcpy(axp->target_comm[axp->pid_count], t->comm, TASK_COMM_LEN);
>  	axp->pid_count++;
Burn Alting July 19, 2019, 10:41 p.m.
On Fri, 2019-07-19 at 11:00 -0500, Eric W. Biederman wrote:
> Paul Moore <paul@paul-moore.com> writes:
> 
> 
> On Wed, Jul 17, 2019 at 8:52 PM Richard Guy Briggs <rgb@redhat.com> wrote:
> 
> On 2019-07-16 19:30, Paul Moore wrote:
> 
> ...
> 
> 
> 
> We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit container
> ID policy, we can not trust ns_capable(CAP_AUDIT_CONTROL).
> 
> Ok.  So does a process in a non-init user namespace have two (or more)
> sets of capabilities stored in creds, one in the init_user_ns, and one
> in current_user_ns?  Or does it get stripped of all its capabilities in
> init_user_ns once it has its own set in current_user_ns?  If the former,
> then we can use capable().  If the latter, we need another mechanism, as
> you have suggested might be needed.
> 
> Unfortunately I think the problem is that ultimately we need to allow
> any container orchestrator that has been given privileges to manage
> the audit container ID to also grant that privilege to any of the
> child process/containers it manages.  I don't believe we can do that
> with capabilities based on the code I've looked at, and the
> discussions I've had, but if you find a way I would leave to hear it.
> 
> 
> 
> If some random unprivileged user wants to fire up a container
> orchestrator/engine in his own user namespace, then audit needs to be
> namespaced.  Can we safely discard this scenario for now?
> 
> I think the only time we want to allow a container orchestrator to
> manage the audit container ID is if it has been granted that privilege
> by someone who has that privilege already.  In the zero-container, or
> single-level of containers, case this is relatively easy, and we can
> accomplish it using CAP_AUDIT_CONTROL as the privilege.  If we start
> nesting container orchestrators it becomes more complicated as we need
> to be able to support granting and inheriting this privilege in a
> manner; this is why I suggested a new mechanism *may* be necessary.
> 
> 
> Let me segway a bit and see if I can get this conversation out of the
> rut it seems to have drifted into.
> 
> Unprivileged containers and nested containers exist today and are going
> to become increasingly common.  Let that be a given.
> 
> As I recall the interesting thing for audit to log is actions by
> privileged processes.  Audit can log more but generally configuring
> logging by of the actions of unprivileged users is effectively a self
> DOS.
> 
> So I think the initial implementation can safely ignore actions of
> nested containers and unprivileged containers because you don't care
> about their actions. 
> 
> If we start allow running audit in a container then we need to deal with
> all of the nesting issues but until then I don't think you folks care.
> 
> Or am I wrong.  Do the requirements for securely auditing things from
> the kernel care about the actions of unprivileged users?

Depending on the sensitivity of the information the host or system manages, yes the
actions of
unprivileged users is important to security auditing. Kernel auditing sometimes is
the only opportunity
an incident responder has to identify a user's (privileged or not) interaction with
the data the host manages.

> 
> Eric
> 
> --
> Linux-audit mailing list
> Linux-audit@redhat.com
> https://www.redhat.com/mailman/listinfo/linux-audit
>
James Bottomley July 20, 2019, 2:19 a.m.
On Fri, 2019-07-19 at 11:00 -0500, Eric W. Biederman wrote:
> Paul Moore <paul@paul-moore.com> writes:
> 
> > On Wed, Jul 17, 2019 at 8:52 PM Richard Guy Briggs <rgb@redhat.com>
> > wrote:
> > > On 2019-07-16 19:30, Paul Moore wrote:
> > 
> > ...
> > 
> > > > We can trust capable(CAP_AUDIT_CONTROL) for enforcing audit
> > > > container ID policy, we can not trust
> > > > ns_capable(CAP_AUDIT_CONTROL).
> > > 
> > > Ok.  So does a process in a non-init user namespace have two (or
> > > more) sets of capabilities stored in creds, one in the
> > > init_user_ns, and one in current_user_ns?  Or does it get
> > > stripped of all its capabilities in init_user_ns once it has its
> > > own set in current_user_ns?  If the former, then we can use
> > > capable().  If the latter, we need another mechanism, as
> > > you have suggested might be needed.
> > 
> > Unfortunately I think the problem is that ultimately we need to
> > allow any container orchestrator that has been given privileges to
> > manage the audit container ID to also grant that privilege to any
> > of the child process/containers it manages.  I don't believe we can
> > do that with capabilities based on the code I've looked at, and the
> > discussions I've had, but if you find a way I would leave to hear
> > it.
> > > If some random unprivileged user wants to fire up a container
> > > orchestrator/engine in his own user namespace, then audit needs
> > > to be namespaced.  Can we safely discard this scenario for now?
> > 
> > I think the only time we want to allow a container orchestrator to
> > manage the audit container ID is if it has been granted that
> > privilege by someone who has that privilege already.  In the zero-
> > container, or single-level of containers, case this is relatively
> > easy, and we can accomplish it using CAP_AUDIT_CONTROL as the
> > privilege.  If we start nesting container orchestrators it becomes
> > more complicated as we need to be able to support granting and
> > inheriting this privilege in a manner; this is why I suggested a
> > new mechanism *may* be necessary.
> 
> 
> Let me segway a bit and see if I can get this conversation out of the
> rut it seems to have drifted into.
> 
> Unprivileged containers and nested containers exist today and are
> going to become increasingly common.  Let that be a given.

Agree fully.

> As I recall the interesting thing for audit to log is actions by
> privileged processes.  Audit can log more but generally configuring
> logging by of the actions of unprivileged users is effectively a self
> DOS.
> 
> So I think the initial implementation can safely ignore actions of
> nested containers and unprivileged containers because you don't care
> about their actions. 

I don't entirely agree here:  remember there might be two consumers for
the audit data: the physical system owner (checking up on the tenants)
and the tenant themselves who might be watching either their sub
tenants or their users (and who, obviously, won't get the full audit
stream).  In either case, the tenant may or may not be privileged, and
if they're privileged, it might be through the user_ns in which case
the physical system owner and the kernel would see them as "not
privileged".  So I think we are ultimately going to need the ability to
audit unprivileged containers.

I also think audit has a role to play in intrusion detection and
forensic analysis for fully unprivileged containers running external
services, but I don't think we have to solve that case immediately.

> If we start allow running audit in a container then we need to deal
> with all of the nesting issues but until then I don't think you folks
> care.
> 
> Or am I wrong.  Do the requirements for securely auditing things from
> the kernel care about the actions of unprivileged users?

I think ultimately we have to care, but it could be three phases: first
would be genuinely privileged containers (i.e. with real root inside,
being our most dangerous problem) the second would be user_ns
privileged containers (i.e. with both user_ns and an interior root
mapping) and the third would be unprivileged containers (with or
without user_ns but no interior root).

James