[v4] Introduce v3 namespaced file capabilities

Submitted by Serge E. Hallyn on May 8, 2017, 6:11 p.m.

Details

Message ID 20170508181156.GA23112@mail.hallyn.com
State New
Series "64fa03de33: BUG:Dentry_still_in_use"
Headers show

Commit Message

Serge E. Hallyn May 8, 2017, 6:11 p.m.
Root in a non-initial user ns cannot be trusted to write a traditional
security.capability xattr.  If it were allowed to do so, then any
unprivileged user on the host could map his own uid to root in a private
namespace, write the xattr, and execute the file with privilege on the
host.

However supporting file capabilities in a user namespace is very
desirable.  Not doing so means that any programs designed to run with
limited privilege must continue to support other methods of gaining and
dropping privilege.  For instance a program installer must detect
whether file capabilities can be assigned, and assign them if so but set
setuid-root otherwise.  The program in turn must know how to drop
partial capabilities, and do so only if setuid-root.

This patch introduces v3 of the security.capability xattr.  It builds a
vfs_ns_cap_data struct by appending a uid_t rootid to struct
vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
namespace which mounted the filesystem, usually init_user_ns) of the
root id in whose namespaces the file capabilities may take effect.

When a task asks to write a v2 security.capability xattr, if it is
privileged with respect to the userns which mounted the filesystem, then
nothing should change.  Otherwise, the kernel will transparently rewrite
the xattr as a v3 with the appropriate rootid.  This is done during the
execution of setxattr() to catch user-space-initiated capability writes.
Subsequently, any task executing the file which has the noted kuid as
its root uid, or which is in a descendent user_ns of such a user_ns,
will run the file with capabilities.

Similarly when asking to read file capabilities, a v3 capability will
be presented as v2 if it applies to the caller's namespace.

If a task writes a v3 security.capability, then it can provide a uid for
the xattr so long as the uid is valid in its own user namespace, and it
is privileged with CAP_SETFCAP over its namespace.  The kernel will
translate that rootid to an absolute uid, and write that to disk.  After
this, a task in the writer's namespace will not be able to use those
capabilities (unless rootid was 0), but a task in a namespace where the
given uid is root will.

Only a single security.capability xattr may exist at a time for a given
file.  A task may overwrite an existing xattr so long as it is
privileged over the inode.  Note this is a departure from previous
semantics, which required privilege to remove a security.capability
xattr.  This check can be re-added if deemed useful.

This allows a simple setxattr to work, allows tar/untar to work, and
allows us to tar in one namespace and untar in another while preserving
the capability, without risking leaking privilege into a parent
namespace.

Example using tar:

 $ cp /bin/sleep sleepx
 $ mkdir b1 b2
 $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
 $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
 $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
 $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
 $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
   b2/sleepx = cap_sys_admin+ep
 # /opt/ltp/testcases/bin/getv3xattr b2/sleepx
   v3 xattr, rootid is 100001

A patch to linux-test-project adding a new set of tests for this
functionality is in the nsfscaps branch at github.com/hallyn/ltp

Changelog:
   Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
   Nov 07 2016: convert rootid from and to fs user_ns
   (From ebiederm: mar 28 2017)
     commoncap.c: fix typos - s/v4/v3
     get_vfs_caps_from_disk: clarify the fs_ns root access check
     nsfscaps: change the code split for cap_inode_setxattr()
   Apr 09 2017:
       don't return v3 cap for caps owned by current root.
      return a v2 cap for a true v2 cap in non-init ns
   Apr 18 2017:
      . Change the flow of fscap writing to support s_user_ns writing.
      . Remove refuse_fcap_overwrite().  The value of the previous
        xattr doesn't matter.
   Apr 24 2017:
      . incorporate Eric's incremental diff
      . move cap_convert_nscap to setxattr and simplify its usage
   May 8, 2017:
      . fix leaking dentry refcount in cap_inode_getsecurity

Signed-off-by: Serge Hallyn <serge@hallyn.com>
---
 fs/xattr.c                      |   6 +
 include/linux/capability.h      |   2 +
 include/linux/security.h        |   2 +
 include/uapi/linux/capability.h |  22 +++-
 security/commoncap.c            | 270 +++++++++++++++++++++++++++++++++++++---
 5 files changed, 280 insertions(+), 22 deletions(-)

Patch hide | download patch | download mbox

diff --git a/fs/xattr.c b/fs/xattr.c
index 7e3317c..0a9dea4 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -444,6 +444,12 @@  setxattr(struct dentry *d, const char __user *name, const void __user *value,
 		if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
 		    (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
 			posix_acl_fix_xattr_from_user(kvalue, size);
+		else if (strcmp(kname, XATTR_NAME_CAPS) == 0) {
+			error = cap_convert_nscap(d, &kvalue, size);
+			if (error < 0)
+				goto out;
+			size = error;
+		}
 	}
 
 	error = vfs_setxattr(d, kname, kvalue, size, flags);
diff --git a/include/linux/capability.h b/include/linux/capability.h
index 6ffb67e..b52e278 100644
--- a/include/linux/capability.h
+++ b/include/linux/capability.h
@@ -248,4 +248,6 @@  extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
 /* audit system wants to get cap info from files as well */
 extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
 
+extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
+
 #endif /* !_LINUX_CAPABILITY_H */
diff --git a/include/linux/security.h b/include/linux/security.h
index 96899fa..bd49cc1 100644
--- a/include/linux/security.h
+++ b/include/linux/security.h
@@ -86,6 +86,8 @@  extern int cap_inode_setxattr(struct dentry *dentry, const char *name,
 extern int cap_inode_removexattr(struct dentry *dentry, const char *name);
 extern int cap_inode_need_killpriv(struct dentry *dentry);
 extern int cap_inode_killpriv(struct dentry *dentry);
+extern int cap_inode_getsecurity(struct inode *inode, const char *name,
+				 void **buffer, bool alloc);
 extern int cap_mmap_addr(unsigned long addr);
 extern int cap_mmap_file(struct file *file, unsigned long reqprot,
 			 unsigned long prot, unsigned long flags);
diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index 49bc062..fd4f87d 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -60,9 +60,13 @@  typedef struct __user_cap_data_struct {
 #define VFS_CAP_U32_2           2
 #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
 
-#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
-#define VFS_CAP_U32             VFS_CAP_U32_2
-#define VFS_CAP_REVISION	VFS_CAP_REVISION_2
+#define VFS_CAP_REVISION_3	0x03000000
+#define VFS_CAP_U32_3           2
+#define XATTR_CAPS_SZ_3         (sizeof(__le32)*(2 + 2*VFS_CAP_U32_3))
+
+#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_3
+#define VFS_CAP_U32             VFS_CAP_U32_3
+#define VFS_CAP_REVISION	VFS_CAP_REVISION_3
 
 struct vfs_cap_data {
 	__le32 magic_etc;            /* Little endian */
@@ -72,6 +76,18 @@  struct vfs_cap_data {
 	} data[VFS_CAP_U32];
 };
 
+/*
+ * same as vfs_cap_data but with a rootid at the end
+ */
+struct vfs_ns_cap_data {
+	__le32 magic_etc;
+	struct {
+		__le32 permitted;    /* Little endian */
+		__le32 inheritable;  /* Little endian */
+	} data[VFS_CAP_U32];
+	__le32 rootid;
+};
+
 #ifndef __KERNEL__
 
 /*
diff --git a/security/commoncap.c b/security/commoncap.c
index 78b3783..c28d126 100644
--- a/security/commoncap.c
+++ b/security/commoncap.c
@@ -332,6 +332,209 @@  int cap_inode_killpriv(struct dentry *dentry)
 	return error;
 }
 
+static bool rootid_owns_currentns(kuid_t kroot)
+{
+	struct user_namespace *ns;
+
+	if (!uid_valid(kroot))
+		return false;
+
+	for (ns = current_user_ns(); ; ns = ns->parent) {
+		if (from_kuid(ns, kroot) == 0)
+			return true;
+		if (ns == &init_user_ns)
+			break;
+	}
+
+	return false;
+}
+
+static __u32 sansflags(__u32 m)
+{
+	return m & ~VFS_CAP_FLAGS_EFFECTIVE;
+}
+
+static bool is_v2header(size_t size, __le32 magic)
+{
+	__u32 m = le32_to_cpu(magic);
+	if (size != XATTR_CAPS_SZ_2)
+		return false;
+	return sansflags(m) == VFS_CAP_REVISION_2;
+}
+
+static bool is_v3header(size_t size, __le32 magic)
+{
+	__u32 m = le32_to_cpu(magic);
+
+	if (size != XATTR_CAPS_SZ_3)
+		return false;
+	return sansflags(m) == VFS_CAP_REVISION_3;
+}
+
+/*
+ * getsecurity: We are called for security.* before any attempt to read the
+ * xattr from the inode itself.
+ *
+ * This gives us a chance to read the on-disk value and convert it.  If we
+ * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
+ *
+ * Note we are not called by vfs_getxattr_alloc(), but that is only called
+ * by the integrity subsystem, which really wants the unconverted values -
+ * so that's good.
+ */
+int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
+			  bool alloc)
+{
+	int size, ret;
+	kuid_t kroot;
+	uid_t root, mappedroot;
+	char *tmpbuf = NULL;
+	struct vfs_cap_data *cap;
+	struct vfs_ns_cap_data *nscap;
+	struct dentry *dentry;
+	struct user_namespace *fs_ns;
+
+	if (strcmp(name, "capability") != 0)
+		return -EOPNOTSUPP;
+
+	dentry = d_find_alias(inode);
+	if (!dentry)
+		return -EINVAL;
+
+	size = sizeof(struct vfs_ns_cap_data);
+	ret = (int) vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
+				 &tmpbuf, size, GFP_NOFS);
+	dput(dentry);
+
+	if (ret < 0)
+		return ret;
+
+	fs_ns = inode->i_sb->s_user_ns;
+	cap = (struct vfs_cap_data *) tmpbuf;
+	if (is_v2header((size_t) ret, cap->magic_etc)) {
+		/* If this is sizeof(vfs_cap_data) then we're ok with the
+		 * on-disk value, so return that.  */
+		if (alloc)
+			*buffer = tmpbuf;
+		else
+			kfree(tmpbuf);
+		return ret;
+	} else if (!is_v3header((size_t) ret, cap->magic_etc)) {
+		kfree(tmpbuf);
+		return -EINVAL;
+	}
+
+	nscap = (struct vfs_ns_cap_data *) tmpbuf;
+	root = le32_to_cpu(nscap->rootid);
+	kroot = make_kuid(fs_ns, root);
+
+	/* If the root kuid maps to a valid uid in current ns, then return
+	 * this as a nscap. */
+	mappedroot = from_kuid(current_user_ns(), kroot);
+	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
+		if (alloc) {
+			*buffer = tmpbuf;
+			nscap->rootid = cpu_to_le32(mappedroot);
+		} else
+			kfree(tmpbuf);
+		return size;
+	}
+
+	if (!rootid_owns_currentns(kroot)) {
+		kfree(tmpbuf);
+		return -EOPNOTSUPP;
+	}
+
+	/* This comes from a parent namespace.  Return as a v2 capability */
+	size = sizeof(struct vfs_cap_data);
+	if (alloc) {
+		*buffer = kmalloc(size, GFP_ATOMIC);
+		if (*buffer) {
+			struct vfs_cap_data *cap = *buffer;
+			__le32 nsmagic, magic;
+			magic = VFS_CAP_REVISION_2;
+			nsmagic = le32_to_cpu(nscap->magic_etc);
+			if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
+				magic |= VFS_CAP_FLAGS_EFFECTIVE;
+			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
+			cap->magic_etc = cpu_to_le32(magic);
+		}
+	}
+	kfree(tmpbuf);
+	return size;
+}
+
+static kuid_t rootid_from_xattr(const void *value, size_t size,
+				struct user_namespace *task_ns)
+{
+	const struct vfs_ns_cap_data *nscap = value;
+	uid_t rootid = 0;
+
+	if (size == XATTR_CAPS_SZ_3)
+		rootid = le32_to_cpu(nscap->rootid);
+
+	return make_kuid(task_ns, rootid);
+}
+
+static bool validheader(size_t size, __le32 magic)
+{
+	return is_v2header(size, magic) || is_v3header(size, magic);
+}
+
+/*
+ * User requested a write of security.capability.  If needed, update the
+ * xattr to change from v2 to v3, or to fixup the v3 rootid.
+ *
+ * If all is ok, we return the new size, on error return < 0.
+ */
+int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
+{
+	struct vfs_ns_cap_data *nscap;
+	uid_t nsrootid;
+	const struct vfs_cap_data *cap = *ivalue;
+	__u32 magic, nsmagic;
+	struct inode *inode = d_backing_inode(dentry);
+	struct user_namespace *task_ns = current_user_ns(),
+		*fs_ns = inode->i_sb->s_user_ns;
+	kuid_t rootid;
+	size_t newsize;
+
+	if (!*ivalue)
+		return -EINVAL;
+	if (!validheader(size, cap->magic_etc))
+		return -EINVAL;
+	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
+		return -EPERM;
+	if (size == XATTR_CAPS_SZ_2)
+		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
+			/* user is privileged, just write the v2 */
+			return size;
+
+	rootid = rootid_from_xattr(*ivalue, size, task_ns);
+	if (!uid_valid(rootid))
+		return -EINVAL;
+
+	nsrootid = from_kuid(fs_ns, rootid);
+	if (nsrootid == -1)
+		return -EINVAL;
+
+	newsize = sizeof(struct vfs_ns_cap_data);
+	nscap = kmalloc(newsize, GFP_ATOMIC);
+	if (!nscap)
+		return -ENOMEM;
+	nscap->rootid = cpu_to_le32(nsrootid);
+	nsmagic = VFS_CAP_REVISION_3;
+	magic = le32_to_cpu(cap->magic_etc);
+	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
+		nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
+	nscap->magic_etc = cpu_to_le32(nsmagic);
+	memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
+
+	kvfree(*ivalue);
+	*ivalue = nscap;
+	return newsize;
+}
+
 /*
  * Calculate the new process capability sets from the capability sets attached
  * to a file.
@@ -385,7 +588,10 @@  int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 	__u32 magic_etc;
 	unsigned tocopy, i;
 	int size;
-	struct vfs_cap_data caps;
+	struct vfs_ns_cap_data data, *nscaps = &data;
+	struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
+	kuid_t rootkuid;
+	struct user_namespace *fs_ns = inode->i_sb->s_user_ns;
 
 	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
 
@@ -393,18 +599,20 @@  int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 		return -ENODATA;
 
 	size = __vfs_getxattr((struct dentry *)dentry, inode,
-			      XATTR_NAME_CAPS, &caps, XATTR_CAPS_SZ);
+			      XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
 	if (size == -ENODATA || size == -EOPNOTSUPP)
 		/* no data, that's ok */
 		return -ENODATA;
+
 	if (size < 0)
 		return size;
 
 	if (size < sizeof(magic_etc))
 		return -EINVAL;
 
-	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps.magic_etc);
+	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
 
+	rootkuid = make_kuid(fs_ns, 0);
 	switch (magic_etc & VFS_CAP_REVISION_MASK) {
 	case VFS_CAP_REVISION_1:
 		if (size != XATTR_CAPS_SZ_1)
@@ -416,15 +624,27 @@  int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
 			return -EINVAL;
 		tocopy = VFS_CAP_U32_2;
 		break;
+	case VFS_CAP_REVISION_3:
+		if (size != XATTR_CAPS_SZ_3)
+			return -EINVAL;
+		tocopy = VFS_CAP_U32_3;
+		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
+		break;
+
 	default:
 		return -EINVAL;
 	}
+	/* Limit the caps to the mounter of the filesystem
+	 * or the more limited uid specified in the xattr.
+	 */
+	if (!rootid_owns_currentns(rootkuid))
+		return -ENODATA;
 
 	CAP_FOR_EACH_U32(i) {
 		if (i >= tocopy)
 			break;
-		cpu_caps->permitted.cap[i] = le32_to_cpu(caps.data[i].permitted);
-		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps.data[i].inheritable);
+		cpu_caps->permitted.cap[i] = le32_to_cpu(caps->data[i].permitted);
+		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps->data[i].inheritable);
 	}
 
 	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
@@ -462,8 +682,8 @@  static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
 	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
 	if (rc < 0) {
 		if (rc == -EINVAL)
-			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
-				__func__, rc, bprm->filename);
+			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
+					bprm->filename);
 		else if (rc == -ENODATA)
 			rc = 0;
 		goto out;
@@ -660,15 +880,19 @@  int cap_bprm_secureexec(struct linux_binprm *bprm)
 int cap_inode_setxattr(struct dentry *dentry, const char *name,
 		       const void *value, size_t size, int flags)
 {
-	if (!strcmp(name, XATTR_NAME_CAPS)) {
-		if (!capable(CAP_SETFCAP))
-			return -EPERM;
+	/* Ignore non-security xattrs */
+	if (strncmp(name, XATTR_SECURITY_PREFIX,
+			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
+		return 0;
+
+	/*
+	 * For XATTR_NAME_CAPS the check will be done in
+	 * cap_convert_nscap(), called by setxattr()
+	 */
+	if (strcmp(name, XATTR_NAME_CAPS) == 0)
 		return 0;
-	}
 
-	if (!strncmp(name, XATTR_SECURITY_PREFIX,
-		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
-	    !capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
@@ -686,15 +910,22 @@  int cap_inode_setxattr(struct dentry *dentry, const char *name,
  */
 int cap_inode_removexattr(struct dentry *dentry, const char *name)
 {
-	if (!strcmp(name, XATTR_NAME_CAPS)) {
-		if (!capable(CAP_SETFCAP))
+	/* Ignore non-security xattrs */
+	if (strncmp(name, XATTR_SECURITY_PREFIX,
+			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
+		return 0;
+
+	if (strcmp(name, XATTR_NAME_CAPS) == 0) {
+		/* security.capability gets namespaced */
+		struct inode *inode = d_backing_inode(dentry);
+		if (!inode)
+			return -EINVAL;
+		if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
 			return -EPERM;
 		return 0;
 	}
 
-	if (!strncmp(name, XATTR_SECURITY_PREFIX,
-		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
-	    !capable(CAP_SYS_ADMIN))
+	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 	return 0;
 }
@@ -1082,6 +1313,7 @@  struct security_hook_list capability_hooks[] = {
 	LSM_HOOK_INIT(bprm_secureexec, cap_bprm_secureexec),
 	LSM_HOOK_INIT(inode_need_killpriv, cap_inode_need_killpriv),
 	LSM_HOOK_INIT(inode_killpriv, cap_inode_killpriv),
+	LSM_HOOK_INIT(inode_getsecurity, cap_inode_getsecurity),
 	LSM_HOOK_INIT(mmap_addr, cap_mmap_addr),
 	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
 	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),

Comments

Eric W. Biederman May 9, 2017, 4:55 p.m.
"Serge E. Hallyn" <serge@hallyn.com> writes:
> Changelog:
[snip]
>    May 8, 2017:
>       . fix leaking dentry refcount in cap_inode_getsecurity
>
[snip]
> +/*
> + * getsecurity: We are called for security.* before any attempt to read the
> + * xattr from the inode itself.
> + *
> + * This gives us a chance to read the on-disk value and convert it.  If we
> + * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
> + *
> + * Note we are not called by vfs_getxattr_alloc(), but that is only called
> + * by the integrity subsystem, which really wants the unconverted values -
> + * so that's good.
> + */
> +int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
> +			  bool alloc)
> +{
> +	int size, ret;
> +	kuid_t kroot;
> +	uid_t root, mappedroot;
> +	char *tmpbuf = NULL;
> +	struct vfs_cap_data *cap;
> +	struct vfs_ns_cap_data *nscap;
> +	struct dentry *dentry;
> +	struct user_namespace *fs_ns;
> +
> +	if (strcmp(name, "capability") != 0)
> +		return -EOPNOTSUPP;
> +
> +	dentry = d_find_alias(inode);
> +	if (!dentry)
> +		return -EINVAL;
> +
> +	size = sizeof(struct vfs_ns_cap_data);
> +	ret = (int) vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
> +				 &tmpbuf, size, GFP_NOFS);
> +	dput(dentry);

This looks like a good fix but ouch! That interface is wrong.

The dentry is needed because vfs_getxattr_alloc does:
	error = handler->get(handler, dentry, inode, name, NULL, 0);

Which is has no business taking a dentry as xattrs are inode concepts.

I have no issue with your patch but it looks like that handler issue
is going to need to be fixed with xattrs.

Eric
Serge E. Hallyn May 9, 2017, 8:37 p.m.
Quoting Eric W. Biederman (ebiederm@xmission.com):
> "Serge E. Hallyn" <serge@hallyn.com> writes:
> > Changelog:
> [snip]
> >    May 8, 2017:
> >       . fix leaking dentry refcount in cap_inode_getsecurity
> >
> [snip]
> > +/*
> > + * getsecurity: We are called for security.* before any attempt to read the
> > + * xattr from the inode itself.
> > + *
> > + * This gives us a chance to read the on-disk value and convert it.  If we
> > + * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
> > + *
> > + * Note we are not called by vfs_getxattr_alloc(), but that is only called
> > + * by the integrity subsystem, which really wants the unconverted values -
> > + * so that's good.
> > + */
> > +int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
> > +			  bool alloc)
> > +{
> > +	int size, ret;
> > +	kuid_t kroot;
> > +	uid_t root, mappedroot;
> > +	char *tmpbuf = NULL;
> > +	struct vfs_cap_data *cap;
> > +	struct vfs_ns_cap_data *nscap;
> > +	struct dentry *dentry;
> > +	struct user_namespace *fs_ns;
> > +
> > +	if (strcmp(name, "capability") != 0)
> > +		return -EOPNOTSUPP;
> > +
> > +	dentry = d_find_alias(inode);
> > +	if (!dentry)
> > +		return -EINVAL;
> > +
> > +	size = sizeof(struct vfs_ns_cap_data);
> > +	ret = (int) vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
> > +				 &tmpbuf, size, GFP_NOFS);
> > +	dput(dentry);
> 
> This looks like a good fix but ouch! That interface is wrong.
> 
> The dentry is needed because vfs_getxattr_alloc does:
> 	error = handler->get(handler, dentry, inode, name, NULL, 0);
> 
> Which is has no business taking a dentry as xattrs are inode concepts.
> 
> I have no issue with your patch but it looks like that handler issue
> is going to need to be fixed with xattrs.

True, it's a bit clunky.

Any reason not to just have the current vfs_getxattr_alloc() become a
lightweight wrapper calling inode_getxattr_alloc(dentry->d_inode)?
Eric W. Biederman May 9, 2017, 10:27 p.m.
"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Eric W. Biederman (ebiederm@xmission.com):
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>> > Changelog:
>> [snip]
>> >    May 8, 2017:
>> >       . fix leaking dentry refcount in cap_inode_getsecurity
>> >
>> [snip]
>> > +/*
>> > + * getsecurity: We are called for security.* before any attempt to read the
>> > + * xattr from the inode itself.
>> > + *
>> > + * This gives us a chance to read the on-disk value and convert it.  If we
>> > + * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
>> > + *
>> > + * Note we are not called by vfs_getxattr_alloc(), but that is only called
>> > + * by the integrity subsystem, which really wants the unconverted values -
>> > + * so that's good.
>> > + */
>> > +int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
>> > +			  bool alloc)
>> > +{
>> > +	int size, ret;
>> > +	kuid_t kroot;
>> > +	uid_t root, mappedroot;
>> > +	char *tmpbuf = NULL;
>> > +	struct vfs_cap_data *cap;
>> > +	struct vfs_ns_cap_data *nscap;
>> > +	struct dentry *dentry;
>> > +	struct user_namespace *fs_ns;
>> > +
>> > +	if (strcmp(name, "capability") != 0)
>> > +		return -EOPNOTSUPP;
>> > +
>> > +	dentry = d_find_alias(inode);
>> > +	if (!dentry)
>> > +		return -EINVAL;
>> > +
>> > +	size = sizeof(struct vfs_ns_cap_data);
>> > +	ret = (int) vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
>> > +				 &tmpbuf, size, GFP_NOFS);
>> > +	dput(dentry);
>> 
>> This looks like a good fix but ouch! That interface is wrong.
>> 
>> The dentry is needed because vfs_getxattr_alloc does:
>> 	error = handler->get(handler, dentry, inode, name, NULL, 0);
>> 
>> Which is has no business taking a dentry as xattrs are inode concepts.
>> 
>> I have no issue with your patch but it looks like that handler issue
>> is going to need to be fixed with xattrs.
>
> True, it's a bit clunky.
>
> Any reason not to just have the current vfs_getxattr_alloc() become a
> lightweight wrapper calling inode_getxattr_alloc(dentry->d_inode)?

My deep issue is that handler is functions like posix_acl_xattr_get.
And all of those functions that vfs_getxattr_alloc calls should not
take a dentry.

So I feel like I have just spotted the tip of an iceberg that needs
sorting out.

Eric
Stefan Berger June 13, 2017, 3:47 p.m.
On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> Root in a non-initial user ns cannot be trusted to write a traditional
> security.capability xattr.  If it were allowed to do so, then any
> unprivileged user on the host could map his own uid to root in a private
> namespace, write the xattr, and execute the file with privilege on the
> host.
>
> However supporting file capabilities in a user namespace is very
> desirable.  Not doing so means that any programs designed to run with
> limited privilege must continue to support other methods of gaining and
> dropping privilege.  For instance a program installer must detect
> whether file capabilities can be assigned, and assign them if so but set
> setuid-root otherwise.  The program in turn must know how to drop
> partial capabilities, and do so only if setuid-root.

Hi Serge,


   I have been looking at patch below primarily to learn how we could 
apply a similar technique to security.ima and security.evm for a 
namespaced IMA. From the paragraphs above I thought that you solved the 
problem of a shared filesystem where one now can write different 
security.capability xattrs by effectively supporting for example 
security.capability[uid=1000] and security.capability[uid=2000] written 
into the filesystem. Each would then become visible as 
security.capability if the userns mapping is set appropriately. However, 
this doesn't seem to be how it is implemented. There seems to be only a 
single such entry with uid appended to it and, if it was a shared 
filesystem, the first one to set this attribute blocks everyone else 
from writing the xattr. Is that how it works? Would that work 
differently with an overlay filesystem ? I think a similar model could 
also work for IMA, but maybe you have some thoughts. The only thing I 
would be concerned about is blocking the parent container's root user 
from setting an xattr.

Regards,
    Stefan


>
> This patch introduces v3 of the security.capability xattr.  It builds a
> vfs_ns_cap_data struct by appending a uid_t rootid to struct
> vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
> namespace which mounted the filesystem, usually init_user_ns) of the
> root id in whose namespaces the file capabilities may take effect.
>
> When a task asks to write a v2 security.capability xattr, if it is
> privileged with respect to the userns which mounted the filesystem, then
> nothing should change.  Otherwise, the kernel will transparently rewrite
> the xattr as a v3 with the appropriate rootid.  This is done during the
> execution of setxattr() to catch user-space-initiated capability writes.
> Subsequently, any task executing the file which has the noted kuid as
> its root uid, or which is in a descendent user_ns of such a user_ns,
> will run the file with capabilities.
>
> Similarly when asking to read file capabilities, a v3 capability will
> be presented as v2 if it applies to the caller's namespace.
>
> If a task writes a v3 security.capability, then it can provide a uid for
> the xattr so long as the uid is valid in its own user namespace, and it
> is privileged with CAP_SETFCAP over its namespace.  The kernel will
> translate that rootid to an absolute uid, and write that to disk.  After
> this, a task in the writer's namespace will not be able to use those
> capabilities (unless rootid was 0), but a task in a namespace where the
> given uid is root will.
>
> Only a single security.capability xattr may exist at a time for a given
> file.  A task may overwrite an existing xattr so long as it is
> privileged over the inode.  Note this is a departure from previous
> semantics, which required privilege to remove a security.capability
> xattr.  This check can be re-added if deemed useful.
>
> This allows a simple setxattr to work, allows tar/untar to work, and
> allows us to tar in one namespace and untar in another while preserving
> the capability, without risking leaking privilege into a parent
> namespace.
>
> Example using tar:
>
>   $ cp /bin/sleep sleepx
>   $ mkdir b1 b2
>   $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
>   $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
>   $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
>   $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
>   $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
>     b2/sleepx = cap_sys_admin+ep
>   # /opt/ltp/testcases/bin/getv3xattr b2/sleepx
>     v3 xattr, rootid is 100001
>
> A patch to linux-test-project adding a new set of tests for this
> functionality is in the nsfscaps branch at github.com/hallyn/ltp
>
> Changelog:
>     Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
>     Nov 07 2016: convert rootid from and to fs user_ns
>     (From ebiederm: mar 28 2017)
>       commoncap.c: fix typos - s/v4/v3
>       get_vfs_caps_from_disk: clarify the fs_ns root access check
>       nsfscaps: change the code split for cap_inode_setxattr()
>     Apr 09 2017:
>         don't return v3 cap for caps owned by current root.
>        return a v2 cap for a true v2 cap in non-init ns
>     Apr 18 2017:
>        . Change the flow of fscap writing to support s_user_ns writing.
>        . Remove refuse_fcap_overwrite().  The value of the previous
>          xattr doesn't matter.
>     Apr 24 2017:
>        . incorporate Eric's incremental diff
>        . move cap_convert_nscap to setxattr and simplify its usage
>     May 8, 2017:
>        . fix leaking dentry refcount in cap_inode_getsecurity
>
> Signed-off-by: Serge Hallyn <serge@hallyn.com>
> ---
>   fs/xattr.c                      |   6 +
>   include/linux/capability.h      |   2 +
>   include/linux/security.h        |   2 +
>   include/uapi/linux/capability.h |  22 +++-
>   security/commoncap.c            | 270 +++++++++++++++++++++++++++++++++++++---
>   5 files changed, 280 insertions(+), 22 deletions(-)
>
> diff --git a/fs/xattr.c b/fs/xattr.c
> index 7e3317c..0a9dea4 100644
> --- a/fs/xattr.c
> +++ b/fs/xattr.c
> @@ -444,6 +444,12 @@ setxattr(struct dentry *d, const char __user *name, const void __user *value,
>   		if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
>   		    (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
>   			posix_acl_fix_xattr_from_user(kvalue, size);
> +		else if (strcmp(kname, XATTR_NAME_CAPS) == 0) {
> +			error = cap_convert_nscap(d, &kvalue, size);
> +			if (error < 0)
> +				goto out;
> +			size = error;
> +		}
>   	}
>
>   	error = vfs_setxattr(d, kname, kvalue, size, flags);
> diff --git a/include/linux/capability.h b/include/linux/capability.h
> index 6ffb67e..b52e278 100644
> --- a/include/linux/capability.h
> +++ b/include/linux/capability.h
> @@ -248,4 +248,6 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
>   /* audit system wants to get cap info from files as well */
>   extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
>
> +extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
> +
>   #endif /* !_LINUX_CAPABILITY_H */
> diff --git a/include/linux/security.h b/include/linux/security.h
> index 96899fa..bd49cc1 100644
> --- a/include/linux/security.h
> +++ b/include/linux/security.h
> @@ -86,6 +86,8 @@ extern int cap_inode_setxattr(struct dentry *dentry, const char *name,
>   extern int cap_inode_removexattr(struct dentry *dentry, const char *name);
>   extern int cap_inode_need_killpriv(struct dentry *dentry);
>   extern int cap_inode_killpriv(struct dentry *dentry);
> +extern int cap_inode_getsecurity(struct inode *inode, const char *name,
> +				 void **buffer, bool alloc);
>   extern int cap_mmap_addr(unsigned long addr);
>   extern int cap_mmap_file(struct file *file, unsigned long reqprot,
>   			 unsigned long prot, unsigned long flags);
> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 49bc062..fd4f87d 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -60,9 +60,13 @@ typedef struct __user_cap_data_struct {
>   #define VFS_CAP_U32_2           2
>   #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
>
> -#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
> -#define VFS_CAP_U32             VFS_CAP_U32_2
> -#define VFS_CAP_REVISION	VFS_CAP_REVISION_2
> +#define VFS_CAP_REVISION_3	0x03000000
> +#define VFS_CAP_U32_3           2
> +#define XATTR_CAPS_SZ_3         (sizeof(__le32)*(2 + 2*VFS_CAP_U32_3))
> +
> +#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_3
> +#define VFS_CAP_U32             VFS_CAP_U32_3
> +#define VFS_CAP_REVISION	VFS_CAP_REVISION_3
>
>   struct vfs_cap_data {
>   	__le32 magic_etc;            /* Little endian */
> @@ -72,6 +76,18 @@ struct vfs_cap_data {
>   	} data[VFS_CAP_U32];
>   };
>
> +/*
> + * same as vfs_cap_data but with a rootid at the end
> + */
> +struct vfs_ns_cap_data {
> +	__le32 magic_etc;
> +	struct {
> +		__le32 permitted;    /* Little endian */
> +		__le32 inheritable;  /* Little endian */
> +	} data[VFS_CAP_U32];
> +	__le32 rootid;
> +};
> +
>   #ifndef __KERNEL__
>
>   /*
> diff --git a/security/commoncap.c b/security/commoncap.c
> index 78b3783..c28d126 100644
> --- a/security/commoncap.c
> +++ b/security/commoncap.c
> @@ -332,6 +332,209 @@ int cap_inode_killpriv(struct dentry *dentry)
>   	return error;
>   }
>
> +static bool rootid_owns_currentns(kuid_t kroot)
> +{
> +	struct user_namespace *ns;
> +
> +	if (!uid_valid(kroot))
> +		return false;
> +
> +	for (ns = current_user_ns(); ; ns = ns->parent) {
> +		if (from_kuid(ns, kroot) == 0)
> +			return true;
> +		if (ns == &init_user_ns)
> +			break;
> +	}
> +
> +	return false;
> +}
> +
> +static __u32 sansflags(__u32 m)
> +{
> +	return m & ~VFS_CAP_FLAGS_EFFECTIVE;
> +}
> +
> +static bool is_v2header(size_t size, __le32 magic)
> +{
> +	__u32 m = le32_to_cpu(magic);
> +	if (size != XATTR_CAPS_SZ_2)
> +		return false;
> +	return sansflags(m) == VFS_CAP_REVISION_2;
> +}
> +
> +static bool is_v3header(size_t size, __le32 magic)
> +{
> +	__u32 m = le32_to_cpu(magic);
> +
> +	if (size != XATTR_CAPS_SZ_3)
> +		return false;
> +	return sansflags(m) == VFS_CAP_REVISION_3;
> +}
> +
> +/*
> + * getsecurity: We are called for security.* before any attempt to read the
> + * xattr from the inode itself.
> + *
> + * This gives us a chance to read the on-disk value and convert it.  If we
> + * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
> + *
> + * Note we are not called by vfs_getxattr_alloc(), but that is only called
> + * by the integrity subsystem, which really wants the unconverted values -
> + * so that's good.
> + */
> +int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
> +			  bool alloc)
> +{
> +	int size, ret;
> +	kuid_t kroot;
> +	uid_t root, mappedroot;
> +	char *tmpbuf = NULL;
> +	struct vfs_cap_data *cap;
> +	struct vfs_ns_cap_data *nscap;
> +	struct dentry *dentry;
> +	struct user_namespace *fs_ns;
> +
> +	if (strcmp(name, "capability") != 0)
> +		return -EOPNOTSUPP;
> +
> +	dentry = d_find_alias(inode);
> +	if (!dentry)
> +		return -EINVAL;
> +
> +	size = sizeof(struct vfs_ns_cap_data);
> +	ret = (int) vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
> +				 &tmpbuf, size, GFP_NOFS);
> +	dput(dentry);
> +
> +	if (ret < 0)
> +		return ret;
> +
> +	fs_ns = inode->i_sb->s_user_ns;
> +	cap = (struct vfs_cap_data *) tmpbuf;
> +	if (is_v2header((size_t) ret, cap->magic_etc)) {
> +		/* If this is sizeof(vfs_cap_data) then we're ok with the
> +		 * on-disk value, so return that.  */
> +		if (alloc)
> +			*buffer = tmpbuf;
> +		else
> +			kfree(tmpbuf);
> +		return ret;
> +	} else if (!is_v3header((size_t) ret, cap->magic_etc)) {
> +		kfree(tmpbuf);
> +		return -EINVAL;
> +	}
> +
> +	nscap = (struct vfs_ns_cap_data *) tmpbuf;
> +	root = le32_to_cpu(nscap->rootid);
> +	kroot = make_kuid(fs_ns, root);
> +
> +	/* If the root kuid maps to a valid uid in current ns, then return
> +	 * this as a nscap. */
> +	mappedroot = from_kuid(current_user_ns(), kroot);
> +	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
> +		if (alloc) {
> +			*buffer = tmpbuf;
> +			nscap->rootid = cpu_to_le32(mappedroot);
> +		} else
> +			kfree(tmpbuf);
> +		return size;
> +	}
> +
> +	if (!rootid_owns_currentns(kroot)) {
> +		kfree(tmpbuf);
> +		return -EOPNOTSUPP;
> +	}
> +
> +	/* This comes from a parent namespace.  Return as a v2 capability */
> +	size = sizeof(struct vfs_cap_data);
> +	if (alloc) {
> +		*buffer = kmalloc(size, GFP_ATOMIC);
> +		if (*buffer) {
> +			struct vfs_cap_data *cap = *buffer;
> +			__le32 nsmagic, magic;
> +			magic = VFS_CAP_REVISION_2;
> +			nsmagic = le32_to_cpu(nscap->magic_etc);
> +			if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
> +				magic |= VFS_CAP_FLAGS_EFFECTIVE;
> +			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
> +			cap->magic_etc = cpu_to_le32(magic);
> +		}
> +	}
> +	kfree(tmpbuf);
> +	return size;
> +}
> +
> +static kuid_t rootid_from_xattr(const void *value, size_t size,
> +				struct user_namespace *task_ns)
> +{
> +	const struct vfs_ns_cap_data *nscap = value;
> +	uid_t rootid = 0;
> +
> +	if (size == XATTR_CAPS_SZ_3)
> +		rootid = le32_to_cpu(nscap->rootid);
> +
> +	return make_kuid(task_ns, rootid);
> +}
> +
> +static bool validheader(size_t size, __le32 magic)
> +{
> +	return is_v2header(size, magic) || is_v3header(size, magic);
> +}
> +
> +/*
> + * User requested a write of security.capability.  If needed, update the
> + * xattr to change from v2 to v3, or to fixup the v3 rootid.
> + *
> + * If all is ok, we return the new size, on error return < 0.
> + */
> +int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
> +{
> +	struct vfs_ns_cap_data *nscap;
> +	uid_t nsrootid;
> +	const struct vfs_cap_data *cap = *ivalue;
> +	__u32 magic, nsmagic;
> +	struct inode *inode = d_backing_inode(dentry);
> +	struct user_namespace *task_ns = current_user_ns(),
> +		*fs_ns = inode->i_sb->s_user_ns;
> +	kuid_t rootid;
> +	size_t newsize;
> +
> +	if (!*ivalue)
> +		return -EINVAL;
> +	if (!validheader(size, cap->magic_etc))
> +		return -EINVAL;
> +	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
> +		return -EPERM;
> +	if (size == XATTR_CAPS_SZ_2)
> +		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
> +			/* user is privileged, just write the v2 */
> +			return size;
> +
> +	rootid = rootid_from_xattr(*ivalue, size, task_ns);
> +	if (!uid_valid(rootid))
> +		return -EINVAL;
> +
> +	nsrootid = from_kuid(fs_ns, rootid);
> +	if (nsrootid == -1)
> +		return -EINVAL;
> +
> +	newsize = sizeof(struct vfs_ns_cap_data);
> +	nscap = kmalloc(newsize, GFP_ATOMIC);
> +	if (!nscap)
> +		return -ENOMEM;
> +	nscap->rootid = cpu_to_le32(nsrootid);
> +	nsmagic = VFS_CAP_REVISION_3;
> +	magic = le32_to_cpu(cap->magic_etc);
> +	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
> +		nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
> +	nscap->magic_etc = cpu_to_le32(nsmagic);
> +	memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
> +
> +	kvfree(*ivalue);
> +	*ivalue = nscap;
> +	return newsize;
> +}
> +
>   /*
>    * Calculate the new process capability sets from the capability sets attached
>    * to a file.
> @@ -385,7 +588,10 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>   	__u32 magic_etc;
>   	unsigned tocopy, i;
>   	int size;
> -	struct vfs_cap_data caps;
> +	struct vfs_ns_cap_data data, *nscaps = &data;
> +	struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
> +	kuid_t rootkuid;
> +	struct user_namespace *fs_ns = inode->i_sb->s_user_ns;
>
>   	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
>
> @@ -393,18 +599,20 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>   		return -ENODATA;
>
>   	size = __vfs_getxattr((struct dentry *)dentry, inode,
> -			      XATTR_NAME_CAPS, &caps, XATTR_CAPS_SZ);
> +			      XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
>   	if (size == -ENODATA || size == -EOPNOTSUPP)
>   		/* no data, that's ok */
>   		return -ENODATA;
> +
>   	if (size < 0)
>   		return size;
>
>   	if (size < sizeof(magic_etc))
>   		return -EINVAL;
>
> -	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps.magic_etc);
> +	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
>
> +	rootkuid = make_kuid(fs_ns, 0);
>   	switch (magic_etc & VFS_CAP_REVISION_MASK) {
>   	case VFS_CAP_REVISION_1:
>   		if (size != XATTR_CAPS_SZ_1)
> @@ -416,15 +624,27 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
>   			return -EINVAL;
>   		tocopy = VFS_CAP_U32_2;
>   		break;
> +	case VFS_CAP_REVISION_3:
> +		if (size != XATTR_CAPS_SZ_3)
> +			return -EINVAL;
> +		tocopy = VFS_CAP_U32_3;
> +		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
> +		break;
> +
>   	default:
>   		return -EINVAL;
>   	}
> +	/* Limit the caps to the mounter of the filesystem
> +	 * or the more limited uid specified in the xattr.
> +	 */
> +	if (!rootid_owns_currentns(rootkuid))
> +		return -ENODATA;
>
>   	CAP_FOR_EACH_U32(i) {
>   		if (i >= tocopy)
>   			break;
> -		cpu_caps->permitted.cap[i] = le32_to_cpu(caps.data[i].permitted);
> -		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps.data[i].inheritable);
> +		cpu_caps->permitted.cap[i] = le32_to_cpu(caps->data[i].permitted);
> +		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps->data[i].inheritable);
>   	}
>
>   	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> @@ -462,8 +682,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
>   	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
>   	if (rc < 0) {
>   		if (rc == -EINVAL)
> -			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> -				__func__, rc, bprm->filename);
> +			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> +					bprm->filename);
>   		else if (rc == -ENODATA)
>   			rc = 0;
>   		goto out;
> @@ -660,15 +880,19 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
>   int cap_inode_setxattr(struct dentry *dentry, const char *name,
>   		       const void *value, size_t size, int flags)
>   {
> -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> -		if (!capable(CAP_SETFCAP))
> -			return -EPERM;
> +	/* Ignore non-security xattrs */
> +	if (strncmp(name, XATTR_SECURITY_PREFIX,
> +			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> +		return 0;
> +
> +	/*
> +	 * For XATTR_NAME_CAPS the check will be done in
> +	 * cap_convert_nscap(), called by setxattr()
> +	 */
> +	if (strcmp(name, XATTR_NAME_CAPS) == 0)
>   		return 0;
> -	}
>
> -	if (!strncmp(name, XATTR_SECURITY_PREFIX,
> -		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> -	    !capable(CAP_SYS_ADMIN))
> +	if (!capable(CAP_SYS_ADMIN))
>   		return -EPERM;
>   	return 0;
>   }
> @@ -686,15 +910,22 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
>    */
>   int cap_inode_removexattr(struct dentry *dentry, const char *name)
>   {
> -	if (!strcmp(name, XATTR_NAME_CAPS)) {
> -		if (!capable(CAP_SETFCAP))
> +	/* Ignore non-security xattrs */
> +	if (strncmp(name, XATTR_SECURITY_PREFIX,
> +			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> +		return 0;
> +
> +	if (strcmp(name, XATTR_NAME_CAPS) == 0) {
> +		/* security.capability gets namespaced */
> +		struct inode *inode = d_backing_inode(dentry);
> +		if (!inode)
> +			return -EINVAL;
> +		if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
>   			return -EPERM;
>   		return 0;
>   	}
>
> -	if (!strncmp(name, XATTR_SECURITY_PREFIX,
> -		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> -	    !capable(CAP_SYS_ADMIN))
> +	if (!capable(CAP_SYS_ADMIN))
>   		return -EPERM;
>   	return 0;
>   }
> @@ -1082,6 +1313,7 @@ struct security_hook_list capability_hooks[] = {
>   	LSM_HOOK_INIT(bprm_secureexec, cap_bprm_secureexec),
>   	LSM_HOOK_INIT(inode_need_killpriv, cap_inode_need_killpriv),
>   	LSM_HOOK_INIT(inode_killpriv, cap_inode_killpriv),
> +	LSM_HOOK_INIT(inode_getsecurity, cap_inode_getsecurity),
>   	LSM_HOOK_INIT(mmap_addr, cap_mmap_addr),
>   	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
>   	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),
Jann Horn via Containers June 13, 2017, 5:14 p.m.
Hi Stefan,

On Tue, Jun 13, 2017 at 11:47:26AM -0400, Stefan Berger wrote:
> On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> > Root in a non-initial user ns cannot be trusted to write a traditional
> > security.capability xattr.  If it were allowed to do so, then any
> > unprivileged user on the host could map his own uid to root in a private
> > namespace, write the xattr, and execute the file with privilege on the
> > host.
> > 
> > However supporting file capabilities in a user namespace is very
> > desirable.  Not doing so means that any programs designed to run with
> > limited privilege must continue to support other methods of gaining and
> > dropping privilege.  For instance a program installer must detect
> > whether file capabilities can be assigned, and assign them if so but set
> > setuid-root otherwise.  The program in turn must know how to drop
> > partial capabilities, and do so only if setuid-root.
> 
> Hi Serge,
> 
> 
>   I have been looking at patch below primarily to learn how we could apply a
> similar technique to security.ima and security.evm for a namespaced IMA.
> From the paragraphs above I thought that you solved the problem of a shared
> filesystem where one now can write different security.capability xattrs by
> effectively supporting for example security.capability[uid=1000] and
> security.capability[uid=2000] written into the filesystem. Each would then
> become visible as security.capability if the userns mapping is set
> appropriately.

One disadvantage of this approach is that whoever is setting up the
container would need to go touch the security.ima attribute for each
file in the contianer, which would slow down container creation time.
For capabilities this makes sense, because you might want the file to
have different capabilities in different namespaces, but for IMA,
since the file hash will be the same in every namespace, it would be
nice to use a design that avoids touching each file on new ns
creation.

Cheers,

Tycho
Serge E. Hallyn June 13, 2017, 5:18 p.m.
Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> >Root in a non-initial user ns cannot be trusted to write a traditional
> >security.capability xattr.  If it were allowed to do so, then any
> >unprivileged user on the host could map his own uid to root in a private
> >namespace, write the xattr, and execute the file with privilege on the
> >host.
> >
> >However supporting file capabilities in a user namespace is very
> >desirable.  Not doing so means that any programs designed to run with
> >limited privilege must continue to support other methods of gaining and
> >dropping privilege.  For instance a program installer must detect
> >whether file capabilities can be assigned, and assign them if so but set
> >setuid-root otherwise.  The program in turn must know how to drop
> >partial capabilities, and do so only if setuid-root.
> 
> Hi Serge,
> 
> 
>   I have been looking at patch below primarily to learn how we could
> apply a similar technique to security.ima and security.evm for a
> namespaced IMA. From the paragraphs above I thought that you solved
> the problem of a shared filesystem where one now can write different
> security.capability xattrs by effectively supporting for example
> security.capability[uid=1000] and security.capability[uid=2000]
> written into the filesystem. Each would then become visible as
> security.capability if the userns mapping is set appropriately.
> However, this doesn't seem to be how it is implemented. There seems
> to be only a single such entry with uid appended to it and, if it
> was a shared filesystem, the first one to set this attribute blocks
> everyone else from writing the xattr. Is that how it works? Would

Yes, that's how this works here.  I'd considered allowing multiple
entries, but I didn't feel that was needed for this case.  In a previous
implementation (which is probably in the lkml archives somewhere) I
supported variable length xattr so that multiple containers could
each write a value tagged with their own userns.rootid.  Instead,
in the final version, if root in any parent container writes an
xattr, it will take effect in child user namespaces.  Which is
sensible - the parent presumbly laid out the filesystem to create
the child container.

> that work differently with an overlay filesystem ? I think a similar

Certainly an overlay filesystem should be an easy case as the container
can have its own copy of the inode with its own xattr.  Btrfs/zfs
would be nicer as the whole file wouldn't need to be copied.

> model could also work for IMA, but maybe you have some thoughts. The
> only thing I would be concerned about is blocking the parent
> container's root user from setting an xattr.

So if you have container c1 creating child container c2 on host h1,
then if c1 creates an xattr, can c2 not use that?  And if h1 writes it,
can c1 and c2 use it?

If they can't, then I guess for IMA multiple xattrs would need to be
supported.

-serge
Stefan Berger June 13, 2017, 5:42 p.m.
On 06/13/2017 01:14 PM, Tycho Andersen wrote:
> Hi Stefan,
>
> On Tue, Jun 13, 2017 at 11:47:26AM -0400, Stefan Berger wrote:
>> On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
>>> Root in a non-initial user ns cannot be trusted to write a traditional
>>> security.capability xattr.  If it were allowed to do so, then any
>>> unprivileged user on the host could map his own uid to root in a private
>>> namespace, write the xattr, and execute the file with privilege on the
>>> host.
>>>
>>> However supporting file capabilities in a user namespace is very
>>> desirable.  Not doing so means that any programs designed to run with
>>> limited privilege must continue to support other methods of gaining and
>>> dropping privilege.  For instance a program installer must detect
>>> whether file capabilities can be assigned, and assign them if so but set
>>> setuid-root otherwise.  The program in turn must know how to drop
>>> partial capabilities, and do so only if setuid-root.
>> Hi Serge,
>>
>>
>>    I have been looking at patch below primarily to learn how we could apply a
>> similar technique to security.ima and security.evm for a namespaced IMA.
>>  From the paragraphs above I thought that you solved the problem of a shared
>> filesystem where one now can write different security.capability xattrs by
>> effectively supporting for example security.capability[uid=1000] and
>> security.capability[uid=2000] written into the filesystem. Each would then
>> become visible as security.capability if the userns mapping is set
>> appropriately.
> One disadvantage of this approach is that whoever is setting up the
> container would need to go touch the security.ima attribute for each
> file in the contianer, which would slow down container creation time.
> For capabilities this makes sense, because you might want the file to
> have different capabilities in different namespaces, but for IMA,
> since the file hash will be the same in every namespace, it would be
> nice to use a design that avoids touching each file on new ns
> creation.

Actually IMA in appraisal mode also supports signatures in the extended 
attributes. Depending on which (public) keys you put on the IMA keyring 
for a namespaced IMA, you may need a different signature on a file to be 
able to access it (execute it for example). For this to work containers 
would have to be able to ship with security.ima xattrs embedded in them 
and users should be able to apply signatures on files on a running 
container (or while building it). I worked on a prototype for namespaces 
IMA before where one of the issues was the CAP_SYS_ADMIN gate that 
disallows setting of security.ima when dropped. So some form of 
exception would have to be granted to be allowed to set security.ima 
from inside a container while CAP_SYS_ADMIN isn't there. And of course 
we need to protect the host's filesystem from an attack where the user 
just modifies the security.ima signature associated with a file.

   Stefan


> Cheers,
>
> Tycho
>
James Bottomley June 13, 2017, 5:45 p.m.
On Tue, 2017-06-13 at 11:14 -0600, Tycho Andersen via Containers wrote:
> Hi Stefan,
> 
> On Tue, Jun 13, 2017 at 11:47:26AM -0400, Stefan Berger wrote:
> > On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> > > Root in a non-initial user ns cannot be trusted to write a 
> > > traditional security.capability xattr.  If it were allowed to do 
> > > so, then any unprivileged user on the host could map his own uid 
> > > to root in a private namespace, write the xattr, and execute the 
> > > file with privilege on the host.
> > > 
> > > However supporting file capabilities in a user namespace is very
> > > desirable.  Not doing so means that any programs designed to run 
> > > with limited privilege must continue to support other methods of
> > > gaining and dropping privilege.  For instance a program installer 
> > > must detect whether file capabilities can be assigned, and assign 
> > > them if so but set setuid-root otherwise.  The program in turn 
> > > must know how to drop partial capabilities, and do so only if
> > > setuid-root.
> > 
> > Hi Serge,
> > 
> > 
> >   I have been looking at patch below primarily to learn how we 
> > could apply a similar technique to security.ima and security.evm 
> > for a namespaced IMA. From the paragraphs above I thought that you 
> > solved the problem of a shared filesystem where one now can write 
> > different security.capability xattrs by effectively supporting for 
> > example security.capability[uid=1000] and
> > security.capability[uid=2000] written into the filesystem. Each
> > would then become visible as security.capability if the userns 
> > mapping is set appropriately.
> 
> One disadvantage of this approach is that whoever is setting up the
> container would need to go touch the security.ima attribute for each
> file in the contianer, which would slow down container creation time.
> For capabilities this makes sense, because you might want the file to
> have different capabilities in different namespaces, but for IMA,
> since the file hash will be the same in every namespace,

Actually, this isn't necessarily true: IMA may have the hash, you're
right, but I suspect in most container use cases it will have the
signature.  It's definitely a use case that the container will be using
a different keyring from the host, so different signatures are surely
possible for the same underlying image file.

One might imagine doing the above via overlays, because the new
signature should override the old.

James

>  it would be nice to use a design that avoids touching each file on 
> new ns creation.
Stefan Berger June 13, 2017, 6:12 p.m.
On 06/13/2017 01:18 PM, Serge E. Hallyn wrote:
> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>> On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
>>> Root in a non-initial user ns cannot be trusted to write a traditional
>>> security.capability xattr.  If it were allowed to do so, then any
>>> unprivileged user on the host could map his own uid to root in a private
>>> namespace, write the xattr, and execute the file with privilege on the
>>> host.
>>>
>>> However supporting file capabilities in a user namespace is very
>>> desirable.  Not doing so means that any programs designed to run with
>>> limited privilege must continue to support other methods of gaining and
>>> dropping privilege.  For instance a program installer must detect
>>> whether file capabilities can be assigned, and assign them if so but set
>>> setuid-root otherwise.  The program in turn must know how to drop
>>> partial capabilities, and do so only if setuid-root.
>> Hi Serge,
>>
>>
>>    I have been looking at patch below primarily to learn how we could
>> apply a similar technique to security.ima and security.evm for a
>> namespaced IMA. From the paragraphs above I thought that you solved
>> the problem of a shared filesystem where one now can write different
>> security.capability xattrs by effectively supporting for example
>> security.capability[uid=1000] and security.capability[uid=2000]
>> written into the filesystem. Each would then become visible as
>> security.capability if the userns mapping is set appropriately.
>> However, this doesn't seem to be how it is implemented. There seems
>> to be only a single such entry with uid appended to it and, if it
>> was a shared filesystem, the first one to set this attribute blocks
>> everyone else from writing the xattr. Is that how it works? Would
> Yes, that's how this works here.  I'd considered allowing multiple
> entries, but I didn't feel that was needed for this case.  In a previous
> implementation (which is probably in the lkml archives somewhere) I
> supported variable length xattr so that multiple containers could
> each write a value tagged with their own userns.rootid.  Instead,
> in the final version, if root in any parent container writes an
> xattr, it will take effect in child user namespaces.  Which is
> sensible - the parent presumbly laid out the filesystem to create
> the child container.
>
>> that work differently with an overlay filesystem ? I think a similar
> Certainly an overlay filesystem should be an easy case as the container
> can have its own copy of the inode with its own xattr.  Btrfs/zfs
> would be nicer as the whole file wouldn't need to be copied.
>
>> model could also work for IMA, but maybe you have some thoughts. The
>> only thing I would be concerned about is blocking the parent
>> container's root user from setting an xattr.
> So if you have container c1 creating child container c2 on host h1,
> then if c1 creates an xattr, can c2 not use that?  And if h1 writes it,
> can c1 and c2 use it?

In the case of IMA appraisal the extended attribute security.ima would 
be a signature. For c1 and c2 to use that file they would all have to 
have the same key on their (isolated IMA namespace ) keyring. I think 
this type of setup could be arranged.
Following your attack description in the introduction I would say that 
we would want to prevent malicious modification of a security.ima 
extended attribute:

"Root in a non-initial user ns cannot be trusted to write a traditional security.ima xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the signature in the security.ima xattr, and prevent the file from being accessible on the host."



>
> If they can't, then I guess for IMA multiple xattrs would need to be
> supported.

I am not sure about that. I suppose any extended attribute modifications 
would have to be designed for the case where a shared filesystem is used 
that also shares the extended attributes, not assuming an overlay 
filesystem that automatically isolates the extend attributes. With the 
shared filesystem I'd like to prevent any type of setting of extended 
attributes by a child container or more generally anyone mounting it as 
a '2nd consumer', which would make it a shared filesystem. Only the 
process that mounts a filesystem as the '1st consumer' would be able to 
set the extended attributes. I am assuming that using an overlay fs 
would always make you the '1st consumer' -- I would hope that these 
conditions could be detected. And probably the process should also write 
along its host uid as part of writing out the xattr. If all extended 
attributes were to support this model, maybe the 'uid' could be 
associated with the 'name' of the xattr rather than its 'value' (not 
sure whether that's possible).

    Stefan

>
> -serge
>
Jann Horn via Containers June 13, 2017, 8:46 p.m.
On Tue, Jun 13, 2017 at 10:45:02AM -0700, James Bottomley wrote:
> On Tue, 2017-06-13 at 11:14 -0600, Tycho Andersen via Containers wrote:
> > Hi Stefan,
> > 
> > On Tue, Jun 13, 2017 at 11:47:26AM -0400, Stefan Berger wrote:
> > > On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> > > > Root in a non-initial user ns cannot be trusted to write a 
> > > > traditional security.capability xattr.  If it were allowed to do 
> > > > so, then any unprivileged user on the host could map his own uid 
> > > > to root in a private namespace, write the xattr, and execute the 
> > > > file with privilege on the host.
> > > > 
> > > > However supporting file capabilities in a user namespace is very
> > > > desirable.  Not doing so means that any programs designed to run 
> > > > with limited privilege must continue to support other methods of
> > > > gaining and dropping privilege.  For instance a program installer 
> > > > must detect whether file capabilities can be assigned, and assign 
> > > > them if so but set setuid-root otherwise.  The program in turn 
> > > > must know how to drop partial capabilities, and do so only if
> > > > setuid-root.
> > > 
> > > Hi Serge,
> > > 
> > > 
> > >   I have been looking at patch below primarily to learn how we 
> > > could apply a similar technique to security.ima and security.evm 
> > > for a namespaced IMA. From the paragraphs above I thought that you 
> > > solved the problem of a shared filesystem where one now can write 
> > > different security.capability xattrs by effectively supporting for 
> > > example security.capability[uid=1000] and
> > > security.capability[uid=2000] written into the filesystem. Each
> > > would then become visible as security.capability if the userns 
> > > mapping is set appropriately.
> > 
> > One disadvantage of this approach is that whoever is setting up the
> > container would need to go touch the security.ima attribute for each
> > file in the contianer, which would slow down container creation time.
> > For capabilities this makes sense, because you might want the file to
> > have different capabilities in different namespaces, but for IMA,
> > since the file hash will be the same in every namespace,
> 
> Actually, this isn't necessarily true: IMA may have the hash, you're
> right, but I suspect in most container use cases it will have the
> signature.  It's definitely a use case that the container will be using
> a different keyring from the host, so different signatures are surely
> possible for the same underlying image file.
> 
> One might imagine doing the above via overlays, because the new
> signature should override the old.

Yes, good point, thanks. Assuming the container and the host are using
the same keyring, we could design it in such a way that the container
engine doesn't need to touch every file on creation, which would be
very nice.

Tycho
Stefan Berger June 13, 2017, 8:49 p.m.
On 06/13/2017 04:46 PM, Tycho Andersen wrote:
> On Tue, Jun 13, 2017 at 10:45:02AM -0700, James Bottomley wrote:
>> On Tue, 2017-06-13 at 11:14 -0600, Tycho Andersen via Containers wrote:
>>> Hi Stefan,
>>>
>>> On Tue, Jun 13, 2017 at 11:47:26AM -0400, Stefan Berger wrote:
>>>> On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
>>>>> Root in a non-initial user ns cannot be trusted to write a
>>>>> traditional security.capability xattr.  If it were allowed to do
>>>>> so, then any unprivileged user on the host could map his own uid
>>>>> to root in a private namespace, write the xattr, and execute the
>>>>> file with privilege on the host.
>>>>>
>>>>> However supporting file capabilities in a user namespace is very
>>>>> desirable.  Not doing so means that any programs designed to run
>>>>> with limited privilege must continue to support other methods of
>>>>> gaining and dropping privilege.  For instance a program installer
>>>>> must detect whether file capabilities can be assigned, and assign
>>>>> them if so but set setuid-root otherwise.  The program in turn
>>>>> must know how to drop partial capabilities, and do so only if
>>>>> setuid-root.
>>>> Hi Serge,
>>>>
>>>>
>>>>    I have been looking at patch below primarily to learn how we
>>>> could apply a similar technique to security.ima and security.evm
>>>> for a namespaced IMA. From the paragraphs above I thought that you
>>>> solved the problem of a shared filesystem where one now can write
>>>> different security.capability xattrs by effectively supporting for
>>>> example security.capability[uid=1000] and
>>>> security.capability[uid=2000] written into the filesystem. Each
>>>> would then become visible as security.capability if the userns
>>>> mapping is set appropriately.
>>> One disadvantage of this approach is that whoever is setting up the
>>> container would need to go touch the security.ima attribute for each
>>> file in the contianer, which would slow down container creation time.
>>> For capabilities this makes sense, because you might want the file to
>>> have different capabilities in different namespaces, but for IMA,
>>> since the file hash will be the same in every namespace,
>> Actually, this isn't necessarily true: IMA may have the hash, you're
>> right, but I suspect in most container use cases it will have the
>> signature.  It's definitely a use case that the container will be using
>> a different keyring from the host, so different signatures are surely
>> possible for the same underlying image file.
>>
>> One might imagine doing the above via overlays, because the new
>> signature should override the old.
> Yes, good point, thanks. Assuming the container and the host are using
> the same keyring, we could design it in such a way that the container
> engine doesn't need to touch every file on creation, which would be
> very nice.

I don't think this will be the general case. The host may be Ubuntu, the 
guest could be Fedora and you'll have different keys. I don't think you 
would want the container keys on the host keyring.

    Stefan


>
> Tycho
>
Jann Horn via Containers June 13, 2017, 8:51 p.m.
On Tue, Jun 13, 2017 at 01:42:24PM -0400, Stefan Berger wrote:
> On 06/13/2017 01:14 PM, Tycho Andersen wrote:
> > Hi Stefan,
> > 
> > On Tue, Jun 13, 2017 at 11:47:26AM -0400, Stefan Berger wrote:
> > > On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> > > > Root in a non-initial user ns cannot be trusted to write a traditional
> > > > security.capability xattr.  If it were allowed to do so, then any
> > > > unprivileged user on the host could map his own uid to root in a private
> > > > namespace, write the xattr, and execute the file with privilege on the
> > > > host.
> > > > 
> > > > However supporting file capabilities in a user namespace is very
> > > > desirable.  Not doing so means that any programs designed to run with
> > > > limited privilege must continue to support other methods of gaining and
> > > > dropping privilege.  For instance a program installer must detect
> > > > whether file capabilities can be assigned, and assign them if so but set
> > > > setuid-root otherwise.  The program in turn must know how to drop
> > > > partial capabilities, and do so only if setuid-root.
> > > Hi Serge,
> > > 
> > > 
> > >    I have been looking at patch below primarily to learn how we could apply a
> > > similar technique to security.ima and security.evm for a namespaced IMA.
> > >  From the paragraphs above I thought that you solved the problem of a shared
> > > filesystem where one now can write different security.capability xattrs by
> > > effectively supporting for example security.capability[uid=1000] and
> > > security.capability[uid=2000] written into the filesystem. Each would then
> > > become visible as security.capability if the userns mapping is set
> > > appropriately.
> > One disadvantage of this approach is that whoever is setting up the
> > container would need to go touch the security.ima attribute for each
> > file in the contianer, which would slow down container creation time.
> > For capabilities this makes sense, because you might want the file to
> > have different capabilities in different namespaces, but for IMA,
> > since the file hash will be the same in every namespace, it would be
> > nice to use a design that avoids touching each file on new ns
> > creation.
> 
> Actually IMA in appraisal mode also supports signatures in the extended
> attributes. Depending on which (public) keys you put on the IMA keyring for
> a namespaced IMA, you may need a different signature on a file to be able to
> access it (execute it for example). For this to work containers would have
> to be able to ship with security.ima xattrs embedded in them and users
> should be able to apply signatures on files on a running container (or while
> building it).

Yes, we will definitely support shipping images with the security.ima
xattrs set when namespaced support is available in the kernel.

> I worked on a prototype for namespaces IMA before where one of
> the issues was the CAP_SYS_ADMIN gate that disallows setting of security.ima
> when dropped. So some form of exception would have to be granted to be
> allowed to set security.ima from inside a container while CAP_SYS_ADMIN
> isn't there. And of course we need to protect the host's filesystem from an
> attack where the user just modifies the security.ima signature associated
> with a file.

At least initially, I think it would be fine to require
capable(CAP_SYS_ADMIN); right now the container engine is the one
responsible for setting up the container's rootfs (via downloading an
image and extracting it), and these typically run as root.

Eventually it would be great to relax this, because
"unprivileged/rootless" containers are a thing people are interested
in, but as a first cut I'm not sure it's necessary.

Cheers, and thanks for looking at this!

Tycho
Jann Horn via Containers June 13, 2017, 8:53 p.m.
On Tue, Jun 13, 2017 at 04:49:03PM -0400, Stefan Berger wrote:
> On 06/13/2017 04:46 PM, Tycho Andersen wrote:
> > On Tue, Jun 13, 2017 at 10:45:02AM -0700, James Bottomley wrote:
> > > On Tue, 2017-06-13 at 11:14 -0600, Tycho Andersen via Containers wrote:
> > > > Hi Stefan,
> > > > 
> > > > On Tue, Jun 13, 2017 at 11:47:26AM -0400, Stefan Berger wrote:
> > > > > On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> > > > > > Root in a non-initial user ns cannot be trusted to write a
> > > > > > traditional security.capability xattr.  If it were allowed to do
> > > > > > so, then any unprivileged user on the host could map his own uid
> > > > > > to root in a private namespace, write the xattr, and execute the
> > > > > > file with privilege on the host.
> > > > > > 
> > > > > > However supporting file capabilities in a user namespace is very
> > > > > > desirable.  Not doing so means that any programs designed to run
> > > > > > with limited privilege must continue to support other methods of
> > > > > > gaining and dropping privilege.  For instance a program installer
> > > > > > must detect whether file capabilities can be assigned, and assign
> > > > > > them if so but set setuid-root otherwise.  The program in turn
> > > > > > must know how to drop partial capabilities, and do so only if
> > > > > > setuid-root.
> > > > > Hi Serge,
> > > > > 
> > > > > 
> > > > >    I have been looking at patch below primarily to learn how we
> > > > > could apply a similar technique to security.ima and security.evm
> > > > > for a namespaced IMA. From the paragraphs above I thought that you
> > > > > solved the problem of a shared filesystem where one now can write
> > > > > different security.capability xattrs by effectively supporting for
> > > > > example security.capability[uid=1000] and
> > > > > security.capability[uid=2000] written into the filesystem. Each
> > > > > would then become visible as security.capability if the userns
> > > > > mapping is set appropriately.
> > > > One disadvantage of this approach is that whoever is setting up the
> > > > container would need to go touch the security.ima attribute for each
> > > > file in the contianer, which would slow down container creation time.
> > > > For capabilities this makes sense, because you might want the file to
> > > > have different capabilities in different namespaces, but for IMA,
> > > > since the file hash will be the same in every namespace,
> > > Actually, this isn't necessarily true: IMA may have the hash, you're
> > > right, but I suspect in most container use cases it will have the
> > > signature.  It's definitely a use case that the container will be using
> > > a different keyring from the host, so different signatures are surely
> > > possible for the same underlying image file.
> > > 
> > > One might imagine doing the above via overlays, because the new
> > > signature should override the old.
> > Yes, good point, thanks. Assuming the container and the host are using
> > the same keyring, we could design it in such a way that the container
> > engine doesn't need to touch every file on creation, which would be
> > very nice.
> 
> I don't think this will be the general case. The host may be Ubuntu, the
> guest could be Fedora and you'll have different keys. I don't think you
> would want the container keys on the host keyring.

I guess it depends: if your entire infrastructure needs to be signed
by your ops team, it would (presumably) all be the same ops key. If
you're running off the shelf stuff from the distros or from a vendor,
probably not, I agree.

Cheers,

Tycho
Stefan Berger June 13, 2017, 8:58 p.m.
On 06/13/2017 04:53 PM, Tycho Andersen wrote:
> On Tue, Jun 13, 2017 at 04:49:03PM -0400, Stefan Berger wrote:
>> On 06/13/2017 04:46 PM, Tycho Andersen wrote:
>>> On Tue, Jun 13, 2017 at 10:45:02AM -0700, James Bottomley wrote:
>>>> On Tue, 2017-06-13 at 11:14 -0600, Tycho Andersen via Containers wrote:
>>>>> Hi Stefan,
>>>>>
>>>>> On Tue, Jun 13, 2017 at 11:47:26AM -0400, Stefan Berger wrote:
>>>>>> On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
>>>>>>> Root in a non-initial user ns cannot be trusted to write a
>>>>>>> traditional security.capability xattr.  If it were allowed to do
>>>>>>> so, then any unprivileged user on the host could map his own uid
>>>>>>> to root in a private namespace, write the xattr, and execute the
>>>>>>> file with privilege on the host.
>>>>>>>
>>>>>>> However supporting file capabilities in a user namespace is very
>>>>>>> desirable.  Not doing so means that any programs designed to run
>>>>>>> with limited privilege must continue to support other methods of
>>>>>>> gaining and dropping privilege.  For instance a program installer
>>>>>>> must detect whether file capabilities can be assigned, and assign
>>>>>>> them if so but set setuid-root otherwise.  The program in turn
>>>>>>> must know how to drop partial capabilities, and do so only if
>>>>>>> setuid-root.
>>>>>> Hi Serge,
>>>>>>
>>>>>>
>>>>>>     I have been looking at patch below primarily to learn how we
>>>>>> could apply a similar technique to security.ima and security.evm
>>>>>> for a namespaced IMA. From the paragraphs above I thought that you
>>>>>> solved the problem of a shared filesystem where one now can write
>>>>>> different security.capability xattrs by effectively supporting for
>>>>>> example security.capability[uid=1000] and
>>>>>> security.capability[uid=2000] written into the filesystem. Each
>>>>>> would then become visible as security.capability if the userns
>>>>>> mapping is set appropriately.
>>>>> One disadvantage of this approach is that whoever is setting up the
>>>>> container would need to go touch the security.ima attribute for each
>>>>> file in the contianer, which would slow down container creation time.
>>>>> For capabilities this makes sense, because you might want the file to
>>>>> have different capabilities in different namespaces, but for IMA,
>>>>> since the file hash will be the same in every namespace,
>>>> Actually, this isn't necessarily true: IMA may have the hash, you're
>>>> right, but I suspect in most container use cases it will have the
>>>> signature.  It's definitely a use case that the container will be using
>>>> a different keyring from the host, so different signatures are surely
>>>> possible for the same underlying image file.
>>>>
>>>> One might imagine doing the above via overlays, because the new
>>>> signature should override the old.
>>> Yes, good point, thanks. Assuming the container and the host are using
>>> the same keyring, we could design it in such a way that the container
>>> engine doesn't need to touch every file on creation, which would be
>>> very nice.
>> I don't think this will be the general case. The host may be Ubuntu, the
>> guest could be Fedora and you'll have different keys. I don't think you
>> would want the container keys on the host keyring.
> I guess it depends: if your entire infrastructure needs to be signed
> by your ops team, it would (presumably) all be the same ops key. If
> you're running off the shelf stuff from the distros or from a vendor,
> probably not, I agree.

I think this will largely depend on how IMA works with namespaces. In 
the prototype I mention we had each container have its own _ima keyring 
and the public keys were inside the container at the typical location 
(for physical machines) and the mgmt. stack (docker) basically emulating 
what the initramfs scripts are doing, which is loading all keys from 
/etc/keys/ima into that keyring.

    Stefan

> Cheers,
>
> Tycho
>
Mimi Zohar June 13, 2017, 8:59 p.m.
On Tue, 2017-06-13 at 14:53 -0600, Tycho Andersen wrote:
> On Tue, Jun 13, 2017 at 04:49:03PM -0400, Stefan Berger wrote:
> > On 06/13/2017 04:46 PM, Tycho Andersen wrote:
> > > On Tue, Jun 13, 2017 at 10:45:02AM -0700, James Bottomley wrote:
> > > > On Tue, 2017-06-13 at 11:14 -0600, Tycho Andersen via Containers wrote:
> > > > > Hi Stefan,
> > > > > 
> > > > > On Tue, Jun 13, 2017 at 11:47:26AM -0400, Stefan Berger wrote:
> > > > > > On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> > > > > > > Root in a non-initial user ns cannot be trusted to write a
> > > > > > > traditional security.capability xattr.  If it were allowed to do
> > > > > > > so, then any unprivileged user on the host could map his own uid
> > > > > > > to root in a private namespace, write the xattr, and execute the
> > > > > > > file with privilege on the host.
> > > > > > > 
> > > > > > > However supporting file capabilities in a user namespace is very
> > > > > > > desirable.  Not doing so means that any programs designed to run
> > > > > > > with limited privilege must continue to support other methods of
> > > > > > > gaining and dropping privilege.  For instance a program installer
> > > > > > > must detect whether file capabilities can be assigned, and assign
> > > > > > > them if so but set setuid-root otherwise.  The program in turn
> > > > > > > must know how to drop partial capabilities, and do so only if
> > > > > > > setuid-root.
> > > > > > Hi Serge,
> > > > > > 
> > > > > > 
> > > > > >    I have been looking at patch below primarily to learn how we
> > > > > > could apply a similar technique to security.ima and security.evm
> > > > > > for a namespaced IMA. From the paragraphs above I thought that you
> > > > > > solved the problem of a shared filesystem where one now can write
> > > > > > different security.capability xattrs by effectively supporting for
> > > > > > example security.capability[uid=1000] and
> > > > > > security.capability[uid=2000] written into the filesystem. Each
> > > > > > would then become visible as security.capability if the userns
> > > > > > mapping is set appropriately.
> > > > > One disadvantage of this approach is that whoever is setting up the
> > > > > container would need to go touch the security.ima attribute for each
> > > > > file in the contianer, which would slow down container creation time.
> > > > > For capabilities this makes sense, because you might want the file to
> > > > > have different capabilities in different namespaces, but for IMA,
> > > > > since the file hash will be the same in every namespace,
> > > > Actually, this isn't necessarily true: IMA may have the hash, you're
> > > > right, but I suspect in most container use cases it will have the
> > > > signature.  It's definitely a use case that the container will be using
> > > > a different keyring from the host, so different signatures are surely
> > > > possible for the same underlying image file.
> > > > 
> > > > One might imagine doing the above via overlays, because the new
> > > > signature should override the old.
> > > Yes, good point, thanks. Assuming the container and the host are using
> > > the same keyring, we could design it in such a way that the container
> > > engine doesn't need to touch every file on creation, which would be
> > > very nice.
> > 
> > I don't think this will be the general case. The host may be Ubuntu, the
> > guest could be Fedora and you'll have different keys. I don't think you
> > would want the container keys on the host keyring.
> 
> I guess it depends: if your entire infrastructure needs to be signed
> by your ops team, it would (presumably) all be the same ops key. If
> you're running off the shelf stuff from the distros or from a vendor,
> probably not, I agree.

Assuming you want to support container specific executables, you would
want them specifically signed by a key not on the system IMA keyring.

Mimi
Jann Horn via Containers June 13, 2017, 9:09 p.m.
On Tue, Jun 13, 2017 at 04:59:30PM -0400, Mimi Zohar wrote:
> Assuming you want to support container specific executables, you would
> want them specifically signed by a key not on the system IMA keyring.

Yes, this is a good point.

Cheers,

Tycho
Serge E. Hallyn June 13, 2017, 11:42 p.m.
Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> >Root in a non-initial user ns cannot be trusted to write a traditional
> >security.capability xattr.  If it were allowed to do so, then any
> >unprivileged user on the host could map his own uid to root in a private
> >namespace, write the xattr, and execute the file with privilege on the
> >host.
> >
> >However supporting file capabilities in a user namespace is very
> >desirable.  Not doing so means that any programs designed to run with
> >limited privilege must continue to support other methods of gaining and
> >dropping privilege.  For instance a program installer must detect
> >whether file capabilities can be assigned, and assign them if so but set
> >setuid-root otherwise.  The program in turn must know how to drop
> >partial capabilities, and do so only if setuid-root.
> 
> Hi Serge,
> 
> 
>   I have been looking at patch below primarily to learn how we could
> apply a similar technique to security.ima and security.evm for a
> namespaced IMA. From the paragraphs above I thought that you solved
> the problem of a shared filesystem where one now can write different
> security.capability xattrs by effectively supporting for example
> security.capability[uid=1000] and security.capability[uid=2000]

Interesting idea.  Worth considering.

> written into the filesystem. Each would then become visible as
> security.capability if the userns mapping is set appropriately.
> However, this doesn't seem to be how it is implemented. There seems

Indeed, when I was considering supporting multiple simulatenous
xattrs, I did it as something like:

struct vfs_ns_cap_data {
	struct {
		__le32 permitted;
		__le32 inheritable;
	} data[VFS_CAP_U32];
	__le32 rootid;
};

struct vfs_ns_cap {
	__le32 magic_etc;
	__le32 n_entries;
	struct ns_cap_data data[0];
}; // followed by n_entries of struct ns_cap_data

You're instead suggesting encoding the rootuid in the name,
which is interesting.

> to be only a single such entry with uid appended to it and, if it
> was a shared filesystem, the first one to set this attribute blocks
> everyone else from writing the xattr. Is that how it works? Would

Approximately - indeed there is only a single xattr.  But it can be
overwritten, so long as the writer has CAP_SETFCAP over the user_ns
which mounted the filesystem.

> that work differently with an overlay filesystem ? I think a similar
> model could also work for IMA, but maybe you have some thoughts. The
> only thing I would be concerned about is blocking the parent
> container's root user from setting an xattr.
> 
> Regards,
>    Stefan
> 
> 
> >
> >This patch introduces v3 of the security.capability xattr.  It builds a
> >vfs_ns_cap_data struct by appending a uid_t rootid to struct
> >vfs_cap_data.  This is the absolute uid_t (that is, the uid_t in user
> >namespace which mounted the filesystem, usually init_user_ns) of the
> >root id in whose namespaces the file capabilities may take effect.
> >
> >When a task asks to write a v2 security.capability xattr, if it is
> >privileged with respect to the userns which mounted the filesystem, then
> >nothing should change.  Otherwise, the kernel will transparently rewrite
> >the xattr as a v3 with the appropriate rootid.  This is done during the
> >execution of setxattr() to catch user-space-initiated capability writes.
> >Subsequently, any task executing the file which has the noted kuid as
> >its root uid, or which is in a descendent user_ns of such a user_ns,
> >will run the file with capabilities.
> >
> >Similarly when asking to read file capabilities, a v3 capability will
> >be presented as v2 if it applies to the caller's namespace.
> >
> >If a task writes a v3 security.capability, then it can provide a uid for
> >the xattr so long as the uid is valid in its own user namespace, and it
> >is privileged with CAP_SETFCAP over its namespace.  The kernel will
> >translate that rootid to an absolute uid, and write that to disk.  After
> >this, a task in the writer's namespace will not be able to use those
> >capabilities (unless rootid was 0), but a task in a namespace where the
> >given uid is root will.
> >
> >Only a single security.capability xattr may exist at a time for a given
> >file.  A task may overwrite an existing xattr so long as it is
> >privileged over the inode.  Note this is a departure from previous
> >semantics, which required privilege to remove a security.capability
> >xattr.  This check can be re-added if deemed useful.
> >
> >This allows a simple setxattr to work, allows tar/untar to work, and
> >allows us to tar in one namespace and untar in another while preserving
> >the capability, without risking leaking privilege into a parent
> >namespace.
> >
> >Example using tar:
> >
> >  $ cp /bin/sleep sleepx
> >  $ mkdir b1 b2
> >  $ lxc-usernsexec -m b:0:100000:1 -m b:1:$(id -u):1 -- chown 0:0 b1
> >  $ lxc-usernsexec -m b:0:100001:1 -m b:1:$(id -u):1 -- chown 0:0 b2
> >  $ lxc-usernsexec -m b:0:100000:1000 -- tar --xattrs-include=security.capability --xattrs -cf b1/sleepx.tar sleepx
> >  $ lxc-usernsexec -m b:0:100001:1000 -- tar --xattrs-include=security.capability --xattrs -C b2 -xf b1/sleepx.tar
> >  $ lxc-usernsexec -m b:0:100001:1000 -- getcap b2/sleepx
> >    b2/sleepx = cap_sys_admin+ep
> >  # /opt/ltp/testcases/bin/getv3xattr b2/sleepx
> >    v3 xattr, rootid is 100001
> >
> >A patch to linux-test-project adding a new set of tests for this
> >functionality is in the nsfscaps branch at github.com/hallyn/ltp
> >
> >Changelog:
> >    Nov 02 2016: fix invalid check at refuse_fcap_overwrite()
> >    Nov 07 2016: convert rootid from and to fs user_ns
> >    (From ebiederm: mar 28 2017)
> >      commoncap.c: fix typos - s/v4/v3
> >      get_vfs_caps_from_disk: clarify the fs_ns root access check
> >      nsfscaps: change the code split for cap_inode_setxattr()
> >    Apr 09 2017:
> >        don't return v3 cap for caps owned by current root.
> >       return a v2 cap for a true v2 cap in non-init ns
> >    Apr 18 2017:
> >       . Change the flow of fscap writing to support s_user_ns writing.
> >       . Remove refuse_fcap_overwrite().  The value of the previous
> >         xattr doesn't matter.
> >    Apr 24 2017:
> >       . incorporate Eric's incremental diff
> >       . move cap_convert_nscap to setxattr and simplify its usage
> >    May 8, 2017:
> >       . fix leaking dentry refcount in cap_inode_getsecurity
> >
> >Signed-off-by: Serge Hallyn <serge@hallyn.com>
> >---
> >  fs/xattr.c                      |   6 +
> >  include/linux/capability.h      |   2 +
> >  include/linux/security.h        |   2 +
> >  include/uapi/linux/capability.h |  22 +++-
> >  security/commoncap.c            | 270 +++++++++++++++++++++++++++++++++++++---
> >  5 files changed, 280 insertions(+), 22 deletions(-)
> >
> >diff --git a/fs/xattr.c b/fs/xattr.c
> >index 7e3317c..0a9dea4 100644
> >--- a/fs/xattr.c
> >+++ b/fs/xattr.c
> >@@ -444,6 +444,12 @@ setxattr(struct dentry *d, const char __user *name, const void __user *value,
> >  		if ((strcmp(kname, XATTR_NAME_POSIX_ACL_ACCESS) == 0) ||
> >  		    (strcmp(kname, XATTR_NAME_POSIX_ACL_DEFAULT) == 0))
> >  			posix_acl_fix_xattr_from_user(kvalue, size);
> >+		else if (strcmp(kname, XATTR_NAME_CAPS) == 0) {
> >+			error = cap_convert_nscap(d, &kvalue, size);
> >+			if (error < 0)
> >+				goto out;
> >+			size = error;
> >+		}
> >  	}
> >
> >  	error = vfs_setxattr(d, kname, kvalue, size, flags);
> >diff --git a/include/linux/capability.h b/include/linux/capability.h
> >index 6ffb67e..b52e278 100644
> >--- a/include/linux/capability.h
> >+++ b/include/linux/capability.h
> >@@ -248,4 +248,6 @@ extern bool ptracer_capable(struct task_struct *tsk, struct user_namespace *ns);
> >  /* audit system wants to get cap info from files as well */
> >  extern int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data *cpu_caps);
> >
> >+extern int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size);
> >+
> >  #endif /* !_LINUX_CAPABILITY_H */
> >diff --git a/include/linux/security.h b/include/linux/security.h
> >index 96899fa..bd49cc1 100644
> >--- a/include/linux/security.h
> >+++ b/include/linux/security.h
> >@@ -86,6 +86,8 @@ extern int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >  extern int cap_inode_removexattr(struct dentry *dentry, const char *name);
> >  extern int cap_inode_need_killpriv(struct dentry *dentry);
> >  extern int cap_inode_killpriv(struct dentry *dentry);
> >+extern int cap_inode_getsecurity(struct inode *inode, const char *name,
> >+				 void **buffer, bool alloc);
> >  extern int cap_mmap_addr(unsigned long addr);
> >  extern int cap_mmap_file(struct file *file, unsigned long reqprot,
> >  			 unsigned long prot, unsigned long flags);
> >diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> >index 49bc062..fd4f87d 100644
> >--- a/include/uapi/linux/capability.h
> >+++ b/include/uapi/linux/capability.h
> >@@ -60,9 +60,13 @@ typedef struct __user_cap_data_struct {
> >  #define VFS_CAP_U32_2           2
> >  #define XATTR_CAPS_SZ_2         (sizeof(__le32)*(1 + 2*VFS_CAP_U32_2))
> >
> >-#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_2
> >-#define VFS_CAP_U32             VFS_CAP_U32_2
> >-#define VFS_CAP_REVISION	VFS_CAP_REVISION_2
> >+#define VFS_CAP_REVISION_3	0x03000000
> >+#define VFS_CAP_U32_3           2
> >+#define XATTR_CAPS_SZ_3         (sizeof(__le32)*(2 + 2*VFS_CAP_U32_3))
> >+
> >+#define XATTR_CAPS_SZ           XATTR_CAPS_SZ_3
> >+#define VFS_CAP_U32             VFS_CAP_U32_3
> >+#define VFS_CAP_REVISION	VFS_CAP_REVISION_3
> >
> >  struct vfs_cap_data {
> >  	__le32 magic_etc;            /* Little endian */
> >@@ -72,6 +76,18 @@ struct vfs_cap_data {
> >  	} data[VFS_CAP_U32];
> >  };
> >
> >+/*
> >+ * same as vfs_cap_data but with a rootid at the end
> >+ */
> >+struct vfs_ns_cap_data {
> >+	__le32 magic_etc;
> >+	struct {
> >+		__le32 permitted;    /* Little endian */
> >+		__le32 inheritable;  /* Little endian */
> >+	} data[VFS_CAP_U32];
> >+	__le32 rootid;
> >+};
> >+
> >  #ifndef __KERNEL__
> >
> >  /*
> >diff --git a/security/commoncap.c b/security/commoncap.c
> >index 78b3783..c28d126 100644
> >--- a/security/commoncap.c
> >+++ b/security/commoncap.c
> >@@ -332,6 +332,209 @@ int cap_inode_killpriv(struct dentry *dentry)
> >  	return error;
> >  }
> >
> >+static bool rootid_owns_currentns(kuid_t kroot)
> >+{
> >+	struct user_namespace *ns;
> >+
> >+	if (!uid_valid(kroot))
> >+		return false;
> >+
> >+	for (ns = current_user_ns(); ; ns = ns->parent) {
> >+		if (from_kuid(ns, kroot) == 0)
> >+			return true;
> >+		if (ns == &init_user_ns)
> >+			break;
> >+	}
> >+
> >+	return false;
> >+}
> >+
> >+static __u32 sansflags(__u32 m)
> >+{
> >+	return m & ~VFS_CAP_FLAGS_EFFECTIVE;
> >+}
> >+
> >+static bool is_v2header(size_t size, __le32 magic)
> >+{
> >+	__u32 m = le32_to_cpu(magic);
> >+	if (size != XATTR_CAPS_SZ_2)
> >+		return false;
> >+	return sansflags(m) == VFS_CAP_REVISION_2;
> >+}
> >+
> >+static bool is_v3header(size_t size, __le32 magic)
> >+{
> >+	__u32 m = le32_to_cpu(magic);
> >+
> >+	if (size != XATTR_CAPS_SZ_3)
> >+		return false;
> >+	return sansflags(m) == VFS_CAP_REVISION_3;
> >+}
> >+
> >+/*
> >+ * getsecurity: We are called for security.* before any attempt to read the
> >+ * xattr from the inode itself.
> >+ *
> >+ * This gives us a chance to read the on-disk value and convert it.  If we
> >+ * return -EOPNOTSUPP, then vfs_getxattr() will call the i_op handler.
> >+ *
> >+ * Note we are not called by vfs_getxattr_alloc(), but that is only called
> >+ * by the integrity subsystem, which really wants the unconverted values -
> >+ * so that's good.
> >+ */
> >+int cap_inode_getsecurity(struct inode *inode, const char *name, void **buffer,
> >+			  bool alloc)
> >+{
> >+	int size, ret;
> >+	kuid_t kroot;
> >+	uid_t root, mappedroot;
> >+	char *tmpbuf = NULL;
> >+	struct vfs_cap_data *cap;
> >+	struct vfs_ns_cap_data *nscap;
> >+	struct dentry *dentry;
> >+	struct user_namespace *fs_ns;
> >+
> >+	if (strcmp(name, "capability") != 0)
> >+		return -EOPNOTSUPP;
> >+
> >+	dentry = d_find_alias(inode);
> >+	if (!dentry)
> >+		return -EINVAL;
> >+
> >+	size = sizeof(struct vfs_ns_cap_data);
> >+	ret = (int) vfs_getxattr_alloc(dentry, XATTR_NAME_CAPS,
> >+				 &tmpbuf, size, GFP_NOFS);
> >+	dput(dentry);
> >+
> >+	if (ret < 0)
> >+		return ret;
> >+
> >+	fs_ns = inode->i_sb->s_user_ns;
> >+	cap = (struct vfs_cap_data *) tmpbuf;
> >+	if (is_v2header((size_t) ret, cap->magic_etc)) {
> >+		/* If this is sizeof(vfs_cap_data) then we're ok with the
> >+		 * on-disk value, so return that.  */
> >+		if (alloc)
> >+			*buffer = tmpbuf;
> >+		else
> >+			kfree(tmpbuf);
> >+		return ret;
> >+	} else if (!is_v3header((size_t) ret, cap->magic_etc)) {
> >+		kfree(tmpbuf);
> >+		return -EINVAL;
> >+	}
> >+
> >+	nscap = (struct vfs_ns_cap_data *) tmpbuf;
> >+	root = le32_to_cpu(nscap->rootid);
> >+	kroot = make_kuid(fs_ns, root);
> >+
> >+	/* If the root kuid maps to a valid uid in current ns, then return
> >+	 * this as a nscap. */
> >+	mappedroot = from_kuid(current_user_ns(), kroot);
> >+	if (mappedroot != (uid_t)-1 && mappedroot != (uid_t)0) {
> >+		if (alloc) {
> >+			*buffer = tmpbuf;
> >+			nscap->rootid = cpu_to_le32(mappedroot);
> >+		} else
> >+			kfree(tmpbuf);
> >+		return size;
> >+	}
> >+
> >+	if (!rootid_owns_currentns(kroot)) {
> >+		kfree(tmpbuf);
> >+		return -EOPNOTSUPP;
> >+	}
> >+
> >+	/* This comes from a parent namespace.  Return as a v2 capability */
> >+	size = sizeof(struct vfs_cap_data);
> >+	if (alloc) {
> >+		*buffer = kmalloc(size, GFP_ATOMIC);
> >+		if (*buffer) {
> >+			struct vfs_cap_data *cap = *buffer;
> >+			__le32 nsmagic, magic;
> >+			magic = VFS_CAP_REVISION_2;
> >+			nsmagic = le32_to_cpu(nscap->magic_etc);
> >+			if (nsmagic & VFS_CAP_FLAGS_EFFECTIVE)
> >+				magic |= VFS_CAP_FLAGS_EFFECTIVE;
> >+			memcpy(&cap->data, &nscap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
> >+			cap->magic_etc = cpu_to_le32(magic);
> >+		}
> >+	}
> >+	kfree(tmpbuf);
> >+	return size;
> >+}
> >+
> >+static kuid_t rootid_from_xattr(const void *value, size_t size,
> >+				struct user_namespace *task_ns)
> >+{
> >+	const struct vfs_ns_cap_data *nscap = value;
> >+	uid_t rootid = 0;
> >+
> >+	if (size == XATTR_CAPS_SZ_3)
> >+		rootid = le32_to_cpu(nscap->rootid);
> >+
> >+	return make_kuid(task_ns, rootid);
> >+}
> >+
> >+static bool validheader(size_t size, __le32 magic)
> >+{
> >+	return is_v2header(size, magic) || is_v3header(size, magic);
> >+}
> >+
> >+/*
> >+ * User requested a write of security.capability.  If needed, update the
> >+ * xattr to change from v2 to v3, or to fixup the v3 rootid.
> >+ *
> >+ * If all is ok, we return the new size, on error return < 0.
> >+ */
> >+int cap_convert_nscap(struct dentry *dentry, void **ivalue, size_t size)
> >+{
> >+	struct vfs_ns_cap_data *nscap;
> >+	uid_t nsrootid;
> >+	const struct vfs_cap_data *cap = *ivalue;
> >+	__u32 magic, nsmagic;
> >+	struct inode *inode = d_backing_inode(dentry);
> >+	struct user_namespace *task_ns = current_user_ns(),
> >+		*fs_ns = inode->i_sb->s_user_ns;
> >+	kuid_t rootid;
> >+	size_t newsize;
> >+
> >+	if (!*ivalue)
> >+		return -EINVAL;
> >+	if (!validheader(size, cap->magic_etc))
> >+		return -EINVAL;
> >+	if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
> >+		return -EPERM;
> >+	if (size == XATTR_CAPS_SZ_2)
> >+		if (ns_capable(inode->i_sb->s_user_ns, CAP_SETFCAP))
> >+			/* user is privileged, just write the v2 */
> >+			return size;
> >+
> >+	rootid = rootid_from_xattr(*ivalue, size, task_ns);
> >+	if (!uid_valid(rootid))
> >+		return -EINVAL;
> >+
> >+	nsrootid = from_kuid(fs_ns, rootid);
> >+	if (nsrootid == -1)
> >+		return -EINVAL;
> >+
> >+	newsize = sizeof(struct vfs_ns_cap_data);
> >+	nscap = kmalloc(newsize, GFP_ATOMIC);
> >+	if (!nscap)
> >+		return -ENOMEM;
> >+	nscap->rootid = cpu_to_le32(nsrootid);
> >+	nsmagic = VFS_CAP_REVISION_3;
> >+	magic = le32_to_cpu(cap->magic_etc);
> >+	if (magic & VFS_CAP_FLAGS_EFFECTIVE)
> >+		nsmagic |= VFS_CAP_FLAGS_EFFECTIVE;
> >+	nscap->magic_etc = cpu_to_le32(nsmagic);
> >+	memcpy(&nscap->data, &cap->data, sizeof(__le32) * 2 * VFS_CAP_U32);
> >+
> >+	kvfree(*ivalue);
> >+	*ivalue = nscap;
> >+	return newsize;
> >+}
> >+
> >  /*
> >   * Calculate the new process capability sets from the capability sets attached
> >   * to a file.
> >@@ -385,7 +588,10 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >  	__u32 magic_etc;
> >  	unsigned tocopy, i;
> >  	int size;
> >-	struct vfs_cap_data caps;
> >+	struct vfs_ns_cap_data data, *nscaps = &data;
> >+	struct vfs_cap_data *caps = (struct vfs_cap_data *) &data;
> >+	kuid_t rootkuid;
> >+	struct user_namespace *fs_ns = inode->i_sb->s_user_ns;
> >
> >  	memset(cpu_caps, 0, sizeof(struct cpu_vfs_cap_data));
> >
> >@@ -393,18 +599,20 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >  		return -ENODATA;
> >
> >  	size = __vfs_getxattr((struct dentry *)dentry, inode,
> >-			      XATTR_NAME_CAPS, &caps, XATTR_CAPS_SZ);
> >+			      XATTR_NAME_CAPS, &data, XATTR_CAPS_SZ);
> >  	if (size == -ENODATA || size == -EOPNOTSUPP)
> >  		/* no data, that's ok */
> >  		return -ENODATA;
> >+
> >  	if (size < 0)
> >  		return size;
> >
> >  	if (size < sizeof(magic_etc))
> >  		return -EINVAL;
> >
> >-	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps.magic_etc);
> >+	cpu_caps->magic_etc = magic_etc = le32_to_cpu(caps->magic_etc);
> >
> >+	rootkuid = make_kuid(fs_ns, 0);
> >  	switch (magic_etc & VFS_CAP_REVISION_MASK) {
> >  	case VFS_CAP_REVISION_1:
> >  		if (size != XATTR_CAPS_SZ_1)
> >@@ -416,15 +624,27 @@ int get_vfs_caps_from_disk(const struct dentry *dentry, struct cpu_vfs_cap_data
> >  			return -EINVAL;
> >  		tocopy = VFS_CAP_U32_2;
> >  		break;
> >+	case VFS_CAP_REVISION_3:
> >+		if (size != XATTR_CAPS_SZ_3)
> >+			return -EINVAL;
> >+		tocopy = VFS_CAP_U32_3;
> >+		rootkuid = make_kuid(fs_ns, le32_to_cpu(nscaps->rootid));
> >+		break;
> >+
> >  	default:
> >  		return -EINVAL;
> >  	}
> >+	/* Limit the caps to the mounter of the filesystem
> >+	 * or the more limited uid specified in the xattr.
> >+	 */
> >+	if (!rootid_owns_currentns(rootkuid))
> >+		return -ENODATA;
> >
> >  	CAP_FOR_EACH_U32(i) {
> >  		if (i >= tocopy)
> >  			break;
> >-		cpu_caps->permitted.cap[i] = le32_to_cpu(caps.data[i].permitted);
> >-		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps.data[i].inheritable);
> >+		cpu_caps->permitted.cap[i] = le32_to_cpu(caps->data[i].permitted);
> >+		cpu_caps->inheritable.cap[i] = le32_to_cpu(caps->data[i].inheritable);
> >  	}
> >
> >  	cpu_caps->permitted.cap[CAP_LAST_U32] &= CAP_LAST_U32_VALID_MASK;
> >@@ -462,8 +682,8 @@ static int get_file_caps(struct linux_binprm *bprm, bool *effective, bool *has_c
> >  	rc = get_vfs_caps_from_disk(bprm->file->f_path.dentry, &vcaps);
> >  	if (rc < 0) {
> >  		if (rc == -EINVAL)
> >-			printk(KERN_NOTICE "%s: get_vfs_caps_from_disk returned %d for %s\n",
> >-				__func__, rc, bprm->filename);
> >+			printk(KERN_NOTICE "Invalid argument reading file caps for %s\n",
> >+					bprm->filename);
> >  		else if (rc == -ENODATA)
> >  			rc = 0;
> >  		goto out;
> >@@ -660,15 +880,19 @@ int cap_bprm_secureexec(struct linux_binprm *bprm)
> >  int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >  		       const void *value, size_t size, int flags)
> >  {
> >-	if (!strcmp(name, XATTR_NAME_CAPS)) {
> >-		if (!capable(CAP_SETFCAP))
> >-			return -EPERM;
> >+	/* Ignore non-security xattrs */
> >+	if (strncmp(name, XATTR_SECURITY_PREFIX,
> >+			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> >+		return 0;
> >+
> >+	/*
> >+	 * For XATTR_NAME_CAPS the check will be done in
> >+	 * cap_convert_nscap(), called by setxattr()
> >+	 */
> >+	if (strcmp(name, XATTR_NAME_CAPS) == 0)
> >  		return 0;
> >-	}
> >
> >-	if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >-		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >-	    !capable(CAP_SYS_ADMIN))
> >+	if (!capable(CAP_SYS_ADMIN))
> >  		return -EPERM;
> >  	return 0;
> >  }
> >@@ -686,15 +910,22 @@ int cap_inode_setxattr(struct dentry *dentry, const char *name,
> >   */
> >  int cap_inode_removexattr(struct dentry *dentry, const char *name)
> >  {
> >-	if (!strcmp(name, XATTR_NAME_CAPS)) {
> >-		if (!capable(CAP_SETFCAP))
> >+	/* Ignore non-security xattrs */
> >+	if (strncmp(name, XATTR_SECURITY_PREFIX,
> >+			sizeof(XATTR_SECURITY_PREFIX) - 1) != 0)
> >+		return 0;
> >+
> >+	if (strcmp(name, XATTR_NAME_CAPS) == 0) {
> >+		/* security.capability gets namespaced */
> >+		struct inode *inode = d_backing_inode(dentry);
> >+		if (!inode)
> >+			return -EINVAL;
> >+		if (!capable_wrt_inode_uidgid(inode, CAP_SETFCAP))
> >  			return -EPERM;
> >  		return 0;
> >  	}
> >
> >-	if (!strncmp(name, XATTR_SECURITY_PREFIX,
> >-		     sizeof(XATTR_SECURITY_PREFIX) - 1) &&
> >-	    !capable(CAP_SYS_ADMIN))
> >+	if (!capable(CAP_SYS_ADMIN))
> >  		return -EPERM;
> >  	return 0;
> >  }
> >@@ -1082,6 +1313,7 @@ struct security_hook_list capability_hooks[] = {
> >  	LSM_HOOK_INIT(bprm_secureexec, cap_bprm_secureexec),
> >  	LSM_HOOK_INIT(inode_need_killpriv, cap_inode_need_killpriv),
> >  	LSM_HOOK_INIT(inode_killpriv, cap_inode_killpriv),
> >+	LSM_HOOK_INIT(inode_getsecurity, cap_inode_getsecurity),
> >  	LSM_HOOK_INIT(mmap_addr, cap_mmap_addr),
> >  	LSM_HOOK_INIT(mmap_file, cap_mmap_file),
> >  	LSM_HOOK_INIT(task_fix_setuid, cap_task_fix_setuid),
>
Serge E. Hallyn June 13, 2017, 11:50 p.m.
Quoting Serge E. Hallyn (serge@hallyn.com):
> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> > On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> > >Root in a non-initial user ns cannot be trusted to write a traditional
> > >security.capability xattr.  If it were allowed to do so, then any
> > >unprivileged user on the host could map his own uid to root in a private
> > >namespace, write the xattr, and execute the file with privilege on the
> > >host.
> > >
> > >However supporting file capabilities in a user namespace is very
> > >desirable.  Not doing so means that any programs designed to run with
> > >limited privilege must continue to support other methods of gaining and
> > >dropping privilege.  For instance a program installer must detect
> > >whether file capabilities can be assigned, and assign them if so but set
> > >setuid-root otherwise.  The program in turn must know how to drop
> > >partial capabilities, and do so only if setuid-root.
> > 
> > Hi Serge,
> > 
> > 
> >   I have been looking at patch below primarily to learn how we could
> > apply a similar technique to security.ima and security.evm for a
> > namespaced IMA. From the paragraphs above I thought that you solved
> > the problem of a shared filesystem where one now can write different
> > security.capability xattrs by effectively supporting for example
> > security.capability[uid=1000] and security.capability[uid=2000]
> 
> Interesting idea.  Worth considering.
> 
> > written into the filesystem. Each would then become visible as
> > security.capability if the userns mapping is set appropriately.
> > However, this doesn't seem to be how it is implemented. There seems
> 
> Indeed, when I was considering supporting multiple simulatenous
> xattrs, I did it as something like:
> 
> struct vfs_ns_cap_data {
> 	struct {
> 		__le32 permitted;
> 		__le32 inheritable;
> 	} data[VFS_CAP_U32];
> 	__le32 rootid;
> };
> 
> struct vfs_ns_cap {
> 	__le32 magic_etc;
> 	__le32 n_entries;
> 	struct ns_cap_data data[0];
> }; // followed by n_entries of struct ns_cap_data
> 
> You're instead suggesting encoding the rootuid in the name,
> which is interesting.
> 
> > to be only a single such entry with uid appended to it and, if it
> > was a shared filesystem, the first one to set this attribute blocks
> > everyone else from writing the xattr. Is that how it works? Would
> 
> Approximately - indeed there is only a single xattr.  But it can be
> overwritten, so long as the writer has CAP_SETFCAP over the user_ns
> which mounted the filesystem.

Hang on.  I've mis-spoken.  That's the requirement for writing a
v2 xattr.  To write a v3 xattr you only need to be privileged
(with CAP_SETFCAP) against the inode.
Serge E. Hallyn June 13, 2017, 11:55 p.m.
Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> On 06/13/2017 01:18 PM, Serge E. Hallyn wrote:
> >Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> >>On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
> >>>Root in a non-initial user ns cannot be trusted to write a traditional
> >>>security.capability xattr.  If it were allowed to do so, then any
> >>>unprivileged user on the host could map his own uid to root in a private
> >>>namespace, write the xattr, and execute the file with privilege on the
> >>>host.
> >>>
> >>>However supporting file capabilities in a user namespace is very
> >>>desirable.  Not doing so means that any programs designed to run with
> >>>limited privilege must continue to support other methods of gaining and
> >>>dropping privilege.  For instance a program installer must detect
> >>>whether file capabilities can be assigned, and assign them if so but set
> >>>setuid-root otherwise.  The program in turn must know how to drop
> >>>partial capabilities, and do so only if setuid-root.
> >>Hi Serge,
> >>
> >>
> >>   I have been looking at patch below primarily to learn how we could
> >>apply a similar technique to security.ima and security.evm for a
> >>namespaced IMA. From the paragraphs above I thought that you solved
> >>the problem of a shared filesystem where one now can write different
> >>security.capability xattrs by effectively supporting for example
> >>security.capability[uid=1000] and security.capability[uid=2000]
> >>written into the filesystem. Each would then become visible as
> >>security.capability if the userns mapping is set appropriately.
> >>However, this doesn't seem to be how it is implemented. There seems
> >>to be only a single such entry with uid appended to it and, if it
> >>was a shared filesystem, the first one to set this attribute blocks
> >>everyone else from writing the xattr. Is that how it works? Would
> >Yes, that's how this works here.  I'd considered allowing multiple
> >entries, but I didn't feel that was needed for this case.  In a previous
> >implementation (which is probably in the lkml archives somewhere) I
> >supported variable length xattr so that multiple containers could
> >each write a value tagged with their own userns.rootid.  Instead,
> >in the final version, if root in any parent container writes an
> >xattr, it will take effect in child user namespaces.  Which is
> >sensible - the parent presumbly laid out the filesystem to create
> >the child container.
> >
> >>that work differently with an overlay filesystem ? I think a similar
> >Certainly an overlay filesystem should be an easy case as the container
> >can have its own copy of the inode with its own xattr.  Btrfs/zfs
> >would be nicer as the whole file wouldn't need to be copied.
> >
> >>model could also work for IMA, but maybe you have some thoughts. The
> >>only thing I would be concerned about is blocking the parent
> >>container's root user from setting an xattr.
> >So if you have container c1 creating child container c2 on host h1,
> >then if c1 creates an xattr, can c2 not use that?  And if h1 writes it,
> >can c1 and c2 use it?
> 
> In the case of IMA appraisal the extended attribute security.ima
> would be a signature. For c1 and c2 to use that file they would all
> have to have the same key on their (isolated IMA namespace )
> keyring. I think this type of setup could be arranged.

Ok.  If it's too much of a restriction then certainly we can make
it more flexible.  I don't think we want to support too many versions
of magic in this code, so if there's a chance we'll want to make it
more flexible later, then perhaps we should discuss the other options
in more detail now.

> Following your attack description in the introduction I would say
> that we would want to prevent malicious modification of a
> security.ima extended attribute:
> 
> "Root in a non-initial user ns cannot be trusted to write a traditional security.ima xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the signature in the security.ima xattr, and prevent the file from being accessible on the host."

Of course.

The way this is handled with nsfscaps is not by just forbidding the
write, but by only respecting the xattr if the rootid which was
written in the xattr (which is translated and enforced by the kernel
at write time) is root in the caller's user_ns or a parent thereof.

I think that would suffice for ima as well?

> >If they can't, then I guess for IMA multiple xattrs would need to be
> >supported.
> 
> I am not sure about that. I suppose any extended attribute
> modifications would have to be designed for the case where a shared
> filesystem is used that also shares the extended attributes, not
> assuming an overlay filesystem that automatically isolates the
> extend attributes. With the shared filesystem I'd like to prevent
> any type of setting of extended attributes by a child container or
> more generally anyone mounting it as a '2nd consumer', which would
> make it a shared filesystem. Only the process that mounts a
> filesystem as the '1st consumer' would be able to set the extended
> attributes.

Right, again that's currently the case in the nscaps patch.

>  I am assuming that using an overlay fs would always make
> you the '1st consumer' -- I would hope that these conditions could
> be detected. And probably the process should also write along its
> host uid as part of writing out the xattr.

I think that's what the rootid in the nscaps xattr is.

>  If all extended
> attributes were to support this model, maybe the 'uid' could be
> associated with the 'name' of the xattr rather than its 'value' (not
> sure whether that's possible).

Right, I missed that in your original email when I saw it this morning.
It's not what my patch does, but it's an interesting idea.  Do you have
a patch to that effect?  We might even be able to generalize that to
namespace any security.* xattrs.  Wouldn't be automatically enabled
for anything but ima and capabilities, but we could make the infrastructure
generic and re-usable.
Stefan Berger June 14, 2017, 12:27 p.m.
On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>> On 06/13/2017 01:18 PM, Serge E. Hallyn wrote:
>>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>>> On 05/08/2017 02:11 PM, Serge E. Hallyn wrote:
>>>>> Root in a non-initial user ns cannot be trusted to write a traditional
>>>>> security.capability xattr.  If it were allowed to do so, then any
>>>>> unprivileged user on the host could map his own uid to root in a private
>>>>> namespace, write the xattr, and execute the file with privilege on the
>>>>> host.
>>>>>
>>>>> However supporting file capabilities in a user namespace is very
>>>>> desirable.  Not doing so means that any programs designed to run with
>>>>> limited privilege must continue to support other methods of gaining and
>>>>> dropping privilege.  For instance a program installer must detect
>>>>> whether file capabilities can be assigned, and assign them if so but set
>>>>> setuid-root otherwise.  The program in turn must know how to drop
>>>>> partial capabilities, and do so only if setuid-root.
>>>> Hi Serge,
>>>>
>>>>
>>>>    I have been looking at patch below primarily to learn how we could
>>>> apply a similar technique to security.ima and security.evm for a
>>>> namespaced IMA. From the paragraphs above I thought that you solved
>>>> the problem of a shared filesystem where one now can write different
>>>> security.capability xattrs by effectively supporting for example
>>>> security.capability[uid=1000] and security.capability[uid=2000]
>>>> written into the filesystem. Each would then become visible as
>>>> security.capability if the userns mapping is set appropriately.
>>>> However, this doesn't seem to be how it is implemented. There seems
>>>> to be only a single such entry with uid appended to it and, if it
>>>> was a shared filesystem, the first one to set this attribute blocks
>>>> everyone else from writing the xattr. Is that how it works? Would
>>> Yes, that's how this works here.  I'd considered allowing multiple
>>> entries, but I didn't feel that was needed for this case.  In a previous
>>> implementation (which is probably in the lkml archives somewhere) I
>>> supported variable length xattr so that multiple containers could
>>> each write a value tagged with their own userns.rootid.  Instead,
>>> in the final version, if root in any parent container writes an
>>> xattr, it will take effect in child user namespaces.  Which is
>>> sensible - the parent presumbly laid out the filesystem to create
>>> the child container.
>>>
>>>> that work differently with an overlay filesystem ? I think a similar
>>> Certainly an overlay filesystem should be an easy case as the container
>>> can have its own copy of the inode with its own xattr.  Btrfs/zfs
>>> would be nicer as the whole file wouldn't need to be copied.
>>>
>>>> model could also work for IMA, but maybe you have some thoughts. The
>>>> only thing I would be concerned about is blocking the parent
>>>> container's root user from setting an xattr.
>>> So if you have container c1 creating child container c2 on host h1,
>>> then if c1 creates an xattr, can c2 not use that?  And if h1 writes it,
>>> can c1 and c2 use it?
>> In the case of IMA appraisal the extended attribute security.ima
>> would be a signature. For c1 and c2 to use that file they would all
>> have to have the same key on their (isolated IMA namespace )
>> keyring. I think this type of setup could be arranged.
> Ok.  If it's too much of a restriction then certainly we can make
> it more flexible.  I don't think we want to support too many versions
> of magic in this code, so if there's a chance we'll want to make it
> more flexible later, then perhaps we should discuss the other options
> in more detail now.
>
>> Following your attack description in the introduction I would say
>> that we would want to prevent malicious modification of a
>> security.ima extended attribute:
>>
>> "Root in a non-initial user ns cannot be trusted to write a traditional security.ima xattr. If it were allowed to do so, then any unprivileged user on the host could map his own uid to root in a private namespace, write the signature in the security.ima xattr, and prevent the file from being accessible on the host."
> Of course.
>
> The way this is handled with nsfscaps is not by just forbidding the
> write, but by only respecting the xattr if the rootid which was
> written in the xattr (which is translated and enforced by the kernel
> at write time) is root in the caller's user_ns or a parent thereof.
>
> I think that would suffice for ima as well?
>
>>> If they can't, then I guess for IMA multiple xattrs would need to be
>>> supported.
>> I am not sure about that. I suppose any extended attribute
>> modifications would have to be designed for the case where a shared
>> filesystem is used that also shares the extended attributes, not
>> assuming an overlay filesystem that automatically isolates the
>> extend attributes. With the shared filesystem I'd like to prevent
>> any type of setting of extended attributes by a child container or
>> more generally anyone mounting it as a '2nd consumer', which would
>> make it a shared filesystem. Only the process that mounts a
>> filesystem as the '1st consumer' would be able to set the extended
>> attributes.
> Right, again that's currently the case in the nscaps patch.
>
>>   I am assuming that using an overlay fs would always make
>> you the '1st consumer' -- I would hope that these conditions could
>> be detected. And probably the process should also write along its
>> host uid as part of writing out the xattr.
> I think that's what the rootid in the nscaps xattr is.
>
>>   If all extended
>> attributes were to support this model, maybe the 'uid' could be
>> associated with the 'name' of the xattr rather than its 'value' (not
>> sure whether that's possible).
> Right, I missed that in your original email when I saw it this morning.
> It's not what my patch does, but it's an interesting idea.  Do you have
> a patch to that effect?  We might even be able to generalize that to

No, I don't have a patch. It may not be possible to implement it. The 
xattr_handler's  take the name of the xattr as input to get(). So one 
could try to encode the mapped uid in the name. However, that could lead 
to problems with stale xattrs in a shared filesystem over time unless 
one could limit the number of xattrs with the same prefix, e.g., 
security.capability*. So I doubt that it would work. Otherwise it would 
be good if the value was wrapped in a data structure use by all xattrs, 
but that doesn't seem to be the case, either. So I guess we have to go 
into each type of value structure and add a uid field there.

> namespace any security.* xattrs.  Wouldn't be automatically enabled
> for anything but ima and capabilities, but we could make the infrastructure
> generic and re-usable.
>
Serge E. Hallyn June 15, 2017, 3:05 a.m.
On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
> On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
> >Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> >>  If all extended
> >>attributes were to support this model, maybe the 'uid' could be
> >>associated with the 'name' of the xattr rather than its 'value' (not
> >>sure whether that's possible).
> >Right, I missed that in your original email when I saw it this morning.
> >It's not what my patch does, but it's an interesting idea.  Do you have
> >a patch to that effect?  We might even be able to generalize that to
> 
> No, I don't have a patch. It may not be possible to implement it.
> The xattr_handler's  take the name of the xattr as input to get().

That may be ok though.  Assume the host created a container with
100000 as the uid for root, which created a container with 130000 as
uid for root.  If root in the nested container tries to read the
xattr, the kernel can check for security.foo[130000] first, then
security.foo[100000], then security.foo.  Or, it can do a listxattr
and look for those.  Am I overlooking one?

> So one could try to encode the mapped uid in the name. However, that

I thought that's exactly what you were suggesting in your original
email?  "security.capability[uid=2000]"

> could lead to problems with stale xattrs in a shared filesystem over
> time unless one could limit the number of xattrs with the same
> prefix, e.g., security.capability*. So I doubt that it would work.

Hm.  Yeah.  But really how many setups are there like that?  I.e. if
you launch a regular docker or lxd container, the image doesn't do a
bind mount of a shared image, it layers something above it or does a
copy.  What setups do you know of where multiple containers in different
user namespaces mount the same filesystem shared and writeable?

> Otherwise it would be good if the value was wrapped in a data
> structure use by all xattrs, but that doesn't seem to be the case,
> either. So I guess we have to go into each type of value structure
> and add a uid field there.
> 
> >namespace any security.* xattrs.  Wouldn't be automatically enabled
> >for anything but ima and capabilities, but we could make the infrastructure
> >generic and re-usable.
> >
Christian Brauner June 16, 2017, 9:02 a.m.
> "Serge E. Hallyn" <serge@hallyn.com> hat am 15. Juni 2017 um 05:05 geschrieben:
> 
> 
> On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
> > On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
> > >Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> > >>  If all extended
> > >>attributes were to support this model, maybe the 'uid' could be
> > >>associated with the 'name' of the xattr rather than its 'value' (not
> > >>sure whether that's possible).
> > >Right, I missed that in your original email when I saw it this morning.
> > >It's not what my patch does, but it's an interesting idea.  Do you have
> > >a patch to that effect?  We might even be able to generalize that to
> > 
> > No, I don't have a patch. It may not be possible to implement it.
> > The xattr_handler's  take the name of the xattr as input to get().
> 
> That may be ok though.  Assume the host created a container with
> 100000 as the uid for root, which created a container with 130000 as
> uid for root.  If root in the nested container tries to read the
> xattr, the kernel can check for security.foo[130000] first, then
> security.foo[100000], then security.foo.  Or, it can do a listxattr
> and look for those.  Am I overlooking one?
> 
> > So one could try to encode the mapped uid in the name. However, that
> 
> I thought that's exactly what you were suggesting in your original
> email?  "security.capability[uid=2000]"
> 
> > could lead to problems with stale xattrs in a shared filesystem over
> > time unless one could limit the number of xattrs with the same
> > prefix, e.g., security.capability*. So I doubt that it would work.
> 
> Hm.  Yeah.  But really how many setups are there like that?  I.e. if
> you launch a regular docker or lxd container, the image doesn't do a
> bind mount of a shared image, it layers something above it or does a
> copy.  What setups do you know of where multiple containers in different
> user namespaces mount the same filesystem shared and writeable?

Iiuc, this should also be something that will be addressed by a viable shifts
solution so it's probably not worth worrying about it here. I was going to
point out that there are plans of sharing filesystems between containers
in different user namespaces. However, without shifts this really will only
work nicely if the container's setup identical id mappings in which case you
won't have to worry about stale xattrs.

> 
> > Otherwise it would be good if the value was wrapped in a data
> > structure use by all xattrs, but that doesn't seem to be the case,
> > either. So I guess we have to go into each type of value structure
> > and add a uid field there.
> > 
> > >namespace any security.* xattrs.  Wouldn't be automatically enabled
> > >for anything but ima and capabilities, but we could make the infrastructure
> > >generic and re-usable.
> > >
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Stefan Berger June 16, 2017, 10:24 p.m.
On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
> On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
>> On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
>>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>>>   If all extended
>>>> attributes were to support this model, maybe the 'uid' could be
>>>> associated with the 'name' of the xattr rather than its 'value' (not
>>>> sure whether that's possible).
>>> Right, I missed that in your original email when I saw it this morning.
>>> It's not what my patch does, but it's an interesting idea.  Do you have
>>> a patch to that effect?  We might even be able to generalize that to
>> No, I don't have a patch. It may not be possible to implement it.
>> The xattr_handler's  take the name of the xattr as input to get().
> That may be ok though.  Assume the host created a container with
> 100000 as the uid for root, which created a container with 130000 as
> uid for root.  If root in the nested container tries to read the
> xattr, the kernel can check for security.foo[130000] first, then
> security.foo[100000], then security.foo.  Or, it can do a listxattr
> and look for those.  Am I overlooking one?

So that sounds like a child would 'inherit' the value of an xattr from 
the closest parent if it doesn't have one itself. I guess it would 
depend on the xattr whether that should apply? And removing an xattr 
becomes difficult then if the parent container's xattr always shines 
through...

>
>> So one could try to encode the mapped uid in the name. However, that
> I thought that's exactly what you were suggesting in your original
> email?  "security.capability[uid=2000]"
>
>> could lead to problems with stale xattrs in a shared filesystem over
>> time unless one could limit the number of xattrs with the same
>> prefix, e.g., security.capability*. So I doubt that it would work.
> Hm.  Yeah.  But really how many setups are there like that?  I.e. if
> you launch a regular docker or lxd container, the image doesn't do a
> bind mount of a shared image, it layers something above it or does a
> copy.  What setups do you know of where multiple containers in different
> user namespaces mount the same filesystem shared and writeable?

So you think it's a good idea? I am not sure when I would get to it, 
though...

    Stefan


>
>> Otherwise it would be good if the value was wrapped in a data
>> structure use by all xattrs, but that doesn't seem to be the case,
>> either. So I guess we have to go into each type of value structure
>> and add a uid field there.
>>
>>> namespace any security.* xattrs.  Wouldn't be automatically enabled
>>> for anything but ima and capabilities, but we could make the infrastructure
>>> generic and re-usable.
>>>
Stefan Berger June 17, 2017, 8:56 p.m.
On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
> On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
>> On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
>>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>>>   If all extended
>>>> attributes were to support this model, maybe the 'uid' could be
>>>> associated with the 'name' of the xattr rather than its 'value' (not
>>>> sure whether that's possible).
>>> Right, I missed that in your original email when I saw it this morning.
>>> It's not what my patch does, but it's an interesting idea.  Do you have
>>> a patch to that effect?  We might even be able to generalize that to
>> No, I don't have a patch. It may not be possible to implement it.
>> The xattr_handler's  take the name of the xattr as input to get().
> That may be ok though.  Assume the host created a container with
> 100000 as the uid for root, which created a container with 130000 as
> uid for root.  If root in the nested container tries to read the
> xattr, the kernel can check for security.foo[130000] first, then
> security.foo[100000], then security.foo.  Or, it can do a listxattr
> and look for those.  Am I overlooking one?
>
>> So one could try to encode the mapped uid in the name. However, that
> I thought that's exactly what you were suggesting in your original
> email?  "security.capability[uid=2000]"
>
>> could lead to problems with stale xattrs in a shared filesystem over
>> time unless one could limit the number of xattrs with the same
>> prefix, e.g., security.capability*. So I doubt that it would work.
> Hm.  Yeah.  But really how many setups are there like that?  I.e. if
> you launch a regular docker or lxd container, the image doesn't do a
> bind mount of a shared image, it layers something above it or does a
> copy.  What setups do you know of where multiple containers in different
> user namespaces mount the same filesystem shared and writeable?

I think I have something now that accomodates userns access to 
security.capability:

https://github.com/stefanberger/linux/commits/xattr_for_userns

Encoding of uid is in the attribute name now as follows: 
security.foo@uid=<uid>

1) The 'plain' security.capability is only r/w accessible from the host 
(init_user_ns).
2) When userns reads/writes 'security.capability' it will read/write 
security.capability@uid=<uid> instead, with uid being the uid of root , 
e.g. 1000.
3) When listing xattrs for userns the host's security.capability is 
filtered out to avoid read failures iof 'security.capability' if 
security.capability@uid=<uid> is read but not there. (see 1) and 2))
4) security.capability* may all be read from anywhere
5) security.capability@uid=<uid> may be read or written directly from a 
userns if <uid> matches the uid of root (current_uid())
6) security.capability@* are 'reserved' and may be read but not written 
to unless 5) applies.


Similat, from the text of one of the functions in the code:

+ * In a user namespace we prevent read/write accesses to the _host's_
+ * security.foo to protect these extended attributes.
+ *
+ * Reading: Reading security.foo from a user namespace will read
+ * security.foo@uid=<uid> instead. Reading security.foo@uid=<uid> directly
+ * also works. In general, all security.foo*, except for security.foo 
of the
+ * host, can be read from a user namespace.
+ *
+ * Writing: Writing security.foo from a user namespace will write
+ * security.foo@uid=<uid> instead. Writing security.foo@uid=<uid> directly
+ * also work.s No other security.foo* attributes, including the 
security.foo
+ * offthe host, can be written to. All security.foo@* are 'reserved'.
+ *
+ * Removing: The same rules for writing apply to removing of extended
+ * attributes from a user namespace.



    Stefan
Serge E. Hallyn June 18, 2017, 10:14 p.m.
Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
> >On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
> >>On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
> >>>Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> >>>>  If all extended
> >>>>attributes were to support this model, maybe the 'uid' could be
> >>>>associated with the 'name' of the xattr rather than its 'value' (not
> >>>>sure whether that's possible).
> >>>Right, I missed that in your original email when I saw it this morning.
> >>>It's not what my patch does, but it's an interesting idea.  Do you have
> >>>a patch to that effect?  We might even be able to generalize that to
> >>No, I don't have a patch. It may not be possible to implement it.
> >>The xattr_handler's  take the name of the xattr as input to get().
> >That may be ok though.  Assume the host created a container with
> >100000 as the uid for root, which created a container with 130000 as
> >uid for root.  If root in the nested container tries to read the
> >xattr, the kernel can check for security.foo[130000] first, then
> >security.foo[100000], then security.foo.  Or, it can do a listxattr
> >and look for those.  Am I overlooking one?
> >
> >>So one could try to encode the mapped uid in the name. However, that
> >I thought that's exactly what you were suggesting in your original
> >email?  "security.capability[uid=2000]"
> >
> >>could lead to problems with stale xattrs in a shared filesystem over
> >>time unless one could limit the number of xattrs with the same
> >>prefix, e.g., security.capability*. So I doubt that it would work.
> >Hm.  Yeah.  But really how many setups are there like that?  I.e. if
> >you launch a regular docker or lxd container, the image doesn't do a
> >bind mount of a shared image, it layers something above it or does a
> >copy.  What setups do you know of where multiple containers in different
> >user namespaces mount the same filesystem shared and writeable?
> 
> I think I have something now that accomodates userns access to
> security.capability:
> 
> https://github.com/stefanberger/linux/commits/xattr_for_userns

Thanks!

> Encoding of uid is in the attribute name now as follows:
> security.foo@uid=<uid>
> 
> 1) The 'plain' security.capability is only r/w accessible from the
> host (init_user_ns).
> 2) When userns reads/writes 'security.capability' it will read/write
> security.capability@uid=<uid> instead, with uid being the uid of
> root , e.g. 1000.
> 3) When listing xattrs for userns the host's security.capability is
> filtered out to avoid read failures iof 'security.capability' if
> security.capability@uid=<uid> is read but not there. (see 1) and 2))
> 4) security.capability* may all be read from anywhere
> 5) security.capability@uid=<uid> may be read or written directly
> from a userns if <uid> matches the uid of root (current_uid())

This looks very close to what we want.  One exception - we do want
to support root in a user namespace being able to write
security.capability@uid=<x> where <x> is a valid uid mapped in its
namespace.  In that case the name should be rewritten to be
security.capability@uid=<y> where y is the unmapped kuid.val.

Eric,

so far my patch hasn't yet hit Linus' tree.  Given that, would you
mind taking a look and seeing what you think of this approach?  If
we may decide to go this route, we probably should stop my patch
from hitting Linus' tree before we have to continue supporting it.

Stefan,

so do you think the general framework could be re-used by IMA?  If
we can move the capability-specific code in fs/xattr.c into
an LSM hook in a way that IMA can also use, then this is a definite
win.

-serge
Stefan Berger June 19, 2017, 1:13 a.m.
On 06/18/2017 06:14 PM, Serge E. Hallyn wrote:
> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>> On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
>>> On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
>>>> On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
>>>>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>>>>>   If all extended
>>>>>> attributes were to support this model, maybe the 'uid' could be
>>>>>> associated with the 'name' of the xattr rather than its 'value' (not
>>>>>> sure whether that's possible).
>>>>> Right, I missed that in your original email when I saw it this morning.
>>>>> It's not what my patch does, but it's an interesting idea.  Do you have
>>>>> a patch to that effect?  We might even be able to generalize that to
>>>> No, I don't have a patch. It may not be possible to implement it.
>>>> The xattr_handler's  take the name of the xattr as input to get().
>>> That may be ok though.  Assume the host created a container with
>>> 100000 as the uid for root, which created a container with 130000 as
>>> uid for root.  If root in the nested container tries to read the
>>> xattr, the kernel can check for security.foo[130000] first, then
>>> security.foo[100000], then security.foo.  Or, it can do a listxattr
>>> and look for those.  Am I overlooking one?
>>>
>>>> So one could try to encode the mapped uid in the name. However, that
>>> I thought that's exactly what you were suggesting in your original
>>> email?  "security.capability[uid=2000]"
>>>
>>>> could lead to problems with stale xattrs in a shared filesystem over
>>>> time unless one could limit the number of xattrs with the same
>>>> prefix, e.g., security.capability*. So I doubt that it would work.
>>> Hm.  Yeah.  But really how many setups are there like that?  I.e. if
>>> you launch a regular docker or lxd container, the image doesn't do a
>>> bind mount of a shared image, it layers something above it or does a
>>> copy.  What setups do you know of where multiple containers in different
>>> user namespaces mount the same filesystem shared and writeable?
>> I think I have something now that accomodates userns access to
>> security.capability:
>>
>> https://github.com/stefanberger/linux/commits/xattr_for_userns
> Thanks!
>
>> Encoding of uid is in the attribute name now as follows:
>> security.foo@uid=<uid>
>>
>> 1) The 'plain' security.capability is only r/w accessible from the
>> host (init_user_ns).
>> 2) When userns reads/writes 'security.capability' it will read/write
>> security.capability@uid=<uid> instead, with uid being the uid of
>> root , e.g. 1000.
>> 3) When listing xattrs for userns the host's security.capability is
>> filtered out to avoid read failures iof 'security.capability' if
>> security.capability@uid=<uid> is read but not there. (see 1) and 2))
>> 4) security.capability* may all be read from anywhere
>> 5) security.capability@uid=<uid> may be read or written directly
>> from a userns if <uid> matches the uid of root (current_uid())
> This looks very close to what we want.  One exception - we do want
> to support root in a user namespace being able to write
> security.capability@uid=<x> where <x> is a valid uid mapped in its
> namespace.  In that case the name should be rewritten to be
> security.capability@uid=<y> where y is the unmapped kuid.val.

I'll try to write a patch on top of the existing one.

Can  you adapt your test cases. I haven't tried them, but having them 
would be important.

>
> Eric,
>
> so far my patch hasn't yet hit Linus' tree.  Given that, would you
> mind taking a look and seeing what you think of this approach?  If
> we may decide to go this route, we probably should stop my patch
> from hitting Linus' tree before we have to continue supporting it.
>
> Stefan,
>
> so do you think the general framework could be re-used by IMA?  If
> we can move the capability-specific code in fs/xattr.c into
> an LSM hook in a way that IMA can also use, then this is a definite
> win.

I am fairly sure that this would be easily possible and some of the if 
statements with string comparisons would likely only have to be extended 
with another comparison.

Regards,
    Stefan

>
> -serge
>
Stefan Berger June 19, 2017, 1:05 p.m.
On 06/18/2017 09:13 PM, Stefan Berger wrote:
> On 06/18/2017 06:14 PM, Serge E. Hallyn wrote:
>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>> On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
>>>> On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
>>>>> On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
>>>>>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>>>>>>   If all extended
>>>>>>> attributes were to support this model, maybe the 'uid' could be
>>>>>>> associated with the 'name' of the xattr rather than its 'value' 
>>>>>>> (not
>>>>>>> sure whether that's possible).
>>>>>> Right, I missed that in your original email when I saw it this 
>>>>>> morning.
>>>>>> It's not what my patch does, but it's an interesting idea.  Do 
>>>>>> you have
>>>>>> a patch to that effect?  We might even be able to generalize that to
>>>>> No, I don't have a patch. It may not be possible to implement it.
>>>>> The xattr_handler's  take the name of the xattr as input to get().
>>>> That may be ok though.  Assume the host created a container with
>>>> 100000 as the uid for root, which created a container with 130000 as
>>>> uid for root.  If root in the nested container tries to read the
>>>> xattr, the kernel can check for security.foo[130000] first, then
>>>> security.foo[100000], then security.foo.  Or, it can do a listxattr
>>>> and look for those.  Am I overlooking one?
>>>>
>>>>> So one could try to encode the mapped uid in the name. However, that
>>>> I thought that's exactly what you were suggesting in your original
>>>> email?  "security.capability[uid=2000]"
>>>>
>>>>> could lead to problems with stale xattrs in a shared filesystem over
>>>>> time unless one could limit the number of xattrs with the same
>>>>> prefix, e.g., security.capability*. So I doubt that it would work.
>>>> Hm.  Yeah.  But really how many setups are there like that? I.e. if
>>>> you launch a regular docker or lxd container, the image doesn't do a
>>>> bind mount of a shared image, it layers something above it or does a
>>>> copy.  What setups do you know of where multiple containers in 
>>>> different
>>>> user namespaces mount the same filesystem shared and writeable?
>>> I think I have something now that accomodates userns access to
>>> security.capability:
>>>
>>> https://github.com/stefanberger/linux/commits/xattr_for_userns
>> Thanks!
>>
>>> Encoding of uid is in the attribute name now as follows:
>>> security.foo@uid=<uid>
>>>
>>> 1) The 'plain' security.capability is only r/w accessible from the
>>> host (init_user_ns).
>>> 2) When userns reads/writes 'security.capability' it will read/write
>>> security.capability@uid=<uid> instead, with uid being the uid of
>>> root , e.g. 1000.
>>> 3) When listing xattrs for userns the host's security.capability is
>>> filtered out to avoid read failures iof 'security.capability' if
>>> security.capability@uid=<uid> is read but not there. (see 1) and 2))
>>> 4) security.capability* may all be read from anywhere
>>> 5) security.capability@uid=<uid> may be read or written directly
>>> from a userns if <uid> matches the uid of root (current_uid())
>> This looks very close to what we want.  One exception - we do want
>> to support root in a user namespace being able to write
>> security.capability@uid=<x> where <x> is a valid uid mapped in its
>> namespace.  In that case the name should be rewritten to be
>> security.capability@uid=<y> where y is the unmapped kuid.val.
>
> I'll try to write a patch on top of the existing one.

Did that now in a 2nd patch (that also fixes a few problems of the 1st). 
In a user ns mapped to 1000 root can write security.capability@uid=123, 
which then ends up writing to security.capability@uid=1123. The reading 
also works with @uid=123. When listing xattrs only those get shown that 
actually have valid mappings.

https://github.com/stefanberger/linux/commits/xattr_for_userns

    Stefan
Eric W. Biederman June 19, 2017, 9:34 p.m.
"Serge E. Hallyn" <serge@hallyn.com> writes:

> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>> On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
>> >On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
>> >>On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
>> >>>Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>> >>>>  If all extended
>> >>>>attributes were to support this model, maybe the 'uid' could be
>> >>>>associated with the 'name' of the xattr rather than its 'value' (not
>> >>>>sure whether that's possible).
>> >>>Right, I missed that in your original email when I saw it this morning.
>> >>>It's not what my patch does, but it's an interesting idea.  Do you have
>> >>>a patch to that effect?  We might even be able to generalize that to
>> >>No, I don't have a patch. It may not be possible to implement it.
>> >>The xattr_handler's  take the name of the xattr as input to get().
>> >That may be ok though.  Assume the host created a container with
>> >100000 as the uid for root, which created a container with 130000 as
>> >uid for root.  If root in the nested container tries to read the
>> >xattr, the kernel can check for security.foo[130000] first, then
>> >security.foo[100000], then security.foo.  Or, it can do a listxattr
>> >and look for those.  Am I overlooking one?
>> >
>> >>So one could try to encode the mapped uid in the name. However, that
>> >I thought that's exactly what you were suggesting in your original
>> >email?  "security.capability[uid=2000]"
>> >
>> >>could lead to problems with stale xattrs in a shared filesystem over
>> >>time unless one could limit the number of xattrs with the same
>> >>prefix, e.g., security.capability*. So I doubt that it would work.
>> >Hm.  Yeah.  But really how many setups are there like that?  I.e. if
>> >you launch a regular docker or lxd container, the image doesn't do a
>> >bind mount of a shared image, it layers something above it or does a
>> >copy.  What setups do you know of where multiple containers in different
>> >user namespaces mount the same filesystem shared and writeable?
>> 
>> I think I have something now that accomodates userns access to
>> security.capability:
>> 
>> https://github.com/stefanberger/linux/commits/xattr_for_userns
>
> Thanks!
>
>> Encoding of uid is in the attribute name now as follows:
>> security.foo@uid=<uid>
>> 
>> 1) The 'plain' security.capability is only r/w accessible from the
>> host (init_user_ns).
>> 2) When userns reads/writes 'security.capability' it will read/write
>> security.capability@uid=<uid> instead, with uid being the uid of
>> root , e.g. 1000.
>> 3) When listing xattrs for userns the host's security.capability is
>> filtered out to avoid read failures iof 'security.capability' if
>> security.capability@uid=<uid> is read but not there. (see 1) and 2))
>> 4) security.capability* may all be read from anywhere
>> 5) security.capability@uid=<uid> may be read or written directly
>> from a userns if <uid> matches the uid of root (current_uid())
>
> This looks very close to what we want.  One exception - we do want
> to support root in a user namespace being able to write
> security.capability@uid=<x> where <x> is a valid uid mapped in its
> namespace.  In that case the name should be rewritten to be
> security.capability@uid=<y> where y is the unmapped kuid.val.
>
> Eric,
>
> so far my patch hasn't yet hit Linus' tree.  Given that, would you
> mind taking a look and seeing what you think of this approach?  If
> we may decide to go this route, we probably should stop my patch
> from hitting Linus' tree before we have to continue supporting it.

Agreed.  I will take a look.  I also want to see how all of this works
in the context of stackable filesystems.  As that is the one case that
looked like it could be a problem case in your current patchset.

Eric
Amir Goldstein June 20, 2017, 5:42 a.m.
On Tue, Jun 20, 2017 at 12:34 AM, Eric W. Biederman
<ebiederm@xmission.com> wrote:
> "Serge E. Hallyn" <serge@hallyn.com> writes:
>
>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>> On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
>>> >On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
>>> >>On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
>>> >>>Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>> >>>>  If all extended
>>> >>>>attributes were to support this model, maybe the 'uid' could be
>>> >>>>associated with the 'name' of the xattr rather than its 'value' (not
>>> >>>>sure whether that's possible).
>>> >>>Right, I missed that in your original email when I saw it this morning.
>>> >>>It's not what my patch does, but it's an interesting idea.  Do you have
>>> >>>a patch to that effect?  We might even be able to generalize that to
>>> >>No, I don't have a patch. It may not be possible to implement it.
>>> >>The xattr_handler's  take the name of the xattr as input to get().
>>> >That may be ok though.  Assume the host created a container with
>>> >100000 as the uid for root, which created a container with 130000 as
>>> >uid for root.  If root in the nested container tries to read the
>>> >xattr, the kernel can check for security.foo[130000] first, then
>>> >security.foo[100000], then security.foo.  Or, it can do a listxattr
>>> >and look for those.  Am I overlooking one?
>>> >
>>> >>So one could try to encode the mapped uid in the name. However, that
>>> >I thought that's exactly what you were suggesting in your original
>>> >email?  "security.capability[uid=2000]"
>>> >
>>> >>could lead to problems with stale xattrs in a shared filesystem over
>>> >>time unless one could limit the number of xattrs with the same
>>> >>prefix, e.g., security.capability*. So I doubt that it would work.
>>> >Hm.  Yeah.  But really how many setups are there like that?  I.e. if
>>> >you launch a regular docker or lxd container, the image doesn't do a
>>> >bind mount of a shared image, it layers something above it or does a
>>> >copy.  What setups do you know of where multiple containers in different
>>> >user namespaces mount the same filesystem shared and writeable?
>>>
>>> I think I have something now that accomodates userns access to
>>> security.capability:
>>>
>>> https://github.com/stefanberger/linux/commits/xattr_for_userns
>>
>> Thanks!
>>
>>> Encoding of uid is in the attribute name now as follows:
>>> security.foo@uid=<uid>
>>>
>>> 1) The 'plain' security.capability is only r/w accessible from the
>>> host (init_user_ns).
>>> 2) When userns reads/writes 'security.capability' it will read/write
>>> security.capability@uid=<uid> instead, with uid being the uid of
>>> root , e.g. 1000.
>>> 3) When listing xattrs for userns the host's security.capability is
>>> filtered out to avoid read failures iof 'security.capability' if
>>> security.capability@uid=<uid> is read but not there. (see 1) and 2))
>>> 4) security.capability* may all be read from anywhere
>>> 5) security.capability@uid=<uid> may be read or written directly
>>> from a userns if <uid> matches the uid of root (current_uid())
>>
>> This looks very close to what we want.  One exception - we do want
>> to support root in a user namespace being able to write
>> security.capability@uid=<x> where <x> is a valid uid mapped in its
>> namespace.  In that case the name should be rewritten to be
>> security.capability@uid=<y> where y is the unmapped kuid.val.
>>
>> Eric,
>>
>> so far my patch hasn't yet hit Linus' tree.  Given that, would you
>> mind taking a look and seeing what you think of this approach?  If
>> we may decide to go this route, we probably should stop my patch
>> from hitting Linus' tree before we have to continue supporting it.
>
> Agreed.  I will take a look.  I also want to see how all of this works
> in the context of stackable filesystems.  As that is the one case that
> looked like it could be a problem case in your current patchset.
>

Apropos stackable filesystems [cc some overlayfs folks], is there any
way that parts of this work could be generalized towards ns aware
trusted@uid.* xattr?

With overlayfs, files are written to underlying fs with mounter's
credentials. How this affects v3 security capabilities and how exactly
security xattrs are handled in overtlayfs I'm not sure. Vivek?

But, if we had an infrastructure to store trusted@<rootid> xattr, then
unprivileged overlayfs mount would become a very reachable goal.
Much closer goal then loop mounting...

Amir.
Serge E. Hallyn June 20, 2017, 6:23 a.m.
On Sun, Jun 18, 2017 at 09:13:28PM -0400, Stefan Berger wrote:
> Can  you adapt your test cases. I haven't tried them, but having
> them would be important.

branch nsfscaps of github.com/hallyn/ltp now has a patch on top
which makes it work with your capabilities.  Tests are passing.

thanks,
-serge
Stefan Berger June 20, 2017, 12:19 p.m.
On 06/20/2017 01:42 AM, Amir Goldstein wrote:
> On Tue, Jun 20, 2017 at 12:34 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>>
>>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>>> On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
>>>>> On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
>>>>>> On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
>>>>>>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>>>>>>>   If all extended
>>>>>>>> attributes were to support this model, maybe the 'uid' could be
>>>>>>>> associated with the 'name' of the xattr rather than its 'value' (not
>>>>>>>> sure whether that's possible).
>>>>>>> Right, I missed that in your original email when I saw it this morning.
>>>>>>> It's not what my patch does, but it's an interesting idea.  Do you have
>>>>>>> a patch to that effect?  We might even be able to generalize that to
>>>>>> No, I don't have a patch. It may not be possible to implement it.
>>>>>> The xattr_handler's  take the name of the xattr as input to get().
>>>>> That may be ok though.  Assume the host created a container with
>>>>> 100000 as the uid for root, which created a container with 130000 as
>>>>> uid for root.  If root in the nested container tries to read the
>>>>> xattr, the kernel can check for security.foo[130000] first, then
>>>>> security.foo[100000], then security.foo.  Or, it can do a listxattr
>>>>> and look for those.  Am I overlooking one?
>>>>>
>>>>>> So one could try to encode the mapped uid in the name. However, that
>>>>> I thought that's exactly what you were suggesting in your original
>>>>> email?  "security.capability[uid=2000]"
>>>>>
>>>>>> could lead to problems with stale xattrs in a shared filesystem over
>>>>>> time unless one could limit the number of xattrs with the same
>>>>>> prefix, e.g., security.capability*. So I doubt that it would work.
>>>>> Hm.  Yeah.  But really how many setups are there like that?  I.e. if
>>>>> you launch a regular docker or lxd container, the image doesn't do a
>>>>> bind mount of a shared image, it layers something above it or does a
>>>>> copy.  What setups do you know of where multiple containers in different
>>>>> user namespaces mount the same filesystem shared and writeable?
>>>> I think I have something now that accomodates userns access to
>>>> security.capability:
>>>>
>>>> https://github.com/stefanberger/linux/commits/xattr_for_userns
>>> Thanks!
>>>
>>>> Encoding of uid is in the attribute name now as follows:
>>>> security.foo@uid=<uid>
>>>>
>>>> 1) The 'plain' security.capability is only r/w accessible from the
>>>> host (init_user_ns).
>>>> 2) When userns reads/writes 'security.capability' it will read/write
>>>> security.capability@uid=<uid> instead, with uid being the uid of
>>>> root , e.g. 1000.
>>>> 3) When listing xattrs for userns the host's security.capability is
>>>> filtered out to avoid read failures iof 'security.capability' if
>>>> security.capability@uid=<uid> is read but not there. (see 1) and 2))
>>>> 4) security.capability* may all be read from anywhere
>>>> 5) security.capability@uid=<uid> may be read or written directly
>>>> from a userns if <uid> matches the uid of root (current_uid())
>>> This looks very close to what we want.  One exception - we do want
>>> to support root in a user namespace being able to write
>>> security.capability@uid=<x> where <x> is a valid uid mapped in its
>>> namespace.  In that case the name should be rewritten to be
>>> security.capability@uid=<y> where y is the unmapped kuid.val.
>>>
>>> Eric,
>>>
>>> so far my patch hasn't yet hit Linus' tree.  Given that, would you
>>> mind taking a look and seeing what you think of this approach?  If
>>> we may decide to go this route, we probably should stop my patch
>>> from hitting Linus' tree before we have to continue supporting it.
>> Agreed.  I will take a look.  I also want to see how all of this works
>> in the context of stackable filesystems.  As that is the one case that
>> looked like it could be a problem case in your current patchset.
>>
> Apropos stackable filesystems [cc some overlayfs folks], is there any
> way that parts of this work could be generalized towards ns aware
> trusted@uid.* xattr?

I am at least removing all string comparison with xattr names from the 
core code and move the enabled xattr names into a list. For the 
security.* extended attribute names we would enumerated the enabled ones 
in that list, only security.capability for now. I am not sure how the 
trusted.* space works.

     Stefan

>
> With overlayfs, files are written to underlying fs with mounter's
> credentials. How this affects v3 security capabilities and how exactly
> security xattrs are handled in overtlayfs I'm not sure. Vivek?
>
> But, if we had an infrastructure to store trusted@<rootid> xattr, then
> unprivileged overlayfs mount would become a very reachable goal.
> Much closer goal then loop mounting...
>
> Amir.
>
Stefan Berger June 20, 2017, 5:33 p.m.
On 06/20/2017 08:19 AM, Stefan Berger wrote:
> On 06/20/2017 01:42 AM, Amir Goldstein wrote:
>> On Tue, Jun 20, 2017 at 12:34 AM, Eric W. Biederman
>> <ebiederm@xmission.com> wrote:
>>> "Serge E. Hallyn" <serge@hallyn.com> writes:
>>>
>>>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>>>> On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
>>>>>> On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
>>>>>>> On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
>>>>>>>> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
>>>>>>>>>   If all extended
>>>>>>>>> attributes were to support this model, maybe the 'uid' could be
>>>>>>>>> associated with the 'name' of the xattr rather than its 
>>>>>>>>> 'value' (not
>>>>>>>>> sure whether that's possible).
>>>>>>>> Right, I missed that in your original email when I saw it this 
>>>>>>>> morning.
>>>>>>>> It's not what my patch does, but it's an interesting idea.  Do 
>>>>>>>> you have
>>>>>>>> a patch to that effect?  We might even be able to generalize 
>>>>>>>> that to
>>>>>>> No, I don't have a patch. It may not be possible to implement it.
>>>>>>> The xattr_handler's  take the name of the xattr as input to get().
>>>>>> That may be ok though.  Assume the host created a container with
>>>>>> 100000 as the uid for root, which created a container with 130000 as
>>>>>> uid for root.  If root in the nested container tries to read the
>>>>>> xattr, the kernel can check for security.foo[130000] first, then
>>>>>> security.foo[100000], then security.foo.  Or, it can do a listxattr
>>>>>> and look for those.  Am I overlooking one?
>>>>>>
>>>>>>> So one could try to encode the mapped uid in the name. However, 
>>>>>>> that
>>>>>> I thought that's exactly what you were suggesting in your original
>>>>>> email?  "security.capability[uid=2000]"
>>>>>>
>>>>>>> could lead to problems with stale xattrs in a shared filesystem 
>>>>>>> over
>>>>>>> time unless one could limit the number of xattrs with the same
>>>>>>> prefix, e.g., security.capability*. So I doubt that it would work.
>>>>>> Hm.  Yeah.  But really how many setups are there like that?  I.e. if
>>>>>> you launch a regular docker or lxd container, the image doesn't do a
>>>>>> bind mount of a shared image, it layers something above it or does a
>>>>>> copy.  What setups do you know of where multiple containers in 
>>>>>> different
>>>>>> user namespaces mount the same filesystem shared and writeable?
>>>>> I think I have something now that accomodates userns access to
>>>>> security.capability:
>>>>>
>>>>> https://github.com/stefanberger/linux/commits/xattr_for_userns
>>>> Thanks!
>>>>
>>>>> Encoding of uid is in the attribute name now as follows:
>>>>> security.foo@uid=<uid>
>>>>>
>>>>> 1) The 'plain' security.capability is only r/w accessible from the
>>>>> host (init_user_ns).
>>>>> 2) When userns reads/writes 'security.capability' it will read/write
>>>>> security.capability@uid=<uid> instead, with uid being the uid of
>>>>> root , e.g. 1000.
>>>>> 3) When listing xattrs for userns the host's security.capability is
>>>>> filtered out to avoid read failures iof 'security.capability' if
>>>>> security.capability@uid=<uid> is read but not there. (see 1) and 2))
>>>>> 4) security.capability* may all be read from anywhere
>>>>> 5) security.capability@uid=<uid> may be read or written directly
>>>>> from a userns if <uid> matches the uid of root (current_uid())
>>>> This looks very close to what we want.  One exception - we do want
>>>> to support root in a user namespace being able to write
>>>> security.capability@uid=<x> where <x> is a valid uid mapped in its
>>>> namespace.  In that case the name should be rewritten to be
>>>> security.capability@uid=<y> where y is the unmapped kuid.val.
>>>>
>>>> Eric,
>>>>
>>>> so far my patch hasn't yet hit Linus' tree.  Given that, would you
>>>> mind taking a look and seeing what you think of this approach?  If
>>>> we may decide to go this route, we probably should stop my patch
>>>> from hitting Linus' tree before we have to continue supporting it.
>>> Agreed.  I will take a look.  I also want to see how all of this works
>>> in the context of stackable filesystems.  As that is the one case that
>>> looked like it could be a problem case in your current patchset.
>>>
>> Apropos stackable filesystems [cc some overlayfs folks], is there any
>> way that parts of this work could be generalized towards ns aware
>> trusted@uid.* xattr?
>
> I am at least removing all string comparison with xattr names from the 
> core code and move the enabled xattr names into a list. For the 
> security.* extended attribute names we would enumerated the enabled 
> ones in that list, only security.capability for now. I am not sure how 
> the trusted.* space works.

I extended 'the infrastructure' now to support prefix matching for 
trusted.* and probably others as well. It's fairly easy to do that but 
would not write the code like that for exact string matching to support 
security.capability. The patch lets me write trusted.foo@uid=100 from 
within the userns if uid=100 exists, rejects it otherwise. It may be 
written out as trusted.foo@uid=1100 for root mapping to uid 1000. I can 
list this entry on the host. For some reason trusted.* is not listed at 
all inside the userns. So something else needs to be enabled as well. 
For now it looks like this:


https://github.com/stefanberger/linux/commit/8ae131e731c9e1def92a2100697632ea35e007d0

Regards,
     Stefan
Amir Goldstein June 20, 2017, 7:56 p.m.
On Tue, Jun 20, 2017 at 8:33 PM, Stefan Berger
<stefanb@linux.vnet.ibm.com> wrote:
> On 06/20/2017 08:19 AM, Stefan Berger wrote:
>>
>> On 06/20/2017 01:42 AM, Amir Goldstein wrote:

>>>>
>>> Apropos stackable filesystems [cc some overlayfs folks], is there any
>>> way that parts of this work could be generalized towards ns aware
>>> trusted@uid.* xattr?
>>
>>
>> I am at least removing all string comparison with xattr names from the
>> core code and move the enabled xattr names into a list. For the security.*
>> extended attribute names we would enumerated the enabled ones in that list,
>> only security.capability for now. I am not sure how the trusted.* space
>> works.
>
>
> I extended 'the infrastructure' now to support prefix matching for trusted.*
> and probably others as well. It's fairly easy to do that but would not write
> the code like that for exact string matching to support security.capability.
> The patch lets me write trusted.foo@uid=100 from within the userns if
> uid=100 exists, rejects it otherwise. It may be written out as
> trusted.foo@uid=1100 for root mapping to uid 1000. I can list this entry on
> the host. For some reason trusted.* is not listed at all inside the userns.
> So something else needs to be enabled as well. For now it looks like this:
>
>
> https://github.com/stefanberger/linux/commit/8ae131e731c9e1def92a2100697632ea35e007d0
>

That looks useful!
I hope someone who knows his way around trusted xattr can say what's missing.

Thanks,
Amir.
Vivek Goyal June 20, 2017, 7:57 p.m.
On Tue, Jun 20, 2017 at 08:42:45AM +0300, Amir Goldstein wrote:
> On Tue, Jun 20, 2017 at 12:34 AM, Eric W. Biederman
> <ebiederm@xmission.com> wrote:
> > "Serge E. Hallyn" <serge@hallyn.com> writes:
> >
> >> Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> >>> On 06/14/2017 11:05 PM, Serge E. Hallyn wrote:
> >>> >On Wed, Jun 14, 2017 at 08:27:40AM -0400, Stefan Berger wrote:
> >>> >>On 06/13/2017 07:55 PM, Serge E. Hallyn wrote:
> >>> >>>Quoting Stefan Berger (stefanb@linux.vnet.ibm.com):
> >>> >>>>  If all extended
> >>> >>>>attributes were to support this model, maybe the 'uid' could be
> >>> >>>>associated with the 'name' of the xattr rather than its 'value' (not
> >>> >>>>sure whether that's possible).
> >>> >>>Right, I missed that in your original email when I saw it this morning.
> >>> >>>It's not what my patch does, but it's an interesting idea.  Do you have
> >>> >>>a patch to that effect?  We might even be able to generalize that to
> >>> >>No, I don't have a patch. It may not be possible to implement it.
> >>> >>The xattr_handler's  take the name of the xattr as input to get().
> >>> >That may be ok though.  Assume the host created a container with
> >>> >100000 as the uid for root, which created a container with 130000 as
> >>> >uid for root.  If root in the nested container tries to read the
> >>> >xattr, the kernel can check for security.foo[130000] first, then
> >>> >security.foo[100000], then security.foo.  Or, it can do a listxattr
> >>> >and look for those.  Am I overlooking one?
> >>> >
> >>> >>So one could try to encode the mapped uid in the name. However, that
> >>> >I thought that's exactly what you were suggesting in your original
> >>> >email?  "security.capability[uid=2000]"
> >>> >
> >>> >>could lead to problems with stale xattrs in a shared filesystem over
> >>> >>time unless one could limit the number of xattrs with the same
> >>> >>prefix, e.g., security.capability*. So I doubt that it would work.
> >>> >Hm.  Yeah.  But really how many setups are there like that?  I.e. if
> >>> >you launch a regular docker or lxd container, the image doesn't do a
> >>> >bind mount of a shared image, it layers something above it or does a
> >>> >copy.  What setups do you know of where multiple containers in different
> >>> >user namespaces mount the same filesystem shared and writeable?
> >>>
> >>> I think I have something now that accomodates userns access to
> >>> security.capability:
> >>>
> >>> https://github.com/stefanberger/linux/commits/xattr_for_userns
> >>
> >> Thanks!
> >>
> >>> Encoding of uid is in the attribute name now as follows:
> >>> security.foo@uid=<uid>
> >>>
> >>> 1) The 'plain' security.capability is only r/w accessible from the
> >>> host (init_user_ns).
> >>> 2) When userns reads/writes 'security.capability' it will read/write
> >>> security.capability@uid=<uid> instead, with uid being the uid of
> >>> root , e.g. 1000.
> >>> 3) When listing xattrs for userns the host's security.capability is
> >>> filtered out to avoid read failures iof 'security.capability' if
> >>> security.capability@uid=<uid> is read but not there. (see 1) and 2))
> >>> 4) security.capability* may all be read from anywhere
> >>> 5) security.capability@uid=<uid> may be read or written directly
> >>> from a userns if <uid> matches the uid of root (current_uid())
> >>
> >> This looks very close to what we want.  One exception - we do want
> >> to support root in a user namespace being able to write
> >> security.capability@uid=<x> where <x> is a valid uid mapped in its
> >> namespace.  In that case the name should be rewritten to be
> >> security.capability@uid=<y> where y is the unmapped kuid.val.
> >>
> >> Eric,
> >>
> >> so far my patch hasn't yet hit Linus' tree.  Given that, would you
> >> mind taking a look and seeing what you think of this approach?  If
> >> we may decide to go this route, we probably should stop my patch
> >> from hitting Linus' tree before we have to continue supporting it.
> >
> > Agreed.  I will take a look.  I also want to see how all of this works
> > in the context of stackable filesystems.  As that is the one case that
> > looked like it could be a problem case in your current patchset.
> >
> 
> Apropos stackable filesystems [cc some overlayfs folks], is there any
> way that parts of this work could be generalized towards ns aware
> trusted@uid.* xattr?
> 
> With overlayfs, files are written to underlying fs with mounter's
> credentials.

We do switch to mounter's credential for privileged operations but
for a newly created file final selinux label is created as if task
created that file.

>How this affects v3 security capabilities and how exactly
> security xattrs are handled in overtlayfs I'm not sure. Vivek?

Given we switch to mounter's creds for operations on underlying
filesystem (setxattr, getxattr), I thought that it probably
will call xattr_userns_name() in nested manner. Once using tasks's
creds and second time using mounter's creds. So that probably
should have made it security.foo@uid@uid. I tried patches quickly
but setcap/getcap inside containers work. So may be it is
due to the fact that mounting was done from init_user_ns and
following line of code will avoid adding @uid in that case.

+       /* no name changes for init_user_ns or uid == 0 */
+       if (current_user_ns() == &init_user_ns || uid.val == 0)
+               goto out_copy;
+

I have not looked deeper. Still curious how getxattr() path works
when we switch to mounter's creds. In that case underlying file
system will get the impression that mounter task is trying to
do getxattr() in security.capability set by container task. I
am assuming we allow that?

I need to spend more time understanding this.

Vivek