[RFC] rlimit: Account nproc per-usernamespace/per-user

Submitted by Nikolay Borisov on Oct. 26, 2016, 12:40 p.m.

Details

Message ID 1477485627-16177-1-git-send-email-kernel@kyup.com
State New
Series "rlimit: Account nproc per-usernamespace/per-user"
Headers show

Commit Message

Nikolay Borisov Oct. 26, 2016, 12:40 p.m.
There are container setups which map the same kuids to
different containers. In such situation what will happen is
that same uid's in different containers will map to the same
underlying user on the matchine (e.g. same struct user). One
implication of this is that the number of processes for that
particular user are going to be shared among all the same uids
in the container. This is problematic, as it means a user in
containerA can potentially exhaust the process limit such that
a user in containerB cannot spawn any processes.

Fix this by utilising the newly introduced UCOUNT infrastructure,
so that process counts are now being accounted per-userns/per-user.
In case when a machine is running in non-container setup then
the accounting semantics is virtually the same, given that there is
going to be one ucount-per-user struct.

Signed-off-by: Nikolay Borisov <kernel@kyup.com>
---
Hello Eric, 

Now that ucount has been merged I took the liberty of 
experimenting what the nproc-per-userns approach would 
look like and here is the result. In my previous [0] you 
expressed concerns regarding not having hierarchical limits, 
essentially allowing users to circumvent their limits. So
this version doesn't do anything to rectify this issue. 
However, I do think the idea of the rlimit is not to 
forbid the user of spawning more processes, since currently
the limit is being set on a per-process basis, so a malicious 
process can in fact increase that limits and spawn a large number 
of processes. Given this I believe the current patch doesn't make
the situation any worse in that regard. 

[0] https://lists.linux-foundation.org/pipermail/containers/2015-September/036229.html

 fs/exec.c                      |  2 +-
 include/linux/cred.h           |  1 +
 include/linux/sched.h          |  1 -
 include/linux/user_namespace.h |  8 ++++++++
 kernel/cred.c                  | 18 +++++++++++-------
 kernel/exit.c                  |  2 +-
 kernel/fork.c                  |  7 +++++--
 kernel/sys.c                   |  2 +-
 kernel/ucount.c                | 16 ++++++++++++++++
 kernel/user.c                  |  1 -
 10 files changed, 44 insertions(+), 14 deletions(-)

Patch hide | download patch | download mbox

diff --git a/fs/exec.c b/fs/exec.c
index 6fcfb3f7b137..8126e00a8d3e 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -1651,7 +1651,7 @@  static int do_execveat_common(int fd, struct filename *filename,
 	 * whether NPROC limit is still exceeded.
 	 */
 	if ((current->flags & PF_NPROC_EXCEEDED) &&
-	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
+	    atomic_read(&current_cred()->ucounts->ulimit[ULIMIT_NPROC]) > rlimit(RLIMIT_NPROC)) {
 		retval = -EAGAIN;
 		goto out_ret;
 	}
diff --git a/include/linux/cred.h b/include/linux/cred.h
index f0e70a1bb3ac..c3bb4d20ff16 100644
--- a/include/linux/cred.h
+++ b/include/linux/cred.h
@@ -141,6 +141,7 @@  struct cred {
 	void		*security;	/* subjective LSM security */
 #endif
 	struct user_struct *user;	/* real user ID subscription */
+	struct ucounts *ucounts;		/* container for per-userns/per-counts */
 	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
 	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
 	struct rcu_head	rcu;		/* RCU deletion hook */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b0ec92..3709d70fe2de 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -842,7 +842,6 @@  static inline int signal_group_exit(const struct signal_struct *sig)
  */
 struct user_struct {
 	atomic_t __count;	/* reference count */
-	atomic_t processes;	/* How many processes does this user have? */
 	atomic_t sigpending;	/* How many pending signals does this user have? */
 #ifdef CONFIG_INOTIFY_USER
 	atomic_t inotify_watches; /* How many inotify watches does this user have? */
diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
index eb209d4523f5..d4fc31ae9d85 100644
--- a/include/linux/user_namespace.h
+++ b/include/linux/user_namespace.h
@@ -35,6 +35,11 @@  enum ucount_type {
 	UCOUNT_COUNTS,
 };
 
+enum ulimit_type {
+	ULIMIT_NPROC,
+	ULIMIT_COUNTS,
+};
+
 struct user_namespace {
 	struct uid_gid_map	uid_map;
 	struct uid_gid_map	gid_map;
@@ -67,6 +72,7 @@  struct ucounts {
 	kuid_t uid;
 	atomic_t count;
 	atomic_t ucount[UCOUNT_COUNTS];
+	atomic_t ulimit[ULIMIT_COUNTS];
 };
 
 extern struct user_namespace init_user_ns;
@@ -75,6 +81,8 @@  bool setup_userns_sysctls(struct user_namespace *ns);
 void retire_userns_sysctls(struct user_namespace *ns);
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
 void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
+struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid, enum ulimit_type type);
+void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type);
 
 #ifdef CONFIG_USER_NS
 
diff --git a/kernel/cred.c b/kernel/cred.c
index 5f264fb5737d..dd9ffc0396bb 100644
--- a/kernel/cred.c
+++ b/kernel/cred.c
@@ -330,13 +330,16 @@  int copy_creds(struct task_struct *p, unsigned long clone_flags)
 #endif
 		clone_flags & CLONE_THREAD
 	    ) {
-		p->real_cred = get_cred(p->cred);
+		struct cred *c;
+		struct ucounts *uc = inc_ulimit(p->cred->user_ns, p->cred->uid,
+						ULIMIT_NPROC);
+		p->real_cred = c = get_cred(p->cred);
 		get_cred(p->cred);
 		alter_cred_subscribers(p->cred, 2);
 		kdebug("share_creds(%p{%d,%d})",
 		       p->cred, atomic_read(&p->cred->usage),
 		       read_cred_subscribers(p->cred));
-		atomic_inc(&p->cred->user->processes);
+		c->ucounts = uc;
 		return 0;
 	}
 
@@ -369,7 +372,8 @@  int copy_creds(struct task_struct *p, unsigned long clone_flags)
 	}
 #endif
 
-	atomic_inc(&new->user->processes);
+	new->ucounts = inc_ulimit(new->user_ns, new->uid,
+				      ULIMIT_NPROC);
 	p->cred = p->real_cred = get_cred(new);
 	alter_cred_subscribers(new, 2);
 	validate_creds(new);
@@ -461,12 +465,12 @@  int commit_creds(struct cred *new)
 	 * in set_user().
 	 */
 	alter_cred_subscribers(new, 2);
-	if (new->user != old->user)
-		atomic_inc(&new->user->processes);
+	if (new->user != old->user || old->user_ns != new->user_ns)
+		new->ucounts = inc_ulimit(new->user_ns, new->uid, ULIMIT_NPROC);
 	rcu_assign_pointer(task->real_cred, new);
 	rcu_assign_pointer(task->cred, new);
-	if (new->user != old->user)
-		atomic_dec(&old->user->processes);
+	if (new->user != old->user || old->user_ns != new->user_ns)
+		dec_ulimit(old->ucounts, ULIMIT_NPROC);
 	alter_cred_subscribers(old, -2);
 
 	/* send notifications */
diff --git a/kernel/exit.c b/kernel/exit.c
index 9d68c45ebbe3..7ac8954230bd 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -173,7 +173,7 @@  void release_task(struct task_struct *p)
 	/* don't need to get the RCU readlock here - the process is dead and
 	 * can't be modifying its own credentials. But shut RCU-lockdep up */
 	rcu_read_lock();
-	atomic_dec(&__task_cred(p)->user->processes);
+	dec_ulimit(__task_cred(p)->ucounts, ULIMIT_NPROC);
 	rcu_read_unlock();
 
 	proc_flush_task(p);
diff --git a/kernel/fork.c b/kernel/fork.c
index 623259fc794d..0b16c9d920b1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -425,6 +425,7 @@  int arch_task_struct_size __read_mostly;
 void __init fork_init(void)
 {
 	int i;
+	struct cred *c = init_task.cred;
 #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
 #ifndef ARCH_MIN_TASKALIGN
 #define ARCH_MIN_TASKALIGN	L1_CACHE_BYTES
@@ -448,6 +449,8 @@  void __init fork_init(void)
 	for (i = 0; i < UCOUNT_COUNTS; i++) {
 		init_user_ns.ucount_max[i] = max_threads/2;
 	}
+
+	c->ucounts = inc_ulimit(&init_user_ns, make_kuid(&init_user_ns, 0), ULIMIT_NPROC);
 }
 
 int __weak arch_dup_task_struct(struct task_struct *dst,
@@ -1515,7 +1518,7 @@  static __latent_entropy struct task_struct *copy_process(
 	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
 #endif
 	retval = -EAGAIN;
-	if (atomic_read(&p->real_cred->user->processes) >=
+	if (atomic_read(&p->real_cred->ucounts->ulimit[ULIMIT_NPROC]) >=
 			task_rlimit(p, RLIMIT_NPROC)) {
 		if (p->real_cred->user != INIT_USER &&
 		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
@@ -1859,7 +1862,7 @@  static __latent_entropy struct task_struct *copy_process(
 #endif
 	delayacct_tsk_free(p);
 bad_fork_cleanup_count:
-	atomic_dec(&p->cred->user->processes);
+	dec_ulimit(p->cred->ucounts, ULIMIT_NPROC);
 	exit_creds(p);
 bad_fork_free:
 	put_task_stack(p);
diff --git a/kernel/sys.c b/kernel/sys.c
index 89d5be418157..143d04ed9da5 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -433,7 +433,7 @@  static int set_user(struct cred *new)
 	 * for programs doing set*uid()+execve() by harmlessly deferring the
 	 * failure to the execve() stage.
 	 */
-	if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
+	if (atomic_read(&new->ucounts->ulimit[ULIMIT_NPROC]) >= rlimit(RLIMIT_NPROC) &&
 			new_user != INIT_USER)
 		current->flags |= PF_NPROC_EXCEEDED;
 	else
diff --git a/kernel/ucount.c b/kernel/ucount.c
index 9d20d5dd298a..d8f5362f6e17 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -181,6 +181,22 @@  static inline bool atomic_inc_below(atomic_t *v, int u)
 	}
 }
 
+struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid,
+			   enum ulimit_type type)
+{
+	struct ucounts *ucounts = get_ucounts(ns, uid);
+	atomic_inc(&ucounts->ulimit[type]);
+
+	return ucounts;
+}
+
+void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type)
+{
+	atomic_dec(&ucounts->ulimit[type]);
+	WARN_ON(atomic_read(&ucounts->ulimit[type]) < 0);
+	put_ucounts(ucounts);
+}
+
 struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
 			   enum ucount_type type)
 {
diff --git a/kernel/user.c b/kernel/user.c
index b069ccbfb0b0..ce1ba1fb96c0 100644
--- a/kernel/user.c
+++ b/kernel/user.c
@@ -90,7 +90,6 @@  static DEFINE_SPINLOCK(uidhash_lock);
 /* root_user.__count is 1, for init task cred */
 struct user_struct root_user = {
 	.__count	= ATOMIC_INIT(1),
-	.processes	= ATOMIC_INIT(1),
 	.sigpending	= ATOMIC_INIT(0),
 	.locked_shm     = 0,
 	.uid		= GLOBAL_ROOT_UID,

Comments

Serge E. Hallyn Oct. 26, 2016, 5:25 p.m.
On Wed, Oct 26, 2016 at 03:40:27PM +0300, Nikolay Borisov wrote:
> There are container setups which map the same kuids to
> different containers. In such situation what will happen is
> that same uid's in different containers will map to the same
> underlying user on the matchine (e.g. same struct user). One
> implication of this is that the number of processes for that
> particular user are going to be shared among all the same uids
> in the container. This is problematic, as it means a user in
> containerA can potentially exhaust the process limit such that
> a user in containerB cannot spawn any processes.

Hi - thanks for the description.  Based on that, though, I worry
that it is a feature we do not want.  Nothing explicitly prohibits
sharing kuids in different containers, but it is is sharing.  If
you want greater isolation between two containers, you must not share
any kuids.

I'm not saying nack, but i am saying it seems a misguided feature
which could lead people to think sharing uids is safer than it is.

> Fix this by utilising the newly introduced UCOUNT infrastructure,
> so that process counts are now being accounted per-userns/per-user.
> In case when a machine is running in non-container setup then
> the accounting semantics is virtually the same, given that there is
> going to be one ucount-per-user struct.
> 
> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
> Hello Eric, 
> 
> Now that ucount has been merged I took the liberty of 
> experimenting what the nproc-per-userns approach would 
> look like and here is the result. In my previous [0] you 
> expressed concerns regarding not having hierarchical limits, 
> essentially allowing users to circumvent their limits. So
> this version doesn't do anything to rectify this issue. 
> However, I do think the idea of the rlimit is not to 
> forbid the user of spawning more processes, since currently
> the limit is being set on a per-process basis, so a malicious 
> process can in fact increase that limits and spawn a large number 
> of processes. Given this I believe the current patch doesn't make
> the situation any worse in that regard. 
> 
> [0] https://lists.linux-foundation.org/pipermail/containers/2015-September/036229.html
> 
>  fs/exec.c                      |  2 +-
>  include/linux/cred.h           |  1 +
>  include/linux/sched.h          |  1 -
>  include/linux/user_namespace.h |  8 ++++++++
>  kernel/cred.c                  | 18 +++++++++++-------
>  kernel/exit.c                  |  2 +-
>  kernel/fork.c                  |  7 +++++--
>  kernel/sys.c                   |  2 +-
>  kernel/ucount.c                | 16 ++++++++++++++++
>  kernel/user.c                  |  1 -
>  10 files changed, 44 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/exec.c b/fs/exec.c
> index 6fcfb3f7b137..8126e00a8d3e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1651,7 +1651,7 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	 * whether NPROC limit is still exceeded.
>  	 */
>  	if ((current->flags & PF_NPROC_EXCEEDED) &&
> -	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
> +	    atomic_read(&current_cred()->ucounts->ulimit[ULIMIT_NPROC]) > rlimit(RLIMIT_NPROC)) {
>  		retval = -EAGAIN;
>  		goto out_ret;
>  	}
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index f0e70a1bb3ac..c3bb4d20ff16 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -141,6 +141,7 @@ struct cred {
>  	void		*security;	/* subjective LSM security */
>  #endif
>  	struct user_struct *user;	/* real user ID subscription */
> +	struct ucounts *ucounts;		/* container for per-userns/per-counts */
>  	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
>  	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
>  	struct rcu_head	rcu;		/* RCU deletion hook */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 348f51b0ec92..3709d70fe2de 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -842,7 +842,6 @@ static inline int signal_group_exit(const struct signal_struct *sig)
>   */
>  struct user_struct {
>  	atomic_t __count;	/* reference count */
> -	atomic_t processes;	/* How many processes does this user have? */
>  	atomic_t sigpending;	/* How many pending signals does this user have? */
>  #ifdef CONFIG_INOTIFY_USER
>  	atomic_t inotify_watches; /* How many inotify watches does this user have? */
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb209d4523f5..d4fc31ae9d85 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -35,6 +35,11 @@ enum ucount_type {
>  	UCOUNT_COUNTS,
>  };
>  
> +enum ulimit_type {
> +	ULIMIT_NPROC,
> +	ULIMIT_COUNTS,
> +};
> +
>  struct user_namespace {
>  	struct uid_gid_map	uid_map;
>  	struct uid_gid_map	gid_map;
> @@ -67,6 +72,7 @@ struct ucounts {
>  	kuid_t uid;
>  	atomic_t count;
>  	atomic_t ucount[UCOUNT_COUNTS];
> +	atomic_t ulimit[ULIMIT_COUNTS];
>  };
>  
>  extern struct user_namespace init_user_ns;
> @@ -75,6 +81,8 @@ bool setup_userns_sysctls(struct user_namespace *ns);
>  void retire_userns_sysctls(struct user_namespace *ns);
>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
>  void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid, enum ulimit_type type);
> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type);
>  
>  #ifdef CONFIG_USER_NS
>  
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 5f264fb5737d..dd9ffc0396bb 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -330,13 +330,16 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>  #endif
>  		clone_flags & CLONE_THREAD
>  	    ) {
> -		p->real_cred = get_cred(p->cred);
> +		struct cred *c;
> +		struct ucounts *uc = inc_ulimit(p->cred->user_ns, p->cred->uid,
> +						ULIMIT_NPROC);
> +		p->real_cred = c = get_cred(p->cred);
>  		get_cred(p->cred);
>  		alter_cred_subscribers(p->cred, 2);
>  		kdebug("share_creds(%p{%d,%d})",
>  		       p->cred, atomic_read(&p->cred->usage),
>  		       read_cred_subscribers(p->cred));
> -		atomic_inc(&p->cred->user->processes);
> +		c->ucounts = uc;
>  		return 0;
>  	}
>  
> @@ -369,7 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>  	}
>  #endif
>  
> -	atomic_inc(&new->user->processes);
> +	new->ucounts = inc_ulimit(new->user_ns, new->uid,
> +				      ULIMIT_NPROC);
>  	p->cred = p->real_cred = get_cred(new);
>  	alter_cred_subscribers(new, 2);
>  	validate_creds(new);
> @@ -461,12 +465,12 @@ int commit_creds(struct cred *new)
>  	 * in set_user().
>  	 */
>  	alter_cred_subscribers(new, 2);
> -	if (new->user != old->user)
> -		atomic_inc(&new->user->processes);
> +	if (new->user != old->user || old->user_ns != new->user_ns)
> +		new->ucounts = inc_ulimit(new->user_ns, new->uid, ULIMIT_NPROC);
>  	rcu_assign_pointer(task->real_cred, new);
>  	rcu_assign_pointer(task->cred, new);
> -	if (new->user != old->user)
> -		atomic_dec(&old->user->processes);
> +	if (new->user != old->user || old->user_ns != new->user_ns)
> +		dec_ulimit(old->ucounts, ULIMIT_NPROC);
>  	alter_cred_subscribers(old, -2);
>  
>  	/* send notifications */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 9d68c45ebbe3..7ac8954230bd 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -173,7 +173,7 @@ void release_task(struct task_struct *p)
>  	/* don't need to get the RCU readlock here - the process is dead and
>  	 * can't be modifying its own credentials. But shut RCU-lockdep up */
>  	rcu_read_lock();
> -	atomic_dec(&__task_cred(p)->user->processes);
> +	dec_ulimit(__task_cred(p)->ucounts, ULIMIT_NPROC);
>  	rcu_read_unlock();
>  
>  	proc_flush_task(p);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 623259fc794d..0b16c9d920b1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -425,6 +425,7 @@ int arch_task_struct_size __read_mostly;
>  void __init fork_init(void)
>  {
>  	int i;
> +	struct cred *c = init_task.cred;
>  #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
>  #ifndef ARCH_MIN_TASKALIGN
>  #define ARCH_MIN_TASKALIGN	L1_CACHE_BYTES
> @@ -448,6 +449,8 @@ void __init fork_init(void)
>  	for (i = 0; i < UCOUNT_COUNTS; i++) {
>  		init_user_ns.ucount_max[i] = max_threads/2;
>  	}
> +
> +	c->ucounts = inc_ulimit(&init_user_ns, make_kuid(&init_user_ns, 0), ULIMIT_NPROC);
>  }
>  
>  int __weak arch_dup_task_struct(struct task_struct *dst,
> @@ -1515,7 +1518,7 @@ static __latent_entropy struct task_struct *copy_process(
>  	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
>  #endif
>  	retval = -EAGAIN;
> -	if (atomic_read(&p->real_cred->user->processes) >=
> +	if (atomic_read(&p->real_cred->ucounts->ulimit[ULIMIT_NPROC]) >=
>  			task_rlimit(p, RLIMIT_NPROC)) {
>  		if (p->real_cred->user != INIT_USER &&
>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> @@ -1859,7 +1862,7 @@ static __latent_entropy struct task_struct *copy_process(
>  #endif
>  	delayacct_tsk_free(p);
>  bad_fork_cleanup_count:
> -	atomic_dec(&p->cred->user->processes);
> +	dec_ulimit(p->cred->ucounts, ULIMIT_NPROC);
>  	exit_creds(p);
>  bad_fork_free:
>  	put_task_stack(p);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 89d5be418157..143d04ed9da5 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -433,7 +433,7 @@ static int set_user(struct cred *new)
>  	 * for programs doing set*uid()+execve() by harmlessly deferring the
>  	 * failure to the execve() stage.
>  	 */
> -	if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
> +	if (atomic_read(&new->ucounts->ulimit[ULIMIT_NPROC]) >= rlimit(RLIMIT_NPROC) &&
>  			new_user != INIT_USER)
>  		current->flags |= PF_NPROC_EXCEEDED;
>  	else
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 9d20d5dd298a..d8f5362f6e17 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -181,6 +181,22 @@ static inline bool atomic_inc_below(atomic_t *v, int u)
>  	}
>  }
>  
> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid,
> +			   enum ulimit_type type)
> +{
> +	struct ucounts *ucounts = get_ucounts(ns, uid);
> +	atomic_inc(&ucounts->ulimit[type]);
> +
> +	return ucounts;
> +}
> +
> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type)
> +{
> +	atomic_dec(&ucounts->ulimit[type]);
> +	WARN_ON(atomic_read(&ucounts->ulimit[type]) < 0);
> +	put_ucounts(ucounts);
> +}
> +
>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
>  			   enum ucount_type type)
>  {
> diff --git a/kernel/user.c b/kernel/user.c
> index b069ccbfb0b0..ce1ba1fb96c0 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -90,7 +90,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
>  /* root_user.__count is 1, for init task cred */
>  struct user_struct root_user = {
>  	.__count	= ATOMIC_INIT(1),
> -	.processes	= ATOMIC_INIT(1),
>  	.sigpending	= ATOMIC_INIT(0),
>  	.locked_shm     = 0,
>  	.uid		= GLOBAL_ROOT_UID,
> -- 
> 2.5.0
> 
> _______________________________________________
> Containers mailing list
> Containers@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
Nikolay Borisov Oct. 27, 2016, 7:01 a.m.
On 10/26/2016 08:25 PM, Serge E. Hallyn wrote:
> On Wed, Oct 26, 2016 at 03:40:27PM +0300, Nikolay Borisov wrote:
>> There are container setups which map the same kuids to
>> different containers. In such situation what will happen is
>> that same uid's in different containers will map to the same
>> underlying user on the matchine (e.g. same struct user). One
>> implication of this is that the number of processes for that
>> particular user are going to be shared among all the same uids
>> in the container. This is problematic, as it means a user in
>> containerA can potentially exhaust the process limit such that
>> a user in containerB cannot spawn any processes.
> 
> Hi - thanks for the description.  Based on that, though, I worry
> that it is a feature we do not want.  Nothing explicitly prohibits
> sharing kuids in different containers, but it is is sharing.  If
> you want greater isolation between two containers, you must not share
> any kuids.
> 
> I'm not saying nack, but i am saying it seems a misguided feature
> which could lead people to think sharing uids is safer than it is.

I agree that in order for this to be considered "secure" it relies on
the assumption that there is no leakage between containers. However,
there are currently setups which rely on this behavior for whatever
(mis)guided reasons. Furthermore the current design of namespaces
doesn't do anything to prevent such uses. Given this I don't think it be
fair to completely disregard them, hence the patch.

> 
>> Fix this by utilising the newly introduced UCOUNT infrastructure,
>> so that process counts are now being accounted per-userns/per-user.
>> In case when a machine is running in non-container setup then
>> the accounting semantics is virtually the same, given that there is
>> going to be one ucount-per-user struct.
>>
>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
>> ---
>> Hello Eric, 
>>
>> Now that ucount has been merged I took the liberty of 
>> experimenting what the nproc-per-userns approach would 
>> look like and here is the result. In my previous [0] you 
>> expressed concerns regarding not having hierarchical limits, 
>> essentially allowing users to circumvent their limits. So
>> this version doesn't do anything to rectify this issue. 
>> However, I do think the idea of the rlimit is not to 
>> forbid the user of spawning more processes, since currently
>> the limit is being set on a per-process basis, so a malicious 
>> process can in fact increase that limits and spawn a large number 
>> of processes. Given this I believe the current patch doesn't make
>> the situation any worse in that regard. 
>>
>> [0] https://lists.linux-foundation.org/pipermail/containers/2015-September/036229.html
>>
>>  fs/exec.c                      |  2 +-
>>  include/linux/cred.h           |  1 +
>>  include/linux/sched.h          |  1 -
>>  include/linux/user_namespace.h |  8 ++++++++
>>  kernel/cred.c                  | 18 +++++++++++-------
>>  kernel/exit.c                  |  2 +-
>>  kernel/fork.c                  |  7 +++++--
>>  kernel/sys.c                   |  2 +-
>>  kernel/ucount.c                | 16 ++++++++++++++++
>>  kernel/user.c                  |  1 -
>>  10 files changed, 44 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 6fcfb3f7b137..8126e00a8d3e 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1651,7 +1651,7 @@ static int do_execveat_common(int fd, struct filename *filename,
>>  	 * whether NPROC limit is still exceeded.
>>  	 */
>>  	if ((current->flags & PF_NPROC_EXCEEDED) &&
>> -	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
>> +	    atomic_read(&current_cred()->ucounts->ulimit[ULIMIT_NPROC]) > rlimit(RLIMIT_NPROC)) {
>>  		retval = -EAGAIN;
>>  		goto out_ret;
>>  	}
>> diff --git a/include/linux/cred.h b/include/linux/cred.h
>> index f0e70a1bb3ac..c3bb4d20ff16 100644
>> --- a/include/linux/cred.h
>> +++ b/include/linux/cred.h
>> @@ -141,6 +141,7 @@ struct cred {
>>  	void		*security;	/* subjective LSM security */
>>  #endif
>>  	struct user_struct *user;	/* real user ID subscription */
>> +	struct ucounts *ucounts;		/* container for per-userns/per-counts */
>>  	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
>>  	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
>>  	struct rcu_head	rcu;		/* RCU deletion hook */
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 348f51b0ec92..3709d70fe2de 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -842,7 +842,6 @@ static inline int signal_group_exit(const struct signal_struct *sig)
>>   */
>>  struct user_struct {
>>  	atomic_t __count;	/* reference count */
>> -	atomic_t processes;	/* How many processes does this user have? */
>>  	atomic_t sigpending;	/* How many pending signals does this user have? */
>>  #ifdef CONFIG_INOTIFY_USER
>>  	atomic_t inotify_watches; /* How many inotify watches does this user have? */
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index eb209d4523f5..d4fc31ae9d85 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -35,6 +35,11 @@ enum ucount_type {
>>  	UCOUNT_COUNTS,
>>  };
>>  
>> +enum ulimit_type {
>> +	ULIMIT_NPROC,
>> +	ULIMIT_COUNTS,
>> +};
>> +
>>  struct user_namespace {
>>  	struct uid_gid_map	uid_map;
>>  	struct uid_gid_map	gid_map;
>> @@ -67,6 +72,7 @@ struct ucounts {
>>  	kuid_t uid;
>>  	atomic_t count;
>>  	atomic_t ucount[UCOUNT_COUNTS];
>> +	atomic_t ulimit[ULIMIT_COUNTS];
>>  };
>>  
>>  extern struct user_namespace init_user_ns;
>> @@ -75,6 +81,8 @@ bool setup_userns_sysctls(struct user_namespace *ns);
>>  void retire_userns_sysctls(struct user_namespace *ns);
>>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
>>  void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
>> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid, enum ulimit_type type);
>> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type);
>>  
>>  #ifdef CONFIG_USER_NS
>>  
>> diff --git a/kernel/cred.c b/kernel/cred.c
>> index 5f264fb5737d..dd9ffc0396bb 100644
>> --- a/kernel/cred.c
>> +++ b/kernel/cred.c
>> @@ -330,13 +330,16 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>>  #endif
>>  		clone_flags & CLONE_THREAD
>>  	    ) {
>> -		p->real_cred = get_cred(p->cred);
>> +		struct cred *c;
>> +		struct ucounts *uc = inc_ulimit(p->cred->user_ns, p->cred->uid,
>> +						ULIMIT_NPROC);
>> +		p->real_cred = c = get_cred(p->cred);
>>  		get_cred(p->cred);
>>  		alter_cred_subscribers(p->cred, 2);
>>  		kdebug("share_creds(%p{%d,%d})",
>>  		       p->cred, atomic_read(&p->cred->usage),
>>  		       read_cred_subscribers(p->cred));
>> -		atomic_inc(&p->cred->user->processes);
>> +		c->ucounts = uc;
>>  		return 0;
>>  	}
>>  
>> @@ -369,7 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>>  	}
>>  #endif
>>  
>> -	atomic_inc(&new->user->processes);
>> +	new->ucounts = inc_ulimit(new->user_ns, new->uid,
>> +				      ULIMIT_NPROC);
>>  	p->cred = p->real_cred = get_cred(new);
>>  	alter_cred_subscribers(new, 2);
>>  	validate_creds(new);
>> @@ -461,12 +465,12 @@ int commit_creds(struct cred *new)
>>  	 * in set_user().
>>  	 */
>>  	alter_cred_subscribers(new, 2);
>> -	if (new->user != old->user)
>> -		atomic_inc(&new->user->processes);
>> +	if (new->user != old->user || old->user_ns != new->user_ns)
>> +		new->ucounts = inc_ulimit(new->user_ns, new->uid, ULIMIT_NPROC);
>>  	rcu_assign_pointer(task->real_cred, new);
>>  	rcu_assign_pointer(task->cred, new);
>> -	if (new->user != old->user)
>> -		atomic_dec(&old->user->processes);
>> +	if (new->user != old->user || old->user_ns != new->user_ns)
>> +		dec_ulimit(old->ucounts, ULIMIT_NPROC);
>>  	alter_cred_subscribers(old, -2);
>>  
>>  	/* send notifications */
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 9d68c45ebbe3..7ac8954230bd 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -173,7 +173,7 @@ void release_task(struct task_struct *p)
>>  	/* don't need to get the RCU readlock here - the process is dead and
>>  	 * can't be modifying its own credentials. But shut RCU-lockdep up */
>>  	rcu_read_lock();
>> -	atomic_dec(&__task_cred(p)->user->processes);
>> +	dec_ulimit(__task_cred(p)->ucounts, ULIMIT_NPROC);
>>  	rcu_read_unlock();
>>  
>>  	proc_flush_task(p);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 623259fc794d..0b16c9d920b1 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -425,6 +425,7 @@ int arch_task_struct_size __read_mostly;
>>  void __init fork_init(void)
>>  {
>>  	int i;
>> +	struct cred *c = init_task.cred;
>>  #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
>>  #ifndef ARCH_MIN_TASKALIGN
>>  #define ARCH_MIN_TASKALIGN	L1_CACHE_BYTES
>> @@ -448,6 +449,8 @@ void __init fork_init(void)
>>  	for (i = 0; i < UCOUNT_COUNTS; i++) {
>>  		init_user_ns.ucount_max[i] = max_threads/2;
>>  	}
>> +
>> +	c->ucounts = inc_ulimit(&init_user_ns, make_kuid(&init_user_ns, 0), ULIMIT_NPROC);
>>  }
>>  
>>  int __weak arch_dup_task_struct(struct task_struct *dst,
>> @@ -1515,7 +1518,7 @@ static __latent_entropy struct task_struct *copy_process(
>>  	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
>>  #endif
>>  	retval = -EAGAIN;
>> -	if (atomic_read(&p->real_cred->user->processes) >=
>> +	if (atomic_read(&p->real_cred->ucounts->ulimit[ULIMIT_NPROC]) >=
>>  			task_rlimit(p, RLIMIT_NPROC)) {
>>  		if (p->real_cred->user != INIT_USER &&
>>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>> @@ -1859,7 +1862,7 @@ static __latent_entropy struct task_struct *copy_process(
>>  #endif
>>  	delayacct_tsk_free(p);
>>  bad_fork_cleanup_count:
>> -	atomic_dec(&p->cred->user->processes);
>> +	dec_ulimit(p->cred->ucounts, ULIMIT_NPROC);
>>  	exit_creds(p);
>>  bad_fork_free:
>>  	put_task_stack(p);
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 89d5be418157..143d04ed9da5 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -433,7 +433,7 @@ static int set_user(struct cred *new)
>>  	 * for programs doing set*uid()+execve() by harmlessly deferring the
>>  	 * failure to the execve() stage.
>>  	 */
>> -	if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
>> +	if (atomic_read(&new->ucounts->ulimit[ULIMIT_NPROC]) >= rlimit(RLIMIT_NPROC) &&
>>  			new_user != INIT_USER)
>>  		current->flags |= PF_NPROC_EXCEEDED;
>>  	else
>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>> index 9d20d5dd298a..d8f5362f6e17 100644
>> --- a/kernel/ucount.c
>> +++ b/kernel/ucount.c
>> @@ -181,6 +181,22 @@ static inline bool atomic_inc_below(atomic_t *v, int u)
>>  	}
>>  }
>>  
>> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid,
>> +			   enum ulimit_type type)
>> +{
>> +	struct ucounts *ucounts = get_ucounts(ns, uid);
>> +	atomic_inc(&ucounts->ulimit[type]);
>> +
>> +	return ucounts;
>> +}
>> +
>> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type)
>> +{
>> +	atomic_dec(&ucounts->ulimit[type]);
>> +	WARN_ON(atomic_read(&ucounts->ulimit[type]) < 0);
>> +	put_ucounts(ucounts);
>> +}
>> +
>>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
>>  			   enum ucount_type type)
>>  {
>> diff --git a/kernel/user.c b/kernel/user.c
>> index b069ccbfb0b0..ce1ba1fb96c0 100644
>> --- a/kernel/user.c
>> +++ b/kernel/user.c
>> @@ -90,7 +90,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
>>  /* root_user.__count is 1, for init task cred */
>>  struct user_struct root_user = {
>>  	.__count	= ATOMIC_INIT(1),
>> -	.processes	= ATOMIC_INIT(1),
>>  	.sigpending	= ATOMIC_INIT(0),
>>  	.locked_shm     = 0,
>>  	.uid		= GLOBAL_ROOT_UID,
>> -- 
>> 2.5.0
>>
>> _______________________________________________
>> Containers mailing list
>> Containers@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
Serge E. Hallyn Oct. 27, 2016, 2:37 p.m.
Quoting Nikolay Borisov (kernel@kyup.com):
> 
> 
> On 10/26/2016 08:25 PM, Serge E. Hallyn wrote:
> > On Wed, Oct 26, 2016 at 03:40:27PM +0300, Nikolay Borisov wrote:
> >> There are container setups which map the same kuids to
> >> different containers. In such situation what will happen is
> >> that same uid's in different containers will map to the same
> >> underlying user on the matchine (e.g. same struct user). One
> >> implication of this is that the number of processes for that
> >> particular user are going to be shared among all the same uids
> >> in the container. This is problematic, as it means a user in
> >> containerA can potentially exhaust the process limit such that
> >> a user in containerB cannot spawn any processes.
> > 
> > Hi - thanks for the description.  Based on that, though, I worry
> > that it is a feature we do not want.  Nothing explicitly prohibits
> > sharing kuids in different containers, but it is is sharing.  If
> > you want greater isolation between two containers, you must not share
> > any kuids.
> > 
> > I'm not saying nack, but i am saying it seems a misguided feature
> > which could lead people to think sharing uids is safer than it is.
> 
> I agree that in order for this to be considered "secure" it relies on
> the assumption that there is no leakage between containers. However,
> there are currently setups which rely on this behavior for whatever
> (mis)guided reasons. Furthermore the current design of namespaces
> doesn't do anything to prevent such uses. Given this I don't think it be
> fair to completely disregard them, hence the patch.

I somehow had missed the fact that (if I read below correctly) you
are actually solving the problem for RLIMIT_NPROC?  That's worthwhile
then.  I thought the ucounts checks were independent and RLIMIT_NPROC
failures were still going to mysteriously plague sibling containers.

I do still worry about the performance impact of adding the get_ucounts()
in those hot paths below.  Have you done any perf measurements?

> >> Fix this by utilising the newly introduced UCOUNT infrastructure,
> >> so that process counts are now being accounted per-userns/per-user.
> >> In case when a machine is running in non-container setup then
> >> the accounting semantics is virtually the same, given that there is
> >> going to be one ucount-per-user struct.
> >>
> >> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> >> ---
> >> Hello Eric, 
> >>
> >> Now that ucount has been merged I took the liberty of 
> >> experimenting what the nproc-per-userns approach would 
> >> look like and here is the result. In my previous [0] you 
> >> expressed concerns regarding not having hierarchical limits, 
> >> essentially allowing users to circumvent their limits. So
> >> this version doesn't do anything to rectify this issue. 
> >> However, I do think the idea of the rlimit is not to 
> >> forbid the user of spawning more processes, since currently
> >> the limit is being set on a per-process basis, so a malicious 
> >> process can in fact increase that limits and spawn a large number 
> >> of processes. Given this I believe the current patch doesn't make
> >> the situation any worse in that regard. 
> >>
> >> [0] https://lists.linux-foundation.org/pipermail/containers/2015-September/036229.html
> >>
> >>  fs/exec.c                      |  2 +-
> >>  include/linux/cred.h           |  1 +
> >>  include/linux/sched.h          |  1 -
> >>  include/linux/user_namespace.h |  8 ++++++++
> >>  kernel/cred.c                  | 18 +++++++++++-------
> >>  kernel/exit.c                  |  2 +-
> >>  kernel/fork.c                  |  7 +++++--
> >>  kernel/sys.c                   |  2 +-
> >>  kernel/ucount.c                | 16 ++++++++++++++++
> >>  kernel/user.c                  |  1 -
> >>  10 files changed, 44 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/fs/exec.c b/fs/exec.c
> >> index 6fcfb3f7b137..8126e00a8d3e 100644
> >> --- a/fs/exec.c
> >> +++ b/fs/exec.c
> >> @@ -1651,7 +1651,7 @@ static int do_execveat_common(int fd, struct filename *filename,
> >>  	 * whether NPROC limit is still exceeded.
> >>  	 */
> >>  	if ((current->flags & PF_NPROC_EXCEEDED) &&
> >> -	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
> >> +	    atomic_read(&current_cred()->ucounts->ulimit[ULIMIT_NPROC]) > rlimit(RLIMIT_NPROC)) {
> >>  		retval = -EAGAIN;
> >>  		goto out_ret;
> >>  	}
> >> diff --git a/include/linux/cred.h b/include/linux/cred.h
> >> index f0e70a1bb3ac..c3bb4d20ff16 100644
> >> --- a/include/linux/cred.h
> >> +++ b/include/linux/cred.h
> >> @@ -141,6 +141,7 @@ struct cred {
> >>  	void		*security;	/* subjective LSM security */
> >>  #endif
> >>  	struct user_struct *user;	/* real user ID subscription */
> >> +	struct ucounts *ucounts;		/* container for per-userns/per-counts */
> >>  	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
> >>  	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
> >>  	struct rcu_head	rcu;		/* RCU deletion hook */
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 348f51b0ec92..3709d70fe2de 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -842,7 +842,6 @@ static inline int signal_group_exit(const struct signal_struct *sig)
> >>   */
> >>  struct user_struct {
> >>  	atomic_t __count;	/* reference count */
> >> -	atomic_t processes;	/* How many processes does this user have? */
> >>  	atomic_t sigpending;	/* How many pending signals does this user have? */
> >>  #ifdef CONFIG_INOTIFY_USER
> >>  	atomic_t inotify_watches; /* How many inotify watches does this user have? */
> >> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> >> index eb209d4523f5..d4fc31ae9d85 100644
> >> --- a/include/linux/user_namespace.h
> >> +++ b/include/linux/user_namespace.h
> >> @@ -35,6 +35,11 @@ enum ucount_type {
> >>  	UCOUNT_COUNTS,
> >>  };
> >>  
> >> +enum ulimit_type {
> >> +	ULIMIT_NPROC,
> >> +	ULIMIT_COUNTS,
> >> +};
> >> +
> >>  struct user_namespace {
> >>  	struct uid_gid_map	uid_map;
> >>  	struct uid_gid_map	gid_map;
> >> @@ -67,6 +72,7 @@ struct ucounts {
> >>  	kuid_t uid;
> >>  	atomic_t count;
> >>  	atomic_t ucount[UCOUNT_COUNTS];
> >> +	atomic_t ulimit[ULIMIT_COUNTS];
> >>  };
> >>  
> >>  extern struct user_namespace init_user_ns;
> >> @@ -75,6 +81,8 @@ bool setup_userns_sysctls(struct user_namespace *ns);
> >>  void retire_userns_sysctls(struct user_namespace *ns);
> >>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
> >>  void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
> >> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid, enum ulimit_type type);
> >> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type);
> >>  
> >>  #ifdef CONFIG_USER_NS
> >>  
> >> diff --git a/kernel/cred.c b/kernel/cred.c
> >> index 5f264fb5737d..dd9ffc0396bb 100644
> >> --- a/kernel/cred.c
> >> +++ b/kernel/cred.c
> >> @@ -330,13 +330,16 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
> >>  #endif
> >>  		clone_flags & CLONE_THREAD
> >>  	    ) {
> >> -		p->real_cred = get_cred(p->cred);
> >> +		struct cred *c;
> >> +		struct ucounts *uc = inc_ulimit(p->cred->user_ns, p->cred->uid,
> >> +						ULIMIT_NPROC);
> >> +		p->real_cred = c = get_cred(p->cred);
> >>  		get_cred(p->cred);
> >>  		alter_cred_subscribers(p->cred, 2);
> >>  		kdebug("share_creds(%p{%d,%d})",
> >>  		       p->cred, atomic_read(&p->cred->usage),
> >>  		       read_cred_subscribers(p->cred));
> >> -		atomic_inc(&p->cred->user->processes);
> >> +		c->ucounts = uc;
> >>  		return 0;
> >>  	}
> >>  
> >> @@ -369,7 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
> >>  	}
> >>  #endif
> >>  
> >> -	atomic_inc(&new->user->processes);
> >> +	new->ucounts = inc_ulimit(new->user_ns, new->uid,
> >> +				      ULIMIT_NPROC);
> >>  	p->cred = p->real_cred = get_cred(new);
> >>  	alter_cred_subscribers(new, 2);
> >>  	validate_creds(new);
> >> @@ -461,12 +465,12 @@ int commit_creds(struct cred *new)
> >>  	 * in set_user().
> >>  	 */
> >>  	alter_cred_subscribers(new, 2);
> >> -	if (new->user != old->user)
> >> -		atomic_inc(&new->user->processes);
> >> +	if (new->user != old->user || old->user_ns != new->user_ns)
> >> +		new->ucounts = inc_ulimit(new->user_ns, new->uid, ULIMIT_NPROC);
> >>  	rcu_assign_pointer(task->real_cred, new);
> >>  	rcu_assign_pointer(task->cred, new);
> >> -	if (new->user != old->user)
> >> -		atomic_dec(&old->user->processes);
> >> +	if (new->user != old->user || old->user_ns != new->user_ns)
> >> +		dec_ulimit(old->ucounts, ULIMIT_NPROC);
> >>  	alter_cred_subscribers(old, -2);
> >>  
> >>  	/* send notifications */
> >> diff --git a/kernel/exit.c b/kernel/exit.c
> >> index 9d68c45ebbe3..7ac8954230bd 100644
> >> --- a/kernel/exit.c
> >> +++ b/kernel/exit.c
> >> @@ -173,7 +173,7 @@ void release_task(struct task_struct *p)
> >>  	/* don't need to get the RCU readlock here - the process is dead and
> >>  	 * can't be modifying its own credentials. But shut RCU-lockdep up */
> >>  	rcu_read_lock();
> >> -	atomic_dec(&__task_cred(p)->user->processes);
> >> +	dec_ulimit(__task_cred(p)->ucounts, ULIMIT_NPROC);
> >>  	rcu_read_unlock();
> >>  
> >>  	proc_flush_task(p);
> >> diff --git a/kernel/fork.c b/kernel/fork.c
> >> index 623259fc794d..0b16c9d920b1 100644
> >> --- a/kernel/fork.c
> >> +++ b/kernel/fork.c
> >> @@ -425,6 +425,7 @@ int arch_task_struct_size __read_mostly;
> >>  void __init fork_init(void)
> >>  {
> >>  	int i;
> >> +	struct cred *c = init_task.cred;
> >>  #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
> >>  #ifndef ARCH_MIN_TASKALIGN
> >>  #define ARCH_MIN_TASKALIGN	L1_CACHE_BYTES
> >> @@ -448,6 +449,8 @@ void __init fork_init(void)
> >>  	for (i = 0; i < UCOUNT_COUNTS; i++) {
> >>  		init_user_ns.ucount_max[i] = max_threads/2;
> >>  	}
> >> +
> >> +	c->ucounts = inc_ulimit(&init_user_ns, make_kuid(&init_user_ns, 0), ULIMIT_NPROC);
> >>  }
> >>  
> >>  int __weak arch_dup_task_struct(struct task_struct *dst,
> >> @@ -1515,7 +1518,7 @@ static __latent_entropy struct task_struct *copy_process(
> >>  	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
> >>  #endif
> >>  	retval = -EAGAIN;
> >> -	if (atomic_read(&p->real_cred->user->processes) >=
> >> +	if (atomic_read(&p->real_cred->ucounts->ulimit[ULIMIT_NPROC]) >=
> >>  			task_rlimit(p, RLIMIT_NPROC)) {
> >>  		if (p->real_cred->user != INIT_USER &&
> >>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> >> @@ -1859,7 +1862,7 @@ static __latent_entropy struct task_struct *copy_process(
> >>  #endif
> >>  	delayacct_tsk_free(p);
> >>  bad_fork_cleanup_count:
> >> -	atomic_dec(&p->cred->user->processes);
> >> +	dec_ulimit(p->cred->ucounts, ULIMIT_NPROC);
> >>  	exit_creds(p);
> >>  bad_fork_free:
> >>  	put_task_stack(p);
> >> diff --git a/kernel/sys.c b/kernel/sys.c
> >> index 89d5be418157..143d04ed9da5 100644
> >> --- a/kernel/sys.c
> >> +++ b/kernel/sys.c
> >> @@ -433,7 +433,7 @@ static int set_user(struct cred *new)
> >>  	 * for programs doing set*uid()+execve() by harmlessly deferring the
> >>  	 * failure to the execve() stage.
> >>  	 */
> >> -	if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
> >> +	if (atomic_read(&new->ucounts->ulimit[ULIMIT_NPROC]) >= rlimit(RLIMIT_NPROC) &&
> >>  			new_user != INIT_USER)
> >>  		current->flags |= PF_NPROC_EXCEEDED;
> >>  	else
> >> diff --git a/kernel/ucount.c b/kernel/ucount.c
> >> index 9d20d5dd298a..d8f5362f6e17 100644
> >> --- a/kernel/ucount.c
> >> +++ b/kernel/ucount.c
> >> @@ -181,6 +181,22 @@ static inline bool atomic_inc_below(atomic_t *v, int u)
> >>  	}
> >>  }
> >>  
> >> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid,
> >> +			   enum ulimit_type type)
> >> +{
> >> +	struct ucounts *ucounts = get_ucounts(ns, uid);
> >> +	atomic_inc(&ucounts->ulimit[type]);
> >> +
> >> +	return ucounts;
> >> +}
> >> +
> >> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type)
> >> +{
> >> +	atomic_dec(&ucounts->ulimit[type]);
> >> +	WARN_ON(atomic_read(&ucounts->ulimit[type]) < 0);
> >> +	put_ucounts(ucounts);
> >> +}
> >> +
> >>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
> >>  			   enum ucount_type type)
> >>  {
> >> diff --git a/kernel/user.c b/kernel/user.c
> >> index b069ccbfb0b0..ce1ba1fb96c0 100644
> >> --- a/kernel/user.c
> >> +++ b/kernel/user.c
> >> @@ -90,7 +90,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
> >>  /* root_user.__count is 1, for init task cred */
> >>  struct user_struct root_user = {
> >>  	.__count	= ATOMIC_INIT(1),
> >> -	.processes	= ATOMIC_INIT(1),
> >>  	.sigpending	= ATOMIC_INIT(0),
> >>  	.locked_shm     = 0,
> >>  	.uid		= GLOBAL_ROOT_UID,
> >> -- 
> >> 2.5.0
> >>
> >> _______________________________________________
> >> Containers mailing list
> >> Containers@lists.linux-foundation.org
> >> https://lists.linuxfoundation.org/mailman/listinfo/containers
Eric W. Biederman Nov. 1, 2016, 3:01 p.m.
Nikolay Borisov <kernel@kyup.com> writes:

> There are container setups which map the same kuids to
> different containers. In such situation what will happen is
> that same uid's in different containers will map to the same
> underlying user on the matchine (e.g. same struct user). One
> implication of this is that the number of processes for that
> particular user are going to be shared among all the same uids
> in the container. This is problematic, as it means a user in
> containerA can potentially exhaust the process limit such that
> a user in containerB cannot spawn any processes.
>
> Fix this by utilising the newly introduced UCOUNT infrastructure,
> so that process counts are now being accounted per-userns/per-user.
> In case when a machine is running in non-container setup then
> the accounting semantics is virtually the same, given that there is
> going to be one ucount-per-user struct.
>
> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
> ---
> Hello Eric, 
>
> Now that ucount has been merged I took the liberty of 
> experimenting what the nproc-per-userns approach would 
> look like and here is the result. In my previous [0] you 
> expressed concerns regarding not having hierarchical limits, 
> essentially allowing users to circumvent their limits. So
> this version doesn't do anything to rectify this issue. 
> However, I do think the idea of the rlimit is not to 
> forbid the user of spawning more processes, since currently
> the limit is being set on a per-process basis, so a malicious 
> process can in fact increase that limits and spawn a large number 
> of processes. Given this I believe the current patch doesn't make
> the situation any worse in that regard.

I think conceptually this can work.

Reading through the code I don't see anything capturing the current
processes RLIMIT_NPROC when creating a new user namespace.  Which is
critical if the limits are going to be enforced hierarchically.

I have a small concern that we toss the per user accounting entirely as
that means we loose a little in ensuring one uid does not have too many
processes.  But that is comparatively minor.

I am buried with Kernel Summit and Plumbers this week, so I will be slow
on review etc.

But overall I think this a viable approach assuming there are no
performance issues in structuring things this way.

Eric

> [0] https://lists.linux-foundation.org/pipermail/containers/2015-September/036229.html
>
>  fs/exec.c                      |  2 +-
>  include/linux/cred.h           |  1 +
>  include/linux/sched.h          |  1 -
>  include/linux/user_namespace.h |  8 ++++++++
>  kernel/cred.c                  | 18 +++++++++++-------
>  kernel/exit.c                  |  2 +-
>  kernel/fork.c                  |  7 +++++--
>  kernel/sys.c                   |  2 +-
>  kernel/ucount.c                | 16 ++++++++++++++++
>  kernel/user.c                  |  1 -
>  10 files changed, 44 insertions(+), 14 deletions(-)
>
> diff --git a/fs/exec.c b/fs/exec.c
> index 6fcfb3f7b137..8126e00a8d3e 100644
> --- a/fs/exec.c
> +++ b/fs/exec.c
> @@ -1651,7 +1651,7 @@ static int do_execveat_common(int fd, struct filename *filename,
>  	 * whether NPROC limit is still exceeded.
>  	 */
>  	if ((current->flags & PF_NPROC_EXCEEDED) &&
> -	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
> +	    atomic_read(&current_cred()->ucounts->ulimit[ULIMIT_NPROC]) > rlimit(RLIMIT_NPROC)) {
>  		retval = -EAGAIN;
>  		goto out_ret;
>  	}
> diff --git a/include/linux/cred.h b/include/linux/cred.h
> index f0e70a1bb3ac..c3bb4d20ff16 100644
> --- a/include/linux/cred.h
> +++ b/include/linux/cred.h
> @@ -141,6 +141,7 @@ struct cred {
>  	void		*security;	/* subjective LSM security */
>  #endif
>  	struct user_struct *user;	/* real user ID subscription */
> +	struct ucounts *ucounts;		/* container for per-userns/per-counts */
>  	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
>  	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
>  	struct rcu_head	rcu;		/* RCU deletion hook */
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 348f51b0ec92..3709d70fe2de 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -842,7 +842,6 @@ static inline int signal_group_exit(const struct signal_struct *sig)
>   */
>  struct user_struct {
>  	atomic_t __count;	/* reference count */
> -	atomic_t processes;	/* How many processes does this user have? */
>  	atomic_t sigpending;	/* How many pending signals does this user have? */
>  #ifdef CONFIG_INOTIFY_USER
>  	atomic_t inotify_watches; /* How many inotify watches does this user have? */
> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
> index eb209d4523f5..d4fc31ae9d85 100644
> --- a/include/linux/user_namespace.h
> +++ b/include/linux/user_namespace.h
> @@ -35,6 +35,11 @@ enum ucount_type {
>  	UCOUNT_COUNTS,
>  };
>  
> +enum ulimit_type {
> +	ULIMIT_NPROC,
> +	ULIMIT_COUNTS,
> +};
> +
>  struct user_namespace {
>  	struct uid_gid_map	uid_map;
>  	struct uid_gid_map	gid_map;
> @@ -67,6 +72,7 @@ struct ucounts {
>  	kuid_t uid;
>  	atomic_t count;
>  	atomic_t ucount[UCOUNT_COUNTS];
> +	atomic_t ulimit[ULIMIT_COUNTS];
>  };
>  
>  extern struct user_namespace init_user_ns;
> @@ -75,6 +81,8 @@ bool setup_userns_sysctls(struct user_namespace *ns);
>  void retire_userns_sysctls(struct user_namespace *ns);
>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
>  void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid, enum ulimit_type type);
> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type);
>  
>  #ifdef CONFIG_USER_NS
>  
> diff --git a/kernel/cred.c b/kernel/cred.c
> index 5f264fb5737d..dd9ffc0396bb 100644
> --- a/kernel/cred.c
> +++ b/kernel/cred.c
> @@ -330,13 +330,16 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>  #endif
>  		clone_flags & CLONE_THREAD
>  	    ) {
> -		p->real_cred = get_cred(p->cred);
> +		struct cred *c;
> +		struct ucounts *uc = inc_ulimit(p->cred->user_ns, p->cred->uid,
> +						ULIMIT_NPROC);
> +		p->real_cred = c = get_cred(p->cred);
>  		get_cred(p->cred);
>  		alter_cred_subscribers(p->cred, 2);
>  		kdebug("share_creds(%p{%d,%d})",
>  		       p->cred, atomic_read(&p->cred->usage),
>  		       read_cred_subscribers(p->cred));
> -		atomic_inc(&p->cred->user->processes);
> +		c->ucounts = uc;
>  		return 0;
>  	}
>  
> @@ -369,7 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>  	}
>  #endif
>  
> -	atomic_inc(&new->user->processes);
> +	new->ucounts = inc_ulimit(new->user_ns, new->uid,
> +				      ULIMIT_NPROC);
>  	p->cred = p->real_cred = get_cred(new);
>  	alter_cred_subscribers(new, 2);
>  	validate_creds(new);
> @@ -461,12 +465,12 @@ int commit_creds(struct cred *new)
>  	 * in set_user().
>  	 */
>  	alter_cred_subscribers(new, 2);
> -	if (new->user != old->user)
> -		atomic_inc(&new->user->processes);
> +	if (new->user != old->user || old->user_ns != new->user_ns)
> +		new->ucounts = inc_ulimit(new->user_ns, new->uid, ULIMIT_NPROC);
>  	rcu_assign_pointer(task->real_cred, new);
>  	rcu_assign_pointer(task->cred, new);
> -	if (new->user != old->user)
> -		atomic_dec(&old->user->processes);
> +	if (new->user != old->user || old->user_ns != new->user_ns)
> +		dec_ulimit(old->ucounts, ULIMIT_NPROC);
>  	alter_cred_subscribers(old, -2);
>  
>  	/* send notifications */
> diff --git a/kernel/exit.c b/kernel/exit.c
> index 9d68c45ebbe3..7ac8954230bd 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -173,7 +173,7 @@ void release_task(struct task_struct *p)
>  	/* don't need to get the RCU readlock here - the process is dead and
>  	 * can't be modifying its own credentials. But shut RCU-lockdep up */
>  	rcu_read_lock();
> -	atomic_dec(&__task_cred(p)->user->processes);
> +	dec_ulimit(__task_cred(p)->ucounts, ULIMIT_NPROC);
>  	rcu_read_unlock();
>  
>  	proc_flush_task(p);
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 623259fc794d..0b16c9d920b1 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -425,6 +425,7 @@ int arch_task_struct_size __read_mostly;
>  void __init fork_init(void)
>  {
>  	int i;
> +	struct cred *c = init_task.cred;
>  #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
>  #ifndef ARCH_MIN_TASKALIGN
>  #define ARCH_MIN_TASKALIGN	L1_CACHE_BYTES
> @@ -448,6 +449,8 @@ void __init fork_init(void)
>  	for (i = 0; i < UCOUNT_COUNTS; i++) {
>  		init_user_ns.ucount_max[i] = max_threads/2;
>  	}
> +
> +	c->ucounts = inc_ulimit(&init_user_ns, make_kuid(&init_user_ns, 0), ULIMIT_NPROC);
>  }
>  
>  int __weak arch_dup_task_struct(struct task_struct *dst,
> @@ -1515,7 +1518,7 @@ static __latent_entropy struct task_struct *copy_process(
>  	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
>  #endif
>  	retval = -EAGAIN;
> -	if (atomic_read(&p->real_cred->user->processes) >=
> +	if (atomic_read(&p->real_cred->ucounts->ulimit[ULIMIT_NPROC]) >=
>  			task_rlimit(p, RLIMIT_NPROC)) {
>  		if (p->real_cred->user != INIT_USER &&
>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
> @@ -1859,7 +1862,7 @@ static __latent_entropy struct task_struct *copy_process(
>  #endif
>  	delayacct_tsk_free(p);
>  bad_fork_cleanup_count:
> -	atomic_dec(&p->cred->user->processes);
> +	dec_ulimit(p->cred->ucounts, ULIMIT_NPROC);
>  	exit_creds(p);
>  bad_fork_free:
>  	put_task_stack(p);
> diff --git a/kernel/sys.c b/kernel/sys.c
> index 89d5be418157..143d04ed9da5 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -433,7 +433,7 @@ static int set_user(struct cred *new)
>  	 * for programs doing set*uid()+execve() by harmlessly deferring the
>  	 * failure to the execve() stage.
>  	 */
> -	if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
> +	if (atomic_read(&new->ucounts->ulimit[ULIMIT_NPROC]) >= rlimit(RLIMIT_NPROC) &&
>  			new_user != INIT_USER)
>  		current->flags |= PF_NPROC_EXCEEDED;
>  	else
> diff --git a/kernel/ucount.c b/kernel/ucount.c
> index 9d20d5dd298a..d8f5362f6e17 100644
> --- a/kernel/ucount.c
> +++ b/kernel/ucount.c
> @@ -181,6 +181,22 @@ static inline bool atomic_inc_below(atomic_t *v, int u)
>  	}
>  }
>  
> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid,
> +			   enum ulimit_type type)
> +{
> +	struct ucounts *ucounts = get_ucounts(ns, uid);
> +	atomic_inc(&ucounts->ulimit[type]);
> +
> +	return ucounts;
> +}
> +
> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type)
> +{
> +	atomic_dec(&ucounts->ulimit[type]);
> +	WARN_ON(atomic_read(&ucounts->ulimit[type]) < 0);
> +	put_ucounts(ucounts);
> +}
> +
>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
>  			   enum ucount_type type)
>  {
> diff --git a/kernel/user.c b/kernel/user.c
> index b069ccbfb0b0..ce1ba1fb96c0 100644
> --- a/kernel/user.c
> +++ b/kernel/user.c
> @@ -90,7 +90,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
>  /* root_user.__count is 1, for init task cred */
>  struct user_struct root_user = {
>  	.__count	= ATOMIC_INIT(1),
> -	.processes	= ATOMIC_INIT(1),
>  	.sigpending	= ATOMIC_INIT(0),
>  	.locked_shm     = 0,
>  	.uid		= GLOBAL_ROOT_UID,
Nikolay Borisov Nov. 3, 2016, 1:49 p.m.
On 10/27/2016 05:37 PM, Serge E. Hallyn wrote:
> Quoting Nikolay Borisov (kernel@kyup.com):
>>
>>
>> On 10/26/2016 08:25 PM, Serge E. Hallyn wrote:
>>> On Wed, Oct 26, 2016 at 03:40:27PM +0300, Nikolay Borisov wrote:
>>>> There are container setups which map the same kuids to
>>>> different containers. In such situation what will happen is
>>>> that same uid's in different containers will map to the same
>>>> underlying user on the matchine (e.g. same struct user). One
>>>> implication of this is that the number of processes for that
>>>> particular user are going to be shared among all the same uids
>>>> in the container. This is problematic, as it means a user in
>>>> containerA can potentially exhaust the process limit such that
>>>> a user in containerB cannot spawn any processes.
>>>
>>> Hi - thanks for the description.  Based on that, though, I worry
>>> that it is a feature we do not want.  Nothing explicitly prohibits
>>> sharing kuids in different containers, but it is is sharing.  If
>>> you want greater isolation between two containers, you must not share
>>> any kuids.
>>>
>>> I'm not saying nack, but i am saying it seems a misguided feature
>>> which could lead people to think sharing uids is safer than it is.
>>
>> I agree that in order for this to be considered "secure" it relies on
>> the assumption that there is no leakage between containers. However,
>> there are currently setups which rely on this behavior for whatever
>> (mis)guided reasons. Furthermore the current design of namespaces
>> doesn't do anything to prevent such uses. Given this I don't think it be
>> fair to completely disregard them, hence the patch.
> 
> I somehow had missed the fact that (if I read below correctly) you
> are actually solving the problem for RLIMIT_NPROC?  That's worthwhile
> then.  I thought the ucounts checks were independent and RLIMIT_NPROC
> failures were still going to mysteriously plague sibling containers.
> 
> I do still worry about the performance impact of adding the get_ucounts()
> in those hot paths below.  Have you done any perf measurements?
> 

Finally managed to get around doing some benchmarking. I performed tests on 
4.9-rc1 with and without my patch. On every kernel I performed 3 tests: 
 - 5 stress-ng instances, doing 250k forks in the init_user_ns
 - 5 stress-ng instances, doing 250k forks in child of init_user_ns (nest factor of 1)
 - 5 stress-ng instances, doing 250k forks in child of init_user_ns (nest factor of 5). 

I ran every experiment 5 times and got the stdev and the average values. Here is how I invoked stress-ng : 

for i in {1..5}; do time /home/projects/kernel-testing/stress-ng/stress-ng -f 5 --fork-ops 250000 ; done

And every unsharing of a namespace was performed by means of "unshare -r"

The results are as follows: 

		Stock 4.9-rc1			Patched 4.9-rc1-nproc			
		Real	STDEV	Sys	STDEV	Real	STDEV	Sys	STDEV
init_user_ns	28.872s	0.167	49.714	0.51	28.386	0.168	49.217	0.899
depth of 1 	27.808	0.448	48.869	0.275	28.303	0.225	50.44	0.174
depth of 5	27.614	0.226	49.336	0.43	28.22	0.737	49.72	0.934

The results are almost identical with only 3% diffrence in the sys time of 1 child ns, 
and 1% with 5 ns child. I think the results are rather noisy. While the tests were 
running I also observed the % for (dec|inc)_ulimit and they were around 0,06%. Very 
negligible. I think on the performance side of things this should be good.  


[SNIP]
Nikolay Borisov Nov. 3, 2016, 2:59 p.m.
On 11/01/2016 05:01 PM, Eric W. Biederman wrote:
> Nikolay Borisov <kernel@kyup.com> writes:
> 
>> There are container setups which map the same kuids to
>> different containers. In such situation what will happen is
>> that same uid's in different containers will map to the same
>> underlying user on the matchine (e.g. same struct user). One
>> implication of this is that the number of processes for that
>> particular user are going to be shared among all the same uids
>> in the container. This is problematic, as it means a user in
>> containerA can potentially exhaust the process limit such that
>> a user in containerB cannot spawn any processes.
>>
>> Fix this by utilising the newly introduced UCOUNT infrastructure,
>> so that process counts are now being accounted per-userns/per-user.
>> In case when a machine is running in non-container setup then
>> the accounting semantics is virtually the same, given that there is
>> going to be one ucount-per-user struct.
>>
>> Signed-off-by: Nikolay Borisov <kernel@kyup.com>
>> ---
>> Hello Eric, 
>>
>> Now that ucount has been merged I took the liberty of 
>> experimenting what the nproc-per-userns approach would 
>> look like and here is the result. In my previous [0] you 
>> expressed concerns regarding not having hierarchical limits, 
>> essentially allowing users to circumvent their limits. So
>> this version doesn't do anything to rectify this issue. 
>> However, I do think the idea of the rlimit is not to 
>> forbid the user of spawning more processes, since currently
>> the limit is being set on a per-process basis, so a malicious 
>> process can in fact increase that limits and spawn a large number 
>> of processes. Given this I believe the current patch doesn't make
>> the situation any worse in that regard.
> 
> I think conceptually this can work.
> 
> Reading through the code I don't see anything capturing the current
> processes RLIMIT_NPROC when creating a new user namespace.  Which is
> critical if the limits are going to be enforced hierarchically.

Right, now the question is whether the limits can in fact be enforced
hierarchically. Because what's really happening is that the actual limit
is set per-process (in task->signal->rlim[limit].rlim_cur). However, the
signal struct is being copied when we fork a new process.

With such a setup nothing prevents the parent process to die and thus
loosing the hierarchical relationship.  Otherwise a rework would need to
be done to move the rlimits in the struct user_namespace. Essentially
this is an open problem and it would require some more design thinking
to get it right.

> 
> I have a small concern that we toss the per user accounting entirely as
> that means we loose a little in ensuring one uid does not have too many
> processes.  But that is comparatively minor.
> 
> I am buried with Kernel Summit and Plumbers this week, so I will be slow
> on review etc.
> 
> But overall I think this a viable approach assuming there are no
> performance issues in structuring things this way.

According to my experiments (see reply I sent to Serge) ucount doesn't
introduce major performance bottlenecks. If you have ideas for a
particular usecases worth investigating I'm happy to do it. But plain
old forking should be sufficient.

> 
> Eric
> 
>> [0] https://lists.linux-foundation.org/pipermail/containers/2015-September/036229.html
>>
>>  fs/exec.c                      |  2 +-
>>  include/linux/cred.h           |  1 +
>>  include/linux/sched.h          |  1 -
>>  include/linux/user_namespace.h |  8 ++++++++
>>  kernel/cred.c                  | 18 +++++++++++-------
>>  kernel/exit.c                  |  2 +-
>>  kernel/fork.c                  |  7 +++++--
>>  kernel/sys.c                   |  2 +-
>>  kernel/ucount.c                | 16 ++++++++++++++++
>>  kernel/user.c                  |  1 -
>>  10 files changed, 44 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/exec.c b/fs/exec.c
>> index 6fcfb3f7b137..8126e00a8d3e 100644
>> --- a/fs/exec.c
>> +++ b/fs/exec.c
>> @@ -1651,7 +1651,7 @@ static int do_execveat_common(int fd, struct filename *filename,
>>  	 * whether NPROC limit is still exceeded.
>>  	 */
>>  	if ((current->flags & PF_NPROC_EXCEEDED) &&
>> -	    atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
>> +	    atomic_read(&current_cred()->ucounts->ulimit[ULIMIT_NPROC]) > rlimit(RLIMIT_NPROC)) {
>>  		retval = -EAGAIN;
>>  		goto out_ret;
>>  	}
>> diff --git a/include/linux/cred.h b/include/linux/cred.h
>> index f0e70a1bb3ac..c3bb4d20ff16 100644
>> --- a/include/linux/cred.h
>> +++ b/include/linux/cred.h
>> @@ -141,6 +141,7 @@ struct cred {
>>  	void		*security;	/* subjective LSM security */
>>  #endif
>>  	struct user_struct *user;	/* real user ID subscription */
>> +	struct ucounts *ucounts;		/* container for per-userns/per-counts */
>>  	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
>>  	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
>>  	struct rcu_head	rcu;		/* RCU deletion hook */
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 348f51b0ec92..3709d70fe2de 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -842,7 +842,6 @@ static inline int signal_group_exit(const struct signal_struct *sig)
>>   */
>>  struct user_struct {
>>  	atomic_t __count;	/* reference count */
>> -	atomic_t processes;	/* How many processes does this user have? */
>>  	atomic_t sigpending;	/* How many pending signals does this user have? */
>>  #ifdef CONFIG_INOTIFY_USER
>>  	atomic_t inotify_watches; /* How many inotify watches does this user have? */
>> diff --git a/include/linux/user_namespace.h b/include/linux/user_namespace.h
>> index eb209d4523f5..d4fc31ae9d85 100644
>> --- a/include/linux/user_namespace.h
>> +++ b/include/linux/user_namespace.h
>> @@ -35,6 +35,11 @@ enum ucount_type {
>>  	UCOUNT_COUNTS,
>>  };
>>  
>> +enum ulimit_type {
>> +	ULIMIT_NPROC,
>> +	ULIMIT_COUNTS,
>> +};
>> +
>>  struct user_namespace {
>>  	struct uid_gid_map	uid_map;
>>  	struct uid_gid_map	gid_map;
>> @@ -67,6 +72,7 @@ struct ucounts {
>>  	kuid_t uid;
>>  	atomic_t count;
>>  	atomic_t ucount[UCOUNT_COUNTS];
>> +	atomic_t ulimit[ULIMIT_COUNTS];
>>  };
>>  
>>  extern struct user_namespace init_user_ns;
>> @@ -75,6 +81,8 @@ bool setup_userns_sysctls(struct user_namespace *ns);
>>  void retire_userns_sysctls(struct user_namespace *ns);
>>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid, enum ucount_type type);
>>  void dec_ucount(struct ucounts *ucounts, enum ucount_type type);
>> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid, enum ulimit_type type);
>> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type);
>>  
>>  #ifdef CONFIG_USER_NS
>>  
>> diff --git a/kernel/cred.c b/kernel/cred.c
>> index 5f264fb5737d..dd9ffc0396bb 100644
>> --- a/kernel/cred.c
>> +++ b/kernel/cred.c
>> @@ -330,13 +330,16 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>>  #endif
>>  		clone_flags & CLONE_THREAD
>>  	    ) {
>> -		p->real_cred = get_cred(p->cred);
>> +		struct cred *c;
>> +		struct ucounts *uc = inc_ulimit(p->cred->user_ns, p->cred->uid,
>> +						ULIMIT_NPROC);
>> +		p->real_cred = c = get_cred(p->cred);
>>  		get_cred(p->cred);
>>  		alter_cred_subscribers(p->cred, 2);
>>  		kdebug("share_creds(%p{%d,%d})",
>>  		       p->cred, atomic_read(&p->cred->usage),
>>  		       read_cred_subscribers(p->cred));
>> -		atomic_inc(&p->cred->user->processes);
>> +		c->ucounts = uc;
>>  		return 0;
>>  	}
>>  
>> @@ -369,7 +372,8 @@ int copy_creds(struct task_struct *p, unsigned long clone_flags)
>>  	}
>>  #endif
>>  
>> -	atomic_inc(&new->user->processes);
>> +	new->ucounts = inc_ulimit(new->user_ns, new->uid,
>> +				      ULIMIT_NPROC);
>>  	p->cred = p->real_cred = get_cred(new);
>>  	alter_cred_subscribers(new, 2);
>>  	validate_creds(new);
>> @@ -461,12 +465,12 @@ int commit_creds(struct cred *new)
>>  	 * in set_user().
>>  	 */
>>  	alter_cred_subscribers(new, 2);
>> -	if (new->user != old->user)
>> -		atomic_inc(&new->user->processes);
>> +	if (new->user != old->user || old->user_ns != new->user_ns)
>> +		new->ucounts = inc_ulimit(new->user_ns, new->uid, ULIMIT_NPROC);
>>  	rcu_assign_pointer(task->real_cred, new);
>>  	rcu_assign_pointer(task->cred, new);
>> -	if (new->user != old->user)
>> -		atomic_dec(&old->user->processes);
>> +	if (new->user != old->user || old->user_ns != new->user_ns)
>> +		dec_ulimit(old->ucounts, ULIMIT_NPROC);
>>  	alter_cred_subscribers(old, -2);
>>  
>>  	/* send notifications */
>> diff --git a/kernel/exit.c b/kernel/exit.c
>> index 9d68c45ebbe3..7ac8954230bd 100644
>> --- a/kernel/exit.c
>> +++ b/kernel/exit.c
>> @@ -173,7 +173,7 @@ void release_task(struct task_struct *p)
>>  	/* don't need to get the RCU readlock here - the process is dead and
>>  	 * can't be modifying its own credentials. But shut RCU-lockdep up */
>>  	rcu_read_lock();
>> -	atomic_dec(&__task_cred(p)->user->processes);
>> +	dec_ulimit(__task_cred(p)->ucounts, ULIMIT_NPROC);
>>  	rcu_read_unlock();
>>  
>>  	proc_flush_task(p);
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 623259fc794d..0b16c9d920b1 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -425,6 +425,7 @@ int arch_task_struct_size __read_mostly;
>>  void __init fork_init(void)
>>  {
>>  	int i;
>> +	struct cred *c = init_task.cred;
>>  #ifndef CONFIG_ARCH_TASK_STRUCT_ALLOCATOR
>>  #ifndef ARCH_MIN_TASKALIGN
>>  #define ARCH_MIN_TASKALIGN	L1_CACHE_BYTES
>> @@ -448,6 +449,8 @@ void __init fork_init(void)
>>  	for (i = 0; i < UCOUNT_COUNTS; i++) {
>>  		init_user_ns.ucount_max[i] = max_threads/2;
>>  	}
>> +
>> +	c->ucounts = inc_ulimit(&init_user_ns, make_kuid(&init_user_ns, 0), ULIMIT_NPROC);
>>  }
>>  
>>  int __weak arch_dup_task_struct(struct task_struct *dst,
>> @@ -1515,7 +1518,7 @@ static __latent_entropy struct task_struct *copy_process(
>>  	DEBUG_LOCKS_WARN_ON(!p->softirqs_enabled);
>>  #endif
>>  	retval = -EAGAIN;
>> -	if (atomic_read(&p->real_cred->user->processes) >=
>> +	if (atomic_read(&p->real_cred->ucounts->ulimit[ULIMIT_NPROC]) >=
>>  			task_rlimit(p, RLIMIT_NPROC)) {
>>  		if (p->real_cred->user != INIT_USER &&
>>  		    !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN))
>> @@ -1859,7 +1862,7 @@ static __latent_entropy struct task_struct *copy_process(
>>  #endif
>>  	delayacct_tsk_free(p);
>>  bad_fork_cleanup_count:
>> -	atomic_dec(&p->cred->user->processes);
>> +	dec_ulimit(p->cred->ucounts, ULIMIT_NPROC);
>>  	exit_creds(p);
>>  bad_fork_free:
>>  	put_task_stack(p);
>> diff --git a/kernel/sys.c b/kernel/sys.c
>> index 89d5be418157..143d04ed9da5 100644
>> --- a/kernel/sys.c
>> +++ b/kernel/sys.c
>> @@ -433,7 +433,7 @@ static int set_user(struct cred *new)
>>  	 * for programs doing set*uid()+execve() by harmlessly deferring the
>>  	 * failure to the execve() stage.
>>  	 */
>> -	if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
>> +	if (atomic_read(&new->ucounts->ulimit[ULIMIT_NPROC]) >= rlimit(RLIMIT_NPROC) &&
>>  			new_user != INIT_USER)
>>  		current->flags |= PF_NPROC_EXCEEDED;
>>  	else
>> diff --git a/kernel/ucount.c b/kernel/ucount.c
>> index 9d20d5dd298a..d8f5362f6e17 100644
>> --- a/kernel/ucount.c
>> +++ b/kernel/ucount.c
>> @@ -181,6 +181,22 @@ static inline bool atomic_inc_below(atomic_t *v, int u)
>>  	}
>>  }
>>  
>> +struct ucounts *inc_ulimit(struct user_namespace *ns, kuid_t uid,
>> +			   enum ulimit_type type)
>> +{
>> +	struct ucounts *ucounts = get_ucounts(ns, uid);
>> +	atomic_inc(&ucounts->ulimit[type]);
>> +
>> +	return ucounts;
>> +}
>> +
>> +void dec_ulimit(struct ucounts *ucounts, enum ulimit_type type)
>> +{
>> +	atomic_dec(&ucounts->ulimit[type]);
>> +	WARN_ON(atomic_read(&ucounts->ulimit[type]) < 0);
>> +	put_ucounts(ucounts);
>> +}
>> +
>>  struct ucounts *inc_ucount(struct user_namespace *ns, kuid_t uid,
>>  			   enum ucount_type type)
>>  {
>> diff --git a/kernel/user.c b/kernel/user.c
>> index b069ccbfb0b0..ce1ba1fb96c0 100644
>> --- a/kernel/user.c
>> +++ b/kernel/user.c
>> @@ -90,7 +90,6 @@ static DEFINE_SPINLOCK(uidhash_lock);
>>  /* root_user.__count is 1, for init task cred */
>>  struct user_struct root_user = {
>>  	.__count	= ATOMIC_INIT(1),
>> -	.processes	= ATOMIC_INIT(1),
>>  	.sigpending	= ATOMIC_INIT(0),
>>  	.locked_shm     = 0,
>>  	.uid		= GLOBAL_ROOT_UID,
Serge E. Hallyn Nov. 7, 2016, 4:56 p.m.
Quoting Nikolay Borisov (kernel@kyup.com):
> 
> 
> On 10/27/2016 05:37 PM, Serge E. Hallyn wrote:
> > Quoting Nikolay Borisov (kernel@kyup.com):
> >>
> >>
> >> On 10/26/2016 08:25 PM, Serge E. Hallyn wrote:
> >>> On Wed, Oct 26, 2016 at 03:40:27PM +0300, Nikolay Borisov wrote:
> >>>> There are container setups which map the same kuids to
> >>>> different containers. In such situation what will happen is
> >>>> that same uid's in different containers will map to the same
> >>>> underlying user on the matchine (e.g. same struct user). One
> >>>> implication of this is that the number of processes for that
> >>>> particular user are going to be shared among all the same uids
> >>>> in the container. This is problematic, as it means a user in
> >>>> containerA can potentially exhaust the process limit such that
> >>>> a user in containerB cannot spawn any processes.
> >>>
> >>> Hi - thanks for the description.  Based on that, though, I worry
> >>> that it is a feature we do not want.  Nothing explicitly prohibits
> >>> sharing kuids in different containers, but it is is sharing.  If
> >>> you want greater isolation between two containers, you must not share
> >>> any kuids.
> >>>
> >>> I'm not saying nack, but i am saying it seems a misguided feature
> >>> which could lead people to think sharing uids is safer than it is.
> >>
> >> I agree that in order for this to be considered "secure" it relies on
> >> the assumption that there is no leakage between containers. However,
> >> there are currently setups which rely on this behavior for whatever
> >> (mis)guided reasons. Furthermore the current design of namespaces
> >> doesn't do anything to prevent such uses. Given this I don't think it be
> >> fair to completely disregard them, hence the patch.
> > 
> > I somehow had missed the fact that (if I read below correctly) you
> > are actually solving the problem for RLIMIT_NPROC?  That's worthwhile
> > then.  I thought the ucounts checks were independent and RLIMIT_NPROC
> > failures were still going to mysteriously plague sibling containers.
> > 
> > I do still worry about the performance impact of adding the get_ucounts()
> > in those hot paths below.  Have you done any perf measurements?
> > 
> 
> Finally managed to get around doing some benchmarking. I performed tests on 
> 4.9-rc1 with and without my patch. On every kernel I performed 3 tests: 
>  - 5 stress-ng instances, doing 250k forks in the init_user_ns
>  - 5 stress-ng instances, doing 250k forks in child of init_user_ns (nest factor of 1)
>  - 5 stress-ng instances, doing 250k forks in child of init_user_ns (nest factor of 5). 
> 
> I ran every experiment 5 times and got the stdev and the average values. Here is how I invoked stress-ng : 
> 
> for i in {1..5}; do time /home/projects/kernel-testing/stress-ng/stress-ng -f 5 --fork-ops 250000 ; done
> 
> And every unsharing of a namespace was performed by means of "unshare -r"
> 
> The results are as follows: 
> 
> 		Stock 4.9-rc1			Patched 4.9-rc1-nproc			
> 		Real	STDEV	Sys	STDEV	Real	STDEV	Sys	STDEV
> init_user_ns	28.872s	0.167	49.714	0.51	28.386	0.168	49.217	0.899
> depth of 1 	27.808	0.448	48.869	0.275	28.303	0.225	50.44	0.174
> depth of 5	27.614	0.226	49.336	0.43	28.22	0.737	49.72	0.934
> 
> The results are almost identical with only 3% diffrence in the sys time of 1 child ns, 
> and 1% with 5 ns child. I think the results are rather noisy. While the tests were 
> running I also observed the % for (dec|inc)_ulimit and they were around 0,06%. Very 
> negligible. I think on the performance side of things this should be good.  

Awesome, thanks for that.
Eric W. Biederman Nov. 7, 2016, 5:28 p.m.
Nikolay Borisov <kernel@kyup.com> writes:

> On 11/01/2016 05:01 PM, Eric W. Biederman wrote:

>> I think conceptually this can work.
>> 
>> Reading through the code I don't see anything capturing the current
>> processes RLIMIT_NPROC when creating a new user namespace.  Which is
>> critical if the limits are going to be enforced hierarchically.
>
> Right, now the question is whether the limits can in fact be enforced
> hierarchically. Because what's really happening is that the actual limit
> is set per-process (in task->signal->rlim[limit].rlim_cur). However, the
> signal struct is being copied when we fork a new process.
>
> With such a setup nothing prevents the parent process to die and thus
> loosing the hierarchical relationship.  Otherwise a rework would need to
> be done to move the rlimits in the struct user_namespace. Essentially
> this is an open problem and it would require some more design thinking
> to get it right.

The only point I see in using the ucount infrastructure is if we limit a
user namespace to the number of processes the creator of user namespace
was allowed to create.  Thus making it impossible to escape your limit
by creating a user namespace and using multiple users.

Your current patch achieves the opposite.  Making it even easier to
escape your current rlimit which is a regression and unacceptable.

>> 
>> I have a small concern that we toss the per user accounting entirely as
>> that means we loose a little in ensuring one uid does not have too many
>> processes.  But that is comparatively minor.
>> 
>> I am buried with Kernel Summit and Plumbers this week, so I will be slow
>> on review etc.
>> 
>> But overall I think this a viable approach assuming there are no
>> performance issues in structuring things this way.
>
> According to my experiments (see reply I sent to Serge) ucount doesn't
> introduce major performance bottlenecks. If you have ideas for a
> particular usecases worth investigating I'm happy to do it. But plain
> old forking should be sufficient.

Which is nice but you have only measured half of what is interesting.
Until there is hierarchical limit enforcement this is uninteresting.

Eric