inotify: Extend ioctl to allow to request id of new watch descriptor

Submitted by Kirill Tkhai on Feb. 14, 2018, 11:34 a.m.

Details

Message ID 151860805967.32478.3189700389678694572.stgit@localhost.localdomain
State New
Series "inotify: Extend ioctl to allow to request id of new watch descriptor"
Headers show

Commit Message

Kirill Tkhai Feb. 14, 2018, 11:34 a.m.
[This patch is accepted in ms and is going to main tree.
 As there is no idr_set_cursor() in 3.10 I've directly
 used data->idr.cur on port].

Watch descriptor is id of the watch created by inotify_add_watch().
It is allocated in inotify_add_to_idr(), and takes the numbers
starting from 1. Every new inotify watch obtains next available
number (usually, old + 1), as served by idr_alloc_cyclic().

CRIU (Checkpoint/Restore In Userspace) project supports inotify
files, and restores watched descriptors with the same numbers,
they had before dump. Since there was no kernel support, we
had to use cycle to add a watch with specific descriptor id:

	while (1) {
		int wd;

		wd = inotify_add_watch(inotify_fd, path, mask);
		if (wd < 0) {
			break;
		} else if (wd == desired_wd_id) {
			ret = 0;
			break;
		}

		inotify_rm_watch(inotify_fd, wd);
	}

(You may find the actual code at the below link:
https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)

The cycle is suboptiomal and very expensive, but since there is no better
kernel support, it was the only way to restore that. Happily, we had met
mostly descriptors with small id, and this approach had worked somehow.

But recent time containers with inotify with big watch descriptors
begun to come, and this way stopped to work at all. When descriptor id
is something about 0x34d71d6, the restoring process spins in busy loop
for a long time, and the restore hungs and delay of migration from node
to node could easily be watched.

This patch aims to solve this problem. It introduces new ioctl
INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
watch descriptor from userspace. It simply calls idr_set_cursor() primitive
to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
will return this id, if it is not occupied. This is the way which is
used to restore some other resources from userspace. For example,
/proc/sys/kernel/ns_last_pid works the same for task pids.

The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
may exclude it.

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/notify/inotify/inotify_user.c |   14 ++++++++++++++
 include/uapi/linux/inotify.h     |    8 ++++++++
 2 files changed, 22 insertions(+)

Patch hide | download patch | download mbox

diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index cc39a5d84a0e..fe93507e4d98 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -309,6 +309,20 @@  static long inotify_ioctl(struct file *file, unsigned int cmd,
 		spin_unlock(&group->notification_lock);
 		ret = put_user(send_len, (int __user *) p);
 		break;
+#ifdef CONFIG_CHECKPOINT_RESTORE
+	case INOTIFY_IOC_SETNEXTWD:
+		ret = -EINVAL;
+		if (arg >= 1 && arg <= INT_MAX) {
+			struct inotify_group_private_data *data;
+
+			data = &group->inotify_data;
+			spin_lock(&data->idr_lock);
+			data->idr.cur = (unsigned int)arg;
+			spin_unlock(&data->idr_lock);
+			ret = 0;
+		}
+		break;
+#endif /* CONFIG_CHECKPOINT_RESTORE */
 	}
 
 	return ret;
diff --git a/include/uapi/linux/inotify.h b/include/uapi/linux/inotify.h
index e6bf35b2dd34..ce8ac99480fa 100644
--- a/include/uapi/linux/inotify.h
+++ b/include/uapi/linux/inotify.h
@@ -70,5 +70,13 @@  struct inotify_event {
 #define IN_CLOEXEC O_CLOEXEC
 #define IN_NONBLOCK O_NONBLOCK
 
+/*
+ * ioctl numbers: inotify uses 'I' prefix for all ioctls,
+ * except historical FIONREAD, which is based on 'T'.
+ *
+ * INOTIFY_IOC_SETNEXTWD: set desired number of next created
+ * watch descriptor.
+ */
+#define INOTIFY_IOC_SETNEXTWD	_IOW('I', 0, __s32)
 
 #endif /* _UAPI_LINUX_INOTIFY_H */

Comments

Kirill Tkhai Feb. 14, 2018, 11:34 a.m.
https://jira.sw.ru/browse/PSBM-81411

On 14.02.2018 14:34, Kirill Tkhai wrote:
> [This patch is accepted in ms and is going to main tree.
>  As there is no idr_set_cursor() in 3.10 I've directly
>  used data->idr.cur on port].
> 
> Watch descriptor is id of the watch created by inotify_add_watch().
> It is allocated in inotify_add_to_idr(), and takes the numbers
> starting from 1. Every new inotify watch obtains next available
> number (usually, old + 1), as served by idr_alloc_cyclic().
> 
> CRIU (Checkpoint/Restore In Userspace) project supports inotify
> files, and restores watched descriptors with the same numbers,
> they had before dump. Since there was no kernel support, we
> had to use cycle to add a watch with specific descriptor id:
> 
> 	while (1) {
> 		int wd;
> 
> 		wd = inotify_add_watch(inotify_fd, path, mask);
> 		if (wd < 0) {
> 			break;
> 		} else if (wd == desired_wd_id) {
> 			ret = 0;
> 			break;
> 		}
> 
> 		inotify_rm_watch(inotify_fd, wd);
> 	}
> 
> (You may find the actual code at the below link:
> https://github.com/checkpoint-restore/criu/blob/v3.7/criu/fsnotify.c#L577)
> 
> The cycle is suboptiomal and very expensive, but since there is no better
> kernel support, it was the only way to restore that. Happily, we had met
> mostly descriptors with small id, and this approach had worked somehow.
> 
> But recent time containers with inotify with big watch descriptors
> begun to come, and this way stopped to work at all. When descriptor id
> is something about 0x34d71d6, the restoring process spins in busy loop
> for a long time, and the restore hungs and delay of migration from node
> to node could easily be watched.
> 
> This patch aims to solve this problem. It introduces new ioctl
> INOTIFY_IOC_SETNEXTWD, which allows to request the number of next created
> watch descriptor from userspace. It simply calls idr_set_cursor() primitive
> to populate idr::idr_next, so that next idr_alloc_cyclic() allocation
> will return this id, if it is not occupied. This is the way which is
> used to restore some other resources from userspace. For example,
> /proc/sys/kernel/ns_last_pid works the same for task pids.
> 
> The new code is under CONFIG_CHECKPOINT_RESTORE #define, so small system
> may exclude it.
> 
> Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
> Reviewed-by: Cyrill Gorcunov <gorcunov@openvz.org>
> Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
> ---
>  fs/notify/inotify/inotify_user.c |   14 ++++++++++++++
>  include/uapi/linux/inotify.h     |    8 ++++++++
>  2 files changed, 22 insertions(+)
> 
> diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
> index cc39a5d84a0e..fe93507e4d98 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -309,6 +309,20 @@ static long inotify_ioctl(struct file *file, unsigned int cmd,
>  		spin_unlock(&group->notification_lock);
>  		ret = put_user(send_len, (int __user *) p);
>  		break;
> +#ifdef CONFIG_CHECKPOINT_RESTORE
> +	case INOTIFY_IOC_SETNEXTWD:
> +		ret = -EINVAL;
> +		if (arg >= 1 && arg <= INT_MAX) {
> +			struct inotify_group_private_data *data;
> +
> +			data = &group->inotify_data;
> +			spin_lock(&data->idr_lock);
> +			data->idr.cur = (unsigned int)arg;
> +			spin_unlock(&data->idr_lock);
> +			ret = 0;
> +		}
> +		break;
> +#endif /* CONFIG_CHECKPOINT_RESTORE */
>  	}
>  
>  	return ret;
> diff --git a/include/uapi/linux/inotify.h b/include/uapi/linux/inotify.h
> index e6bf35b2dd34..ce8ac99480fa 100644
> --- a/include/uapi/linux/inotify.h
> +++ b/include/uapi/linux/inotify.h
> @@ -70,5 +70,13 @@ struct inotify_event {
>  #define IN_CLOEXEC O_CLOEXEC
>  #define IN_NONBLOCK O_NONBLOCK
>  
> +/*
> + * ioctl numbers: inotify uses 'I' prefix for all ioctls,
> + * except historical FIONREAD, which is based on 'T'.
> + *
> + * INOTIFY_IOC_SETNEXTWD: set desired number of next created
> + * watch descriptor.
> + */
> +#define INOTIFY_IOC_SETNEXTWD	_IOW('I', 0, __s32)
>  
>  #endif /* _UAPI_LINUX_INOTIFY_H */
>