[RHEL8,COMMIT] ve/vfs: introduce "fs.odirect_enable" sysctl and disable it by default

Submitted by Konstantin Khorenko on June 17, 2020, 1:21 p.m.

Details

Message ID 202006171321.05HDLcfQ030399@finist-co8.sw.ru
State New
Series "ve/vfs: introduce "fs.odirect_enable" sysctl and disable it by default"
Headers show

Commit Message

Konstantin Khorenko June 17, 2020, 1:21 p.m.
The commit is pushed to "branch-rh8-4.18.0-80.1.2.vz8.3.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh8-4.18.0-80.1.2.vz8.3.12
------>
commit b2215480f4199d1d1b0e9a96bbf036607d4ad054
Author: Konstantin Khorenko <khorenko@virtuozzo.com>
Date:   Wed Jun 17 14:03:12 2020 +0300

    ve/vfs: introduce "fs.odirect_enable" sysctl and disable it by default
    
    We've observed a situation when in case of many Containers on a node
    even small direct disk io in each CT brings the whole node to knees
    (100 CTs, 5 lines of logs written each 20-30 seconds).
    The node had surely slow hdds.
    
    Note, that this significantly slows down async reads: they can be direct
    only, if they are called in cached mode, they effectively became
    synchronous in case > 1 writers.
    
    Example:
     # fio --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 \
       --name=test --filename=test --bs=4k --iodepth=64 --size=1G \
       --readwrite=randrw --rwmixread=75
    
    The vps here resulted in 20MB/s read and 6.8MB/s write, while other VPS
    (with O_DIRECT enabled) resulted in 230MB/s read and 76MB/s write.
    
    The root cause is known: libaio becomes synchronous in case of cached io.
    
    So the userspace is better check if underlying disk is fast enough and
    enable O_DIRECT in those cases.
    
    https://jira.sw.ru/browse/PSBM-53458
    https://jira.sw.ru/browse/PSBM-68005
    https://jira.sw.ru/browse/PSBM-68656
    https://jira.sw.ru/browse/PSBM-100671
    https://jira.sw.ru/browse/PSBM-104338
    
    Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
    
    ===============================================================
    ===============================================================
    Original commit message:
    
     commit f5829bccbd390437013bd914d68caabf79d09b3e
     Author: Konstantin Khorenko <khorenko@virtuozzo.com>
     Date:   Mon Dec 11 23:00:45 2017 +0300
    
        ve/fs: introduce "fs.fsync-enable" and "fs.odirect_enable" sysctls
    
        ve/vfs: introduce "odirect_enable" sysctl and disable it by default
    
        khorenko@: we want to disable direct access from inside Container
                because this is limited numbers of direct requests available
                on the system (128), and in case they are busy next request
                is provided only after some requst is completed.
                There is no any scheduler at this level => DDoS is possible
                from inside a CT: just run _many_ processes writing with O_DIRECT.
    
        diff-vfs-odirect-enable && diff-vfs-odirect-enable-location-fix
    
        Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
    
        +++
        ve/fs: Port fs.fsync-enable and fs.odirect_enable sysctls
    
        This is a part of 74-diff-ve-mix-combined.
    
        https://jira.sw.ru/browse/PSBM-17903
    
        Signed-off-by: Kirill Tkhai <ktkhai@parallels.com>
    
        =====================================================
    
        ve/fs: check container odirect and fsync settings in __dentry_open
    
        sys_open for conventional filesystems doesn't call dentry_open,
        it calls __dentry_open (in nameidata_to_filp), so we have to move
        checks for odirect and fsync behaviour to __dentry_open
        to make them working on ploop containers.
    
        https://jira.sw.ru/browse/PSBM-17157
    
        Signed-off-by: Dmitry Guryanov <dguryanov@parallels.com>
    
        Acked-by: Dmitry Monakhov <dmonakhov@openvz.org>
        Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
    
        ================================================
    
        ve: initialize fsync_enable also for non ve0 environment
    
        Patchset description:
    
        ve: fix initialization and remove sysctl_fsync_enable
    
        v2:
        - initialize only on ve cgroup creation, remove get_ve_features
        - rename setup_iptables_mask into ve_setup_iptables_mask
    
        https://jira.sw.ru/browse/PSBM-34286
        https://jira.sw.ru/browse/PSBM-34285
    
        Pavel Tikhomirov (4):
          ve: remove sysctl_fsync_enable and use ve_fsync_behavior instead
          ve: initialize fsync_enable also for non ve0 environment
          ve: iptables: fix mask initialization and changing
          ve: cgroup: initialize odirect_enable, features and _randomize_va_space
    
        =====================================================================
        This patch description:
    
        v2: only on ve cgroup creation
    
        https://jira.sw.ru/browse/PSBM-34286
        Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
        Acked-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/fcntl.c         | 30 ++++++++++++++++++++++++++++++
 fs/open.c          |  3 +++
 include/linux/fs.h |  2 ++
 include/linux/ve.h |  1 +
 kernel/sysctl.c    |  7 +++++++
 kernel/ve/ve.c     |  2 ++
 6 files changed, 45 insertions(+)

Patch hide | download patch | download mbox

diff --git a/fs/fcntl.c b/fs/fcntl.c
index 1f2fd840f50c..5aa733d00ce9 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -25,6 +25,7 @@ 
 #include <linux/user_namespace.h>
 #include <linux/memfd.h>
 #include <linux/compat.h>
+#include <linux/ve.h>
 
 #include <linux/poll.h>
 #include <asm/siginfo.h>
@@ -32,11 +33,40 @@ 
 
 #define SETFL_MASK (O_APPEND | O_NONBLOCK | O_NDELAY | O_DIRECT | O_NOATIME)
 
+/*
+ * Host is always allowed to use O_DIRECT.
+ * Host's value of sysctl "fs.odirect_enable" might affect Containers only.
+ *
+ * Container's "fs.odirect_enable" sysctl value means:
+ *  0: Container ignores O_DIRECT flag
+ *  1: Container honors  O_DIRECT flag (in fact, any X>0 && X != 2)
+ *  2: Container checks the host's sysctl value and work according it
+ */
+int may_use_odirect(void)
+{
+	int may;
+
+	if (ve_is_super(get_exec_env()))
+		return 1;
+
+	may = capable(CAP_SYS_RAWIO);
+	if (!may) {
+		may = get_exec_env()->odirect_enable;
+		if (may == 2)
+			may = get_ve0()->odirect_enable;
+	}
+
+	return may;
+}
+
 static int setfl(int fd, struct file * filp, unsigned long arg)
 {
 	struct inode * inode = file_inode(filp);
 	int error = 0;
 
+	if (!may_use_odirect())
+		arg &= ~O_DIRECT;
+
 	/*
 	 * O_APPEND cannot be cleared if the file is marked as append-only
 	 * and the file is open for write.
diff --git a/fs/open.c b/fs/open.c
index 2874bfc68f08..47c1e5ac2c97 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -727,6 +727,9 @@  static int do_dentry_open(struct file *f,
 	/* Ensure that we skip any errors that predate opening of the file */
 	f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
 
+	if (!may_use_odirect())
+		f->f_flags &= ~O_DIRECT;
+
 	if (unlikely(f->f_flags & O_PATH)) {
 		f->f_mode = FMODE_PATH | FMODE_OPENED;
 		f->f_op = &empty_fops;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 84ebf6e57c2f..c5ecb02684b3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -175,6 +175,8 @@  typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
  */
 #define CHECK_IOVEC_ONLY -1
 
+extern int may_use_odirect(void);
+
 /*
  * Attribute flags.  These should be or-ed together to figure out what
  * has been changed!
diff --git a/include/linux/ve.h b/include/linux/ve.h
index ba84d3058ad2..b659e779cb49 100644
--- a/include/linux/ve.h
+++ b/include/linux/ve.h
@@ -67,6 +67,7 @@  struct ve_struct {
 #ifdef CONFIG_VE_IPTABLES
 	__u64			ipt_mask;
 #endif
+	int			odirect_enable;
 
 	u64			_uevent_seqnum;
 
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 898ec305032a..d74004c2be59 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1857,6 +1857,13 @@  static struct ctl_table fs_table[] = {
 		.child		= sysctl_mount_point,
 	},
 #endif
+	{
+		.procname	= "odirect_enable",
+		.data		= &ve0.odirect_enable,
+		.maxlen		= sizeof(int),
+		.mode		= 0644 | S_ISVTX,
+		.proc_handler	= proc_dointvec_virtual,
+	},
 	{
 		.procname	= "pipe-max-size",
 		.data		= &pipe_max_size,
diff --git a/kernel/ve/ve.c b/kernel/ve/ve.c
index 0f07c4ecf849..befc5163cfe6 100644
--- a/kernel/ve/ve.c
+++ b/kernel/ve/ve.c
@@ -562,6 +562,8 @@  static struct cgroup_subsys_state *ve_create(struct cgroup_subsys_state *parent_
 	ve->features = VE_FEATURES_DEF;
 	ve->_randomize_va_space = ve0._randomize_va_space;
 
+	ve->odirect_enable = 2;
+
 #ifdef CONFIG_VE_IPTABLES
 	ve->ipt_mask = ve_setup_iptables_mask(VE_IP_DEFAULT);
 #endif