[Devel,RHEL7,COMMIT] ms/xfs: rework buffer dispose list tracking

Submitted by Konstantin Khorenko on Jan. 11, 2017, 10:34 a.m.

Details

Message ID 201701111034.v0BAYdb5003372@finist_cl7.x64_64.work.ct
State New
Series "rebase xfs lru patches"
Headers show

Commit Message

Konstantin Khorenko Jan. 11, 2017, 10:34 a.m.
The commit is pushed to "branch-rh7-3.10.0-514.vz7.27.x-ovz" and will appear at https://src.openvz.org/scm/ovz/vzkernel.git
after rh7-3.10.0-514.vz7.27.8
------>
commit 80545dc5820da58dc3b7512dab667992ebcd487d
Author: Dmitry Monakhov <dmonakhov@openvz.org>
Date:   Wed Jan 11 14:34:39 2017 +0400

    ms/xfs: rework buffer dispose list tracking
    
    Patchset description:
    [7.3] rebase xfs lru patches
    
    rh7-3.10.0-514 already has 'fs-xfs-rework-buffer-dispose-list-tracking', but
    originally it depens on ms/xfs-convert-buftarg-LRU-to-generic, so
    In order to preserve original logic I've revert rhel's patch (1'st one),
    and reapply it later in natural order:
    TOC:
    0001-Revert-fs-xfs-rework-buffer-dispose-list-tracking.patch
    
    0002-ms-xfs-convert-buftarg-LRU-to-generic-code.patch
    0003-From-c70ded437bb646ace0dcbf3c7989d4edeed17f7e-Mon-Se.patch [not changed]
    0004-ms-xfs-rework-buffer-dispose-list-tracking.patch
    
    ===============================================================
    This patch description:
    
    In converting the buffer lru lists to use the generic code, the locking
    for marking the buffers as on the dispose list was lost.  This results in
    confusion in LRU buffer tracking and acocunting, resulting in reference
    counts being mucked up and filesystem beig unmountable.
    
    To fix this, introduce an internal buffer spinlock to protect the state
    field that holds the dispose list information.  Because there is now
    locking needed around xfs_buf_lru_add/del, and they are used in exactly
    one place each two lines apart, get rid of the wrappers and code the logic
    directly in place.
    
    Further, the LRU emptying code used on unmount is less than optimal.
    Convert it to use a dispose list as per a normal shrinker walk, and repeat
    the walk that fills the dispose list until the LRU is empty.  Thi avoids
    needing to drop and regain the LRU lock for every item being freed, and
    allows the same logic as the shrinker isolate call to be used.  Simpler,
    easier to understand.
    
    Signed-off-by: Dave Chinner <dchinner@redhat.com>
    Signed-off-by: Glauber Costa <glommer@openvz.org>
    Cc: "Theodore Ts'o" <tytso@mit.edu>
    Cc: Adrian Hunter <adrian.hunter@intel.com>
    Cc: Al Viro <viro@zeniv.linux.org.uk>
    Cc: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
    Cc: Arve Hjønnevåg <arve@android.com>
    Cc: Carlos Maiolino <cmaiolino@redhat.com>
    Cc: Christoph Hellwig <hch@lst.de>
    Cc: Chuck Lever <chuck.lever@oracle.com>
    Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Gleb Natapov <gleb@redhat.com>
    Cc: Greg Thelen <gthelen@google.com>
    Cc: J. Bruce Fields <bfields@redhat.com>
    Cc: Jan Kara <jack@suse.cz>
    Cc: Jerome Glisse <jglisse@redhat.com>
    Cc: John Stultz <john.stultz@linaro.org>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Cc: Kent Overstreet <koverstreet@google.com>
    Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
    Cc: Marcelo Tosatti <mtosatti@redhat.com>
    Cc: Mel Gorman <mgorman@suse.de>
    Cc: Steven Whitehouse <swhiteho@redhat.com>
    Cc: Thomas Hellstrom <thellstrom@vmware.com>
    Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
    (cherry picked from commit a408235726aa82c0358c9ec68124b6f4bc0a79df)
    
    https://jira.sw.ru/browse/PSBM-55577
    
    Signed-off-by: Dmitry Monakhov <dmonakhov@openvz.org>
---
 fs/xfs/xfs_buf.c | 147 +++++++++++++++++++++++++++----------------------------
 fs/xfs/xfs_buf.h |   8 ++-
 2 files changed, 78 insertions(+), 77 deletions(-)

Patch hide | download patch | download mbox

diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index bf933d5..8d8c9ce 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -80,37 +80,6 @@  xfs_buf_vmap_len(
 }
 
 /*
- * xfs_buf_lru_add - add a buffer to the LRU.
- *
- * The LRU takes a new reference to the buffer so that it will only be freed
- * once the shrinker takes the buffer off the LRU.
- */
-static void
-xfs_buf_lru_add(
-	struct xfs_buf	*bp)
-{
-	if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
-		bp->b_lru_flags &= ~_XBF_LRU_DISPOSE;
-		atomic_inc(&bp->b_hold);
-	}
-}
-
-/*
- * xfs_buf_lru_del - remove a buffer from the LRU
- *
- * The unlocked check is safe here because it only occurs when there are not
- * b_lru_ref counts left on the inode under the pag->pag_buf_lock. it is there
- * to optimise the shrinker removing the buffer from the LRU and calling
- * xfs_buf_free().
- */
-static void
-xfs_buf_lru_del(
-	struct xfs_buf	*bp)
-{
-	list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
-}
-
-/*
  * Bump the I/O in flight count on the buftarg if we haven't yet done so for
  * this buffer. The count is incremented once per buffer (per hold cycle)
  * because the corresponding decrement is deferred to buffer release. Buffers
@@ -181,12 +150,14 @@  xfs_buf_stale(
 	 */
 	xfs_buf_ioacct_dec(bp);
 
-	atomic_set(&(bp)->b_lru_ref, 0);
-	if (!(bp->b_lru_flags & _XBF_LRU_DISPOSE) &&
+	spin_lock(&bp->b_lock);
+	atomic_set(&bp->b_lru_ref, 0);
+	if (!(bp->b_state & XFS_BSTATE_DISPOSE) &&
 	    (list_lru_del(&bp->b_target->bt_lru, &bp->b_lru)))
 		atomic_dec(&bp->b_hold);
 
 	ASSERT(atomic_read(&bp->b_hold) >= 1);
+	spin_unlock(&bp->b_lock);
 }
 
 static int
@@ -987,10 +958,28 @@  xfs_buf_rele(
 	/* the last reference has been dropped ... */
 	xfs_buf_ioacct_dec(bp);
 	if (!(bp->b_flags & XBF_STALE) && atomic_read(&bp->b_lru_ref)) {
-		xfs_buf_lru_add(bp);
+		/*
+		 * If the buffer is added to the LRU take a new
+		 * reference to the buffer for the LRU and clear the
+		 * (now stale) dispose list state flag
+		 */
+		if (list_lru_add(&bp->b_target->bt_lru, &bp->b_lru)) {
+			bp->b_state &= ~XFS_BSTATE_DISPOSE;
+			atomic_inc(&bp->b_hold);
+		}
 		spin_unlock(&pag->pag_buf_lock);
 	} else {
-		xfs_buf_lru_del(bp);
+		/*
+		 * most of the time buffers will already be removed from
+		 * the LRU, so optimise that case by checking for the
+		 * XFS_BSTATE_DISPOSE flag indicating the last list the
+		 * buffer was on was the disposal list
+		 */
+		if (!(bp->b_state & XFS_BSTATE_DISPOSE)) {
+			list_lru_del(&bp->b_target->bt_lru, &bp->b_lru);
+		} else {
+			ASSERT(list_empty(&bp->b_lru));
+		}
 		ASSERT(!(bp->b_flags & _XBF_DELWRI_Q));
 		rb_erase(&bp->b_rbnode, &pag->pag_buf_tree);
 		spin_unlock(&pag->pag_buf_lock);
@@ -1575,66 +1564,65 @@  xfs_buf_iomove(
 static enum lru_status
 xfs_buftarg_wait_rele(
 	struct list_head	*item,
+	struct list_lru_one     *lru,
 	spinlock_t		*lru_lock,
 	void			*arg)
 
 {
 	struct xfs_buf		*bp = container_of(item, struct xfs_buf, b_lru);
-
-	/*
-	 * First wait on the buftarg I/O count for all in-flight buffers to be
-	 * released. This is critical as new buffers do not make the LRU until
-	 * they are released.
-	 *
-	 * Next, flush the buffer workqueue to ensure all completion processing
-	 * has finished. Just waiting on buffer locks is not sufficient for
-	 * async IO as the reference count held over IO is not released until
-	 * after the buffer lock is dropped. Hence we need to ensure here that
-	 * all reference counts have been dropped before we start walking the
-	 * LRU list.
-	 */
-	while (percpu_counter_sum(&btp->bt_io_count))
-		delay(100);
-	flush_workqueue(btp->bt_mount->m_buf_workqueue);
+	struct list_head        *dispose = arg;
 
 	if (atomic_read(&bp->b_hold) > 1) {
 		/* need to wait */
 		trace_xfs_buf_wait_buftarg(bp, _RET_IP_);
-		spin_unlock(lru_lock);
-		delay(100);
-	} else {
-		/*
-		 * clear the LRU reference count so the buffer doesn't get
-		 * ignored in xfs_buf_rele().
-		 */
-		atomic_set(&bp->b_lru_ref, 0);
-		spin_unlock(lru_lock);
-		if (bp->b_flags & XBF_WRITE_FAIL) {
-			xfs_alert(btp->bt_mount,
-"Corruption Alert: Buffer at block 0x%llx had permanent write failures!",
-				(long long)bp->b_bn);
-			xfs_alert(btp->bt_mount,
-"Please run xfs_repair to determine the extent of the problem.");
-		}
-		xfs_buf_rele(bp);
+		return LRU_SKIP;
 	}
-
-	spin_lock(lru_lock);
-	return LRU_RETRY;
+	if (!spin_trylock(&bp->b_lock))
+		return LRU_SKIP;
+	/*
+	 * clear the LRU reference count so the buffer doesn't get
+	 * ignored in xfs_buf_rele().
+	 */
+	atomic_set(&bp->b_lru_ref, 0);
+	bp->b_state |= XFS_BSTATE_DISPOSE;
+	list_move(item, dispose);
+	spin_unlock(&bp->b_lock);
+	return LRU_REMOVED;
 }
 
 void
 xfs_wait_buftarg(
 	struct xfs_buftarg	*btp)
 {
-	while (list_lru_count(&btp->bt_lru))
+	LIST_HEAD(dispose);
+	int loop = 0;
+
+	/* loop until there is nothing left on the lru list. */
+	while (list_lru_count(&btp->bt_lru)) {
 		list_lru_walk(&btp->bt_lru, xfs_buftarg_wait_rele,
-			      NULL, LONG_MAX);
+			      &dispose, LONG_MAX);
+
+		while (!list_empty(&dispose)) {
+			struct xfs_buf *bp;
+			bp = list_first_entry(&dispose, struct xfs_buf, b_lru);
+			list_del_init(&bp->b_lru);
+			if (bp->b_flags & XBF_WRITE_FAIL) {
+				xfs_alert(btp->bt_mount,
+"Corruption Alert: Buffer at block 0x%llx had permanent write failures!\n"
+"Please run xfs_repair to determine the extent of the problem.",
+					(long long)bp->b_bn);
+			}
+			xfs_buf_rele(bp);
+		}
+		if (loop++ != 0)
+			delay(100);
+	}
 }
 
 static enum lru_status
 xfs_buftarg_isolate(
 	struct list_head	*item,
+	struct list_lru_one     *lru,
 	spinlock_t		*lru_lock,
 	void			*arg)
 {
@@ -1642,15 +1630,24 @@  xfs_buftarg_isolate(
 	struct list_head	*dispose = arg;
 
 	/*
+	 * we are inverting the lru lock/bp->b_lock here, so use a trylock.
+	 * If we fail to get the lock, just skip it.
+	 */
+	if (!spin_trylock(&bp->b_lock))
+		return LRU_SKIP;
+	/*
 	 * Decrement the b_lru_ref count unless the value is already
 	 * zero. If the value is already zero, we need to reclaim the
 	 * buffer, otherwise it gets another trip through the LRU.
 	 */
-	if (!atomic_add_unless(&bp->b_lru_ref, -1, 0))
+	if (!atomic_add_unless(&bp->b_lru_ref, -1, 0)) {
+		spin_unlock(&bp->b_lock);
 		return LRU_ROTATE;
+	}
 
-	bp->b_lru_flags |= _XBF_LRU_DISPOSE;
+	bp->b_state |= XFS_BSTATE_DISPOSE;
 	list_move(item, dispose);
+	spin_unlock(&bp->b_lock);
 	return LRU_REMOVED;
 }
 
diff --git a/fs/xfs/xfs_buf.h b/fs/xfs/xfs_buf.h
index cb1421d..09f16fb 100644
--- a/fs/xfs/xfs_buf.h
+++ b/fs/xfs/xfs_buf.h
@@ -63,7 +63,6 @@  typedef enum {
 #define _XBF_KMEM	 (1 << 21)/* backed by heap memory */
 #define _XBF_DELWRI_Q	 (1 << 22)/* buffer on a delwri queue */
 #define _XBF_COMPOUND	 (1 << 23)/* compound buffer */
-#define _XBF_LRU_DISPOSE (1 << 24)/* buffer being discarded */
 #define _XBF_IN_FLIGHT	 (1 << 26) /* I/O in flight, for accounting purposes */
 
 typedef unsigned int xfs_buf_flags_t;
@@ -85,9 +84,13 @@  typedef unsigned int xfs_buf_flags_t;
 	{ _XBF_KMEM,		"KMEM" }, \
 	{ _XBF_DELWRI_Q,	"DELWRI_Q" }, \
 	{ _XBF_COMPOUND,	"COMPOUND" }, \
-	{ _XBF_LRU_DISPOSE,	"LRU_DISPOSE" }, \
 	{ _XBF_IN_FLIGHT,	"IN_FLIGHT" }
 
+/*
+ * Internal state flags.
+ */
+#define XFS_BSTATE_DISPOSE	 (1 << 0)	/* buffer being discarded */
+
 
 /*
  * The xfs_buftarg contains 2 notions of "sector size" -
@@ -162,6 +165,7 @@  typedef struct xfs_buf {
 	struct list_head	b_lru;		/* lru list */
 	xfs_buf_flags_t		b_lru_flags;	/* internal lru status flags */
 	spinlock_t		b_lock;		/* internal state lock */
+	unsigned int		b_state;	/* internal state flags */
 	int			b_io_error;	/* internal IO error state */
 	wait_queue_head_t	b_waiters;	/* unpin waiters */
 	struct list_head	b_list;