math: optimize lrint on 32bit targets

Submitted by Szabolcs Nagy on Sept. 22, 2019, 8:43 p.m.

Details

Message ID 20190922204335.GC22009@port70.net
State New
Series "math: optimize lrint on 32bit targets"
Headers show

Commit Message

Szabolcs Nagy Sept. 22, 2019, 8:43 p.m.
* Szabolcs Nagy <nsz@port70.net> [2019-09-21 17:52:35 +0200]:
> this was discussed on irc.

did more benchmarks, on i486 branches seem better
than setting the sign bit but on arm branch is
worse so i keep the original code, just changed
the code style (asuint macro instead of union).

Patch hide | download patch | download mbox

From 67990a5c85fc5db55831f9ddddc58317e5b344b6 Mon Sep 17 00:00:00 2001
From: Szabolcs Nagy <nsz@port70.net>
Date: Mon, 16 Sep 2019 20:33:11 +0000
Subject: [PATCH] math: optimize lrint on 32bit targets

lrint in (LONG_MAX, 1/DBL_EPSILON) and in (-1/DBL_EPSILON, LONG_MIN)
is not trivial: rounding to int may be inexact, but the conversion to
int may overflow and then the inexact flag must not be raised. (the
overflow threshold is rounding mode dependent).

this matters on 32bit targets (without single instruction lrint or
rint), so the common case (when there is no overflow) is optimized by
inlining the lrint logic, otherwise the old code is kept as a fallback.

on my laptop an i486 lrint call is asm:10ns, old c:30ns, new c:21ns
on a smaller arm core: old c:71ns, new c:34ns
on a bigger arm core: old c:27ns, new c:19ns
---
 src/math/lrint.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/src/math/lrint.c b/src/math/lrint.c
index bdca8b7c..ddee7a0d 100644
--- a/src/math/lrint.c
+++ b/src/math/lrint.c
@@ -1,5 +1,6 @@ 
 #include <limits.h>
 #include <fenv.h>
+#include <math.h>
 #include "libm.h"
 
 /*
@@ -26,7 +27,18 @@  as a double.
 */
 
 #if LONG_MAX < 1U<<53 && defined(FE_INEXACT)
-long lrint(double x)
+#include <float.h>
+#include <stdint.h>
+#if FLT_EVAL_METHOD==0 || FLT_EVAL_METHOD==1
+#define EPS DBL_EPSILON
+#elif FLT_EVAL_METHOD==2
+#define EPS LDBL_EPSILON
+#endif
+#ifdef __GNUC__
+/* avoid stack frame in lrint */
+__attribute__((noinline))
+#endif
+static long lrint_slow(double x)
 {
 	#pragma STDC FENV_ACCESS ON
 	int e;
@@ -38,6 +50,20 @@  long lrint(double x)
 	/* conversion */
 	return x;
 }
+
+long lrint(double x)
+{
+	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
+	uint64_t sign = asuint64(x) & (1ULL << 63);
+
+	if (abstop < 0x41dfffff) {
+		/* |x| < 0x7ffffc00, no overflow */
+		double_t toint = asdouble(asuint64(1/EPS) | sign);
+		double_t y = x + toint - toint;
+		return (long)y;
+	}
+	return lrint_slow(x);
+}
 #else
 long lrint(double x)
 {
-- 
2.21.0


Comments

Rich Felker Sept. 23, 2019, 2:24 p.m.
On Sun, Sep 22, 2019 at 10:43:35PM +0200, Szabolcs Nagy wrote:
> * Szabolcs Nagy <nsz@port70.net> [2019-09-21 17:52:35 +0200]:
> > this was discussed on irc.
> 
> did more benchmarks, on i486 branches seem better
> than setting the sign bit but on arm branch is
> worse so i keep the original code, just changed
> the code style (asuint macro instead of union).
> 

> >From 67990a5c85fc5db55831f9ddddc58317e5b344b6 Mon Sep 17 00:00:00 2001
> From: Szabolcs Nagy <nsz@port70.net>
> Date: Mon, 16 Sep 2019 20:33:11 +0000
> Subject: [PATCH] math: optimize lrint on 32bit targets
> 
> lrint in (LONG_MAX, 1/DBL_EPSILON) and in (-1/DBL_EPSILON, LONG_MIN)
> is not trivial: rounding to int may be inexact, but the conversion to
> int may overflow and then the inexact flag must not be raised. (the
> overflow threshold is rounding mode dependent).
> 
> this matters on 32bit targets (without single instruction lrint or
> rint), so the common case (when there is no overflow) is optimized by
> inlining the lrint logic, otherwise the old code is kept as a fallback.
> 
> on my laptop an i486 lrint call is asm:10ns, old c:30ns, new c:21ns
> on a smaller arm core: old c:71ns, new c:34ns
> on a bigger arm core: old c:27ns, new c:19ns
> ---
>  src/math/lrint.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/src/math/lrint.c b/src/math/lrint.c
> index bdca8b7c..ddee7a0d 100644
> --- a/src/math/lrint.c
> +++ b/src/math/lrint.c
> @@ -1,5 +1,6 @@
>  #include <limits.h>
>  #include <fenv.h>
> +#include <math.h>
>  #include "libm.h"
>  
>  /*
> @@ -26,7 +27,18 @@ as a double.
>  */
>  
>  #if LONG_MAX < 1U<<53 && defined(FE_INEXACT)
> -long lrint(double x)
> +#include <float.h>
> +#include <stdint.h>
> +#if FLT_EVAL_METHOD==0 || FLT_EVAL_METHOD==1
> +#define EPS DBL_EPSILON
> +#elif FLT_EVAL_METHOD==2
> +#define EPS LDBL_EPSILON
> +#endif
> +#ifdef __GNUC__
> +/* avoid stack frame in lrint */
> +__attribute__((noinline))
> +#endif
> +static long lrint_slow(double x)
>  {
>  	#pragma STDC FENV_ACCESS ON
>  	int e;
> @@ -38,6 +50,20 @@ long lrint(double x)
>  	/* conversion */
>  	return x;
>  }
> +
> +long lrint(double x)
> +{
> +	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
> +	uint64_t sign = asuint64(x) & (1ULL << 63);
> +
> +	if (abstop < 0x41dfffff) {
> +		/* |x| < 0x7ffffc00, no overflow */
> +		double_t toint = asdouble(asuint64(1/EPS) | sign);
> +		double_t y = x + toint - toint;
> +		return (long)y;
> +	}
> +	return lrint_slow(x);
> +}
>  #else
>  long lrint(double x)
>  {
> -- 

Looks good! Thanks for working on this.

Does asuint64(1/EPS) compile to an integer constant rather than
needing to load a floating point operand? I would assume so but just
want to check, since otherwise it might make more sense to write this
as an expression involving [L]DBL_MANT_DIG and integer bitshifts.

Rich
Szabolcs Nagy Sept. 23, 2019, 2:54 p.m.
* Rich Felker <dalias@libc.org> [2019-09-23 10:24:36 -0400]:
> On Sun, Sep 22, 2019 at 10:43:35PM +0200, Szabolcs Nagy wrote:
> > +long lrint(double x)
> > +{
> > +	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
> > +	uint64_t sign = asuint64(x) & (1ULL << 63);
> > +
> > +	if (abstop < 0x41dfffff) {
> > +		/* |x| < 0x7ffffc00, no overflow */
> > +		double_t toint = asdouble(asuint64(1/EPS) | sign);
> > +		double_t y = x + toint - toint;
> > +		return (long)y;
> > +	}
> > +	return lrint_slow(x);
> > +}
> >  #else
> >  long lrint(double x)
> >  {
> > -- 
> 
> Looks good! Thanks for working on this.
> 
> Does asuint64(1/EPS) compile to an integer constant rather than
> needing to load a floating point operand? I would assume so but just
> want to check, since otherwise it might make more sense to write this
> as an expression involving [L]DBL_MANT_DIG and integer bitshifts.

i think if 1/EPS was rounding mode dependent then it would
be computed at runtime, but since it's an exact power-of-two
gcc const folds it (on arm there is no load, the value is put
together by bitops with immediates)
Rich Felker Sept. 23, 2019, 4:08 p.m.
On Mon, Sep 23, 2019 at 04:54:23PM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2019-09-23 10:24:36 -0400]:
> > On Sun, Sep 22, 2019 at 10:43:35PM +0200, Szabolcs Nagy wrote:
> > > +long lrint(double x)
> > > +{
> > > +	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
> > > +	uint64_t sign = asuint64(x) & (1ULL << 63);
> > > +
> > > +	if (abstop < 0x41dfffff) {
> > > +		/* |x| < 0x7ffffc00, no overflow */
> > > +		double_t toint = asdouble(asuint64(1/EPS) | sign);
> > > +		double_t y = x + toint - toint;
> > > +		return (long)y;
> > > +	}
> > > +	return lrint_slow(x);
> > > +}
> > >  #else
> > >  long lrint(double x)
> > >  {
> > > -- 
> > 
> > Looks good! Thanks for working on this.
> > 
> > Does asuint64(1/EPS) compile to an integer constant rather than
> > needing to load a floating point operand? I would assume so but just
> > want to check, since otherwise it might make more sense to write this
> > as an expression involving [L]DBL_MANT_DIG and integer bitshifts.
> 
> i think if 1/EPS was rounding mode dependent then it would
> be computed at runtime, but since it's an exact power-of-two
> gcc const folds it (on arm there is no load, the value is put
> together by bitops with immediates)

Nice. I'm checking and yes it looks fine. Looks similar (modulo bad
codegen in general) on sh4 too, chosen as another arch with hardfloat
but no optimized lrint, and where there's not really any good way to
write one -- the only conversion insn is a truncating one.

Rich
Rich Felker Sept. 23, 2019, 5:40 p.m.
On Sun, Sep 22, 2019 at 10:43:35PM +0200, Szabolcs Nagy wrote:
> * Szabolcs Nagy <nsz@port70.net> [2019-09-21 17:52:35 +0200]:
> > this was discussed on irc.
> 
> did more benchmarks, on i486 branches seem better
> than setting the sign bit but on arm branch is
> worse so i keep the original code, just changed
> the code style (asuint macro instead of union).
> 

> >From 67990a5c85fc5db55831f9ddddc58317e5b344b6 Mon Sep 17 00:00:00 2001
> From: Szabolcs Nagy <nsz@port70.net>
> Date: Mon, 16 Sep 2019 20:33:11 +0000
> Subject: [PATCH] math: optimize lrint on 32bit targets
> 
> lrint in (LONG_MAX, 1/DBL_EPSILON) and in (-1/DBL_EPSILON, LONG_MIN)
> is not trivial: rounding to int may be inexact, but the conversion to
> int may overflow and then the inexact flag must not be raised. (the
> overflow threshold is rounding mode dependent).
> 
> this matters on 32bit targets (without single instruction lrint or
> rint), so the common case (when there is no overflow) is optimized by
> inlining the lrint logic, otherwise the old code is kept as a fallback.
> 
> on my laptop an i486 lrint call is asm:10ns, old c:30ns, new c:21ns
> on a smaller arm core: old c:71ns, new c:34ns
> on a bigger arm core: old c:27ns, new c:19ns
> ---
>  src/math/lrint.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/src/math/lrint.c b/src/math/lrint.c
> index bdca8b7c..ddee7a0d 100644
> --- a/src/math/lrint.c
> +++ b/src/math/lrint.c
> @@ -1,5 +1,6 @@
>  #include <limits.h>
>  #include <fenv.h>
> +#include <math.h>
>  #include "libm.h"
>  
>  /*
> @@ -26,7 +27,18 @@ as a double.
>  */
>  
>  #if LONG_MAX < 1U<<53 && defined(FE_INEXACT)
> -long lrint(double x)
> +#include <float.h>
> +#include <stdint.h>
> +#if FLT_EVAL_METHOD==0 || FLT_EVAL_METHOD==1
> +#define EPS DBL_EPSILON
> +#elif FLT_EVAL_METHOD==2
> +#define EPS LDBL_EPSILON
> +#endif
> +#ifdef __GNUC__
> +/* avoid stack frame in lrint */
> +__attribute__((noinline))
> +#endif
> +static long lrint_slow(double x)
>  {
>  	#pragma STDC FENV_ACCESS ON
>  	int e;
> @@ -38,6 +50,20 @@ long lrint(double x)
>  	/* conversion */
>  	return x;
>  }
> +
> +long lrint(double x)
> +{
> +	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
> +	uint64_t sign = asuint64(x) & (1ULL << 63);
> +
> +	if (abstop < 0x41dfffff) {
> +		/* |x| < 0x7ffffc00, no overflow */
> +		double_t toint = asdouble(asuint64(1/EPS) | sign);
> +		double_t y = x + toint - toint;
> +		return (long)y;
> +	}
> +	return lrint_slow(x);
> +}
>  #else
>  long lrint(double x)
>  {

This code should be considerably faster than calling rint on 64-bit
archs too, no? I wonder if it should be something like (untested,
written inline here):

long lrint(double x)
{
	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
	uint64_t sign = asuint64(x) & (1ULL << 63);

#if LONG_MAX < 1U<<53 && defined(FE_INEXACT)
	if (abstop >= 0x41dfffff) return lrint_slow(x);
#endif
	/* |x| < 0x7ffffc00, no overflow */
	double_t toint = asdouble(asuint64(1/EPS) | sign);
	double_t y = x + toint - toint;
	return (long)y;
}

Rich
Szabolcs Nagy Sept. 23, 2019, 6:38 p.m.
* Rich Felker <dalias@libc.org> [2019-09-23 13:40:29 -0400]:
> On Sun, Sep 22, 2019 at 10:43:35PM +0200, Szabolcs Nagy wrote:
> > +long lrint(double x)
> > +{
> > +	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
> > +	uint64_t sign = asuint64(x) & (1ULL << 63);
> > +
> > +	if (abstop < 0x41dfffff) {
> > +		/* |x| < 0x7ffffc00, no overflow */
> > +		double_t toint = asdouble(asuint64(1/EPS) | sign);
> > +		double_t y = x + toint - toint;
> > +		return (long)y;
> > +	}
> > +	return lrint_slow(x);
> > +}
> >  #else
> >  long lrint(double x)
> >  {
> 
> This code should be considerably faster than calling rint on 64-bit
> archs too, no? I wonder if it should be something like (untested,
> written inline here):

yeah i'd expect it to be a bit faster, but e.g. a
target may prefer sign?-1/EPS:1/EPS to 1/EPS|sign,
and you need a threshold check even if there is no
inexact overflow issue:

> long lrint(double x)
> {
> 	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
> 	uint64_t sign = asuint64(x) & (1ULL << 63);
> 
> #if LONG_MAX < 1U<<53 && defined(FE_INEXACT)
> 	if (abstop >= 0x41dfffff) return lrint_slow(x);

#else
	if (abstop >= 0x43300000) return (long)x;
	/* |x| < 2^52 <= 1/EPS */

> #endif
> 	/* |x| < 0x7ffffc00, no overflow */
> 	double_t toint = asdouble(asuint64(1/EPS) | sign);
> 	double_t y = x + toint - toint;
> 	return (long)y;
> }

i can try to benchmark this (although on x86_64 and
aarch64 there is single instruction lrint so i can
only benchmark machines where this is not relevant).
Rich Felker Sept. 23, 2019, 8:42 p.m.
On Mon, Sep 23, 2019 at 08:38:19PM +0200, Szabolcs Nagy wrote:
> * Rich Felker <dalias@libc.org> [2019-09-23 13:40:29 -0400]:
> > On Sun, Sep 22, 2019 at 10:43:35PM +0200, Szabolcs Nagy wrote:
> > > +long lrint(double x)
> > > +{
> > > +	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
> > > +	uint64_t sign = asuint64(x) & (1ULL << 63);
> > > +
> > > +	if (abstop < 0x41dfffff) {
> > > +		/* |x| < 0x7ffffc00, no overflow */
> > > +		double_t toint = asdouble(asuint64(1/EPS) | sign);
> > > +		double_t y = x + toint - toint;
> > > +		return (long)y;
> > > +	}
> > > +	return lrint_slow(x);
> > > +}
> > >  #else
> > >  long lrint(double x)
> > >  {
> > 
> > This code should be considerably faster than calling rint on 64-bit
> > archs too, no? I wonder if it should be something like (untested,
> > written inline here):
> 
> yeah i'd expect it to be a bit faster, but e.g. a
> target may prefer sign?-1/EPS:1/EPS to 1/EPS|sign,

Yes, perhaps. But this will be a lot less of a difference than an
extra call frame I think.

> and you need a threshold check even if there is no
> inexact overflow issue:

Ah yes.

> > long lrint(double x)
> > {
> > 	uint32_t abstop = asuint64(x)>>32 & 0x7fffffff;
> > 	uint64_t sign = asuint64(x) & (1ULL << 63);
> > 
> > #if LONG_MAX < 1U<<53 && defined(FE_INEXACT)
> > 	if (abstop >= 0x41dfffff) return lrint_slow(x);
> 
> #else
> 	if (abstop >= 0x43300000) return (long)x;
> 	/* |x| < 2^52 <= 1/EPS */
> 
> > #endif
> > 	/* |x| < 0x7ffffc00, no overflow */
> > 	double_t toint = asdouble(asuint64(1/EPS) | sign);
> > 	double_t y = x + toint - toint;
> > 	return (long)y;
> > }
> 
> i can try to benchmark this (although on x86_64 and
> aarch64 there is single instruction lrint so i can
> only benchmark machines where this is not relevant).

Yeah I think it's mainly riscv64 and s390x that might benefit right
now. Everything else 64-bit seems to have an optimized one.

Rich