math: move i386 sqrt to C

Submitted by Alexander Monakov on Jan. 7, 2020, 1:06 p.m.

Details

Message ID 20200107130605.7618-1-amonakov@ispras.ru
State New
Headers show

Patch hide | download patch | download mbox

diff --git a/src/math/i386/sqrt.c b/src/math/i386/sqrt.c
new file mode 100644
index 00000000..619df056
--- /dev/null
+++ b/src/math/i386/sqrt.c
@@ -0,0 +1,15 @@ 
+#include "libm.h"
+
+double sqrt(double x)
+{
+	union ldshape ux;
+	unsigned fpsr;
+	__asm__ ("fsqrt; fnstsw %%ax": "=t"(ux.f), "=a"(fpsr) : "0"(x));
+	if ((ux.i.m & 0x7ff) != 0x400)
+		return (double)ux.f;
+	/* Rounding to double would have encountered an exact halfway case.
+	   Adjust mantissa downwards if fsqrt rounded up, else upwards.
+	   (result of fsqrt could not have been exact) */
+	ux.i.m ^= (fpsr & 0x200) + 0x200;
+	return (double)ux.f;
+}
diff --git a/src/math/i386/sqrt.s b/src/math/i386/sqrt.s
deleted file mode 100644
index 57837e25..00000000
--- a/src/math/i386/sqrt.s
+++ /dev/null
@@ -1,21 +0,0 @@ 
-.global sqrt
-.type sqrt,@function
-sqrt:	fldl 4(%esp)
-	fsqrt
-	fnstsw %ax
-	sub $12,%esp
-	fld %st(0)
-	fstpt (%esp)
-	mov (%esp),%ecx
-	and $0x7ff,%ecx
-	cmp $0x400,%ecx
-	jnz 1f
-	and $0x200,%eax
-	sub $0x100,%eax
-	sub %eax,(%esp)
-	fstp %st(0)
-	fldt (%esp)
-1:	add $12,%esp
-	fstpl 4(%esp)
-	fldl 4(%esp)
-	ret

Comments

Rich Felker Jan. 8, 2020, 7:26 a.m.
On Tue, Jan 07, 2020 at 04:06:05PM +0300, Alexander Monakov wrote:
> ---
> Since union ldshape does not have a dedicated field for 32 least significant
> bits of the x87 long double mantissa, keeping the original approach with
> 
>     ux.i.m -= (fpsr & 0x200) - 0x100;
> 
> would lead to a 64-bit subtraction that is not trivial for the compiler to
> optimize to 32-bit subtraction as done in the original assembly. Therefore
> I have elected to change the approach and use
> 
>     ux.i.m ^= (fpsr & 0x200) + 0x200;
> 
> which is easier to optimize to a 32-bit rather than 64-bit xor.
> 
> Thoughts?

I think it looks like a good change. Ideally the compiler would see,
since the branch was taken where the low bits are 0x400, that
subtracting a value less than 0x400 can't borrow from the high 32
bits. But I don't think current compilers do such range/value
analysis, and it doesn't seem like there are anny downsides to your
approach anyway.

Rich
Rich Felker March 21, 2020, 5:53 p.m.
On Tue, Jan 07, 2020 at 04:06:05PM +0300, Alexander Monakov wrote:
> ---
> Since union ldshape does not have a dedicated field for 32 least significant
> bits of the x87 long double mantissa, keeping the original approach with
> 
>     ux.i.m -= (fpsr & 0x200) - 0x100;
> 
> would lead to a 64-bit subtraction that is not trivial for the compiler to
> optimize to 32-bit subtraction as done in the original assembly. Therefore
> I have elected to change the approach and use
> 
>     ux.i.m ^= (fpsr & 0x200) + 0x200;
> 
> which is easier to optimize to a 32-bit rather than 64-bit xor.
> 
> Thoughts?

I'm getting test failures with sqrt and this seems to be the culprit
-- I don't think it's equivalent. The original version could offset
the value by +0x100 or -0x100 before rounding, and offsets in the
opposite direction of the rounding that already occurred. Your version
can only offset it by +0x200 or -0x400.

The (well, one) particular failing case is:

src/math/ucb/sqrt.h:49: RU sqrt(0x1.fffffffffffffp+1023) want 0x1p+512
got 0x1.fffffffffffffp+511 ulperr -0.250 = -0x1p-1 + 0x1p-2

Here the mantissa is

fffffffffffffc00

and offset by -0x400 yields:

fffffffffffff800

which has exactly 53 bits and therefore does not round up like it
should.

I still like your approach better if there's a way to salvage it. Do
you see one?

Rich
Rich Felker March 21, 2020, 5:57 p.m.
On Sat, Mar 21, 2020 at 01:53:51PM -0400, Rich Felker wrote:
> On Tue, Jan 07, 2020 at 04:06:05PM +0300, Alexander Monakov wrote:
> > ---
> > Since union ldshape does not have a dedicated field for 32 least significant
> > bits of the x87 long double mantissa, keeping the original approach with
> > 
> >     ux.i.m -= (fpsr & 0x200) - 0x100;
> > 
> > would lead to a 64-bit subtraction that is not trivial for the compiler to
> > optimize to 32-bit subtraction as done in the original assembly. Therefore
> > I have elected to change the approach and use
> > 
> >     ux.i.m ^= (fpsr & 0x200) + 0x200;
> > 
> > which is easier to optimize to a 32-bit rather than 64-bit xor.
> > 
> > Thoughts?
> 
> I'm getting test failures with sqrt and this seems to be the culprit
> -- I don't think it's equivalent. The original version could offset
> the value by +0x100 or -0x100 before rounding, and offsets in the
> opposite direction of the rounding that already occurred. Your version
> can only offset it by +0x200 or -0x400.
> 
> The (well, one) particular failing case is:
> 
> src/math/ucb/sqrt.h:49: RU sqrt(0x1.fffffffffffffp+1023) want 0x1p+512
> got 0x1.fffffffffffffp+511 ulperr -0.250 = -0x1p-1 + 0x1p-2
> 
> Here the mantissa is
> 
> fffffffffffffc00
> 
> and offset by -0x400 yields:
> 
> fffffffffffff800
> 
> which has exactly 53 bits and therefore does not round up like it
> should.
> 
> I still like your approach better if there's a way to salvage it. Do
> you see one?

And, I think I do. Changing it to:

    ux.i.m ^= (fpsr & 0x200) + 0x300;

yields an offset of +0x300 (^0x300) or -0x300 (^0x500). This looks
like it should work theoretically, and indeed it passes libc-test.

Rich
Alexander Monakov March 21, 2020, 8:30 p.m.
On Sat, 21 Mar 2020, Rich Felker wrote:

> On Sat, Mar 21, 2020 at 01:53:51PM -0400, Rich Felker wrote:
> > On Tue, Jan 07, 2020 at 04:06:05PM +0300, Alexander Monakov wrote:
> > > ---
> > > Since union ldshape does not have a dedicated field for 32 least significant
> > > bits of the x87 long double mantissa, keeping the original approach with
> > > 
> > >     ux.i.m -= (fpsr & 0x200) - 0x100;
> > > 
> > > would lead to a 64-bit subtraction that is not trivial for the compiler to
> > > optimize to 32-bit subtraction as done in the original assembly. Therefore
> > > I have elected to change the approach and use
> > > 
> > >     ux.i.m ^= (fpsr & 0x200) + 0x200;
> > > 
> > > which is easier to optimize to a 32-bit rather than 64-bit xor.
> > > 
> > > Thoughts?
> > 
> > I'm getting test failures with sqrt and this seems to be the culprit
> > -- I don't think it's equivalent. The original version could offset
> > the value by +0x100 or -0x100 before rounding, and offsets in the
> > opposite direction of the rounding that already occurred. Your version
> > can only offset it by +0x200 or -0x400.
> > 
> > The (well, one) particular failing case is:
> > 
> > src/math/ucb/sqrt.h:49: RU sqrt(0x1.fffffffffffffp+1023) want 0x1p+512
> > got 0x1.fffffffffffffp+511 ulperr -0.250 = -0x1p-1 + 0x1p-2
> > 
> > Here the mantissa is
> > 
> > fffffffffffffc00
> > 
> > and offset by -0x400 yields:
> > 
> > fffffffffffff800
> > 
> > which has exactly 53 bits and therefore does not round up like it
> > should.
> > 
> > I still like your approach better if there's a way to salvage it. Do
> > you see one?
> 
> And, I think I do. Changing it to:
> 
>     ux.i.m ^= (fpsr & 0x200) + 0x300;
> 
> yields an offset of +0x300 (^0x300) or -0x300 (^0x500). This looks
> like it should work theoretically, and indeed it passes libc-test.

Indeed, I was considering only the default (to-nearest) rounding mode
and did not notice the problem for upwards rounding mode.

I think your change solves this nicely.

Thanks
Alexander