math: move x86_64 fabs, fabsf to C with inline asm

Submitted by Alexander Monakov on Jan. 5, 2020, 4:36 p.m.

Details

Message ID 20200105163639.25963-1-amonakov@ispras.ru
State New
Series "math: add x86_64 remquol"
Headers show

Commit Message

Alexander Monakov Jan. 5, 2020, 4:36 p.m.
---

Questions:

Why are there amd64-specific fabs implementations in the first place?
(Only) because GCC generated poor code for the generic C version?

Do annotations for mask manipulation in the patch help? Any way to make
them less ambiguous?


 src/math/x86_64/fabs.c  | 10 ++++++++++
 src/math/x86_64/fabs.s  |  9 ---------
 src/math/x86_64/fabsf.c | 10 ++++++++++
 src/math/x86_64/fabsf.s |  7 -------
 4 files changed, 20 insertions(+), 16 deletions(-)
 create mode 100644 src/math/x86_64/fabs.c
 delete mode 100644 src/math/x86_64/fabs.s
 create mode 100644 src/math/x86_64/fabsf.c
 delete mode 100644 src/math/x86_64/fabsf.s

Patch hide | download patch | download mbox

diff --git a/src/math/x86_64/fabs.c b/src/math/x86_64/fabs.c
new file mode 100644
index 00000000..16562477
--- /dev/null
+++ b/src/math/x86_64/fabs.c
@@ -0,0 +1,10 @@ 
+#include <math.h>
+
+double fabs(double x)
+{
+	double t;
+	__asm__ ("pcmpeqd %0, %0" : "=x"(t));          // t = ~0
+	__asm__ ("psrlq   $1, %0" : "+x"(t));          // t >>= 1
+	__asm__ ("andps   %1, %0" : "+x"(x) : "x"(t)); // x &= t
+	return x;
+}
diff --git a/src/math/x86_64/fabs.s b/src/math/x86_64/fabs.s
deleted file mode 100644
index 5715005e..00000000
--- a/src/math/x86_64/fabs.s
+++ /dev/null
@@ -1,9 +0,0 @@ 
-.global fabs
-.type fabs,@function
-fabs:
-	xor %eax,%eax
-	dec %rax
-	shr %rax
-	movq %rax,%xmm1
-	andpd %xmm1,%xmm0
-	ret
diff --git a/src/math/x86_64/fabsf.c b/src/math/x86_64/fabsf.c
new file mode 100644
index 00000000..36ea7481
--- /dev/null
+++ b/src/math/x86_64/fabsf.c
@@ -0,0 +1,10 @@ 
+#include <math.h>
+
+float fabsf(float x)
+{
+	float t;
+	__asm__ ("pcmpeqd %0, %0" : "=x"(t));          // t = ~0
+	__asm__ ("psrld   $1, %0" : "+x"(t));          // t >>= 1
+	__asm__ ("andps   %1, %0" : "+x"(x) : "x"(t)); // x &= t
+	return x;
+}
diff --git a/src/math/x86_64/fabsf.s b/src/math/x86_64/fabsf.s
deleted file mode 100644
index 501a1f17..00000000
--- a/src/math/x86_64/fabsf.s
+++ /dev/null
@@ -1,7 +0,0 @@ 
-.global fabsf
-.type fabsf,@function
-fabsf:
-	mov $0x7fffffff,%eax
-	movq %rax,%xmm1
-	andps %xmm1,%xmm0
-	ret

Comments

Rich Felker Jan. 5, 2020, 8:05 p.m.
On Sun, Jan 05, 2020 at 07:36:39PM +0300, Alexander Monakov wrote:
> ---
> 
> Questions:
> 
> Why are there amd64-specific fabs implementations in the first place?
> (Only) because GCC generated poor code for the generic C version?

I think so. It generates:

0000000000000000 <fabs>:
   0:   66 48 0f 7e c0          movq   %xmm0,%rax
   5:   48 ba ff ff ff ff ff    movabs $0x7fffffffffffffff,%rdx
   c:   ff ff 7f
   f:   48 21 d0                and    %rdx,%rax
  12:   48 89 44 24 f8          mov    %rax,-0x8(%rsp)
  17:   f2 0f 10 44 24 f8       movsd  -0x8(%rsp),%xmm0
  1d:   c3                      retq

> Do annotations for mask manipulation in the patch help? Any way to make
> them less ambiguous?

I think so. I like how you did individual asm statements with
dependency relationship between them so compiler could even schedule
them if it likes. I wonder if you could just write 0x7fffffffffffffff
as an operand and have the compiler load it, though.

Rich
Alexander Monakov Jan. 5, 2020, 9:32 p.m.
On Sun, 5 Jan 2020, Rich Felker wrote:

> On Sun, Jan 05, 2020 at 07:36:39PM +0300, Alexander Monakov wrote:
> > ---
> > 
> > Questions:
> > 
> > Why are there amd64-specific fabs implementations in the first place?
> > (Only) because GCC generated poor code for the generic C version?
> 
> I think so. It generates:
[snip]

*nod* In my eyes that's a missed optimization, but one that is probably not
going to be fully fixed anytime soon, although for the particular case of
generic fabs gcc-9 has improved:

        movq    %xmm0, %rax
        btrq    $63, %rax
        movq    %rax, %xmm0			

On Aarch64 GCC seems to do better with float bit manipulations (can emit code
that does them on vector registers directly without copying to/from general
registers). On x86 LLVM compiles fabs well, but not copysign.

(ideally the language would allow to express bit manipulations of floats
directly, then compilers probably would have better support as well)

FWIW GCC bugreport is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039
but I'm not holding my breath.

By this logic, specialized implementations of copysign are also desirable,
right?  (2 instructions longer than fabs, except for long double)

> > Do annotations for mask manipulation in the patch help? Any way to make
> > them less ambiguous?
> 
> I think so. I like how you did individual asm statements with
> dependency relationship between them so compiler could even schedule
> them if it likes. I wonder if you could just write 0x7fffffffffffffff
> as an operand and have the compiler load it, though.

In this case the mask is so simple that building it with pcmpeq-psrl is cheaper
than loading from memory or moving from a general register. So not using an
immediate is intentional.

Alexander
Rich Felker Jan. 5, 2020, 10:43 p.m.
On Mon, Jan 06, 2020 at 12:32:38AM +0300, Alexander Monakov wrote:
> 
> 
> On Sun, 5 Jan 2020, Rich Felker wrote:
> 
> > On Sun, Jan 05, 2020 at 07:36:39PM +0300, Alexander Monakov wrote:
> > > ---
> > > 
> > > Questions:
> > > 
> > > Why are there amd64-specific fabs implementations in the first place?
> > > (Only) because GCC generated poor code for the generic C version?
> > 
> > I think so. It generates:
> [snip]
> 
> *nod* In my eyes that's a missed optimization, but one that is probably not
> going to be fully fixed anytime soon, although for the particular case of
> generic fabs gcc-9 has improved:
> 
>         movq    %xmm0, %rax
>         btrq    $63, %rax
>         movq    %rax, %xmm0			
> 
> On Aarch64 GCC seems to do better with float bit manipulations (can emit code
> that does them on vector registers directly without copying to/from general
> registers). On x86 LLVM compiles fabs well, but not copysign.
> 
> (ideally the language would allow to express bit manipulations of floats
> directly, then compilers probably would have better support as well)
> 
> FWIW GCC bugreport is https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93039
> but I'm not holding my breath.
> 
> By this logic, specialized implementations of copysign are also desirable,
> right?  (2 instructions longer than fabs, except for long double)

I'm not sure if "this logic" carries over. fabs is a common operation
(ideally compiler would inline it anyway in the caller, though).
copysign not so much.

Really I'm not even sure it makes sense to have the asm here at all
for fabs either, but perhaps with the gratuitous stack access in the
older-GCC version it does...?

> > > Do annotations for mask manipulation in the patch help? Any way to make
> > > them less ambiguous?
> > 
> > I think so. I like how you did individual asm statements with
> > dependency relationship between them so compiler could even schedule
> > them if it likes. I wonder if you could just write 0x7fffffffffffffff
> > as an operand and have the compiler load it, though.
> 
> In this case the mask is so simple that building it with pcmpeq-psrl is cheaper
> than loading from memory or moving from a general register. So not using an
> immediate is intentional.

OK, I was figuring the compiler might be able to generate it easily
with vector insns if there were no "non-vector" arithmetic/bitwise ops
involved in the use of the result, but that's probably expecting too
much...

Rich
Alexander Monakov Jan. 6, 2020, 8:17 a.m.
On Sun, 5 Jan 2020, Rich Felker wrote:

> I'm not sure if "this logic" carries over. fabs is a common operation
> (ideally compiler would inline it anyway in the caller, though).
> copysign not so much.

Indeed, in practice we want compilers to recognize and inline simple math
functions, especially on x86 where all float registers are call-clobbered.
This makes it moot, except maybe for musl-internal uses like in floatscan.c,
but there the right solution is probably to use __builtin_ forms if available.

(to be clear: I think musl may not implement math functions in terms of
math builtins because it might result in a circular dependency, but using
builtins for complex math and higher-level functions like in floatscan.c
should be fair game)

For functions like sqrt GCC can be changed to default to -fno-math-errno
on *-musl targets, as progress on flipping it globally seems slow.

> Really I'm not even sure it makes sense to have the asm here at all
> for fabs either, but perhaps with the gratuitous stack access in the
> older-GCC version it does...?

That's why I'm asking, the situation just looks ambiguous to me, I can't
deduce the intent and I could imagine arguments both ways. I think I'd
prefer a more inclusive playground. Please come up with some policy or
guidelines if possible.

Alexander