More patches for math subtree

Submitted by Stefan Kanthak on Dec. 10, 2019, 4:57 p.m.

Details

Message ID 2C3325A208DA4260A1A0F7B4517D6DFA@H270
State New
Series "More patches for math subtree"
Headers show

Commit Message

Stefan Kanthak Dec. 10, 2019, 4:57 p.m.
Some more optimisations: the current implementations of ceil(), floor()
and trunc() for i386 change the rounding control using fldcw instructions,
which are SLOW; these patches provide faster and smaller branch-free (!)
implementations.

JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!

Patch hide | download patch | download mbox

--- -/src/math/i386/floor.s
+++ +/src/math/i386/floor.s
@@ -1,67 +1,26 @@ 
 .global floorf
 .type floorf,@function
 floorf:
         flds 4(%esp)
         jmp 1f
 
 .global floorl
 .type floorl,@function
 floorl:
         fldt 4(%esp)
         jmp 1f
 
 .global floor
 .type floor,@function
 floor:
         fldl 4(%esp)
+1:      fld %st(0)
+        frndint
+        fxch %st(1)
+        fucomip %st(1),%st(0)
+        fld1
+        fldz
+        fcmovb %st(1),%st(0)
+        fsubp %st(0),%st(2)
+        fstp %st(0)
+        ret
-1:      mov $0x7,%al
-1:      fstcw 4(%esp)
-        mov 5(%esp),%ah
-        mov %al,5(%esp)
-        fldcw 4(%esp)
-        frndint
-        mov %ah,5(%esp)
-        fldcw 4(%esp)
-        ret
-
-.global ceil
-.type ceil,@function
-ceil:
-        fldl 4(%esp)
-        mov $0xb,%al
-        jmp 1b
-
-.global ceilf
-.type ceilf,@function
-ceilf:
-        flds 4(%esp)
-        mov $0xb,%al
-        jmp 1b
-
-.global ceill
-.type ceill,@function
-ceill:
-        fldt 4(%esp)
-        mov $0xb,%al
-        jmp 1b
-
-.global trunc
-.type trunc,@function
-trunc:
-        fldl 4(%esp)
-        mov $0xf,%al
-        jmp 1b
-
-.global truncf
-.type truncf,@function
-truncf:
-        flds 4(%esp)
-        mov $0xf,%al
-        jmp 1b
-
-.global truncl
-.type truncl,@function
-truncl:
-        fldt 4(%esp)
-        mov $0xf,%al
-        jmp 1b

--- -/src/math/i386/ceilf.s
+++ +/src/math/i386/ceilf.s
@@ -1,1 +1,1 @@ 
-# see floor.s
+# see ceil.s

--- -/src/math/i386/ceill.s
+++ +/src/math/i386/ceill.s
@@ -1,1 +1,1 @@ 
-# see floor.s
+# see ceil.s

--- -/src/math/i386/ceil.s
+++ +/src/math/i386/ceil.s
@@ -1,1 +1,26 @@ 
-# see floor.s
+.global ceilf
+.type ceilf,@function
+ceilf:
+        flds 4(%esp)
+        jmp 1f
+
+.global ceill
+.type ceill,@function
+ceill:
+        fldt 4(%esp)
+        jmp 1f
+
+.global ceil
+.type ceil,@function
+ceil:
+        fldl 4(%esp)
+1:      fld %st(0)
+        frndint
+        fxch %st(1)
+        fucomip %st(1),%st(0)
+        fld1
+        fldz
+        fcmovnbe %st(1),%st(0)
+        faddp %st(0),%st(1)
+        fstp %st(0)
+        ret

--- -/src/math/i386/truncf.s
+++ +/src/math/i386/truncf.s
@@ -1,1 +1,1 @@ 
-# see floor.s
+# see trunc.s

--- -/src/math/i386/truncl.s
+++ +/src/math/i386/truncl.s
@@ -1,1 +1,1 @@ 
-# see floor.s
+# see trunc.s

--- -/src/math/i386/trunc.s
+++ +/src/math/i386/trunc.s
@@ -1,1 +1,32 @@ 
-# see floor.s
+.global truncf
+.type truncf,@function
+truncf:
+        flds 4(%esp)
+        jmp 1f
+
+.global truncl
+.type truncl,@function
+truncl:
+        fldt 4(%esp)
+        jmp 1f
+
+.global trunc
+.type trunc,@function
+trunc:
+        fldl 4(%esp)
+1:      fld %st(0)
+        fabs
+        fld %st(0)
+        frndint
+        fxch %st(1)
+        fucomip %st(1),%st(0)
+        fldz
+        fld1
+        fcmovnb %st(1),%st(0)
+        fsubp %st(0),%st(2)
+        fucomip %st(2),%st(0)
+        fst %st(1)
+        fchs
+        fcmovbe %st(1),%st(0)
+        fstp %st(1)
+        ret

Comments

Rich Felker Dec. 10, 2019, 7:35 p.m.
On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote:
> Some more optimisations: the current implementations of ceil(), floor()
> and trunc() for i386 change the rounding control using fldcw instructions,
> which are SLOW; these patches provide faster and smaller branch-free (!)
> implementations.
> 
> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
> 
> --- -/src/math/i386/floor.s
> +++ +/src/math/i386/floor.s
> @@ -1,67 +1,26 @@
>  .global floorf
>  .type floorf,@function
>  floorf:
>          flds 4(%esp)
>          jmp 1f
>  
>  .global floorl
>  .type floorl,@function
>  floorl:
>          fldt 4(%esp)
>          jmp 1f
>  
>  .global floor
>  .type floor,@function
>  floor:
>          fldl 4(%esp)
> +1:      fld %st(0)
> +        frndint
> +        fxch %st(1)
> +        fucomip %st(1),%st(0)
> +        fld1
> +        fldz
> +        fcmovb %st(1),%st(0)
           ^^^^^^

fcmovb is not in the baseline ISA. Otherwise, I *think* the idea of
this patch looks good, provided I'm not missing anything with respect
to how status flags are affected.

As noted in the other email (sorry about not CC'ing you before; I've
got you on CC now), I really want to get rid of all these .s files in
favor of __asm__ statements with proper constraints in C source files.
That makes them inlineable with LTO, and makes it possible for the
compiler to select to use an instruction like fcmovb conditionally
based on the targeted ISA level rather than having to do a .S file
with hard-coded preprocessor conditionals. It also precludes x87 stack
imbalance bugs like CVE-2019-14697, which make me really wary of
manual changes to these files.

Would you be interested in working on converting over the files you
want to optimize (or even others too) to that form at the same time as
doing the optimizations? It would really help with review process and
with improving the overall code state.

Rich
Stefan Kanthak Dec. 10, 2019, 9:32 p.m.
"Rich Felker" <dalias@libc.org> wrote:


> On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote:
>> Some more optimisations: the current implementations of ceil(), floor()
>> and trunc() for i386 change the rounding control using fldcw instructions,
>> which are SLOW; these patches provide faster and smaller branch-free (!)
>> implementations.
>> 
>> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
>> 
>> --- -/src/math/i386/floor.s
>> +++ +/src/math/i386/floor.s
>> @@ -1,67 +1,26 @@
>>  .global floorf
>>  .type floorf,@function
>>  floorf:
>>          flds 4(%esp)
>>          jmp 1f
>>  
>>  .global floorl
>>  .type floorl,@function
>>  floorl:
>>          fldt 4(%esp)
>>          jmp 1f
>>  
>>  .global floor
>>  .type floor,@function
>>  floor:
>>          fldl 4(%esp)
>> +1:      fld %st(0)
>> +        frndint
>> +        fxch %st(1)
>> +        fucomip %st(1),%st(0)
>> +        fld1
>> +        fldz
>> +        fcmovb %st(1),%st(0)
>           ^^^^^^
> 
> fcmovb is not in the baseline ISA.

This is but irrelevant or inconsequent: FCMOV* as well as FCOMI* and
FUCOMI* were introduced with the PentiumPro. If you allow the use of
the latter you can safely use the former too. And FCOMI* and FUCOMI*
are already used in other .S files.

> Otherwise, I *think* the idea of this patch looks good, provided I'm
> not missing anything with respect to how status flags are affected.

FRNDINT takes care of them!

> As noted in the other email (sorry about not CC'ing you before; I've
> got you on CC now), I really want to get rid of all these .s files in
> favor of __asm__ statements with proper constraints in C source files.
> That makes them inlineable with LTO, and makes it possible for the
> compiler to select to use an instruction like fcmovb conditionally
> based on the targeted ISA level rather than having to do a .S file
> with hard-coded preprocessor conditionals.

While this is generally good idea, there's no guarantee that a compiler
will emit a branch-free instruction sequence like those shown above.
I also doubt that a compiler will produce the 5 instruction sequence
shown in my patch for src/math/i386/remquo.S which collects the FPU
flags C0, C3 and C1 set by FPREM.

I noticed that you provide .S files for "long double" on x86-64, but
not for "double" and "float". I therefore assume that you use the
SSE floating-point instructions there, respectively let the compiler
use them.
Does any compiler emit branch-free instruction sequences like the
following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD?

        .code   ; Intel syntax
ceil    proc    public
        extern  __real@8000000000000000:real8
        movsd   xmm1, __real@8000000000000000
        extern  __real@3ff0000000000000:real8
        movsd   xmm2, __real@3ff0000000000000
        extern  __real@4330000000000000:real8
        movsd   xmm3, __real@4330000000000000
        movsd   xmm4, xmm1
        andnpd  xmm1, xmm0
        andpd   xmm4, xmm0
        cmpltsd xmm1, xmm3
        andpd   xmm1, xmm3
        orpd    xmm1, xmm4
        movsd   xmm3, xmm0
        addsd   xmm0, xmm1
        subsd   xmm0, xmm1
        movsd   xmm1, xmm0
        cmpltsd xmm0, xmm3
        andpd   xmm0, xmm2
        addsd   xmm0, xmm1
        orpd    xmm0, xmm4
        ret
ceil    endp

Or instruction sequences like

        .code   ; Intel syntax
copysign proc   public
        movd    rcx, xmm0
        movd    rdx, xmm1
        shld    rcx, rdx, 1
        ror     rcx, 1
        movd    xmm0, rcx
        ret
copysign endp

        .code   ; Intel syntax
fdim    proc    public
        movsd   xmm2, xmm0
        cmpsd   xmm0, xmm1, 6
        subsd   xmm2, xmm1
        andpd   xmm0, xmm2
        ret
fdim    endp

> It also precludes x87 stack imbalance bugs like CVE-2019-14697, which
> make me really wary of manual changes to these files.
> 
> Would you be interested in working on converting over the files you
> want to optimize (or even others too) to that form at the same time as
> doing the optimizations?

I don't use musl-libc; I also don't use an OS or a compiler/assembler
which can be used to build it.
I just stumbled upon the functions for which I sent in patches while
searching for code which uses Intel's FPU.

> It would really help with review process and with improving the overall
> code state.

If I start using musl-libc I'd be interested and rewrite these parts.

regards
Stefan
Rich Felker Dec. 10, 2019, 10:17 p.m.
On Tue, Dec 10, 2019 at 10:32:26PM +0100, Stefan Kanthak wrote:
> "Rich Felker" <dalias@libc.org> wrote:
> 
> 
> > On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote:
> >> Some more optimisations: the current implementations of ceil(), floor()
> >> and trunc() for i386 change the rounding control using fldcw instructions,
> >> which are SLOW; these patches provide faster and smaller branch-free (!)
> >> implementations.
> >> 
> >> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
> >> 
> >> --- -/src/math/i386/floor.s
> >> +++ +/src/math/i386/floor.s
> >> @@ -1,67 +1,26 @@
> >>  .global floorf
> >>  .type floorf,@function
> >>  floorf:
> >>          flds 4(%esp)
> >>          jmp 1f
> >>  
> >>  .global floorl
> >>  .type floorl,@function
> >>  floorl:
> >>          fldt 4(%esp)
> >>          jmp 1f
> >>  
> >>  .global floor
> >>  .type floor,@function
> >>  floor:
> >>          fldl 4(%esp)
> >> +1:      fld %st(0)
> >> +        frndint
> >> +        fxch %st(1)
> >> +        fucomip %st(1),%st(0)
> >> +        fld1
> >> +        fldz
> >> +        fcmovb %st(1),%st(0)
> >           ^^^^^^
> > 
> > fcmovb is not in the baseline ISA.
> 
> This is but irrelevant or inconsequent: FCMOV* as well as FCOMI* and
> FUCOMI* were introduced with the PentiumPro. If you allow the use of
> the latter you can safely use the former too. And FCOMI* and FUCOMI*
> are already used in other .S files.

This is why we're not using them. I think you're looking at x86_64
where they are in the baseline ISA.

> > Otherwise, I *think* the idea of this patch looks good, provided I'm
> > not missing anything with respect to how status flags are affected.
> 
> FRNDINT takes care of them!

OK.

> > As noted in the other email (sorry about not CC'ing you before; I've
> > got you on CC now), I really want to get rid of all these .s files in
> > favor of __asm__ statements with proper constraints in C source files.
> > That makes them inlineable with LTO, and makes it possible for the
> > compiler to select to use an instruction like fcmovb conditionally
> > based on the targeted ISA level rather than having to do a .S file
> > with hard-coded preprocessor conditionals.
> 
> While this is generally good idea, there's no guarantee that a compiler
> will emit a branch-free instruction sequence like those shown above.
> I also doubt that a compiler will produce the 5 instruction sequence
> shown in my patch for src/math/i386/remquo.S which collects the FPU
> flags C0, C3 and C1 set by FPREM.

For that you'd probably put the collection of bits inside the asm. It
still makes just a few instructions of asm, with no need for external
call ABI logic in the asm.

> I noticed that you provide .S files for "long double" on x86-64, but
> not for "double" and "float". I therefore assume that you use the
> SSE floating-point instructions there, respectively let the compiler
> use them.

On the x86_64 ABI, float and double arithmetic are performed in SSE
rather than in excess precision with the x87 unit.

> Does any compiler emit branch-free instruction sequences like the
> following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD?
> 
>         .code   ; Intel syntax
> ceil    proc    public
>         extern  __real@8000000000000000:real8
>         movsd   xmm1, __real@8000000000000000
>         extern  __real@3ff0000000000000:real8
>         movsd   xmm2, __real@3ff0000000000000
>         extern  __real@4330000000000000:real8
>         movsd   xmm3, __real@4330000000000000
>         movsd   xmm4, xmm1
>         andnpd  xmm1, xmm0
>         andpd   xmm4, xmm0
>         cmpltsd xmm1, xmm3
>         andpd   xmm1, xmm3
>         orpd    xmm1, xmm4
>         movsd   xmm3, xmm0
>         addsd   xmm0, xmm1
>         subsd   xmm0, xmm1
>         movsd   xmm1, xmm0
>         cmpltsd xmm0, xmm3
>         andpd   xmm0, xmm2
>         addsd   xmm0, xmm1
>         orpd    xmm0, xmm4
>         ret
> ceil    endp
> 
> Or instruction sequences like
> 
>         .code   ; Intel syntax
> copysign proc   public
>         movd    rcx, xmm0
>         movd    rdx, xmm1
>         shld    rcx, rdx, 1
>         ror     rcx, 1
>         movd    xmm0, rcx
>         ret
> copysign endp

Not quite (but it might be possible to write the C in terms of shifts
instead of masks such that it does), but I also don't think it's clear
which version is better. Yours here is mildly smaller and might
perform better, but when making changes that aren't clearly better
there should be some evidence that it's actually an improvement --
especially if it's not just improving existing arch optimizations but
adding new ones where the C was formerly used. Generally musl avoids
asm and arch-specific files as much as possible, using them only for
things that aren't representable in C or where the C is a lot larger
or slower or both.

>         .code   ; Intel syntax
> fdim    proc    public
>         movsd   xmm2, xmm0
>         cmpsd   xmm0, xmm1, 6
>         subsd   xmm2, xmm1
>         andpd   xmm0, xmm2
>         ret
> fdim    endp

Does this handle nans correctly?

> > It also precludes x87 stack imbalance bugs like CVE-2019-14697, which
> > make me really wary of manual changes to these files.
> > 
> > Would you be interested in working on converting over the files you
> > want to optimize (or even others too) to that form at the same time as
> > doing the optimizations?
> 
> I don't use musl-libc; I also don't use an OS or a compiler/assembler
> which can be used to build it.
> I just stumbled upon the functions for which I sent in patches while
> searching for code which uses Intel's FPU.
> 
> > It would really help with review process and with improving the overall
> > code state.
> 
> If I start using musl-libc I'd be interested and rewrite these parts.

OK. I don't mind looking at these patches further as-is, and I'll try
to continue offering constructive comments now, but it'll be after
this release cycle (hopefully wrapping that up in the next week or so)
before consideration for merging. musl 1.2.0 is already going to be a
release with big changes (time64) and I don't want to risk subtle
breakage with new changes that haven't been reviewed in detail yet or
had time for users to test.


Rich
Rosen Penev Dec. 11, 2019, 1:13 a.m.
On Tue, Dec 10, 2019 at 2:17 PM Rich Felker <dalias@libc.org> wrote:
>
> On Tue, Dec 10, 2019 at 10:32:26PM +0100, Stefan Kanthak wrote:
> > "Rich Felker" <dalias@libc.org> wrote:
> >
> >
> > > On Tue, Dec 10, 2019 at 05:57:55PM +0100, Stefan Kanthak wrote:
> > >> Some more optimisations: the current implementations of ceil(), floor()
> > >> and trunc() for i386 change the rounding control using fldcw instructions,
> > >> which are SLOW; these patches provide faster and smaller branch-free (!)
> > >> implementations.
> > >>
> > >> JFTR: I'm NOT subscribed to your mailing list, so CC: me in replies!
> > >>
> > >> --- -/src/math/i386/floor.s
> > >> +++ +/src/math/i386/floor.s
> > >> @@ -1,67 +1,26 @@
> > >>  .global floorf
> > >>  .type floorf,@function
> > >>  floorf:
> > >>          flds 4(%esp)
> > >>          jmp 1f
> > >>
> > >>  .global floorl
> > >>  .type floorl,@function
> > >>  floorl:
> > >>          fldt 4(%esp)
> > >>          jmp 1f
> > >>
> > >>  .global floor
> > >>  .type floor,@function
> > >>  floor:
> > >>          fldl 4(%esp)
> > >> +1:      fld %st(0)
> > >> +        frndint
> > >> +        fxch %st(1)
> > >> +        fucomip %st(1),%st(0)
> > >> +        fld1
> > >> +        fldz
> > >> +        fcmovb %st(1),%st(0)
> > >           ^^^^^^
> > >
> > > fcmovb is not in the baseline ISA.
> >
> > This is but irrelevant or inconsequent: FCMOV* as well as FCOMI* and
> > FUCOMI* were introduced with the PentiumPro. If you allow the use of
> > the latter you can safely use the former too. And FCOMI* and FUCOMI*
> > are already used in other .S files.
>
> This is why we're not using them. I think you're looking at x86_64
> where they are in the baseline ISA.
>
> > > Otherwise, I *think* the idea of this patch looks good, provided I'm
> > > not missing anything with respect to how status flags are affected.
> >
> > FRNDINT takes care of them!
>
> OK.
>
> > > As noted in the other email (sorry about not CC'ing you before; I've
> > > got you on CC now), I really want to get rid of all these .s files in
> > > favor of __asm__ statements with proper constraints in C source files.
> > > That makes them inlineable with LTO, and makes it possible for the
> > > compiler to select to use an instruction like fcmovb conditionally
> > > based on the targeted ISA level rather than having to do a .S file
> > > with hard-coded preprocessor conditionals.
> >
> > While this is generally good idea, there's no guarantee that a compiler
> > will emit a branch-free instruction sequence like those shown above.
> > I also doubt that a compiler will produce the 5 instruction sequence
> > shown in my patch for src/math/i386/remquo.S which collects the FPU
> > flags C0, C3 and C1 set by FPREM.
>
> For that you'd probably put the collection of bits inside the asm. It
> still makes just a few instructions of asm, with no need for external
> call ABI logic in the asm.
>
> > I noticed that you provide .S files for "long double" on x86-64, but
> > not for "double" and "float". I therefore assume that you use the
> > SSE floating-point instructions there, respectively let the compiler
> > use them.
>
> On the x86_64 ABI, float and double arithmetic are performed in SSE
> rather than in excess precision with the x87 unit.
>
> > Does any compiler emit branch-free instruction sequences like the
> > following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD?
> >
> >         .code   ; Intel syntax
> > ceil    proc    public
> >         extern  __real@8000000000000000:real8
> >         movsd   xmm1, __real@8000000000000000
> >         extern  __real@3ff0000000000000:real8
> >         movsd   xmm2, __real@3ff0000000000000
> >         extern  __real@4330000000000000:real8
> >         movsd   xmm3, __real@4330000000000000
> >         movsd   xmm4, xmm1
> >         andnpd  xmm1, xmm0
> >         andpd   xmm4, xmm0
> >         cmpltsd xmm1, xmm3
> >         andpd   xmm1, xmm3
> >         orpd    xmm1, xmm4
> >         movsd   xmm3, xmm0
> >         addsd   xmm0, xmm1
> >         subsd   xmm0, xmm1
> >         movsd   xmm1, xmm0
> >         cmpltsd xmm0, xmm3
> >         andpd   xmm0, xmm2
> >         addsd   xmm0, xmm1
> >         orpd    xmm0, xmm4
> >         ret
> > ceil    endp
> >
> > Or instruction sequences like
> >
> >         .code   ; Intel syntax
> > copysign proc   public
> >         movd    rcx, xmm0
> >         movd    rdx, xmm1
> >         shld    rcx, rdx, 1
> >         ror     rcx, 1
> >         movd    xmm0, rcx
> >         ret
> > copysign endp
>
> Not quite (but it might be possible to write the C in terms of shifts
> instead of masks such that it does), but I also don't think it's clear
> which version is better. Yours here is mildly smaller and might
> perform better, but when making changes that aren't clearly better
> there should be some evidence that it's actually an improvement --
> especially if it's not just improving existing arch optimizations but
> adding new ones where the C was formerly used. Generally musl avoids
> asm and arch-specific files as much as possible, using them only for
> things that aren't representable in C or where the C is a lot larger
> or slower or both.
>
> >         .code   ; Intel syntax
> > fdim    proc    public
> >         movsd   xmm2, xmm0
> >         cmpsd   xmm0, xmm1, 6
> >         subsd   xmm2, xmm1
> >         andpd   xmm0, xmm2
> >         ret
> > fdim    endp
>
> Does this handle nans correctly?
>
> > > It also precludes x87 stack imbalance bugs like CVE-2019-14697, which
> > > make me really wary of manual changes to these files.
> > >
> > > Would you be interested in working on converting over the files you
> > > want to optimize (or even others too) to that form at the same time as
> > > doing the optimizations?
> >
> > I don't use musl-libc; I also don't use an OS or a compiler/assembler
> > which can be used to build it.
> > I just stumbled upon the functions for which I sent in patches while
> > searching for code which uses Intel's FPU.
> >
> > > It would really help with review process and with improving the overall
> > > code state.
> >
> > If I start using musl-libc I'd be interested and rewrite these parts.
>
> OK. I don't mind looking at these patches further as-is, and I'll try
> to continue offering constructive comments now, but it'll be after
> this release cycle (hopefully wrapping that up in the next week or so)
> before consideration for merging. musl 1.2.0 is already going to be a
> release with big changes (time64) and I don't want to risk subtle
> breakage with new changes that haven't been reviewed in detail yet or
> had time for users to test.
Since you guys are discussing math optimizations, here's another one:
https://www.openwall.com/lists/musl/2019/11/08/1
>
>
> Rich
Stefan Kanthak Dec. 11, 2019, 9:53 a.m.
"Rich Felker" <dalias@libc.org> wrote:

> On Tue, Dec 10, 2019 at 10:32:26PM +0100, Stefan Kanthak wrote:

[ asm vs. C ]

>> Does any compiler emit branch-free instruction sequences like the
>> following for Intel CPUs without SSE4.1, i.e. without ROUNDSS/ROUNDSD?
>> 
>>         .code   ; Intel syntax
>> ceil    proc    public
>>         extern  __real@8000000000000000:real8
>>         movsd   xmm1, __real@8000000000000000
>>         extern  __real@3ff0000000000000:real8
>>         movsd   xmm2, __real@3ff0000000000000
>>         extern  __real@4330000000000000:real8
>>         movsd   xmm3, __real@4330000000000000
>>         movsd   xmm4, xmm1
>>         andnpd  xmm1, xmm0
>>         andpd   xmm4, xmm0
>>         cmpltsd xmm1, xmm3
>>         andpd   xmm1, xmm3
>>         orpd    xmm1, xmm4
>>         movsd   xmm3, xmm0
>>         addsd   xmm0, xmm1
>>         subsd   xmm0, xmm1
>>         movsd   xmm1, xmm0
>>         cmpltsd xmm0, xmm3
>>         andpd   xmm0, xmm2
>>         addsd   xmm0, xmm1
>>         orpd    xmm0, xmm4
>>         ret
>> ceil    endp
>> 
>> Or instruction sequences like
>> 
>>         .code   ; Intel syntax
>> copysign proc   public
>>         movd    rcx, xmm0
>>         movd    rdx, xmm1
>>         shld    rcx, rdx, 1
>>         ror     rcx, 1
>>         movd    xmm0, rcx
>>         ret
>> copysign endp
> 
> Not quite (but it might be possible to write the C in terms of shifts
> instead of masks such that it does), but I also don't think it's clear
> which version is better. Yours here is mildly smaller and might
> perform better, but when making changes that aren't clearly better
> there should be some evidence that it's actually an improvement --
> especially if it's not just improving existing arch optimizations but
> adding new ones where the C was formerly used.

Correct.
I expect the compiler to emit such properly optimised code instead of
calls to the library for standard functions like copysign(), fdim(),
etc. which can be written with just a few instructions ... what the
compiler but not (always) does.

JFTR: I don't know whether GCC or clang either provide intrinsics or
      __builtin_* for such (or all those) small standard functions.

> Generally musl avoids asm and arch-specific files as much as possible,
> using them only for things that aren't representable in C or where
> the C is a lot larger or slower or both.
> 
>>         .code   ; Intel syntax
>> fdim    proc    public
>>         movsd   xmm2, xmm0
>>         cmpsd   xmm0, xmm1, 6
>>         subsd   xmm2, xmm1
>>         andpd   xmm0, xmm2
>>         ret
>> fdim    endp
> 
> Does this handle nans correctly?

Of course! It's equivalent to

double fdim(double a, double b)
{
    uint64_t mask = (a <= b) ? 0ull : ~0ull;
    union {double dbl; uint64_t ull;} u = {a - b};
    u.ull &= mask;
    return u.dbl;
}

[...]

> OK. I don't mind looking at these patches further as-is, and I'll try
> to continue offering constructive comments now, but it'll be after
> this release cycle (hopefully wrapping that up in the next week or so)
> before consideration for merging. musl 1.2.0 is already going to be a
> release with big changes (time64) and I don't want to risk subtle
> breakage with new changes that haven't been reviewed in detail yet or
> had time for users to test.

That's OK.

Stefan
Szabolcs Nagy Dec. 11, 2019, 10:28 a.m.
* Stefan Kanthak <stefan.kanthak@nexgo.de> [2019-12-11 10:53:41 +0100]:
> JFTR: I don't know whether GCC or clang either provide intrinsics or
>       __builtin_* for such (or all those) small standard functions.

they do.

they may not inline them, but that's a compiler decision
and should be addressed there.