12 Commits

Author SHA1 Message Date
Tue Ly
76bb278ebb [libc][math] Implement double precision exp10 function correctly rounded for all rounding modes.
Implement double precision exp10 function correctly rounded for all
rounding modes.  Using the same algorithm as double precision exp
(https://reviews.llvm.org/D158551) and exp2 (https://reviews.llvm.org/D158812)
functions.

Reviewed By: zimmermann6

Differential Revision: https://reviews.llvm.org/D159143
2023-08-30 08:43:50 -04:00
Tue Ly
8ca614aa22 [libc][math] Implement double precision exp2 function correctly rounded for all rounding modes.
Implement double precision exp2 function correctly rounded for all
rounding modes.  Using the same algorithm as double precision exp function in
https://reviews.llvm.org/D158551.

Reviewed By: zimmermann6

Differential Revision: https://reviews.llvm.org/D158812
2023-08-25 10:15:08 -04:00
Tue Ly
434bf16084 [libc][math] Implement double precision exp function correctly rounded for all rounding modes.
Implement double precision exp function correctly rounded for all
rounding modes.  Using 4 stages:
- Range reduction: reduce to `exp(x) = 2^hi * 2^mid1 * 2^mid2 * exp(lo)`.
- Use 64 + 64 LUT for 2^mid1 and 2^mid2, and use cubic Taylor polynomial to
approximate `(exp(lo) - 1) / lo` in double precision.  Relative error in this
step is bounded by 1.5 * 2^-63.
- If the rounding test fails, use degree-6 Taylor polynomial to approximate
`exp(lo)` in double-double precision.  Relative error in this step is bounded by
2^-99.
- If the rounding test still fails, use degree-7 Taylor polynomial to compute
`exp(lo)` in ~128-bit precision.

Reviewed By: zimmermann6

Differential Revision: https://reviews.llvm.org/D158551
2023-08-24 10:17:17 -04:00
Tue Ly
f320fefc4a [libc][math] Implement erff function correctly rounded to all rounding modes.
Implement correctly rounded `erff` functions.

For `x >= 4`, `erff(x) = 1` for `FE_TONEAREST` or `FE_UPWARD`, `0x1.ffffep-1` for `FE_DOWNWARD` or `FE_TOWARDZERO`.

For `0 <= x < 4`, we divide into 32 sub-intervals of length `1/8`, and use a degree-15 odd polynomial to approximate `erff(x)` in each sub-interval:
```
  erff(x) ~ x * (c0 + c1 * x^2 + c2 * x^4 + ... + c7 * x^14).
```

For `x < 0`, we can use the same formula as above, since the odd part is factored out.

Performance tested with `perf.sh` tool from the CORE-MATH project on AMD Ryzen 9 5900X:

Reciprocal throughput (clock cycles / op)
```
$ ./perf.sh erff --path2
GNU libc version: 2.35
GNU libc release: stable
-- CORE-MATH reciprocal throughput --  with -march=native      (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 11.790 + 0.182 clc/call; Median-Min = 0.154 clc/call; Max = 12.255 clc/call;
-- CORE-MATH reciprocal throughput --  with -march=x86-64-v2      (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 14.205 + 0.151 clc/call; Median-Min = 0.159 clc/call; Max = 15.893 clc/call;

-- System LIBC reciprocal throughput --
[####################] 100 %
Ntrial = 20 ; Min = 45.519 + 0.445 clc/call; Median-Min = 0.552 clc/call; Max = 46.345 clc/call;

-- LIBC reciprocal throughput --  with -mavx2 -mfma     (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 9.595 + 0.214 clc/call; Median-Min = 0.220 clc/call; Max = 9.887 clc/call;
-- LIBC reciprocal throughput --  with -msse4.2     (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 10.223 + 0.190 clc/call; Median-Min = 0.222 clc/call; Max = 10.474 clc/call;
```

and latency (clock cycles / op):
```
$ ./perf.sh erff --path2
GNU libc version: 2.35
GNU libc release: stable
-- CORE-MATH latency --  with -march=native      (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 38.566 + 0.391 clc/call; Median-Min = 0.503 clc/call; Max = 39.170 clc/call;
-- CORE-MATH latency --  with -march=x86-64-v2      (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 43.223 + 0.667 clc/call; Median-Min = 0.680 clc/call; Max = 43.913 clc/call;

-- System LIBC latency --
[####################] 100 %
Ntrial = 20 ; Min = 111.613 + 1.267 clc/call; Median-Min = 1.696 clc/call; Max = 113.444 clc/call;

-- LIBC latency --  with -mavx2 -mfma     (with FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 40.138 + 0.410 clc/call; Median-Min = 0.536 clc/call; Max = 40.729 clc/call;
-- LIBC latency --  with -msse4.2     (without FMA instructions)
[####################] 100 %
Ntrial = 20 ; Min = 44.858 + 0.872 clc/call; Median-Min = 0.814 clc/call; Max = 46.019 clc/call;
```

Reviewed By: michaelrj

Differential Revision: https://reviews.llvm.org/D153683
2023-06-28 13:58:37 -04:00
Tue Ly
e557b8a142 [libc][RISCV] Add log, log2, log1p, log10 for RISC-V64 entrypoints.
Add log, log2, log1p, log10 RISCV64 entrypoints.

Reviewed By: michaelrj, sivachandra

Differential Revision: https://reviews.llvm.org/D151674
2023-05-30 14:18:19 -04:00
Tue Ly
0bda541829 [libc][doc] Update math function status page to show more targets.
Show availability of math functions on each target.

Reviewed By: jeffbailey

Differential Revision: https://reviews.llvm.org/D151489
2023-05-25 19:24:33 -04:00
Kazu Hirata
9a515d8142 [libc] Fix typos in documentation 2023-05-22 23:27:59 -07:00
Kazu Hirata
e042efdab6 [libc] Fix typos in documentation 2023-04-24 23:31:48 -07:00
Tue Ly
f63025f52f [libc][Obvious] Fix the performance table in math function documentation. 2023-04-18 14:10:26 -04:00
Tue Ly
9af8dca70f [libc][math] Update range reduction step for log10f and reduce its latency.
Simplify the range reduction steps by choosing the reduction constants
carefully so that the reduced arguments v = r*m_x - 1 and v^2 are exact in double
precision, even without FMA instructions, and -2^-8 <= v < 2^-7.  This allows the
polynomial evaluations to be parallelized more efficiently.

Reviewed By: zimmermann6

Differential Revision: https://reviews.llvm.org/D147676
2023-04-07 10:31:46 -04:00
Tue Ly
6c7894a8e6 [libc][doc] Move docs/math.rst to docs/math/index.rst
Move docs/math.rst to docs/math/index.rst

Reviewed By: michaelrj

Differential Revision: https://reviews.llvm.org/D144028
2023-02-14 13:41:44 -05:00
Tue Ly
5814b7b279 [libc][math] Implement log10 function correctly rounded for all rounding modes
Implement double precision log10 function correctly rounded for all
rounding modes.  This implementation currently needs FMA instructions for
correctness.

Use 2 passes:
Fast pass:
- 1 step range reduction with a lookup table of `2^7 = 128` elements to reduce the ranges to `[-2^-7, 2^-7]`.
- Use a degree-7 minimax polynomial generated by Sollya, evaluated using a mixed of double-double and double precisions.
- Apply Ziv's test for accuracy.
Accurate pass:
- Apply 5 more range reduction steps to reduce the ranges further to [-2^-27, 2^-27].
- Use a degree-4 minimax polynomial generated by Sollya, evaluated using 192-bit precisions.
- By the result of Lefevre (add quote), this is more than enough for correct rounding to all rounding modes.

In progress: Adding detail documentations about the algorithm.

Depend on: https://reviews.llvm.org/D136799

Reviewed By: zimmermann6

Differential Revision: https://reviews.llvm.org/D139846
2023-01-08 17:41:54 -05:00