56301 Commits

Author SHA1 Message Date
Craig Topper
1bc9de2474 [RISCV] Add test cases for llvm.tan/asin/acos/atan/atan2/sinh/cosh/tanh. NFC 2024-11-27 10:24:34 -08:00
Craig Topper
d7643e8610 [RISCV][GISel] Support f32/f64 llvm.exp10 intrinsics. 2024-11-27 10:24:33 -08:00
Craig Topper
dae9cf3816 [RISCV] Move scalar llvm.exp10 tests into half/float/double-intrinsics.ll. NFC
Improves coverage for more configurations.
2024-11-27 10:24:33 -08:00
Petar Avramovic
87503fa51c
Revert "AMDGPU/GlobalISel: Add stub custom regbankselect pass" (#113913)
This reverts commit e9c49901a43f5b16c3df416460b7e4dbdd24ce03.
Current AMDGPURegBankSelect does nothing different then RegBankSelect.
Revert to using generic RegBankSelect in preparation for adding new
regbankselect passes. New AMDGPURegBankSelect, that will use uniformity
analysis for regbank select decisions, will not subclass RegBankSelect.
Revert regression tests to use regbankselect since amdgpu-regbankselect
will be used by new pass and behavior will be different.
2024-11-27 13:16:22 -05:00
RolandF77
a475180498
[PowerPC] Use setbc for values from vector compare conditions (#114858)
For P10 use the setbc instruction to get int values from vector compare
summary condition results.
2024-11-27 12:47:10 -05:00
knickish
c29e895ad2
[M68k] Handle 16 bit MOVs to and from CCR (#114714)
Builds on @TechnoElf 's CCR MOV pr
https://github.com/llvm/llvm-project/pull/107591 and adds some tests.

Fixes https://github.com/llvm/llvm-project/issues/106210.

---------

Co-authored-by: TechnoElf <technoelf@undertheprinter.com>
2024-11-27 09:37:57 -08:00
Sander de Smalen
318c69de52 Reland "[AArch64] Define high bits of FPR and GPR registers (take 2) (#114827)"
The issue with slow compile-time was caused by an assert in
AArch64RegisterInfo.cpp. The assert invokes 'checkAllSuperRegsMarked'
after adding all the reserved registers. This call gets very expensive
after adding the _HI registers due to the way the function searches
in the 'Exception' list, which is expected to be a small list but isn't
(the patch added 190 _HI regs).

It was possible to rewrite the code in such a way that the _HI registers
are marked as reserved after the check. This makes the problem go away
entirely and restores compile-time to what it was before (tested for
`check-runtimes`, which previously showed a ~5x slowdown).

This reverts commits:
  1434d2ab215e3ea9c5f34689d056edd3d4423a78
  2704647fb7986673b89cef1def729e3b022e2607
2024-11-27 13:31:59 +00:00
Simon Pilgrim
f30f7a084c [X86] canonicalizeShuffleWithOp - initial support for shuffle(cvt(x),cvt(y)) -> cvt(shuffle(x,y))
Initial support is just for UNPCKL(CVTPH2PS(X),CVTPH2PS(Y)) -> CVTPH2PS(UNPCKL(X,Y))

Making this more general for other shuffles/conversions will have to be done carefully as we have to handle changes in src/dst element width, so I just handled the CVTPH2PS regression case.

Fixes #83414
2024-11-27 12:38:52 +00:00
Igor Kirillov
e874c8fc27
[SelectOpt] Refactor to prepare for support more select-like operations (#117582)
* Enables conversion of several select-like instructions within one
group
* Any number of auxiliary instructions depending on the same condition
can be in between select-like instructions
* After splitting the basic block, move select-like instructions into
the relevant basic blocks and optimise them
* Make it easier to add support shift-base select-like instructions and
also any mixture of zext/sext/not instructions
2024-11-27 11:35:59 +00:00
Anatoly Trosinenko
1fccba5ca1
[AArch64][PAC] Eliminate excessive MOVs when computing blend (#115185)
As function calls do not generally preserve X16 and X17, it is beneficial
to allow AddrDisc operand of BLRA instruction to reside in these
registers and make use of this condition when computing the
discriminator.

This can save up to two MOVs in cases such as loading a (signed) virtual
function pointer via a (signed) pointer to vtable, for example

    ldr   x9, [x16]
    mov   x8, x16
    mov   x17, x8
    movk  x17, #34646, lsl #48
    blraa x9, x17

can be simplified to

    ldr   x8, [x16]
    movk  x16, #34646, lsl #48
    blraa x8, x16
2024-11-27 13:24:32 +03:00
David Green
712ef7d0ba [AArch64][GlobalISel] Fix smull and umull intrinsics.
These were the wrong way around somehow, with aarch64_neon_umull being converted
to G_SMULL.
2024-11-27 10:11:06 +00:00
tangaac
427be07675
[LoongArch] Support amcas[_db].{b/h/w/d} instructions. (#114189)
Two options for clang: -mlamcas & -mno-lamcas.
Enable or disable amcas[_db].{b/h} instructions.
The default is -mno-lamcas.
Only works on LoongArch64.
2024-11-27 17:36:13 +08:00
Craig Topper
50dfb0772b [RISCV] Support f32/f64 libcalls for sin/cos/pow/log/log2/log10/exp/exp2
Test cases copied from SelectionDAG.
2024-11-26 23:35:52 -08:00
tangaac
53c0a25db7
[LoongArch] Use div.w/mod.w to eliminate unnecessary sign-extend for sdiv/srem i32. (#117298) 2024-11-27 14:35:53 +08:00
Matt Arsenault
b4a16a78c2
AMDGPU: Match and Select BITOP3 on gfx950 (#117843)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2024-11-27 01:31:19 -05:00
Matt Arsenault
6934870a13
AMDGPU: Remove FeatureCvtFP8VOP1Bug from gfx950 (#117827) 2024-11-27 01:28:09 -05:00
Durgadoss R
40d0058e6a
[NVPTX] Add TMA bulk tensor reduction intrinsics (#116854)
This patch adds NVVM intrinsics and NVPTX codegen for:
* cp.async.bulk.tensor.reduce.1D -> 5D variants, supporting both Tile
   and Im2Col modes.
* These intrinsics optionally support cache_hints as indicated by the
   boolean flag argument.
* Lit tests are added for all combinations of these intrinsics in
   cp-async-bulk-tensor-reduce.ll.
* The generated PTX is verified with a 12.3 ptxas executable.
* Added docs for these intrinsics in NVPTXUsage.rst file.

PTX Spec reference:
https://docs.nvidia.com/cuda/parallel-thread-execution/#data-movement-and-conversion-instructions-cp-reduce-async-bulk-tensor

Signed-off-by: Durgadoss R <durgadossr@nvidia.com>
2024-11-27 10:57:51 +05:30
Thurston Dang
0d15d46362
[ubsan] Change ubsan-unique-traps to use nomerge instead of counter (#117651)
https://github.com/llvm/llvm-project/pull/65972 (continuation of
https://reviews.llvm.org/D148654) had considered adding nomerge to
ubsantrap, but did not proceed with that because of
https://github.com/llvm/llvm-project/issues/53011. Instead, it added a
counter (based on TrapBB->getParent()->size()) to each ubsantrap call.
However, this counter is not guaranteed to be unique after inlining, as
shown by https://github.com/llvm/llvm-project/pull/83470, which can
result in ubsantraps being merged by the backend.

https://github.com/llvm/llvm-project/pull/101549 has since fixed the
nomerge limitation ("It sets nomerge flag for the node if the
instruction has nomerge arrtibute."). This patch therefore takes
advantage of nomerge instead of using the counter, guaranteeing that the
ubsantraps are not merged.

This patch is equivalent to
https://github.com/llvm/llvm-project/pull/83470 but also adds nomerge
and updates tests (https://github.com/llvm/llvm-project/pull/117649:
ubsan-trap-merge.c; https://github.com/llvm/llvm-project/pull/117657:
ubsan-trap-merge.ll, ubsan-trap-nomerge.ll; catch-undef-behavior.c).
2024-11-26 21:13:00 -08:00
Sergei Barannikov
61a23646c9
[SjLjEHPrepare] Configure call sites correctly (#117656)
After 9fe78db4, the pass inserts `store volatile i32 -1, ptr %call_site`
before all invoke instruction except the one in the entry block, which
has the effect of bypassing landing pads on exceptions.

When configuring the call site for a potentially throwing instruction
check that it is not `InvokeInst` -- they are handled by earlier code.
2024-11-27 08:03:47 +03:00
Matt Arsenault
5615657209
AMDGPU: Builtin & CodeGen support for v_cvt_sr_{bf16|f16}_f32 instructions (#117824)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:37:05 -05:00
Matt Arsenault
62dc8f3069
AMDGPU: Add builtins & codegen support for bitop3_b{16|32} of gfx950. (#117823)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 23:33:07 -05:00
Matt Arsenault
142b33c58b
AMDGPU: Allocate different registers for vdst & src in v_cvt_scalef32* (#117822)
For multipass instructions, overlap on VDST and SRC’s
would result in HW race & undefined results.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 23:29:11 -05:00
Matt Arsenault
265e209ceb
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_{bf8|fp8}_{f16|bf16|f32} (#117821)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:24:01 -05:00
Matt Arsenault
301c8e6047
AMDGPU: Add support for v_cvt_scalef32_sr instructions (#117820)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 23:20:16 -05:00
antangelo
512defe603
[NFC][GISel][AArch64] Pre-commit baseline tests for translation of @llvm.expect.with.probability (#117842)
Pre-commit of tests for generic GlobalISel translation of
`@llvm.expect.with.probability` for when optimizations are not enabled
2024-11-26 22:58:56 -05:00
Brandon Wu
4a7dbede6b
[RISCV] Support svukte extension (#115657)
This is the extension for "Address-Independent Latency of User-Mode
Faults to Supervisor Addresses".
Spec: https://github.com/riscv/riscv-isa-manual/pull/1564,
https://lf-riscv.atlassian.net/browse/RVS-2977
The spec states that the `svukte` depends on `sv39`, but we don't have
`sv39` yet, so I didn't add it to the implied list.
2024-11-27 10:54:57 +08:00
Craig Topper
38a3cce90a [RISCV][GISel] Copy fneg test cases from SelectionDAG into float/double-arith.ll. NFC
The test cases use fcmp which was not fully supported before
43b6b78771e9ab4da912b574664e713758c43110.
2024-11-26 18:20:56 -08:00
antangelo
dd4844722d
[SelectionDAG] Add generic implementation for @llvm.expect.with.probability when optimizations are disabled (#117459)
Handle \@llvm.expect.with.probability in SelectionDAGBuilder, FastISel,
and IntrinsicLowering in the same way \@llvm.expect is handled, where
the value is passed through as-is. This can be reached if the intrinsic
is used without optimizations, where it would otherwise be properly
transformed out.

Fixes #115411 for SelectionDAG. A similar patch is likely needed for
GlobalISel.
2024-11-26 20:22:25 -05:00
Sam Clegg
ea58410d0f
[WebAssembly] Implement %llvm.thread.pointer intrinsic (#117817)
We can simply use the `__tls_base` global for this which is guaranteed
to be non-zero and unique per thread.

Fixes: #117433
2024-11-26 17:19:14 -08:00
Matt Arsenault
76715787f4
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_sr_pk_fp4 instructions (#117798)
Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-11-26 19:59:14 -05:00
Matt Arsenault
c8ee1ee057
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk_fp4_{f|bf}16 for gfx950 (#117794)
These instructions have non-standard use of OPSEL bits to select
dest write byte. The src2_modifiers operand is used without having
its corresponding src2 operand by introducing dummy src2.

OPSEL ASM OPSEL Syntax: opsel:[a,b,c,d]
a & b are meaningless, c & d together decides byte to write in dst reg.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:38:23 -05:00
Matt Arsenault
065dc93d96
AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{bf|f}16_{bf|fp}8 for gfx950 (#117793)
OPSEL[0] selects src_word to read.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:35:18 -05:00
Matt Arsenault
991dcbc468
AMDGPU: Builtin & codegen support for v_cvt_scalef32_pk32_{bf|f}16_{bf|fp}6 for gfx950 (#117747)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:30:04 -05:00
Matt Arsenault
0f4fcca546
AMDGPU: Builtin & CodeGen support for v_cvt_scalef32_pk32_f32_[fp|bf]6 for gfx950 (#117745)
Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:26:07 -05:00
Matt Arsenault
eeb76880f3
AMDGPU: Builtins & CodeGen support for v_cvt_scalef32_pk_{f|bf}16_fp4 for gfx950 (#117744)
OPSEL ASM Syntax for v_cvt_scalef32_pk_{f|bf}16_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.

Note: Conventional Inst{13} i.e. OPSEL[2] is ignored in asm syntax.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:23:15 -05:00
Matt Arsenault
2b9e947d43
AMDGPU: Builtins & Codegen support for v_cvt_scale_fp4<->f32 for gfx950 (#117743)
OPSEL ASM Syntax for v_cvt_scalef32_pk_f32_fp4 : opsel:[x,y,z]
where, x & y i.e. OPSEL[1 : 0] selects which src_byte to read.

OPSEL ASM Syntax for v_cvt_scalef32_pk_fp4_f32 : opsel:[a,b,c,d]
where, c & d i.e. OPSEL[3 : 2] selects which dst_byte  to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:20:09 -05:00
Matt Arsenault
4527894143
Builtins & Codegen support for v_cvt_scalef32_pk_{fp|bf}8_{f|bf}16 for gfx950 (#117742)
OPSEL[3] determines low/high 16 bits of word to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:16:08 -05:00
Matt Arsenault
62584f32eb
AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_f32_{fp8|bf8} for gfx950 (#117741)
OPSEL[0] determines low/high 16 bits of src0 to read.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 19:12:18 -05:00
Craig Topper
43b6b78771
[RISCV][GISel] Use libcalls for f32/f64 G_FCMP without F/D extensions. (#117660)
LegalizerHelp only supported f128 libcalls and incorrectly assumed that
the destination register for the G_FCMP was s32.
2024-11-26 15:48:49 -08:00
Pradeep Kumar
e84614833e
[LLVM][NVPTX] Add support for div.full instruction (#116482)
This commit adds NVPTX support for div.full PTX instruction with test
under div.ll. [For more information, see PTX
ISA](https://docs.nvidia.com/cuda/parallel-thread-execution/#floating-point-instructions-div)
2024-11-27 04:57:42 +05:30
Thurston Dang
dde7f4d024
[NFC][clang] Add ubsan-trap-merge.ll test to show absence of nomerge considered harmful (#117657)
These testcases demonstrate that ubsan intrinsics are merged in the
backend iff nomerge is missing from ubsantrap intrinsics.

This is based on the observation and testcase by Vitaly Buka in
https://github.com/llvm/llvm-project/pull/83470.
2024-11-26 14:21:05 -08:00
Matt Arsenault
803bd812b1
AMDGPU: Builtins & Codegen support for v_cvt_scalef32_pk_{fp8|bf8}_f32 for gfx950 (#117740)
OPSEL[3] determines low/high 16 bits of word to write.

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 14:57:09 -05:00
Matt Arsenault
815069c701
AMDGPU: Builtins & Codegen support for: v_cvt_scalef32_[f16|f32]_[bf8|fp8] (#117739)
OPSEL[1:0] collectively decide which byte to read
from src input.

Builtin takes additional imm argument which
represents index (with valid values:[0:3]) of src
byte read. Out of bounds checks will added in next
patch.

OPSEL ASM Syntax: opsel:[x,y,z]
where,
    opsel[x] = Inst{11} = src0_modifier{2}
    opsel[y] = Inst{12} = src1_modifier{2}
    opsel[z] = Inst{14} = src0_modifier{3}

Note: Inst{13} i.e. OPSEL[2] is ignored in
asm syntax and opsel[z] is meaningless
for v_cvt_scalef32_f32_{fp|bf}8

Co-authored-by: Pravin Jagtap <Pravin.Jagtap@amd.com>
2024-11-26 14:54:10 -05:00
Matt Arsenault
7221bc74bc
AMDGPU: Make v2f16 minimum/maximum legal for gfx950 (#117738) 2024-11-26 14:51:05 -05:00
Matt Arsenault
f5e92eb04b
AMDGPU: Handle f32 minimum3/maximum3 pattern for gfx950 (#117737) 2024-11-26 14:47:52 -05:00
Matt Arsenault
e57b327be2
AMDGPU: Legalize fminimum and fmaximum f32 for gfx950 (#117634)
Select to minimum3/maximum3. Leave f16/v2f16 for later
since it's complicated by only having the vector version.
2024-11-26 14:44:09 -05:00
Matt Arsenault
5a3299a684
AMDGPU: Remove some -verify-machineinstrs from tests (#117736)
We should leave these for EXPENSIVE_CHECKS builds. Some of these
were near the top of slowest tests.
2024-11-26 12:59:15 -05:00
Philip Reames
c55a080c08 [RISCV] Add shuffle coverage for compress, decompress, and repeat idioms
compress is intented to match vcompress from the ISA manual. Note that
  deinterleave is a subset of this, and is already tested elsewhere.

decompress is the synthetic pattern defined in same - though we can often
  do better than the mentioned iota/vrgather.  Note that some of these
  can also be expressed as interleave with at least one undef source,
  and is already tested elsewhere.

repeat repeats each input element N times in the output.  It can be
  described as as a interleave operations, but we can sometimes do
  better lowering wise.
2024-11-26 09:27:56 -08:00
Zaara Syeda
b1a34b80b8
[NFC][Test] Fix PowerPC test gcov_ctr_ref_init.ll (#117577) 2024-11-26 12:09:49 -05:00
SpencerAbson
2a0162c019
[AArch64][SVE] Change the immediate argument in svextq (#115340)
In order to align with `svext` and NEON `vext`/`vextq`, this patch
changes immediate argument in `svextq` such that it refers to elements
of the size of those of the source vector, rather than bytes. The [spec
for this
intrinsic](https://github.com/ARM-software/acle/blob/main/main/acle.md#extq)
is ambiguous about the meaning of this argument, this issue was raised
after there was a differing interpretation for it from the implementers
of the ACLE in GCC.

For example (with our current implementation):

`svextq_f64(zn_f64, zm_f64, 1)` would, for each 128-bit segment of
`zn_f64,` concatenate the highest 15 bytes of this segment with the
first byte of the corresponding segment of `zm_f64`.

After this patch, the behavior of `svextq_f64(zn_f64, zm_f64, 1)` would
be, for each 128-bit vector segment of `zn_f64`, to concatenate the
higher doubleword of this segment with the lower doubleword of the
corresponding segment of `zm_f64`.

The range of the immediate argument in `svextq` would be modified such
that it is:
- [0,15] for `svextq_{s8,u8}`
- [0,7] for `svextq_{s16,u16,f16,bf16}`
- [0,3] for `svextq_{s32,u32,f32}`
- [0,1] for `svextq_{s64,u64,f64}`
2024-11-26 16:50:51 +00:00