56693 Commits

Author SHA1 Message Date
Piotr Fusik
6e7312bda6
[RISCV] Select and/or/xor with certain constants to Zbb ANDN/ORN/XNOR (#120221)
(and X, (C<<12|0xfff)) -> (ANDN X, ~C<<12)
    (or  X, (C<<12|0xfff)) -> (ORN  X, ~C<<12)
    (xor X, (C<<12|0xfff)) -> (XNOR X, ~C<<12)

Emits better code, typically by avoiding an `ADDI HI, -1` instruction.

Co-authored-by: Craig Topper <craig.topper@sifive.com>
2024-12-19 21:38:20 +01:00
Konstantina Mitropoulou
d3508ccd15
[AMDGPU] Emit S_CBRANCH_SCC for floating-point conditions. (#120588)
- **[AMDGPU] Add new test.**
- **[AMDGPU] Emit S_CBRANCH_SCC for floating-point conditions.**

---------

Co-authored-by: Konstantina Mitropoulou <KonstantinaMitropoulou@amd.com>
2024-12-19 11:20:43 -08:00
Justin Bogner
aa07f92210
[DirectX][SPIRV] Consistent names for HLSL resource intrinsics (#120466)
Rename HLSL resource-related intrinsics to be consistent with the naming
conventions discussed in [wg-hlsl:0014].

This is an entirely mechanical change, consisting of the following
commands and automated formatting.

```sh
git grep -l handle.fromBinding | xargs perl -pi -e \
  's/(dx|spv)(.)handle.fromBinding/$1$2resource$2handlefrombinding/g'
git grep -l typedBufferLoad_checkbit | xargs perl -pi -e \
  's/(dx|spv)(.)typedBufferLoad_checkbit/$1$2resource$2loadchecked$2typedbuffer/g'
git grep -l typedBufferLoad | xargs perl -pi -e \
  's/(dx|spv)(.)typedBufferLoad/$1$2resource$2load$2typedbuffer/g'
git grep -l typedBufferStore | xargs perl -pi -e \
  's/(dx|spv)(.)typedBufferStore/$1$2resource$2store$2typedbuffer/g'
git grep -l bufferUpdateCounter | xargs perl -pi -e \
  's/(dx|spv)(.)bufferUpdateCounter/$1$2resource$2updatecounter/g'
git grep -l cast_handle | xargs perl -pi -e \
  's/(dx|spv)(.)cast.handle/$1$2resource$2casthandle/g'
```

[wg-hlsl:0014]: https://github.com/llvm/wg-hlsl/blob/main/proposals/0014-consistent-naming-for-dx-intrinsics.md
2024-12-19 12:17:21 -07:00
Michael Maitland
3710050566
[RISCV][VLOPT] Set CommonVL as the largest of the users (#120349)
Prior to this patch, we required that all users had the same VL in order
to optimize. But as the FIXME said, we can use the largest VL to
optimize, as long as we can determine what the largest is. This patch
implements the FIXME.
2024-12-19 13:22:31 -05:00
Piotr Fusik
01b96385fd [RISCV][test] Add zbb-logic-neg-imm.ll 2024-12-19 18:44:21 +01:00
Alex MacLean
310e798757
[NVPTX] Avoid introducing unnecessary ProxyRegs and Movs in ISel (#120486)
Avoid introducing `ProxyReg` and `MOV` nodes during ISel when lowering
`bitconvert` or similar operations. These nodes are all erased by a
later pass but not introducing them in the first place is simpler and
likely saves compile time.

Also remove redundant `MOV` instruction definitions.
2024-12-19 07:55:03 -08:00
Benjamin Maxwell
ca98a3d9bb
[AArch64][SVE] Use SVE for scalar FP converts in streaming[-compatible] functions (1/n) (#118505)
In streaming[-compatible] functions, use SVE for scalar FP conversions
to/from integer types. This can help avoid moves between FPRs and GRPs,
which could be costly.

This patch also updates definitions of SCVTF_ZPmZ_StoD and
UCVTF_ZPmZ_StoD to disallow lowering to them from ISD nodes, as doing so
requires creating a [U|S]INT_TO_FP_MERGE_PASSTHRU node with inconsistent
types.

Follow up to #112213.

Note: This PR does not include support for f64 <-> i32 conversions (like
#112564), which needs a bit more work to support.
2024-12-19 13:16:31 +00:00
Feng Zou
eb812d28f5
[X86] Put R20/R21/R28/R29 later in GR64 list (#120510)
Because these registers require an extra byte to encode in certain
memory form. Putting them later in the list will reduce code size when
EGPR is enabled. And align the same order in GR8, GR16 and GR32 lists.
Example:

    movq (%r20), %r11  # encoding: [0xd5,0x1c,0x8b,0x1c,0x24]
    movq (%r22), %r11  # encoding: [0xd5,0x1c,0x8b,0x1e]
2024-12-19 20:16:34 +08:00
David Green
e020f46027 [ARM] Fix BF16 lowering with FullFP16
This adds test coverage for bf16 instructions, making sure that lowering bf16
works with and without +fullfp16.
2024-12-19 10:20:35 +00:00
Kerry McLaughlin
9829598933
[AArch64][SME2] Extend getRegAllocationHints for ZPRStridedOrContiguousReg (#119865)
ZPR2StridedOrContiguous loads used by a FORM_TRANSPOSED_REG_TUPLE
pseudo should attempt to assign a strided register to avoid unnecessary
copies, even though this may overlap with the list of SVE callee-saved registers.
2024-12-19 09:40:13 +00:00
Pengcheng Wang
2c782ab271
[RISCV] Add software pipeliner support (#117546)
This patch adds basic support of `MachinePipeliner` and disable
it by default.
    
The functionality should be OK and all llvm-test-suite tests have
passed.
2024-12-19 13:00:08 +08:00
Craig Topper
dc72ec808d
[RISCV] Custom legalize vp.merge for mask vectors. (#120479)
The default legalization uses vmslt with a vector of XLen to compute a
mask. This doesn't work if the type isn't legal. For fixed vectors it
will scalarize. For scalable vectors it crashes the compiler.

This patch uses an alternate strategy that promotes the i1 vector to an
i8 vector and does the merge. I don't claim this to be the best
lowering. I wrote it quickly almost 3 years ago when a crash was
reported in our downstream.

Fixes #120405.
2024-12-18 19:19:14 -08:00
Zhaoxin Yang
f334db92be
[llvm][CodeGen] Intrinsic llvm.powi.* code gen for vector arguments (#118242)
Scalarize vector FPOWI instead of promoting the type. This allows the
scalar FPOWIs to be visited and converted to libcalls before promoting
the type.

FIXME: This should be done in LegalizeVectorOps/LegalizeDAG, but call
lowering needs the unpromoted EVT.

Without this patch, in some backends, such as RISCV64 and LoongArch64,
the i32 type is illegal and will be promoted. This causes exponent type
check to fail when ISD::FPOWI node generates a libcall.

Fix https://github.com/llvm/llvm-project/issues/118079
2024-12-19 08:57:31 +08:00
Farzon Lotfi
6457aee5b7
[DirectX] Bug fix for Data Scalarization crash (#118426)
Two bugs here. First calling `Inst->getFunction()` has undefined
behavior if the instruction is not tracked to a function. I suspect the
`replaceAllUsesWith` was leaving the GEPs in a weird ghost parent
situation. I switched up the visitor to be able to `eraseFromParent` as
part of visiting and then everything started working.

The second bug was in `DXILFlattenArrays.cpp`. I was unaware that you
can have multidimensional arrays of `zeroinitializer`, and `undef` so
fixed up the initializer to handle these two cases.

fixes #117273
2024-12-18 16:33:49 -05:00
Justin Bogner
9b3d85f0f4
[DirectX] TypedUAVLoadAdditionalFormats shader flag (#120477)
Set the TypedUAVLoadAddtionalFormats flag if the shader contains a load
from a multicomponent UAV.

Fixes #114557
2024-12-18 13:42:12 -07:00
Justin Bogner
bfd05102d8
[DirectX] Lower ops after translating metadata (#120157)
Move the DXILOpLoweringPass after DXILTranslateMetadata, and add asserts
in DXILShaderFlags to ensure it isn't scheduled after op lowering. This
will allow us to rely on DirectX intrinsics in the shader flags analysis
rather than having to recover information from lowered operations.

Fixes #120119.
2024-12-18 12:03:05 -07:00
Jun Wang
d57230c72e
[AMDGPU][MC] Disallow op_sel in some VOP3P dot instructions (#100485)
In v_dot4 and v_dot8 instructions with 4- or 8-bit packed data (e.g.,
v_dot4_u32_u8, v_dot8_u32_u4), the op_sel modifier should not be
allowed.
2024-12-18 10:50:47 -08:00
Brox Chen
c6f753b9a0
[AMDGPU][True16][MC] true16 for v_pack_b32_f16 (#119630)
Support true16 format for v_pack_b32_f16  in MC.

Since we are replacing v_alignbit_b32 to
`v_pack_b32_f16_t16/v_pack_b32_f16_fake16` in Post-GFX11, have to update
the CodeGen pattern for `v_pack_b32_f16_fake16 `to get CodeGen test
passing. There is no pattern modified/created, but just replacing the
`v_pack_b32_f16` with fake16 format.

Some of the true16 CodeGen test are impacted since `v_pack_b32_f16`
selection are removed in Post-GFX11 while `v_pack_b32_f16_t16` are not
yet supported. The CodeGen patch for `v_pack_b32_f16_t16` will be done
is the following patch.
2024-12-18 13:28:42 -05:00
Justin Bogner
0e2466f624
[DirectX] Create symbols for resource handles (#119775)
We need to create symbols with "the original shape of resource and
element type" to put in the resource metadata in order to generate valid
DXIL.

Note that DXC generally doesn't emit an actual symbol outside of library
shaders (it emits an undef of a pointer to the type), but since we have
to deal with opaque pointers we would need a way to smuggle the type
through to match that. Instead, we simply emit symbols for now.

Fixed #116849
2024-12-18 10:47:12 -07:00
Justin Bogner
0fca76d576
[DirectX] Introduce the DXILResourceAccess pass (#116726)
This pass transforms resource access via `llvm.dx.resource.getpointer`
into buffer loads and stores.

Fixes #114848.
2024-12-18 10:13:45 -07:00
Simon Pilgrim
49fd2dde21
[X86] LowerShift - don't prematurely lower to x86 vector shift imm instructions (#120282)
When splitting 2 unique amount shifts to shuffle(shift(x,c1),shift(x,c2)), don't use getTargetVShiftByConstNode directly to lower, use generic shifts to ensure we make use of any further canonicalization: shl(X,1) to add(X,X) etc. - this can have notably better throughput on some x86 targets.

Noticed on #120270
2024-12-18 16:08:45 +00:00
Justin Bogner
3eca15cbb9
[DirectX] Split resource info into type and binding info. NFC (#119773)
This splits the DXILResourceAnalysis pass into TypeAnalysis and
BindingAnalysis passes. The type analysis pass is made immutable and
populated lazily so that it can be used earlier in the pipeline without
needing to carefully maintain the invariants of the binding analysis.

Fixes #118400
2024-12-18 09:02:28 -07:00
Sergei Barannikov
d3750412aa
[TableGen][GISel] Improve dead register handling (#120426)
A dead implicit def wasn't marked as dead if it is also an implicit use.
The new approach should also be more straightforward and simplifies
future changes for supporting optional defs and physical register defs.

Pull Request: https://github.com/llvm/llvm-project/pull/120426
2024-12-18 18:58:37 +03:00
Florian Hahn
76714be5fd
Revert "Add support for single reductions in ComplexDeinterleavingPass (#112875)"
This reverts commit b3eede5e1fa7ab742b86e9be22db7bccd2505b8a.

This has been breaking most AArch64 stage2 builds for 4+ hours,
reverting to get the bots back to green.

https://lab.llvm.org/buildbot/#/builders/41/builds/4172
https://lab.llvm.org/buildbot/#/builders/4/builds/4281
https://lab.llvm.org/buildbot/#/builders/199/builds/263
https://lab.llvm.org/buildbot/#/builders/198/builds/334
https://lab.llvm.org/buildbot/#/builders/143/builds/4276
https://lab.llvm.org/buildbot/#/builders/17/builds/4725
2024-12-18 15:06:52 +00:00
Aaditya
0446990cc7
Reapply "[NFC][AMDGPU] Pre-commit clang and llvm tests for dynamic allocas" (#120410)
This reapplies commit https://github.com/llvm/llvm-project/pull/120063.

A machine-verifier bug was causing a crash in the previous commit. 
This has been addressed in
https://github.com/llvm/llvm-project/pull/120393.
2024-12-18 18:20:45 +05:30
Simon Pilgrim
f270c9a7d0 [X86] urem-seteq-illegal-types.ll - regenerate VPTERNLOG comment 2024-12-18 11:58:49 +00:00
Paul Walker
3146911eb0
[LLVM][AsmPrinter] Add vector ConstantInt/FP support to emitGlobalConstantImpl. (#120077)
The fixes a failure path for fixed length vector globals when
ConstantInt/FP is used to represent splats instead of
ConstantDataVector.
2024-12-18 11:51:01 +00:00
Sergei Barannikov
1941f34172
[TableGen][GISel] Import more "multi-level" patterns (#120332)
Previously, if the destination DAG has an untyped leaf, we would import
the pattern only if that leaf is defined by the *top-level* source DAG.
This is an unnecessary restriction.

Here is an example of such pattern:
```
def : Pat<(add (mul v8i16:$vA, v8i16:$vB), v8i16:$vC),
          (VMLADDUHM $vA, $vB, $vC)>;
```

Previously, it failed to import because `add` doesn't define neither
`$vA` nor `$vB`.

This change reduces the number of skipped patterns as follows:

```
AArch64: 8695 ->  8548 (-147)
AMDGPU: 11333 -> 11240 (-93)
ARM:     4297 ->  4278 (-1)
PowerPC: 3955 ->  3010 (-945)
```

Other GISel-enabled targets are unaffected.
2024-12-18 14:44:55 +03:00
Simon Pilgrim
dd8e1adbf2
[X86] LowerShift - track the number and location of constant shift elements. (#120270)
We have several vector shift lowering strategies that have to analyse
the distribution of non-uniform constant vector shift amounts, at the
moment there is very little sharing of data between these analysis.

This patch creates a SmallDenseMap of the different LEGAL constant shift
amounts used, with a mask of which elements they are used in. So far
I've only updated the shuffle(immshift(x,c1),immshift(x,c2)) lowering
pattern to use it for clarity, there's several more that can be done in
followups. Its hoped that the proposed patch #117980 can be simplified
after this patch as well.

vec_shift6.ll - the existing shuffle(immshift(x,c1),immshift(x,c2))
lowering bails on out of range shift amounts, while this patch now skips
them and treats them as UNDEF - this means we manage to fold more cases
that before would have to lower to a SHL->MUL pattern, including some
legalized cases.
2024-12-18 11:36:54 +00:00
Mikhail Goncharov
41c1992a16 [NVPTX] fix nvcl-param-align.ll
fix for f9c8c01d38f8fbea81db99ab90b7d0f2bdcc8b4d
2024-12-18 11:41:44 +01:00
Aaditya
414c462a83
[AMDGPU] Modify Dyn Alloca test to account for Machine-Verifier bug (#120393)
Machine-Verifier crashes in kernel functions, 
but fails gracefully in device functions.

This is due to the buffer resource descriptor selected
during G-ISEL, before the fallback path. 
Device functions use `$sgpr0_sgpr1_sgpr2_sgpr3`.
while Kernel functions select `$private_rsrc_reg` 
where machine-verifier complains: 
`$private_rsrc_reg is not a SReg_128 register.`

Modifying test case to capture both behaviors, this is related to
https://github.com/llvm/llvm-project/pull/120063
2024-12-18 16:08:17 +05:30
Nicholas Guy
b3eede5e1f
Add support for single reductions in ComplexDeinterleavingPass (#112875)
The Complex Deinterleaving pass assumes that all values emitted will
result in complex numbers, this patch aims to remove that assumption and
adds support for emitting just the real or imaginary components, not
both.
2024-12-18 10:34:26 +00:00
Simon Pilgrim
0b4ee8d4ee
[X86] combineKSHIFT - fold kshiftr(kshiftr/extract_subvector(X,C1),C2) --> kshiftr(X,C1+C2) (#115528)
Merge serial KSHIFTR nodes, possibly separated by EXTRACT_SUBVECTOR, to allow mask instructions to be computed in parallel.
2024-12-18 09:48:38 +00:00
Csanád Hajdú
96bb281b63
[AArch64] Prevent unnecessary truncation in bool vector reduce code generation (#120096)
Prevent unnecessarily truncating results of 128 bit wide vector
comparisons to 64 bit wide vector values in boolean vector reduce
operations.
2024-12-18 09:14:12 +00:00
Aaditya
d6e8ab1fa6
Revert "[NFC][AMDGPU] Pre-commit clang and llvm tests for dynamic allocas" (#120369)
Reverts llvm/llvm-project#120063 due to build-bot failures
2024-12-18 14:06:49 +07:00
Aaditya
99c2e3b782
[NFC][AMDGPU] Pre-commit clang and llvm tests for dynamic allocas (#120063)
For #119822
2024-12-18 12:14:37 +05:30
Ruiling, Song
67c55b1ffc
[AMDGPU] Make max dwords of memory cluster configurable (#119342)
We find it helpful to increase the value for graphics workload. Make it
configurable so we can experiment with a different value.
2024-12-18 14:17:27 +08:00
Michael Maitland
a61eeaa748
[RISCV][VLOPT] Add vector indexed loads and stores to getOperandInfo (#119748)
Use `MO.getOperandNo() == 0` instead of `IsMODef` so naming is clear for the store, since the store should treat its operand 0 like that even though it is not a def.The load should treat its operand 0 def in the same way.
2024-12-17 23:51:45 -05:00
Michael Maitland
fb33268d2f
[RISCV][VLOPT] Add support for VID and VIOTA (#120331)
We already cover vid in `llvm/test/CodeGen/RISCV/rvv/vl-opt-op-info.mir`
so no need to add tests for that instruction.
2024-12-17 21:15:23 -05:00
Drew Kersnar
932d9c13fa
[NVPTX] Generalize and extend upsizing when lowering 8/16-bit-element vector loads/stores (#119622)
This addresses the following issue I opened:
https://github.com/llvm/llvm-project/issues/118851.

This change generalizes the Type Legalization mechanism that currently
handles `v8[i/f/bf]16` upsizing to include loads _and_ stores of `v8i8`
+ `v16i8`, allowing all of the mentioned vectors to be lowered to ptx as
vectors of `b32`. This extension also allows us to remove the DagCombine
that only handled exactly `load v16i8`, thus centralizing all the
upsizing logic into one place.

Test changes include adding v8i8, v16i8, and v8i16 cases to
load-store.ll, and updating the CHECKs for other tests to match the
improved codegen.
2024-12-17 15:23:22 -08:00
Farzon Lotfi
c03fc929ff
[DirectX] Add support for vector_reduce_add (#117646)
Use of `vector_reduce_add` will make it easier to write more intrinsics
in `hlsl_intrinsics.h`.
2024-12-17 17:32:50 -05:00
Michael Maitland
169c32eb49
[RISCV][VLOPT] Enable the RISCVVLOptimizer by default (#119461)
Now that we have testing of all instructions in the isSupportedInstr
switch, and better coverage of getOperandInfo, I think it is a good time
to enable this by default.
2024-12-17 16:19:35 -05:00
Philip Reames
984cb791db
[RISCV] Use vmv.v.x to materialize masks in deinterleave2 lowering (#118500)
This is a follow up to 2af2634 to use vmv.v.x of i8 constants instead of
the prior vid/vand/vmsne sequence. The advantage of the vmv.v.x sequence
is that it's always m1 (so cheaper at high LMUL), and can be
rematerialized by the register allocator if needed to locally reduce
register pressure.
2024-12-17 12:50:09 -08:00
Simon Pilgrim
2a922903bf [X86] vector-shift tests - regenerate VPTERNLOG comments 2024-12-17 18:19:15 +00:00
Michael Maitland
904849f297
[RISCV][VLOPT] Add support for more instructions in vl-opt-op-info.mir (#119416)
Specifically, some more where EMUL=LMUL and EEW=SEW.
2024-12-17 12:57:29 -05:00
Michael Maitland
345a35259c
[RISCV][VLOPT] Avoid crash when user produces scalar def (#120255)
I found this crash when trying to enable the VLOptimizer pass. We need
this patch before we can enable by default. The old assert was not
checking that USE and DEF were vector registers. The correct condition
is guarded at the callsite of tryReduceVL.
2024-12-17 12:07:29 -05:00
Mikhail Goncharov
17b3dd03a0 [NVPTX][test] fix CodeGen/NVPTX/surf-write.ll
ptxas needs a proper triplet

for 133352feb30605ec51b15f77826ed3a2fbf8db56
2024-12-17 15:45:06 +01:00
Florian Hahn
c1f5937eb4
[SelectOpt] Support BinOps with SExt operands. (#115879)
Building on top of https://github.com/llvm/llvm-project/pull/115489
extend support for binops with SExt operand.

PR: https://github.com/llvm/llvm-project/pull/115879
2024-12-17 11:52:15 +00:00
SpencerAbson
908e30658d
[AArch64] Implement intrinsics for FP8 SME FMLAL/FMLALL (multi) (#119546)
This patch implements the following intrinsics:

Multi-vector 8-bit floating-point multiply-add long (multiple vectors).

``` c
// Only if __ARM_FEATURE_SME_F8F16 != 0
void svmla_za16[_mf8]_vg2x2_fpm(uint32_t slice, svmfloat8x2_t zn, svmfloat8x2_t zm,
                                fpm_t fpm) __arm_streaming __arm_inout("za");

void svmla_za16[_mf8]_vg2x4_fpm(uint32_t slice, svmfloat8x4_t zn, svmfloat8x4_t zm,
                                fpm_t fpm) __arm_streaming __arm_inout("za");
// Only if __ARM_FEATURE_SME_F8F32 != 0
void svmla_za32[_mf8]_vg4x2_fpm(uint32_t slice, svmfloat8x2_t zn, svmfloat8x2_t zm,
                                fpm_t fpm) __arm_streaming __arm_inout("za");

void svmla_za32[_mf8]_vg4x4_fpm(uint32_t slice, svmfloat8x4_t zn, svmfloat8x4_t zm,
                                fpm_t fpm) __arm_streaming __arm_inout("za");                              
```

In accordance with https://github.com/ARM-software/acle/pull/323
2024-12-17 11:47:20 +00:00
Benjamin Maxwell
a7dafea384
[SDAG] Allow folding stack slots into sincos/frexp in more cases (#118117)
This adds a new helper `canFoldStoreIntoLibCallOutputPointers()` to
check that it is safe to fold a store into a node that will expand to a
library call that takes output pointers. This requires checking for two
(independent) properties:

1. The store is not within a CALLSEQ_START..CALLSEQ_END pair
* If it is, the expansion would lead to nested call sequences (which is
invalid)
2. The node does not appear as a predecessor to the store
* If it does, attempting to merge the store into the call would result
in a cycle in the DAG

These two properties are checked as part of the same traversal in
`canFoldStoreIntoLibCallOutputPointers()`
2024-12-17 10:54:17 +00:00