7376 Commits

Author SHA1 Message Date
Jay Foad
466d266945
[AMDGPU] Fix GFX90x check prefixes in tests (#92254) 2024-05-15 15:13:53 +01:00
Yingwei Zheng
3aae916ff7
Reland "[ValueTracking] Compute knownbits from known fp classes" (#92084)
This patch relands https://github.com/llvm/llvm-project/pull/86409.

I mistakenly thought that `Known.makeNegative()` clears the sign bit of
`Known.Zero`. This patch fixes the assertion failure by explicitly
clearing the sign bit.
2024-05-14 18:10:28 +08:00
Martin Storsjö
2e165a2c4b Revert "[ValueTracking] Compute knownbits from known fp classes (#86409)"
This reverts commit d03a1a6e5838c7c2c0836d71507dfdf7840ade49.

This change caused failed assertions, see
https://github.com/llvm/llvm-project/pull/86409#issuecomment-2109469845
for details.
2024-05-14 10:37:23 +03:00
Stanislav Mekhanoshin
efc7bbb917
[AMDGPU] Make v2bf16 BUILD_VECTOR legal (#92022)
There is nothing specific here and it is not different from i16 or f16.
2024-05-13 14:53:26 -07:00
Leon Clark
bd679865c0
[AMDGPU] Add tests for vector rebroadcast. (#91322)
Co-authored-by: Leon Clark <leoclark@amd.com>
2024-05-13 19:39:26 +01:00
Yingwei Zheng
d03a1a6e58
[ValueTracking] Compute knownbits from known fp classes (#86409)
This patch calculates knownbits from fp instructions/dominating fcmp
conditions. It will enable more optimizations with signbit idioms.
2024-05-14 01:58:45 +08:00
Vikram Hegde
cb4fca929b
[AMDGPU] Extend llvm.amdgcn.update.dpp intrinsic to support f64 (#91190)
Follow up patch to https://github.com/llvm/llvm-project/pull/89217,
before we make changes to atomic optimizer.
2024-05-13 10:20:37 +05:30
Stanislav Mekhanoshin
5d18d575d8
[AMDGPU] Make fneg/fabs/copysign legal for bf16 (#91676)
These are just bit operations, exactly the same as with f16.
2024-05-10 14:33:47 -07:00
Petar Avramovic
f5e49279c0
AMDGPU: fix isSafeToSink expecting exactly one predecessor (#89224)
isSafeToSink needs to check if machine cycle has divergent exit branch
but first it needs the MBB that contains cycle exit branch.
Early-tailduplication can delete exit block created by structurize-cfg
so there is still exactly one cycle exit block but the new cycle exit
block can have multiple predecessors.
Simplify search for MBBs that contain cycle exit branch by introducing
helper method getExitingBlocks in GenericCycle.

Fixes #89200
2024-05-10 13:02:05 +02:00
Matt Arsenault
6a8d30b1c1
DAG: Skip 0 sign handling in minimum/maximum lowering for _ieee case (#91326)
dc9664a8adae17f2083fbcc8e96cfce606c56d57 changed the documentation to
assume these order -0 as less than +0.
2024-05-09 14:41:13 +02:00
Jay Foad
6eb9e214b3
RFC: [AMDGPU] Check subtarget features for consistency (#86957)
Implement GCNSubtarget::checkSubtargetFeatures as a canonical place to
check subtarget features for consistency and diagnose any
inconsistencies. To start with, the implementation just checks that
either wavefrontsize32 or wavefrontsize64 is selected.

checkSubtargetFeatures is called at the start of instruction selection.
This is pretty arbitrary. It is just a convenient point at which we have
access to the subtarget that we're going to use for codegenning a
particular function.
2024-05-09 11:37:28 +01:00
Jay Foad
2cbfe4a823
[AMDGPU] Remove duplicate -mtriple options in tests (#91576) 2024-05-09 10:59:25 +01:00
Matt Arsenault
b5afda8d76 AMDGPU: Add some more ctlz_zero_undef tests 2024-05-08 16:13:22 +02:00
Stanislav Mekhanoshin
2a3903fa0e
[AMDGPU] Prevent FMINIMUM and FMAXIMUM beeing fully scalarized (#91378)
This is the same logic as with FMINNUM_IEEE/FMAXNUM_IEEE.
2024-05-07 14:20:13 -07:00
Shilei Tian
0b50d095bc
[AMDGPU] Don't optimize agpr phis if the operand doesn't have subreg use (#91267)
If the operand doesn't have any subreg use, the optimization could
potentially
generate `V_ACCVGPR_READ_B32_e64` with wrong register class. The
following example demonstrates the issue.

Input MIR:

```
bb.0:
  %0:sgpr_32 = S_MOV_B32 0
  %1:sgpr_128 = REG_SEQUENCE %0:sgpr_32, %subreg.sub0, %0:sgpr_32, %subreg.sub1, %0:sgpr_32, %subreg.sub2, %0:sgpr_32, %subreg.sub3
  %2:vreg_128 = COPY %1:sgpr_128
  %3:areg_128 = COPY %2:vreg_128, implicit $exec

bb.1:
  %4:areg_128 = PHI %3:areg_128, %bb.0, %6:areg_128, %bb.1
  %5:areg_128 = PHI %3:areg_128, %bb.0, %7:areg_128, %bb.1
  ...
```

Output of current implementation:

```
bb.0:
  %0:agpr_32 = V_ACCVGPR_WRITE_B32_e64 0, implicit $exec
  %1:agpr_32 = V_ACCVGPR_WRITE_B32_e64 0, implicit $exec
  %2:agpr_32 = V_ACCVGPR_WRITE_B32_e64 0, implicit $exec
  %3:agpr_32 = V_ACCVGPR_WRITE_B32_e64 0, implicit $exec
  %4:areg_128 = REG_SEQUENCE %0:agpr_32, %subreg.sub0, %1:agpr_32, %subreg.sub1, %2:agpr_32, %subreg.sub2, %3:agpr_32, %subreg.sub3
  %5:vreg_128 = V_ACCVGPR_READ_B32_e64 %4:areg_128, implicit $exec
  %6:areg_128 = COPY %46:vreg_128

bb.1:
  %7:areg_128 = PHI %6:areg_128, %bb.0, %9:areg_128, %bb.1
  %8:areg_128 = PHI %6:areg_128, %bb.0, %10:areg_128, %bb.1
  ...
```

The problem is the generated `V_ACCVGPR_READ_B32_e64` instruction.
Apparently the operand `%4:areg_128` is not valid for this.

In this patch, we don't count the none-subreg use because
`V_ACCVGPR_READ_B32_e64` can't handle none-32-bit operand.

Fixes: SWDEV-459556
2024-05-07 16:44:00 -04:00
Matt Arsenault
f548c4d83c AMDGPU: Add mode register use to s_getreg_b32
This should fix reading the wrong mode after setting the mode.
Ideally we would have separate pseudos for the case that we know
does not read mode.
2024-05-07 16:06:51 +02:00
Sameer Sahasrabuddhe
8a65ee8b2a
[AMDGPU] don't mark control-flow intrinsics as convergent (#90026)
This is really a workaround to allow control flow lowering in the
presence of convergence control tokens. Control-flow intrinsics in LLVM
IR are convergent because they indirectly represent the wave CFG, i.e.,
sets of threads that are "converged" or "execute in lock-step". But they
exist during a small window in the lowering process, inserted after the
structurizer and then translated to equivalent MIR pseudos. So rather
than create convergence tokens for these builtins, we simply mark them
as not convergent.

The corresponding MIR pseudos are marked as having side effects, which
is sufficient to prevent optimizations without having to mark them as
convergent.
2024-05-06 14:07:11 +05:30
Matt Arsenault
d654278bde
Reapply "AMDGPU: Implement llvm.set.rounding (#88587)" series (#91113)
Revert "Revert 4 last AMDGPU commits to unbreak Windows bots"

This reverts commit 0d493ed2c6e664849a979b357a606dcd8273b03f.

MSVC does not like constexpr on the definition after an extern
declaration of a global.
2024-05-06 09:09:19 +02:00
Mehdi Amini
0d493ed2c6 Revert 4 last AMDGPU commits to unbreak Windows bots
Revert "AMDGPU: Try to fix build error with old gcc"
This reverts commit c7ad12d0d7606b0b9fb531b0b273bdc5f1490ddb.

Revert "AMDGPU: Use umin in set.rounding expansion"
This reverts commit a56f0b51dd988ad2b533de759c98457c1ed42456.

Revert "AMDGPU: Optimize set_rounding if input is known to fit in 2 bits (#88588)"
This reverts commit b4e751e2ab0ff152ed18dea59ebf9691e963e1dd.

Revert "AMDGPU: Implement llvm.set.rounding (#88587)"
This reverts commit 9731b77e80261c627d79980f8c275700bdaf6591.
2024-05-04 19:57:33 +02:00
Matt Arsenault
7ec698e6ed
AMDGPU: Add tests for minimum and maximum intrinsics (#90997)
Baseline tests for new expansion. I think we can do better and avoid the
classes.
2024-05-03 21:43:30 +02:00
Abhinav Garg
76508dce43
[AMDGPU] Fix mode register pass for constrained FP operations (#90085)
This PR will fix the si-mode-register pass which is inserting an extra
setreg instruction in case of constrained FP operations. This pass will
be ignored for strictfp functions.
2024-05-03 19:47:15 +02:00
Matt Arsenault
a56f0b51dd AMDGPU: Use umin in set.rounding expansion
Addresses comment from #88587
2024-05-03 19:22:19 +02:00
Jay Foad
cd4287bc44
[AMDGPU] Convert PrologEpilogSGPRSpills from DenseMap to sorted vector (#90957)
In practice PrologEpilogSGPRSpills never has more than 3 entries so
DenseMap is overkill. In addition this means that iteration happens in
register number order, instead of DenseMap's hashed order, so it will
not be affected by future patches that define new physical registers.
This should reduce future test case churn.
2024-05-03 13:42:40 +01:00
Matt Arsenault
4e67b5058e AMDGPU: Add more tests for atomicrmw handling
Add agent scope copies of atomicrmw atomics tests.
Expand testing for the undo identity atomicrmw case.
Test 16-bit atomic expansions.
2024-05-03 11:50:59 +02:00
Matt Arsenault
9f9856d623 AMDGPU: Update name for amdgpu.no.remote.memory metadata 2024-05-03 11:50:59 +02:00
Matt Arsenault
b4e751e2ab
AMDGPU: Optimize set_rounding if input is known to fit in 2 bits (#88588)
We don't need to figure out the weird extended rounding modes or
handle offsets to keep the lookup table in 64-bits.
    
https://reviews.llvm.org/D153258

Depends #88587
2024-05-03 11:17:18 +02:00
Carl Ritson
44648ccb8b
[AMDGPU] Always emit lds_size in PAL ELF Metadata 3.0 (#87222)
Emit lds_size for all shader types in PAL metadata.
2024-05-03 17:01:03 +09:00
Matt Arsenault
9731b77e80
AMDGPU: Implement llvm.set.rounding (#88587)
Use a shift of a magic constant and some offseting to convert from
flt_rounds values.

I don't know why the enum defines Dynamic = 7. The standard suggests -1
is the cannot determine value. If we could start the extended values at
4 we wouldn't need the extra compare sub and select.

https://reviews.llvm.org/D153257
2024-05-03 09:41:27 +02:00
Stanislav Mekhanoshin
57216f7bd6
[AMDGPU] Support byte_sel modifier for v_cvt_f32_fp8 and v_cvt_f32_bf8 (#90887) 2024-05-02 12:03:51 -07:00
Scott Egerton
eb8236381b
[AMDGPU] Group multiple single use producers under one single use instruction. (#90713)
Previously each single use producer would be marked with a
"S_SINGLEUSE_VDST 1" instruction. This patch adds support for
larger immediates that encode multiple single use producers into
one S_SINGLEUSE_VDST instruction.
2024-05-02 17:30:11 +01:00
paperchalice
3fe282a83d
[Pass] Add pre-isel-intrinsic-lowering to pass registry (#90851)
This was removed due to avoid circular dependency between `CodeGen` and
`Passes`, but now I realized this is no longer a problem since
`CodeGenPassBuilder` is moved into `Passes`.
2024-05-02 21:57:46 +08:00
Valery Pykhtin
981aa6fcf6
[AMDGPU] Fix incorrect stepping in gdb for amdgcn.end.cf intrinsic. (#83010)
After #73958 gdb.rocm/lane-execution.exp test started to fail due to
incorrect debug location. This is kind of a revert patch.
2024-05-02 12:59:31 +02:00
Gang Chen
167427f5db
[AMDGPU] change order of fp and sp in kernel prologue (#90626)
change order of fp and sp in kernel prologue also related codegen tests
to make it easier to merge code into our downstream branches

Signed-off-by: gangc <gangc@amd.com>
2024-05-01 08:16:55 -07:00
Matt Arsenault
39e24bdd8e
MachineLICM: Allow hoisting REG_SEQUENCE (#90638) 2024-05-01 16:52:04 +02:00
David Stuttard
f898161bfa
[AMDGPU] Fix image_msaa_load waitcnt insertion for pre-gfx12 (#90710)
https://github.com/llvm/llvm-project/pull/90201 made some fixes for
gfx12
image_msaa_load waitcnt insertion.
That fix might break in some situations for pre-gfx12 - this fixes that
by
explitly checking for VSAMPLE which always requires a s_wait_samplecnt
and
leaves the previous logic intact for non-gfx12.
2024-05-01 11:37:57 +01:00
David Stuttard
5fb1e2825f
[AMDGPU] Enhance s_waitcnt insertion before barrier for gfx12 (#90595)
Code to determine if a waitcnt is required before a barrier instruction
only
considered S_BARRIER.
gfx12 adds barrier_signal/wait so need to enhance the existing code to
look for
a barrier start (which is just an S_BARRIER for earlier architectures).
2024-05-01 11:37:13 +01:00
Jay Foad
0b21b25eac
[AMDGPU] Do not optimize away pre-existing waitcnt instructions at -O0 (#90716)
The autogenerated memory legalizer tests use -O0 so this allows us to
see the exact waitcnts that were inserted by the memory legalizer
without them being optimized away.
2024-05-01 11:29:11 +01:00
Scott Egerton
d97f25b948
[AMPGPU] Emit s_singleuse_vdst instructions when a register is used multiple times in the same instruction. (#89601)
Previously, multiple uses of a register within the same instruction were
being counted as multiple uses. This has been corrected to
only count as a single use as per the specification allowing for
more optimisation candidates.
2024-04-30 17:14:01 +01:00
David Stuttard
62dea99a7d
[AMDGPU] Fix gfx12 waitcnt type for image_msaa_load (#90201)
image_msaa_load is actually encoded as a VSAMPLE instruction and
requires the appropriate waitcnt variant.
2024-04-30 10:41:51 +01:00
Shilei Tian
d47c4984e9
[AMDGPU][ISel] Add more trunc store actions regarding bf16 (#90493) 2024-04-29 18:27:52 -04:00
Jannik Silvanus
5f9ae61dee
[Support][YamlTraits] Add quoting for keys in textual YAML representation (#88763)
The support library contains helpers to parse and emit YAML documents.

In the textual YAML representation, some strings need to be quoted, e.g.
when containing unprintable characters.

We already have such quoting implemented for YAML values.

This patch applies the same quoting to YAML *keys*.

One affected case is output of control registers in AMDGPU Msgpack
metadata, which are printed in a format like this:

```
   0x2cca (SPI_SHADER_PGM_RSRC1_ES): 42
```

With this patch, the key is quoted:

```
   '0x2cca (SPI_SHADER_PGM_RSRC1_ES)': 42
```

Most test changes come from this pattern.
2024-04-29 15:37:42 +02:00
Shilei Tian
8e17c84836
[AMDGPU][ISel] Set trunc store action to expand for v4f32->v4bf16 (#90427) 2024-04-29 09:08:54 -04:00
Bjorn Pettersson
55c6bda01e Revert "Revert "[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison (#84921)" and more..."
This reverts commit 16bd10a38730fed27a3bf111076b8ef7a7e7b3ee.

Re-applies:
    b3c55b707110084a9f50a16aade34c3be6fa18da - "[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison (#84921)"
    8e2f6495c0bac1dd6ee32b6a0d24152c9c343624 - "[DAGCombiner] Do not always fold FREEZE over BUILD_VECTOR (#85932)"
    73472c5996716cda0dbb3ddb788304e0e7e6a323 - "[SelectionDAG] Treat CopyFromReg as freezing the value (#85932)"

with a fix in DAGCombiner::visitFREEZE.
2024-04-29 13:08:52 +02:00
David Stuttard
2914a11e3f
[AMDGPU] Fix hard clausing for image instructions on gfx12 (#90221)
Also updated hard-clauses.mir to have separate versions for gfx11 and
gfx12 since
the MIR instructions are different for each of them.
2024-04-29 11:42:36 +01:00
David Spickett
16bd10a387 Revert "[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison (#84921)" and more...
This reverts:
b3c55b707110084a9f50a16aade34c3be6fa18da - "[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison (#84921)"
(because it updates a test case that I don't know how to resolve the conflict for)
8e2f6495c0bac1dd6ee32b6a0d24152c9c343624 - "[DAGCombiner] Do not always fold FREEZE over BUILD_VECTOR (#85932)"
73472c5996716cda0dbb3ddb788304e0e7e6a323 - "[SelectionDAG] Treat CopyFromReg as freezing the value (#85932)"

Due to a test suite failure on AArch64 when compiling for SVE.
https://lab.llvm.org/buildbot/#/builders/197/builds/13955

clang: ../llvm/llvm/include/llvm/CodeGen/ValueTypes.h:307: MVT llvm::EVT::getSimpleVT() const: Assertion `isSimple() && "Expected a SimpleValueType!"' failed.
2024-04-29 09:47:41 +01:00
Björn Pettersson
b3c55b7071
[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison (#84921)
[SelectionDAG] Handle more opcodes in canCreateUndefOrPoison

Handle SELECT_CC similarly as SETCC.

Handle these operations that only propagate poison/undef based on the
input operands:
  SADDSAT, UADDSAT, SSUBSAT, USUBSAT, MULHU, MULHS,
  SMIN, SMAX, UMIN, UMAX

These operations may create poison based on shift amount and exact
flag being violated:
  SRL, SRA

One goal here is to allow pushing freeze through these operations
when allowed, as well as letting analyses such as
isGuaranteedNotToBeUndefOrPoison to not break on such operations.

Since some problems have been observed with pushing freeze through
SRA/SRL we block that explicitly in DAGCombiner::visitFreeze now.
That way we can still model SRA/SRL properly in
SelectionDAG::canCreateUndefOrPoison, e.g. when used by
isGuaranteedNotToBeUndefOrPoison, even if we do not want to push
freeze through those instructions.
2024-04-29 07:56:49 +02:00
Stanislav Mekhanoshin
6e722bbe30
[AMDGPU] Support byte_sel modifier on v_cvt_sr_fp8_f32 and v_cvt_sr_bf8_f32 (#90244) 2024-04-26 13:02:57 -07:00
Matt Arsenault
2a95022cff AMDGPU: Add atomic bfloat load/store codegen tests 2024-04-25 16:08:11 +02:00
Emma Pilkington
2c50f8ffbb
[AMDGPU] Include missing FeatureMADIntraFwdBug in gfx11-generic (#89936)
It seems like this happened because #79460 moved this from
`FeatureISAVersion11_Common` to `FeatureISAVersion11_0_Common` while
#76955 was being reviewed.
2024-04-25 09:24:07 -04:00
Abhinav Garg
007e859258
AMDGPU: Pre-commit test to verify mode change in fp constrained operations (#88858)
This test will check the mode register in case of constrained floating
point operations.

---------

Co-authored-by: Matt Arsenault <Matthew.Arsenault@amd.com>
2024-04-24 20:36:05 +02:00