6856 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin
84f398af74
[AMDGPU] Add missing test checks. NFC. (#69484) 2023-10-18 11:26:39 -07:00
Stanislav Mekhanoshin
47ed921985
[AMDGPU] Add legality check when folding short 64-bit literals (#69391)
We can only fold it if it can fit into 32-bit. I believe it did not
trigger yet because we do not select 64-bit literals generally.
2023-10-18 09:22:23 -07:00
Sirish Pande
28e4f97320
[AMDGPU] Save/Restore SCC bit across waterfall loop. (#68363)
Waterfall loop is overwriting SCC bit of status register. Make sure SCC
bit is saved and restored across.
We need to save/restore only in cases where SCC is live across waterfall
loop.

Co-authored-by: Sirish Pande <sirish.pande@amd.com>
2023-10-18 08:43:29 -05:00
pvanhout
868abf0961 Revert "[AMDGPU] Remove Code Object V3 (#67118)"
This reverts commit 544d91280c26fd5f7acd70eac4d667863562f4cc.
2023-10-18 12:55:36 +02:00
Jay Foad
104db26004
[AMDGPU] Fix image intrinsic optimizer on loads from different resources (#69355)
The image intrinsic optimizer pass was neglecting to check any arguments
of the load intrinsic after the VAddr arguments. For example multiple
loads from different resources should not have been combined but were,
because the pass was not checking the resource argument.
2023-10-18 11:08:01 +01:00
Pierre van Houtryve
c464fea779
[DAG] Constant fold FMAD (#69324)
This has very little effect on codegen in practice, but is a nice to
have I think.

See #68315
2023-10-18 07:46:24 +02:00
Stanislav Mekhanoshin
a22a1fe151
[AMDGPU] support 64-bit immediates in SIInstrInfo::FoldImmediate (#69260)
This is a part of https://github.com/llvm/llvm-project/issues/67781.
Until we select more 64-bit move immediates the impact is minimal.
2023-10-17 10:53:22 -07:00
Guozhi Wei
760e7d00d1 [X86, Peephole] Enable FoldImmediate for X86
Enable FoldImmediate for X86 by implementing X86InstrInfo::FoldImmediate.

Also enhanced peephole by deleting identical instructions after FoldImmediate.

Differential Revision: https://reviews.llvm.org/D151848
2023-10-17 16:22:42 +00:00
Pierre van Houtryve
cc3d2533cc
[AMDGPU] Add i1 mul patterns (#67291)
i1 muls can sometimes happen after SCEV. They resulted in ISel failures
because we were missing the patterns for them.

Solves SWDEV-423354
2023-10-16 16:18:27 +02:00
Pierre van Houtryve
4d6fc88946
[AMDGPU] Add patterns for V_CMP_O/U (#69157)
Fixes SWDEV-427162
2023-10-16 13:07:56 +02:00
Pierre van Houtryve
544d91280c
[AMDGPU] Remove Code Object V3 (#67118)
V3 has been deprecated for a while as well, so it can safely be removed
like V2 was removed.

- [Clang] Set minimum code object version to 4
- [lld] Fix tests using code object v3
- Remove code object V3 from the AMDGPU backend, and delete or port v3
tests to v4.
- Update docs to make it clear V3 can no longer be emitted.
2023-10-16 08:21:48 +02:00
Stephen Thomas
720be6c535
[AMDGPU] Add encoding/decoding support for non-result-returning ATOMIC_CSUB instructions (#68684)
The BUFFER_ATOMIC_CSUB and GLOBAL_ATOMIC_CSUB instructions have
encodings for
non-value-returning forms, although actually using them isn't supported
by
hardware. However, these encodings aren't supported by the backend,
meaning
that they can't even be assembled or disassembled.

Add support for the non-returning encodings, but gate actually using
them
in instruction selection behind a new feature
FeatureAtomicCSubNoRtnInsts,
which no target uses. This does allow the non-returning instructions to
be
tested manually and llvm.amdgcn.atomic.csub.ll is extended to cover
them.
The feature does not gate assembling or disassembling them, this is now
not an error, and encoding and decoding tests have been adapted
accordingly.
2023-10-11 11:37:27 +01:00
Piotr Sobczak
2888fa4313 [AMDGPU] Update test remat-smrd.mir
Update test/CodeGen/AMDGPU/remat-smrd.mir:

* Convert a negative case of non-dereferenceable invariant load to positive one.
* Add new cases for subreg.
2023-10-11 10:19:22 +02:00
Thomas Symalla
aa5158cd1e
[AMDGPU] Use absolute relocations when compiling for AMDPAL and Mesa3D (#67791)
The primary ISA-independent justification for using PC-relative
addressing is that it makes code position-independent and therefore
allows sharing of .text pages between processes.

When not sharing .text pages, we can use absolute relocations instead,
which will possibly prevent a bubble introduced by s_getpc_b64.

Co-authored-by: Thomas Symalla <thomas.symalla@amd.com>
2023-10-10 09:22:02 +02:00
Jay Foad
7b3bbd83c0 Revert "[CodeGen] Really renumber slot indexes before register allocation (#67038)"
This reverts commit 2501ae58e3bb9a70d279a56d7b3a0ed70a8a852c.

Reverted due to various buildbot failures.
2023-10-09 12:31:32 +01:00
Jay Foad
2501ae58e3
[CodeGen] Really renumber slot indexes before register allocation (#67038)
PR #66334 tried to renumber slot indexes before register allocation, but
the numbering was still affected by list entries for instructions which
had been erased. Fix this to make the register allocator's live range
length heuristics even less dependent on the history of how instructions
have been added to and removed from SlotIndexes's maps.
2023-10-09 11:44:41 +01:00
Jeffrey Byrnes
6afceba510
[AMDGPU][IGLP] SingleWaveOpt: Cache DSW Counters from PreRA (#67759)
Save the DSW counters from PreRA scheduling. While this avoids recalculation in the postRA pass, that isn't the main purpose.

This is required because of physical register dependencies in PostRA scheduling -- they alter the DAG s.t. our counters may become incorrect -- which alters the layout of the pipeline. By preserving the values from PreRA, we can be sure that we accurately construct the pipeline.

Additionally, remove a bad assert in SharesPredWithPrevNthGroup -- it is possible that we will have an empty cache if OtherGroup has no elements which have a V_PERM pred (possible if the V_PERM SG is empty).
2023-10-06 17:34:14 -07:00
Petar Avramovic
2fa7d652d0 AMDGPU: Fix temporal divergence introduced by machine-sink (#67456)
Temporal divergence that was present in input or introduced in IR
transforms, like code-sinking or LICM, is handled in SIFixSGPRCopies
by changing sgpr source instr to vgpr instr.
After 5b657f5, that moved LICM after AMDGPUCodeGenPrepare,
machine-sinking can introduce temporal divergence by sinking
instructions outside of the cycle.
Add isSafeToSink callback in TargetInstrInfo.
2023-10-06 15:00:08 +02:00
Petar Avramovic
2d7fe90a3e AMDGPU: Add test for temporal divergence introduced by machine-sink
Introduced by 5b657f50b8e8dc5836fb80e566ca7569fd04c26f that moved
LICM after AMDGPUCodeGenPrepare. Some instructions are no longer
sunk during ir optimizations but in machine-sinking instead.
If vgpr instruction used sgpr defined inside the cycle is sunk outside
of the cycle we end up with not-handled case of temporal divergence.
Add test for theoretical case when SALU instruction (represents
uniform value) is sunk outside of the cycle.
Add a test when SALU instruction can be sunk if it edits lane mask.
2023-10-06 15:00:08 +02:00
Petar Avramovic
ccf68ab432 Revert "MachineSink: Fix sinking VGPR def out of a divergent loop"
This reverts commit 3f8ef57bede94445b1a1042c987cc914a886e7ff.
2023-10-06 15:00:08 +02:00
Diana Picus
2e1718adc8 Reland "AMDGPU: Duplicate instead of COPY constants from VGPR to SGPR (#66882)"
Teach the si-fix-sgpr-copies pass to deal with REG_SEQUENCE, PHI or
INSERT_SUBREG where the result is an SGPR, but some of the inputs are
constants materialized into VGPRs. This may happen in cases where for
instance several instructions use an immediate zero and SelectionDAG
chooses to put it in a VGPR to satisfy all of them. This however causes
the si-fix-sgpr-copies to try to switch the whole chain to VGPR and may
lead to illegal VGPR-to-SGPR copies. Rematerializing the constant into
an SGPR fixes the issue.

This was originally reverted because it triggered an unrelated bug in
PEI on one of the OpenMP buildbots. That bug has been fixed in #68299,
so it should be ok to try again.
2023-10-06 10:03:50 +02:00
Diana
be382de059
[AMDGPU] Use correct operand order for shifts (#68299)
In a special case in frame index elimination (when the offset is 0), we
generate either a S_LSHR_B32 or a V_LSHRREV_B32 using the same code.
However, they don't expect their operands in the same order - S_LSHR_B32
takes the value to be shifted first and then the shift amount, whereas
V_LSHRREV_B32 has the operands reversed (hence the REV in its name).
Update the code & tests to take this into account. Also remove an
outdated comment (this code is definitely reachable now that non-entry
functions no longer have a fixed emergency scavenge slot).
2023-10-06 09:43:04 +02:00
Matt Arsenault
5082e827c1 AMDGPU/GlobalISel: Add test for packed sub selection
Mirror of the add test, I've had this lying around for a long
time.
2023-10-05 10:07:57 -07:00
Matt Arsenault
b5ebf07499 AMDGPU/GlobalISel: Add global-isel run lines to shrink add/sub test 2023-10-05 10:07:57 -07:00
Matt Arsenault
2ca30eb8fd
AMDGPU/GlobalISel: Handle mubuf load/store for more types (#68268)
Fixes MUBUF path for most vectors and pointers, which unblocks fixing
the gfx6/7 run lines in assorted tests. Also fixes inconsistent behavior
for -flat-for-global.
2023-10-05 05:36:16 -07:00
Ivan Kosarev
f04aa1f814
[AMDGPU][CodeGen] Fold immediates in src1 operands of V_MAD/MAC/FMA/FMAC. (#68002) 2023-10-05 14:22:29 +03:00
Kirill Stoimenov
0a776996af Revert "[DAG] Attempt shl narrowing in SimplifyDemandedBits"
This reverts commit 7a8c04ef84ecdab4390b451d4c2fe17bc45a7b63.
2023-10-04 22:15:41 +00:00
Jeffrey Byrnes
7794e16b49 [AMDGPU]: Allow combining into v_dot4
Differential Revision: https://reviews.llvm.org/D155995

Change-Id: Id15d232629a32a3549b13d47bf84d7a61b28b928
2023-10-04 13:31:36 -07:00
Alex Richardson
e86d6a43f0 Regenerate test checks for tests affected by D141060 2023-10-04 10:51:35 -07:00
Alex Richardson
83c4227ab7 Auto-generate test checks for tests affected by D141060
These files had manual CHECK lines which make the diff from D141060
very difficult to review.
2023-10-04 10:51:35 -07:00
Ivan Kosarev
cf80defae2
[AMDGPU][GFX11] Do not rewrite V_FMA/FMAC_* to V_FMAAK_F16_t16 on operand legalization. (#66202)
V_FMAAK_F16_t16 takes VGPR_32_Lo128 operands whereas the original
instructions would have VGPR_32 operands. Switching the opcodes without
updating operands' register classes leads to MachineVerifier complaining
about the classes not matching instruction definitions. The problem only
reveals itself of builds with expensive checks enabled because of
missing -verify-machineinstrs in the test.

This is the third attempt to update CodeGen/AMDGPU/fma.f16.ll to run for
GFX11, following the second attempt in a1e38e0b8e3e, partially reverted
in eaf737a4e004.
2023-10-04 12:41:46 +01:00
Simon Pilgrim
7a8c04ef84 [DAG] Attempt shl narrowing in SimplifyDemandedBits
If a shl node leaves the upper half bits zero / undemanded, then see if we can profitably perform this with a half-width shl and a free trunc/zext.

Followup to D146121

Differential Revision: https://reviews.llvm.org/D155472
2023-10-04 10:23:02 +01:00
JP Lehr
e816c89c84 Revert "InlineSpiller: Consider if all subranges are the same when avoiding redundant spills"
This reverts commit d8127b2ba8a87a610851b9a462f2fc2526c36e37.
2023-10-02 06:26:33 -05:00
Matt Arsenault
d8127b2ba8 InlineSpiller: Consider if all subranges are the same when avoiding redundant spills
This avoids some redundant spills of subranges, and avoids a compile failure.
This greatly reduces the numbers of spills in a loop.

The main range is not informative when multiple instructions are needed to fully define
a register. A common scenario is a lowered reg_sequence where every subregister
is sequentially defined, but each def changes the main range's value number. If
we look at specific lanes at the use index, we can see the value is actually the
same.

In this testcase, there are a large number of materialized 64-bit constant defs
which are hoisted outside of the loop by MachineLICM. These are feeding REG_SEQUENCES,
which is not considered rematerializable inside the loop. After coalescing, the split
constant defs produce main ranges with an apparent phi def. There's no phi def if you look
at each individual subrange, and only half of the register is really redefined to a constant.

Fixes: SWDEV-380865

https://reviews.llvm.org/D147079
2023-10-01 11:37:53 +03:00
Matt Arsenault
7252787dd9 RegAllocGreedy: Fix detection of lanes read by a bundle
SplitKit creates questionably formed bundles of copies
when it needs to copy a subset of live lanes and can't do
it with a single subregister index. These are merely marked
as part of a bundle, and don't start with a BUNDLE instruction.
Queries for the slot index would give the first copy in the
bundle, and we need to inspect the operands of all the other
bundled copies.

Also fix and simplify detection of read lane subsets. This causes
some RISCV test regressions, but these look like accidentally beneficial
splits. I don't see a subrange based reason to perform these splits.

Avoids some really ugly regressions in a future patch.

https://reviews.llvm.org/D146859
2023-10-01 11:37:48 +03:00
Jay Foad
6e3d2a4b38
[ISel] Fix another crash in new FMA DAG combine (#67818)
Following on from D135150, this patch fixes another crash caused by this
DAG combine:

fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E)

The combine calls ReplaceAllUsesOfValueWith to replace (fmul C, D) with
(fma C, D, E). This can cause nodes to get CSEd. In D135150 the problem
was that the (fma C, D, E) node got CSEd away. In this new case, the
problem is that the outer fadd node gets CSEd away. To fix it we have
to return SDValue(N, 0) from the combine and be careful not to add a
deleted node to the worklist.
2023-09-29 17:18:23 +01:00
Nikita Popov
4251aa7a6f [IRBuilder] Migrate most casts to folding API
Migrate creation of most casts to use the FoldXYZ rather than
CreateXYZ style APIs. This means that InstSimplifyFolder now
works for these, which is what accounts for the AMDGPU test changes.
2023-09-29 12:40:38 +02:00
Mirko Brkušanin
2cd2445c21
[AMDGPU] Src1 of VOP3 DPP instructions can be SGPR on supported subtargets (#67461)
In order to avoid duplicating every dpp pseudo opcode that has src1, we
allow it for all opcodes and add manual checks on subtargets that do not
support it.
2023-09-29 11:54:49 +02:00
Yashwant Singh
7ac532efc8
[AMDGPU] Introduce AMDGPU::SGPR_SPILL asm comment flag (#67091)
Use this flag to give more context to implicit def comments in assembly.

Reviewed on phabricator: 
https://reviews.llvm.org/D153754
2023-09-29 11:15:01 +05:30
Tobias Stadler
305fbc1b32 Revert "[GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND"
This reverts commit 3686a0b611c65f0d7190345b8e3e73cdca9fa657.
This seems to have broken some sanitizer tests:
https://lab.llvm.org/buildbot/#/builders/184/builds/7721
2023-09-29 03:35:40 +02:00
Tobias Stadler
3686a0b611 [GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND
The legalizer currently generates lots of G_AND artifacts.
For example between boolean uses and defs there is always a G_AND with a mask of 1, but when the target uses ZeroOrOneBooleanContents, this is unnecessary.
Currently these artifacts have to be removed using post-legalize combines.
Omitting these artifacts at their source in the artifact combiner has a few advantages:
- We know that the emitted G_AND is very likely to be useless, so our KnownBits call is likely worth it.
- The G_AND and G_CONSTANT can interrupt e.g. G_UADDE/... sequences generated during legalization of wide adds which makes it harder to detect these sequences in the instruction selector (e.g. useful to prevent unnecessary reloading of AArch64 NZCV register).
- This cleans up a lot of legalizer output and even improves compilation-times.
AArch64 CTMark geomean: `O0` -5.6% size..text; `O0` and `O3` ~-0.9% compilation-time (instruction count).

Since this introduces KnownBits into code-paths used by `O0`, I reduced the default recursion depth.
This doesn't seem to make a difference in CTMark, but should prevent excessive recursive calls in the worst case.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D159140
2023-09-29 02:11:57 +02:00
Jay Foad
c3939eb827
[AMDGPU] Fix typo in scheduler option name (#67661)
Fix: -amdgpu-disable-unclustred-high-rp-reschedule
Now: -amdgpu-disable-unclustered-high-rp-reschedule
2023-09-28 20:54:57 +01:00
Jay Foad
a0a06b1804 [AMDGPU] Make a check slightly more robust
Previously this was relying on [[RESULT]] having been defined in an earlier function.
2023-09-28 13:09:51 +01:00
Ivan Kosarev
be8b559956 [AMDGPU] Test codegen'ing True16 additions.
The GlobalISel part is to be addressed later.

Differential Revision: https://reviews.llvm.org/D156106
2023-09-27 11:10:48 +01:00
Ivan Kosarev
3ff7d51eb8 [AMDGPU][True16] Pre-commit addition tests.
Differential Revision: https://reviews.llvm.org/D156529
2023-09-27 10:27:33 +01:00
Jay Foad
e3d714f2cc [AMDGPU] Add gfx1150 test coverage in trans-forwarding-hazards.mir
This demonstrates that gfx1150 does not have FeatureVALUTransUseHazard.
2023-09-26 17:24:43 +01:00
Ivan Kosarev
64482d5766
[AMDGPU] Fix passing CodeGen/AMDGPU/frem.ll on gfx1150. (#67425)
We would currently crash on it trying to use t16 instructions instead of
fake16 ones.
2023-09-26 15:13:23 +01:00
Ivan Kosarev
287f6cdd17 [AMDGPU] Remove the support for non-True16 copies between different register sizes.
Differential Revision: https://reviews.llvm.org/D156985
2023-09-26 14:46:34 +01:00
Jingu Kang
ff68e43c81 [MachineLICM] Handle Subloops
It is a re-commit from reverted commit 3454cf67bd0a650097dc6ca99874a34e1d59b500.

Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass
handle subloops with only visiting outermost loop's blocks once.

Differential Revision: https://reviews.llvm.org/D154205
2023-09-26 14:25:11 +01:00
Jay Foad
d85d143ad9
[AMDGPU] New image intrinsic optimizer pass (#67151)
Implement a new pass to combine multiple image_load_2dmsaa and
2darraymsaa intrinsic calls into a single image_msaa_load if:

- they refer to the same vaddr except for sample_id,
- they use a constant sample_id and they fall into the same group,
- they have the same dmask and the number of instructions and the
  number of vaddr/vdata dword transfers is reduced by the combine

This should be valid on all GFX11 but a hardware bug renders it
unworkable on GFX11.0.* so it is only enabled for GFX11.5.

Based on a patch by Rodrigo Dominguez!
2023-09-26 09:33:49 +01:00