6824 Commits

Author SHA1 Message Date
JP Lehr
e816c89c84 Revert "InlineSpiller: Consider if all subranges are the same when avoiding redundant spills"
This reverts commit d8127b2ba8a87a610851b9a462f2fc2526c36e37.
2023-10-02 06:26:33 -05:00
Matt Arsenault
d8127b2ba8 InlineSpiller: Consider if all subranges are the same when avoiding redundant spills
This avoids some redundant spills of subranges, and avoids a compile failure.
This greatly reduces the numbers of spills in a loop.

The main range is not informative when multiple instructions are needed to fully define
a register. A common scenario is a lowered reg_sequence where every subregister
is sequentially defined, but each def changes the main range's value number. If
we look at specific lanes at the use index, we can see the value is actually the
same.

In this testcase, there are a large number of materialized 64-bit constant defs
which are hoisted outside of the loop by MachineLICM. These are feeding REG_SEQUENCES,
which is not considered rematerializable inside the loop. After coalescing, the split
constant defs produce main ranges with an apparent phi def. There's no phi def if you look
at each individual subrange, and only half of the register is really redefined to a constant.

Fixes: SWDEV-380865

https://reviews.llvm.org/D147079
2023-10-01 11:37:53 +03:00
Matt Arsenault
7252787dd9 RegAllocGreedy: Fix detection of lanes read by a bundle
SplitKit creates questionably formed bundles of copies
when it needs to copy a subset of live lanes and can't do
it with a single subregister index. These are merely marked
as part of a bundle, and don't start with a BUNDLE instruction.
Queries for the slot index would give the first copy in the
bundle, and we need to inspect the operands of all the other
bundled copies.

Also fix and simplify detection of read lane subsets. This causes
some RISCV test regressions, but these look like accidentally beneficial
splits. I don't see a subrange based reason to perform these splits.

Avoids some really ugly regressions in a future patch.

https://reviews.llvm.org/D146859
2023-10-01 11:37:48 +03:00
Jay Foad
6e3d2a4b38
[ISel] Fix another crash in new FMA DAG combine (#67818)
Following on from D135150, this patch fixes another crash caused by this
DAG combine:

fadd (fma A, B, (fmul C, D)), E --> fma A, B, (fma C, D, E)

The combine calls ReplaceAllUsesOfValueWith to replace (fmul C, D) with
(fma C, D, E). This can cause nodes to get CSEd. In D135150 the problem
was that the (fma C, D, E) node got CSEd away. In this new case, the
problem is that the outer fadd node gets CSEd away. To fix it we have
to return SDValue(N, 0) from the combine and be careful not to add a
deleted node to the worklist.
2023-09-29 17:18:23 +01:00
Nikita Popov
4251aa7a6f [IRBuilder] Migrate most casts to folding API
Migrate creation of most casts to use the FoldXYZ rather than
CreateXYZ style APIs. This means that InstSimplifyFolder now
works for these, which is what accounts for the AMDGPU test changes.
2023-09-29 12:40:38 +02:00
Mirko Brkušanin
2cd2445c21
[AMDGPU] Src1 of VOP3 DPP instructions can be SGPR on supported subtargets (#67461)
In order to avoid duplicating every dpp pseudo opcode that has src1, we
allow it for all opcodes and add manual checks on subtargets that do not
support it.
2023-09-29 11:54:49 +02:00
Yashwant Singh
7ac532efc8
[AMDGPU] Introduce AMDGPU::SGPR_SPILL asm comment flag (#67091)
Use this flag to give more context to implicit def comments in assembly.

Reviewed on phabricator: 
https://reviews.llvm.org/D153754
2023-09-29 11:15:01 +05:30
Tobias Stadler
305fbc1b32 Revert "[GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND"
This reverts commit 3686a0b611c65f0d7190345b8e3e73cdca9fa657.
This seems to have broken some sanitizer tests:
https://lab.llvm.org/buildbot/#/builders/184/builds/7721
2023-09-29 03:35:40 +02:00
Tobias Stadler
3686a0b611 [GlobalISel] LegalizationArtifactCombiner: Elide redundant G_AND
The legalizer currently generates lots of G_AND artifacts.
For example between boolean uses and defs there is always a G_AND with a mask of 1, but when the target uses ZeroOrOneBooleanContents, this is unnecessary.
Currently these artifacts have to be removed using post-legalize combines.
Omitting these artifacts at their source in the artifact combiner has a few advantages:
- We know that the emitted G_AND is very likely to be useless, so our KnownBits call is likely worth it.
- The G_AND and G_CONSTANT can interrupt e.g. G_UADDE/... sequences generated during legalization of wide adds which makes it harder to detect these sequences in the instruction selector (e.g. useful to prevent unnecessary reloading of AArch64 NZCV register).
- This cleans up a lot of legalizer output and even improves compilation-times.
AArch64 CTMark geomean: `O0` -5.6% size..text; `O0` and `O3` ~-0.9% compilation-time (instruction count).

Since this introduces KnownBits into code-paths used by `O0`, I reduced the default recursion depth.
This doesn't seem to make a difference in CTMark, but should prevent excessive recursive calls in the worst case.

Reviewed By: aemerson

Differential Revision: https://reviews.llvm.org/D159140
2023-09-29 02:11:57 +02:00
Jay Foad
c3939eb827
[AMDGPU] Fix typo in scheduler option name (#67661)
Fix: -amdgpu-disable-unclustred-high-rp-reschedule
Now: -amdgpu-disable-unclustered-high-rp-reschedule
2023-09-28 20:54:57 +01:00
Jay Foad
a0a06b1804 [AMDGPU] Make a check slightly more robust
Previously this was relying on [[RESULT]] having been defined in an earlier function.
2023-09-28 13:09:51 +01:00
Ivan Kosarev
be8b559956 [AMDGPU] Test codegen'ing True16 additions.
The GlobalISel part is to be addressed later.

Differential Revision: https://reviews.llvm.org/D156106
2023-09-27 11:10:48 +01:00
Ivan Kosarev
3ff7d51eb8 [AMDGPU][True16] Pre-commit addition tests.
Differential Revision: https://reviews.llvm.org/D156529
2023-09-27 10:27:33 +01:00
Jay Foad
e3d714f2cc [AMDGPU] Add gfx1150 test coverage in trans-forwarding-hazards.mir
This demonstrates that gfx1150 does not have FeatureVALUTransUseHazard.
2023-09-26 17:24:43 +01:00
Ivan Kosarev
64482d5766
[AMDGPU] Fix passing CodeGen/AMDGPU/frem.ll on gfx1150. (#67425)
We would currently crash on it trying to use t16 instructions instead of
fake16 ones.
2023-09-26 15:13:23 +01:00
Ivan Kosarev
287f6cdd17 [AMDGPU] Remove the support for non-True16 copies between different register sizes.
Differential Revision: https://reviews.llvm.org/D156985
2023-09-26 14:46:34 +01:00
Jingu Kang
ff68e43c81 [MachineLICM] Handle Subloops
It is a re-commit from reverted commit 3454cf67bd0a650097dc6ca99874a34e1d59b500.

Following discussion on https://reviews.llvm.org/D154205, make MachineLICM pass
handle subloops with only visiting outermost loop's blocks once.

Differential Revision: https://reviews.llvm.org/D154205
2023-09-26 14:25:11 +01:00
Jay Foad
d85d143ad9
[AMDGPU] New image intrinsic optimizer pass (#67151)
Implement a new pass to combine multiple image_load_2dmsaa and
2darraymsaa intrinsic calls into a single image_msaa_load if:

- they refer to the same vaddr except for sample_id,
- they use a constant sample_id and they fall into the same group,
- they have the same dmask and the number of instructions and the
  number of vaddr/vdata dword transfers is reduced by the combine

This should be valid on all GFX11 but a hardware bug renders it
unworkable on GFX11.0.* so it is only enabled for GFX11.5.

Based on a patch by Rodrigo Dominguez!
2023-09-26 09:33:49 +01:00
Ivan Kosarev
053478bbd0 [AMDGPU] Switch to using real True16 operands.
The DPP source and e64 destination operands remain unchanged for now.

Reviewed By: Joe_Nash

Differential Revision: https://reviews.llvm.org/D156104
2023-09-25 18:21:13 +01:00
Austin Kerbow
0455596e1e [AMDGPU] Add DAG ISel support for preloaded kernel arguments
This patch adds the DAG isel changes for kernel argument preloading.
These changes are not usable with older firmware but subsequent patches
in the series will make the codegen backwards compatible. This patch
should only be submitted alongside that subsequent patch.

Preloading here begins from the start of the kernel arguments until the
amount of arguments indicated by the CL flag
amdgpu-kernarg-preload-count.

Aggregates and arguments passed by-ref are not supported.

Special care for the alignment of the kernarg segment is needed as well
as consideration of the alignment of addressable SGPR tuples when we
cannot directly use misaligned large tuples that the arguments are
loaded to.

Reviewed By: bcahoon

Differential Revision: https://reviews.llvm.org/D158579
2023-09-25 09:32:59 -07:00
Austin Kerbow
7b70af297a [AMDGPU] Add IR lowering changes for preloaded kernargs
Preloaded kernel arguments should not be lowered in the IR pass
AMDGPULowerKernelArguments. Therefore it's necessary to calculate the
total number of user SGPRs that are available for preloading and how
many SGPRs would be required to preload each argument to determine
whether we should skip lowering i.e. the argument will be preloaded
instead.

Reviewed By: bcahoon

Differential Revision: https://reviews.llvm.org/D156853
2023-09-25 08:54:07 -07:00
Diana Picus
327fdcf789 Revert "AMDGPU: Duplicate instead of COPY constants from VGPR to SGPR (#66882)"
This reverts commit a04603993b43e5ebac1531293d288315f1885886 because
it broke the OpenMP buildbot.
2023-09-25 13:40:38 +02:00
Diana
a04603993b
AMDGPU: Duplicate instead of COPY constants from VGPR to SGPR (#66882)
Teach the si-fix-sgpr-copies pass to deal with REG_SEQUENCE, PHI or
INSERT_SUBREG where the result is an SGPR, but some of the inputs are
constants materialized into VGPRs. This may happen in cases where for
instance several instructions use an immediate zero and SelectionDAG
chooses to put it in a VGPR to satisfy all of them. This however causes
the si-fix-sgpr-copies to try to switch the whole chain to VGPR and may
lead to illegal VGPR-to-SGPR copies. Rematerializing the constant into
an SGPR fixes the issue.
2023-09-25 13:20:08 +02:00
Simon Pilgrim
8b36d082c4 [DAG] getNode() - fold (zext (trunc x)) -> x iff the upper bits are known zero - add SRL support
This is part of the work to address the D155472 regressions, there's a number of issues with generalizing this fold which is why I'm just adding SRL support atm.

Differential Revision: https://reviews.llvm.org/D159533
2023-09-24 13:40:07 +01:00
Simon Pilgrim
142efd6d61 [AMDGPU] Add ISD::FSHR Handling to AMDGPUISD::PERM matching
Pulled out of D159533, which encourages (zext (trunc x)) -> x folds, leading to more ISD::FSHR nodes, which was breaking some existing AMDGPUISD::PERM tests

Differential Revision: https://reviews.llvm.org/D159533
2023-09-24 13:40:07 +01:00
Ivan Kosarev
fab28e0e14 Reapply "[AMDGPU] Introduce real and keep fake True16 instructions."
Reverts 6cb3866b1ce9d835402e414049478cea82427cf1.

Analysis of failures on buildbots with expensive checks enabled showed
that the problem was triggered by changes in another commit,
469b3bfad20550968ac428738eb1f8bb8ce3e96d, and was caused by the bug
addressed in #67245.
2023-09-23 22:07:41 +01:00
Ivan Kosarev
6cb3866b1c Revert "[AMDGPU] Introduce real and keep fake True16 instructions."
This reverts commit 0f864c7b8bc9323293ec3d85f4bd5322f8f61b16 due to
failures on expensive checks.
2023-09-22 15:40:26 +01:00
Mirko Brkusanin
72e3713009 [IRTranslator] Set NUW flag for inbounds gep and load/store offsets
Patch by: Acim Maravic

Differential Revision: https://reviews.llvm.org/D159515
2023-09-22 16:16:28 +02:00
Mirko Brkusanin
a657deb42e [AMDGPU] Update RUN line in test (NFC) 2023-09-22 12:41:54 +02:00
Ivan Kosarev
c62f208c05 [AMDGPU] Don't suppress printing the .l and .h register suffixes.
We don't seem to have a use for the -amdgpu-keep-16-bit-reg-suffixes
option anymore. Was introduced in <https://reviews.llvm.org/D79435>.

Reviewed By: Joe_Nash, foad

Differential Revision: https://reviews.llvm.org/D156102
2023-09-22 11:13:05 +01:00
Ivan Kosarev
0f864c7b8b [AMDGPU] Introduce real and keep fake True16 instructions.
The existing fake True16 instructions using 32-bit VGPRs are supposed to
co-exist with real ones until all the necessary True16 functionality is
implemented and relevant tests are updated.

Reviewed By: arsenm, Joe_Nash

Differential Revision: https://reviews.llvm.org/D156101
2023-09-22 10:57:56 +01:00
Ivan Kosarev
469b3bfad2 [AMDGPU] Add True16 register classes.
Reviewed By: rampitec, Joe_Nash

Differential Revision: https://reviews.llvm.org/D156099
2023-09-22 10:17:02 +01:00
Sirish Pande
e6f9483f77
[SelectionDAG] Flags are dropped when creating a new FMUL (#66701)
While simplifying some vector operators in DAG combine, we may need to
create new instructions for simplified vectors. At that time, we need to
make sure that all the flags of the new instruction are copied/modified
from the old instruction.

If "contract" is dropped from an instruction like FMUL, it may not
generate FMA instruction which would impact performance.

Here's an example where "contract" flag is dropped when FMUL is created.

Replacing.2 t42: v2f32 = fmul contract t41, t38
With: t48: v2f32 = fmul t38, t38

Co-authored-by: Sirish Pande <sirish.pande@amd.com>
2023-09-21 10:26:34 -05:00
Jeffrey Byrnes
acb4854563
[AMDGPU] Precommit test for D159533 (#66965)
Precommit test ahead of https://reviews.llvm.org/D159533 for ISD::FSHR /
AMDGPUISD::PERM combine
2023-09-21 12:17:59 +01:00
Mirko Brkušanin
ecfdc23dd2
[AMDGPU] Select gfx1150 SALU Float instructions (#66885) 2023-09-21 12:22:55 +02:00
Pierre van Houtryve
fe2f67e4ba
[AMDGPU] Remove Code Object V2 (#65715)
Code Object V2 has been deprecated for more than a year now. We can
safely remove it from LLVM.

- [clang] Remove support for the `-mcode-object-version=2` option.
- [lld] Remove/refactor tests that were still using COV2
- [llvm] Update AMDGPUUsage.rst
- Code Object V2 docs are left for informational purposes because those
code objects may still be supported by the runtime/loaders for a while.
- [AMDGPU] Remove COV2 emission capabilities.
- [AMDGPU] Remove `MetadataStreamerYamlV2` which was only used by COV2
- [AMDGPU] Update all tests that were still using COV2 - They are either
deleted or ported directly to code object v4 (as v3 is also planned to
be removed soon).
2023-09-21 12:00:45 +02:00
Noah Goldstein
47c642f9a0 [DAGCombiner] Fold IEEE fmul/fdiv by Pow2 to add/sub of exp
Note: This is moving D154678 which previously implemented this in
InstCombine. Concerns where brought up that this was de-canonicalizing
and really targeting a codegen improvement, so placing in DAGCombiner.

This implements:

```
(fmul C, (uitofp Pow2))
    -> (bitcast_to_FP (add (bitcast_to_INT C), Log2(Pow2) << mantissa))
(fdiv C, (uitofp Pow2))
    -> (bitcast_to_FP (sub (bitcast_to_INT C), Log2(Pow2) << mantissa))
```
The motivation is mostly fdiv where 2^(-p) is a fairly common
expression.

The patch is intentionally conservative about the transform, only
doing so if we:
    1) have IEEE floats
    2) C is normal
    3) add/sub of max(Log2(Pow2)) stays in the min/max exponent
       bounds.

Alive2 can't realistically prove this, but did test float16/float32
cases (within the bounds of the above rules) exhaustively.

Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D154805
2023-09-20 13:28:24 -05:00
Noah Goldstein
32a46919a2 [AMDGPU] Add tests for folding fmul/fdiv by Pow2 to add/sub of exp; NFC
Reviewed By: RKSimon

Differential Revision: https://reviews.llvm.org/D159405
2023-09-20 13:28:24 -05:00
Sirish Pande
cc3491fd45
[SelectionDAG] [NFC] Add pre-commit test for PR66701. (#66796)
[SelectionDAG] [NFC] Add pre-commit test for PR66701.

Co-authored-by: Sirish Pande <sirish.pande@amd.com>
2023-09-20 11:37:18 -05:00
Simon Pilgrim
2ec697b4c7 [AMDGPU] Regenerate always-uniform.ll 2023-09-20 16:58:00 +01:00
Joe Nash
2c0f2b510c
[AMDGPU] Convert tests rotr.ll and rotl.ll to be auto-generated (#66828)
and add GFX11 coverage. NFC
2023-09-20 10:32:04 -04:00
Jay Foad
a68c7241ec
[AMDGPU] Run twoaddr tests with -early-live-intervals (#66775)
Sample test case:

%3 = V_FMAC_F32_e32 killed %0, %1, %2, implicit $mode, implicit $exec

With LiveVariables this is converted to three-address form just because
there is no "killed" flag on %2. To make it do the same thing with
LiveIntervals I added a later use of %2:

%3 = V_FMAC_F32_e32 killed %0, %1, %2, implicit $mode, implicit $exec
    S_ENDPGM 0, implicit %2
2023-09-20 08:22:00 +01:00
Dhruv Chawla
3e992d81af
[InferAlignment] Enable InferAlignment pass by default
This gives an improvement of 0.6%:
https://llvm-compile-time-tracker.com/compare.php?from=7d35fe6d08e2b9b786e1c8454cd2391463832167&to=0456c8e8a42be06b62ad4c3e3cf34b21f2633d1e&stat=instructions:u

Differential Revision: https://reviews.llvm.org/D158600
2023-09-20 12:08:52 +05:30
Austin Kerbow
60a227c464 [AMDGPU] Use inreg for hint to preload kernel arguments
This patch is the first in a series that adds support for pre-loading
kernel arguments into SGPRs. The command-line argument
'amdgpu-kernarg-preload-count' is used to specify the number of
arguments sequentially from the first that we should attempt to preload,
the default is 0.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D156852
2023-09-19 15:13:38 -07:00
Jay Foad
e0919b189b [CodeGen] Renumber slot indexes before register allocation (#66334)
RegAllocGreedy uses SlotIndexes::getApproxInstrDistance to approximate
the length of a live range for its heuristics. Renumbering all slot
indexes with the default instruction distance ensures that this estimate
will be as accurate as possible, and will not depend on the history of
how instructions have been added to and removed from SlotIndexes's maps.

This also means that enabling -early-live-intervals, which runs the
SlotIndexes analysis earlier, will not cause large amounts of churn due
to different register allocator decisions.
2023-09-19 11:18:12 +01:00
Jay Foad
1d305f95d6 [AMDGPU] Fix line endings in a test 2023-09-19 11:09:03 +01:00
Matt Arsenault
1328a8534b
AMDGPU: Fix handling of -0 in round lowering (#65761) 2023-09-19 09:14:17 +03:00
Guozhi Wei
cbdccb30c2 [RA] Split a virtual register in cold blocks if it is not assigned preferred physical register
If a virtual register is not assigned preferred physical register, it means some
COPY instructions will be changed to real register move instructions. In this
case we can try to split the virtual register in colder blocks, if success, the
original COPY instructions can be deleted, and the new COPY instructions in
colder blocks will be generated as register move instructions. It results in
fewer dynamic register move instructions executed.

The new test case split-reg-with-hint.ll gives an example, the hot path contains
24 instructions without this patch, now it is only 4 instructions with this
patch.

Differential Revision: https://reviews.llvm.org/D156491
2023-09-15 19:52:50 +00:00
Benjamin Kramer
3454cf67bd Revert "[MachineLICM] Handle Subloops"
This reverts commit 5ec9699c4d1f165364586d825baef434e2c110b4. It
accesses MI after it has been hoisted.
2023-09-15 13:20:31 +02:00
Jay Foad
ceb68eea8c
[AMDGPU] Remove repeated -mtriple options from RUN lines (#66486) 2023-09-15 11:29:24 +01:00