7565 Commits

Author SHA1 Message Date
paperchalice
abde52aa66
[CodeGen][NewPM] Port LiveIntervals to new pass manager (#98118)
- Add `LiveIntervalsAnalysis`.
- Add `LiveIntervalsPrinterPass`.
- Use `LiveIntervalsWrapperPass` in legacy pass manager.
- Use `std::unique_ptr` instead of raw pointer for `LICalc`, so
destructor and default move constructor can handle it correctly.

This would be the last analysis required by `PHIElimination`.
2024-07-10 19:34:48 +08:00
Fabian Ritter
17316a5989
Revert "[LowerMemIntrinsics] Use correct alignment in residual loop for variable llvm.memcpy" (#98295)
Reverts llvm/llvm-project#97998
This seems to cause a buildbot failure on clang-hip-vega20, in the HIP
test-suite, need to investigate.
2024-07-10 12:16:20 +02:00
Fabian Ritter
6c84bba218
[LowerMemIntrinsics] Use correct alignment in residual loop for variable llvm.memcpy (#97998)
Memcpy intrinsics with statically unknown loop sizes are lowered with
two load/store loops: one with access widths specified by the target,
and a residual loop that copies remaining bytes individually.

As the residual loop operates byte-wise, its accesses are only
1-aligned. However, we currently use the alignment that is optimal for
the first loop in both, which is unsound. With this patch, we use the
correct alignment in the residual loop.

The lowering of memcpy with a static size already handles alignments for
the residual correctly.
2024-07-10 11:29:26 +02:00
Carl Ritson
7eb1a320cc
[AMDGPU] Update EXECZ retention in SIPreEmitPeephole for GFX10/12 (#97676)
The check to maintain EXECZ branches only checks S_WAITCNT.
Add handling for new waitcnt instructions in GFX10 and GFX12.
2024-07-09 14:44:31 +09:00
Manish Kausik H
69192e0193
[LegalizeDAG] Optimize CodeGen for ISD::CTLZ_ZERO_UNDEF (#83039)
Previously we had the same instructions being generated for `ISD::CTLZ` and `ISD::CTLZ_ZERO_UNDEF` which did not take advantage of the fact that zero is an invalid input for `ISD::CTLZ_ZERO_UNDEF`. This commit separates codegen for the two cases to allow for the optimization for the latter case.

The details of the optimization are outlined in #82075

Fixes #82075

Co-authored-by: Manish Kausik H <hmamishkausik@gmail.com>
2024-07-08 14:01:32 +01:00
Vikram Hegde
2a9607168b
[AMDGPU] Cleanup bitcast spam in atomic optimizer (#96933) 2024-07-08 10:53:16 +05:30
Matt Arsenault
611212fc9a
AMDGPU/GlobalISel: Legalize atomicrmw fmin/fmax (#97048)
We only handled the easy LDS case before. Handle the other address
spaces
with the more complicated legality logic.
2024-07-03 23:30:05 +02:00
Jeffrey Byrnes
5da7179cb3 [AMDGPU] Reland: Add IR LiveReg type-based optimization 2024-07-03 09:26:19 -07:00
Yingwei Zheng
d5c9ffd545
[SDAG] Intersect poison-generating flags after CSE (#97434)
This patch fixes a miscompilation when `N` gets CSEed to `Existing`:
```
Existing: t5: i32 = sub nuw Constant:i32<0>, t3
N: t30: i32 = sub Constant:i32<0>, t3
```

Fixes https://github.com/llvm/llvm-project/issues/96366.
2024-07-03 20:32:46 +08:00
Fabian Ritter
d37e7ec2c5
[LowerMemIntrinsics] Respect the volatile argument of llvm.memmove (#97545)
So far, we ignored if a memmove intrinsic is volatile when lowering it
to loops in the IR. This change generates volatile loads and stores in
this case (similar to how memcpy is handled) and adds tests for volatile
memmoves and memcpys.
2024-07-03 13:37:38 +02:00
Jay Foad
b76dd4edbf
[AMDGPU] Disable atomic optimization of fadd/fsub with result (#96479)
An atomic fadd instruction like this should return %x:

  ; value at %ptr is %x
  %r = atomicrmw fadd ptr %ptr, float %y

After atomic optimization, if %y is uniform, the result is calculated
as %r = %x + * %y * +0.0. This has a couple of problems:

1. If %y is Inf or NaN, this will return NaN instead of %x.
2. If %x is -0.0 and %y is positive, this will return +0.0 instead of
   -0.0.

Avoid these problems by disabling the "%y is uniform" path if there are
any uses of the result.
2024-07-03 11:35:51 +01:00
Alexis Engelke
bb260eb87d
[CodeGen] Only deduplicate PHIs on critical edges (#97064)
PHIElim deduplicates identical PHI nodes to reduce the number of copies
inserted. There are two cases:

1. Identical PHI nodes are in different blocks. That's the reason for
   this optimization; this can't be avoided at SSA-level. A necessary
   prerequisite for this is that the predecessors of all basic blocks
   (where such a PHI node could occur) are the same. This implies that
   all (>= 2) predecessors must have multiple successors, i.e. all edges
   into the block are critical edges.

2. Identical PHI nodes are in the same block. CSE can remove these.
   There are a few cases, however, where they still occur regardless:

   - expand-large-div-rem creates PHI nodes with large integers, which
     get lowered into one PHI per MVT. Later, some identical values
     (zeroes) get folded, resulting in identical PHI nodes.
   - peephole-opt occasionally inserts PHIs for the same value.
   - Some pseudo instruction emitters create redundant PHI nodes (e.g.,
     AVR's insertShift), merging the same values more than once.

   In any case, this happens rarely and MachineCSE handles most cases
   anyway, so that PHIElim only gets to see very few of such cases (see
   changed test files).

Currently, all PHI nodes are inserted into a DenseMap that checks
equality not by pointer but by operands. This hash map is pretty
expensive (hashing itself and the hash map), but only really useful in
the first case.

Avoid this expensive hashing most of the time by restricting it to basic
blocks with only critical input edges. This improves performance for
code with many PHI nodes, especially at -O0. (Note that Clang often
doesn't generate PHI nodes and -O0 includes no mem2reg. Other
compilers always generate PHI nodes.)
2024-07-03 11:19:05 +02:00
Jay Foad
f3a02253e9
[test] Remove immarg parameter attribute from calls (#97432)
It is documented that immarg is only valid on intrinsic declarations,
although the verifier also tolerates it on intrinsic calls.

This patch updates tests that are not specifically testing the
behavior of the IR parser or verifier.
2024-07-03 09:02:31 +01:00
Fabian Ritter
e1094dd889
[AMDGPU][DAG] Enable ganging up of memcpy loads/stores for AMDGPU (#96185)
In the SelectionDAG lowering of the memcpy intrinsic, this optimization
introduces additional chains between fixed-size groups of loads and the
corresponding stores. While initially introduced to ensure that wider
load/store-pair instructions are generated on AArch64, this optimization
also improves code generation for AMDGPU: Ganged loads are scheduled
into a clause; stores only await completion of their corresponding load.

The chosen value of 16 performed good in microbenchmarks, values of 8,
32, or 64 would perform similarly.
The testcase updates are autogenerated by
utils/update_llc_test_checks.py.

See also:
 - PR introducing this optimization: https://reviews.llvm.org/D46477

Part of SWDEV-455845.
2024-07-03 08:32:35 +02:00
Matt Arsenault
79516ddbee
AMDGPU: Fix assert from wrong address space size assumption (#97267)
This was assuming the source address space was at least as large
as the destination of the cast. I'm not sure why this was casting
to begin with; the assumption seems to be the source
address space from the root addrspacecast matches the underlying
object so directly check that.

Fixes #97457
2024-07-02 23:18:25 +02:00
Simon Pilgrim
1f7d31e342 [AMDGPU] Regenerate srem.ll tests - more closely match the testing in sdiv.ll 2024-07-02 17:14:39 +01:00
Jay Foad
43b9888214
[AMDGPU] Use nan as the identity for atomicrmw fmax/fmin (#97411)
atomicrmw fmax/fmin perform the same operation as llvm.maxnum/minnum
which return the other operand if one operand is nan. This means that,
in the presence of nan arguments, +/- inf is not an identity for these
operations but nan is (at least if you don't care about nan payloads).
2024-07-02 15:45:36 +01:00
Matt Arsenault
940ea5b8c5 AMDGPU: Add some exotic truncating store tests
PR#97010 is touching the legalize rules for 5 vector stores,
but not all of them so check some more cases to make sure they work.
2024-07-02 16:33:36 +02:00
Shilei Tian
9a4f57ec1e
[SelectionDAG] Use EVT::getIntegerVT in getBitcastedAnyExtOrTrunc (#96658)
`SelectionDAG::getBitcastedAnyExtOrTrunc` assumes that there is always a
valid
integer type corresponding to another type, which is not always true
when it
comes to vector type. For example, `<3 x i8>` doesn't have a
corresponding
integer type.

Fix SWDEV-464698.
2024-07-01 15:10:57 -04:00
Matt Arsenault
bff619f910 Revert "AMDGPU: Use real copysign in fast pow (#97152)"
This reverts commit d3e7c4ce7a3d7f08cea02cba8f34c590a349688b.
2024-07-01 20:54:50 +02:00
Matt Arsenault
d3e7c4ce7a
AMDGPU: Use real copysign in fast pow (#97152)
Previously this would introduce some codegen regressions, but
those have been avoided by simplifying demanded bits on copysign
operations.
2024-07-01 20:16:22 +02:00
Jeffrey Byrnes
f903e3ec77 [AMDGPU] Reset kill flags for multiple uses of SDWAInst Ops
Change-Id: I8b56d86a55c397623567945a87ad2f55749680bc
2024-07-01 09:14:02 -07:00
Matt Arsenault
0d88f662ff GlobalISel: ComputeNumSignBits from load range metadata
We're missing SimplifyDemandedBits styles of optimizations,
so one case differs from the DAG from not trimming the constant.
The other case is an optimization we get that the DAG doesn't do to
split the 64-bit shift.

https://reviews.llvm.org/D138082
2024-07-01 15:26:50 +02:00
Matt Arsenault
7032076242
GlobalISel: Drop vector range metadata on bitcast lowering (#97279)
If we are reinterpreting the type, the range metadata also needs to be
converted. I believe the DAG has the same bug.
2024-07-01 15:26:09 +02:00
Matt Arsenault
8eee6d33f7
DAG: Call SimplifyDemandedBits on copysign value operand (#97180)
So far the only cases that seem to benefit are the weird
copysign with different typed inputs.
2024-07-01 12:29:11 +02:00
Matt Arsenault
db9252b115
DAG: Call SimplifyDemandedBits on fcopysign sign value (#97151)
Math library code has quite a few places with complex bit
logic that are ultimately fed into a copysign. This helps
avoid some regressions in a future patch.

This assumes the position in the float type, which should
at least be valid for IEEE types. Not sure if we need to guard
against ppc_fp128 or anything else weird.

There appears to be some value in simplifying the value operand
as well, but I'll address that separately.
2024-07-01 12:19:17 +02:00
Matt Arsenault
c769dc457c
AMDGPU: Add baseline test for copysign combine (#97150)
Pre-commit tests showing we try to SimplifyDemandedBits on the
sign operand.
2024-07-01 12:14:57 +02:00
Matt Arsenault
3562001007 AMDGPU: Regenerate test checks to avoid spurious diff 2024-07-01 11:10:17 +02:00
Nuno Lopes
0e6257fbc2 SSAUpdater: use poison instead of undef in phi entries for unreachable predecessors 2024-06-30 11:51:30 +01:00
Matt Arsenault
76bc071418
DAG: Fix assert when legalizing v3f16 ldexp (#97098)
For the v3f16.v3i32 case, the v3f16 would request widening
to v4f16, but the v3i32 does not require widening to be a legal
type, so GetWidenedVector would fail. We need to widen the exponent
vector to the same element count as the result.

 Fixes: SWDEV-470951
2024-06-30 08:29:20 +02:00
Vitaly Buka
3e53c97d33
Revert "[AMDGPU] Add IR LiveReg type-based optimization" (#97138)
Part of #66838.

https://lab.llvm.org/buildbot/#/builders/52/builds/404
https://lab.llvm.org/buildbot/#/builders/55/builds/358
https://lab.llvm.org/buildbot/#/builders/164/builds/518

This reverts commit ded956440739ae326a99cbaef18ce4362e972679.
2024-06-28 23:18:26 -07:00
Vigneshwar Jayakumar
d2c817df84
[AMDGPU] Fix DynLDS causing crash when LowerLDS is run at fullLTO pipeline (#96038)
Direct mapped dynamic LDS is not lowered in the LowerLDSModule pass.
Hence it is not marked with an absolute symbol. When the LowerLDS pass is
rerun in LTO, compilation fails with an assert "cannot mix abs and non-abs LDVs".
This patch adds an additional check for direct mapped dynLDS to skip the assert.

Fixes SWDEV-454281
2024-06-28 21:05:48 -05:00
Jeffrey Byrnes
ded9564407 [AMDGPU] Add IR LiveReg type-based optimization
Change-Id: Ia0d11b79b8302e79247fe193ccabc0dad2d359a0
2024-06-28 15:01:39 -07:00
Matt Arsenault
2df2373eb8
DAG/GlobalISel: Set disjoint for or in copysign lowering (#97057)
We masked out the sign bit from one value, and the non-sign bits
from the other so there should be no common bits set.

No idea how to test this on the DAG path, other than scraping
the debug logs. A few targets hit this path with f16 values, but
the resulting i16 ors get anyext promoted and lose the disjoint
flag. In the fp128 case, PPC gets further and the or loses the flag
somewhere else later. Adding a haveNoCommonBits assert shows this
works though.
2024-06-28 23:03:39 +02:00
Matt Arsenault
28d142a485 AMDGPU/GlobalISel: Make pk f16 atomicrmw fadd legal for gfx908
The subtarget features for these are a bit of a mess; the no return
version should probably be implied by the with-return feature.
2024-06-28 11:33:42 +02:00
isuckatcs
937d79bc9d
[GlobalISel][AArch64][AMDGPU] Expand FPOWI into series of multiplication (#95217)
SelectionDAG already converts FPOWI into a series of optimized multiplications, 
this patch introduces the same optimization into GlobalISel.
2024-06-28 09:57:50 +02:00
Matt Arsenault
a2a73d892a
AMDGPU: Fix no return atomicrmw fadd v2f16 selection for gfx908 (#96948)
We previously would always expand this with a cmpxchg loop, while
it should be the same conditions as the f32 case (except for the
denormal concern).
2024-06-27 21:17:16 +02:00
Jay Foad
4e70720139 [AMDGPU] Add some gfx1200 test coverage 2024-06-27 14:53:59 +01:00
Matt Arsenault
4477ff6836
AMDGPU: Remove ds_fmin/ds_fmax intrinsics (#96739)
These have been replaced with atomicrmw.
2024-06-27 15:35:24 +02:00
Jay Foad
bf536cc7db
[AMDGPU] Fix unwanted LICM/CSE of llvm.amdgcn.pops.exiting.wave.id (#96190)
Mark both the intrinsic and the selected MachineInstr as having side
effects to prevent MachineLICM and MachineCSE from moving/removing them.
2024-06-27 09:27:52 +01:00
Janek van Oirschot
17eaa23f7e
[AMDGPU] MCExpr-ify AMDGPU HSAMetadata (#94788)
Enables MCExpr for HSAMetadata, particularly, HSAMetadata's msgpack format.
2024-06-26 16:39:08 +01:00
Vikram Hegde
35f7b60aa6
[AMDGPU] Extend permlane16, permlanex16 and permlane64 intrinsic lowering for generic types (#92725)
These are incremental changes over #89217 , with core logic being the
same. This patch along with #89217 and #91190 should get us ready to enable 64
bit optimizations in atomic optimizer.
2024-06-26 09:24:09 +05:30
Matt Arsenault
4f80f362a5 AMDGPU: Add new metadata and expand atomicrmw fadd expansion tests 2024-06-25 23:42:48 +02:00
Matt Arsenault
8bba070ef8 AMDGPU: Expand testing of atomicrmw fmin/fmax lowering
Cover amdgpu.no.fine.grained.memory vs. amdgpu.no.remote.memory.
2024-06-25 23:42:48 +02:00
Jay Foad
aaf50bf34f
[AMDGPU] Disallow negative s_load offsets in isLegalAddressingMode (#91327) 2024-06-25 17:43:00 +01:00
Matt Arsenault
889f3c5741
AMDGPU: Handle legal v2bf16 atomicrmw fadd for gfx12 (#95930)
Annoyingly gfx90a/940 support this for global/flat but not buffer.
2024-06-25 17:45:34 +02:00
Vikram Hegde
5feb32ba92
[AMDGPU] Extend readlane, writelane and readfirstlane intrinsic lowering for generic types (#89217)
This patch is intended to be the first of a series with end goal to
adapt atomic optimizer pass to support i64 and f64 operations (along
with removing all unnecessary bitcasts). This legalizes 64 bit readlane,
writelane and readfirstlane ops pre-ISel

---------

Co-authored-by: vikramRH <vikhegde@amd.com>
2024-06-25 14:35:19 +05:30
vangthao95
3aef525aa4
[AMDGPU] Fix negative immediate offset for unbuffered smem loads (#89165)
For unbuffered smem loads, it is illegal for the immediate offset to be
negative if the resulting IOFFSET + (SGPR[Offset] or M0 or zero) is
negative.

New PR of https://github.com/llvm/llvm-project/pull/79553.
2024-06-24 14:18:23 -07:00
Mariusz Sikora
689c5c4829
[AMDGPU] Set total VGPRs to 1536 for gfx12 (#96272)
- Use Feature1_5xVGPRs
2024-06-24 13:26:03 +02:00
vg0204
c2fc7f75f6 Revert "[AMDGPU]Optimize SGPR spills (#93668)"
This reverts commit 4b9112e88a998ce620e4683548f2afd17cc5fe95. A separate
issue(#96353) describing it has been opened to further keep its track.
2024-06-24 12:36:36 +05:30