8259 Commits

Author SHA1 Message Date
Robert Imschweiler
bcba3117c0
[AMDGPU] SelDAG: fix lowering of undefined workitem intrinsics (#126058)
GlobalISel already handles undefined workitem.id.{x,y,z} intrinsics,
SelDAG failed in AMDGPUISelLowering.cpp due to a failed assertion in
`AMDGPUTargetLowering::loadInputValue`: `Arg && "Attempting to load
missing argument"`. This commit changes the behavior of SelDAG to
instead use a zero constant.

This LLVM defect was identified via the AMD Fuzzing project.
2025-02-12 18:41:41 -05:00
Jeffrey Byrnes
c5a4512d85
[AMDGPU] iglp.opt does not clobber memory operands (#126976)
I think it was an accident that this wasn't included.
2025-02-12 14:11:02 -08:00
Akshat Oke
7b60e03d73
Reland "CodeGen][NewPM] Port MachineScheduler to NPM. (#125703)" (#126684)
`RegisterClassInfo` was supposed to be kept alive between pass runs,
which wasn't being done leading to recomputations increasing the compile
time.

Now the Impl class is a member of the legacy and new passes so that it
is not reconstructed on every pass run.

---------

Co-authored-by: Christudasan Devadasan <christudasan.devadasan@amd.com>
2025-02-12 18:54:39 +05:30
Vikram Hegde
9c725ef368
[AMDGPU][NewPM] Port "GCNRewritePartialRegUses" pass to NPM (#126024) 2025-02-12 11:21:40 +05:30
Krzysztof Drewniak
934c97dd16
[LowerBufferFatPointers] Fix support for GEP T, p7, <N x T> idxs (#126126)
The lowering for GEP didn't properly support the case where the pointer
argument was being implicitly broadcast by a vector of indices. Fix
that.

---------

Co-authored-by: Matt Arsenault <arsenm2@gmail.com>
2025-02-11 18:22:50 -06:00
Brox Chen
ad6cd7e8b2
[AMDGPU][True16][CodeGen] true16 codegen for MadFmaMixPat (#124892)
true16 codegen for MadFmaMixPat. GISEL test not enabled and will be
added later when GISEL is supported
2025-02-11 17:36:44 -05:00
Vigneshwar Jayakumar
1188b1ff7b
AMDGPU: Handle gfx950 XDL Write-VGPR-VALU-WAW wait state change (#126132)
There are additional wait states for XDL write VALU WAW hazard in gfx950
compared to gfx940.
2025-02-12 01:32:23 +07:00
Vigneshwar Jayakumar
a2263eba4d
AMDGPU: Handle gfx950 XDL-write-VGPR-VALU-Mem-Exp wait state change (#126727) 2025-02-12 01:30:53 +07:00
Vigneshwar Jayakumar
c837f57286
AMDGPU: Handle gfx950 XDL-write-VGPR-Overlap-Src-AB wait state (#126732)
gfx950 needs more additional waitstates from gfx940
2025-02-11 22:30:16 +07:00
Juan Manuel Martinez Caamaño
dd59198647
[NFC][AMDGPU] Rename test (#126725)
The demonte-scc transformation is no longer needed and the old test name
doesn't make sense anymore.

The test checks the generated assembly for different branch cases
* without metadata,
* with the same branch_weights on each edge and
* with a branch_weights that corresponds to the [[likely]] attribute
2025-02-11 15:10:37 +01:00
Vikram Hegde
3293bff5d2
[AMDGPU][NewPM] Port "GCNPreRAOptimizations" pass to NPM (#126040) 2025-02-11 11:09:38 +05:30
Shilei Tian
bde8ce6a5c
[AMDGPU] Only run AMDGPUPrintfRuntimeBindingPass at non-prelink phase (#125162) 2025-02-10 08:24:50 -05:00
Shilei Tian
70fdd9f0a2
[GlobalISel] Check whether G_CTLZ is legal in matchUMulHToLShr (#126457)
We need to check `G_CTLZ` because the combine uses `G_CTLZ` to get log
base 2,
and it is not always legal for on a target.

Fixes SWDEV-512440.
2025-02-10 00:11:09 -05:00
Shilei Tian
967973512b
[AMDGPU] Don't unify divergent exit nodes with musttail calls (#126395)
Fixes SWDEV-512254.
2025-02-09 21:48:24 -05:00
Akshat Oke
564b9b7f4d
Revert "CodeGen][NewPM] Port MachineScheduler to NPM. (#125703)" (#126268)
This reverts commit 5aa4979c47255770cac7b557f3e4a980d0131d69 while I
investigate what's causing the compile-time regression.
2025-02-08 15:36:48 +05:30
Brox Chen
2b43543afb
[AMDGPU][True16][MC][CodeGen] true16 for v_alignbyte_b32 (#125706)
Support true16 format for v_alignbyte_b32 in MC and CodeGen
2025-02-07 09:54:11 -05:00
Matt Arsenault
d21fc58aee
AMDGPU: Use default shouldRewriteCopySrc (#125535)
This was ultimately working around bugs in subregister handling
in peephole-opt. In the common case, it would give up on folding
anything into a subregister extract copy.
2025-02-07 12:31:14 +07:00
Jeffrey Byrnes
16f7e961c6
[AMDGPU] Allow rematerialization of instructions with virtual register uses (#124327)
Remove the restriction that scheduling rematerialization candidates
cannot have virtual reg uses.

Currently, this only allows for virtual reg uses which are already live
at the rematerialization point, so bring in allUsesAvailableAt to check
for this condition. Because of this condition, the uses of the remats
will already be live in to the region, so the remat won't increase
live-in pressure.

Add an expensive check to check this condition.
2025-02-06 10:16:28 -08:00
Matt Arsenault
58a88001f3
PeepholeOpt: Fix looking for def of current copy to coalesce (#125533)
This fixes the handling of subregister extract copies. This
will allow AMDGPU to remove its implementation of
shouldRewriteCopySrc, which exists as a 10 year old workaround
to this bug. peephole-opt-fold-reg-sequence-subreg.mir will
show the expected improvement once the custom implementation
is removed.

The copy coalescing processing here is overly abstracted
from what's actually happening. Previously when visiting
coalescable copy-like instructions, we would parse the
sources one at a time and then pass the def of the root
instruction into findNextSource. This means that the
first thing the new ValueTracker constructed would do
is getVRegDef to find the instruction we are currently
processing. This adds an unnecessary step, placing
a useless entry in the RewriteMap, and required skipping
the no-op case where getNewSource would return the original
source operand. This was a problem since in the case
of a subregister extract, shouldRewriteCopySource would always
say that it is useful to rewrite and the use-def chain walk
would abort, returning the original operand. Move the process
to start looking at the source operand to begin with.

This does not fix the confused handling in the uncoalescable
copy case which is proving to be more difficult. Some currently
handled cases have multiple defs from a single source, and other
handled cases have 0 input operands. It would be simpler if
this was implemented with isCopyLikeInstr, rather than guessing
at the operand structure as it does now.

There are some improvements and some regressions. The
regressions appear to be downstream issues for the most part. One
of the uglier regressions is in PPC, where a sequence of insert_subrgs
is used to build registers. I opened #125502 to use reg_sequence instead,
which may help.

The worst regression is an absurd SPARC testcase using a <251 x fp128>,
which uses a very long chain of insert_subregs.

We need improved subregister handling locally in PeepholeOptimizer,
and other pasess like MachineCSE to fix some of the other regressions.
We should handle subregister composes and folding more indexes
into insert_subreg and reg_sequence.
2025-02-05 23:29:02 +07:00
Christudasan Devadasan
b83c960bad
[CodeGen][NewPM] Port SIWholeQuadMode to NPM. (#125833) 2025-02-05 18:44:57 +05:30
Akshat Oke
f77f777f35
[CodeGen][NewPM] Port RenameIndependentSubregs to NPM (#125192) 2025-02-05 17:54:57 +05:30
Christudasan Devadasan
44f638f88e
CodeGen][NewPM] Port PostRAScheduler to NPM. (#125798) 2025-02-05 12:45:59 +05:30
Christudasan Devadasan
5aa4979c47
CodeGen][NewPM] Port MachineScheduler to NPM. (#125703) 2025-02-05 12:17:59 +05:30
Robert Imschweiler
21560fe6b9
GlobalISel: Fix defined register of invariant.start (#125664)
In contrast to SelectionDAG, GlobalISel created a new virtual register
for the return value of invariant.start, leaving subsequent users of the
invariant.start value with an undefined reference.
A minimal example:
```
  %tmp = alloca i32, align 4, addrspace(5)
  %tmpI = call ptr @llvm.invariant.start.p5(i64 4, ptr addrspace(5) %tmp) #3
  call void @llvm.invariant.end.p5(ptr %tmpI, i64 4, ptr addrspace(5) %tmp) #3
  store i32 %i, ptr %tmpI, align 4
```
Although the return value of invariant.start might not be intended for
any use beyond invariant.end (the fuzzer might not have created a
sensible situation here), an implicit definition of the corresponding
virtual register avoids a segfault in the target instruction selector
later.

This LLVM defect was identified via the AMD Fuzzing project.
2025-02-04 23:59:03 +07:00
Brox Chen
5eff19f48b
[AMDGPU][True16][Codegen] true16 codegen for FPtoI1 (#125120)
True16 codegen for FPtoi1.

It seems tablegen figured out the pattern even without this pat in
place, and the fptoui/fptosi.ll already got the right transformation.
Aditionally updated the mir file and split it to pre-gfx11 and
post-gfx11.
2025-02-04 11:20:40 -05:00
Brox Chen
6515fdf73d
[AMDGPU][True16][CodeGen] true16 codegen for FPMinMax pat (#125107)
true16 codegen for FPMinMax Pattern
2025-02-04 11:20:17 -05:00
Akshat Oke
4313345f2e
[CodeGen][NewPM] Port MachineCopyPropagation to NPM (#125202) 2025-02-04 15:45:03 +05:30
Matt Arsenault
2f2ac3de69
DAG: Avoid stack usage in bitcast operand promotion to legal vector (#125637)
Fix introducing stack usage if a bitcast source operand is an illegal
integer type cast to a legal vector type. This should cover more
situations, but this is the first one I noticed.
2025-02-04 16:43:42 +07:00
Matt Arsenault
cdca04913a
DAG: Avoid introducing stack usage in vector->int bitcast int op promotion
(#125636)

Avoids stack usage in the v5i32 to i160 case for AMDGPU, which appears
in fat pointer lowering.
2025-02-04 16:32:47 +07:00
David Stuttard
6c560ef33e
[AMDGPU] Add .entry_point back into PAL metadata (#125505) 2025-02-04 08:19:05 +00:00
Fabian Ritter
b95a6c750c
[AMDGPU] Remove special cases in TTI::getMemcpyLoop(Residual)LoweringType (#125507)
These special cases limit the width of memory operations we use for
lowering memcpy/memmove when the pointer arguments are 2-aligned or in
the LDS/GDS.

I found that performance in microbenchmarks on gfx90a, gfx1030, and
gfx1100 is better without this limitation.
2025-02-04 08:18:24 +01:00
Matt Arsenault
077e0c134a
AMDGPU: Generalize truncate of shift of cast build_vector combine (#125617)
Previously we only handled cases that looked like the high element
extract of a 64-bit shift. Generalize this to handle any multiple
indexing. I was hoping this would help avoid some regressions,
but it did not. It does however reduce the number of steps the DAG
takes to process these cases.

NFC-ish, I have yet to find an example where this changes the
final output.
2025-02-04 11:46:30 +07:00
Simon Pilgrim
b7c8271601
[DAG] getNode - convert scalar i1 arithmetic calls to bitwise instructions (#125486)
We already do this for vector vXi1 types - this patch removes the vector constraint to handle it for all bool types.
2025-02-03 16:36:01 +00:00
Akshat Oke
fe9a97ca38
[CodeGen][NewPM] Port RegisterCoalescer to NPM (#124698) 2025-02-03 13:41:51 +07:00
David Green
070e129304
[AArch64][GlobalISel] Add disjoint handling for add_and_or_is_add. (#123594)
This allows us to easily detect, without known-bits, that the or in a
fshl/fshr is disjoint allowing us to use usra under aarch64.
2025-02-02 21:01:49 +00:00
Sergei Barannikov
ff9c041d96
[MachineScheduler] Fix physreg dependencies of ExitSU (#123541)
Providing the correct operand index allows addPhysRegDataDeps to compute
the correct latency.

Pull Request: https://github.com/llvm/llvm-project/pull/123541
2025-02-01 20:40:50 +03:00
Brox Chen
a51798e3d6
[AMDGPU][True16][CodeGen] true16 codegen pat for fptrunc_round (#124044)
true16 codegen pattern for fptrunc_round f32 to f16.

For mir test, split to preGFX11 and postGFX11. and add a true16 and a
fake16 test accordingly
2025-01-30 18:31:52 -05:00
Jon Chesterfield
4f358d75d0 [amdgpu][nfc] Post-commit feedback on c39fba209 2025-01-30 20:07:44 +00:00
Stanislav Mekhanoshin
8a20c6459e
[AMDGPU] Create new option for force flush load counter (#124974)
In ceratin situations it is beneficial to wait for all outstanding
loads regardless of specific load's data we need. This may allow
to reduce a number of cache requests.

Fixes: SWDEV-511507
2025-01-30 11:14:38 -08:00
Brox Chen
33d401fb15
[AMDGPU][True16][CodeGen] true16 codegen for icmp and is_fpclass (#124757)
True16 codegen pattern for icmp patterns and is_fpclass
2025-01-30 12:18:00 -05:00
Jon Chesterfield
c39fba209c
[AMDGPU] S_SET_GPR_IDX_ON can be passed an immediate index (#125086)
Oversight found by ISel fuzz effort. Assuming the argument is a
register, in some cases it can be an immediate. Tablegen's type for the
instruction is SSrc_b32, i.e. register or immediate fine. Added the
repro from the bug reporter as a test case - prior to this patch llvm
will assert in getReg.

Fixes SWDEV-508589
2025-01-30 16:40:12 +00:00
Matt Arsenault
41f76070f3 AMDGPU: Regenerate test checks 2025-01-30 22:55:33 +07:00
Matt Arsenault
d246cc618a
PeepholeOpt: Do not add subregister indexes to reg_sequence operands (#124111)
Given the rest of the pass just gives up when it needs to compose
subregisters, folding a subregister extract directly into a reg_sequence
is counterproductive. Later fold attempts in the function will give up
on the subregister operand, preventing looking up through the reg_sequence.

It may still be profitable to do these folds if we start handling
the composes. There are some test regressions, but this mostly
looks better.
2025-01-30 20:42:02 +07:00
Matt Arsenault
6017480461
MachineVerifier: Fix check for range type (#124894)
We need to permit scalar extending loads with range annotations.

Fix expensive_checks failures after 11db7fb09b36e656a801117d6a2492133e9c2e46
2025-01-30 10:56:12 +07:00
Matt Arsenault
97a1f494a6
DAG: Avoid breaking legal vector_shuffle with multiple uses (#123712)
Previously this combine would undo AMDGPU's new custom legalization of
wide vector shuffles into 2 element pieces. The comment also
states that this combine is only done before legalization,
but the case with a build_vector source was unconditional.

We probably don't want to do this if the multiple uses are full
scalarization of the vector, but this seems to work well enough.
Scalarizing extracts should have folded out pre-legalize.
2025-01-30 10:55:21 +07:00
Carl Ritson
a3a3e6997b
[AMDGPU] Rewrite GFX12 SGPR hazard handling to dedicated pass (#118750)
- Algorithm operates over whole IR to attempt to minimize waits.
- Add support for VALU->VALU SGPR hazards via VA_SDST/VA_VCC.
2025-01-30 11:21:11 +09:00
Konstantina Mitropoulou
9adc99bcc5
[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point branches. (#124028)
- **[NFC] Use GCNPat instead of Pat.**
- **[AMDGPU] Always emit SI_KILL_I1_PSEUDO for uniform floating point
branches.**

---------

Co-authored-by: Konstantina Mitropoulou <KonstantinaMitropoulou@amd.com>
2025-01-29 09:00:40 -08:00
Nikita Popov
29441e4f5f
[IR] Convert from nocapture to captures(none) (#123181)
This PR removes the old `nocapture` attribute, replacing it with the new
`captures` attribute introduced in #116990. This change is
intended to be essentially NFC, replacing existing uses of `nocapture`
with `captures(none)` without adding any new analysis capabilities.
Making use of non-`none` values is left for a followup.

Some notes:
* `nocapture` will be upgraded to `captures(none)` by the bitcode
   reader.
* `nocapture` will also be upgraded by the textual IR reader. This is to
   make it easier to use old IR files and somewhat reduce the test churn in
   this PR.
* Helper APIs like `doesNotCapture()` will check for `captures(none)`.
* MLIR import will convert `captures(none)` into an `llvm.nocapture`
   attribute. The representation in the LLVM IR dialect should be updated
   separately.
2025-01-29 16:56:47 +01:00
Acim Maravic
3a29dfe37c
[LLVM][AMDGPU] Add Intrinsic and Builtin for ds_bpermute_fi_b32 (#124616) 2025-01-29 14:04:10 +01:00
David Green
66e0498daf
[GlobalISel] Do not run verifier after ResetMachineFunctionPass (#124799)
After we fall back from GlobalISel to SDAG, the verifier gets called,
which calls getReservedRegs which uses SIMachineFunctionInfo::usesAGPRs
which caches the result of UsesAGPRs. Because we have just fallen-back
the function is empty and it incorrectly gets cached to false. This
patch makes sure we don't try to run the verifier whilst the function is
empty.
2025-01-29 12:48:11 +00:00