17 Commits

Author SHA1 Message Date
Fabian Ritter
2697c8cb45
[LowerMemIntrinsics] Factor control flow generation out of the memcpy lowering (#169039)
So far, memcpy with known size, memcpy with unknown size, memmove with known
size, and memmove with unknown size have individual optimized loop lowering
implementations, while memset and memset.pattern use an unoptimized loop
lowering. This patch extracts the parts of the memcpy lowerings (for known and
unknown sizes) that generate the control flow for the loop expansion into an
`insertLoopExpansion` function. The `createMemCpyLoop(Unk|K)nownSize` functions
then only collect the necessary arguments for `insertLoopExpansion`, call it,
and fill the generated loop basic blocks.

The immediate benefit of this is that logic from the two memcpy lowerings is
deduplicated. Moreover, it enables follow-up patches that will use
`insertLoopExpansion` to optimize the memset and memset.pattern implementations
similarly to memcpy, since they can use the exact same control flow patterns.

The test changes are due to more consistent and useful basic block names in the
loop expansion and an improvement in basic block ordering: previously, the
basic block that determines if the residual loop is executed would be put at
the end of the function, now it is put before the residual loop body.
Otherwise, the generated code should be equivalent.

This patch doesn't affect memmove; deduplicating its logic would also be nice,
but to extract all CF generation from the memmove lowering,
`insertLoopExpansion` would need to be able to also create code that iterates
backwards over the argument buffers. That would make `insertLoopExpansion` a
lot more complex for a code path that's only used for memmove, so it's probably
not worth refactoring.

For SWDEV-543208.
2025-12-03 11:13:52 +01:00
Matt Arsenault
c7019c7eda
AMDGPU: Really use AV classes by default for vector classes (#166483)
AMDGPU: Really use AV classes by default for vector classes

Update getRegClassFor to use AV classes in place of VGPRs for
gfx90a-gfx950. There are a handful of regressions. Most are
enabling unprofitable rematerialization which reduce register
count by 1 but add an unnecessary instruction.
2025-11-13 18:54:02 +00:00
Nicolai Hähnle
fa050eadab
Reland: CodeGen: Record MMOs in finalizeBundle (#166689)
(original PR: #166210)

This allows more accurate alias analysis to apply at the bundle level.
This has a bunch of minor effects in post-RA scheduling that look mostly
beneficial to me, all of them in AMDGPU (the Thumb2 change is cosmetic).

The pre-existing (and unchanged) test in
CodeGen/MIR/AMDGPU/custom-pseudo-source-values.ll tests that MIR with a
bundle with MMOs can be parsed successfully.

v2:
- use cloneMergedMemRefs
- add another test to explicitly check the MMO bundling behavior

v3:
- use poison instead of undef to initialize the global variable in the
test

v4:
- treat bundle memory accesses as never trivially disjoint
2025-11-06 07:34:36 -08:00
Jan Patrick Lehr
833983918d
Revert "CodeGen: Record MMOs in finalizeBundle" (#166520)
Reverts llvm/llvm-project#166210

Buildbot failures in the libc on GPU bot:
https://lab.llvm.org/buildbot/#/builders/10/builds/16711
2025-11-05 11:11:08 +01:00
Nicolai Hähnle
304d2ff4d9
CodeGen: Record MMOs in finalizeBundle (#166210)
This allows more accurate alias analysis to apply at the bundle level.
This has a bunch of minor effects in post-RA scheduling that look mostly
beneficial to me, all of them in AMDGPU (the Thumb2 change is cosmetic).

The pre-existing (and unchanged) test in
CodeGen/MIR/AMDGPU/custom-pseudo-source-values.ll tests that MIR with a
bundle with MMOs can be parsed successfully.

v2:
- use cloneMergedMemRefs
- add another test to explicitly check the MMO bundling behavior

v3:
- use poison instead of undef to initialize the global variable in the
test
2025-11-05 06:56:19 +00:00
Mirko Brkušanin
bdec5bf69c
[AMDGPU][GlobalISel] Combine (or s64, zext(s32)) (#151519)
If we only deal with a one part of 64bit value we can just generate
merge and unmerge which will be either combined away or
selected into copy / mov_b32.
2025-10-24 17:25:00 +02:00
Gang Chen
640644d68a
[AMDGPU] Move LowerBufferFatPointers after LoadStoreVectorizer and remove the fixme (#161531)
Move LowerBufferFatPointers pass after CodegenPrepare and
LoadStoreVectorizer pass, and remove the fixme about that.
2025-10-01 17:52:15 -07:00
Matt Arsenault
1614c3b3c7
AMDGPU: Always use AV spill pseudos on targets with AGPRs (#149099)
This increases allocator freedom to inflate register classes
to the AV class, we don't need to introduce a new restriction
by basing the opcode on the current virtual register class.
Ideally we would avoid this if we don't have any allocatable
AGPRs for the function, but it probably doesn't make much
difference in the end result if they are excluded from the
final allocation order.
2025-07-18 15:31:50 +09:00
Ruiling, Song
3e47d8deba
MachineScheduler: Reset next cluster candidate for each node (#139513)
When a node is picked, we should reset its next cluster candidate to
null before releasing its successors/predecessors.
2025-05-28 14:53:46 +08:00
Austin Kerbow
2c9a46cce3
[AMDGPU] Move kernarg preload logic to separate pass (#130434)
Moves kernarg preload logic to its own module pass. Cloned function
declarations are removed when preloading hidden arguments. The inreg
attribute is now added in this pass instead of AMDGPUAttributor. The
rest of the logic is copied from AMDGPULowerKernelArguments which now
only check whether an arguments is marked inreg to avoid replacing
direct uses of preloaded arguments. This change requires test updates to
remove inreg from lit tests with kernels that don't actually want
preloading.
2025-05-11 21:18:11 -07:00
Alexander Richardson
ee13638362
[AMDGPU] Remove explicit datalayout from tests where not needed
Since e39f6c1844fab59c638d8059a6cf139adb42279a opt will infer the
correct datalayout when given a triple. Avoid explicitly specifying it
in tests that depend on the AMDGPU target being present to avoid the
string becoming out of sync with the TargetInfo value.
Only tests with REQUIRES: amdgpu-registered-target or a local lit.cfg
were updated to ensure that tests for non-target-specific passes that
happen to use the AMDGPU layout still pass when building with a limited
set of targets.

Reviewed By: shiltian, arsenm

Pull Request: https://github.com/llvm/llvm-project/pull/137921
2025-04-30 10:58:17 -07:00
Ana Mihajlovic
459b4e3fe1
Reland "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)" (#131111)
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from
it). When VALU is issued, it increments internal counter VA_SDST used to
track use of this SGPR. SALU will not issue until VA_SDST is zero, that
is when VALU is finished writing. Therefore, delays added by s_delay_alu
are not needed in this situation.
2025-03-13 10:26:20 +01:00
Kazu Hirata
aa008e0008 Revert "[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)"
This reverts commit 71582c6667a6334c688734cae628e906b3c1ac1d.

Multiple buildbot failures have been reported:
https://github.com/llvm/llvm-project/pull/127212
2025-03-12 12:09:09 -07:00
Ana Mihajlovic
71582c6667
[AMDGPU] Remove s_delay_alu for VALU->SGPR->SALU (#127212)
We have a VALU->SGPR->SALU (VALU writing to SGPR and SALU reading from
it). When VALU is issued, it increments internal counter VA_SDST used to
track use of this SGPR. SALU will not issue until VA_SDST is zero, that
is when VALU is finished writing. Therefore, delays added by s_delay_alu
are not needed in this situation.
2025-03-12 09:33:07 -07:00
Krzysztof Drewniak
f8cc509b69
Reapply "[AMDGPU] Handle memcpy()-like ops in LowerBufferFatPointers (#126621)" (#129078)
This reverts commit 1559a65efaf327f9c72e14d4bb1834f076e7fc20.

Fixed test (I suspect broken by unrelated change in the merge)
2025-02-27 11:26:13 -06:00
Kazu Hirata
1559a65efa Revert "[AMDGPU] Handle memcpy()-like ops in LowerBufferFatPointers (#126621)"
This reverts commit 469757efafebdd5772d993fca4dc0dfa7cbda17c.

Multiple buildbot failures have been reported:
https://github.com/llvm/llvm-project/pull/126621
2025-02-26 14:35:07 -08:00
Krzysztof Drewniak
469757efaf
[AMDGPU] Handle memcpy()-like ops in LowerBufferFatPointers (#126621)
Since LowerBufferFatPointers runs before PreISelIntrinsicLowering, which
normally handles unsupported memcpy()s,, and since you can't have a
`noalias {ptr addrspace(8), i32}` becasue it crashes later passes,
manually expand memcpy()s involving buffer fat pointers to loops.

Additionally, though they're unlikely to be used, this commit adds
support for memset().

This commit doesn't implement writing direct-to-LDS loads as the
intrinsics, but leaves the option in the future.
2025-02-26 16:03:32 -06:00