7767 Commits

Author SHA1 Message Date
Piotr Sobczak
adf02ae41f
[AMDGPU] Simplify lowerBUILD_VECTOR (#109094)
Simplify `lowerBUILD_VECTOR` by commoning up the way the vectors
are split.
Also reorder the checks to avoid a long condition inside `if`.
2024-09-18 12:58:16 +02:00
Aditi Medhane
5a8d2dd1f9
[AMDGPU] Handle subregisters properly in generic operand legalizer (#108496)
Fix for the issue found during COPY introduction during legalization of
PHI operands for sgpr to vgpr copy when subreg is involved.
2024-09-18 13:14:49 +05:30
Thorsten Schütt
acfa294b5e
[GlobalIsel] Canonicalize G_FCMP (#108891)
As a side-effect, we start constant folding fcmps.
2024-09-17 09:42:04 +02:00
Thorsten Schütt
5c348f692a
[GlobalIsel] Canonicalize G_ICMP (#108755)
As a side-effect, we start constant folding icmps.

Split out from https://github.com/llvm/llvm-project/pull/105991.
2024-09-16 19:25:34 +02:00
Stanislav Mekhanoshin
18f1c980bc
[AMDGPU] Avoid unneeded waitcounts before spill stores (#108303)
Implicit defs and uses on spill stores were accounted as real defs and
uses, while only exist for liveness accounting. As a result unneded
waits were generated.

Fixes: SWDEV-484177
2024-09-14 02:22:28 -07:00
Stanislav Mekhanoshin
d0e7714de7
[AMDGPU] Error on non-global pointer with s_prefetch_data (#107624) 2024-09-13 11:14:28 -07:00
Diana Picus
3356208531
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108512)
This reverts commit
7792b4ae79.

The problem was a conflict with
e55d6f5ea2
"[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive
(https://github.com/llvm/llvm-project/pull/107889)"
which changed the syntax of V_SET_INACTIVE (and thus made my MIR test
crash).

...if only we had a merge queue.
2024-09-13 11:54:30 +02:00
Pankaj Dwivedi
991c842b38
[AMDGPU] eliminate frame index v_add wave32 test (#107832)
PR: #102346  v_add_u32_e64  test cases for wave32
2024-09-13 13:55:14 +05:30
Aditi Medhane
5237f0dbcb
[AMDGPU] Precommit and Modify phi_moveimm_subreg_input testcase (#108389)
- Updated `phi_moveimm_subreg_input` test case to introduce
sub-registers as PHI input operands.
Currently subreg is making the testcase in non-SSA format, need to fix
this by giving subreg as an input operand to PHI instead defining the
subreg register.

This change is relevant for : [[AMDGPU] Add MachineVerifier check to
detect illegal copies from vector register to SGPR
](https://github.com/llvm/llvm-project/pull/105494)
2024-09-12 19:13:00 +05:30
Jay Foad
c657a6f6aa
[AMDGPU] Fix selection of s_load_b96 on GFX11 (#108029)
Fix a bug which resulted in selection of s_load_b96 on GFX11, which only
exists in GFX12.

The root cause was a mismatch between legalization and selection. The
condition used to check that the load was uniform in legalization
(SITargetLowering::LowerLOAD) was "!Op->isDivergent()". The condition
used to detect a non-uniform load during selection
(AMDGPUDAGToDAGISel::isUniformLoad()) was
"N->isDivergent() && !AMDGPUInstrInfo::isUniformMMO(MMO)". This makes a
difference when IR uniformity analysis has more information than SDAG's
built in analysis. In the test case this is because IR UA reports that
everything is uniform if isSingleLaneExecution() returns true, e.g. if
the specified max flat workgroup size is 1, but SDAG does not have this
optimization.

The immediate fix is to use the same condition to detect uniform loads
in legalization and selection. In future SDAG should learn about
isSingleLaneExecution(), and then it could probably stop relying on IR
metadata to detect uniform loads.
2024-09-12 13:41:40 +01:00
Aditi Medhane
36ad0720de
[AMDGPU] Autogenerate checks for phi-vgpr-input-moveimm.mir (#108372)
Update the MIR checks for phi-vgpr-input-moveimm testcase.
2024-09-12 17:28:24 +05:30
Diana Picus
7792b4ae79
Revert "Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)"" (#108341)
Reverts llvm/llvm-project#108173

si-init-whole-wave.mir crashes on some buildbots (although it passed
both locally with sanitizers enabled and in pre-merge tests).
Investigating.
2024-09-12 10:12:09 +02:00
Diana Picus
703ebca869
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)" (#108173)
This reverts commit
c7a7767fca.

The buildbots failed because I removed a MI from its parent before
updating LIS. This PR should fix that.
2024-09-12 09:11:41 +02:00
Jay Foad
e55d6f5ea2
[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (#107889)
Always generate v_cndmask_b32 instead of modifying exec around
v_mov_b32. This is expected to be faster because
modifying exec generally causes pipeline stalls.
2024-09-11 17:16:06 +01:00
Brox Chen
35e27c0ee5
[AMDGPU][True16][MC] 16bit vsrc and vdst support in MC (#104510)
This is a large patch includes the MC level support for V_CVT_F16_F32,
V_CVT_F32_F16 and V_LDEXP_F16 in true16 format.

This patch includes the asm/disasm changes to encode/decode the 16bit
vsrc, vdst and src modifieres for vop and dpp format. This patch is a
dependency for many 16 bit instructions while only three instructions
are updated to make it easier to review.

There will be another patch to support these three instructions in the
codeGen level, this patch just replaces these two instructions with its
fake16 format.
2024-09-11 10:48:11 -04:00
Matt Arsenault
ee61a4db3c AMDGPU: Add tests for minimumnum/maximumnum intrinsics
Vector cases are broken, so leave those for later.
2024-09-11 18:20:03 +04:00
Thorsten Schütt
ba4bcce5f5
[GlobalIsel] Combine trunc of binop (#107721)
trunc (binop X, C) --> binop (trunc X, trunc C)  --> binop (trunc X, C`)

Try to narrow the width of math or bitwise logic instructions by pulling
a truncate ahead of binary operators.

Vx and Nx cores consider 32-bit and 64-bit basic arithmetic equal in
costs.
2024-09-11 15:04:55 +02:00
Akshat Oke
e1ee07d0ff
[AMDGPU][NewPM] Port SIPeepholeSDWA pass to NPM (#107049) 2024-09-11 14:30:16 +04:00
Simon Pilgrim
704116373a [AMDGPU] Regenerate buffer intrinsic tests with update_llc_test_checks. NFC. 2024-09-11 11:06:55 +01:00
Vitaly Buka
c7a7767fca
Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)
Breaks bots, see #105822.

Reverts llvm/llvm-project#105822
2024-09-10 09:51:43 -07:00
Diana Picus
44556e64f2
[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822)
This intrinsic is meant to be used in functions that have a "tail" that
needs to be run with all the lanes enabled. The "tail" may contain
complex control flow that makes it unsuitable for the use of the
existing WWM intrinsics. Instead, we will pretend that the function
starts with all the lanes enabled, then branches into the actual body of
the function for the lanes that were meant to run it, and then finally
all the lanes will rejoin and run the tail.

As such, the intrinsic will return the EXEC mask for the body of the
function, and is meant to be used only as part of a very limited pattern
(for now only in amdgpu_cs_chain functions):

```
entry:
  %func_exec = call i1 @llvm.amdgcn.init.whole.wave()
  br i1 %func_exec, label %func, label %tail

func:
  ; ... stuff that should run with the actual EXEC mask
  br label %tail

tail:
  ; ... stuff that runs with all the lanes enabled;
  ; can contain more than one basic block
```

It's an error to use the result of this intrinsic for anything
other than a branch (but unfortunately checking that in the verifier is
non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if
between the intrinsic and the branch).

The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now
is expanded in si-wqm (which is where SI_INIT_EXEC is handled too);
however the information that the function was conceptually started in
whole wave mode is stored in the machine function info
(hasInitWholeWave). This will be useful in prolog epilog insertion,
where we can skip saving the inactive lanes for CSRs (since if the
function started with all the lanes active, then there are no inactive
lanes to preserve).
2024-09-10 13:24:53 +02:00
sstipanovic
914ab366c2
[AMDGPU] Overload image atomic swap to allow float as well. (#107283)
LLPC can generate llvm.amdgcn.image.atomic.swap intrinsic with data
argument as float type as well as float return type. This went unnoticed
until CreateIntrinsic with implicit mangling was used.
2024-09-09 17:54:30 +02:00
Pierre van Houtryve
eaac4a2613
[AMDGPU] Document & Finalize GFX12 Memory Model (#98599)
Documents the memory model implemented as of #98591, with some
fixes/optimizations to the implementation.
2024-09-09 15:35:28 +02:00
Stanislav Mekhanoshin
0745219d4a
[AMDGPU] Add target intrinsic for s_buffer_prefetch_data (#107293) 2024-09-06 11:41:21 -07:00
Shilei Tian
ce2e38653f
[Attributor] Add support for atomic operations in AAAddressSpace (#106927) 2024-09-06 12:45:16 -04:00
Shilei Tian
109cd11dc4
[Attributor] Skip AS specialization for volatile memory instructions (#107250) 2024-09-06 11:00:30 -04:00
Matt Arsenault
a9daad8280
AMDGPU: Update live intervals in convertToThreeAddress (#104610)
Fixes #98741
2024-09-06 18:18:27 +04:00
Chaitanya
50be4f17a0
[AMDGPU] Skip lowerNonKernelLDSAccesses if function is declaration. (#106975)
This PR skips lowering non-kernel LDS i.e lowerNonKernelLDSAccesses,
when function is a declaration or there are no lds globals to process.
2024-09-06 16:04:17 +05:30
Changpeng Fang
24267a7e14
AMDGPU: Add f64 to f32 support for llvm.fptrunc.round (#107481) 2024-09-05 22:57:27 -07:00
Stanislav Mekhanoshin
bd840a4004
[AMDGPU] Add target intrinsic for s_prefetch_data (#107133) 2024-09-05 15:14:31 -07:00
Changpeng Fang
e44a67543c
AMDGPU: Add a few unsupported checks for llvm.fptrunc.round intrinsic (#107330)
A check here can be removed when we implement support for the
corresponding types/mode.
2024-09-05 15:05:15 -07:00
Carl Ritson
16cda01d22
[AMDGPU] V_SET_INACTIVE optimizations (#98864)
Optimize V_SET_INACTIVE by allow it to run in WWM.
Hence WWM sections are not broken up for inactive lane setting.
WWM V_SET_INACTIVE can typically be lower to V_CNDMASK.
Some cases require use of exec manipulation V_MOV as previous code.
GFX9 sees slight instruction count increase in edge cases due to
smaller constant bus.

Additionally avoid introducing exec manipulation and V_MOVs where
a source of V_SET_INACTIVE is the destination.
This is a common pattern as WWM register pre-allocation often
assigns the same register.
2024-09-05 14:39:28 +09:00
Juan Manuel Martinez Caamaño
2d7339ad24
[AMDGPU][LDS] Fix dynamic LDS interaction with "amdgpu-no-lds-kernel-id" (#107092)
Dynamic lds and Table lds both use the amdgpu_lds_kernel_id intrinsic.
Kernels and functons that make an indirect use of this should not have
the
"amdgpu-no-lds-kernel-id" attribute.

For the later, this was done. For the dynamic lds case, this was
missing. This patch fixes it.
2024-09-04 16:41:43 +02:00
Christudasan Devadasan
6c143a86cd
[CodeGen][NewPM] Port MachineCSE pass to new pass manager. (#106605) 2024-09-04 18:54:07 +05:30
Juan Manuel Martinez Caamaño
43b8ae3cea
[AMDGPU][LDS] Pre-Commit tests for 'Fix dynamic LDS interaction with "amdgpu-no-lds-kernel-id" (#107091) 2024-09-04 13:11:45 +02:00
Jay Foad
5a6926ce49 [AMDGPU] Fix test update after #107108 2024-09-04 11:48:08 +01:00
Jay Foad
126d6f2710
[AMDGPU] Improve codegen for GFX10+ DPP reductions and scans (#107108)
Use poison for an unused input to the permlanex16 intrinsic, to improve
register allocation and avoid an unnecessary v_mov instruction.
2024-09-04 11:03:22 +01:00
paperchalice
69657eb7f6
[llc] Provide opt like verifier options (#106665)
- Support `verify-each` option.
- Default behavior is verifying output only.
2024-09-04 17:37:34 +08:00
Carl Ritson
86627149f6
[AMDGPU] Mitigate GFX12 VALU read SGPR hazard (#100067)
Any SGPR read by a VALU can potentially obscure SALU writes to the same
register.
Insert s_wait_alu instructions to mitigate the hazard on affected paths.

Compute a global cache of SGPRs with any VALU reads and use this to
avoid inserting mitigation for SGPRs never accessed by VALUs.

To avoid excessive search when compile time is priority implement
secondary mode where all SALU writes are mitigated.

Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-09-04 12:15:20 +09:00
Christudasan Devadasan
042104985c
[AMDGPU][NewPM] Port SIShrinkInstructions to new pass manager. (#106967) 2024-09-03 10:52:50 +05:30
Shilei Tian
cb949b74e8 [NFC][FIX] Work around update_test_checks bug 2024-09-02 12:33:24 -04:00
Shilei Tian
f32f0289fd [NFC] Update check lines of the test case llvm/test/CodeGen/AMDGPU/remove-no-kernel-id-attribute.ll 2024-09-02 12:23:26 -04:00
Akshat Oke
da13754103
AMDGPU/NewPM Port SILoadStoreOptimizer to NPM (#106362) 2024-09-02 11:41:56 +05:30
Changpeng Fang
26b0bef192
AMDGPU: Use pattern to select instruction for intrinsic llvm.fptrunc.round (#105761)
Use GCNPat instead of Custom Lowering to select instructions for
intrinsic llvm.fptrunc.round. "SupportedRoundMode : TImmLeaf" is used as
a predicate to select only when the rounding mode is supported.
"as_hw_round_mode : SDNodeXForm" is developed to translate the round
modes to the corresponding ones that hardware recognizes.
2024-08-29 11:43:58 -07:00
Stephen Tozer
3d08ade7bd
[ExtendLifetimes] Implement llvm.fake.use to extend variable lifetimes (#86149)
This patch is part of a set of patches that add an `-fextend-lifetimes`
flag to clang, which extends the lifetimes of local variables and
parameters for improved debuggability. In addition to that flag, the
patch series adds a pragma to selectively disable `-fextend-lifetimes`,
and an `-fextend-this-ptr` flag which functions as `-fextend-lifetimes`
for this pointers only. All changes and tests in these patches were
written by Wolfgang Pieb (@wolfy1961), while Stephen Tozer (@SLTozer)
has handled review and merging. The extend lifetimes flag is intended to
eventually be set on by `-Og`, as discussed in the RFC
here:

https://discourse.llvm.org/t/rfc-redefine-og-o1-and-add-a-new-level-of-og/72850

This patch implements a new intrinsic instruction in LLVM,
`llvm.fake.use` in IR and `FAKE_USE` in MIR, that takes a single operand
and has no effect other than "using" its operand, to ensure that its
operand remains live until after the fake use. This patch does not emit
fake uses anywhere; the next patch in this sequence causes them to be
emitted from the clang frontend, such that for each variable (or this) a
fake.use operand is inserted at the end of that variable's scope, using
that variable's value. This patch covers everything post-frontend, which
is largely just the basic plumbing for a new intrinsic/instruction,
along with a few steps to preserve the fake uses through optimizations
(such as moving them ahead of a tail call or translating them through
SROA).

Co-authored-by: Stephen Tozer <stephen.tozer@sony.com>
2024-08-29 17:53:32 +01:00
Pierre van Houtryve
1f8f2ed66a
[NFC][AMDGPU] Autogenerate tests for uniform i32 promo in ISel (#106382)
Many tests were easy to update, but these are quite big and I think it's
better to autogenerate them to see the difference well.
2024-08-29 15:20:32 +02:00
Matt Arsenault
7b7b0b95b2
DAG: Check if is_fpclass is custom, instead of isLegalOrCustom (#105577)
For some reason, isOperationLegalOrCustom is not the same as
isOperationLegal || isOperationCustom. Unfortunately, it checks
if the type is legal which makes it uesless for custom lowering
on non-legal types (which is always ppcf128).

Really the DAG builder shouldn't be going to expand this in the
builder, it makes it difficult to work with. It's only here to work
around the DAG requiring legal integer types the same size as
the FP type after type legalization.
2024-08-29 14:05:43 +04:00
Akshat Oke
fdca2c33a1
AMDGPU/NewPM Port GCNDPPCombine to NPM (#105816)
Co-authored-by: Akshat Oke <Akshat.Oke@amd.com>
2024-08-29 14:49:52 +05:30
Akshat Oke
2adc94cd6c
AMDGPU/NewPM: Port SIFoldOperands to new pass manager (#105801) 2024-08-29 11:34:54 +05:30
Shilei Tian
572d2fd327
[Attributor] Fix an issue that could potentially cause AccessList and OffsetBins out of sync (#106187)
The implementation of `AAPointerInfo::RangeList::set_difference` doesn't
consider the case where two ranges have the same offset but different
sizes.
This could cause `AccessList` and `OffsetBins` out of sync because a
range has
been already updated in `AccessList` but missing in `ToRemove`.

I do have a reproducer but the reproducer itself is 248kb. `llvm-reduce`
can't
further reduce it. Not sure how I can make a smaller reproducer.

Fixes: SWDEV-479757.
2024-08-29 01:02:19 -04:00