58 Commits

Author SHA1 Message Date
Stanislav Mekhanoshin
3277c7cd28
[AMDGPU] Skip VGPR deallocation for waveslot limited kernels (#112765)
MSG_DEALLOC_VGPRS slows down very small waveslot limited kernels. It's
been identified this message is only really needed for VGPR limited
kernels. A kernel becomes VGPR limited if a total number of VGPRs per
SIMD / number of used VGPRs is more than a number of wave slots.
2024-10-21 09:39:52 -07:00
Pierre van Houtryve
924a64a348
[AMDGPU] Only emit SCOPE_SYS global_wb (#110636)
global_wb with scopes lower than SCOPE_SYS is unnecessary for
correctness.

I was initially optimistic they would be very cheap no-ops but they can
actually be quite expensive so let's avoid them.
2024-10-07 07:35:31 +02:00
Matt Arsenault
8632e8bd64
AMDGPU: Fix implicit vcc def to vcc_lo on wave32 targets (#109514) 2024-09-23 13:20:21 +04:00
Jay Foad
e55d6f5ea2
[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive (#107889)
Always generate v_cndmask_b32 instead of modifying exec around
v_mov_b32. This is expected to be faster because
modifying exec generally causes pipeline stalls.
2024-09-11 17:16:06 +01:00
Carl Ritson
16cda01d22
[AMDGPU] V_SET_INACTIVE optimizations (#98864)
Optimize V_SET_INACTIVE by allow it to run in WWM.
Hence WWM sections are not broken up for inactive lane setting.
WWM V_SET_INACTIVE can typically be lower to V_CNDMASK.
Some cases require use of exec manipulation V_MOV as previous code.
GFX9 sees slight instruction count increase in edge cases due to
smaller constant bus.

Additionally avoid introducing exec manipulation and V_MOVs where
a source of V_SET_INACTIVE is the destination.
This is a common pattern as WWM register pre-allocation often
assigns the same register.
2024-09-05 14:39:28 +09:00
Jay Foad
5a6926ce49 [AMDGPU] Fix test update after #107108 2024-09-04 11:48:08 +01:00
Jay Foad
126d6f2710
[AMDGPU] Improve codegen for GFX10+ DPP reductions and scans (#107108)
Use poison for an unused input to the permlanex16 intrinsic, to improve
register allocation and avoid an unnecessary v_mov instruction.
2024-09-04 11:03:22 +01:00
Carl Ritson
86627149f6
[AMDGPU] Mitigate GFX12 VALU read SGPR hazard (#100067)
Any SGPR read by a VALU can potentially obscure SALU writes to the same
register.
Insert s_wait_alu instructions to mitigate the hazard on affected paths.

Compute a global cache of SGPRs with any VALU reads and use this to
avoid inserting mitigation for SGPRs never accessed by VALUs.

To avoid excessive search when compile time is priority implement
secondary mode where all SALU writes are mitigated.

Co-authored-by: Shilei Tian <shilei.tian@amd.com>
2024-09-04 12:15:20 +09:00
Changpeng Fang
a82032918c
[AMDGPU] Remove -wavefrontsize32 and -wavefrontsize64 from GFX10+ tests (NFC) (#100711)
They are no longer needed after the patch: [AMDGPU] Remove wavefrontsize
feature from GFX10: https://github.com/llvm/llvm-project/pull/98400
The exception is when "target-features" are set to "+wavefrontsize32" or
"+wavefrontsize64", we still need to remove a wavefrontsize feature
before add a different one to make sure only one of them are present.
2024-07-26 00:42:24 -07:00
Christudasan Devadasan
229e118559
[AMDGPU] Codegen support for constrained multi-dword sloads (#96163)
For targets that support xnack replay feature (gfx8+), the
multi-dword scalar loads shouldn't clobber any register that
holds the src address. The constrained version of the scalar
loads have the early clobber flag attached to the dst operand
to restrict RA from re-allocating any of the src regs for its
dst operand.
2024-07-23 13:59:15 +05:30
Pierre van Houtryve
b3a446650c
[AMDGPU] Implement GFX12 Memory Model (#98591)
- Emit GLOBAL_WB instructions
- Reflect synscope on instructions's `scope:` operand

Fixes SWDEV-468508
Fixes SWDEV-470735
Fixes SWDEV-468392
Fixes SWDEV-469622
2024-07-16 10:53:06 +02:00
Vikram Hegde
cf230e7799
[AMDGPU] Enable atomic optimizer for divergent i64 and double values (#96934) 2024-07-15 17:49:09 +05:30
Matt Arsenault
b1bcb7ca46 Reapply "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commit adaff46d087799072438dd744b038e6fd50a2d78.

Drop the -O3 checks from default-attributes.hip. I don't know why they
are different on some bots but reverting this is far too disruptive.
2024-07-15 11:51:44 +04:00
dyung
adaff46d08
Revert "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commits 677cc15e0ff2e0e6aa30538eb187990a6a8f53c0 and
78bc1b64a6dc3fb6191355a5e1b502be8b3668e7.

The test CodeGenHIP/default-attributes.hip is failing on multiple bots
even after the attempted fix including the following:
- https://lab.llvm.org/buildbot/#/builders/3/builds/1473
- https://lab.llvm.org/buildbot/#/builders/65/builds/1380
- https://lab.llvm.org/buildbot/#/builders/161/builds/595
- https://lab.llvm.org/buildbot/#/builders/154/builds/1372
- https://lab.llvm.org/buildbot/#/builders/133/builds/1547
- https://lab.llvm.org/buildbot/#/builders/81/builds/755
- https://lab.llvm.org/buildbot/#/builders/40/builds/570
- https://lab.llvm.org/buildbot/#/builders/13/builds/748
- https://lab.llvm.org/buildbot/#/builders/12/builds/1845
- https://lab.llvm.org/buildbot/#/builders/11/builds/1695
- https://lab.llvm.org/buildbot/#/builders/190/builds/1829
- https://lab.llvm.org/buildbot/#/builders/193/builds/962
- https://lab.llvm.org/buildbot/#/builders/23/builds/991
- https://lab.llvm.org/buildbot/#/builders/144/builds/2256
- https://lab.llvm.org/buildbot/#/builders/46/builds/1614

These bots have been broken for a day, so reverting to get everything
back to green.
2024-07-14 18:48:54 -07:00
Matt Arsenault
78bc1b64a6
AMDGPU: Move attributor into optimization pipeline (#83131)
Removing it from the codegen pipeline induces a lot of test churn
because llc is no longer optimizing out implicit arguments to kernels.

Mostly mechanical, but there are some creative test updates. I preferred
to take the changes as-is in tests where the ABI isn't relevant. In
cases where it's more relevant, or the optimize out logic was too
ingrained in the test, I pre-run the optimization. Some cases manually
add attributes to disable inputs.
2024-07-14 08:36:33 +04:00
Vikram Hegde
2a9607168b
[AMDGPU] Cleanup bitcast spam in atomic optimizer (#96933) 2024-07-08 10:53:16 +05:30
Pierre van Houtryve
87d7711934
[AMDGPU][SIMemoryLegalizer] Fix order of GL0/1_INV on GFX10/11 (#81450)
Fixes SWDEV-443292
2024-02-13 09:07:51 +01:00
Jay Foad
ba52f06f9d
[AMDGPU] CodeGen for GFX12 S_WAIT_* instructions (#77438)
Update SIMemoryLegalizer and SIInsertWaitcnts to use separate wait
instructions per counter (e.g. S_WAIT_LOADCNT) and split VMCNT into
separate LOADCNT, SAMPLECNT and BVHCNT counters.
2024-01-18 10:47:45 +00:00
Jay Foad
e9e9d1b0b1
[AMDGPU] Disable V_MAD_U64_U32/V_MAD_I64_I32 workaround for GFX12 (#77927) 2024-01-17 11:52:19 +00:00
Fangrui Song
9e9907f1cf
[AMDGPU,test] Change llc -march= to -mtriple= (#75982)
Similar to 806761a7629df268c8aed49657aeccffa6bca449.

For IR files without a target triple, -mtriple= specifies the full
target triple while -march= merely sets the architecture part of the
default target triple, leaving a target triple which may not make sense,
e.g. amdgpu-apple-darwin.

Therefore, -march= is error-prone and not recommended for tests without
a target triple. The issue has been benign as we recognize
$unknown-apple-darwin as ELF instead of rejecting it outrightly.

This patch changes AMDGPU tests to not rely on the default
OS/environment components. Tests that need fixes are not changed:

```
  LLVM :: CodeGen/AMDGPU/fabs.f64.ll
  LLVM :: CodeGen/AMDGPU/fabs.ll
  LLVM :: CodeGen/AMDGPU/floor.ll
  LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll
  LLVM :: CodeGen/AMDGPU/fneg-fabs.ll
  LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll
  LLVM :: CodeGen/AMDGPU/schedule-if-2.ll
```
2024-01-16 21:54:58 -08:00
Jay Foad
daa4728dee
[AMDGPU] Add CodeGen support for GFX12 s_mul_u64 (#75825) 2024-01-08 19:13:38 +00:00
Mirko Brkušanin
7ca4473dd9
[AMDGPU] Add new cache flushing instructions for GFX12 (#76944)
Co-authored-by: Diana Picus <Diana-Magda.Picus@amd.com>
2024-01-08 14:06:58 +00:00
Acim Maravic
48f36c6e74
[LLVM] Make use of s_flbit_i32_b64 and s_ff1_i32_b64 (#75158)
Update DAG ISel to support 64bit versions S_FF1_I32_B64 and
S_FLBIT_I32_B664

---------

Co-authored-by: Acim Maravic <Acim.Maravic@amd.com>
2023-12-25 11:55:20 +01:00
Mirko Brkušanin
5879162f7f
[AMDGPU] CodeGen for GFX12 VBUFFER instructions (#75492) 2023-12-15 13:45:03 +01:00
Pierre van Houtryve
ef067f5204
[AMDGPU][SIInsertWaitcnts] Do not add s_waitcnt when the counters are known to be 0 already (#72830)
Co-authored-by: Juan Manuel MARTINEZ CAAMAÑO <juamarti@amd.com>
2023-12-15 12:33:32 +01:00
Jay Foad
7b3bbd83c0 Revert "[CodeGen] Really renumber slot indexes before register allocation (#67038)"
This reverts commit 2501ae58e3bb9a70d279a56d7b3a0ed70a8a852c.

Reverted due to various buildbot failures.
2023-10-09 12:31:32 +01:00
Jay Foad
2501ae58e3
[CodeGen] Really renumber slot indexes before register allocation (#67038)
PR #66334 tried to renumber slot indexes before register allocation, but
the numbering was still affected by list entries for instructions which
had been erased. Fix this to make the register allocator's live range
length heuristics even less dependent on the history of how instructions
have been added to and removed from SlotIndexes's maps.
2023-10-09 11:44:41 +01:00
Matt Arsenault
87b6f85c2b AMDGPU: Add syncscopes to some atomic tests
These were not testing what was intended, which should be the cases we
can directly select to the instructions.
2023-08-08 14:38:06 -04:00
Jay Foad
7fa7a08f21 [AMDGPU] Insert s_nop before s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)
Differential Revision: https://reviews.llvm.org/D155681
2023-07-19 10:33:11 +01:00
Sameer Sahasrabuddhe
7a101798b7 Revert "[AMDGPU] Mark mbcnt as convergent"
This reverts commit 37114036aa57e53217a57afacd7f47b36114edfb.

The output of mbcnt does not depend on other active lanes, and hence it is not
convergent. The original change was made as a possible fix for

https://github.com/ROCm-Developer-Tools/HIP/issues/3172

But changing mbcnt does not fix that issue.

Reviewed By: ruiling, foad, yaxunl

Differential Revision: https://reviews.llvm.org/D153953
2023-06-30 13:10:44 +05:30
Pravin Jagtap
597fb7fb46 [AMDGPU] Switch to the new cl option amdgpu-atomic-optimizer-strategy.
Atomic optimizer is turned on by default through D152649. This patch
removes the usage of old command line option amdgpu-atomic-optimizations
and transfer the responsibility to `amdgpu-atomic-optimizer-strategy`.

We can safely remove old option when LLPC remove its all usage.

Reviewed By: foad, arsenm, #amdgpu, cdevadas

Differential Revision: https://reviews.llvm.org/D153007
2023-06-22 07:06:42 -04:00
Pravin Jagtap
f6c8a8e9cb [AMDGPU] Iterative scan implementation for atomic optimizer.
This patch provides an alternative implementation to DPP for Scan Computations.

An alternative implementation iterates over all active lanes of Wavefront
using llvm.cttz and performs the following steps:
    1.  Read the value that needs to be atomically incremented using
        llvm.amdgcn.readlane intrinsic
    2.  Accumulate the result.
    3.  Update the scan result using llvm.amdgcn.writelane intrinsic
        if intermediate scan results are needed later in the kernel.

Reviewed By: arsenm, cdevadas

Differential Revision: https://reviews.llvm.org/D147408
2023-06-09 01:08:44 -04:00
Nikita Popov
bdf2fbba9c [AMDGPU] Convert some tests to opaque pointers (NFC) 2022-12-19 12:41:13 +01:00
Jay Foad
5cae88164e [AMDGPU] Add GFX11 test coverage
Add GFX11 test coverage to a bunch of tests where it was easy to do so,
mostly because the checks are autogenerated and/or GFX11 can share the
same checks as GFX10.

Differential Revision: https://reviews.llvm.org/D129295
2022-07-08 09:13:59 +01:00
David Stuttard
77851cc1cf [AMDGPU] Change use null for dead sdst to be gfx1030+
Pre gfx1030 null for sdst is different.
c97436f8b6e2 [AMDGPU] Use null for dead sdst operand - requires a change to make
it not apply to pre gfx1030

Differential Revision: https://reviews.llvm.org/D127869
2022-06-16 10:39:06 +01:00
Stanislav Mekhanoshin
c97436f8b6 [AMDGPU] Use null for dead sdst operand
Differential Revision: https://reviews.llvm.org/D127542
2022-06-13 14:41:40 -07:00
Stanislav Mekhanoshin
23db8e4b43 [AMDGPU] Use v_mad_u64_u32 for IMAD32
Nic Curtis done the experiments to prove it is faster than a
separate mul and add.

Fixes: SWDEV-332806

Differential Revision: https://reviews.llvm.org/D127253
2022-06-09 11:39:49 -07:00
Nicolai Hähnle
6c2a01ce3a AMDGPU/SDAG: Refine the fold to v_mad_[iu]64_[iu]32
Only fold for uniform values on pre-GFX9 chips. GFX9+ allow us
to keep the calculation entirely on the SALU.

For subtargets where integer multiplication isn't full-rate, avoid
folding if the multiply has too many uses.

Finally, we expand 64x32 and 64x64 multiplies here as well, if they
feed into an addition. This results in better code generation than
the generic expansion for such multiplies because we end up using
the accumulator of the MAD instructions.

Differential Revision: https://reviews.llvm.org/D123835
2022-05-10 09:15:51 -05:00
Ruiling Song
0719c43735 AMDGPU: Don't clobber source register for V_SET_INACTIVE_*
The WWM register has unmodeled register liveness, For v_set_inactive_*,
clobberring source register is dangerous because it will overwrite the
inactive lanes. When the source vgpr is dead at v_set_inactive_lane,
the inactive lanes may be not really dead. This may make common
optimizations doing wrong.

For example in a simple if-then cfg in Machine IR:
bb.if:
  %src =

bb.then:
  %src1 = COPY %src
  %dst = V_SET_INACTIVE %src1(tied-def 0), %inactive

bb.end
  ... = PHI [0, %bb.then] [%src, %bb.if]

The register coalescer will think it is safe to optimize "%src1 = COPY %src"
in bb.then. And at the same time, there is no interference for the PHI in
bb.end. The source and destination values of the PHI will be assigned
the same register. The single PHI register will be overwritten by the
v_set_inactive, then we would get wrong value in bb.end.

With this change, we will copy the content of the source register before
setting inactive lanes after register allocation. Yes, this will sacrifice
the WWM code generation a little, but I don't have any better idea to do things
correctly.

Differential Revision: https://reviews.llvm.org/D117482
2022-02-06 12:38:26 +08:00
Austin Kerbow
da067ed569 [AMDGPU] Set most sched model resource's BufferSize to one
Using a BufferSize of one for memory ProcResources will result in better
ILP since it more accurately models the dependencies between memory ops
and their consumers on an in-order processor. After this change, the
scheduler will treat the data edges from loads as blocking so that
stalls are guaranteed when waiting for data to be retreaved from memory.
Since we don't actually track waitcnt here, this should do a better job
at modeling their behavior.

Practically, this means that the scheduler will trigger the 'STALL'
heuristic more often.

This type of change needs to be evaluated experimentally. Preliminary
results are positive.

Fixes: SWDEV-282962

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D114777
2021-12-01 22:31:28 -08:00
Jay Foad
d7e03df719 [AMDGPU] Implement widening multiplies with v_mad_i64_i32/v_mad_u64_u32
Select SelectionDAG ops smul_lohi/umul_lohi to
v_mad_i64_i32/v_mad_u64_u32 respectively, with an addend of 0.
v_mul_lo, v_mul_hi and v_mad_i64/u64 are all quarter-rate instructions
so it is better to use one instruction than two.

Further improvements are possible to make better use of the addend
operand, but this is already a strict improvement over what we have
now.

Differential Revision: https://reviews.llvm.org/D113986
2021-11-24 11:25:02 +00:00
RamNalamothu
18f9351223 [AMDGPU] Do not generate ELF symbols for the local branch target labels
The compiler was generating symbols in the final code object for local
branch target labels. This bloats the code object, slows down the loader,
and is only used to simplify disassembly.

Use '--symbolize-operands' with llvm-objdump to improve readability of the
branch target operands in disassembly.

Fixes: SWDEV-312223

Reviewed By: scott.linder

Differential Revision: https://reviews.llvm.org/D114273
2021-11-20 10:32:41 +05:30
Joe Nash
3ce1b9631a [AMDGPU] Switch PostRA sched to MachineSched
Use GCNHazardRecognizer in postra sched.
Updated tests for the new schedules.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D109536

Change-Id: Ia86ba2ae168f12fb34b4d8efdab491f84d936cde
2021-09-14 15:11:27 -04:00
Matt Arsenault
d719f1c3cc AMDGPU: Add alloc priority to global ranges
The requested register class priorities weren't respected
globally. Not sure why this is a target option, and not just the
expected behavior (recently added in
1a6dc92be7d68611077f0fb0b723b361817c950c). This avoids an allocation
failure when many wide tuple spills are introduced. I think this is a
workaround since I would not expect the allocation priority to be
required, and only a performance hint. The allocator should be smarter
about when only a subregister needs to be spilled and restored.

This does regress a couple of degenerate store stress lit tests which
shouldn't be too important.
2021-08-10 13:12:34 -04:00
Dmitry Preobrazhensky
cd953434f2 [AMDGPU][MC][GFX10][GFX90A] Corrected _e32/_e64 suffices
Fixed bugs https://bugs.llvm.org//show_bug.cgi?id=49643, https://bugs.llvm.org//show_bug.cgi?id=49644, https://bugs.llvm.org//show_bug.cgi?id=49645.

Differential Revision: https://reviews.llvm.org/D99413
2021-04-01 14:21:00 +03:00
Jay Foad
87248e852b [AMDGPU] Rationalize some check prefixes and use more common prefixes. NFC. 2021-03-19 16:48:33 +00:00
Jay Foad
5df52f7708 [AMDGPU] Remove weird target triples from tests. NFC. 2021-03-19 16:48:32 +00:00
Simon Pilgrim
9d2df96407 [DAG] computeKnownBits - add ISD::MULHS/MULHU/SMUL_LOHI/UMUL_LOHI handling
Reuse the existing KnownBits multiplication code to handle the 'extend + multiply + extract high bits' pattern for multiply-high ops.

Noticed while looking at the codegen for D88785 / D98587 - the patch helps division-by-constant expansion code in particular, which suggests that we might have some further KnownBits div/rem cases we could handle - but this was far easier to implement.

Differential Revision: https://reviews.llvm.org/D98857
2021-03-19 16:02:31 +00:00
Simon Pilgrim
388fbefb4f [AMDGPU] Regenerate atomic_optimizations_global_pointer.ll tests 2021-03-18 11:15:44 +00:00
Nicolai Hähnle
52bc2e7577 [AMDGPU][SelectionDAG] Don't combine uniform multiplies to MUL_[UI]24
Prefer to keep uniform (non-divergent) multiplies on the scalar ALU when
possible. This significantly improves some game cases by eliminating
v_readfirstlane instructions when the result feeds into a scalar
operation, like the address calculation for a scalar load or store.

Since isDivergent is only an approximation of whether a value is in
SGPRs, it can potentially regress some situations where a uniform value
ends up in a VGPR. These should be rare in real code, although the test
changes do contain a number of examples.

Most of the test changes are just using s_mul instead of v_mul/mad which
is generally better for both register pressure and latency (at least on
GFX10 where sgpr pressure doesn't affect occupancy and vector ALU
instructions have significantly longer latency than scalar ALU). Some
R600 tests now use MULLO_INT instead of MUL_UINT24.

GlobalISel appears to handle more scenarios in the desirable way,
although it can also be thrown off and fails to select the 24-bit
multiplies in some cases.

Alternative solution considered and rejected was to allow selecting
MUL_[UI]24 to S_MUL_I32. I've rejected this because the definition of
those SD operations works is don't-care on the most significant 8 bits,
and this fact is used in some combines via SimplifyDemandedBits.

Based on a patch by Nicolai Hähnle.

Differential Revision: https://reviews.llvm.org/D97063
2021-02-23 15:39:19 +00:00