180 Commits

Author SHA1 Message Date
Lucas Ramirez
6206f5444f
[AMDGPU] Occupancy w.r.t. workgroup size range is also a range (#123748)
Occupancy (i.e., the number of waves per EU) depends, in addition to
register usage, on per-workgroup LDS usage as well as on the range of
possible workgroup sizes. Mirroring the latter, occupancy should
therefore be expressed as a range since different group sizes generally
yield different achievable occupancies.

`getOccupancyWithLocalMemSize` currently returns a scalar occupancy
based on the maximum workgroup size and LDS usage. With respect to the
workgroup size range, this scalar can be the minimum, the maximum, or
neither of the two of the range of achievable occupancies. This commit
fixes the function by making it compute and return the range of
achievable occupancies w.r.t. workgroup size and LDS usage; it also
renames it to `getOccupancyWithWorkGroupSizes` since it is the range of
workgroup sizes that produces the range of achievable occupancies.

Computing the achievable occupancy range is surprisingly involved.
Minimum/maximum workgroup sizes do not necessarily yield maximum/minimum
occupancies i.e., sometimes workgroup sizes inside the range yield the
occupancy bounds. The implementation finds these sizes in constant time;
heavy documentation explains the rationale behind the sometimes
relatively obscure calculations.

As a justifying example, consider a target with 10 waves / EU, 4 EUs/CU,
64-wide waves. Also consider a function with no LDS usage and a flat
workgroup size range of [513,1024].

- A group of 513 items requires 9 waves per group. Only 4 groups made up
of 9 waves each can fit fully on a CU at any given time, for a total of
36 waves on the CU, or 9 per EU. However, filling as much as possible
the remaining 40-36=4 wave slots without decreasing the number of groups
reveals that a larger group of 640 items yields 40 waves on the CU, or
10 per EU.
- Similarly, a group of 1024 items requires 16 waves per group. Only 2
groups made up of 16 waves each can fit fully on a CU ay any given time,
for a total of 32 waves on the CU, or 8 per EU. However, removing as
many waves as possible from the groups without being able to fit another
equal-sized group on the CU reveals that a smaller group of 896 items
yields 28 waves on the CU, or 7 per EU.

Therefore the achievable occupancy range for this function is not [8,9]
as the group size bounds directly yield, but [7,10].

Naturally this change causes a lot of test churn as instruction
scheduling is driven by achievable occupancy estimates. In most unit
tests the flat workgroup size range is the default [1,1024] which,
ignoring potential LDS limitations, would previously produce a scalar
occupancy of 8 (derived from 1024) on a lot of targets, whereas we now
consider the maximum occupancy to be 10 in such cases. Most tests are
updated automatically and checked manually for sanity. I also manually
changed some non-automatically generated assertions when necessary.

Fixes #118220.
2025-01-23 16:07:57 +01:00
Stanislav Mekhanoshin
21704a685d
[AMDGPU] Fix printing hasInitWholeWave in mir (#123232) 2025-01-17 03:00:02 -08:00
Ruiling, Song
67c55b1ffc
[AMDGPU] Make max dwords of memory cluster configurable (#119342)
We find it helpful to increase the value for graphics workload. Make it
configurable so we can experiment with a different value.
2024-12-18 14:17:27 +08:00
Shilei Tian
6548b6354d Reapply "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit ca33649abe5fad93c57afef54e43ed9b3249cd86.
2024-11-08 20:21:16 -05:00
Shilei Tian
ca33649abe Revert "[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403)"
This reverts commit e215a1e27d84adad2635a52393621eb4fa439dc9 as it broke both
hip and openmp buildbots.
2024-11-08 16:36:35 -05:00
Shilei Tian
e215a1e27d
[AMDGPU] Still set up the two SGPRs for queue ptr even it is COV5 (#112403) 2024-11-08 13:05:35 -05:00
dyung
bc7e099aa8
Revert "[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes" (#115353)
Reverts llvm/llvm-project#115291

Reverting due to test failures on many bots including
https://lab.llvm.org/buildbot/#/builders/174/builds/8049
2024-11-07 13:02:51 -05:00
Akshat Oke
21835ee28d
[AMDGPU][MIR] Serialize NumPhysicalVGPRSpillLanes (#115291) 2024-11-07 20:08:36 +05:30
Akshat Oke
e76d9214c8
[AMDGPU] Fix 3495d04 MIR test (#114963)
Needed to specify scratchRSrcReg and spreg in order to stop after
prologepilog.

- Fixes #113129 test failure
2024-11-05 17:11:47 +05:30
Akshat Oke
3495d04560
[AMDGPU][MIR] Serialize SpillPhysVGPRs (#113129) 2024-11-05 13:17:25 +05:30
Akshat Oke
6360652e9f
Reland [AMDGPU] Serialize WWM_REG vreg flag (#110229) (#112492)
A reland but not an exact copy as `VRegInfo.Flags` from the parser is
now an int8 instead of a vector; so only need to copy over the value.
2024-10-21 13:44:09 +05:30
Peter Collingbourne
3cab8827fd Revert "[AMDGPU] Serialize WWM_REG vreg flag (#110229)"
This reverts commit bec839d8eed9dd13fa7eaffd50b28f8f913de2e2.

Caused buildbot failures, e.g.
https://lab.llvm.org/buildbot/#/builders/52/builds/2928
2024-10-15 13:18:43 -07:00
Akshat Oke
bec839d8ee
[AMDGPU] Serialize WWM_REG vreg flag (#110229) 2024-10-14 14:37:21 +05:30
Diana Picus
3356208531
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108512)
This reverts commit
7792b4ae79.

The problem was a conflict with
e55d6f5ea2
"[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive
(https://github.com/llvm/llvm-project/pull/107889)"
which changed the syntax of V_SET_INACTIVE (and thus made my MIR test
crash).

...if only we had a merge queue.
2024-09-13 11:54:30 +02:00
Diana Picus
7792b4ae79
Revert "Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)"" (#108341)
Reverts llvm/llvm-project#108173

si-init-whole-wave.mir crashes on some buildbots (although it passed
both locally with sanitizers enabled and in pre-merge tests).
Investigating.
2024-09-12 10:12:09 +02:00
Diana Picus
703ebca869
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)" (#108173)
This reverts commit
c7a7767fca.

The buildbots failed because I removed a MI from its parent before
updating LIS. This PR should fix that.
2024-09-12 09:11:41 +02:00
Vitaly Buka
c7a7767fca
Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)
Breaks bots, see #105822.

Reverts llvm/llvm-project#105822
2024-09-10 09:51:43 -07:00
Diana Picus
44556e64f2
[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822)
This intrinsic is meant to be used in functions that have a "tail" that
needs to be run with all the lanes enabled. The "tail" may contain
complex control flow that makes it unsuitable for the use of the
existing WWM intrinsics. Instead, we will pretend that the function
starts with all the lanes enabled, then branches into the actual body of
the function for the lanes that were meant to run it, and then finally
all the lanes will rejoin and run the tail.

As such, the intrinsic will return the EXEC mask for the body of the
function, and is meant to be used only as part of a very limited pattern
(for now only in amdgpu_cs_chain functions):

```
entry:
  %func_exec = call i1 @llvm.amdgcn.init.whole.wave()
  br i1 %func_exec, label %func, label %tail

func:
  ; ... stuff that should run with the actual EXEC mask
  br label %tail

tail:
  ; ... stuff that runs with all the lanes enabled;
  ; can contain more than one basic block
```

It's an error to use the result of this intrinsic for anything
other than a branch (but unfortunately checking that in the verifier is
non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if
between the intrinsic and the branch).

The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now
is expanded in si-wqm (which is where SI_INIT_EXEC is handled too);
however the information that the function was conceptually started in
whole wave mode is stored in the machine function info
(hasInitWholeWave). This will be useful in prolog epilog insertion,
where we can skip saving the inactive lanes for CSRs (since if the
function started with all the lanes active, then there are no inactive
lanes to preserve).
2024-09-10 13:24:53 +02:00
Carl Ritson
16cda01d22
[AMDGPU] V_SET_INACTIVE optimizations (#98864)
Optimize V_SET_INACTIVE by allow it to run in WWM.
Hence WWM sections are not broken up for inactive lane setting.
WWM V_SET_INACTIVE can typically be lower to V_CNDMASK.
Some cases require use of exec manipulation V_MOV as previous code.
GFX9 sees slight instruction count increase in edge cases due to
smaller constant bus.

Additionally avoid introducing exec manipulation and V_MOVs where
a source of V_SET_INACTIVE is the destination.
This is a common pattern as WWM register pre-allocation often
assigns the same register.
2024-09-05 14:39:28 +09:00
Matt Arsenault
b1bcb7ca46 Reapply "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commit adaff46d087799072438dd744b038e6fd50a2d78.

Drop the -O3 checks from default-attributes.hip. I don't know why they
are different on some bots but reverting this is far too disruptive.
2024-07-15 11:51:44 +04:00
dyung
adaff46d08
Revert "AMDGPU: Move attributor into optimization pipeline (#83131)" and follow up commit "clang/AMDGPU: Defeat attribute optimization in attribute test" (#98851)
This reverts commits 677cc15e0ff2e0e6aa30538eb187990a6a8f53c0 and
78bc1b64a6dc3fb6191355a5e1b502be8b3668e7.

The test CodeGenHIP/default-attributes.hip is failing on multiple bots
even after the attempted fix including the following:
- https://lab.llvm.org/buildbot/#/builders/3/builds/1473
- https://lab.llvm.org/buildbot/#/builders/65/builds/1380
- https://lab.llvm.org/buildbot/#/builders/161/builds/595
- https://lab.llvm.org/buildbot/#/builders/154/builds/1372
- https://lab.llvm.org/buildbot/#/builders/133/builds/1547
- https://lab.llvm.org/buildbot/#/builders/81/builds/755
- https://lab.llvm.org/buildbot/#/builders/40/builds/570
- https://lab.llvm.org/buildbot/#/builders/13/builds/748
- https://lab.llvm.org/buildbot/#/builders/12/builds/1845
- https://lab.llvm.org/buildbot/#/builders/11/builds/1695
- https://lab.llvm.org/buildbot/#/builders/190/builds/1829
- https://lab.llvm.org/buildbot/#/builders/193/builds/962
- https://lab.llvm.org/buildbot/#/builders/23/builds/991
- https://lab.llvm.org/buildbot/#/builders/144/builds/2256
- https://lab.llvm.org/buildbot/#/builders/46/builds/1614

These bots have been broken for a day, so reverting to get everything
back to green.
2024-07-14 18:48:54 -07:00
Matt Arsenault
78bc1b64a6
AMDGPU: Move attributor into optimization pipeline (#83131)
Removing it from the codegen pipeline induces a lot of test churn
because llc is no longer optimizing out implicit arguments to kernels.

Mostly mechanical, but there are some creative test updates. I preferred
to take the changes as-is in tests where the ABI isn't relevant. In
cases where it's more relevant, or the optimize out logic was too
ingrained in the test, I pre-run the optimization. Some cases manually
add attributes to disable inputs.
2024-07-14 08:36:33 +04:00
Jonas Paulsson
09bc6abba6
[MachineFrameInfo] Refactoring around computeMaxcallFrameSize() (NFC) (#78001)
- Use computeMaxCallFrameSize() in PEI::calculateCallFrameInfo() instead of duplicating the code.

- Set AdjustsStack in FinalizeISel instead of in computeMaxCallFrameSize().
2024-03-18 10:37:59 -04:00
Diana Picus
bc6955f18c
[AMDGPU] Don't fix the scavenge slot at offset 0 (#79136)
At the moment, the emergency spill slot is a fixed object for entry
functions and chain functions, and a regular stack object otherwise.
This patch adopts the latter behaviour for entry/chain functions too. It
seems this was always the intention [1] and it will also save us a bit
of stack space in cases where the first stack object has a large
alignment.

[1]
34c8b835b1
2024-02-09 09:20:25 +01:00
Nikita Popov
ff9af4c43a [CodeGen] Convert tests to opaque pointers (NFC) 2024-02-05 14:07:09 +01:00
Mirko Brkušanin
1d286ad59b
[AMDGPU] Add mark last scratch load pass (#75512) 2024-01-18 09:36:44 +01:00
Fangrui Song
9e9907f1cf
[AMDGPU,test] Change llc -march= to -mtriple= (#75982)
Similar to 806761a7629df268c8aed49657aeccffa6bca449.

For IR files without a target triple, -mtriple= specifies the full
target triple while -march= merely sets the architecture part of the
default target triple, leaving a target triple which may not make sense,
e.g. amdgpu-apple-darwin.

Therefore, -march= is error-prone and not recommended for tests without
a target triple. The issue has been benign as we recognize
$unknown-apple-darwin as ELF instead of rejecting it outrightly.

This patch changes AMDGPU tests to not rely on the default
OS/environment components. Tests that need fixes are not changed:

```
  LLVM :: CodeGen/AMDGPU/fabs.f64.ll
  LLVM :: CodeGen/AMDGPU/fabs.ll
  LLVM :: CodeGen/AMDGPU/floor.ll
  LLVM :: CodeGen/AMDGPU/fneg-fabs.f64.ll
  LLVM :: CodeGen/AMDGPU/fneg-fabs.ll
  LLVM :: CodeGen/AMDGPU/r600-infinite-loop-bug-while-reorganizing-vector.ll
  LLVM :: CodeGen/AMDGPU/schedule-if-2.ll
```
2024-01-16 21:54:58 -08:00
Matt Arsenault
c44dca15a4
MachineVerifier: Reject extra non-register operands on instructions (#73758)
We were allowing extra immediate arguments, and only bothering to check
if registers were implicit or not.

Also consolidate extra operand checks in verifier, to make this
testable. We had 3 different places checking if you were trying to build
an instruction with more operands than allowed by the definition. We had
an assertion in addOperand, a direct check in the MIRParser to avoid the
assertion, and the machine verifier checks. Remove the assert and parser
check so the verifier can provide a consistent verification experience,
which will also handle instructions modified in place.
2023-11-30 22:33:42 +09:00
Diana Picus
5272ae667d [AMDGPU] Add IsChainFunction to the MachineFunctionInfo
This will represent functions with the amdgpu_cs_chain or
amdgpu_cs_chain_preserve calling conventions.

Differential Revision: https://reviews.llvm.org/D156410
2023-08-21 12:37:32 +02:00
Sameer Sahasrabuddhe
ef38e6d97f [GlobalISel] introduce MIFlag::NoConvergent
Some opcodes in MIR are defined to be convergent by the target by setting
IsConvergent in the corresponding TD file. For example, in AMDGPU, the opcodes
G_SI_CALL and G_INTRINSIC* are marked as convergent. But this is too
conservative, since calls to functions that do not execute convergent operations
should not be marked convergent. This information is available in LLVM IR.

The new flag MIFlag::NoConvergent now allows the IR translator to mark an
instruction as not performing any convergent operations. It is relevant only on
occurrences of opcodes that are marked isConvergent in the target.

Differential Revision: https://reviews.llvm.org/D157475
2023-08-20 21:14:46 +05:30
Stanislav Mekhanoshin
0c7e8c06bc [AMDGPU] Change syncscopes.mir not to use undefined cpol bits. NFC. 2023-08-04 11:19:12 -07:00
Matt Arsenault
4d42e8b5d1 Reapply "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd.

The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work
around the underlying problem with SUBREG_TO_REG.
2023-07-31 20:15:45 -04:00
Vitaly Buka
a496c8be6e Revert "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
And dependent commits.

Details in D150388.

This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.

No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll

Reviewed By: alexfh

Differential Revision: https://reviews.llvm.org/D156381
2023-07-26 22:13:32 -07:00
Fangrui Song
9996e71f2d [Support] Implement LLVM_ENABLE_REVERSE_ITERATION for StringMap
ProgrammersManual.html says

> StringMap iteration order, however, is not guaranteed to be deterministic, so any uses which require that should instead use a std::map.

This patch makes -DLLVM_REVERSE_ITERATION=on (currently
-DLLVM_ENABLE_REVERSE_ITERATION=on works as well) shuffle StringMap
iteration order (actually flipping the hash so that elements not in the
same bucket are reversed) to catch violations, similar to D35043 for
DenseMap. This should help change the hash function (e.g., D142862,
D155781).

With a lot of fixes, there are still some violations. This patch
implements the "reverse_iteration" lit feature to skip such tests.
Eventually we should remove this feature.

`ninja check-{llvm,clang,clang-tools}` are clean with
`#define LLVM_ENABLE_REVERSE_ITERATION 1`.

Reviewed By: jhenderson

Differential Revision: https://reviews.llvm.org/D155789
2023-07-21 08:46:51 -07:00
Christudasan Devadasan
7a98f084c4 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which is an unproblematic case.

Differential Revision: https://reviews.llvm.org/D124196
2023-07-07 23:14:32 +05:30
Christudasan Devadasan
b78b36e1a2 [AMDGPU] Implement whole wave register spill
To reduce the register pressure during allocation,
when the allocator spills a virtual register that
corresponds to a whole wave mode operation, the
spill loads and restores should be activated for
all lanes by temporarily flipping all bits in exec
register to one just before the spills. It is not
implemented in the compiler as of today and this
patch enables the necessary support.

This is a pre-patch before the SGPR spill to virtual
VGPR lanes that would eventually causes the whole
wave register spills during allocation.

Reviewed By: arsenm, cdevadas

Differential Revision: https://reviews.llvm.org/D143759
2023-07-07 22:51:45 +05:30
Brendon Cahoon
853b2a84cb [AMDGPU] Reserve SGPR pair when long branches are present
Branch relaxation requires 2 additional SGPRs for AMDGPU to handle the
case when an indirect branch target is too far away. The register
scavanger may not find available registers, which causes a “did not find
scavenging index” assert to occur in assignRegToScavengingIndex.

In this patch, we estimate before register allocation whether an
indirect branch is likely to be needed, and reserve 2 SGPRs if the
branch distance is found to be above a threshold. The distance threshold
is an approximation as the exact code size and branch distance are
unknown prior to register allocation.

Patch by Corbin Robeck. Thanks!

Differential Review: https://reviews.llvm.org/D149775
2023-06-29 16:50:46 -05:00
Krzysztof Drewniak
ab37937812 [AMDGPU] Use resource base for buffer instruction MachineMemOperands
1. Remove the existing code that would encode the constant offsets (if
there were any) on buffer intrinsic operations onto their
`MachineMemOperand`s. As far as I can tell, this use of `offset` has
no substantial impact on the generated code, especially since the same
reasoning is performed by areMemAccessesTriviallyDisjoint().

2. When a buffer resource intrinsic takes a pointer argument as the
base resource/descriptor, place that memory argument in the value
field of the MachineMemOperand attached to that intrinsic.

This is more conservative than what would be produced by more typical
LLVM code using GEP, as the Value (for alias analysis purposes)
corresponding to accessing buffer[0] and buffer[1] is the same.
However, the target-specific analysis of disjoint offsets covers a lot
of the simple usecases.

Despite this limitation, the new buffer intrinsics, combined with
LLVM's existing pointer annotations, allow for non-trivial
optimizations, as seen in the new tests, where marking two buffer
descriptors "noalias" allows merging together loads and stores in a
"load from A, modify loaded value, store to B" sequence, which would
not be possible previously.

Depends on D147547

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D148184
2023-06-05 17:06:57 +00:00
Krzysztof Drewniak
faa2c678aa [AMDGPU] Add buffer intrinsics that take resources as pointers
In order to enable the LLVM frontend to better analyze buffer
operations (and to potentially enable more precise analyses on the
backend), define versions of the raw and structured buffer intrinsics
that use `ptr addrspace(8)` instead of `<4 x i32>` to represent their
rsrc arguments.

The new intrinsics are named by replacing `buffer.` with `buffer.ptr`.

One advantage to these intrinsic definitions is that, instead of
specifying that a buffer load/store will read/write some memory, we
can indicate that the memory read or written will be based on the
pointer argument. This means that, for example, a read from a
`noalias` buffer can be pulled out of a loop that is modifying a
distinct buffer.

In the future, we will define custom PseudoSourceValues that will
allow us to package up the (buffer, index, offset) triples that buffer
intrinsics contain and allow for more precise backend analysis.

This work also enables creating address space 7, which represents
manipulation of raw buffers using native LLVM load and store
instructions.

Where tests simply used a buffer intrinsic while testing some other
code path (such as the tests for VGPR spills), they have been updated
to use the new intrinsic form. Tests that are "about" buffer
intrinsics (for instance, those that ensure that they codegen as
expected) have been duplicated, either within existing files or into
new ones.

Depends on D145441

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D147547
2023-06-05 16:59:07 +00:00
Tobias Hieta
f84bac329b
[NFC][Py Reformat] Reformat lit.local.cfg python files in llvm
This is a follow-up to b71edfaa4ec3c998aadb35255ce2f60bba2940b0
since I forgot the lit.local.cfg files in that one.

Reformatting is done with `black`.

If you end up having problems merging this commit because you
have made changes to a python file, the best way to handle that
is to run git checkout --ours <yourfile> and then reformat it
with black.

If you run into any problems, post to discourse about it and
we will try to help.

RFC Thread below:

https://discourse.llvm.org/t/rfc-document-and-standardize-python-code-style

Reviewed By: barannikov88, kwk

Differential Revision: https://reviews.llvm.org/D150762
2023-05-17 17:03:15 +02:00
Krzysztof Drewniak
f0415f2a45 Re-land "[AMDGPU] Define data layout entries for buffers""
Re-land D145441 with data layout upgrade code fixed to not break OpenMP.

This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27.

Differential Revision: https://reviews.llvm.org/D149776
2023-05-03 19:43:56 +00:00
Krzysztof Drewniak
3f2fbe92d0 Revert "[AMDGPU] Define data layout entries for buffers"
This reverts commit f9c1ede2543b37fabe9f2d8f8fed5073c475d850.

Differential Revision: https://reviews.llvm.org/D149758
2023-05-03 16:11:00 +00:00
Krzysztof Drewniak
f9c1ede254 [AMDGPU] Define data layout entries for buffers
Per discussion at
https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798,
we define two new address spaces for AMDGCN targets.

The first is address space 7, a non-integral address space (which was
already in the data layout) that has 160-bit pointers (which are
256-bit aligned) and uses a 32-bit offset. These pointers combine a
128-bit buffer descriptor and a 32-bit offset, and will be usable with
normal LLVM operations (load, store, GEP). However, they will be
rewritten out of existence before code generation.

The second of these is address space 8, the address space for "buffer
resources". These will be used to represent the resource arguments to
buffer instructions, and new buffer intrinsics will be defined that
take them instead of <4 x i32> as resource arguments. ptr
addrspace(8). These pointers are 128-bits long (with the same
alignment). They must not be used as the arguments to getelementptr or
otherwise used in address computations, since they can have
arbitrarily complex inherent addressing semantics that can't be
represented in LLVM. Even though, like their address space 7 cousins,
these pointers have deterministic ptrtoint/inttoptr semantics, they
are defined to be non-integral in order to prevent optimizations that
rely on pointers being a [0, [addr_max]] value from applying to them.

Future work includes:
- Defining new buffer intrinsics that take ptr addrspace(8) resources.
- A late rewrite to turn address space 7 operations into buffer
intrinsics and offset computations.

This commit also updates the "fallback address space" for buffer
intrinsics to the buffer resource, and updates the alias analysis
table.

Depends on D143437

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D145441
2023-05-03 15:25:58 +00:00
Matt Arsenault
7ac3ab34cb AMDGPU: Fix missing MIR serialization for PSInputAddr/PSInputEnable
Resuming any mir test for a pixel shader would assert in the AsmPrinter.
2023-04-08 07:05:35 -04:00
Nicolai Hähnle
10cef708a7 AMDGPU: Clean up LDS-related occupancy calculations
Occupancy is expressed as waves per SIMD. This means that we need to
take into account the number of SIMDs per "CU" or, to be more precise,
the number of SIMDs over which a workgroup may be distributed.

getOccupancyWithLocalMemSize was wrong because it didn't take SIMDs
into account at all.

At the same time, we need to take into account that WGP mode offers
access to a larger total amount of LDS, since this can affect how
non-power-of-two LDS allocations are rounded. To make this work
consistently, we distinguish between (available) local memory size and
addressable local memory size (which is always limited by 64kB on
gfx10+, even with WGP mode).

This change results in a massive amount of test churn. A lot of it is
caused by the fact that the default work group size is 1024, which means
that (due to rounding effects) the default occupancy on older hardware
is 8 instead of 10, which affects scheduling via register pressure
estimates. I've adjusted most tests by just running the UTC tools, but
in some cases I manually changed the work group size to 32 or 64 to make
sure that work group size chunkiness has no effect.

Differential Revision: https://reviews.llvm.org/D139468
2023-01-23 21:43:06 +01:00
Matt Arsenault
61be261549 MIR: Fix test error message 2022-12-22 08:58:56 -05:00
Matt Arsenault
20d72c4917 MIR: Don't assert if a virtual register uses a non-allocatable class 2022-12-22 08:18:07 -05:00
Christudasan Devadasan
a3028239a7 Revert "[AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs"
This reverts commit 40ba0942e2ab1107f83aa5a0ee5ae2980bf47b1a.
2022-12-21 16:17:42 +05:30
Nikita Popov
376ab5f413 [MIR] Convert some tests to opaque pointers (NFC) 2022-12-19 12:54:50 +01:00
Christudasan Devadasan
40ba0942e2 [AMDGPU][SILowerSGPRSpills] Spill SGPRs to virtual VGPRs
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.

This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.

Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which isn an unproblematic case.

This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand-picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D124196
2022-12-17 11:56:32 +05:30