589 Commits

Author SHA1 Message Date
Petar Avramovic
fef54d0393
AMDGPU/GlobalISel: Add skeletons for new register bank select passes (#112862)
New register bank select for AMDGPU will be split in two passes:
- AMDGPURegBankSelect: select banks based on machine uniformity analysis
- AMDGPURegBankLegalize: lower instructions that can't be inst-selected
  with register banks assigned by AMDGPURegBankSelect.
AMDGPURegBankLegalize is similar to legalizer but with context of
uniformity analysis. Does not change already assigned banks.
Main goal of AMDGPURegBankLegalize is to provide high level table-like
overview of how to lower generic instructions based on available target
features and uniformity info (uniform vs divergent).
See RegBankLegalizeRules.

Summary of new features:
At the moment register bank select assigns register bank to output
register using simple algorithm:
- one of the inputs is vgpr output is vgpr
- all inputs are sgpr output is sgpr.
When function does not contain divergent control flow propagating
register banks like this works. In general, first point is still correct
but second is not when function contains divergent control flow.
Examples:
- Phi with uniform inputs that go through divergent branch
- Instruction with temporal divergent use.
To fix this AMDGPURegBankSelect will use machine uniformity analysis
to assign vgpr to each divergent and sgpr to each uniform instruction.
But some instructions are only available on VALU (for example floating
point instructions before gfx1150) and we need to assign vgpr to them.
Since we are no longer propagating register banks we need to ensure that
uniform instructions get their inputs in sgpr in some way.
In AMDGPURegBankLegalize uniform instructions that are only available on
VALU will be reassigned to vgpr on all operands and read-any-lane vgpr
output to original sgpr output.
2024-12-03 16:02:00 -05:00
Christudasan Devadasan
c5ab28a42d
[AMDGPU][NewPM] Port SIOptimizeVGPRLiveRange pass to NPM. (#117686) 2024-11-29 09:11:24 +05:30
Petar Avramovic
87503fa51c
Revert "AMDGPU/GlobalISel: Add stub custom regbankselect pass" (#113913)
This reverts commit e9c49901a43f5b16c3df416460b7e4dbdd24ce03.
Current AMDGPURegBankSelect does nothing different then RegBankSelect.
Revert to using generic RegBankSelect in preparation for adding new
regbankselect passes. New AMDGPURegBankSelect, that will use uniformity
analysis for regbank select decisions, will not subclass RegBankSelect.
Revert regression tests to use regbankselect since amdgpu-regbankselect
will be used by new pass and behavior will be different.
2024-11-27 13:16:22 -05:00
Jay Foad
89cb0eefcb
[AMDGPU] Move GCNPreRAOptimizations after MachineScheduler (#116211)
This is in preparation for adding a new optimization to the pass that
cares about the order of instructions. The existing optimization does
not care, so this just causes minor codegen differences.
2024-11-16 09:40:46 +00:00
Matin Raayai
bb3f5e1fed
Overhaul the TargetMachine and LLVMTargetMachine Classes (#111234)
Following discussions in #110443, and the following earlier discussions
in https://lists.llvm.org/pipermail/llvm-dev/2017-October/117907.html,
https://reviews.llvm.org/D38482, https://reviews.llvm.org/D38489, this
PR attempts to overhaul the `TargetMachine` and `LLVMTargetMachine`
interface classes. More specifically:
1. Makes `TargetMachine` the only class implemented under
`TargetMachine.h` in the `Target` library.
2. `TargetMachine` contains target-specific interface functions that
relate to IR/CodeGen/MC constructs, whereas before (at least on paper)
it was supposed to have only IR/MC constructs. Any Target that doesn't
want to use the independent code generator simply does not implement
them, and returns either `false` or `nullptr`.
3. Renames `LLVMTargetMachine` to `CodeGenCommonTMImpl`. This renaming
aims to make the purpose of `LLVMTargetMachine` clearer. Its interface
was moved under the CodeGen library, to further emphasis its usage in
Targets that use CodeGen directly.
4. Makes `TargetMachine` the only interface used across LLVM and its
projects. With these changes, `CodeGenCommonTMImpl` is simply a set of
shared function implementations of `TargetMachine`, and CodeGen users
don't need to static cast to `LLVMTargetMachine` every time they need a
CodeGen-specific feature of the `TargetMachine`.
5. More importantly, does not change any requirements regarding library
linking.

cc @arsenm @aeubanks
2024-11-14 13:30:05 -08:00
Kazu Hirata
be187369a0
[AMDGPU] Remove unused includes (NFC) (#116154)
Identified with misc-include-cleaner.
2024-11-13 21:10:03 -08:00
Jay Foad
2560505203
[AMDGPU] Reorder GCNPassConfig::addOptimizedRegAlloc. NFC. (#115873)
This just makes it so that the added passes are mentioned in this
function in the same order that they will appear in the final pass
pipeline.
2024-11-13 14:38:23 +00:00
Akshat Oke
3495d04560
[AMDGPU][MIR] Serialize SpillPhysVGPRs (#113129) 2024-11-05 13:17:25 +05:30
Shilei Tian
390300d9f4
[PassBuilder] Add ThinOrFullLTOPhase to optimizer pipeline (#114577) 2024-11-03 23:25:29 -05:00
Shilei Tian
dc45ff1d2a
[PassBuilder] Add ThinOrFullLTOPhase to early simplication EP call backs (#114547)
The early simplication pipeline is used in non-LTO and (Thin/Full)LTO
pre-link
stage. There are some passes that we want them in non-LTO mode, but not
at LTO
pre-link stage. The control is missing currently. This PR adds the
support. To
demonstrate the use, we only enable the internalization pass in non-LTO
mode for
AMDGPU because having it run in pre-link stage causes some issues.
2024-11-03 23:24:10 -05:00
Shilei Tian
10a1ea9b53
[NFC][AMDGPU] Remove the empty FPM as well as the adaptor to MPM (#114558) 2024-11-01 12:21:26 -04:00
Akshat Oke
ca32bd643b
[NewPM][AMDGPU] Port SIPreAllocateWWMRegs to NPM (#109939) 2024-10-22 15:37:08 +05:30
Akshat Oke
6360652e9f
Reland [AMDGPU] Serialize WWM_REG vreg flag (#110229) (#112492)
A reland but not an exact copy as `VRegInfo.Flags` from the parser is
now an int8 instead of a vector; so only need to copy over the value.
2024-10-21 13:44:09 +05:30
Christudasan Devadasan
72a7b471de
[AMDGPU][NewPM] Fill out addILPOpts. (#108514) 2024-10-16 13:30:46 +05:30
Christudasan Devadasan
488d3924dd
[CodeGen][NewPM] Port EarlyIfConversion pass to NPM. (#108508) 2024-10-16 13:22:57 +05:30
Peter Collingbourne
3cab8827fd Revert "[AMDGPU] Serialize WWM_REG vreg flag (#110229)"
This reverts commit bec839d8eed9dd13fa7eaffd50b28f8f913de2e2.

Caused buildbot failures, e.g.
https://lab.llvm.org/buildbot/#/builders/52/builds/2928
2024-10-15 13:18:43 -07:00
Akshat Oke
bec839d8ee
[AMDGPU] Serialize WWM_REG vreg flag (#110229) 2024-10-14 14:37:21 +05:30
Akshat Oke
039e6f879c
[AMDGPU][NewPM] Fill out AMDGPU addMachineSSAOptimizations (#111658)
Implement the addMachineSSAOptimizations passes for AMDGPU. Porting
the other generic passes in this category is WIP.
2024-10-10 15:35:11 +05:30
Jay Foad
8d13e7b8c3
[AMDGPU] Qualify auto. NFC. (#110878)
Generated automatically with:
$ clang-tidy -fix -checks=-*,llvm-qualified-auto $(find
lib/Target/AMDGPU/ -type f)
2024-10-03 13:07:54 +01:00
vikashgu
870bdc6ea7 Reapply "[AMDGPU]Optimize SGPR spills (#93668)"
This reverts commit c2fc7f75f67039bb1ed577bc0edbd699a850cd9d. As the
dependent patch about split vgpr regalloc pipeline solved the issue(#96353).
2024-10-03 09:47:15 +00:00
Christudasan Devadasan
ac0f64f06d
[AMDGPU] Split vgpr regalloc pipeline (#93526)
Allocating wwm-registers and per-thread VGPR operands
together imposes many challenges in the way the
registers are reused during allocation. There are
times when regalloc reuses the registers of regular
VGPRs operations for wwm-operations in a small range
leading to unwantedly clobbering their inactive lanes
causing correctness issues that are hard to trace.

This patch splits the VGPR allocation pipeline further
to allocate wwm-registers first and the regular VGPR
operands in a separate pipeline. The splitting would
ensure that the physical registers used for wwm
allocations won't take part in the next allocation
pipeline to avoid any such clobbering.
2024-09-30 19:55:42 +05:30
Matt Arsenault
a87640c97e
AMDGPU: Fix assertion on load of vector of pointers (#110436)
Fix InferAddressSpaces asserting on a load of a vector of flat
pointers.

Fixes #110433
2024-09-30 10:16:38 +04:00
Scott Egerton
396f677514
[AMDGPU] Remove unused VGPRSingleUseHintInsts feature (#109769) 2024-09-24 10:58:00 +01:00
Akshat Oke
0b0874755d
[AMDGPU][NewPM] Port SILowerSGPRSpills to NPM (#108934) 2024-09-21 09:59:36 +05:30
Akshat Oke
d2d78e584b
[NewPM][CodeGen] Port MachineLICM to NPM (#107376) 2024-09-20 11:34:18 +05:30
Jay Foad
e03f427196
[LLVM] Use {} instead of std::nullopt to initialize empty ArrayRef (#109133)
It is almost always simpler to use {} instead of std::nullopt to
initialize an empty ArrayRef. This patch changes all occurrences I could
find in LLVM itself. In future the ArrayRef(std::nullopt_t) constructor
could be deprecated or removed.
2024-09-19 16:16:38 +01:00
Diana Picus
3356208531
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108512)
This reverts commit
7792b4ae79.

The problem was a conflict with
e55d6f5ea2
"[AMDGPU] Simplify and improve codegen for llvm.amdgcn.set.inactive
(https://github.com/llvm/llvm-project/pull/107889)"
which changed the syntax of V_SET_INACTIVE (and thus made my MIR test
crash).

...if only we had a merge queue.
2024-09-13 11:54:30 +02:00
Diana Picus
7792b4ae79
Revert "Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)"" (#108341)
Reverts llvm/llvm-project#108173

si-init-whole-wave.mir crashes on some buildbots (although it passed
both locally with sanitizers enabled and in pre-merge tests).
Investigating.
2024-09-12 10:12:09 +02:00
Diana Picus
703ebca869
Reland "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)" (#108173)
This reverts commit
c7a7767fca.

The buildbots failed because I removed a MI from its parent before
updating LIS. This PR should fix that.
2024-09-12 09:11:41 +02:00
Akshat Oke
e1ee07d0ff
[AMDGPU][NewPM] Port SIPeepholeSDWA pass to NPM (#107049) 2024-09-11 14:30:16 +04:00
Vitaly Buka
c7a7767fca
Revert "[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic" (#108054)
Breaks bots, see #105822.

Reverts llvm/llvm-project#105822
2024-09-10 09:51:43 -07:00
Diana Picus
44556e64f2
[amdgpu] Add llvm.amdgcn.init.whole.wave intrinsic (#105822)
This intrinsic is meant to be used in functions that have a "tail" that
needs to be run with all the lanes enabled. The "tail" may contain
complex control flow that makes it unsuitable for the use of the
existing WWM intrinsics. Instead, we will pretend that the function
starts with all the lanes enabled, then branches into the actual body of
the function for the lanes that were meant to run it, and then finally
all the lanes will rejoin and run the tail.

As such, the intrinsic will return the EXEC mask for the body of the
function, and is meant to be used only as part of a very limited pattern
(for now only in amdgpu_cs_chain functions):

```
entry:
  %func_exec = call i1 @llvm.amdgcn.init.whole.wave()
  br i1 %func_exec, label %func, label %tail

func:
  ; ... stuff that should run with the actual EXEC mask
  br label %tail

tail:
  ; ... stuff that runs with all the lanes enabled;
  ; can contain more than one basic block
```

It's an error to use the result of this intrinsic for anything
other than a branch (but unfortunately checking that in the verifier is
non-trivial because SIAnnotateControlFlow will introduce an amdgcn.if
between the intrinsic and the branch).

The intrinsic is lowered to a SI_INIT_WHOLE_WAVE pseudo, which for now
is expanded in si-wqm (which is where SI_INIT_EXEC is handled too);
however the information that the function was conceptually started in
whole wave mode is stored in the machine function info
(hasInitWholeWave). This will be useful in prolog epilog insertion,
where we can skip saving the inactive lanes for CSRs (since if the
function started with all the lanes active, then there are no inactive
lanes to preserve).
2024-09-10 13:24:53 +02:00
Christudasan Devadasan
6c143a86cd
[CodeGen][NewPM] Port MachineCSE pass to new pass manager. (#106605) 2024-09-04 18:54:07 +05:30
Christudasan Devadasan
042104985c
[AMDGPU][NewPM] Port SIShrinkInstructions to new pass manager. (#106967) 2024-09-03 10:52:50 +05:30
Akshat Oke
da13754103
AMDGPU/NewPM Port SILoadStoreOptimizer to NPM (#106362) 2024-09-02 11:41:56 +05:30
Shilei Tian
84ed3c29e8
Revert "[AMDGPU][LTO] Assume closed world after linking (#105845)" (#106889)
We can't assume closed world even in full LTO post-link stage. It is
only true
if we are building a "GPU executable". However, AMDGPU does support
"dyamic
library". I'm not aware of any approach to tell if it is relocatable
link when
we create the pass. For now let's revert the patch as it is currently
breaking things.
We can re-enable it once we can handle it correctly.
2024-09-01 09:32:08 -04:00
Akshat Oke
fdca2c33a1
AMDGPU/NewPM Port GCNDPPCombine to NPM (#105816)
Co-authored-by: Akshat Oke <Akshat.Oke@amd.com>
2024-08-29 14:49:52 +05:30
Akshat Oke
2adc94cd6c
AMDGPU/NewPM: Port SIFoldOperands to new pass manager (#105801) 2024-08-29 11:34:54 +05:30
Chaitanya
1f02be2e17
[AMDGPU] Enable "amdgpu-sw-lower-lds" pass in pipeline. (#89206)
This PR enables "amdgpu-sw-lower-lds" pass in the pipeline.
Also introduces "amdgpu-enable-sw-lower-lds" cmd line flag to
enbale/disable the pass.
2024-08-26 14:21:19 +05:30
Chaitanya
7bc9d95b7e
[AMDGPU] Introduce "amdgpu-sw-lower-lds" pass to lower LDS accesses. (#87265)
This PR introduces new pass "amdgpu-sw-lower-lds". 

This pass lowers the local data store, LDS, uses in kernel and
non-kernel functions in module to use dynamically allocated global
memory. Packed LDS Layout is emulated in the global memory.
The lowered memory instructions from LDS to global memory are then
instrumented for address sanitizer, to catch addressing errors.
This pass only work when address sanitizer has been enabled and has
instrumented the IR. It identifies that IR has been instrumented using
"nosanitize_address" module flag.

For a kernel, LDS access can be static or dynamic which are direct
(accessed within kernel) and indirect (accessed through non-kernels).

**Replacement of Kernel LDS accesses:** 
- All the LDS accesses corresponding to kernel will be packed together,
where all static LDS accesses will be allocated first and then dynamic
LDS follows. The total size with alignment is calculated. A new LDS
global will be created for the kernel called "SW LDS" and it will have
the attribute "amdgpu-lds-size" attached with value of the size
calculated. All the LDS accesses in the module will be replaced by GEP
with offset into the "Sw LDS".
- A new "llvm.amdgcn.<kernel>.dynlds" is created per kernel accessing
the dynamic LDS. This will be marked used by kernel and will have
MD_absolue_symbol metadata set to total static LDS size, Since dynamic
LDS allocation starts after all static LDS allocation.

- A device global memory equal to the total LDS size will be allocated.
At the prologue of the kernel, a single work-item from the work-group,
does a "malloc" and stores the pointer of the allocation in "SW LDS". To
store the offsets corresponding to all LDS accesses, another global
variable is created which will be called "SW LDS metadata" in this pass.

- **SW LDS:** 
It is LDS global of ptr type with name
"llvm.amdgcn.sw.lds.<kernel-name>".

- **SW LDS Metadata:** 
It is of struct type, with n members. n equals the number of LDS globals
accessed by the kernel(direct and indirect). Each member of struct is
another struct of type {i32, i32, i32}. First member corresponds to
offset, second member corresponds to size of LDS global being replaced
and third represents the total aligned size. It will have name
"llvm.amdgcn.sw.lds.<kernel-name>.md". This global will have an
intializer with static LDS related offsets and sizes initialized. But
for dynamic LDS related entries, offsets will be intialized to previous
static LDS allocation end offset. Sizes for them will be zero initially.
These dynamic LDS offset and size values will be updated with in the
kernel, since kernel can read the dynamic LDS size allocation done at
runtime with query to "hidden_dynamic_lds_size" hidden kernel argument.

- At the epilogue of kernel, allocated memory would be made free by the
same single work-item.

**Replacement of non-kernel LDS accesses:** 
- Multiple kernels can access the same non-kernel function. All the
kernels accessing LDS through non-kernels are sorted and assigned a
kernel-id. All the LDS globals accessed by non-kernels are sorted.

- This information is used to build two tables: 
- **Base table:** 
Base table will have single row, with elements of the row placed as per
kernel ID. Each element in the row corresponds to ptr of "SW LDS"
variable created for that kernel.

- **Offset table:** 
Offset table will have multiple rows and columns. Rows are assumed to be
from 0 to (n-1). n is total number of kernels accessing the LDS through
non-kernels. Each row will have m elements. m is the total number of
unique LDS globals accessed by all non-kernels. Each element in the row
correspond to the ptr of the replacement of LDS global done by that
particular kernel.

- A LDS variable in non-kernel will be replaced based on the information
from base and offset tables. Based on kernel-id query, ptr of "SW LDS"
for that corresponding kernel is obtained from base table. The Offset
into the base "SW LDS" is obtained from corresponding element in offset
table. With this information, replacement value is obtained.
2024-08-26 08:59:26 +05:30
Anshil Gandhi
033e225d90
Revert "Revert "[AMDGPU][LTO] Assume closed world after linking (#105845)" (#106000)" (#106001)
This reverts commit 4b6c064dd124c70ff163411dff120c6174e0e022.

Add a requirement for an amdgpu target in the test.
2024-08-25 17:23:36 -04:00
Anshil Gandhi
4b6c064dd1
Revert "[AMDGPU][LTO] Assume closed world after linking (#105845)" (#106000)
This reverts commit 33f3ebc86e7d3afcb65c551feba5bbc2421b42ed.
2024-08-25 14:56:39 -04:00
Anshil Gandhi
33f3ebc86e
[AMDGPU][LTO] Assume closed world after linking (#105845) 2024-08-25 14:06:29 -04:00
Juan Manuel Martinez Caamaño
5def27c72c
[AMDGPU] Remove "amdgpu-enable-structurizer-workarounds" flag (#105819) 2024-08-23 15:04:03 +02:00
Juan Manuel Martinez Caamaño
2b4b909509
[AMDGPU] Remove unused amdgpu-disable-structurizer flag (#105800) 2024-08-23 14:14:17 +02:00
Juan Manuel Martinez Caamaño
cbf34a5f77
[AMDGPU] Remove dead pass: AMDGPUMachineCFGStructurizer (#105645) 2024-08-23 14:06:17 +02:00
Matt Arsenault
dd90c72b05 AMDGPU: Temporarily stop adding AtomicExpand to new PM passes
This breaks using -passes=atomic-expand (but only sometimes?).
Somehow an AtomicExpand pass ends up running without a TargetMachine,
despite always being constructed with one.
2024-08-21 00:19:37 +04:00
Matt Arsenault
33e18b2b43
AMDGPU/NewPM: Start filling out addIRPasses (#102884)
This is not complete, but gets AtomicExpand running. I was able
to get further than I expected; we're quite close to having all
the IR codegen passes ported.
2024-08-20 23:38:05 +04:00
Matt Arsenault
afeef4dbc3
AMDGPU/NewPM: Fill out passes in addCodeGenPrepare (#102867)
AMDGPUAnnotateKernelFeatures hasn't been ported yet, but it
should be soon removable.
2024-08-20 23:35:01 +04:00
Matt Arsenault
7022498ac2
AMDGPU/NewPM: Start implementing addCodeGenPrepare (#102816) 2024-08-20 00:10:45 +04:00