501 Commits

Author SHA1 Message Date
paperchalice
a2dfc9ac7d
[NewPM][AMDGPU] Add AMDGPUPassRegistry.def (#86095)
Move the pass registry to a separate file, prepare for porting dag-isel.
2024-03-22 08:49:29 +08:00
Pierre van Houtryve
95a834a16c
(Reland) [AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#85626)
Reland of #75333
2024-03-21 11:44:47 +01:00
pvanhout
3493438605 Revert "[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333)"
This reverts commit 9b98692eedb78aa106539c36ba02944f32cae1ff.
2024-03-18 11:18:57 +01:00
Pierre van Houtryve
9b98692eed
[AMDGPU] Run LowerLDS at the end of the fullLTO pipeline (#75333)
This change allows us to use `--lto-partitions` in some cases (not at
all guaranteed it works perfectly), as LDS is lowered before the module
is split for parallel codegen.

We must run LowerLDS before splitting modules as it needs to see all
callers of functions with LDS to properly lower them.
2024-03-18 09:09:43 +01:00
Krzysztof Drewniak
6540f1635a
[AMDGPU] Add IR-level pass to rewrite away address space 7 (#77952)
This commit adds the -lower-buffer-fat-pointers pass, which is
applicable to all AMDGCN compilations.

The purpose of this pass is to remove the type `ptr addrspace(7)` from
incoming IR. This must be done at the LLVM IR level because `ptr
addrspace(7)`, as a 160-bit primitive type, cannot be correctly handled
by SelectionDAG.

The detailed operation of the pass is described in comments, but, in
summary, the removal proceeds by:
1. Rewriting loads and stores of ptr addrspace(7) to loads and stores of
i160 (including vectors and aggregates). This is needed because the
in-register representation of these pointers will stop matching their
in-memory representation in step 2, and so ptrtoint/inttoptr operations
are used to preserve the expected memory layout

2. Mutating the IR to replace all occurrences of `ptr addrspace(7)` with
the type `{ptr addrspace(8), ptr addrspace(6) }`, which makes the two
parts of a buffer fat pointer (the 128-bit address space 8 resource and
the 32-bit address space 6 offset) visible in the IR. This also impacts
the argument and return types of functions.

3. *Splitting* the resource and offset parts. All instructions that
produce or consume buffer fat pointers (like GEP or load) are rewritten
to produce or consume the resource and offset parts separately. For
example, GEP updates the offset part of the result and a load uses the
resource and offset parts to populate the relevant
llvm.amdgcn.raw.ptr.buffer.load intrinsic call.

At the end of this process, the original mutated instructions are
replaced by their new split counterparts, ensuring no invalidly-typed IR
escapes this pass. (For operations like call, where the struct form is
needed, insertelement operations are inserted).

Compared to LGC's PatchBufferOp (

32cda89776/lgc/patch/PatchBufferOp.cpp
): this pass
- Also handles vectors of ptr addrspace(7)s
- Also handles function boundaries
- Includes the same uniform buffer optimization for loops and
conditionals
- Does *not* handle memcpy() and friends (this is future work)
- Does *not* break up large loads and stores into smaller parts. This
should be handled by extending the legalization
of *.buffer.{load,store} to handle larger types by producing multiple
instructions (the same way ordinary LOAD and STORE are legalized). That
work is planned for a followup commit.
- Does *not* have special logic for handling divergent buffer
descriptors. The logic in LGC is, as far as I can tell, incorrect in
general, and, per discussions with @nhaehnle, isn't widely used.
Therefore, divergent descriptors are handled with waterfall loops later
in legalization.

As a final matter, this commit updates atomic expansion to treat buffer
operations analogously to global ones.

(One question for reviewers: is the new pass is the right place? Should
it be later in the pipeline?)

Differential Revision: https://reviews.llvm.org/D158463
2024-03-06 09:49:58 -06:00
Sameer Sahasrabuddhe
60822637bf Restore "Implement convergence control in MIR using SelectionDAG (#71785)"
This restores commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.
Previously reverted in f010b1bef4dda2c7082cbb41dbabf1f149cce306.

LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-03-06 12:19:32 +05:30
Mitch Phillips
f010b1bef4 Revert "Restore "Implement convergence control in MIR using SelectionDAG (#71785)""
This reverts commit c7fdd8c11e54585dc9d15d63de9742067e0506b9.

Reason: Broke the sanitizer buildbots. See the comments at
https://github.com/llvm/llvm-project/pull/71785
for more information.
2024-03-04 17:05:34 +01:00
Sameer Sahasrabuddhe
c7fdd8c11e Restore "Implement convergence control in MIR using SelectionDAG (#71785)"
Original commit 79889734b940356ab3381423c93ae06f22e772c9.
Perviously reverted in commit a2afcd5721869d1d03c8146bae3885b3385ba15e.

LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-03-04 13:28:04 +05:30
Jeffrey Byrnes
cf1c97b2d2
[AMDGPU] Do not attempt to fallback to default mutations (#83208)
IGLP itself will be in SavedMutations via mutations added during
Scheduler creation, thus falling back results in reapplying IGLP.

In PostRA scheduling, if we have multiple regions with IGLP
instructions, then we may have infinite loop.

Disable the feature for now.
2024-02-27 18:04:59 -08:00
Rishabh Bali
fe42e72db2
[CodeGen] Port AtomicExpand to new Pass Manager (#71220)
Port the `atomicexpand` pass to the new Pass Manager. 
Fixes #64559
2024-02-25 18:42:22 +05:30
Jeffrey Byrnes
8f2bd8ae68
[AMDGPU] Introduce iglp_opt(2): Generalized exp/mfma interleaving for select kernels (#81342)
This implements the basic pipelining structure of exp/mfma interleaving
for better extensibility. While it does have improved extensibility,
there are controls which only enable it for DAGs with certain
characteristics (matching the DAGs it has been designed against).
2024-02-23 17:13:20 -08:00
Sameer Sahasrabuddhe
a2afcd5721 Revert "Implement convergence control in MIR using SelectionDAG (#71785)"
This reverts commit 79889734b940356ab3381423c93ae06f22e772c9.

Encountered multiple buildbot failures.
2024-02-21 11:07:02 +05:30
Sameer Sahasrabuddhe
79889734b9
Implement convergence control in MIR using SelectionDAG (#71785)
LLVM function calls carry convergence control tokens as operand bundles, where
the tokens themselves are produced by convergence control intrinsics. This patch
implements convergence control tokens in MIR as follows:

1. Introduce target-independent ISD opcodes and MIR opcodes for convergence
   control intrinsics.
2. Model token values as untyped virtual registers in MIR.

The change also introduces an additional ISD opcode CONVERGENCECTRL_GLUE and a
corresponding machine opcode with the same spelling. This glues the convergence
control token to SDNodes that represent calls to intrinsics. The glued token is
later translated to an implicit argument in the MIR.

The lowering of calls to user-defined functions is target-specific. On AMDGPU,
the convergence control operand bundle at a non-intrinsic call is translated to
an explicit argument to the SI_CALL_ISEL instruction. Post-selection adjustment
converts this explicit argument to an implicit argument on the SI_CALL
instruction.
2024-02-21 10:06:37 +05:30
Matin Raayai
87e04b471e
Fix Passing TargetOptions by Value in TargetMachines for AMDGPU (#79866)
`TargetOptions` is currently passed by value in AMDGPU targets, which
makes unnecessary copies. This PR fixes this issue.
2024-02-01 09:50:44 -08:00
Mirko Brkušanin
1d286ad59b
[AMDGPU] Add mark last scratch load pass (#75512) 2024-01-18 09:36:44 +01:00
paperchalice
ffb1f20e0d
[CodeGen] Add flag to populate target pass names (#76328)
`print-pipeline-passes` can show target pass names.
2024-01-03 09:07:02 +08:00
Mariusz Sikora
a018c8cdbb
GFX12: Add LoopDataPrefetchPass (#75625)
It is currently disabled by default. It will need experiments on a real
HW to tune and decide on the profitability.

---------

Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2023-12-19 08:32:16 +01:00
Jessica Del
32f9983c06
[AMDGPU] - Add address space for strided buffers (#74471)
This is an experimental address space for strided buffers. These buffers
can have structs as elements and
a stride > 1.
These pointers allow the indexed access in units of stride, i.e., they
point at `buffer[index * stride]`.
Thus, we can use the `idxen` modifier for buffer loads.

We assign address space 9 to 192-bit buffer pointers which contain a
128-bit descriptor, a 32-bit offset and a 32-bit index. Essentially,
they are fat buffer pointers with an additional 32-bit index.
2023-12-15 15:49:25 +01:00
Valery Pykhtin
dd051295bc
[AMDGPU] Enable GCNRewritePartialRegUses pass by default. (#72975)
Let's try once again after #69957 has landed.
2023-12-14 14:10:27 +01:00
Petar Avramovic
6892c175c5
AMDGPU/GlobalISel: add AMDGPUGlobalISelDivergenceLowering pass (#75340)
Add empty AMDGPUGlobalISelDivergenceLowering pass. This pass will
implement
- selection of divergent i1 phis as lane mask phis, requires lane mask
merging in some cases
- lower uses of divergent i1 values outside of the cycle using lane mask
merging
- lowering of all cases of temporal divergence:
- lower uses of uniform i1 values outside of the cycle using lane mask
merging
- lower uses of uniform non-i1 values outside of the cycle using a copy
to vgpr inside of the cycle

Add very detailed set of regression tests for cases mentioned above.

patch 1 from: https://github.com/llvm/llvm-project/pull/73337
2023-12-13 16:42:56 +01:00
Piotr Sobczak
fac093dd08
[AMDGPU] Update IEEE and DX10_CLAMP for GFX12 (#75030)
Co-authored-by: Stanislav Mekhanoshin <Stanislav.Mekhanoshin@amd.com>
2023-12-13 13:52:40 +01:00
Kazu Hirata
586ecdf205
[llvm] Use StringRef::{starts,ends}_with (NFC) (#74956)
This patch replaces uses of StringRef::{starts,ends}with with
StringRef::{starts,ends}_with for consistency with
std::{string,string_view}::{starts,ends}_with in C++20.

I'm planning to deprecate and eventually remove
StringRef::{starts,ends}with.
2023-12-11 21:01:36 -08:00
Jay Foad
28233b11ac
[AMDGPU] New AMDGPUInsertSingleUseVDST pass (#72388)
Add support for emitting GFX11.5 s_singleuse_vdst instructions. This is
a power saving feature whereby the compiler can annotate VALU
instructions whose results are known to have only a single use, so the
hardware can in some cases avoid writing the result back to VGPR RAM.

To begin with the pass is disabled by default because of one missing
feature: we need an exclusion list of opcodes that never qualify as
single-use producers and/or consumers. A future patch will implement
this and enable the pass by default.

---------

Co-authored-by: Scott Egerton <scott.egerton@amd.com>
2023-11-24 10:23:06 +00:00
Carl Ritson
af6ff98c53
[AMDGPU] Move WWM register pre-allocation to during regalloc (#70618)
Move SIPreAllocateWWMRegs pass to just before VGPR allocation. This
saves recomputation of the virtual matrix and live reg map, with the
slight regression in O0 that live intervals and slot indexes must be
computed.
2023-11-08 11:54:28 +09:00
Matt Arsenault
d34a10a47d
AMDGPU: Port AMDGPUAttributor to new pass manager (#71349) 2023-11-07 15:40:40 +09:00
Valery Pykhtin
e808f8a616
[AMDGPU] GCNRegPressurePrinter pass to print GCNRegPressure values for testing. (#70031)
Using GCNDownwardRPTracker or GCNUpwardRPTracker the pass collects register pressure values for a function and prints these values next to instructions. Output can be used to generate Filecheck rules in mir tests.
2023-11-01 23:01:39 +01:00
Alex Voicu
0ce6255a50 [HIP][LLVM][Opt] Add LLVM support for hipstdpar
This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional:

1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:

    - Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
    - Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
    - After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
    - All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;

2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
    - Iterate all functions in a Module;
    - If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
        - If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
    - We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.

Reviewed by: JonChesterfield, yaxunl

Differential Revision: https://reviews.llvm.org/D155856
2023-10-12 11:26:48 +01:00
Alex Voicu
25935c384d Revert "[HIP][LLVM][Opt] Add LLVM support for hipstdpar"
This reverts commit c5bba7ea5a05f540948f76a189c880eb24a5e8c6.
2023-10-11 12:27:03 +01:00
Alex Voicu
c5bba7ea5a [HIP][LLVM][Opt] Add LLVM support for hipstdpar
This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional:

1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:

    - Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
    - Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
    - After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
    - All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;

2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
    - Iterate all functions in a Module;
    - If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
        - If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
    - We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.

Reviewed by: JonChesterfield, yaxunl

Differential Revision: https://reviews.llvm.org/D155856
2023-10-11 12:22:00 +01:00
Alex Voicu
98eda5dda7 Revert "[HIP][LLVM][Opt] Add LLVM support for hipstdpar" in order to address build breakage.
This reverts commit 9b98ebb0eb43b005921926a622177f10e13b1ac6.
2023-10-10 12:16:10 +01:00
Alex Voicu
9b98ebb0eb [HIP][LLVM][Opt] Add LLVM support for hipstdpar
This patch adds the LLVM changes needed for enabling HIP parallel algorithm offload on AMDGPU targets. What we do here is add two passes, one mandatory and one optional:

1. HipStdParAcceleratorCodeSelectionPass is mandatory, depends on CallGraphAnalysis, and implements the following transform:

    - Traverse the call-graph, and check for functions that are roots for accelerator execution (at the moment, these are GPU kernels exclusively, and would originate in the accelerator specific algorithm library the toolchain uses as an implementation detail);
    - Starting from a root, do a BFS to find all functions that are reachable (called directly or indirectly via a call- chain) and record them;
    - After having done the above for all roots in the Module, we have the computed the set of reachable functions, which is the union of roots and functions reachable from roots;
    - All functions that are not in the reachable set are removed; for the special case where the reachable set is empty we completely clear the module;

2. HipStdParAllocationInterpositionPass is optional, is meant as a fallback with restricted functionality for cases where on-demand paging is unavailable on a platform, and implements the following transform:
    - Iterate all functions in a Module;
    - If a function's name is in a predefined set of allocation / deallocation that the runtime implementation is allowed and expected to interpose, replace all its uses with the equivalent accelerator aware function, iff the latter is available;
        - If the accelerator aware equivalent is unavailable we warn, but compilation will go ahead, which means that it is possible to get issues around the accelerator trying to access inaccessible memory at run time;
    - We rely on direct name matching as opposed to using the new alloc-kind family of attributes and / or the LibCall analysis pass because some of the legacy functions that need replacing would not carry the former or be identified by the latter.

Reviewed by: JonChesterfield, yaxunl

Differential Revision: https://reviews.llvm.org/D155856
2023-10-10 12:02:05 +01:00
Jeffrey Byrnes
6afceba510
[AMDGPU][IGLP] SingleWaveOpt: Cache DSW Counters from PreRA (#67759)
Save the DSW counters from PreRA scheduling. While this avoids recalculation in the postRA pass, that isn't the main purpose.

This is required because of physical register dependencies in PostRA scheduling -- they alter the DAG s.t. our counters may become incorrect -- which alters the layout of the pipeline. By preserving the values from PreRA, we can be sure that we accurately construct the pipeline.

Additionally, remove a bad assert in SharesPredWithPrevNthGroup -- it is possible that we will have an empty cache if OtherGroup has no elements which have a V_PERM pred (possible if the V_PERM SG is empty).
2023-10-06 17:34:14 -07:00
Jay Foad
d85d143ad9
[AMDGPU] New image intrinsic optimizer pass (#67151)
Implement a new pass to combine multiple image_load_2dmsaa and
2darraymsaa intrinsic calls into a single image_msaa_load if:

- they refer to the same vaddr except for sample_id,
- they use a constant sample_id and they fall into the same group,
- they have the same dmask and the number of instructions and the
  number of vaddr/vdata dword transfers is reduced by the combine

This should be valid on all GFX11 but a hardware bug renders it
unworkable on GFX11.0.* so it is only enabled for GFX11.5.

Based on a patch by Rodrigo Dominguez!
2023-09-26 09:33:49 +01:00
Arthur Eubanks
0a1aa6cda2
[NFC][CodeGen] Change CodeGenOpt::Level/CodeGenFileType into enum classes (#66295)
This will make it easy for callers to see issues with and fix up calls
to createTargetMachine after a future change to the params of
TargetMachine.

This matches other nearby enums.

For downstream users, this should be a fairly straightforward
replacement,
e.g. s/CodeGenOpt::Aggressive/CodeGenOptLevel::Aggressive
or s/CGFT_/CodeGenFileType::
2023-09-14 14:10:14 -07:00
Kazu Hirata
a9c7ba964f [AMDGPU] Fix a warning
This patch fixes:

  llvm/lib/Target/AMDGPU/AMDGPU.h:297:18: error: private field 'TM' is
  not used [-Werror,-Wunused-private-field]
2023-09-12 14:02:07 -07:00
jwanggit86
b853988e0d
[AMDGPU] Port AMDGPURewriteUndefForPHI to new pass manager (#66008)
This patch ports the AMDGPURewriteUndefForPHI pass to the new pass
manager. With this, the pass is supported under both the legacy and the
new pass managers.

---------

Co-authored-by: Jun Wang <jun.wang7@amd.com>
2023-09-12 13:32:02 -07:00
Matt Arsenault
f7dcabe502 AMDGPU: Pass in TargetMachine to AMDGPULowerModuleLDSPass
https://reviews.llvm.org/D157660
2023-09-02 12:02:36 -04:00
Pravin Jagtap
6ef6c954c6 [AMDGPU] Reorder atomic optimizer to avoid CAS loop.
Expand-Atomic pass emits the CAS loop for FP operations
which limits the optimizations offered by atomic optimizer.

Moving atomic optimizer before expand-atomics allows
better codegen.

Reviewed By: arsenm, #amdgpu

Differential Revision: https://reviews.llvm.org/D157265
2023-08-30 12:05:21 -04:00
pvanhout
89e91e4c0c [AMDGPU] Remove post-PromoteAlloca SROA run
PromoteAlloca now uses SSAUpdater, it doesn't need SROA to clean-up after it anymore.

Internal testing shows no noticeable performance impact.

Reviewed By: #amdgpu, arsenm

Differential Revision: https://reviews.llvm.org/D156398
2023-08-11 08:29:21 +02:00
Matt Arsenault
58e87c961e AMDGPU: Port AMDGPULowerKernelArguments to new pass manager
https://reviews.llvm.org/D157498
2023-08-09 18:34:30 -04:00
Matt Arsenault
9a806551a0 AMDGPU: Delete old PM support for libcall passes
This has no reason to run in the codegen pipeline.
2023-08-01 18:22:02 -04:00
Matt Arsenault
5dfdd3494b AMDGPU: Don't try to fold wavefrontsize intrinsic in libcall simplify
It's not a libcall so doesn't really belong here to begin
with. Relying on checking the target name and explicit features isn't
particularly sound either. The library doesn't use the intrinsic
anymore, so it doesn't matter anyway.
2023-08-01 18:20:50 -04:00
Matt Arsenault
4d42e8b5d1 Reapply "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
This reverts commit a496c8be6e638ae58bb45f13113dbe3a4b7b23fd.

The workaround in c26dfc81e254c78dc23579cf3d1336f77249e1f6 should work
around the underlying problem with SUBREG_TO_REG.
2023-07-31 20:15:45 -04:00
Matt Arsenault
5b5bd81b71 AMDGPU: Move placement of RemoveIncompatibleFunctions
This should be approximately first and run with other module passes.

https://reviews.llvm.org/D155987
2023-07-31 19:22:04 -04:00
Vitaly Buka
a496c8be6e Revert "[CodeGen]Allow targets to use target specific COPY instructions for live range splitting"
And dependent commits.

Details in D150388.

This reverts commit 825b7f0ca5f2211ec3c93139f98d1e24048c225c.
This reverts commit 7a98f084c4d121244ef7286bc6503b6a181d446e.
This reverts commit b4a62b1fa546312d882fa12dfdcd015177d66826.
This reverts commit b7836d856206ec39509d42529f958c920368166b.

No conflicts in the code, few tests had conflicts in autogenerated CHECKs:
llvm/test/CodeGen/Thumb2/mve-float32regloops.ll
llvm/test/CodeGen/AMDGPU/fix-frame-reg-in-custom-csr-spills.ll

Reviewed By: alexfh

Differential Revision: https://reviews.llvm.org/D156381
2023-07-26 22:13:32 -07:00
Christudasan Devadasan
b4a62b1fa5 [AMDGPU] Enable whole wave register copy
So far, we haven't exposed the allocation of whole-wave
registers to regalloc. We hand-picked them for various
whole wave mode operations. With a future patch, we
want the allocator to efficiently allocate them rather
than using the custom pre-allocation pass.

Any liverange split of virtual registers involved in
whole-wave operations require the resulting COPY
introduced with the split to be performed for all
lanes. It isn't implemented in the compiler yet.

This patch would identify all such copies and
manipulate the exec mask around them to enable all
lanes without affecting the value of exec mask
elsewhere.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D143762
2023-07-07 22:58:55 +05:30
Christudasan Devadasan
b78b36e1a2 [AMDGPU] Implement whole wave register spill
To reduce the register pressure during allocation,
when the allocator spills a virtual register that
corresponds to a whole wave mode operation, the
spill loads and restores should be activated for
all lanes by temporarily flipping all bits in exec
register to one just before the spills. It is not
implemented in the compiler as of today and this
patch enables the necessary support.

This is a pre-patch before the SGPR spill to virtual
VGPR lanes that would eventually causes the whole
wave register spills during allocation.

Reviewed By: arsenm, cdevadas

Differential Revision: https://reviews.llvm.org/D143759
2023-07-07 22:51:45 +05:30
Ivan Kosarev
ee165cdb1b [AMDGPU] Eliminate SIMCCodeEmitter and de-virtualise encoding methods.
Simplifies some future changes needed for
<https://github.com/llvm/llvm-project/issues/62629>.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D154337
2023-07-05 10:13:33 +01:00
Brendon Cahoon
853b2a84cb [AMDGPU] Reserve SGPR pair when long branches are present
Branch relaxation requires 2 additional SGPRs for AMDGPU to handle the
case when an indirect branch target is too far away. The register
scavanger may not find available registers, which causes a “did not find
scavenging index” assert to occur in assignRegToScavengingIndex.

In this patch, we estimate before register allocation whether an
indirect branch is likely to be needed, and reserve 2 SGPRs if the
branch distance is found to be above a threshold. The distance threshold
is an approximation as the exact code size and branch distance are
unknown prior to register allocation.

Patch by Corbin Robeck. Thanks!

Differential Review: https://reviews.llvm.org/D149775
2023-06-29 16:50:46 -05:00
Matt Arsenault
d7d4aa539c AMDGPU: Move AMDGPUAttributor run earlier
Move it up with other module passes. It's a higher level optimization
that should probably be done before hacking up the IR for codegen. It
should really be done earlier than this. We could possibly move this
with other IPO passes, but we'd have to stop inferring the lack of
lds.kernel.id calls and have the LDS module pass mark functions which
don't need the ID.

The one test change is because that pass is relying on the backend run
of SROA (which we ideally wouldn't have).
2023-06-28 12:42:40 -04:00