12 Commits

Author SHA1 Message Date
Jon Chesterfield
3c76e5f0c8 [amdgpu][nfc] Remove dead code associated with LDS lowering
Pass disabled since approximately D104962 for miscompiling openmp

The functions under ReplaceConstant miscompile phis as noted in D112717 and
have no users in tree other than the disabled pass. It seems likely it has no
users out of tree.

Deletes the test cases associated with the disabled pass as well.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D147586
2023-04-05 22:24:22 +01:00
Jon Chesterfield
0507448d82 [amdgpu] Implement dynamic LDS accesses from non-kernel functions
The premise here is to allow non-kernel functions to locate external LDS variables without using LDS or extra magic SGPRs to do so.

1/ First it crawls the callgraph to work out which external LDS variables are reachable from a given kernel
2/ Then it creates a new `extern char[0]` variable for each kernel, which will alias all the other extern LDS variables because that's the documented behaviour of these variables
3/ The address of that variable is written to a lookup table. The global variable is tagged with metadata to track what address it was allocated at by codegen
4/ The assembler builds the lookup table using the metadata
5/ Any non-kernel functions use the same magic intrinsic used by table lookups of non-dynamic LDS variables to find the address to use

Heavy overlap with the code paths taken for other lowering, in particular the same intrinsic is used to pass the dynamic scope information through the same sgpr as for table lookups of static LDS.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D144233
2023-04-04 20:06:34 +01:00
Jon Chesterfield
56202c51d4 Revert "[amdgpu][lds] Use the same isKernel predicate consistently"
Looks like this composed poorly with a nominally independent patch, will fix
This reverts commit 0ba0398517778514eb44cb7ba9bf9d4d20a856e0.
2022-11-09 16:54:20 +00:00
Jon Chesterfield
0ba0398517 [amdgpu][lds] Use the same isKernel predicate consistently
isKernelCC != isKernel(F->getCallingConv())
There's a test case (lower-kernel-lds.ll) that explicitly skips amdgpu_ps
so this change picks the isKernel predicate that continues to skip that
calling convention.

isKernel returns true for AMDGPU_KERNEL and SPIR_KERNEL. isKernelCC also
returns true for other calling conventions.

Reviewed By: arsenm

Differential Revision: https://reviews.llvm.org/D136599
2022-11-09 16:45:05 +00:00
Kazu Hirata
20d764aff0 [llvm] Don't including SetVector.h (NFC)
llvm/lib/ProfileData/RawMemProfReader.cpp uses SetVector without
including SetVector.h, so this patch adds an appropriate #include
there.
2022-09-17 12:36:43 -07:00
Jon Chesterfield
cdb9738963 [amdgpu] Expand all ConstantExpr users of LDS variables in instructions
Bug noted in D112717 can be sidestepped with this change.

Expanding all ConstantExpr involved with LDS up front makes the variable specialisation simpler. Excludes ConstantExpr that don't access LDS to avoid disturbing codegen elsewhere.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D133422
2022-09-14 07:55:46 +01:00
Jon Chesterfield
a28bbd00c6 [amdgpu][nfc] Factor predicate out of findLDSVariablesToLower 2022-08-31 15:44:51 +01:00
Kazu Hirata
e20d210eef [llvm] Qualify auto (NFC)
Identified with readability-qualified-auto.
2022-08-07 23:55:27 -07:00
Austin Kerbow
f5b21680d1 [AMDGPU] Add amdgcn_sched_group_barrier builtin
This builtin allows the creation of custom scheduling pipelines on a per-region
basis. Like the sched_barrier builtin this is intended to be used either for
testing, in situations where the default scheduler heuristics cannot be
improved, or in critical kernels where users are trying to get performance that
is close to handwritten assembly. Obviously using these builtins will require
extra work from the kernel writer to maintain the desired behavior.

The builtin can be used to create groups of instructions called "scheduling
groups" where ordering between the groups is enforced by the scheduler.
__builtin_amdgcn_sched_group_barrier takes three parameters. The first parameter
is a mask that determines the types of instructions that you would like to
synchronize around and add to a scheduling group. These instructions will be
selected from the bottom up starting from the sched_group_barrier's location
during instruction scheduling. The second parameter is the number of matching
instructions that will be associated with this sched_group_barrier. The third
parameter is an identifier which is used to describe what other
sched_group_barriers should be synchronized with. Note that multiple
sched_group_barriers must be added in order for them to be useful since they
only synchronize with other sched_group_barriers. Only "scheduling groups" with
a matching third parameter will have any enforced ordering between them.

As an example, the code below tries to create a pipeline of 1 VMEM_READ
instruction followed by 1 VALU instruction followed by 5 MFMA instructions...
// 1 VMEM_READ
__builtin_amdgcn_sched_group_barrier(32, 1, 0)
// 1 VALU
__builtin_amdgcn_sched_group_barrier(2, 1, 0)
// 5 MFMA
__builtin_amdgcn_sched_group_barrier(8, 5, 0)
// 1 VMEM_READ
__builtin_amdgcn_sched_group_barrier(32, 1, 0)
// 3 VALU
__builtin_amdgcn_sched_group_barrier(2, 3, 0)
// 2 VMEM_WRITE
__builtin_amdgcn_sched_group_barrier(64, 2, 0)

Reviewed By: jrbyrnes

Differential Revision: https://reviews.llvm.org/D128158
2022-07-28 10:43:14 -07:00
Austin Kerbow
2db700215a [AMDGPU] Add llvm.amdgcn.sched.barrier intrinsic
Adds an intrinsic/builtin that can be used to fine tune scheduler behavior. If
there is a need to have highly optimized codegen and kernel developers have
knowledge of inter-wave runtime behavior which is unknown to the compiler this
builtin can be used to tune scheduling.

This intrinsic creates a barrier between scheduling regions. The immediate
parameter is a mask to determine the types of instructions that should be
prevented from crossing the sched_barrier. In this initial patch, there are only
two variations. A mask of 0 means that no instructions may be scheduled across
the sched_barrier. A mask of 1 means that non-memory, non-side-effect inducing
instructions may cross the sched_barrier.

Note that this intrinsic is only meant to work with the scheduling passes. Any
other transformations that may move code will not be impacted in the ways
described above.

Reviewed By: rampitec

Differential Revision: https://reviews.llvm.org/D124700
2022-05-11 13:22:51 -07:00
Stanislav Mekhanoshin
c7eb846345 [AMDGPU] Merge AMDGPULDSUtils into AMDGPUMemoryUtils
Differential Revision: https://reviews.llvm.org/D119502
2022-02-11 10:32:24 -08:00
Stanislav Mekhanoshin
290e5722e8 [AMDGPU] Improve clobbering checks in the kernel argument promotion
Use same MSSA clobbering checks as in the AMDGPUAnnotateUniformValues.
Kernel argument promotion needs exactly the same information so factor
out utility function isClobberedInFunction.

Differential Revision: https://reviews.llvm.org/D119480
2022-02-10 14:51:47 -08:00