Atomic optimizer is turned on by default through D152649. This patch
removes the usage of old command line option amdgpu-atomic-optimizations
and transfer the responsibility to `amdgpu-atomic-optimizer-strategy`.
We can safely remove old option when LLPC remove its all usage.
Reviewed By: foad, arsenm, #amdgpu, cdevadas
Differential Revision: https://reviews.llvm.org/D153007
The optimizing, non-broken features have all been moved to
AMDGPUAttributor. The only remaining piece of functionality was the
broken propagation of the wavesize features. This was fundamentally
broken and a hack for device library linking. It doesn't matter when
the device libraries are correctly linked and internalized.
In case of linked-as-normal-bitcode (as comgr still does), we're
reliant on the global subtarget anyway. If we can get away without
forcing target-cpu, we should just as well be able to get away without
propagating target-features.
Implement this optimization in SIInsertWaitcnts, where we already have
information about whether there might be outstanding VMEM store
instructions. This has the following advantages:
- Correctly handles atomics-with-return.
- Correctly handles call instructions.
- Should be faster because it does not require running a separate pass.
Differential Revision: https://reviews.llvm.org/D153279
The D147408 implemented new Iterative approach for scan computations
and added new flag `amdgpu-atomic-optimizer-strategy` which is
defaulted to DPP.
The changeset https://github.com/GPUOpen-Drivers/llpc/pull/2506
adapts to the new changes in LLPC.
This patch enables atomic optimizer pass and selects Iterative
approach for scan computations by default for compute pipeline.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D152649
The commit that added the run says it's to hoist uniform parts of
integer division expansion. That expansion is performed later, so this
didn't do anything in that case. Move this later so the original test
shows the improvement.
This also saves a run of "Canonicalize natural loops". Not sure why
this appears to be still getting a separate loop PM run. Also feels a
bit heavy to run this just for divide. Is there a way to specifically
hoist the divide sequence when it expands?
Expand large or unknown size memory intrinsics into loops in the
default lowering pipeline if the target doesn't have the corresponding
libfunc. Previously AMDGPU had a custom pass which existed to call the
expansion utilities.
With a default no-libcall option, we can remove the libfunc checks in
LoopIdiomRecognize for these, which never made any sense. This also
provides a path to lifting the immarg restriction on
llvm.memcpy.inline.
There seems to be a bug where TLI reports functions as available if
you use -march and not -mtriple.
This patch provides an alternative implementation to DPP for Scan Computations.
An alternative implementation iterates over all active lanes of Wavefront
using llvm.cttz and performs the following steps:
1. Read the value that needs to be atomically incremented using
llvm.amdgcn.readlane intrinsic
2. Accumulate the result.
3. Update the scan result using llvm.amdgcn.writelane intrinsic
if intermediate scan results are needed later in the kernel.
Reviewed By: arsenm, cdevadas
Differential Revision: https://reviews.llvm.org/D147408
There is a failure with this pass in the case when target register class for a subregister isn't known from instruction description (for ex. COPY).
Currently in this situation the RC is obtained using TargetRegisterInfo::getSubRegisterClass but in general it's not working.
In order to fix this two things should be done:
1. Stop processing a subregister if the target register class is unknown (conservative approach)
2. Improve deduction of subregister' target register class (i.e by processing COPY chain)
I was going to implement point 1 but my tests use implicit operands for S_NOP and they don't have associated target register class and all tests fail.
Therefore I decided to turn off the pass now, implement point 1 and fix my tests.
Reviewed By: arsenm, #amdgpu
Differential Revision: https://reviews.llvm.org/D152291
The main purpose of this is to simplify register pressure tracking as after the pass there is no need
to track subreg liveness anymore.
On the other hand this pass creates more possibilites for the subreg unaware code, as many of the subregs
becomes ordinary registers.
Intersting sideeffect: spill-vgpr.ll has lost a lot of spills.
Reviewed By: #amdgpu, arsenm
Differential Revision: https://reviews.llvm.org/D139732
Currently AMDGPU offers extra ctor / dtor lowering by emitting a kernel
that can be called. It's possible to handle ctors and dtors using the
standard method as shown in D149340's commit message. In which case we
on't need these extra kernels as they won't be called. This patch simply
adds a way to conditionally turn off this handling if we do not want to
get extra kernels in the output.
Unrelated, but we could convert this handling to an ODR function that simply
calls the code in D149340 constructed via LLVM-IR. That would handle priority
correctly and would then be correct if not run in LTO mode.
Reviewed By: yaxunl
Differential Revision: https://reviews.llvm.org/D150565
Re-land D145441 with data layout upgrade code fixed to not break OpenMP.
This reverts commit 3f2fbe92d0f40bcb46db7636db9ec3f7e7899b27.
Differential Revision: https://reviews.llvm.org/D149776
Per discussion at
https://discourse.llvm.org/t/representing-buffer-descriptors-in-the-amdgpu-target-call-for-suggestions/68798,
we define two new address spaces for AMDGCN targets.
The first is address space 7, a non-integral address space (which was
already in the data layout) that has 160-bit pointers (which are
256-bit aligned) and uses a 32-bit offset. These pointers combine a
128-bit buffer descriptor and a 32-bit offset, and will be usable with
normal LLVM operations (load, store, GEP). However, they will be
rewritten out of existence before code generation.
The second of these is address space 8, the address space for "buffer
resources". These will be used to represent the resource arguments to
buffer instructions, and new buffer intrinsics will be defined that
take them instead of <4 x i32> as resource arguments. ptr
addrspace(8). These pointers are 128-bits long (with the same
alignment). They must not be used as the arguments to getelementptr or
otherwise used in address computations, since they can have
arbitrarily complex inherent addressing semantics that can't be
represented in LLVM. Even though, like their address space 7 cousins,
these pointers have deterministic ptrtoint/inttoptr semantics, they
are defined to be non-integral in order to prevent optimizations that
rely on pointers being a [0, [addr_max]] value from applying to them.
Future work includes:
- Defining new buffer intrinsics that take ptr addrspace(8) resources.
- A late rewrite to turn address space 7 operations into buffer
intrinsics and offset computations.
This commit also updates the "fallback address space" for buffer
intrinsics to the buffer resource, and updates the alias analysis
table.
Depends on D143437
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D145441
Removed definitions of vectorizeBasicBlock and VectorizeConfig
(possibly a remnant from the BBVectorize pass that was removed
way back in 2017).
Also reduced amount of include dependencies to Transforms/Vectorize.h.
Pass disabled since approximately D104962 for miscompiling openmp
The functions under ReplaceConstant miscompile phis as noted in D112717 and
have no users in tree other than the disabled pass. It seems likely it has no
users out of tree.
Deletes the test cases associated with the disabled pass as well.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D147586
AMDGPUPrintfRuntimeBindingPass is not run in the IR optimization
pipeline with -O0.
This means that with OpenCL the printf definition coming from
device_libs gets linked with the user's code, which blocks
AMDGPUPrintfRuntimeBindingPass from working after the linkage is done.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D146720
The legacy pass is only used in AMDGPU codegen, which doesn't care about running it in call graph order (it actually has to work around that fact).
Make the legacy pass a module pass and share code with the new pass.
This allows us to remove the legacy inliner infrastructure.
Reviewed By: mtrofin
Differential Revision: https://reviews.llvm.org/D146446
Adds a new pass that removes functions
if they use features that are not supported on the current GPU.
This change is aimed at preventing crashes when building code at O0 that
uses idioms such as `if (ISA_VERSION >= N) intrinsic_a(); else intrinsic_b();`
where ISA_VERSION is not constexpr, and intrinsic_a is not selectable
on older targets.
This is a pattern that's used all over the ROCm device libs. The main
motive behind this change is to allow code using ROCm device libs
to be built at O0.
Note: the feature checking logic is done ad-hoc in the pass. There is no other
pass that needs (or will need in the foreseeable future) to do similar
feature-checking logic so I did not see a need to generalize the feature
checking logic yet. It can (and should probably) be generalized later and
moved to a TargetInfo-like class or helper file.
Reviewed By: arsenm, Joe_Nash
Differential Revision: https://reviews.llvm.org/D139000
Uniformity analysis needs to be the fundamental basis for
regbank decisions. The considerations of the default pass
are secondary, but potentially useful for some edge cases (e.g.
selecting AGPRs when arbitrary loads and stores can directly use
them). This needs to be a separate pass since it requires new
analysis dependencies.
Boilerplate to subclass the existing pass which does nothing
different.
This simplies a future patch. The MIR handling should be fixed. We're
still printing these in custom MachineFunctionInfo as bools (plus the
inverted meaning is hard to follow).
This fixes what I consider to be an API flaw I've tripped over
multiple times. The point this is constructed isn't well defined, so
depending on where this is first called, you can conclude different
information based on the MachineFunction. For example, the AMDGPU
implementation inspected the MachineFrameInfo on construction for the
stack objects and if the frame has calls. This kind of worked in
SelectionDAG which visited all allocas up front, but broke in
GlobalISel which hasn't visited any of the IR when arguments are
lowered.
I've run into similar problems before with the MIR parser and trying
to make use of other MachineFunction fields, so I think it's best to
just categorically disallow dependency on the MachineFunction state in
the constructor and to always construct this at the same time as the
MachineFunction itself.
A missing feature I still could use is a way to access an custom
analysis pass on the IR here.
Quite a few passes are not default constructible. In order to properly
support -{start|stop}-{before|after}= for these passes, we would like to
continue to use INITIALIZE_PASS, but not necessarily provide a default
constructor.
Delete the default constructors of classes derived from
SelectionDAGISel.
Link: https://github.com/llvm/llvm-project/issues/59538
Reviewed By: efriedma
Differential Revision: https://reviews.llvm.org/D140349
Currently, the custom SGPR spill lowering pass spills
SGPRs into physical VGPR lanes and the remaining VGPRs
are used by regalloc for vector regclass allocation.
This imposes many restrictions that we ended up with
unsuccessful SGPR spilling when there won't be enough
VGPRs and we are forced to spill the leftover into
memory during PEI. The custom spill handling during PEI
has many edge cases and often breaks the compiler time
to time.
This patch implements spilling SGPRs into virtual VGPR
lanes. Since we now split the register allocation for
SGPRs and VGPRs, the virtual registers introduced for
the spill lanes would get allocated automatically in
the subsequent regalloc invocation for VGPRs.
Spill to virtual registers will always be successful,
even in the high-pressure situations, and hence it avoids
most of the edge cases during PEI. We are now left with
only the custom SGPR spills during PEI for special registers
like the frame pointer which isn an unproblematic case.
This patch also implements the whole wave spills which
might occur if RA spills any live range of virtual registers
involved in the whole wave operations. Earlier, we had
been hand-picking registers for such machine operands.
But now with SGPR spills into virtual VGPR lanes, we are
exposing them to the allocator.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D124196
C++17 allows us to call constructors pair and tuple instead of helper
functions make_pair and make_tuple.
Differential Revision: https://reviews.llvm.org/D139828
This patch mechanically replaces None with std::nullopt where the
compiler would warn if None were deprecated. The intent is to reduce
the amount of manual work required in migrating from Optional to
std::optional.
This is part of an effort to migrate from llvm::Optional to
std::optional:
https://discourse.llvm.org/t/deprecating-llvm-optional-x-hasvalue-getvalue-getvalueor/63716
The use of a PSV for buffer intrinsics is misleading because it may be
misinterpreted as all buffer intrinsics accessing the same address in
memory, which is clearly not true.
Instead, build MachineMemOperands without a pointer value but with an
address space, so that address space-based alias analysis can still
work.
There is a lot of test churn because previously address space 4
(constant address space) was used as an address space for buffer
intrinsics. This doesn't make much sense and seems to have been an
accident -- see the change in
AMDGPUTargetMachine::getAddressSpaceForPseudoSourceKind.
Differential Revision: https://reviews.llvm.org/D138711
Since opt no longer supports to run default (O0/O1/O2/O3/Os/Oz)
pipelines using the legacy PM, there are no in-tree uses of
TargetMachine::adjustPassManager remaining. This patch removes the
no longer used adjustPassManager functions.
Reviewed By: aeubanks
Differential Revision: https://reviews.llvm.org/D137796
For the pattern of IR (%if terminates with a divergent branch.),
divergence analysis will report %phi as uniform to help optimal code
generation.
```
%if
| \
| %then
| /
%endif: %phi = phi [ %uniform, %if ], [ %undef, %then ]
```
In the backend, %phi and %uniform will be assigned a scalar register.
But the %undef from %then will make the scalar register dead in %then.
This will likely cause the register being over-written in %then. To fix
the issue, we will rewrite %undef as %uniform. For details, please refer
the comment in AMDGPURewriteUndefForPHI.cpp. Currently there is no test
changes shown, but this is mandatory for later changes.
Reviewed by: sameerds
Differential Revision: https://reviews.llvm.org/D133840
Adds a builtin that serves as an optimization hint to apply specific optimized
DAG mutations during scheduling. This also disables any other mutations or
clustering that may interfere with the desired pipeline. The first optimization
strategy that is added here is designed to improve the performance of small gemm
kernels on gfx90a.
Reviewed By: jrbyrnes
Differential Revision: https://reviews.llvm.org/D132079
When compiling for multiple targets the scheduler that is selected via the
-misched option is applied globally. This patch adds a target CL option instead.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D131022
Creates a new scheduling strategy that attempts to maximize ILP for a single
wave.
Reviewed By: rampitec
Differential Revision: https://reviews.llvm.org/D130869
This pass seems to have very little effect because all it does is hoist
some instructions, but it is followed later in the codegen pipeline by
the IR CodeSinking pass which does the opposite.
Differential Revision: https://reviews.llvm.org/D130258
Implement an intrinsic for use lowering LDS variables to different
addresses from different kernels. This will allow kernels that cannot
reach an LDS variable to avoid wasting space for it.
There are a number of implicit arguments accessed by intrinsic already
so this implementation closely follows the existing handling. It is slightly
novel in that this SGPR is written by the kernel prologue.
It is necessary in the general case to put variables at different addresses
such that they can be compactly allocated and thus necessary for an
indirect function call to have some means of determining where a
given variable was allocated. Claiming an arbitrary SGPR into which
an integer can be written by the kernel, in this implementation based
on metadata associated with that kernel, which is then passed on to
indirect call sites is sufficient to determine the variable address.
The intent is to emit a __const array of LDS addresses and index into it.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D125060
We form VOPD instructions in the GCNCreateVOPD pass by combining
back-to-back component instructions. There are strict register
constraints for creating a legal VOPD, namely that the matching operands
(e.g. src0x and src0y, src1x and src1y) must be in different register
banks. We add a PostRA scheduler
mutation to put possible VOPD components back-to-back.
Depends on D128442, D128270
Reviewed By: #amdgpu, rampitec
Differential Revision: https://reviews.llvm.org/D128656
GFX11 has a new message type MSG_DEALLOC_VGPRS which can be used to
release a shader's VGPRs. Sending this at the end of a shader (just
before the s_endpgm) can help overall system performance in cases where
the s_endpgm would have to wait for outstanding VMEM stores to complete
before releasing the VGPRs.
Differential Revision: https://reviews.llvm.org/D128442