[amdgpu] Update med3 combine to skip i64
Fixes an assumption that a type which is not i32 will be i16. This asserts
when trying to sign/zero extend an i64 to i32.
Test case was cut down from an openmp application. Variations on it are hit by
other combines before reaching the problematic one, e.g. replacing the
immediate values with other function arguments changes the codegen path and
misses this combine.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D98872
byval requires an implicit copy between the caller and callee such
that the callee may write into the stack area without it modifying the
value in the parent. Previously, this was passing through the raw
pointer value which would break if the callee wrote into it.
Most of the time, this copy can be optimized out (however we don't
have the optimization SelectionDAG does yet).
This will trigger more fallbacks for AMDGPU now, since we don't have
legalization for memcpy yet (although we should stop using byval
anyway).
RA can insert something like a sub1_sub2 COPY of a wide VGPR
tuple which results in the unaligned acces with v_pk_mov_b32
after the copy is expanded. This is regression after D97316.
Differential Revision: https://reviews.llvm.org/D98549
Replace individual operands GLC, SLC, and DLC with a single cache_policy
bitmask operand. This will reduce the number of operands in MIR and I hope
the amount of code. These operands are mostly 0 anyway.
Additional advantage that parser will accept these flags in any order unlike
now.
Differential Revision: https://reviews.llvm.org/D96469
[amdgpu] Implement lower function LDS pass
Local variables are allocated at kernel launch. This pass collects global
variables that are used from non-kernel functions, moves them into a new struct
type, and allocates an instance of that type in every kernel. Uses are then
replaced with a constantexpr offset.
Prior to this pass, accesses from a function are compiled to trap. With this
pass, most such accesses are removed before reaching codegen. The trap logic
is left unchanged by this pass. It is still reachable for the cases this pass
misses, notably the extern shared construct from hip and variables marked
constant which survive the optimizer.
This is of interest to the openmp project because the deviceRTL runtime library
uses cuda shared variables from functions that cannot be inlined. Trunk llvm
therefore cannot compile some openmp kernels for amdgpu. In addition to the
unit tests attached, this patch applied to ROCm llvm with fixed-abi enabled
and the function pointer hashing scheme deleted passes the openmp suite.
This lowering will use more LDS than strictly necessary. It is intended to be
a functionally correct fallback for cases that are difficult to target from
future optimisation passes.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D94648
When tracking defined lanes through phi nodes in the live range
graph each branch of the phi must be handled independently.
Also rewrite the marking algorithm to reduce unnecessary
operations.
Previously a shared set of defined lanes was used which caused
marking to stop prematurely. This was observable in existing lit
tests, but test patterns did not cover this detail.
Reviewed By: piotr
Differential Revision: https://reviews.llvm.org/D98614
This reverts commit 329aeb5db43f5e69df038fb20d2def77fe6f8595,
and relands commit 61f006ac655431bd44b9e089f74c73bec0c1a48c.
This is a continuation of D89456.
As it was suggested there, now that SCEV models `PtrToInt`,
we can try to improve SCEV's pointer handling.
In particular, i believe, i will need this in the future
to further fix `SCEVAddExpr`operation type handling.
This removes special handling of `ConstantPointerNull`
from `ScalarEvolution::createSCEV()`, and add constant folding
into `ScalarEvolution::getPtrToIntExpr()`.
This way, `null` constants stay as such in SCEV's,
but gracefully become zero integers when asked.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D98147
This is a continuation of D89456.
As it was suggested there, now that SCEV models `PtrToInt`,
we can try to improve SCEV's pointer handling.
In particular, i believe, i will need this in the future
to further fix `SCEVAddExpr`operation type handling.
This removes special handling of `ConstantPointerNull`
from `ScalarEvolution::createSCEV()`, and add constant folding
into `ScalarEvolution::getPtrToIntExpr()`.
This way, `null` constants stay as such in SCEV's,
but gracefully become zero integers when asked.
Reviewed By: Meinersbur
Differential Revision: https://reviews.llvm.org/D98147
This also briefly tests a larger set of architectures than the more
exhaustive functionality tests for AArch64 and x86.
As requested in D88785
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D98339
byval arguments need to be assumed writable. Only implicitly stack
passed arguments which aren't addressable in the IR can be assumed
immutable.
Mips is still broken since for some reason its doing its own thing
with the ValueHandlers (and x86 doesn't actually handle byval
arguments now, although some of the code is there).
This was essentially ignoring byval and treating them as a pointer
argument which needed to be loaded from. This should copy the frame
index value to the virtual register, not insert a load from the frame
index into the pointer value.
For AMDGPU, this was producing a load from the byval pointer argument,
to a pointer used for the byval arguments. I do not understand how
AArch64 managed to work before since it appears to be similarly
broken.
We could also change the ValueHandler API to avoid the extra copy from
the frame index, since currently it returns a new register.
I believe there is still an issue with outgoing byval arguments. These
should have a copy inserted in case the callee decided to overwrite
the memory.
As llvm.amdgcn.kill is lowered to a terminator it can cause
else branch annotations to end up in the wrong block.
Do not annotate conditionals as else branches where there is
a kill to avoid this.
Reviewed By: arsenm
Differential Revision: https://reviews.llvm.org/D97427
This instruction is only valid on 2D MSAA and 2D MSAA Array
surfaces. Remove intrinsic support for other dimension types,
and block assembly for unsupported dimensions.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D98397
We have amdgpu_gfx functions that have high register pressure. If
we do not reserve VGPR for SGPR spill, we will fall into the path
to spill the SGPR to memory, which does not only have correctness issue,
but also have really bad performance.
I don't know why there is the check for hasStackObjects(), in our case,
we don't have stack objects at the time of finalizeLowering(). So just
remove the check that we always reserve a VGPR for possible SGPR spill
in non-entry functions.
Reviewed by: arsenm
Differential Revision: https://reviews.llvm.org/D98345
For attribute sets, the return index is at 0, and arguments start at
1. getParamAlignment adds the offset of 1, so we need to convert from
attribute index back to IR index.
As we may overwrite inactive lanes of a caller-save-vgpr, we should
always save/restore the reserved vgpr for sgpr spill.
Reviewed by: arsenm
Differential Revision: https://reviews.llvm.org/D98319
D57708 changed SIInstrInfo::isReallyTriviallyReMaterializable to reject
V_MOVs with extra implicit operands, but it accidentally rejected all
V_MOVs because of their implicit use of exec. Fix it but avoid adding a
moderately expensive call to MI.getDesc().getNumImplicitUses().
In real graphics shaders this changes quite a few vgpr copies into move-
immediates, which is good for avoiding stalls on GFX10.
Differential Revision: https://reviews.llvm.org/D98347
It is good to have a combined `divrem` instruction when the
`div` and `rem` are computed from identical input operands.
Some targets can lower them through a single expansion that
computes both division and remainder. It effectively reduces
the number of instructions than individually expanding them.
Reviewed By: arsenm, paquette
Differential Revision: https://reviews.llvm.org/D96013
AMDGPU target tries to handle the SGPR and VGPR spills in a
custom pass before the actual frame lowering pass. Once they
are handled and the respective frames are eliminated in the
custom pass, certain uses of them still remain. For instance,
the DBG_VALUE instructions inserted by the allocator alongside
the spill instruction will use the corresponding frame index.
They become dead later during PEI and causes a crash while trying to
replace the frame indices. We should possibly avoid this custom pass.
For now, replacing such dead references with null register value.
Reviewed By: arsenm, scott.linder
Differential Revision: https://reviews.llvm.org/D98038
This is already deprecated, so remove code working on this.
Also update the tests by using S_CBRANCH_EXECZ instead of SI_MASK_BRANCH.
Reviewed By: foad
Differential Revision: https://reviews.llvm.org/D97545
The result of ISD::USUBSAT will never be larger than the LHS. We
can use this to put a bound on the number of leading zeros.
Reviewed By: RKSimon
Differential Revision: https://reviews.llvm.org/D98133
This reverts commit e58d68fcd06ddc7743e0419c0b364df3d44121b6.
This reinstates commit fc28f600e558c1344618bda149a068d6162b6f0b
with a fix to initialize HasShaderCyclesRegister. See
https://reviews.llvm.org/D97928.
gfx1030 added a new way to implement readcyclecounter using the
SHADER_CYCLES hardware register, but the s_memtime instruction still
exists, so the MC layer should still accept it and the
llvm.amdgcn.s.memtime intrinsic should still work.
Differential Revision: https://reviews.llvm.org/D97928
Same as other memory instructions, ds instructions add latency even if
exec is zero. Jumping over them if exec=0 is cheaper than executing
them.
With this change, the branch instruction that skips over a basic block
if exec=0 is not removed when the block contains a ds instruction.
Differential Revision: https://reviews.llvm.org/D97922
Recommit bf5a5826504754788a8f1e3fec7a7dc95cda5782. Depends on
4c8fb7ddd6fa49258e0e9427e7345fb56ba522d4 which was reverted.
RegBankSelect creates zext and trunc when it selects banks for uniform i1.
Add zext_trunc_fold from generic combiner to post RegBankSelect combiner.
Differential Revision: https://reviews.llvm.org/D95432
Recommit 4112299ee761a9b6a309c8ff4a7e75f8c8d8851b. Depends on
4c8fb7ddd6fa49258e0e9427e7345fb56ba522d4 which was reverted.
Combine zext(trunc x) to x when truncated bits are known to be zero.
Differential Revision: https://reviews.llvm.org/D96031
The hazard where a VMEM reads an SGPR written by a VALU counts as a data
dependency hazard, so no nops are required on GFX10. Tested with Vulkan
CTS on GFX10.1 and GFX10.3.
Differential Revision: https://reviews.llvm.org/D97926
This is recommit of 4c8fb7ddd6fa49258e0e9427e7345fb56ba522d4.
MIR in one unit test had mismatched types.
For vectors we consider a bit as known if it is the same for all demanded
vector elements (all elements by default). KnownBits BitWidth for vector
type is size of vector element. Add support for G_BUILD_VECTOR.
This allows combines of urem_pow2_to_mask in pre-legalizer combiner.
Differential Revision: https://reviews.llvm.org/D96122
:: (store 1 + 4, addrspace 1)
->
:: (store 1 into undef + 4, addrspace 1)
An offset without a base isn't terribly useful but it's convenient to update
the offset without checking the value. For example, when breaking apart
stores into smaller units
Differential Revision: https://reviews.llvm.org/D97812
RegBankSelect creates zext and trunc when it selects banks for uniform i1.
Add zext_trunc_fold from generic combiner to post RegBankSelect combiner.
Differential Revision: https://reviews.llvm.org/D95432
For vectors we consider a bit as known if it is the same for all demanded
vector elements (all elements by default). KnownBits BitWidth for vector
type is size of vector element. Add support for G_BUILD_VECTOR.
This allows combines of urem_pow2_to_mask in pre-legalizer combiner.
Differential Revision: https://reviews.llvm.org/D96122
VirtRegRewriter may sometimes fail to correctly apply the kill flag where necessary,
which causes unecessary code gen on PowerPC. This patch fixes the way masks for
defined lanes are computed and the way mask for used lanes is computed.
Contact albion.fung@ibm.com instead of author for problems related to this commit.
Differential Revision: https://reviews.llvm.org/D92405
Refactor insertion of the asserting ops. This enables using them for
AMDGPU.
This code should essentially be the same for every target. Mips, X86
and ARM all have different code there now, but this seems to be an
accident. The assignment functions are called with different types
than they would be in the DAG, so this is all likely an assortment of
hacks to get around that.